Spark Questions Imp
Spark Questions Imp
At a high level, every Apache Spark application consists of a driver program that
launches various parallel operations on executor Java Virtual Machines (JVMs)
running either in a cluster or locally on the same machine. In Databricks, the notebook
interface is the driver program. This driver program contains the main loop for the
program and creates distributed datasets on the cluster, then applies operations
(transformations & actions) to those datasets.
Driver programs access Apache Spark through a SparkSession object regardless of
deployment location.
Apache Spark Interview Questions Youtube: Pyspark telugu
At high level, when any action is called on the RDD, Spark creates the DAG and submits
it to the DAG scheduler.
• The DAG scheduler divides operators into stages of tasks. A stage is comprised of
tasks based on partitions of the input data. The DAG scheduler pipelines operators
together. For e.g. Many map operators can be scheduled in a single stage. The final
result of a DAG scheduler is a set of stages.
• The Stages are passed on to the Task Scheduler.The task scheduler launches tasks
via cluster manager (Spark Standalone/Yarn/Mesos). The task scheduler doesn't
know about dependencies of the stages.
• The Worker executes the tasks on the Slave.
Apache Spark Interview Questions Youtube: Pyspark telugu
Many data scientists, analysts, and general business intelligence users rely on
interactive SQL queries for exploring data. Spark SQL is a Spark module for
structured data processing. It provides a programming abstraction called
DataFrame and can also act as distributed SQL query engine. It enables unmodified
Hadoop Hive queries to run up to 100x faster on existing deployments and data. It
also provides powerful integration with the rest of the Spark ecosystem (e.g.,
integrating SQL query processing with machine learning).
Many applications need the ability to process and analyse not only batch data,
but also streams of new data in real-time. Running on top of Spark, Spark
Streaming enables powerful interactive and analytical applications across both
streaming and historical data, while inheriting Spark’s ease of use and fault
tolerance characteristics. It readily integrates with a wide variety of popular data
sources, including HDFS, Flume, Kafka, and Twitter.
Machine learning has quickly emerged as a critical piece in mining Big Data for
actionable insights. Built on top of Spark, MLlib is a scalable machine learning
library that delivers both high-quality algorithms (e.g., multiple iterations to
increase accuracy) and blazing speed (up to 100x faster than MapReduce). The
library is usable in Java, Scala, and Python as part of Spark applications, so that
you can include it in complete workflows.
Apache Spark Interview Questions Youtube: Pyspark telugu
4) Graph Computation: GraphX
GraphX is a graph computation engine built on top of Spark that enables users
to interactively build, transform and reason about graph structured data at
scale. It comes complete with a library of common algorithms.
Spark Core is the underlying general execution engine for the Spark platform that
all other functionality is built on top of. It provides in-memory computing
capabilities to deliver speed, a generalized execution model to support a wide
variety of applications, and Java, Scala, and Python APIs for ease of development.
A Single Node cluster is a cluster consisting of a Spark driver and no Spark workers.
Such clusters support Spark jobs and all Spark data sources, including Delta Lake.
In contrast, Standard clusters require at least one Spark worker to run Spark jobs.
• Running single node machine learning workloads that need Spark to load and
save data Lightweight exploratory data analysis (EDA)
• Runs Spark locally with as many executor threads as logical cores on the cluster
(the number of cores on driver - 1).
• Has 0 workers, with the driver node acting as both master and worker.
• The executor stderr, stdout, and log4j logs are in the driver log.
• Cannot be converted to a Standard cluster. Instead, create a new cluster with the
mode set to Standard
RDD Lineage (like RDD operator graph or RDD dependency graph) is a graph of
all the parent RDDs of a RDD. It is built as a result of applying transformations
to the RDD and creates a logical execution plan.
At high level, there are two transformations that can be applied onto the RDDs,
namely narrow transformation and wide transformation. Wide transformations
basically result in stage boundaries.
The DAG scheduler will then submit the stages into the task scheduler. The
number of tasks submitted depends on the number of partitions present in the
textFile. Fox example consider we have 4 partitions in this example, then there will
be 4 set of tasks created and submitted in parallel provided there are enough
slaves/cores. Below diagram illustrates this in more detail:
Repartition
Source: https://pixipanda.com/
Apache Spark Interview Questions Youtube: Pyspark telugu
14. What is data skew?
Data Skew
Often the data is split into partitions based on a key, for instance the first letter of
a name. If values are not evenly distributed throughout this key then more data
will be placed in one partition than another. An example would be:
Here the A partition is 3 times larger than the other two, and therefore will take
approximately 3 times as long to compute. As the next stage of processing cannot
begin until all three partitions are evaluated, the overall results from the stage will
be delayed.
A skew hint must contain at least the name of the relation with skew. A relation is
a table, view, or a subquery. All joins with this relation then use skew join
optimization.
Apache Spark Interview Questions Youtube: Pyspark telugu
There might be multiple joins on a relation and only some of them will suffer from
skew. Skew join optimization has some overhead so it is better to use it only when
needed. For this purpose, the skew hint accepts column names. Only joins with
these columns use skew join optimization.
You can also specify skew values in the hint. Depending on the query and data,
the skew values might be known (for example, because they never change) or might
be easy to find out. Doing this reduces the overhead of skew join optimization.
Otherwise, Delta Lake detects them automatically.
Apache Spark Interview Questions Youtube: Pyspark telugu
Key/value RDDs are commonly used to perform aggregations, and often we will
do some initial ETL (extract, transform, and load) to get our data into a key/value
format. Key/value RDDs expose new operations (e.g., counting up reviews for
each product, grouping together data with the same key, and grouping together
two different RDDs).
Transformations on one pair RDD PairRDD: {(1, 2), (3, 4), (3, 6)}
values with the same key. mapValues(func) Apply a function to each value of a
A shuffle occurs when data is rearranged between partitions. This is required when
a transformation requires information from other partitions, such as summing all
the values in a column. Spark will gather the required data from each partition and
combine it into a new partition, likely on a different executor.
During a shuffle, data is written to disk and transferred across the network,
halting Spark’s ability to do processing in-memory and causing a performance
bottleneck. Consequently we want to try to reduce the number of shuffles being
done or reduce the amount of data being shuffled.
Apache Spark Interview Questions Youtube: Pyspark telugu
Cluster Mode
In the cluster mode, the Spark driver or spark application master will get started in
any of the worker machines. So, the client who is submitting the application can
submit the application and the client can go away after initiating the application or
can continue with some other work. So, it works with the concept of Fire and
Forgets.
Client Mode
In the client mode, the client who is submitting the spark application will start the
driver and it will maintain the spark context. So, till the particular job execution
gets over, the management of the task will be done by the driver. Also, the client
should be in touch with the cluster. The client will have to be online until that
particular job gets completed.
Apache Spark Interview Questions Youtube: Pyspark telugu
In this mode, the client can keep getting the information in terms of what is the
status and what are the changes happening on a particular job. So, in case if we
want to keep monitoring the status of that particular job, we can submit the job in
client mode. In this mode, the entire application is dependent on the Local machine
since the Driver resides in here. In case of any issue in the local machine, the driver
will go off. Subsequently, the entire application will go off. Hence this mode is not
suitable for Production use cases. However, it is good for debugging or testing since
we can throw the outputs on the driver terminal which is a Local machine.
20. What types of file format using in big data and those differences?
Apache Spark Interview Questions Youtube: Pyspark telugu
Apache Spark Interview Questions Youtube: Pyspark telugu
Apache Spark Interview Questions Youtube: Pyspark telugu
Spark operates on data in fault-tolerant file systems like HDFS or S3. Hence, all of
the RDDs generated from the fault-tolerant data are also fault-tolerant. However,
this is not the case for Spark Streaming as the data in most cases is received over
the network (except when fileStream is used). To achieve the same fault-tolerance
properties for all of the generated RDDs, the received data is replicated among
Apache Spark Interview Questions Youtube: Pyspark telugu
multiple Spark executors in worker nodes in the cluster (default replication factor
is 2). This leads to two kinds of data in the system that need to recovered in the
event of failures:
1. Data received and replicated - This data survives failure of a single worker
node as a copy of it exists on one of the other nodes.
2. Data received but buffered for replication - Since this is not replicated,
the only way to recover this data is to get it again from the source.
Furthermore, there are two kinds of failures that we should be concerned
about:
3. Failure of a Worker Node - Any of the worker nodes running executors can
fail, and all in-memory data on those nodes will be lost. If any receivers were
running on failed nodes, then their buffered data will be lost.
4. Failure of the Driver Node - If the driver node running the Spark Streaming
application fails, then obviously the SparkContext is lost, and all executors
with their in-memory data are lost.
With this basic knowledge, let us understand the fault-tolerance semantics of
Spark Streaming.
The reduceByKey() transformation gathers together pairs that have the same key
and applies a function to two associated values at a time. reduceByKey() operates
Apache Spark Interview Questions Youtube: Pyspark telugu
by applying the function first within each partition on a per-key basis and then
across the partitions.
While both the groupByKey() and reduceByKey() transformations can often be used
to solve the same problem and will produce the same answer,
the reduceByKey() transformation works much better for large distributed datasets.
This is because Spark knows it can combine output with a common key on each
partition before shuffling (redistributing) the data across nodes. Only use
groupByKey() if the operation would not benefit from reducing the data before the
shuffle occurs.
groupByKey() is just to group your dataset based on a key. It will result in data
shuffling when RDD is not already partitioned.
• foldByKey merges the values for each key using an associative function and a
neutral "zero value".
•
Apache Spark Interview Questions Youtube: Pyspark telugu
23. Lazy evaluation in Spark and its benefits? Lazy Evaluation:
Spark SQL is one of the most technically involved components of Apache Spark. It
powers both SQL queries and the DataFrame API. At the core of Spark SQL is the
Catalyst optimizer, which leverages advanced programming language features (e.g.
Scala’s pattern matching and quasiquotes) in a novel way to build an extensible
query optimizer.
When the side of the table is relatively small, we choose to broadcast it out to
avoid shuffle, improve performance. But because the broadcast table is first to
collect to the driver segment, and then distributed to each executor redundant, so
when the table is relatively large, the use of broadcast progress will be the driver
and executor side caused greater pressure.
Apache Spark Interview Questions Youtube: Pyspark telugu
But because Spark is a distributed computing engine, you can partition the large
number of data can be divided into n smaller data sets for parallel computing.
This idea is applied to the Join is Shuffle Hash Join. Spark SQL will be larger table
join and rule, the first table is divided into n partitions, and then the
corresponding data in the two tables were Hash Join, so that is to a certain extent,
the same time, Reducing the pressure on the side of the driver broadcast side, but
also reduce the executor to take the entire broadcast by the memory of the table.
As we all know, in the database common model (such as star model or snowflake
model), the table is generally divided into two types: fact table and dimension
table. Dimension tables (small tables) generally refer to fixed, less variable tables,
such as contacts, items, etc., the general data is limited. The fact table generally
records water, such as sales lists, etc., usually with the growth of time constantly
expanding.... means large tables
Because the Join operation is the two tables in the same key value of the record to
connect, in SparkSQL, the two tables to do Join the most direct way is based on
the key partition, and then in each partition the key value of the same record
Come out to do the connection operation. But this will inevitably involve shuffle,
and shuffle in Spark is a more timeconsuming operation, we should try to design
Spark application to avoid a lot of shuffle.
When the dimension table and the fact table for the Join operation, in order to
avoid shuffle, we can be limited size of the dimension table of all the data
Apache Spark Interview Questions Youtube: Pyspark telugu
distributed to each node for the fact table to use. executor all the data stored in
the dimension table, to a certain extent, sacrifice the space, in exchange for
shuffle operation a lot of time-consuming, which in SparkSQL called Broadcast
Join