0% found this document useful (0 votes)
71 views

SPARK Interview Questions

Spark is a big data processing engine that distributes data across partitions to process them in parallel. It was initially released in 2014 and handed over to Apache Software Foundation. Spark is faster than MapReduce by processing data in-memory instead of writing intermediate results to disk, providing 10-100 times faster execution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

SPARK Interview Questions

Spark is a big data processing engine that distributes data across partitions to process them in parallel. It was initially released in 2014 and handed over to Apache Software Foundation. Spark is faster than MapReduce by processing data in-memory instead of writing intermediate results to disk, providing 10-100 times faster execution.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

1. What is Spark?

• Spark is a big data processing engine.


• It distributes the data in partitions and processes them in parallel.
• Spark was initially released around 2014.
• After initial development of Spark, it was handed over to Apache Software foundation which is an
open-source community, since then Spark is named as Apache Spark.
• Spark was a game changer in big data framework as it overcame limitations of Map-Reduce by
processing data in-memory, thereby providing 10-100 times faster execution.
• Spark has been developed using Scala programming language.
• Development in Spark can be done using Scala, Python, R and SQL.

2. What are features of Spark?


• Parallel processing: Spark distributes data into partitions and process them in parallel.
• In-memory processing: Spark processes data in memory which delivers execution faster up to 10-100
times as compared to map reduce.
• Fault-tolerant: Spark is fault-tolerant, it has ability to recover from failure.
• Immutability: It is one of the core features built into Spark due which once the RDD is created it
cannot be changed.
• Lazy evaluation: Spark doesn’t start the execution until an action is called. Spark waits until an action
is being called to optimize the execution by combining transformations in single step or re-arranging the
sequence of transformations.

3. How is Spark fast compared to map reduce?


• Firstly, note map reduce is a processing solution in Hadoop framework.
• The problem with map-reduce is you need to write intermediate results to disk and read it again for next
step/transformation. This consumes a lot of time and goes up as data size increases.
• Hence, Spark was developed to overcome these limitations.
• Spark once it reads the data, it processes everything in memory, no matter how many intermediate
steps/transformations are there and writes it back in the end. So here since there is no need to write the
intermediate results to disk and processing is in memory, it is a game changer and eventually you get
10-100 times faster execution.
4. Which language is used to develop Spark?
• Most of the Spark development has been done using Scala programming language.

5. Which different languages can be used to develop Spark programs?


• Spark development can be done using Scala, Python, R and SQL.
• Scala remains the top choice, however with Python being popular and widely adopted, an api was
released for spark development with Python which was named PySpark.
• As we go ahead, you would find more support for SQL in Spark as compared to past which is a good
thing for those who love SQL and want to do away with programming, however everything cannot be
done with SQL.

6. Which architecture does Spark follow?


• Spark follows a Master-Slave Architecture.
• A master slave architecture consists of a Master node and multiple slave nodes.
• Together these nodes form a cluster.
• Master node is also called as primary node.
• Slave nodes are also called as worker nodes.
• The most important part in spark is to distribute the data and process in parallel.
• Master node is responsible for distributing the data to slave nodes and manage several other operations
in cluster.
• Master node doesn’t process the data, it is the worker nodes which process the data in parallel across the
cluster.

Simple Cluster
7. How do you start developing spark programs (what is the starting point)?
• Import point to note, you cannot use Scala or Python on standalone basis to achieve parallel processing
with speed, optimizations and performance, you need Spark in these cases. But, at the same time you
cannot use only Spark either to develop programs, hence you need to blend both Scala with Spark or
Python with Spark. When we use Scala with Spark we call it as Spark-Scala and when we use Python
with Spark, we call it as PySpark.
• Coming to beginning development in Spark, you need either Spark Context or Spark Session to begin
spark program. Both are entry point to Spark.

8. Explain SparkContext and SparkSession.


• Spark is all about parallel processing.
• SparkContext is an entry point to Spark. It was released in version 1.
• It exposes various operations to interact with Spark.
• Functions like Parallelize and textFile enable users to process the data in parallel and return objects out
of them which are called as RDD’s.
• SparkSession was introduced in Version 2 of Spark. It is also an entry point to Spark.
• The best part about SparkSession is it includes SparkContext, SQL Context, Hive Context and
Streaming Context which was not the case in Version 1 as all of these were separate.

9. What is RDD?
• RDD stands for Resilient Distributed Dataset.
• Resilient means ability to recover from failure.
• It is the fundamental/basic unit of storage in Spark.
• RDD’s are immutable, once created cannot be changed.
• RDD’s are fault tolerant, in case of failure they can recover from parent RDD.
• RDD’s were released in initial version of Spark.
• RDD’s are low level Api’s.
• RDD can be created using different ways in Spark using functions parallelize and textFile in
SparkContext.
• You can perform several operations on top of RDD after you create it.
• These operations are nothing but various transformations on data that can be performed to achieve
desired results.
• Mostly when data is not structed, you use RDD’s.
• When data is structured should use optimized objects in Spark which are DataFrames/Datasets and those
are high level Api’s.
• When you read a textfile, you get RDD, this RDD is partitioned across the cluster in multiple worker
nodes.
• The immutability feature helps spark to recover from failure by taking the parent RDD and then moving
further.

10. What are different types of operations performed on RDD?


• There are 2 types of operations performed on RDD’s: Transformations and Actions.
• Map, flatmap, reduceByKey, etc. are transformations.
• Count, Collect, etc. are actions.
• Transformations are not executed physically until an action is called which makes transformations lazy.
11. Explain shuffle in spark.
• When data needs to be redistributed or re-copied across the worker nodes this process is named as
shuffle.
• It is best to avoid it as it involves additional computation and time which is not pleasing. However, at
times we don’t have an option to avoid it, but we try to minimize it.
• When data is distributed across nodes, the processing starts where transformations are to be applied.
• Some transformations can be processed within the worker node itself and don’t need data to be
redistributed, hence there is no shuffle, such transformations are called narrow transformations.
• Some transformations which require redistribution of data, invoke a shuffle and they are called as wide
transformations.
• Transformations like map, flatmap are narrow transformations.
• Transformations like reduceByKey, groupByKey are wide transformations.

12. What is lazy evaluation in spark?


• Lazy evaluation is one of the important features in Spark.
• There are 2 kinds of operation in Spark: Transformations and Actions.
• Spark doesn’t start execution until and action is called on transformation, so here the execution is
delayed.
• Spark does so to achieve optimizations wherever possible, instead of immediately running each step or
transformations.

13. What is a Spark application?


• Spark application is nothing, but a program written by user/developer using any of the programming
languages: Scala, Python, R or SQL.

14. What is a lineage graph?


• Lineage graph is a flow of operations in spark RDD.
• Its lists dependencies in RDD’s.
• When you apply a transformation on RDD you get a new RDD and so on, in case there is a failure in
any step, RDD is recovered from its parent RDD. This makes Spark fault tolerant and resilient.
• You can access the lineage by using toDebugString method.
• It is part of logical plan.

15. What are Dataframes and Datasets?


• Dataframes and Datasets are high level Api’s in Spark as compared to RDD’s which are low level
Api’s.
• Both have richer optimizations under the hood and Spark’s SQL optimized execution is one of the major
benefits in them.
• Both are table like structure in a database.
• Dataframes are available in both Scala and Python.
• Datasets are only available in Scala and not in Python, but Python has most of the features already
available in DataFrames.
• Dataframes are built on top of RDD’s.
16. What is DAG?
• DAG stands for Directed Acyclic graph.
• Directed means operations are executed in an order.
• Acyclic means there are no loops or cycles.
• Graph shows flow of operations.
• DAG can be seen in Spark UI.
• DAG consists of stages and its details.

17. What is a Spark driver?


• Spark driver is a process launched on master node.
• When a job is triggered in spark first spark driver is launched.
• Then spark driver creates the spark context.
• Spark context gives the program to driver.
• Driver then creates the DAG.
• Next, Driver splits the application into tasks and scheduled them to run on executors.
• Driver co-ordinates with executors, executors report the status to driver.

Master Node

Driver Program

Spark Context

Cluster Manager

Worker Node Worker Node

Executor
Executor

Task Task
18. What is DAG scheduler?
• DAG scheduler is a scheduling layer in spark.
• It implements stage wise scheduling.
• It computes DAG for each job and finds minimum schedule to run a job.
• It then submits the stages to underlying Task Scheduler to run the tasks on cluster.
• DAG scheduler transforms a logical execution plan(lineage) into physical execution plan(stages).

19. What is Task Scheduler?


• Task scheduler is responsible for scheduling of tasks.
• DAG scheduler submits tasks to Task Scheduler.
• It launches execution of tasks via cluster manager on executors.
• Spark context creates the task scheduler.

20. What are Spark deployment modes?


• Firstly, understand what is spark deployment. Spark deployment is nothing but submitting your code
for execution. When you are using Hadoop, you need to submit the code using spark-submit command.
However, this is not needed when we are using Databricks as this has been made easy for us. You just
need to start your cluster in Databricks and run your code.
• Coming to deployment modes: there are 2 types of deploy modes in Spark: Client and Cluster.
• When spark driver is launched on client machine which is outside the cluster is it called as client mode.
• When spark driver is launched within the cluster it is called as cluster mode.

21. Explain execution process in Spark.


• When execution is triggered for a Spark application, it goes through several operations.
• Once your application code is ready you trigger the execution.
• Once execution is triggered (spark job is triggered) spark driver is launched.
• Your application code is basically a set of instructions which is sent to driver for execution.
• Driver creates the spark context.
• Spark context passes the application code to driver.
• Driver builds the DAG using lineage.
• Based on narrow and wide transformations the stages in DAG are created, each stage is divided into
tasks.
• Now the driver doesn’t perform execution on the code, the execution is supposed to be done on worker
nodes to achieve parallelism.
• Here, cluster manager comes into picture.
• Driver requests cluster manager for resources to execute the code.
• Cluster manager starts executors on worker nodes.
• Driver schedules the tasks to run on executors through task scheduler.
• Executor is a JVM process.
• Executors register themselves to driver, hence driver is aware how many executors are running.
• Executors are the actual entities which perform the execution operation.
• The driver divides the entire process into smallest unit of execution called tasks.
• Driver creates logical and physical plan.
• After physical plan is generated, driver sends the tasks to executors.
• Tasks run on executors, once completed results are returned to driver.
• In the end spark releases all resources(executors) from cluster manager.
22. What is heartbeat in executor?
• Heartbeat is a signal or message sent from executor to driver.
• It is in place to convey that executor is under working condition(liveliness).
• So, executor is supposed to send a heartbeat after specific interval to the driver.
• spark.executor.heartbeatInterval property defines the heartbeat interval.
• In case driver doesn’t receive a heartbeat after this interval the executor is marked as failed and tasks are
allocated to another executor.

23. How does spark break down the execution?


• Remember spark doesn’t start the execution unless an action is called on RDD.
• Once an action is called Job is triggered which is divided into stages depending on the transformations
namely narrow and wide.
• Each stage is further divided into tasks.

Job Stages Tasks


• The number of actions on RDD define the number of jobs.
• So, a spark application can have one to many jobs, each job can have one to many stages and each
stage can have one to many tasks.
24. Which are commonly used operations on RDD?
1. Transformations:

map: Guaranteed single output for every single input.

flatmap: One to many outputs for every single input.

reduceByKey: accumulates value for each key.

2. Actions:

collect: Fetches the results from executor to driver and prints the results.

take: extracts n number of elements from rdd.

25. Who assists to parallelize data in spark?


1. Spark context assists to parallelize data in spark.
2. Spark context exposes functions like parallelize and textFile which distribute the data and process it in
parallel.
3. Parallelize function takes in a collection like list or array and processes it in parallel and returns an RDD.
4. textFile function takes a filepath of text file and processes it in parallel and returns an RDD.

26. Explain how to parallelize a list step by step with lineage and DAG.
DAG:
1. textFile
DAG:

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy