SPARK Interview Questions
SPARK Interview Questions
Simple Cluster
7. How do you start developing spark programs (what is the starting point)?
• Import point to note, you cannot use Scala or Python on standalone basis to achieve parallel processing
with speed, optimizations and performance, you need Spark in these cases. But, at the same time you
cannot use only Spark either to develop programs, hence you need to blend both Scala with Spark or
Python with Spark. When we use Scala with Spark we call it as Spark-Scala and when we use Python
with Spark, we call it as PySpark.
• Coming to beginning development in Spark, you need either Spark Context or Spark Session to begin
spark program. Both are entry point to Spark.
9. What is RDD?
• RDD stands for Resilient Distributed Dataset.
• Resilient means ability to recover from failure.
• It is the fundamental/basic unit of storage in Spark.
• RDD’s are immutable, once created cannot be changed.
• RDD’s are fault tolerant, in case of failure they can recover from parent RDD.
• RDD’s were released in initial version of Spark.
• RDD’s are low level Api’s.
• RDD can be created using different ways in Spark using functions parallelize and textFile in
SparkContext.
• You can perform several operations on top of RDD after you create it.
• These operations are nothing but various transformations on data that can be performed to achieve
desired results.
• Mostly when data is not structed, you use RDD’s.
• When data is structured should use optimized objects in Spark which are DataFrames/Datasets and those
are high level Api’s.
• When you read a textfile, you get RDD, this RDD is partitioned across the cluster in multiple worker
nodes.
• The immutability feature helps spark to recover from failure by taking the parent RDD and then moving
further.
Master Node
Driver Program
Spark Context
Cluster Manager
Executor
Executor
Task Task
18. What is DAG scheduler?
• DAG scheduler is a scheduling layer in spark.
• It implements stage wise scheduling.
• It computes DAG for each job and finds minimum schedule to run a job.
• It then submits the stages to underlying Task Scheduler to run the tasks on cluster.
• DAG scheduler transforms a logical execution plan(lineage) into physical execution plan(stages).
2. Actions:
collect: Fetches the results from executor to driver and prints the results.
26. Explain how to parallelize a list step by step with lineage and DAG.
DAG:
1. textFile
DAG: