0% found this document useful (0 votes)

71 views

SPARK Interview Questions

Spark is a big data processing engine that distributes data across partitions to process them in parallel. It was initially released in 2014 and handed over to Apache Software Foundation. Spark is faster than MapReduce by processing data in-memory instead of writing intermediate results to disk, providing 10-100 times faster execution.

Uploaded by

aditya.rana.datascience

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views

SPARK Interview Questions

Uploaded by

aditya.rana.datascience

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

1. What is Spark?

• Spark is a big data processing engine.

• It distributes the data in partitions and processes them in parallel.
• Spark was initially released around 2014.
• After initial development of Spark, it was handed over to Apache Software foundation which is an
open-source community, since then Spark is named as Apache Spark.
• Spark was a game changer in big data framework as it overcame limitations of Map-Reduce by
processing data in-memory, thereby providing 10-100 times faster execution.
• Spark has been developed using Scala programming language.
• Development in Spark can be done using Scala, Python, R and SQL.

2. What are features of Spark?

• Parallel processing: Spark distributes data into partitions and process them in parallel.
• In-memory processing: Spark processes data in memory which delivers execution faster up to 10-100
times as compared to map reduce.
• Fault-tolerant: Spark is fault-tolerant, it has ability to recover from failure.
• Immutability: It is one of the core features built into Spark due which once the RDD is created it
cannot be changed.
• Lazy evaluation: Spark doesn’t start the execution until an action is called. Spark waits until an action
is being called to optimize the execution by combining transformations in single step or re-arranging the
sequence of transformations.

3. How is Spark fast compared to map reduce?

• Firstly, note map reduce is a processing solution in Hadoop framework.
• The problem with map-reduce is you need to write intermediate results to disk and read it again for next
step/transformation. This consumes a lot of time and goes up as data size increases.
• Hence, Spark was developed to overcome these limitations.
• Spark once it reads the data, it processes everything in memory, no matter how many intermediate
steps/transformations are there and writes it back in the end. So here since there is no need to write the
intermediate results to disk and processing is in memory, it is a game changer and eventually you get
10-100 times faster execution.
4. Which language is used to develop Spark?
• Most of the Spark development has been done using Scala programming language.

5. Which different languages can be used to develop Spark programs?

• Spark development can be done using Scala, Python, R and SQL.
• Scala remains the top choice, however with Python being popular and widely adopted, an api was
released for spark development with Python which was named PySpark.
• As we go ahead, you would find more support for SQL in Spark as compared to past which is a good
thing for those who love SQL and want to do away with programming, however everything cannot be
done with SQL.

6. Which architecture does Spark follow?

• Spark follows a Master-Slave Architecture.
• A master slave architecture consists of a Master node and multiple slave nodes.
• Together these nodes form a cluster.
• Master node is also called as primary node.
• Slave nodes are also called as worker nodes.
• The most important part in spark is to distribute the data and process in parallel.
• Master node is responsible for distributing the data to slave nodes and manage several other operations
in cluster.
• Master node doesn’t process the data, it is the worker nodes which process the data in parallel across the
cluster.

Simple Cluster
7. How do you start developing spark programs (what is the starting point)?
• Import point to note, you cannot use Scala or Python on standalone basis to achieve parallel processing
with speed, optimizations and performance, you need Spark in these cases. But, at the same time you
cannot use only Spark either to develop programs, hence you need to blend both Scala with Spark or
Python with Spark. When we use Scala with Spark we call it as Spark-Scala and when we use Python
with Spark, we call it as PySpark.
• Coming to beginning development in Spark, you need either Spark Context or Spark Session to begin
spark program. Both are entry point to Spark.

8. Explain SparkContext and SparkSession.

• Spark is all about parallel processing.
• SparkContext is an entry point to Spark. It was released in version 1.
• It exposes various operations to interact with Spark.
• Functions like Parallelize and textFile enable users to process the data in parallel and return objects out
of them which are called as RDD’s.
• SparkSession was introduced in Version 2 of Spark. It is also an entry point to Spark.
• The best part about SparkSession is it includes SparkContext, SQL Context, Hive Context and
Streaming Context which was not the case in Version 1 as all of these were separate.

9. What is RDD?
• RDD stands for Resilient Distributed Dataset.
• Resilient means ability to recover from failure.
• It is the fundamental/basic unit of storage in Spark.
• RDD’s are immutable, once created cannot be changed.
• RDD’s are fault tolerant, in case of failure they can recover from parent RDD.
• RDD’s were released in initial version of Spark.
• RDD’s are low level Api’s.
• RDD can be created using different ways in Spark using functions parallelize and textFile in
SparkContext.
• You can perform several operations on top of RDD after you create it.
• These operations are nothing but various transformations on data that can be performed to achieve
desired results.
• Mostly when data is not structed, you use RDD’s.
• When data is structured should use optimized objects in Spark which are DataFrames/Datasets and those
are high level Api’s.
• When you read a textfile, you get RDD, this RDD is partitioned across the cluster in multiple worker
nodes.
• The immutability feature helps spark to recover from failure by taking the parent RDD and then moving
further.

10. What are different types of operations performed on RDD?

• There are 2 types of operations performed on RDD’s: Transformations and Actions.
• Map, flatmap, reduceByKey, etc. are transformations.
• Count, Collect, etc. are actions.
• Transformations are not executed physically until an action is called which makes transformations lazy.
11. Explain shuffle in spark.
• When data needs to be redistributed or re-copied across the worker nodes this process is named as
shuffle.
• It is best to avoid it as it involves additional computation and time which is not pleasing. However, at
times we don’t have an option to avoid it, but we try to minimize it.
• When data is distributed across nodes, the processing starts where transformations are to be applied.
• Some transformations can be processed within the worker node itself and don’t need data to be
redistributed, hence there is no shuffle, such transformations are called narrow transformations.
• Some transformations which require redistribution of data, invoke a shuffle and they are called as wide
transformations.
• Transformations like map, flatmap are narrow transformations.
• Transformations like reduceByKey, groupByKey are wide transformations.

12. What is lazy evaluation in spark?

• Lazy evaluation is one of the important features in Spark.
• There are 2 kinds of operation in Spark: Transformations and Actions.
• Spark doesn’t start execution until and action is called on transformation, so here the execution is
delayed.
• Spark does so to achieve optimizations wherever possible, instead of immediately running each step or
transformations.

13. What is a Spark application?

• Spark application is nothing, but a program written by user/developer using any of the programming
languages: Scala, Python, R or SQL.

14. What is a lineage graph?

• Lineage graph is a flow of operations in spark RDD.
• Its lists dependencies in RDD’s.
• When you apply a transformation on RDD you get a new RDD and so on, in case there is a failure in
any step, RDD is recovered from its parent RDD. This makes Spark fault tolerant and resilient.
• You can access the lineage by using toDebugString method.
• It is part of logical plan.

15. What are Dataframes and Datasets?

• Dataframes and Datasets are high level Api’s in Spark as compared to RDD’s which are low level
Api’s.
• Both have richer optimizations under the hood and Spark’s SQL optimized execution is one of the major
benefits in them.
• Both are table like structure in a database.
• Dataframes are available in both Scala and Python.
• Datasets are only available in Scala and not in Python, but Python has most of the features already
available in DataFrames.
• Dataframes are built on top of RDD’s.
16. What is DAG?
• DAG stands for Directed Acyclic graph.
• Directed means operations are executed in an order.
• Acyclic means there are no loops or cycles.
• Graph shows flow of operations.
• DAG can be seen in Spark UI.
• DAG consists of stages and its details.

17. What is a Spark driver?

• Spark driver is a process launched on master node.
• When a job is triggered in spark first spark driver is launched.
• Then spark driver creates the spark context.
• Spark context gives the program to driver.
• Driver then creates the DAG.
• Next, Driver splits the application into tasks and scheduled them to run on executors.
• Driver co-ordinates with executors, executors report the status to driver.

Master Node

Driver Program

Spark Context

Cluster Manager

Worker Node Worker Node

Executor
Executor

Task Task
18. What is DAG scheduler?
• DAG scheduler is a scheduling layer in spark.
• It implements stage wise scheduling.
• It computes DAG for each job and finds minimum schedule to run a job.
• It then submits the stages to underlying Task Scheduler to run the tasks on cluster.
• DAG scheduler transforms a logical execution plan(lineage) into physical execution plan(stages).

19. What is Task Scheduler?

• Task scheduler is responsible for scheduling of tasks.
• DAG scheduler submits tasks to Task Scheduler.
• It launches execution of tasks via cluster manager on executors.
• Spark context creates the task scheduler.

20. What are Spark deployment modes?

• Firstly, understand what is spark deployment. Spark deployment is nothing but submitting your code
for execution. When you are using Hadoop, you need to submit the code using spark-submit command.
However, this is not needed when we are using Databricks as this has been made easy for us. You just
need to start your cluster in Databricks and run your code.
• Coming to deployment modes: there are 2 types of deploy modes in Spark: Client and Cluster.
• When spark driver is launched on client machine which is outside the cluster is it called as client mode.
• When spark driver is launched within the cluster it is called as cluster mode.

21. Explain execution process in Spark.

• When execution is triggered for a Spark application, it goes through several operations.
• Once your application code is ready you trigger the execution.
• Once execution is triggered (spark job is triggered) spark driver is launched.
• Your application code is basically a set of instructions which is sent to driver for execution.
• Driver creates the spark context.
• Spark context passes the application code to driver.
• Driver builds the DAG using lineage.
• Based on narrow and wide transformations the stages in DAG are created, each stage is divided into
tasks.
• Now the driver doesn’t perform execution on the code, the execution is supposed to be done on worker
nodes to achieve parallelism.
• Here, cluster manager comes into picture.
• Driver requests cluster manager for resources to execute the code.
• Cluster manager starts executors on worker nodes.
• Driver schedules the tasks to run on executors through task scheduler.
• Executor is a JVM process.
• Executors register themselves to driver, hence driver is aware how many executors are running.
• Executors are the actual entities which perform the execution operation.
• The driver divides the entire process into smallest unit of execution called tasks.
• Driver creates logical and physical plan.
• After physical plan is generated, driver sends the tasks to executors.
• Tasks run on executors, once completed results are returned to driver.
• In the end spark releases all resources(executors) from cluster manager.
22. What is heartbeat in executor?
• Heartbeat is a signal or message sent from executor to driver.
• It is in place to convey that executor is under working condition(liveliness).
• So, executor is supposed to send a heartbeat after specific interval to the driver.
• spark.executor.heartbeatInterval property defines the heartbeat interval.
• In case driver doesn’t receive a heartbeat after this interval the executor is marked as failed and tasks are
allocated to another executor.

23. How does spark break down the execution?

• Remember spark doesn’t start the execution unless an action is called on RDD.
• Once an action is called Job is triggered which is divided into stages depending on the transformations
namely narrow and wide.
• Each stage is further divided into tasks.

Job Stages Tasks

• The number of actions on RDD define the number of jobs.
• So, a spark application can have one to many jobs, each job can have one to many stages and each
stage can have one to many tasks.
24. Which are commonly used operations on RDD?
1. Transformations:

map: Guaranteed single output for every single input.

flatmap: One to many outputs for every single input.

reduceByKey: accumulates value for each key.

2. Actions:

collect: Fetches the results from executor to driver and prints the results.

take: extracts n number of elements from rdd.

25. Who assists to parallelize data in spark?

1. Spark context assists to parallelize data in spark.
2. Spark context exposes functions like parallelize and textFile which distribute the data and process it in
parallel.
3. Parallelize function takes in a collection like list or array and processes it in parallel and returns an RDD.
4. textFile function takes a filepath of text file and processes it in parallel and returns an RDD.

26. Explain how to parallelize a list step by step with lineage and DAG.
DAG:
1. textFile
DAG:

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
Contour HD1080 Manual Firmware Update Instruction PDF
No ratings yet
Contour HD1080 Manual Firmware Update Instruction PDF
1 page
FlexRay Protocol Specification V2.1 Rev.A
No ratings yet
FlexRay Protocol Specification V2.1 Rev.A
245 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
DBR 7.x - Spark 3.x Features Migration
No ratings yet
DBR 7.x - Spark 3.x Features Migration
86 pages
Spark
No ratings yet
Spark
33 pages
Spark Kafkaintegration PDF
100% (1)
Spark Kafkaintegration PDF
71 pages
Spark and Scala Course
No ratings yet
Spark and Scala Course
5 pages
Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library Hien Luu all chapter instant download
100% (3)
Beginning Apache Spark 3: With DataFrame, Spark SQL, Structured Streaming, and Spark Machine Learning Library Hien Luu all chapter instant download
50 pages
Spark Structured Streaming
No ratings yet
Spark Structured Streaming
655 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Developing Modern Applications With Scala
No ratings yet
Developing Modern Applications With Scala
72 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
ERModel PDF
100% (1)
ERModel PDF
82 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
azure comapny wise question
No ratings yet
azure comapny wise question
68 pages
Day 4-01-Spark
No ratings yet
Day 4-01-Spark
43 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Tamil Rosary
No ratings yet
Tamil Rosary
12 pages
Power BI Cheat Sheet
No ratings yet
Power BI Cheat Sheet
10 pages
WP - Databricks vs. ETL Data Lake - Updated
No ratings yet
WP - Databricks vs. ETL Data Lake - Updated
12 pages
SQL Questions
No ratings yet
SQL Questions
4 pages
Databricks
No ratings yet
Databricks
4 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Lake House Data at Scale With Power Bi
No ratings yet
Lake House Data at Scale With Power Bi
38 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
RMK Group CoE Selection Result
0% (1)
RMK Group CoE Selection Result
32 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
100% (4)
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
55 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
SQL interview questions for a Data Engineer
No ratings yet
SQL interview questions for a Data Engineer
11 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Spark RDD Dataframes SQL
No ratings yet
Spark RDD Dataframes SQL
3 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Structured Streaming
No ratings yet
Structured Streaming
12 pages
Pyspark
No ratings yet
Pyspark
31 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Windowing Functions
No ratings yet
Windowing Functions
54 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
Cognos Query Tips and Guidelines
No ratings yet
Cognos Query Tips and Guidelines
11 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
Query Performance Tuning
No ratings yet
Query Performance Tuning
35 pages
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Real Time Control Application
No ratings yet
Real Time Control Application
7 pages
Chandan Project
No ratings yet
Chandan Project
16 pages
Micro Controller
No ratings yet
Micro Controller
6 pages
Challenges in Grid Computing
No ratings yet
Challenges in Grid Computing
4 pages
Bassmix
No ratings yet
Bassmix
5 pages
FRST - 24-06-2024 14.13.36
No ratings yet
FRST - 24-06-2024 14.13.36
25 pages
Web VPN & SSL VPN: CCIE Security Advanced Technologies Class
No ratings yet
Web VPN & SSL VPN: CCIE Security Advanced Technologies Class
9 pages
Buses Protocols
No ratings yet
Buses Protocols
3 pages
Indramotion MLC The Complete System For All Control Tasks PDF
No ratings yet
Indramotion MLC The Complete System For All Control Tasks PDF
125 pages
Input-Output Organization
No ratings yet
Input-Output Organization
65 pages
Computer Networks: Basics of DHCP
No ratings yet
Computer Networks: Basics of DHCP
15 pages
250+ TOP MCQs On TCP - IP and OSI Reference Model Answers6
No ratings yet
250+ TOP MCQs On TCP - IP and OSI Reference Model Answers6
5 pages
Unit 107: CCNA LAN Switching and Wireless (Exploration 3)
No ratings yet
Unit 107: CCNA LAN Switching and Wireless (Exploration 3)
8 pages
Win 10
No ratings yet
Win 10
32 pages
Laboratórios CCNA
100% (1)
Laboratórios CCNA
44 pages
Brtool
No ratings yet
Brtool
3 pages
IVC1 Series PLC Manual Seccion III PDF
No ratings yet
IVC1 Series PLC Manual Seccion III PDF
99 pages
Ring Protection
No ratings yet
Ring Protection
20 pages
Apd4 T88res Install e Revb
No ratings yet
Apd4 T88res Install e Revb
54 pages
Intel 8086: Atanu Shome
No ratings yet
Intel 8086: Atanu Shome
23 pages
Fpr2140 NGFW k9 Datasheet
No ratings yet
Fpr2140 NGFW k9 Datasheet
5 pages
CCNA2 Module 6
No ratings yet
CCNA2 Module 6
46 pages
Gccint
No ratings yet
Gccint
885 pages
Precios Mayo
No ratings yet
Precios Mayo
230 pages
(EX) Disabling Me0 Interface May Split Virtual Chassis (VC) : Summary
No ratings yet
(EX) Disabling Me0 Interface May Split Virtual Chassis (VC) : Summary
31 pages
Cisco Catalyst 4948 Switch
No ratings yet
Cisco Catalyst 4948 Switch
18 pages
Multi-Core Computer Architecture: Review of Basic Computer Organization
No ratings yet
Multi-Core Computer Architecture: Review of Basic Computer Organization
26 pages
Release Notes Xcode44dp
No ratings yet
Release Notes Xcode44dp
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

SPARK Interview Questions

Uploaded by

SPARK Interview Questions

Uploaded by

1. What is Spark?

• Spark is a big data processing engine.

2. What are features of Spark?

3. How is Spark fast compared to map reduce?

5. Which different languages can be used to develop Spark programs?

6. Which architecture does Spark follow?

8. Explain SparkContext and SparkSession.

10. What are different types of operations performed on RDD?

12. What is lazy evaluation in spark?

13. What is a Spark application?

14. What is a lineage graph?

15. What are Dataframes and Datasets?

17. What is a Spark driver?

Worker Node Worker Node

19. What is Task Scheduler?

20. What are Spark deployment modes?

21. Explain execution process in Spark.

23. How does spark break down the execution?

Job Stages Tasks

map: Guaranteed single output for every single input.

flatmap: One to many outputs for every single input.

reduceByKey: accumulates value for each key.

take: extracts n number of elements from rdd.

25. Who assists to parallelize data in spark?

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.