Apache Spark IQ
Apache Spark IQ
INTERVIEW QUESTIONS
ABHINANDAN PATRA
DATA ENGINEER
1. What is Apache Spark?
Apache Spark is an open-source, distributed computing system that provides an interface for
programming entire clusters with implicit data parallelism and fault tolerance. It is designed
to process large-scale data efficiently.
Apache Spark is used because it is faster than traditional big data tools like Hadoop
MapReduce due to its in-memory processing capabilities, supports multiple languages (Scala,
Python, R, Java), provides libraries for various tasks (SQL, machine learning, graph
processing, etc.), and has robust fault tolerance.
Spark Core: The foundational engine for large-scale parallel and distributed data
processing.
Spark SQL: For structured data processing.
Spark Streaming: For real-time data processing.
MLlib: A library for scalable machine learning.
GraphX: For graph and graph-parallel computation.
Spark Core is the general execution engine for the Spark platform, responsible for tasks such
as scheduling, distributing, and monitoring applications.
Scala
Python
Java
R
SQL
Spark is better in several ways, including faster processing due to in-memory computation,
ease of use with APIs for various programming languages, flexibility with built-in libraries
for diverse tasks, and a rich set of APIs for transformations and actions.
7. What are the different methods to run Spark over Apache Hadoop?
SparkContext is the entry point for any Spark application. It acts as a connection to the
Spark cluster, allowing Spark jobs to be executed.
SparkSession is the unified entry point to work with DataFrames, Datasets, and SQL in
Apache Spark. It replaces SQLContext and HiveContext.
RDDs are immutable to provide fault tolerance and support functional programming
principles, allowing Spark to rebuild lost data from the lineage information.
Paired RDDs are RDDs where each element is a pair (key-value). They are used for
operations like aggregation, grouping, and joins.
RDD is an in-memory data structure optimized for processing, while Distributed Storage
(like HDFS) focuses on data storage and retrieval.
16. Explain transformation and action in RDD in Apache Spark.
Transformation: Lazy operations that define a new RDD without executing until an
action is called (e.g., map, filter).
Action: Triggers the execution of transformations (e.g., count, collect).
A lineage graph tracks the sequence of transformations that created an RDD, used for
recomputing lost data due to node failures.
21. By default, how many partitions are created in RDD in Apache Spark?
By default, Spark creates partitions based on the number of cores available or the input file's
HDFS block size.
DataFrames are distributed collections of data organized into named columns, similar to
tables in a relational database.
Benefits include optimizations (Catalyst query optimizer), improved performance, and easier
manipulation using SQL-like syntax.
Advantages include compile-time type safety, optimizations through Tungsten, and the ability
to leverage JVM object serialization.
A DAG in Spark represents a sequence of computations performed on data, where each node
is an RDD and edges represent transformations. It's used to optimize execution plans.
The DAG allows Spark to optimize execution by scheduling tasks efficiently, minimizing
data shuffling, and managing dependencies.
29. What is the difference between Caching and Persistence in Apache Spark?
Write-Ahead Log is a fault-tolerance mechanism where every received data is first written to
a log file (disk) before processing, ensuring no data loss.
Catalyst is Spark SQL's query optimizer that uses rule-based and cost-based optimization
techniques to generate efficient execution plans.
Shared variables are variables that can be used by tasks running on different nodes:
Spark stores metadata like lineage information, partition data, and task details in the driver
and worker nodes, managing it using its DAG scheduler.
MLlib is Spark's scalable machine learning library, which provides algorithms and utilities
for classification, regression, clustering, collaborative filtering, and more.
Linear Regression
Logistic Regression
Decision Trees
Random Forests
Gradient-Boosted Trees
K-Means Clustering
Lazy evaluation defers execution until an action is performed, optimizing the execution plan
by reducing redundant computations.
Benefits include:
Apache Spark is generally up to 100x faster than Hadoop for in-memory processing and up to
10x faster for on-disk data.
44. What are the ways to launch Apache Spark over YARN?
Spark supports:
47. How can data transfer be minimized when working with Apache Spark?
48. What are the cases where Apache Spark surpasses Hadoop?
49. What is an action, and how does it process data in Apache Spark?
The Spark Driver is responsible for converting the user's code into tasks, scheduling them on
executors, and collecting the results.
A worker node is a machine in a Spark cluster where the actual data processing tasks are
executed.
Transformations are lazy to build an optimized execution plan (DAG) and to avoid
unnecessary computation.
Yes, Spark can run independently using its built-in cluster manager or other managers like
Mesos and Kubernetes.
An accumulator is a variable used for aggregating information across executors, like counters
in MapReduce.
The Driver program coordinates the execution of tasks, maintains the SparkContext, and
communicates with the cluster manager.
57. How to identify that a given operation is a Transformation or Action in
your program?
Transformations return RDDs (e.g., map, filter), while actions return non-RDD values (e.g.,
collect, count).
58. Name the two types of shared variables available in Apache Spark.
Broadcast Variables
Accumulators
59. What are the common faults of developers while using Apache Spark?
60. By Default, how many partitions are created in RDD in Apache Spark?
The default number of partitions is based on the number of cores available in the cluster or
the HDFS block size.
61. Why do we need compression, and what are the different compression
formats supported?
Compression reduces the storage size of data and speeds up data transfer. Spark supports
several compression formats:
Snappy
Gzip
Bzip2
LZ4
Zstandard (Zstd)
The filter transformation creates a new RDD by selecting only elements that satisfy a given
predicate function.
sortByKey() sorts an RDD of key-value pairs by the key in ascending or descending order.
foreach() applies a function to each element in the RDD, typically used for side effects like
updating an external data store.
groupByKey: Groups values by key and shuffles all data across the network, which
can be less efficient.
reduceByKey: Combines values for each key locally before shuffling, reducing
network traffic.
mapis a transformation that applies a function to each element in the RDD, resulting in a new
RDD.
fold() aggregates the elements of an RDD using an associative function and a "zero value"
(an initial value).
textFile(): Reads a text file and creates an RDD of strings, each representing a line.
wholeTextFiles(): Reads entire files and creates an RDD of (filename, content) pairs.
cogroup() groups data from two or more RDDs sharing the same key.
pipe() passes each partition of an RDD to an external script or program and returns the
output as an RDD.
coalesce() reduces the number of partitions in an RDD, useful for minimizing shuffling
when reducing the data size.
fullOuterJoin() returns an RDD with all pairs of elements for matching keys and null for
non-matching keys from both RDDs.
leftOuterJoin(): Returns all key-value pairs from the left RDD and matching pairs
from the right, filling with null where no match is found.
rightOuterJoin(): Returns all key-value pairs from the right RDD and matching pairs
from the left, filling with null where no match is found.
These operations compute the sum, maximum, and minimum of elements in an RDD,
respectively.
countByValue() returns a map of the counts of each unique value in the RDD.
lookup() returns the list of values associated with a given key in a paired RDD.
saveAsTextFile() saves the RDD content as a text file or set of text files.
reduceByKey() applies a reducing function to the elements with the same key, reducing
them to a single element per key.
flatMap() applies a function that returns an iterable to each element and flattens the results
into a single RDD.
Limitations include high memory consumption, not ideal for OLTP (transactional
processing), lack of a mature security framework, and dependency on cluster resources.
Spark SQL is a Spark module for structured data processing, providing a DataFrame API and
allowing SQL queries to be executed.
Transformations include:
Starvation occurs when all tasks are waiting for resources that are occupied by other long-
running tasks, leading to delays or deadlocks.
103. What are the different input sources for Spark Streaming?
Kafka
Flume
Kinesis
Socket
HDFS or S3
Spark Streaming can receive real-time data streams over a socket using
socketTextStream().
The file system manages data storage, access, and security, ensuring data integrity and
availability.
106. How do you parse data in XML? Which kind of class do you use with
Java to parse data?
To parse XML data in Java, you can use classes from the javax.xml.parsers package, such
as:
DocumentBuilder: Used with the Document Object Model (DOM) for in-memory
tree representation.
SAXParser: Used with the Simple API for XML (SAX) for event-driven parsing.
PageRank is an algorithm used to rank web pages in search engine results, based on the
number and quality of links to a page. In Spark, it can be implemented using RDDs or
DataFrames to compute the rank of nodes in a graph.
108. What are the roles and responsibilities of worker nodes in the Apache
Spark cluster? Is the Worker Node in Spark the same as the Slave Node?
Worker Nodes: Execute tasks assigned by the Spark Driver, manage executors, and
store data in memory or disk as required.
Slave Nodes: Worker nodes in Spark are commonly referred to as slave nodes. Both
terms are used interchangeably.
110. On what basis can you differentiate RDD, DataFrame, and DataSet?