SPARK Architecture
SPARK Architecture
RDD,DAG
Basics
● Resilient Distributed Datasets(RDD)
● Directed Acyclic Graph(DAG)
Resilient Distributed Datasets (RDD)
var b=a.filter(....)
In-memory computation
● Transformation
● Actions
Transformations:
● Spark RDD Transformations are functions that take an RDD as the input and produce one or
many RDDs as the output.
● They do not change the input RDD (since RDDs are immutable and hence one cannot change it),
but always produce one or more new RDDs by applying the computations they represent e.g.
Map(), filter(), reduceByKey() etc.
Contd...
Actions:
Certain transformations can be pipelined which is an optimization method, that Spark uses to improve
the performance of computations. There are two kinds of transformations: narrow transformation,
wide transformation.
a. Narrow Transformations
It is the result of map, filter and such that the data is from a single partition only, i.e. it is
self-sufficient. An output RDD has partitions with records that originate from a single partition in the
parent RDD. Only a limited subset of partitions used to calculate the result.
Spark groups narrow transformations as a stage known as pipelining.
Contd...
b. Wide Transformations
It is the result of groupByKey() and reduceByKey() like functions. The data required to compute the
records in a single partition may live in many partitions of the parent RDD.