0% found this document useful (0 votes)
6 views22 pages

SPARK Architecture

Uploaded by

21f3000149
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views22 pages

SPARK Architecture

Uploaded by

21f3000149
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

SPARK Architecture

RDD,DAG
Basics
● Resilient Distributed Datasets(RDD)
● Directed Acyclic Graph(DAG)
Resilient Distributed Datasets (RDD)

● Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an


immutable distributed collection of objects.
● Each dataset in RDD is divided into logical partitions, which may be computed on
different nodes of the cluster.
● RDDs can contain any type of Python, Java, or Scala objects, including user-defined
classes.
● Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created
through deterministic operations on either data on stable storage or other RDDs.
● RDD is a fault-tolerant collection of elements that can be operated on in parallel.
Contd...
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Parallelizing an existing collection :


Contd...
Referencing a dataset in an external storage system:
Resilient Distributed Datasets

● Fault tolerant distributed dataset


● Lazy Evaluation
● Caching
● In memory computation
○ Spark RDDs have a provision of in-memory computation. It stores intermediate
results in distributed memory(RAM) instead of stable storage(disk).
● Immutability
● Partitioning
Contd...
● Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to
recompute missing or damaged partitions due to node failures.
● Distributed, since Data resides on multiple nodes.
● Dataset represents records of the data you work with. The user can load the data set
externally which can be either JSON file, CSV file, text file or database via JDBC with
no specific data structure.
Caching

var b=a.filter(....)
In-memory computation

In in-memory computation, the data is kept in random access memory(RAM) instead of


some slow disk drives and is processed in parallel.
Example:
Partitioning
● Data is split up into partitions.
● Partition size depends on the data source you are using.
● For HDFS one block is one partition.
● Single partition ------> Single Task
DAG(Introduction)
(Directed Acyclic Graph) DAG in Apache Spark is a set of Vertices and Edges, where
vertices represent the RDDs and the edges represent the Operation to be applied on RDD.
In Spark DAG, every edge directs from earlier to later in the sequence. On the calling of
Action, the created DAG submits to DAG Scheduler which further splits the graph into the
stages of the task.
DAG
● Spark creates a graph when you enter code in spark console.
● When an action is called on spark RDD, Spark submits graph to DAG scheduler.
● Operators are divided into stages of task in DAG Scheduler.
● The stages are passed on Task scheduler, which launches task through CM.
Task:
Spark Context:
● Establishes a connection to spark execution environment.
● It can be used to created RDDs, accumulators and broadcast variables.
Spark Architecture
Why do we need RDD in Spark?
● Iterative algorithms.
● DSM (Distributed Shared Memory) is a very general abstraction, but this generality makes it
harder to implement in an efficient and fault tolerant manner on commodity clusters. Here the
need of RDD comes into the picture.
● In distributed computing system data is stored in intermediate stable distributed store such as
HDFS or Amazon S3. This makes the computation of job slower since it involves many IO
operations, replications, and serializations in the process.
● To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based
on coarse-grained transformation rather than fine-grained updates to shared state.
Spark RDD Operations
RDD in Apache Spark supports two types of operations:

● Transformation
● Actions

Transformations:

● Spark RDD Transformations are functions that take an RDD as the input and produce one or
many RDDs as the output.
● They do not change the input RDD (since RDDs are immutable and hence one cannot change it),
but always produce one or more new RDDs by applying the computations they represent e.g.
Map(), filter(), reduceByKey() etc.
Contd...
Actions:

● An Action in Spark returns final result of RDD computations.


● It triggers execution using lineage graph to load the data into original RDD, carry out all
intermediate transformations and return final results to Driver program or write it out to file
system.
● Lineage graph is dependency graph of all parallel RDDs of RDD.
● Example: collect(),take(),count() etc
Two kinds of transformations

Certain transformations can be pipelined which is an optimization method, that Spark uses to improve
the performance of computations. There are two kinds of transformations: narrow transformation,
wide transformation.
a. Narrow Transformations
It is the result of map, filter and such that the data is from a single partition only, i.e. it is
self-sufficient. An output RDD has partitions with records that originate from a single partition in the
parent RDD. Only a limited subset of partitions used to calculate the result.
Spark groups narrow transformations as a stage known as pipelining.
Contd...
b. Wide Transformations

It is the result of groupByKey() and reduceByKey() like functions. The data required to compute the
records in a single partition may live in many partitions of the parent RDD.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy