0% found this document useful (0 votes)

6 views22 pages

SPARK Architecture

Uploaded by

21f3000149

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views22 pages

SPARK Architecture

Uploaded by

21f3000149

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

SPARK Architecture

RDD,DAG
Basics
● Resilient Distributed Datasets(RDD)
● Directed Acyclic Graph(DAG)
Resilient Distributed Datasets (RDD)

● Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an

immutable distributed collection of objects.
● Each dataset in RDD is divided into logical partitions, which may be computed on
different nodes of the cluster.
● RDDs can contain any type of Python, Java, or Scala objects, including user-defined
classes.
● Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created
through deterministic operations on either data on stable storage or other RDDs.
● RDD is a fault-tolerant collection of elements that can be operated on in parallel.
Contd...
There are two ways to create RDDs − parallelizing an existing collection in your driver
program, or referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format.

Parallelizing an existing collection :

Contd...
Referencing a dataset in an external storage system:
Resilient Distributed Datasets

● Fault tolerant distributed dataset

● Lazy Evaluation
● Caching
● In memory computation
○ Spark RDDs have a provision of in-memory computation. It stores intermediate
results in distributed memory(RAM) instead of stable storage(disk).
● Immutability
● Partitioning
Contd...
● Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to
recompute missing or damaged partitions due to node failures.
● Distributed, since Data resides on multiple nodes.
● Dataset represents records of the data you work with. The user can load the data set
externally which can be either JSON file, CSV file, text file or database via JDBC with
no specific data structure.
Caching

var b=a.filter(....)
In-memory computation

In in-memory computation, the data is kept in random access memory(RAM) instead of

some slow disk drives and is processed in parallel.
Example:
Partitioning
● Data is split up into partitions.
● Partition size depends on the data source you are using.
● For HDFS one block is one partition.
● Single partition ------> Single Task
DAG(Introduction)
(Directed Acyclic Graph) DAG in Apache Spark is a set of Vertices and Edges, where
vertices represent the RDDs and the edges represent the Operation to be applied on RDD.
In Spark DAG, every edge directs from earlier to later in the sequence. On the calling of
Action, the created DAG submits to DAG Scheduler which further splits the graph into the
stages of the task.
DAG
● Spark creates a graph when you enter code in spark console.
● When an action is called on spark RDD, Spark submits graph to DAG scheduler.
● Operators are divided into stages of task in DAG Scheduler.
● The stages are passed on Task scheduler, which launches task through CM.
Task:
Spark Context:
● Establishes a connection to spark execution environment.
● It can be used to created RDDs, accumulators and broadcast variables.
Spark Architecture
Why do we need RDD in Spark?
● Iterative algorithms.
● DSM (Distributed Shared Memory) is a very general abstraction, but this generality makes it
harder to implement in an efficient and fault tolerant manner on commodity clusters. Here the
need of RDD comes into the picture.
● In distributed computing system data is stored in intermediate stable distributed store such as
HDFS or Amazon S3. This makes the computation of job slower since it involves many IO
operations, replications, and serializations in the process.
● To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based
on coarse-grained transformation rather than fine-grained updates to shared state.
Spark RDD Operations
RDD in Apache Spark supports two types of operations:

● Transformation
● Actions

Transformations:

● Spark RDD Transformations are functions that take an RDD as the input and produce one or
many RDDs as the output.
● They do not change the input RDD (since RDDs are immutable and hence one cannot change it),
but always produce one or more new RDDs by applying the computations they represent e.g.
Map(), filter(), reduceByKey() etc.
Contd...
Actions:

● An Action in Spark returns final result of RDD computations.

● It triggers execution using lineage graph to load the data into original RDD, carry out all
intermediate transformations and return final results to Driver program or write it out to file
system.
● Lineage graph is dependency graph of all parallel RDDs of RDD.
● Example: collect(),take(),count() etc
Two kinds of transformations

Certain transformations can be pipelined which is an optimization method, that Spark uses to improve
the performance of computations. There are two kinds of transformations: narrow transformation,
wide transformation.
a. Narrow Transformations
It is the result of map, filter and such that the data is from a single partition only, i.e. it is
self-sufficient. An output RDD has partitions with records that originate from a single partition in the
parent RDD. Only a limited subset of partitions used to calculate the result.
Spark groups narrow transformations as a stage known as pipelining.
Contd...
b. Wide Transformations

It is the result of groupByKey() and reduceByKey() like functions. The data required to compute the
records in a single partition may live in many partitions of the parent RDD.

Unit V
No ratings yet
Unit V
35 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
Spark
No ratings yet
Spark
96 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
Oracle Sqlplus Pocket Reference 3rd Ed Jonathan Gennick Download
No ratings yet
Oracle Sqlplus Pocket Reference 3rd Ed Jonathan Gennick Download
54 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Iit M Foundation Diploma FN Exam Qdf1 22 Dec 2024
No ratings yet
Iit M Foundation Diploma FN Exam Qdf1 22 Dec 2024
253 pages
SPARK
No ratings yet
SPARK
35 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Attack Defense AWS
No ratings yet
Attack Defense AWS
47 pages
Modern Information Retrieval: July 1999
No ratings yet
Modern Information Retrieval: July 1999
39 pages
Sap Irf
No ratings yet
Sap Irf
29 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
ANGGITARIDHA
No ratings yet
ANGGITARIDHA
1 page
DWM Theory
No ratings yet
DWM Theory
37 pages
bd1718 10 Spark
No ratings yet
bd1718 10 Spark
55 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
FPA - Advance
No ratings yet
FPA - Advance
80 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Data Preprocessing
No ratings yet
Data Preprocessing
63 pages
Explaining The Unexplainable
No ratings yet
Explaining The Unexplainable
31 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
04OLAP
No ratings yet
04OLAP
58 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Data Analysis-2
No ratings yet
Data Analysis-2
41 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Bigdata Interview Q&A
No ratings yet
Bigdata Interview Q&A
71 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
prezentareBD Tot
No ratings yet
prezentareBD Tot
30 pages
Model Building Using Healthcare Dataset
No ratings yet
Model Building Using Healthcare Dataset
19 pages
DBMS MST-1 Paper New
No ratings yet
DBMS MST-1 Paper New
1 page
SAP HANA Data Management and Performance On IBM Power Systems PDF
No ratings yet
SAP HANA Data Management and Performance On IBM Power Systems PDF
76 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Introduction To Databases - COM4015 - (4378)
No ratings yet
Introduction To Databases - COM4015 - (4378)
16 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Insurance Domain Project PPT2
No ratings yet
Insurance Domain Project PPT2
14 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Spark
No ratings yet
Spark
9 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
Caching Techniques
No ratings yet
Caching Techniques
4 pages
Dbms Relational Algebra
No ratings yet
Dbms Relational Algebra
25 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Learning Spark Programming Basics: Introduction To Rdds
No ratings yet
Learning Spark Programming Basics: Introduction To Rdds
70 pages
Week4 Linked List Variants Circular Linked List 19102022 121122pm
No ratings yet
Week4 Linked List Variants Circular Linked List 19102022 121122pm
13 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Ch. 4
No ratings yet
Ch. 4
4 pages
Anmol PDF
No ratings yet
Anmol PDF
1 page
MIS
No ratings yet
MIS
55 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Unit 1: Db2 Program Preparation
No ratings yet
Unit 1: Db2 Program Preparation
167 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Upgrade Guide Build 8010 10020
No ratings yet
Upgrade Guide Build 8010 10020
3 pages
Unit 4
No ratings yet
Unit 4
8 pages
POCONFIRMGST
No ratings yet
POCONFIRMGST
7 pages
Spark SQL
100% (1)
Spark SQL
25 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
SQL Server
100% (1)
SQL Server
163 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Cyberciege Campaign Manager User'S Guide: The Center For Information Systems Security Studies and Research
No ratings yet
Cyberciege Campaign Manager User'S Guide: The Center For Information Systems Security Studies and Research
5 pages
Resume Data Analyst
No ratings yet
Resume Data Analyst
2 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Scala and Spark Overview PDF
No ratings yet
Scala and Spark Overview PDF
37 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Zookeeper: Coordinating Your Cluster
No ratings yet
Zookeeper: Coordinating Your Cluster
13 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
Mysql Create Table Exercises
No ratings yet
Mysql Create Table Exercises
9 pages
CCP603
No ratings yet
CCP603
14 pages
SQL 3 Hari Resume
No ratings yet
SQL 3 Hari Resume
3 pages
DLookup Function - Access - Microsoft Office
No ratings yet
DLookup Function - Access - Microsoft Office
2 pages
Business Analytics Course Outline V3
No ratings yet
Business Analytics Course Outline V3
3 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
8 pages
5 The Relational Calculus
No ratings yet
5 The Relational Calculus
61 pages
QuickStart Guide to Db2 Development with Python
From Everand
QuickStart Guide to Db2 Development with Python
Roger E. Sanders
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

SPARK Architecture

Uploaded by

SPARK Architecture

Uploaded by

SPARK Architecture

● Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an

Parallelizing an existing collection :

● Fault tolerant distributed dataset

In in-memory computation, the data is kept in random access memory(RAM) instead of

● An Action in Spark returns final result of RDD computations.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.