0% found this document useful (0 votes)

13 views12 pages

Lecture 19-RDD in Spark

The document provides an overview of Resilient Distributed Datasets (RDD) in Apache Spark, highlighting their importance for distributed data processing and fault tolerance. It explains the features, workflow, and methods for creating RDDs, emphasizing their role as the core data structure in Spark. Practical examples demonstrate RDD operations using PySpark, showcasing how to initialize a Spark session and perform actions on RDDs.

Uploaded by

kmngl47

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views12 pages

Lecture 19-RDD in Spark

Uploaded by

kmngl47

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Lecture 19

RDD in Apache Spark

By
Dr. Aditya Bhardwaj

aditya.bhardwaj@bennett.edu.in

Big Data Analytics and Business Intelligence (CSET/CMCA-580)

Why RDD is Needed-Partitions in Apache Spark
• The performance and ergonomics of dealing with distributed data will
largely be a function of how the data is distributed.
• In Spark, data is distributed in a master-worker fashion and when
possible, all in-memory
• The Resilient Distributed Data Set (RDD) is the data structure and API
for dealing with distributed data.
• Under the hood of the RDD, data is stored into partitions. Through
partitioning of data we get the best performance and minimize the
amount of time we spend moving data around
What is RDD in Spark?
• An RDD (Resilient Distributed Dataset) is a core data structure in
Apache Spark, forming its backbone since its inception. It represents an
immutable, fault-tolerant collection of elements that can be processed
in parallel across a cluster of machines.

• RDDs serve as the fundamental building blocks in Spark, upon which

newer data structures like datasets and data frames are constructed.

• RDDs are designed for distributed computing, dividing the dataset into
logical partitions. This logical partitioning enables efficient and scalable
processing by distributing different data segments across different
nodes within the cluster
Expanding RDD in Spark?
Resilient: RDDs are fault-tolerant, and they automatically recover from
failures. They achieve this by tracking the lineage of operations
performed on the data.

Distributed: These are distributed across the multiple nodes in a cluster,

enabling parallel data processing.

Dataset: It represents the records of the data. Data sets, such as JSON
files, text files, etc., can be loaded by using RDD.
Features of RDD
RDDs encompass a wide range of operations, including transformations
(such as map, filter, and reduce) and actions (like count and collect).
These operations allow users to perform complex data manipulations and
computations on RDDs. RDDs provide fault tolerance once you perform
any operations in an existing RDD, a new copy of that RDD is created,
and the operations are performed on the newly created RDD. Thus, any
lost data can be recovered easily and recreated. This feature makes
Spark RDD fault-tolerant.
Workflow of RDD
The workflow of RDD in Apache Spark begins with the creation of RDDs by loading
data from external sources. Transformations are applied to RDDs to develop new
RDDs. Each dataset in RDD is divided into logical partitions and enables parallel
processing on different nodes of the cluster.
RDD workflow in Apache Spark includes creating RDDS, applying transformations,
performing actions, partitioning the data for parallel processing, and cleaning the
RDDs when they are not needed. This workflow makes Apache Spark a powerful
framework for data analytics and processing tasks.
How to create RDD?
In Apache Spark, RDDs can be created in following frequent ways.

Parallelize method by which already existing collection can be used in the

driver program.

New RDDs can be created from an existing RDD.

Practical Demo 1- Example of RDD operations
from pyspark.sql import SparkSession
Explanation:
# Initialize Spark session • SparkSession: This is used to initialize a Spark
spark = SparkSession.builder \ application. The .builder.appName() method gives
.appName("Simple RDD Example") \ a name to your Spark application, and
.getOrCreate() .getOrCreate() creates a session if none exists.
# Create an RDD from a Python list • Creating an RDD: The
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] spark.sparkContext.parallelize(data) method
rdd = spark.sparkContext.parallelize(data) creates an RDD from the Python list data.
# Perform an action: Collect the elements of the RDD • Action - collect(): The collect() action retrieves all
collected_data = rdd.collect() the elements of the RDD into a list.
# Print the collected data • Output: The collected elements are printed to the
print("Collected RDD elements:", collected_data) console.
# Stop the Spark session
spark.stop()
Practical Demo 2- Example of RDD operations
Practical Demo 3- Example of RDD operations
Thanks Note

THANKS

tungal/presentations/ad2012

Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
IBM Data Analyst Capstone Project
No ratings yet
IBM Data Analyst Capstone Project
19 pages
Basics of RDD
No ratings yet
Basics of RDD
84 pages
BDA-Unit-III
No ratings yet
BDA-Unit-III
19 pages
Spark 1
No ratings yet
Spark 1
57 pages
SPARK Architecture
No ratings yet
SPARK Architecture
22 pages
FusionServer V7 Server GPU Card Operation Guide 08
No ratings yet
FusionServer V7 Server GPU Card Operation Guide 08
170 pages
Lec28 - RDD (1)
No ratings yet
Lec28 - RDD (1)
56 pages
Unit 4
No ratings yet
Unit 4
8 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
bd1718-10-spark
No ratings yet
bd1718-10-spark
55 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Spark
No ratings yet
Spark
96 pages
Computer Organization - MIIT, Mandalay - Mouli Sankaran
100% (1)
Computer Organization - MIIT, Mandalay - Mouli Sankaran
41 pages
Learning Spark Programming Basics: Introduction To Rdds
No ratings yet
Learning Spark Programming Basics: Introduction To Rdds
70 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Spark
No ratings yet
Spark
51 pages
External Video-En (15)
No ratings yet
External Video-En (15)
2 pages
BDA-Lec7
No ratings yet
BDA-Lec7
32 pages
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
Lab 04 Spark APIs (1)
No ratings yet
Lab 04 Spark APIs (1)
20 pages
Introduction to Big Data With PySpark_ Spark RDDs With PySpark Cheatsheet _ Codecademy
No ratings yet
Introduction to Big Data With PySpark_ Spark RDDs With PySpark Cheatsheet _ Codecademy
6 pages
A204080739_28953_20_2025_unit 3 Introduction to RDD (1)
No ratings yet
A204080739_28953_20_2025_unit 3 Introduction to RDD (1)
51 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Pyspark
No ratings yet
Pyspark
31 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
BDA NOTES
No ratings yet
BDA NOTES
241 pages
15_PDFsam_apache_spark_tutorial
No ratings yet
15_PDFsam_apache_spark_tutorial
7 pages
SPARK
No ratings yet
SPARK
35 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Spark_Class_1_PPT
No ratings yet
Spark_Class_1_PPT
33 pages
Cad Lab Updated by Anjaiah - Final - 1
No ratings yet
Cad Lab Updated by Anjaiah - Final - 1
111 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
unit 6 spark (2)
No ratings yet
unit 6 spark (2)
43 pages
Spark Questions
No ratings yet
Spark Questions
2 pages
BDA-Lec8
No ratings yet
BDA-Lec8
39 pages
C Questions
100% (1)
C Questions
31 pages
Architecture of IBM System - 360
100% (1)
Architecture of IBM System - 360
15 pages
SRS - Web Publishing System
50% (2)
SRS - Web Publishing System
31 pages
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
No ratings yet
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
27 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Introduction To Big Data With Apache Spark: Uc Berkeley
No ratings yet
Introduction To Big Data With Apache Spark: Uc Berkeley
43 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Module 3
No ratings yet
Module 3
51 pages
Watson Assistant
No ratings yet
Watson Assistant
754 pages
Error Recovery
No ratings yet
Error Recovery
16 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
ch5ـcontextـfreeـgrammars
No ratings yet
ch5ـcontextـfreeـgrammars
49 pages
AGRICULTURE MANAGEMENT SYSTEM (1)
No ratings yet
AGRICULTURE MANAGEMENT SYSTEM (1)
58 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
PLT Test Center Technical Requirements - CCT
No ratings yet
PLT Test Center Technical Requirements - CCT
16 pages
Spark_RDD
No ratings yet
Spark_RDD
60 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
HCL 1
No ratings yet
HCL 1
2 pages
CodeBlocks_Fortran
No ratings yet
CodeBlocks_Fortran
20 pages
Bim 360
No ratings yet
Bim 360
65 pages
22634
No ratings yet
22634
24 pages
Runtime Environments in Compiler Design
100% (1)
Runtime Environments in Compiler Design
12 pages
Unit 2
No ratings yet
Unit 2
36 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
8 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Nehra Classes Bash Shell Scripting Syllabus
100% (1)
Nehra Classes Bash Shell Scripting Syllabus
4 pages
Optimal Code Generation in Compiler Design
No ratings yet
Optimal Code Generation in Compiler Design
12 pages
Microsoft Azure DevOps Solutions AZ 400 Practice Exam 9 Py2tzw
No ratings yet
Microsoft Azure DevOps Solutions AZ 400 Practice Exam 9 Py2tzw
9 pages
SAP HANA Supported Operating Systems
No ratings yet
SAP HANA Supported Operating Systems
9 pages
How To Install Tool-X in
No ratings yet
How To Install Tool-X in
4 pages
Batocera Installation Guide
No ratings yet
Batocera Installation Guide
14 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Round Robin and Priority Schedule
No ratings yet
Round Robin and Priority Schedule
9 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
DSD Lab 5
No ratings yet
DSD Lab 5
5 pages
006 Knowledge Reprensation 2 - Semantic
No ratings yet
006 Knowledge Reprensation 2 - Semantic
20 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Lab 1 Muhammad Abdullah (1823-2021)
No ratings yet
Lab 1 Muhammad Abdullah (1823-2021)
10 pages
Datasheet Stormagic SvSAN
No ratings yet
Datasheet Stormagic SvSAN
6 pages
Finite Automata DFA and NFA
No ratings yet
Finite Automata DFA and NFA
8 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Local Optimization of Three-Address-Code
No ratings yet
Local Optimization of Three-Address-Code
6 pages
1000 SAP ABAP Interview Questions and Answers: The Best Tech For All You Do
No ratings yet
1000 SAP ABAP Interview Questions and Answers: The Best Tech For All You Do
46 pages
Shenzhen ORVIBO Technology Co., LTD
No ratings yet
Shenzhen ORVIBO Technology Co., LTD
28 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Cse-backlog Registration Details Even Sem 2024-25
No ratings yet
Cse-backlog Registration Details Even Sem 2024-25
10 pages
Spark Interview
No ratings yet
Spark Interview
17 pages
Exfo Spec Sheet FTB 4 Pro v5
No ratings yet
Exfo Spec Sheet FTB 4 Pro v5
11 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 19-RDD in Spark

Uploaded by

Lecture 19-RDD in Spark

Uploaded by

Lecture 19

RDD in Apache Spark

Big Data Analytics and Business Intelligence (CSET/CMCA-580)

• RDDs serve as the fundamental building blocks in Spark, upon which

Distributed: These are distributed across the multiple nodes in a cluster,

Parallelize method by which already existing collection can be used in the

New RDDs can be created from an existing RDD.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.