0% found this document useful (0 votes)
13 views12 pages

Lecture 19-RDD in Spark

The document provides an overview of Resilient Distributed Datasets (RDD) in Apache Spark, highlighting their importance for distributed data processing and fault tolerance. It explains the features, workflow, and methods for creating RDDs, emphasizing their role as the core data structure in Spark. Practical examples demonstrate RDD operations using PySpark, showcasing how to initialize a Spark session and perform actions on RDDs.

Uploaded by

kmngl47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views12 pages

Lecture 19-RDD in Spark

The document provides an overview of Resilient Distributed Datasets (RDD) in Apache Spark, highlighting their importance for distributed data processing and fault tolerance. It explains the features, workflow, and methods for creating RDDs, emphasizing their role as the core data structure in Spark. Practical examples demonstrate RDD operations using PySpark, showcasing how to initialize a Spark session and perform actions on RDDs.

Uploaded by

kmngl47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lecture 19

RDD in Apache Spark

By
Dr. Aditya Bhardwaj

aditya.bhardwaj@bennett.edu.in

Big Data Analytics and Business Intelligence (CSET/CMCA-580)


Why RDD is Needed-Partitions in Apache Spark
• The performance and ergonomics of dealing with distributed data will
largely be a function of how the data is distributed.
• In Spark, data is distributed in a master-worker fashion and when
possible, all in-memory
• The Resilient Distributed Data Set (RDD) is the data structure and API
for dealing with distributed data.
• Under the hood of the RDD, data is stored into partitions. Through
partitioning of data we get the best performance and minimize the
amount of time we spend moving data around
What is RDD in Spark?
• An RDD (Resilient Distributed Dataset) is a core data structure in
Apache Spark, forming its backbone since its inception. It represents an
immutable, fault-tolerant collection of elements that can be processed
in parallel across a cluster of machines.

• RDDs serve as the fundamental building blocks in Spark, upon which


newer data structures like datasets and data frames are constructed.

• RDDs are designed for distributed computing, dividing the dataset into
logical partitions. This logical partitioning enables efficient and scalable
processing by distributing different data segments across different
nodes within the cluster
Expanding RDD in Spark?
Resilient: RDDs are fault-tolerant, and they automatically recover from
failures. They achieve this by tracking the lineage of operations
performed on the data.

Distributed: These are distributed across the multiple nodes in a cluster,


enabling parallel data processing.

Dataset: It represents the records of the data. Data sets, such as JSON
files, text files, etc., can be loaded by using RDD.
Features of RDD
RDDs encompass a wide range of operations, including transformations
(such as map, filter, and reduce) and actions (like count and collect).
These operations allow users to perform complex data manipulations and
computations on RDDs. RDDs provide fault tolerance once you perform
any operations in an existing RDD, a new copy of that RDD is created,
and the operations are performed on the newly created RDD. Thus, any
lost data can be recovered easily and recreated. This feature makes
Spark RDD fault-tolerant.
Workflow of RDD
The workflow of RDD in Apache Spark begins with the creation of RDDs by loading
data from external sources. Transformations are applied to RDDs to develop new
RDDs. Each dataset in RDD is divided into logical partitions and enables parallel
processing on different nodes of the cluster.
RDD workflow in Apache Spark includes creating RDDS, applying transformations,
performing actions, partitioning the data for parallel processing, and cleaning the
RDDs when they are not needed. This workflow makes Apache Spark a powerful
framework for data analytics and processing tasks.
How to create RDD?
In Apache Spark, RDDs can be created in following frequent ways.

Parallelize method by which already existing collection can be used in the


driver program.

New RDDs can be created from an existing RDD.


Practical Demo 1- Example of RDD operations
from pyspark.sql import SparkSession
Explanation:
# Initialize Spark session • SparkSession: This is used to initialize a Spark
spark = SparkSession.builder \ application. The .builder.appName() method gives
.appName("Simple RDD Example") \ a name to your Spark application, and
.getOrCreate() .getOrCreate() creates a session if none exists.
# Create an RDD from a Python list • Creating an RDD: The
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] spark.sparkContext.parallelize(data) method
rdd = spark.sparkContext.parallelize(data) creates an RDD from the Python list data.
# Perform an action: Collect the elements of the RDD • Action - collect(): The collect() action retrieves all
collected_data = rdd.collect() the elements of the RDD into a list.
# Print the collected data • Output: The collected elements are printed to the
print("Collected RDD elements:", collected_data) console.
# Stop the Spark session
spark.stop()
Practical Demo 2- Example of RDD operations
Practical Demo 3- Example of RDD operations
Thanks Note

THANKS

tungal/presentations/ad2012

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy