Lecture 19-RDD in Spark
Lecture 19-RDD in Spark
By
Dr. Aditya Bhardwaj
aditya.bhardwaj@bennett.edu.in
• RDDs are designed for distributed computing, dividing the dataset into
logical partitions. This logical partitioning enables efficient and scalable
processing by distributing different data segments across different
nodes within the cluster
Expanding RDD in Spark?
Resilient: RDDs are fault-tolerant, and they automatically recover from
failures. They achieve this by tracking the lineage of operations
performed on the data.
Dataset: It represents the records of the data. Data sets, such as JSON
files, text files, etc., can be loaded by using RDD.
Features of RDD
RDDs encompass a wide range of operations, including transformations
(such as map, filter, and reduce) and actions (like count and collect).
These operations allow users to perform complex data manipulations and
computations on RDDs. RDDs provide fault tolerance once you perform
any operations in an existing RDD, a new copy of that RDD is created,
and the operations are performed on the newly created RDD. Thus, any
lost data can be recovered easily and recreated. This feature makes
Spark RDD fault-tolerant.
Workflow of RDD
The workflow of RDD in Apache Spark begins with the creation of RDDs by loading
data from external sources. Transformations are applied to RDDs to develop new
RDDs. Each dataset in RDD is divided into logical partitions and enables parallel
processing on different nodes of the cluster.
RDD workflow in Apache Spark includes creating RDDS, applying transformations,
performing actions, partitioning the data for parallel processing, and cleaning the
RDDs when they are not needed. This workflow makes Apache Spark a powerful
framework for data analytics and processing tasks.
How to create RDD?
In Apache Spark, RDDs can be created in following frequent ways.
THANKS
tungal/presentations/ad2012