0% found this document useful (0 votes)

28 views4 pages

Spark RDD

This document discusses Spark RDD transformations and actions. It provides examples of map, flatMap, filter, reduceByKey, sortBy, join, countByValue, repartition, coalesce, groupByKey, distinct, union, intersection, count, first, take, countByKey, foreach, saveAsSequenceFile, saveAsObjectFile and more. It also discusses concepts like RDDs, transformations, actions, partitions and more.

Uploaded by

Jagadeesh Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views4 pages

Spark RDD

Uploaded by

Jagadeesh Reddy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

SPARK RDD Transformations & Actions:

Queries and explanations:

map() ==> give one-to one mapping ie if we have 1 i/p we get only 1 output
1.1 To convert input into an array
-->map(x=>x.split(",")) ==>it will split the input and convert into an array
//input
//vaish,big data,reading,23

//output
//Array(vaish big data reading 23)

1.2 To access particular elements of the rdd tuple

X.map(x=>x._1,x._3)

//input
//("vaish", "passion","big data")

//output
//("vaish","big data")

1.3 To reverse elements of the rdd tuple

X.map(x=>x._2,x._1)

//input
//("vaish","big data")

//output
//("big data","vaish)

1.4 To access elements of sub tuple rdd

X.map(x=>x._1,(x._2,x._2._1))

1.5 To directly work on "map" values

X.mapValues(x => (x._1/x._2) )
//input
//("vaish",100,200)
//("vaish",300,300)

//output
//("vaish",0.5)
//("vaish",1)

1.6 To add values to rdd tuple

X.map(x=>x,1)

("yashe") ("vaish") ==> ("yashe",1) ("vaish",1)

1.7 To perform map operations on partition level

X.mapPartition(x=>(x,1))
1.8 To perform map operations on partition level and track them
X.mapPartitionwithIndex(x=>(x,1))
Here we can track the index of the partition whose mapping operations are ongoing

2.flatMap() : performs one-to-many mapping ie one i/p gives many outputs

2.1 x.flatMap(x=>x.split(","))
2.2 x.filterMap(x=>x.split(","), 1) ==>to add values to rdd tuple

("vaishu","yashe","anju") ==> ("vaishu") ("yashe") ("vaishu")

("vaishu","yashe","anju") ==> ("vaishu",1) ("yashe",1) ("vaishu",1)

2.1 To flatten the map based on values

2.1.1 rdd.flatMapValue(x=>x.split(" "))

(big data course, 25) => (big, 25) (data,25) (course,25)

3.filter() : To filter out tuples based on condition

X.filter(x=>x._2 == "data")

(1, data) (2,big) (3,data) ==> (1,data) (3,data)

4.reduceByKey() : To perform aggregations on a set of values based on "key"

X.reduceByKey((x,y)=>x+y)
X.reduceByKey(x,y=>(x._1+y._1),(x._2,y._2))

(data,1) (big,1) (data,1) =>(data,2) (big,1)

(data,100,200) (big,1000,500) (data,500,300) =>(data,600,500) (big,1000,500)

5.To sort the data

5.1 x.sortBy(x=>x._2) //To sort data based on second field in tuple. It sorts in ascending order

5.2 x.sortBy(x=>(x._2),false) //It sorts the data in descending order

5.3 First transpose the elements to make "value" to be "key" rdd.map(x=>x._2,x._1) . Now u can use below
function
x.sortByKey(x=>x._1) //It is a transformation (**it is even shown as a action due to some inner implementation)

7.To join the tables

Val joinRdd=newRdd.join(firstRdd)
newRdd==>chapter_id,user_id
firstRdd==>chapter_id,course_id

-->when u r joining 2 tables/files, make sure the first field is "join field"
-->resultant rdd will be tuple of (chapter_id, (user_id,course_id))

8. To get count of values from mapper itself

Rdd2=rdd1.countByValue()
//input [5],[5],[5],[3],[5],[5],[3],[4]
//output [5,5], [4,1] , [3,2]

9. To increase/decrease no. partitions

Rdd.repartition(10) //to increase no. of partitions
Rdd.coalesce(3) //to decrease no. of partitions

10. To group the output based on key

Rdd.groupByKey(x,y=>x+y) //u will learn more on this in next chapters

11. To perform union/intersection of data sets

Rdd.union(dataset)
Rdd.intersection(dataset)

12. To get distinct values from a rdd

Rdd.distinct()

ACTIONS:

1. To return all the output values from a transformation

Rdd.collect.foreach(println) // collect output and print each line

2. To return the first row from output

Rdd.first() //return first output row

3. To return n output rows

Rdd.take(n)

4. To return the no. of elements

Rdd.count()

5. To reduce the input into single line output

Rdd.reduce()

6. To save output into a file

6.1 x.saveAsTextFile("<output path>")
6.2 rdd.saveAsSequenceFile("path")
6.3 rdd.saveAsObjectFile("path")

7. To show first 20 records of a output

Rdd.show()

How to remove header from a file using Rdd ?

Val fileRdd=spark.sc.textFile("path")
Val firstElement=fileRdd.first() // header is returned
Val filterRdd=firstElement.filter(x=>!x.contains(firstElement)) //filter out the header
Val printResult=filterRdd.collect.foreach(prinltn) //print all elements after header filtered out

Repartition(), coalesce(), groupByKey(), distinct(), union(), intersection()

Count(), first(), take(),countByKey(), foreach(), saveAsSequeceFile(), saveAsObjectFile()

More on spark RDD:

https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations
KEYWORDS:
Spark : In-memory processing engine
Why spark is fast: Due to less I/O disc reads and writes
RDD: It is a data structure to store data in spark
When RDD fails: Using lineage graph we track which RDD failed and reprocess it
Why RDD immutable : As it has to be recovered after its failure and to track which RDD failed
Operations in spark: Transformation and Action
Transformation: Change data from one form to another, are lazy.
Action: Operations which processes the tranformations, not lazy. creates DAG to remember sequence
of steps.
Port number: localhost://4040 → spark UI

**Note:
1 hdfs block = 1 rdd partition = 128mb
1 hdfs block in local=1 rdd partition in local spark cluster= 32mb
1 rdd ~ can have n partitions in it
1 cluster = 1 machine
N cores = N blocks can run in parallel in each cluster/machine
N stages = N - 1 wide transformations
N tasks in each stage= N partitions in each stage for that rdd/data frame

API Marketplace Engineering Design
No ratings yet
API Marketplace Engineering Design
281 pages
LabVIEW Graphical Programming (4th Ed) (Gary and Richard)
100% (2)
LabVIEW Graphical Programming (4th Ed) (Gary and Richard)
625 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
Pyspark RDD Operations
No ratings yet
Pyspark RDD Operations
5 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
Basics of RDD
No ratings yet
Basics of RDD
84 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
SPARK
No ratings yet
SPARK
35 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
Action and Transformations (Wide and Narrow)
No ratings yet
Action and Transformations (Wide and Narrow)
7 pages
L7A - Spark RDD With Scala
No ratings yet
L7A - Spark RDD With Scala
21 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
External Video-En
No ratings yet
External Video-En
2 pages
Apache Spark Tutorials
No ratings yet
Apache Spark Tutorials
9 pages
RDD
No ratings yet
RDD
4 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Spark Running Notes
No ratings yet
Spark Running Notes
19 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Anatomy of RDD
No ratings yet
Anatomy of RDD
31 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
Spark Transformations and Actions
No ratings yet
Spark Transformations and Actions
4 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Spark End To End QUESTIONS
No ratings yet
Spark End To End QUESTIONS
10 pages
Key Differences in Aache Spark Components and Concepts
No ratings yet
Key Differences in Aache Spark Components and Concepts
7 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Day 1 Stlecture Notes
No ratings yet
Day 1 Stlecture Notes
4 pages
Spark
No ratings yet
Spark
96 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Spark (Introduction, RDD)
No ratings yet
Spark (Introduction, RDD)
28 pages
Interview Question Spark Day1
No ratings yet
Interview Question Spark Day1
3 pages
Spark Transformations and Actions
No ratings yet
Spark Transformations and Actions
24 pages
Spark
No ratings yet
Spark
13 pages
15 PDFsam Apache Spark Tutorial
No ratings yet
15 PDFsam Apache Spark Tutorial
7 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
PySpark CheatSheet Edureka
No ratings yet
PySpark CheatSheet Edureka
1 page
Spark Interview
No ratings yet
Spark Interview
17 pages
Spark by Sumit
No ratings yet
Spark by Sumit
33 pages
Understanding Apache Spark Architecture
No ratings yet
Understanding Apache Spark Architecture
30 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
2335 m8 Demo1 v1 0h2 Cq188do
No ratings yet
2335 m8 Demo1 v1 0h2 Cq188do
9 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
SA Lab Manual
No ratings yet
SA Lab Manual
7 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
No ratings yet
Introduction To Big Data With PySpark - Spark RDDs With PySpark Cheatsheet - Codecademy
6 pages
Course Slideware
No ratings yet
Course Slideware
60 pages
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Tosibox Datasheet Lock100 LR PDF
No ratings yet
Tosibox Datasheet Lock100 LR PDF
2 pages
Face Project
No ratings yet
Face Project
43 pages
Neb Class 12 Computer Programming in C Notes
No ratings yet
Neb Class 12 Computer Programming in C Notes
60 pages
Clinic Information System
100% (1)
Clinic Information System
15 pages
Cardimax fx8322
No ratings yet
Cardimax fx8322
142 pages
Packet Tracer Commands - CCNA
No ratings yet
Packet Tracer Commands - CCNA
16 pages
Moving From SAP ECC To S
No ratings yet
Moving From SAP ECC To S
8 pages
Powershell' Deep Dive:: A United Threat Research Report
No ratings yet
Powershell' Deep Dive:: A United Threat Research Report
16 pages
Network Traffic Analysis
No ratings yet
Network Traffic Analysis
9 pages
Module 11 Fetch Decode Execute Cycle V3
No ratings yet
Module 11 Fetch Decode Execute Cycle V3
14 pages
Asus Strix 1070 Cg411p 8gb Gddr5x Rev1.0
100% (1)
Asus Strix 1070 Cg411p 8gb Gddr5x Rev1.0
39 pages
12CS em 2025
No ratings yet
12CS em 2025
193 pages
IT Vulnerability and Remediation Management Security Standard
No ratings yet
IT Vulnerability and Remediation Management Security Standard
8 pages
The Complete Guide To Prompt Engineering....
No ratings yet
The Complete Guide To Prompt Engineering....
47 pages
JXCX OMT0002
No ratings yet
JXCX OMT0002
113 pages
Azure MSP Playbook
No ratings yet
Azure MSP Playbook
56 pages
Performance Analysis of Startup Time in CPU Within Windows Environment
No ratings yet
Performance Analysis of Startup Time in CPU Within Windows Environment
9 pages
Vci v3 Api Manual PDF
No ratings yet
Vci v3 Api Manual PDF
106 pages
Organization Culture
No ratings yet
Organization Culture
105 pages
Aprisa XE User Manual: September 2006
No ratings yet
Aprisa XE User Manual: September 2006
134 pages
Who Is Arthur Noriega - Google Search
No ratings yet
Who Is Arthur Noriega - Google Search
1 page
BDS602 Module 2 PDF
No ratings yet
BDS602 Module 2 PDF
16 pages
Online Book Store Project Report
100% (1)
Online Book Store Project Report
50 pages
Notes Co Unit4
No ratings yet
Notes Co Unit4
12 pages
Dhrystone - Wikipedia
No ratings yet
Dhrystone - Wikipedia
15 pages
Mathematics Course Companion SL
No ratings yet
Mathematics Course Companion SL
142 pages
Web3 Based Blockchain
No ratings yet
Web3 Based Blockchain
50 pages
DevOps Shack - Mastering Git A Comprehensive Guide
No ratings yet
DevOps Shack - Mastering Git A Comprehensive Guide
41 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Spark RDD

Uploaded by

Spark RDD

Uploaded by

SPARK RDD Transformations & Actions:

Queries and explanations:

1.2 To access particular elements of the rdd tuple

1.3 To reverse elements of the rdd tuple

1.4 To access elements of sub tuple rdd

1.5 To directly work on "map" values

1.6 To add values to rdd tuple

("yashe") ("vaish") ==> ("yashe",1) ("vaish",1)

1.7 To perform map operations on partition level

2.flatMap() : performs one-to-many mapping ie one i/p gives many outputs

("vaishu","yashe","anju") ==> ("vaishu") ("yashe") ("vaishu")

2.1 To flatten the map based on values

(big data course, 25) => (big, 25) (data,25) (course,25)

3.filter() : To filter out tuples based on condition

(1, data) (2,big) (3,data) ==> (1,data) (3,data)

4.reduceByKey() : To perform aggregations on a set of values based on "key"

(data,1) (big,1) (data,1) =>(data,2) (big,1)

(data,100,200) (big,1000,500) (data,500,300) =>(data,600,500) (big,1000,500)

5.To sort the data

5.2 x.sortBy(x=>(x._2),false) //It sorts the data in descending order

7.To join the tables

8. To get count of values from mapper itself

9. To increase/decrease no. partitions

10. To group the output based on key

11. To perform union/intersection of data sets

12. To get distinct values from a rdd

1. To return all the output values from a transformation

2. To return the first row from output

3. To return n output rows

4. To return the no. of elements

5. To reduce the input into single line output

6. To save output into a file

7. To show first 20 records of a output

How to remove header from a file using Rdd ?

Repartition(), coalesce(), groupByKey(), distinct(), union(), intersection()

More on spark RDD:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.