0% found this document useful (0 votes)

38 views5 pages

Actions Spark

Spark transformations and actions allow users to read, transform, and analyze distributed datasets. Key transformations include map, filter, reduceByKey and join. Map applies a function to each element, filter selects elements where a predicate is true, reduceByKey aggregates values for each key, and join combines pairs of elements based on matching keys. Actions like collect return final results to the driver program from distributed computations.

Uploaded by

Souvik Ghosh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views5 pages

Actions Spark

Uploaded by

Souvik Ghosh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

SPARK TRANSFORMATIONS AND ACTIONS

Reading Data
Reading a csv file:

Displaying Data

a. Displaying a csv file

b. Displaying a json file

map

Returns a new distributed dataset, formed by passing each element of the source through a function
func.

scala> sc.parallelize(List(1,2,3)).map(x=>List(x,x,x)).collect

flatMap

Similar to map, but each input item can be mapped to 0 or more output items (so func should return
a Seq rather than a single item).

scala> sc.parallelize(List(1,2,3)).flatMap(x=>List(x,x,x)).collect

filter
Returns a new dataset formed by selecting those elements of the source on which func returns true

val file = sc.textFile("catalina.txt")

val errors = file.filter(line => line.contains("ERROR")).collect

mapPartitions

Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type
Iterator => Iterator when running on an RDD of type T.

scala> val parallel = sc.parallelize(1 to 9, 3)

scala> parallel.mapPartitions( x => List(x.next).iterator).collect

scala> val parallel = sc.parallelize(1 to 9)

scala> parallel.mapPartitions( x => List(x.next).iterator).collect

mapPartitionsWithIndex

Similar to map Partitions, but also provides func with an integer value representing the index of the
partition, so func must be of type (Int, Iterator) => Iterator when running on an RDD of type T.

scala> val parallel = sc.parallelize(1 to 9)

scala> parallel.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>
it.toList.map(x => index + ", "+x).iterator).collect

scala> val parallel = sc.parallelize(1 to 9, 3)

scala> parallel.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>
it.toList.map(x => index + ", "+x).iterator).collect

sample

Sample a fraction of the data, with or without replacement, using a given random number generator
seed

scala> val parallel = sc.parallelize(1 to 9)

scala> parallel.sample(true,.2).count
scala> parallel.sample(true,.1)

union

Returns a new dataset that contains the union of the elements in the so dataset and the argument.

scala> val parallel = sc.parallelize(1 to 9)

scala> val par2 = sc.parallelize(5 to 15)
scala> parallel.union(par2).collect

intersection

Returns a new RDD that contains the intersection of elements in the source dataset and the
argument.

scala> val parallel = sc.parallelize(1 to 9)

scala> val par2 = sc.parallelize(5 to 15)
scala> parallel.intersection(par2).collect

distinct

Returns a new dataset that contains the distinct elements of the source dataset

scala> val parallel = sc.parallelize(1 to 9)

scala> val par2 = sc.parallelize(5 to 15)
scala> parallel.union(par2).distinct.collect

join

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs
of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and
fullOuterJoin.

scala> val names1 = sc.parallelize(List("abe", "abby", "apple")).map(a =>

(a, 1))
scala> val names2 = sc.parallelize(List("apple", "beatty",
"beatrice")).map(a => (a, 1))
scala> names1.join(names2).collect
scala> names1.leftOuterJoin(names2).collect
scala> names1.rightOuterJoin(names2).collect

groupByKey

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs. Note: If you are
grouping in order to perform an aggregation (such as a sum or average) over each key, using
reduceByKey or aggregateByKey will yield much better performance.

scala> val babyNames = sc.textFile("baby_names.csv")

scala> val rows = babyNames.map(line => line.split(","))
scala> val namesToCounties = rows.map(name => (name(1),name(2)))
scala> namesToCounties.groupByKey.collect
scala> val data = sc.parallelize(Seq(("C",3),("A",1),("B",4),("A",2),
("B",5)))
scala> data.collect
scala> val groupfunc = data.groupByKey()
scala> groupfunc.collect

reduceByKey

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each
key are aggregated using the given reduce function func, which must be of type (V, V) => V

scala> val filteredRows = babyNames.filter(line => !

line.contains("Count")).map(line => line.split(","))
scala> filteredRows.map(n => (n(1),n(4).toInt)).reduceByKey((v1,v2) => v1 +
v2).collect

scala> val data = sc.parallelize(Array(("C",3),("A",1),("B",4),("A",2),

("B",5)))
scala> data.collect
scala> val reducefunc = data.reduceByKey((value, x) => (value + x))

aggregateByKey

When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each
key are aggregated using the given combine functions and a neutral "zero" value. Allows an
aggregated value type that is different from the input value type, while avoiding unnecessary
allocations.

scala> val filteredRows = babyNames.filter(line => !

line.contains("Count")).map(line => line.split(","))
scala> filteredRows.map(n => (n(1),n(4).toInt)).reduceByKey((v1,v2) => v1 +
v2).collect

scala> val babyNamesCSV = sc.parallelize(List(("David", 6), ("Abby", 4), ("David", 5), ("Abby", 5)))

scala> babyNamesCSV.aggregateByKey(0)((k,v) => v.toInt+k, (v,k) => k+v).collect

sortByKey

When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V)
pairs sorted by keys in ascending or descending order, as specified in the Boolean ascending
argument.

scala> val filteredRows = babyNames.filter(line => !

line.contains("Count")).map(line => line.split(","))
scala> filteredRows.map ( n => (n(1), n(4))).sortByKey().foreach (println
_)

scala> filteredRows.map ( n => (n(1), n(4))).sortByKey(false).foreach

(println _) // opposite order

scala> val data = sc.parallelize(Seq(("C",3),("A",1),("D",4),("B",2),

("E",5)))
scala> data.collect
scala> val sortfunc = data.sortByKey()
scala> sortfunc.collect

Hist SN T1 e ST
No ratings yet
Hist SN T1 e ST
58 pages
Classroom and Lab Area - Job Roles Wise
No ratings yet
Classroom and Lab Area - Job Roles Wise
115 pages
HUMSS 12 DIASS FIRST QUARTER EXAM. by ALMIRAH MACALUNAS
100% (9)
HUMSS 12 DIASS FIRST QUARTER EXAM. by ALMIRAH MACALUNAS
11 pages
Module 08 - Basic Network Configuration
100% (1)
Module 08 - Basic Network Configuration
12 pages
Transformations and Actions: A Visual Guide of The API
No ratings yet
Transformations and Actions: A Visual Guide of The API
122 pages
15 PDFsam Apache Spark Tutorial
No ratings yet
15 PDFsam Apache Spark Tutorial
7 pages
Spark Transformations and Actions
No ratings yet
Spark Transformations and Actions
24 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
No ratings yet
Write Scala Code To Parallelize A Simple Collection (E.g., An Array or List) Into An RDD in Spark
48 pages
Docse
No ratings yet
Docse
3 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Spark
No ratings yet
Spark
13 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Action and Transformations (Wide and Narrow)
No ratings yet
Action and Transformations (Wide and Narrow)
7 pages
Scala PDF
No ratings yet
Scala PDF
29 pages
BDT Unit 3
No ratings yet
BDT Unit 3
105 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Open Spark Shell
No ratings yet
Open Spark Shell
12 pages
Resilient Distributed Datasets
No ratings yet
Resilient Distributed Datasets
40 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
D5 - Lab - Practicals - Day 5
100% (1)
D5 - Lab - Practicals - Day 5
37 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Comp9313: Big Data Management: Mapreduce
No ratings yet
Comp9313: Big Data Management: Mapreduce
65 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
D8 - Lab - Practicals - Day 8
No ratings yet
D8 - Lab - Practicals - Day 8
48 pages
TP6 - 3IM - en
No ratings yet
TP6 - 3IM - en
2 pages
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Spark
No ratings yet
Spark
11 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
No ratings yet
A204080739 - 28953 - 20 - 2025 - Unit 3 Introduction To RDD
51 pages
Spark Cheatsheet - BEPEC
No ratings yet
Spark Cheatsheet - BEPEC
1 page
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Function Spark
No ratings yet
Function Spark
10 pages
Notes
No ratings yet
Notes
5 pages
Notes
No ratings yet
Notes
26 pages
groupByKey VS reduceByKey
No ratings yet
groupByKey VS reduceByKey
3 pages
Introduction To PySpark
100% (1)
Introduction To PySpark
21 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Basics of RDD
No ratings yet
Basics of RDD
84 pages
SA Lab Manual
No ratings yet
SA Lab Manual
7 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
Function Spark
No ratings yet
Function Spark
9 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Apache Spark
No ratings yet
Apache Spark
6 pages
Scala Lightning Tour
No ratings yet
Scala Lightning Tour
8 pages
L7A - Spark RDD With Scala
No ratings yet
L7A - Spark RDD With Scala
21 pages
Lec 9
No ratings yet
Lec 9
38 pages
Spark RDD
No ratings yet
Spark RDD
60 pages
Computational Tools DTU Presentation Week3
No ratings yet
Computational Tools DTU Presentation Week3
33 pages
Spark RDD
No ratings yet
Spark RDD
4 pages
Lec 9
No ratings yet
Lec 9
33 pages
BDT MSE2Scheme 23-24
No ratings yet
BDT MSE2Scheme 23-24
4 pages
Ip Practical (2) (Autosaved)
No ratings yet
Ip Practical (2) (Autosaved)
21 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
NgRx SignalStore: An effortless solution for state management
From Everand
NgRx SignalStore: An effortless solution for state management
Abdelfattah Ragab
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)
Big data-UNIT 1
No ratings yet
Big data-UNIT 1
39 pages
Applied 1
No ratings yet
Applied 1
64 pages
Aneek Machine Design Notes Part 2
No ratings yet
Aneek Machine Design Notes Part 2
5 pages
Machining 3rd Part
No ratings yet
Machining 3rd Part
9 pages
Ch05 Student (Prob. Tuts)
No ratings yet
Ch05 Student (Prob. Tuts)
154 pages
GM 3500T OwnersManual
No ratings yet
GM 3500T OwnersManual
36 pages
Field Record of Concrete: Commercial & Office Building On Plot No. 373-1343 at Al Barsha First Dubai
No ratings yet
Field Record of Concrete: Commercial & Office Building On Plot No. 373-1343 at Al Barsha First Dubai
38 pages
1.) Trace The Development of Science and Technology From Pre-Colonial Times Up To The Present. What Have You Observe?
No ratings yet
1.) Trace The Development of Science and Technology From Pre-Colonial Times Up To The Present. What Have You Observe?
1 page
Germination Value A New Formula: Pinus Radiata
No ratings yet
Germination Value A New Formula: Pinus Radiata
5 pages
Human Settlements and Town Planning
No ratings yet
Human Settlements and Town Planning
3 pages
Cambridge International AS & A Level: Physics 9702/23
No ratings yet
Cambridge International AS & A Level: Physics 9702/23
12 pages
3.-GE11 EntrepreneurialMind FINAL
100% (4)
3.-GE11 EntrepreneurialMind FINAL
15 pages
Eurovent - New Energy Classes - 2016 PDF
No ratings yet
Eurovent - New Energy Classes - 2016 PDF
3 pages
AVR-15 Manual E
No ratings yet
AVR-15 Manual E
8 pages
My Homework For You
100% (1)
My Homework For You
4 pages
GenAI 20 Weeks Roadmap
No ratings yet
GenAI 20 Weeks Roadmap
2 pages
Day 9: Primary Health Care (PHC) : CHN Lec Term 2 Exam
No ratings yet
Day 9: Primary Health Care (PHC) : CHN Lec Term 2 Exam
46 pages
Hehe
No ratings yet
Hehe
4 pages
C# Tutorial - SoloLearn - Learn To Code For FREE!
No ratings yet
C# Tutorial - SoloLearn - Learn To Code For FREE!
1 page
Intership
No ratings yet
Intership
40 pages
1c - Business Letter Rules
No ratings yet
1c - Business Letter Rules
1 page
Laboratory Experiment: LATENT HEAT: Q ML Q J Cal M KG L L L J KG
No ratings yet
Laboratory Experiment: LATENT HEAT: Q ML Q J Cal M KG L L L J KG
6 pages
ME 111 Thermodynamics 1
No ratings yet
ME 111 Thermodynamics 1
8 pages
General Tolerances - DIN - IsO - 2768
No ratings yet
General Tolerances - DIN - IsO - 2768
2 pages
IMU (V) 2012 13 Detail Brochure
No ratings yet
IMU (V) 2012 13 Detail Brochure
6 pages
Cersai: Central Registry of Securitisation Asset Reconstruction and Security Interest of India
No ratings yet
Cersai: Central Registry of Securitisation Asset Reconstruction and Security Interest of India
3 pages
Laptop Issue Form Sample
100% (1)
Laptop Issue Form Sample
3 pages
Unit 3 Ge Esci Contemporary World
No ratings yet
Unit 3 Ge Esci Contemporary World
28 pages
Water Jet Cutter
No ratings yet
Water Jet Cutter
7 pages
Table Morgan Sample Thesis
86% (7)
Table Morgan Sample Thesis
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Actions Spark

Uploaded by

Actions Spark

Uploaded by

SPARK TRANSFORMATIONS AND ACTIONS

a. Displaying a csv file

b. Displaying a json file

val file = sc.textFile("catalina.txt")

scala> val parallel = sc.parallelize(1 to 9, 3)

scala> val parallel = sc.parallelize(1 to 9)

scala> val parallel = sc.parallelize(1 to 9)

scala> val parallel = sc.parallelize(1 to 9, 3)

scala> val parallel = sc.parallelize(1 to 9)

scala> val parallel = sc.parallelize(1 to 9)

scala> val parallel = sc.parallelize(1 to 9)

scala> val parallel = sc.parallelize(1 to 9)

scala> val names1 = sc.parallelize(List("abe", "abby", "apple")).map(a =>

scala> val babyNames = sc.textFile("baby_names.csv")

scala> val filteredRows = babyNames.filter(line => !

scala> val data = sc.parallelize(Array(("C",3),("A",1),("B",4),("A",2),

scala> val filteredRows = babyNames.filter(line => !

scala> babyNamesCSV.aggregateByKey(0)((k,v) => v.toInt+k, (v,k) => k+v).collect

scala> val filteredRows = babyNames.filter(line => !

scala> filteredRows.map ( n => (n(1), n(4))).sortByKey(false).foreach

scala> val data = sc.parallelize(Seq(("C",3),("A",1),("D",4),("B",2),

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.