0% found this document useful (0 votes)
38 views5 pages

Actions Spark

Spark transformations and actions allow users to read, transform, and analyze distributed datasets. Key transformations include map, filter, reduceByKey and join. Map applies a function to each element, filter selects elements where a predicate is true, reduceByKey aggregates values for each key, and join combines pairs of elements based on matching keys. Actions like collect return final results to the driver program from distributed computations.

Uploaded by

Souvik Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views5 pages

Actions Spark

Spark transformations and actions allow users to read, transform, and analyze distributed datasets. Key transformations include map, filter, reduceByKey and join. Map applies a function to each element, filter selects elements where a predicate is true, reduceByKey aggregates values for each key, and join combines pairs of elements based on matching keys. Actions like collect return final results to the driver program from distributed computations.

Uploaded by

Souvik Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

SPARK TRANSFORMATIONS AND ACTIONS

Reading Data
Reading a csv file:

Displaying Data

a. Displaying a csv file

b. Displaying a json file

map

Returns a new distributed dataset, formed by passing each element of the source through a function
func.

scala> sc.parallelize(List(1,2,3)).map(x=>List(x,x,x)).collect

flatMap

Similar to map, but each input item can be mapped to 0 or more output items (so func should return
a Seq rather than a single item).

scala> sc.parallelize(List(1,2,3)).flatMap(x=>List(x,x,x)).collect

filter
Returns a new dataset formed by selecting those elements of the source on which func returns true

val file = sc.textFile("catalina.txt")


val errors = file.filter(line => line.contains("ERROR")).collect

mapPartitions

Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type
Iterator => Iterator when running on an RDD of type T.

scala> val parallel = sc.parallelize(1 to 9, 3)


scala> parallel.mapPartitions( x => List(x.next).iterator).collect

scala> val parallel = sc.parallelize(1 to 9)


scala> parallel.mapPartitions( x => List(x.next).iterator).collect

mapPartitionsWithIndex

Similar to map Partitions, but also provides func with an integer value representing the index of the
partition, so func must be of type (Int, Iterator) => Iterator when running on an RDD of type T.

scala> val parallel = sc.parallelize(1 to 9)


scala> parallel.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>
it.toList.map(x => index + ", "+x).iterator).collect

scala> val parallel = sc.parallelize(1 to 9, 3)


scala> parallel.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) =>
it.toList.map(x => index + ", "+x).iterator).collect

sample

Sample a fraction of the data, with or without replacement, using a given random number generator
seed

scala> val parallel = sc.parallelize(1 to 9)


scala> parallel.sample(true,.2).count
scala> parallel.sample(true,.1)

union

Returns a new dataset that contains the union of the elements in the so dataset and the argument.

scala> val parallel = sc.parallelize(1 to 9)


scala> val par2 = sc.parallelize(5 to 15)
scala> parallel.union(par2).collect

intersection

Returns a new RDD that contains the intersection of elements in the source dataset and the
argument.

scala> val parallel = sc.parallelize(1 to 9)


scala> val par2 = sc.parallelize(5 to 15)
scala> parallel.intersection(par2).collect

distinct

Returns a new dataset that contains the distinct elements of the source dataset

scala> val parallel = sc.parallelize(1 to 9)


scala> val par2 = sc.parallelize(5 to 15)
scala> parallel.union(par2).distinct.collect

join

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs
of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and
fullOuterJoin.

scala> val names1 = sc.parallelize(List("abe", "abby", "apple")).map(a =>


(a, 1))
scala> val names2 = sc.parallelize(List("apple", "beatty",
"beatrice")).map(a => (a, 1))
scala> names1.join(names2).collect
scala> names1.leftOuterJoin(names2).collect
scala> names1.rightOuterJoin(names2).collect

groupByKey

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs. Note: If you are
grouping in order to perform an aggregation (such as a sum or average) over each key, using
reduceByKey or aggregateByKey will yield much better performance.

scala> val babyNames = sc.textFile("baby_names.csv")


scala> val rows = babyNames.map(line => line.split(","))
scala> val namesToCounties = rows.map(name => (name(1),name(2)))
scala> namesToCounties.groupByKey.collect
scala> val data = sc.parallelize(Seq(("C",3),("A",1),("B",4),("A",2),
("B",5)))
scala> data.collect
scala> val groupfunc = data.groupByKey()
scala> groupfunc.collect

reduceByKey

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each
key are aggregated using the given reduce function func, which must be of type (V, V) => V

scala> val filteredRows = babyNames.filter(line => !


line.contains("Count")).map(line => line.split(","))
scala> filteredRows.map(n => (n(1),n(4).toInt)).reduceByKey((v1,v2) => v1 +
v2).collect

scala> val data = sc.parallelize(Array(("C",3),("A",1),("B",4),("A",2),


("B",5)))
scala> data.collect
scala> val reducefunc = data.reduceByKey((value, x) => (value + x))

aggregateByKey

When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each
key are aggregated using the given combine functions and a neutral "zero" value. Allows an
aggregated value type that is different from the input value type, while avoiding unnecessary
allocations.

scala> val filteredRows = babyNames.filter(line => !


line.contains("Count")).map(line => line.split(","))
scala> filteredRows.map(n => (n(1),n(4).toInt)).reduceByKey((v1,v2) => v1 +
v2).collect

scala> val babyNamesCSV = sc.parallelize(List(("David", 6), ("Abby", 4), ("David", 5), ("Abby", 5)))

scala> babyNamesCSV.aggregateByKey(0)((k,v) => v.toInt+k, (v,k) => k+v).collect


sortByKey

When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V)
pairs sorted by keys in ascending or descending order, as specified in the Boolean ascending
argument.

scala> val filteredRows = babyNames.filter(line => !


line.contains("Count")).map(line => line.split(","))
scala> filteredRows.map ( n => (n(1), n(4))).sortByKey().foreach (println
_)

scala> filteredRows.map ( n => (n(1), n(4))).sortByKey(false).foreach


(println _) // opposite order

scala> val data = sc.parallelize(Seq(("C",3),("A",1),("D",4),("B",2),


("E",5)))
scala> data.collect
scala> val sortfunc = data.sortByKey()
scala> sortfunc.collect

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy