3 - Spark
3 - Spark
Introduction to SPARK
1
Distributed Computing vs. Parallel computing
2
Distributed Computing Benefits
3
Spark for Distributed Computing
4
Spark vs. MapReduce
MapReduce
5
Apache Spark1
7
Functional Programming
8
Functional Programming
9
Functional Programming – Passing Operations
Scala
Python
10
Functional Programming – using Arrays
Scala
Python
add = lambda x, y: x + y
numbers = [(1,2), (3,4), (5,6)]
sums = list(map(lambda pair: add(pair[0], pair[1]), numbers))
print(sums) # Prints: [3, 7, 11]
11
Unified Platform for Big Data Processing
12
Why Unification?
13
Spark Abstractions
14
Word count In Spark
sc = SparkContext(appName="PythonWordCount")
lines = sc.textFile(sys.argv[1])
output = counts.collect()
16
RDD Transformations
17
Map vs. FlatMap
Input
"Hello world", "This is a test" "Hello world", "This is a test"
Example
Splits each line into words and
Operation Splits each line into words.
merges them into a single collection.
18
Transformation: flatMap
19
Transformation: filter
20
Transformation: map
21
Transformation: reduceByKey
22
RDD Actions
23
Action: collect
output = counts.collect( )
24
Lazy Execution
25
RDD Lineage
Application
mydata_filt = sc.textFile('file.txt')
.map(lambda line: line.upper())
.filter(lambda line: line.startswith('I'))
Lineage
mydata_filt.toDebudString()
26
Pipelining
27
Pipelining
28
Pipelining
29
Creating RDDS
30
Creating RDDs from Collections
SparkContext.parallelize(collection)
mydata = [’Alice’ , ’Jack’ , ’Andrew’ , ’Frank’]
myRDD = sc.Parallelize(mydata)
myRDD.take(2)
output : [’Alice’ , ’Jack’]
Useful for:
Testing
Integrating
31
Creating RDDs from Files
32
Creating RDDs from Other File Formats
33
Using Input/output Formats
Example:
input_rdd = sc.newAPIHadoopFile(
"path/to/textfile.txt",
"org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
"org.apache.hadoop.io.LongWritable",
"org.apache.hadoop.io.Text" )
34
Whole-file-based RDDs
sc.textFile puts each line as a
separate element
What if you are
processing XML or
JSON files?
sc.wholeTextFile(directory)
creates a single element
in RDD for the whole
content of a file in the input
directory
Creates a special type of
RDDs, (paired RDDs),
we discuss later
Works for files with
small sizes (elements
must fit in memory)
35
RDD Content
36
Other General RDD Transformations
[1, 2, 3] zip ['a', 'b', 'c'] [(1, 'a'), (2, 'b'), (3, 'c')]
37
Pair RDDs
38
Creating Pair RDDs
You can have your root RDD as a pair RDD, e.g., sc.wholeTextFile
You can use a transformation to put the data in pair RDDs
map
flatMap/flatMapValues
keyBy
39
Example
users=sc.textFile (file)
.map(lambda line: line.split(’\t’))
.map(lambda elems: (elems[0],elems[1]))
. keyBy (lambda line : line.split(’\ t’) [ 0 ] )
40
Transformation: reduceByKey
41
Other Pair RDD Transformations
countByKey
Returns a pair RDD with the same key as parent and value is the count
of key occurrences
groupByKey
Similar to the input of Hadoop Reducer, (key, [list of values])
sortByKey(ascending=true/false)
Returns a pair RDD sorted by the key
join
Takes two input pair RDDs with the same key (key, value1), (key,
value2)
Returns (Key, (value1,value2))
42
Examples
43
Example: join by key
Orders = orderItems.join(orderTotals)
44
Running A Spark Job on YARN
45
RDD Partitions
46
Parallel Operations on Partitions
47
Stages
All operations that can work on the same data partition are grouped
into a stage.
Tasks within a stage are pipelined together
Spark divides the DAG of the job into stages
How Spark Calculates Stages? Based on RDD dependencies
Narrow dependencies
Only one child depends on the RDD
No shuffle required
Wide (shuffle) dependencies
Multiple children depend on the RDD
Defines a new stage
49
Tasks Pipelining
50
RDD Persistence