Lecture 25
Lecture 25
HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 1
Recap
MapReduce
• For easily writing applications to process vast amounts of data in-
parallel on large clusters in a reliable, fault-tolerant manner
• Takes care of scheduling tasks, monitoring them and re-executes
the failed tasks
HDFS & MapReduce: Running on the same set of nodes
compute nodes and storage nodes same (keeping data close
to the computation) very high throughput
YARN & MapReduce: A single master resource manager, one
slave node manager per node, and AppMaster per application
2
Today’s Topics
•Motivation
•Spark Basics
•Spark Programming
3
History of Hadoop and Spark
4
Apache Spark
** Spark can connect to several types of cluster managers (either
Spark’s own standalone cluster manager, Mesos or YARN)
Hadoop Spark
5
Apache Ha doop La c k a Unifie d Vision
• Sparse Modules
• Diversity of APIs
• Higher Operational Costs
6
Spark Ecosystem: A Unified Pipeline
Note: Spark is not designed for IoT real-time. The streaming layer is used for
continuous input streams like financial data from stock markets, where events occur
steadily and must be processed as they occur. But there is no sense of direct I/O
from sensors/actuators. For IoT use cases, Spark would not be suitable.
7
Key ideas
In Hadoop, each developer tends to invent his or her own style of work
With Spark, serious effort to standardize around the idea that people are
writing pa ra lle l c ode tha t ofte n runs for ma ny “c yc le s” or “ite ra tions” in
whic h a lot of re use of informa tion oc c urs.
8
How this works
You express your application as a graph of RDDs.
The graph is only evaluated as needed, and they only compute the RDDs
a c tua lly ne e de d for the output you ha ve re que ste d.
Then Spark can be told to cache the re use a ble informa tion e ithe r in
me mory, in SSD stora ge or e ve n on disk, ba se d on when it will be needed
again, how big it is, and how costly it would be to recreate.
You write the RDD logic and control all of this via hints
9
Motivation (1)
MapReduce: The original scalable, general, processing
engine of the Hadoop ecosystem
• Disk-based data processing framework (HDFS files)
• Persists intermediate results to disk
• Data is reloaded from disk with every query → Costly I/O
• Be st for ETL like workloa ds (ba tc h proc e ssing)
• Costly I/O → Not a ppropria te for ite ra tive or stre a m
proc e ssing workloa ds
10
Motivation (2)
Spark: General purpose computational framework that
substantially improves performance of MapReduce, but
retains the basic model
• Memory based data processing framework → a voids c ostly
I/O by ke e ping inte rme dia te re sults in me mory
• Le ve ra ge s distribute d me mory
• Re me mbe rs ope ra tions a pplie d to da ta se t
• Da ta loc a lity ba se d c omputa tion → High Pe rforma nc e
• Be st for both ite ra tive (or stre a m proc e ssing) a nd ba tc h
workloa ds
11
Motivation - Summa ry
Softwa re e ngine e ring point of vie w
Ha doop c ode ba se is huge
Contributions/Exte nsions to Ha doop a re c umbe rsome
J a va -only hinde rs wide a doption, but J a va support is funda me nta l
Syste m/Fra me work point of vie w
Unifie d pipe line
Simplifie d da ta flow
Fa ste r proc e ssing spe e d
Da ta a bstra c tion point of vie w
Ne w funda me nta l a bstra c tion RDD
Ea sy to e xte nd with ne w ope ra tors
More de sc riptive c omputing mode l
12
Today’s Topics
•Motivation
•Spark Basics
•Spark Programming
13
Spark Basics(1)
Spa rk: Fle xible , in-me mory da ta proc e ssing fra me work writte n in Sc a la
Goa ls:
• Simplic ity (Ea sie r to use ):
Ric h APIs for Sc a la , J a va , a nd Python
• Ge ne ra lity: APIs for diffe re nt type s of workloa ds
Ba tc h, Stre a ming, Ma c hine Le a rning, Gra ph
• Low La te nc y (Pe rforma nc e ) : In-me mory proc e ssing a nd
c a c hing
• Fa ult-tole ra nc e : Fa ults shouldn’t be spe c ia l c a se
14
Spark Basics(2)
The re a re two wa ys to ma nipula te da ta in Spa rk
• Spa rk She ll:
Inte ra c tive – for le a rning or da ta e xplora tion
Python or Sc a la
• Spa rk Applic a tions
For la rge sc a le da ta proc e ssing
Python, Sc a la , or J a va
15
Spark Core: Code Base (2012)
16
Spark Shell
The Spa rk She ll provide s inte ra ctive da ta e xplora tion
(REPL)
17
Spark Fundamentals
Example of an
application:
• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions
18
Spark Context (1)
•Every Spark application requires a spark context: the main
entry point to the Spark API
•Spark Shell provides a preconfigured Spark Context called “sc”
19
Spark Context (2)
• Standalone applications Driver code Spark Context
• Spark Context holds configuration information and represents
connection to a Spark cluster
Standalone Application
(Drives Computation)
20
Spark Context (3)
Spa rk c onte xt works a s a c lie nt a nd re pre se nts c onne c tion to a Spa rk c luste r
21
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions
22
Resilient Distributed Dataset
RDD (Resilient Distributed Dataset) is the fundamental unit of data in Spark: An
Immutable collection of objects (or records, or elements) that can be operated on “in
parallel” (spread across a cluster)
Resilient -- if data in memory is lost, it can be recreated
• Recover from node failures
• An RDD keeps its lineage information it c a n be re c re a te d from
pa re nt RDDs
Distributed -- proc e sse d a c ross the c luste r
• Ea c h RDD is c ompose d of one or more pa rtitions (more pa rtitions –
more pa ra lle lism)
Dataset -- initia l da ta c a n c ome from a file or be c re a te d
23
RDDs
Key Idea: Write applications in terms of transformations
on distributed datasets. One RDD per transformation.
• Organize the RDDs into a DAG showing how data flows.
• RDD can be saved and reused or recomputed. Spark can
save it to disk if the dataset does not fit in memory
• Built through parallel transformations (map, filter, group-by,
join, etc). Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)
24
RDDs a re de signe d to be “immuta ble ”
• Cre a te once , the n re use without cha nge s. Spa rk knows
line a ge ca n be re cre a te d a t a ny time Fa ult-tole ra nce
• Avoids da ta inconsiste ncy proble ms (no simulta ne ous
upda te s) Corre ctne ss
• Ea sily live in me mory a s on disk Ca ching Sa fe to sha re
a cross proce sse s/ta sks Improve s pe rforma nce
• Tra de off: (Fault-tolerance & Correctness) vs (Disk Memory & CPU)
25
Creating a RDD
Thre e wa ys to cre a te a RDD
• From a file or se t of file s
• From da ta in me mory
• From a nothe r RDD
26
Example: A File-ba se d RDD
27
Spark Fundamentals
Example of an application:
• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions
28
RDD Operations
Two types of operations
Transformations: Define a
new RDD based on current
RDD(s)
Actions: return values
29
RDD Transformations
•Set of operations on a RDD that define how they should
be transformed
•As in relational algebra, the application of a
transformation to an RDD yields a new RDD (because
RDD are immutable)
•Transformations are lazily evaluated, which allow for
optimizations to take place before execution
•Examples: map(), filter(), groupByKey(), sortByKey(),
etc.
30
Example: map and filter Transformations
31
RDD Actions
• Apply transformation chains on RDDs, eventually performing
some additional operations (e.g., counting)
• Some actions only store data to an external data source (e.g.
HDFS), others fetch data from the RDD (and its transformation
chain) upon which the action is applied, and convey it to the
driver
• Some common actions
count() – return the number of elements
32
Lazy Execution of RDDs (1)
Data in RDDs is not processed
until an action is performed
33
Lazy Execution of RDDs (2)
Data in RDDs is not processed
until an action is performed
34
Lazy Execution of RDDs (3)
Data in RDDs is not processed
until an action is performed
35
Lazy Execution of RDDs (4)
Data in RDDs is not processed
until an action is performed
36
Lazy Execution of RDDs (5)
Data in RDDs is not processed
until an action is performed
37
Example: Mine e rror logs
Loa d e rror me ssa ge s from a log into me mory, the n inte ra ctive ly
se a rch for va rious pa tte rns:
38
Key Idea: Elastic parallelism
Spark will spread the task over the nodes where data resides, offers a highly
c onc urre nt e xe c ution tha t minimize s de la ys. Te rm: “pa rtitione d c omputa tion” .
If some component crashes or even is just slow, Spark simply kills that task and
la unc he s a substitute .
39
RDD and Partitions (Pa ra lle lism e xa mple )
40
RDD Graph: Data Set vs Partition Views
Much like in Hadoop MapReduce, each RDD is associated to
(input) partitions
41
RDDs: Data Locality
•Data Locality Principle
Keep high-value RDDs precomputed, in cache or SDD
Run tasks that need the specific RDD with those same inputs
on the node where the cached copy resides.
This can maximize in-memory computational performance.
43
Lifetime of a Job in Spark
44
Anatomy of a Spark Application
Cluster Manager
(YARN/Mesos)
45
Typical RDD pattern of use
Instead of doing a lot of work in each RDD, developers split
ta sks into lots of sma ll RDDs
46
Why is this a good strategy?
Spark tries to run tasks that will need the same intermediary data on the same
node s.
If Ma pRe duc e jobs we re a rbitra ry progra ms, this wouldn’t he lp be c a use re use
would be ve ry ra re .
But in fact the MapReduce model isve ry re pe titious a nd ite ra tive , a nd ofte n
a pplie s the sa me tra nsforma tions a ga in a nd a ga in to the sa me input file s.
Those pa rtic ula r RDDs be c ome gre a t c a ndida te s for c a c hing.
Ma pRe duc e progra mme r ma y not know how ma ny ite ra tions will oc c ur, but
Spa rk itse lf is sma rt e nough to e vic t RDDs if the y don’t a c tua lly ge t re use d.
47
Iterative Algorithms: Spark vs MapReduce
48
Today’s Topics
•Motivation
•Spark Basics
•Spark Programming
49
Spark Programming (1)
Cre a ting RDDs
# Turn a Python collection into an RDD
sc.parallelize([1, 2, 3])
51
Spark Programming (3)
Ba sic Actions
nums = sc.parallelize([1, 2, 3])
54
Example: Word Count
lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ “))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)
55
Example: Spark Streaming
Re pre se nts stre a ms a s a se rie s of RDDs ove r time (typic a lly sub se c ond inte rva ls, but it
is c onfigura ble )
56
Spark: Combining Libraries (Unified Pipeline)
# Load data using Spark SQL
points = spark.sql(“select latitude, longitude from tweets”)
# Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)
words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)
58
Summary