0% found this document useful (0 votes)

14 views59 pages

Lecture 25

Spark is a flexible in-memory data processing framework that improves on MapReduce by keeping intermediate results in memory instead of disk. This avoids costly I/O and allows Spark to be faster than MapReduce for both iterative and streaming workloads. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be distributed across a cluster and operated on in parallel. RDDs can be transformed using operations like map, filter, and join and cached in memory for reuse.

Uploaded by

SandraPerera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views59 pages

Lecture 25

Uploaded by

SandraPerera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

CS5412 / Lecture 25 Kishore Pusukuri,

Apache Spark and RDDs Spring 2019

HTTP://WWW.CS.CORNELL.EDU/COURSES/CS5412/2018SP 1
Recap
MapReduce
• For easily writing applications to process vast amounts of data in-
parallel on large clusters in a reliable, fault-tolerant manner
• Takes care of scheduling tasks, monitoring them and re-executes
the failed tasks
HDFS & MapReduce: Running on the same set of nodes 
compute nodes and storage nodes same (keeping data close
to the computation)  very high throughput
YARN & MapReduce: A single master resource manager, one
slave node manager per node, and AppMaster per application
2
Today’s Topics
•Motivation
•Spark Basics
•Spark Programming

3
History of Hadoop and Spark

4
Apache Spark
** Spark can connect to several types of cluster managers (either
Spark’s own standalone cluster manager, Mesos or YARN)

Processing Spark Spark Other

Spark ML Applications
Stream SQL

Resource Spark Core Data

manager Mesos etc. Yet Another Resource
(Standalone Ingestion
Negotiator (YARN) Systems
Scheduler)
e.g.,
Apache
S3, Cassandra etc., Ha doop NoSQL Da ta ba se (HBa se )
Data Kafka,
Storage
other storage systems Flume, etc
Ha doop Distribute d File Syste m (HDFS)

Hadoop Spark

5
Apache Ha doop La c k a Unifie d Vision

• Sparse Modules
• Diversity of APIs
• Higher Operational Costs
6
Spark Ecosystem: A Unified Pipeline

Note: Spark is not designed for IoT real-time. The streaming layer is used for
continuous input streams like financial data from stock markets, where events occur
steadily and must be processed as they occur. But there is no sense of direct I/O
from sensors/actuators. For IoT use cases, Spark would not be suitable.
7
Key ideas

In Hadoop, each developer tends to invent his or her own style of work

With Spark, serious effort to standardize around the idea that people are
writing pa ra lle l c ode tha t ofte n runs for ma ny “c yc le s” or “ite ra tions” in
whic h a lot of re use of informa tion oc c urs.

Spark centers on Resilient Distributed Dataset, RDDs, that capture the

informa tion be ing re use d.

8
How this works
You express your application as a graph of RDDs.

The graph is only evaluated as needed, and they only compute the RDDs
a c tua lly ne e de d for the output you ha ve re que ste d.

Then Spark can be told to cache the re use a ble informa tion e ithe r in
me mory, in SSD stora ge or e ve n on disk, ba se d on when it will be needed
again, how big it is, and how costly it would be to recreate.

You write the RDD logic and control all of this via hints
9
Motivation (1)
MapReduce: The original scalable, general, processing
engine of the Hadoop ecosystem
• Disk-based data processing framework (HDFS files)
• Persists intermediate results to disk
• Data is reloaded from disk with every query → Costly I/O
• Be st for ETL like workloa ds (ba tc h proc e ssing)
• Costly I/O → Not a ppropria te for ite ra tive or stre a m
proc e ssing workloa ds

10
Motivation (2)
Spark: General purpose computational framework that
substantially improves performance of MapReduce, but
retains the basic model
• Memory based data processing framework → a voids c ostly
I/O by ke e ping inte rme dia te re sults in me mory
• Le ve ra ge s distribute d me mory
• Re me mbe rs ope ra tions a pplie d to da ta se t
• Da ta loc a lity ba se d c omputa tion → High Pe rforma nc e
• Be st for both ite ra tive (or stre a m proc e ssing) a nd ba tc h
workloa ds
11
Motivation - Summa ry
Softwa re e ngine e ring point of vie w
 Ha doop c ode ba se is huge
 Contributions/Exte nsions to Ha doop a re c umbe rsome
 J a va -only hinde rs wide a doption, but J a va support is funda me nta l
Syste m/Fra me work point of vie w
 Unifie d pipe line
 Simplifie d da ta flow
 Fa ste r proc e ssing spe e d
Da ta a bstra c tion point of vie w
 Ne w funda me nta l a bstra c tion RDD
 Ea sy to e xte nd with ne w ope ra tors
 More de sc riptive c omputing mode l

12
Today’s Topics
•Motivation
•Spark Basics
•Spark Programming

13
Spark Basics(1)
Spa rk: Fle xible , in-me mory da ta proc e ssing fra me work writte n in Sc a la
Goa ls:
• Simplic ity (Ea sie r to use ):
 Ric h APIs for Sc a la , J a va , a nd Python
• Ge ne ra lity: APIs for diffe re nt type s of workloa ds
 Ba tc h, Stre a ming, Ma c hine Le a rning, Gra ph
• Low La te nc y (Pe rforma nc e ) : In-me mory proc e ssing a nd
c a c hing
• Fa ult-tole ra nc e : Fa ults shouldn’t be spe c ia l c a se
14
Spark Basics(2)
The re a re two wa ys to ma nipula te da ta in Spa rk
• Spa rk She ll:
 Inte ra c tive – for le a rning or da ta e xplora tion
 Python or Sc a la
• Spa rk Applic a tions
 For la rge sc a le da ta proc e ssing
 Python, Sc a la , or J a va

15
Spark Core: Code Base (2012)

16
Spark Shell
The Spa rk She ll provide s inte ra ctive da ta e xplora tion
(REPL)

REPL: Repeat/Evaluate/Print Loop

17
Spark Fundamentals
Example of an
application:
• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions

18
Spark Context (1)
•Every Spark application requires a spark context: the main
entry point to the Spark API
•Spark Shell provides a preconfigured Spark Context called “sc”

19
Spark Context (2)
• Standalone applications  Driver code  Spark Context
• Spark Context holds configuration information and represents
connection to a Spark cluster

Standalone Application
(Drives Computation)

20
Spark Context (3)
Spa rk c onte xt works a s a c lie nt a nd re pre se nts c onne c tion to a Spa rk c luste r

21
Spark Fundamentals
Example of an application:

• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions

22
Resilient Distributed Dataset
RDD (Resilient Distributed Dataset) is the fundamental unit of data in Spark: An
Immutable collection of objects (or records, or elements) that can be operated on “in
parallel” (spread across a cluster)
Resilient -- if data in memory is lost, it can be recreated
• Recover from node failures
• An RDD keeps its lineage information  it c a n be re c re a te d from
pa re nt RDDs
Distributed -- proc e sse d a c ross the c luste r
• Ea c h RDD is c ompose d of one or more pa rtitions  (more pa rtitions –
more pa ra lle lism)
Dataset -- initia l da ta c a n c ome from a file or be c re a te d

23
RDDs
Key Idea: Write applications in terms of transformations
on distributed datasets. One RDD per transformation.
• Organize the RDDs into a DAG showing how data flows.
• RDD can be saved and reused or recomputed. Spark can
save it to disk if the dataset does not fit in memory
• Built through parallel transformations (map, filter, group-by,
join, etc). Automatically rebuilt on failure
• Controllable persistence (e.g. caching in RAM)

24
RDDs a re de signe d to be “immuta ble ”
• Cre a te once , the n re use without cha nge s. Spa rk knows
line a ge  ca n be re cre a te d a t a ny time  Fa ult-tole ra nce
• Avoids da ta inconsiste ncy proble ms (no simulta ne ous
upda te s)  Corre ctne ss
• Ea sily live in me mory a s on disk  Ca ching  Sa fe to sha re
a cross proce sse s/ta sks  Improve s pe rforma nce
• Tra de off: (Fault-tolerance & Correctness) vs (Disk Memory & CPU)

25
Creating a RDD
Thre e wa ys to cre a te a RDD
• From a file or se t of file s
• From da ta in me mory
• From a nothe r RDD

26
Example: A File-ba se d RDD

27
Spark Fundamentals
Example of an application:

• Spark Context
• Resilient Distributed
Data
• Transformations
• Actions

28
RDD Operations
Two types of operations
Transformations: Define a
new RDD based on current
RDD(s)
Actions: return values

29
RDD Transformations
•Set of operations on a RDD that define how they should
be transformed
•As in relational algebra, the application of a
transformation to an RDD yields a new RDD (because
RDD are immutable)
•Transformations are lazily evaluated, which allow for
optimizations to take place before execution
•Examples: map(), filter(), groupByKey(), sortByKey(),
etc.
30
Example: map and filter Transformations

31
RDD Actions
• Apply transformation chains on RDDs, eventually performing
some additional operations (e.g., counting)
• Some actions only store data to an external data source (e.g.
HDFS), others fetch data from the RDD (and its transformation
chain) upon which the action is applied, and convey it to the
driver
• Some common actions
count() – return the number of elements

take(n) – return an array of the first n elements

collect()– return an array of all elements

saveAsTextFile(file) – save to text file(s)

32
Lazy Execution of RDDs (1)
Data in RDDs is not processed
until an action is performed

33
Lazy Execution of RDDs (2)
Data in RDDs is not processed
until an action is performed

34
Lazy Execution of RDDs (3)
Data in RDDs is not processed
until an action is performed

35
Lazy Execution of RDDs (4)
Data in RDDs is not processed
until an action is performed

36
Lazy Execution of RDDs (5)
Data in RDDs is not processed
until an action is performed

Output Action “triggers” computation, pull model

37
Example: Mine e rror logs
Loa d e rror me ssa ge s from a log into me mory, the n inte ra ctive ly
se a rch for va rious pa tte rns:

lines = spark.textFile(“hdfs://...”) HadoopRDD

errors = lines.filter(lambda s: s.startswith(“ERROR”)) FilteredRDD
messages = errors.map(lambda s: s.split(“\t”)[2])
messages.cache()
messages.filter(lambda s: “foo” in s).count()

Re sult: full-te xt se a rc h of Wikipe dia in 0.5 se c (vs 20 se c for on-disk da ta )

38
Key Idea: Elastic parallelism

RDDs operations are designed to offer embarrassing parallelism.

Spark will spread the task over the nodes where data resides, offers a highly
c onc urre nt e xe c ution tha t minimize s de la ys. Te rm: “pa rtitione d c omputa tion” .

If some component crashes or even is just slow, Spark simply kills that task and
la unc he s a substitute .

39
RDD and Partitions (Pa ra lle lism e xa mple )

40
RDD Graph: Data Set vs Partition Views
Much like in Hadoop MapReduce, each RDD is associated to
(input) partitions

41
RDDs: Data Locality
•Data Locality Principle
 Keep high-value RDDs precomputed, in cache or SDD
 Run tasks that need the specific RDD with those same inputs
on the node where the cached copy resides.
 This can maximize in-memory computational performance.

Requires cooperation between your hints to Spark when you

build the RDD, Spark runtime and optimization planner, and the
underlying YARN resource manager.
42
RDDs -- Summa ry
RDD a re pa rtitione d, loca lity a wa re , distribute d
colle ctions
 RDD a re immuta ble
RDD a re da ta structure s tha t:
 Eithe r point to a dire c t da ta sourc e (e .g. HDFS)
 Apply some tra nsforma tions to its pa re nt RDD(s) to
ge ne ra te ne w da ta e le me nts
Computa tions on RDDs
 Re pre se nte d by la zily e va lua te d line a ge DAGs c ompose d
by c ha ine d RDDs

43
Lifetime of a Job in Spark

44
Anatomy of a Spark Application

Cluster Manager
(YARN/Mesos)

45
Typical RDD pattern of use
Instead of doing a lot of work in each RDD, developers split
ta sks into lots of sma ll RDDs

These are then organized into a DAG.

Developer anticipates which will be costly to re compute a nd

hints to Spa rk tha t it should ca che those .

46
Why is this a good strategy?

Spark tries to run tasks that will need the same intermediary data on the same
node s.
If Ma pRe duc e jobs we re a rbitra ry progra ms, this wouldn’t he lp be c a use re use
would be ve ry ra re .
But in fact the MapReduce model isve ry re pe titious a nd ite ra tive , a nd ofte n
a pplie s the sa me tra nsforma tions a ga in a nd a ga in to the sa me input file s.
 Those pa rtic ula r RDDs be c ome gre a t c a ndida te s for c a c hing.
 Ma pRe duc e progra mme r ma y not know how ma ny ite ra tions will oc c ur, but
Spa rk itse lf is sma rt e nough to e vic t RDDs if the y don’t a c tua lly ge t re use d.

47
Iterative Algorithms: Spark vs MapReduce

48
Today’s Topics
•Motivation
•Spark Basics
•Spark Programming

49
Spark Programming (1)
Cre a ting RDDs
# Turn a Python collection into an RDD
sc.parallelize([1, 2, 3])

# Load text file from local FS, HDFS, or S3

sc.textFile(“file.txt”)
sc.textFile(“directory/*.txt”)
sc.textFile(“hdfs://namenode:9000/path/file”)

# Use existing Hadoop InputFormat (Java/Scala only)

sc.hadoopFile(keyClass, valClass, inputFmt, conf)
50
Spark Programming (2)
Ba sic Tra nsforma tions

nums = sc.parallelize([1, 2, 3])

# Pass each element through a function

squares = nums.map(lambda x: x*x) // {1, 4, 9}

# Keep elements passing a predicate

even = squares.filter(lambda x: x % 2 == 0) // {4}

51
Spark Programming (3)
Ba sic Actions
nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collection

nums.collect() # => [1, 2, 3]

# Return first K elements

nums.take(2) # => [1, 2]

# Count number of elements

nums.count() # => 3

# Merge elements with an associative function

nums.reduce(lambda x, y: x + y) # => 6
52
Spark Programming (4)
Working with Ke y-Va lue Pa irs
Spark’s “distributed reduce” transformations operate on RDDs of
key-value pairs

Python: pair = (a, b)

pair[0] # => a
pair[1] # => b

Scala: val pair = (a, b)

pair._1 // => a
pair._2 // => b

Java: Tuple2 pair = new Tuple2(a, b);

pair._1 // => a
pair._2 // => b 53
Spark Programming (5)
Some Ke y-Va lue Ope ra tions

pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])

pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)}

pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}

pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}

54
Example: Word Count
lines = sc.textFile(“hamlet.txt”)
counts = lines.flatMap(lambda line: line.split(“ “))
.map(lambda word: (word, 1))
.reduceByKey(lambda x, y: x + y)

55
Example: Spark Streaming

Re pre se nts stre a ms a s a se rie s of RDDs ove r time (typic a lly sub se c ond inte rva ls, but it
is c onfigura ble )

val spammers = sc.sequenceFile(“hdfs://spammers.seq”)

sc.twitterStream(...)
.filter(t => t.text.contains(“Santa Clara University”))
.transform(tweets => tweets.map(t => (t.user, t)).join(spammers))
.print()

56
Spark: Combining Libraries (Unified Pipeline)
# Load data using Spark SQL
points = spark.sql(“select latitude, longitude from tweets”)

# Train a machine learning model

model = KMeans.train(points, 10)

# Apply it to a stream
sc.twitterStream(...)
.map(lambda t: (model.predict(t.location), 1))
.reduceByWindow(“5s”, lambda a, b: a + b)

HTTP://WWW.CS.CORNELL.EDU/ COURSES/ CS5412/2018SP 57

Spark: Setting the Level of Parallelism
All the pa ir RDD ope ra tions ta ke a n optiona l se c ond
pa ra me te r for numbe r of ta sks

words.reduceByKey(lambda x, y: x + y, 5)
words.groupByKey(5)
visits.join(pageViews, 5)

58
Summary

Spark is a powerful “manager” for big data computing.

It centers on a job scheduler for Hadoop (Ma pRe duc e ) tha t is sma rt
a bout whe re to run e a c h ta sk: c o-loc a te ta sk with da ta .
The data objects are “RDDs”: a kind of recipe for generating a file from
a n unde rlying da ta c olle c tion. RDD c a c hing a llows Spa rk to run mostly
from me mory-ma ppe d da ta , for spe e d.

• Online tutorials: spark.apache.org/docs/latest

Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
SPARK
No ratings yet
SPARK
47 pages
Spark
No ratings yet
Spark
51 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
Lecture 3 - Introduction To Apache Spark - 1691899519972
No ratings yet
Lecture 3 - Introduction To Apache Spark - 1691899519972
67 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
SPARK
No ratings yet
SPARK
66 pages
Unit V
No ratings yet
Unit V
35 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
SPARK
No ratings yet
SPARK
35 pages
Spark
No ratings yet
Spark
160 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
SPARK
No ratings yet
SPARK
125 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Spark 1
No ratings yet
Spark 1
57 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Spark Introduction
No ratings yet
Spark Introduction
26 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Spark
No ratings yet
Spark
96 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Kursus ICT Refresh Course Programme (ICTRCP) Tahun 2024 (Sesi 6)
No ratings yet
Kursus ICT Refresh Course Programme (ICTRCP) Tahun 2024 (Sesi 6)
32 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Module 3
No ratings yet
Module 3
51 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Headlight Wiring
No ratings yet
Headlight Wiring
127 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Husband and Wife 10
67% (3)
Husband and Wife 10
18 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Describing Web Resources in RDF: Grigoris Antoniou Frank Van Harmelen
No ratings yet
Describing Web Resources in RDF: Grigoris Antoniou Frank Van Harmelen
120 pages
Be Electrical Engineering Semester 5 2023 December Renewable Energy Sourcesrev 2019 C Scheme
No ratings yet
Be Electrical Engineering Semester 5 2023 December Renewable Energy Sourcesrev 2019 C Scheme
1 page
DLR Opr Dep Period
No ratings yet
DLR Opr Dep Period
94 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
SinclairCollins K-Series 02 2016
No ratings yet
SinclairCollins K-Series 02 2016
20 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
Values and Principles: Notes Should Follow Reading A Life in The United Kingdom Test Study Book
0% (1)
Values and Principles: Notes Should Follow Reading A Life in The United Kingdom Test Study Book
44 pages
02 - FootPrinting
No ratings yet
02 - FootPrinting
91 pages
8 Powerful Icon Libraries
No ratings yet
8 Powerful Icon Libraries
10 pages
Chapter 7 Software Reuse
No ratings yet
Chapter 7 Software Reuse
30 pages
Motion in 1dimension DPP 05 of Lec 06 Prayas JEE 2.0 2025666a9a5e40a9650018c46d03
No ratings yet
Motion in 1dimension DPP 05 of Lec 06 Prayas JEE 2.0 2025666a9a5e40a9650018c46d03
5 pages
Key Management Protocols and Compositionality: John Mitchell Stanford
No ratings yet
Key Management Protocols and Compositionality: John Mitchell Stanford
30 pages
Matthew Cabral
No ratings yet
Matthew Cabral
1 page
Daniel Science
No ratings yet
Daniel Science
10 pages
Judo Physiological Profile Sportsmedicine Franchini
No ratings yet
Judo Physiological Profile Sportsmedicine Franchini
21 pages
CS 186, Spring 2007, Lecture 7 R&G, Chapter 5 Mary Roth: The Important Thing Is Not To Stop Questioning
No ratings yet
CS 186, Spring 2007, Lecture 7 R&G, Chapter 5 Mary Roth: The Important Thing Is Not To Stop Questioning
36 pages
Engine Safety Unit
No ratings yet
Engine Safety Unit
2 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
My Classroom
No ratings yet
My Classroom
1 page
Resilient Distributed Datasets: A Fault-Tolerant Abstraction For In-Memory Cluster Computing
No ratings yet
Resilient Distributed Datasets: A Fault-Tolerant Abstraction For In-Memory Cluster Computing
18 pages
CS 285 Network Security: Hash Algorithm
No ratings yet
CS 285 Network Security: Hash Algorithm
22 pages
Photoluminescence FBG
No ratings yet
Photoluminescence FBG
13 pages
Data Security
No ratings yet
Data Security
13 pages
English Yr5 2015 Ms
No ratings yet
English Yr5 2015 Ms
9 pages
Efficient Mapreduce Matrix Multiplication With Optimized Mapper Set
No ratings yet
Efficient Mapreduce Matrix Multiplication With Optimized Mapper Set
12 pages
Keralauniversity of Fisheries & Ocean Studies: Panangad P.O., Kochi 682 506, Kerala, India
No ratings yet
Keralauniversity of Fisheries & Ocean Studies: Panangad P.O., Kochi 682 506, Kerala, India
13 pages
Imply
No ratings yet
Imply
17 pages
Mapping Pulling Cable Grounding System
No ratings yet
Mapping Pulling Cable Grounding System
1 page
SCBA Pre-Use Inspection
No ratings yet
SCBA Pre-Use Inspection
2 pages
Quick Start Guide: Register Your Product and Get Support at
No ratings yet
Quick Start Guide: Register Your Product and Get Support at
6 pages
Apache Hadoop: A Guide For Cluster Configuration & Testing
No ratings yet
Apache Hadoop: A Guide For Cluster Configuration & Testing
6 pages
Ephesians: What To Do
No ratings yet
Ephesians: What To Do
8 pages
Inside Out Story
No ratings yet
Inside Out Story
3 pages
Mercedes-Benz: Faculty of Political Science
No ratings yet
Mercedes-Benz: Faculty of Political Science
7 pages
Feldman-Mahalanobis Model
No ratings yet
Feldman-Mahalanobis Model
3 pages
Important: Service Data Sheet
No ratings yet
Important: Service Data Sheet
4 pages
POEM
No ratings yet
POEM
7 pages
A Guide To Rosehill and Around
No ratings yet
A Guide To Rosehill and Around
2 pages
Saqs Methods Cog T and D
No ratings yet
Saqs Methods Cog T and D
2 pages
Analytics Engineer, UCAS
No ratings yet
Analytics Engineer, UCAS
2 pages
Colonial Houses and The Stephen Moylan Press
No ratings yet
Colonial Houses and The Stephen Moylan Press
7 pages
Soa Exercise Guide - 2018: Vmware VM Location: VM User/Password: Oxsoa/Oxsoa
No ratings yet
Soa Exercise Guide - 2018: Vmware VM Location: VM User/Password: Oxsoa/Oxsoa
1 page
Packet Tracer Activity 3.5.1
No ratings yet
Packet Tracer Activity 3.5.1
2 pages
Post WW Ii Latin American Boom: 21 Century Literature From The Philippines and The World Week 4 Topic
No ratings yet
Post WW Ii Latin American Boom: 21 Century Literature From The Philippines and The World Week 4 Topic
2 pages
Domino Squares
100% (2)
Domino Squares
1 page
Big Data Analytics
From Everand
Big Data Analytics
Venkat Ankam
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Lecture 25

Uploaded by

Lecture 25

Uploaded by

CS5412 / Lecture 25 Kishore Pusukuri,

Apache Spark and RDDs Spring 2019

Processing Spark Spark Other

Resource Spark Core Data

Spark centers on Resilient Distributed Dataset, RDDs, that capture the

REPL: Repeat/Evaluate/Print Loop

take(n) – return an array of the first n elements

collect()– return an array of all elements

saveAsTextFile(file) – save to text file(s)

Output Action “triggers” computation, pull model

lines = spark.textFile(“hdfs://...”) HadoopRDD

Re sult: full-te xt se a rc h of Wikipe dia in 0.5 se c (vs 20 se c for on-disk da ta )

RDDs operations are designed to offer embarrassing parallelism.

Requires cooperation between your hints to Spark when you

These are then organized into a DAG.

Developer anticipates which will be costly to re compute a nd

# Load text file from local FS, HDFS, or S3

# Use existing Hadoop InputFormat (Java/Scala only)

nums = sc.parallelize([1, 2, 3])

# Pass each element through a function

# Keep elements passing a predicate

# Retrieve RDD contents as a local collection

# Return first K elements

# Count number of elements

# Merge elements with an associative function

Python: pair = (a, b)

Scala: val pair = (a, b)

Java: Tuple2 pair = new Tuple2(a, b);

pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])

pets.reduceByKey(lambda x, y: x + y) # => {(cat, 3), (dog, 1)}

pets.groupByKey() # => {(cat, [1, 2]), (dog, [1])}

pets.sortByKey() # => {(cat, 1), (cat, 2), (dog, 1)}

val spammers = sc.sequenceFile(“hdfs://spammers.seq”)

# Train a machine learning model

HTTP://WWW.CS.CORNELL.EDU/ COURSES/ CS5412/2018SP 57

Spark is a powerful “manager” for big data computing.

• Online tutorials: spark.apache.org/docs/latest

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.