bd1718 10 Spark
bd1718 10 Spark
Julian M. Kunkel
julian.kunkel@googlemail.com
2018-01-19
Disclaimer: Big Data software is constantly updated, code samples may be outdated.
Outline
1 Concepts
2 Architecture
3 Computation
4 Managing Jobs
5 Examples
6 Higher-Level Abstractions
7 Summary
P1 P2 P3 P4 RDD X
1 Concepts
2 Architecture
3 Computation
4 Managing Jobs
5 Examples
6 Higher-Level Abstractions
7 Summary
Source: [12]
Execution of code
1 The closure is computed: variables/methods needed for execution
2 The driver serializes the closure together with the task (code)
Broadcast vars are useful as they do not require to be packed with the task
3 The driver sends the closure to the executors
4 Tasks on the executor run the closure, which manipulates the local data
Persistence [13]
Concepts
The data lineage of an RDD is stored
Actions trigger computation, no intermediate results are kept
The methods cache() and persist() enables preserving of results
The first time an RDD is computed, it is then kept for further usage
Each executor keeps its local data
cache() keeps data in memory (level: MEMORY_ONLY)
persist() allows to choose the storage level
Spark manages the memory cache automatically
LRU cache, old RDDs are evicted to secondary storage (or deleted)
If an RDD is not in cache, re-computation may be triggered
Storage levels
Parallelism [13]
Spark runs one task for each partition of the RDD
Recommendation: create 2-4 partitions for each CPU
When creating an RDD default value is set, but can be changed manually
1 # Create 10 partitions when the data is distributed
2 sc.parallelize(data, 10)
1 Concepts
2 Architecture
3 Computation
4 Managing Jobs
5 Examples
6 Higher-Level Abstractions
7 Summary
Computation
Lazy execution: apply operations when results are needed (by actions)
Intermediate RDDs can be re-computed multiple times
Users can persist RDDs (in-memory or disk) for later use
Many operations apply user-defined functions or lambda expressions
Code and closure are serialized on the driver and send to executors
Note: When using class instance functions, the object is serialized
RDD partitions are processed in parallel (data parallelism)
Use local data where possible
Simple Example
Example session when using pyspark
To run with a specific Python version, e.g., use
1 PYSPARK_PYTHON=python3 pyspark --master yarn-client
Compute PI [20]
Approach: Randomly throw NUM_SAMPLES darts on a circle and count hits
Python
1 def sample(p):
2 x, y = random(), random()
3 return 1 if x*x + y*y < 1 else 0
4
5 count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b)
6 print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES)
Java
1 int count = spark.parallelize(makeRange(1, NUM_SAMPLES)).filter(
2 new Function<Integer, Boolean>() {
3 public Boolean call(Integer i) {
4 double x = Math.random();
5 double y = Math.random();
6 return x*x + y*y < 1;
7 }
8 }).count();
9 System.out.println("Pi is roughly " + 4 * count / NUM_SAMPLES);
Shuffle [13]
Concepts
Operations
1
More efficient than repartition()
Julian M. Kunkel Lecture BigData Analytics, WiSe 17/18 20 / 54
Concepts Architecture Computation Managing Jobs Examples Higher-Level Abstractions Summary
1 Concepts
2 Architecture
3 Computation
4 Managing Jobs
5 Examples
6 Higher-Level Abstractions
7 Summary
Batch Applications
Submit batch applications via spark-submit
Supports JARs (Scala or Java)
Supports Python code
To query results check output (tracking URL)
Build self-contained Spark applications (see [24])
1 spark-submit --master <master-URL> --class <MAIN> # for Java/Scala Applications
2 --conf <key>=<value> --py-files x,y,z # Add files to the PYTHONPATH
3 --jars <(hdfs|http|file|local)>://<FILE> # provide JARs to the classpath
4 <APPLICATION> [APPLICATION ARGUMENTS]
Web UI
2
Change it by adding -conf spark.ui.port=PORT to, e.g., pyspark.
Julian M. Kunkel Lecture BigData Analytics, WiSe 17/18 25 / 54
Concepts Architecture Computation Managing Jobs Examples Higher-Level Abstractions Summary
Overview
RDD details
1 Concepts
2 Architecture
3 Computation
4 Managing Jobs
5 Examples
6 Higher-Level Abstractions
7 Summary
1 Concepts
2 Architecture
3 Computation
4 Managing Jobs
5 Examples
6 Higher-Level Abstractions
7 Summary
Higher-Level Abstractions
1 # When using an SQL statement to create the table, the table is visible in HCatalog!
2 p = sqlContext.sql("CREATE TABLE IF NOT EXISTS data (key INT, value STRING)")
3
4 # Bulk loading data by appending it to the table data (if it existed)
5 sqlContext.sql("LOAD DATA LOCAL INPATH ’data.txt’ INTO TABLE data")
6
7 # The result of a SQL query is a DataFrame, an RDD of rows
8 rdd = sqlContext.sql("SELECT * from data")
9
10 # Treat RDD as a SchemaRDD, access row members using the column name
11 o = rdd.map(lambda x: x.key) # Access the column by name, here "key"
12 # To print the distributed values they have to be collected.
13 print(o.collect())
14
15 sqlContext.cacheTable("data") # Cache the table in memory
16
17 # Save as Text file/directory into the local file system
18 dw.json("data.json", mode="overwrite")
19 # e.g., {"key":10,"value":"test"}
20
21 sqlContext.sql("DROP TABLE data") # Remove the table
Statistics
Descriptive statistics, hypothesis testing, random data generation
Classification and regression
Linear models, Decision trees, Naive Bayes
Clustering
k-means
Frequent pattern mining
Association rules
Higher-level APIs for complex pipelines
Feature extraction, transformation and selection
Classification and regression trees
Multilayer perceptron classifier
Julian M. Kunkel Lecture BigData Analytics, WiSe 17/18 46 / 54
Concepts Architecture Computation Managing Jobs Examples Higher-Level Abstractions Summary
Clustering [25]
Integration into R
Integrated R shell: sparkR
Features
Store/retrieve data frames in/from Spark
In-memory SQL and access to HDFS data and Hive tables
Provides functions to: (lazily) access/derive data and ML-algorithms
Enables (lazy) parallelism in R!
Summary
Spark is an in-memory processing and storage engine
It is based on the concept of RDDs
An RDD is an immutable list of tuples (or a key/value tuple)
Computation is programmed by transforming RDDs
Data is distributed by partitioning an RDD / DataFrame / DataSet
Computation of transformations is done on local partitions
Shuffle operations change the mapping and require communication
Actions return data to the driver or perform I/O
Fault-tolerance is provided by re-computing partitions
Driver program controls the executors and provides code closures
Lazy evaluation: All computation is deferred until needed by actions
Higher-level APIs enable SQL, streaming and machine learning
Interactions with the Hadoop ecosystem
Accessing HDFS data
Sharing tables with Hive
Can use YARN resource management
Julian M. Kunkel Lecture BigData Analytics, WiSe 17/18 53 / 54
Concepts Architecture Computation Managing Jobs Examples Higher-Level Abstractions Summary
Bibliography
10 Wikipedia
12 http://spark.apache.org/docs/latest/cluster- overview.html
13 http://spark.apache.org/docs/latest/programming- guide.html
14 http://spark.apache.org/docs/latest/api/python/pyspark.sql.html
18 http://www.cloudera.com/content/www/en- us/documentation/enterprise/latest/topics/cdh_ig_running_spark_on_yarn.html
20 http://spark.apache.org/examples.html
22 http://spark.apache.org/docs/latest/mllib- guide.html
25 http://spark.apache.org/docs/latest/mllib- clustering.html
26 https://en.wikipedia.org/wiki/In- memory_processing
28 https://databricks.com/blog/2016/07/14/a- tale- of- three- apache- spark- apis- rdds- dataframes- and- datasets.html