BDA Unit III
BDA Unit III
(Part-I)
Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark
has its own cluster management computation, it uses Hadoop for storage purpose only.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and stream
processing. The main feature of Spark is its in-memory cluster computing that increases the
processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workloads in a
respective system, it reduces the management burden of maintaining separate tools.
Features:
Apache Spark has following features.
Speed: Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory,
and 10 times faster when running on disk. This is possible by reducing number of read/write
operations to disk. It stores the intermediate processing data in memory.
Supports multiple languages: Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-level
operators for interactive querying.
Advanced Analytics: Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.
Spark Built on Hadoop:
The following diagram shows three ways of how Spark can be built with Hadoop components.
There are three ways of Spark deployment as explained below.
Standalone: Spark Standalone deployment means Spark occupies the place on top of HDFS
(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and
MapReduce will run side by side to cover all spark jobs on cluster.
Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-
1
Unit-III
(Part-I)
installation or root access required. It helps to integrate Spark into Hadoop ecosystem or
Hadoop stack. It allows other components to run on top of stack.
Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark job in addition to
standalone deployment. With SIMR, user can start Spark and uses its shell without any
administrative access.
Components of Spark
Spark Core
Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in
external storage systems.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib
developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine
times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a
Spark interface).
2
Unit-III
(Part-I)
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API
for expressing graph computation that can model the user-defined graphs by using Pregel
abstraction API. It also provides an optimized runtime for this abstraction.
Cluster Managers
To achieve this while maximizing flexibility, Spark can run over a variety of cluster
managers, including Hadoop YARN, Apache Mesos, and a simple cluster manager included in
Spark itself called the Standalone Scheduler.
RDD Basics
An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is
split into multiple partitions, which may be computed on different nodes of the cluster. RDDs
can contain any type of Python, Java, or Scala objects, including user defined classes.
Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to
recompute missing or damaged partitions due to node failures.
Distributed, since Data resides on multiple nodes.
Dataset represents records of the data you work with. The user can load the data set
externally which can be either JSON file, CSV file, text file or database via JDBC with no
specific data structure.
Users create RDDs in three ways: (i)parallelizing an existing collection in your driver
program, or (ii)referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format. (iii) RDDs can be
created through deterministic operations on other RDDs. RDD is a fault-tolerant collection of
elements that can be operated on in parallel.
Once created, RDDs offer two types of operations: transformations and actions.
Transformations construct a new RDD from a previous one. For example, one common
transformation is filtering data that matches a predicate. In our text file example, we can use
this to create a new RDD holding just the strings that contain the word Python, as shown in
Example 1: Calling the filter() transformation
scala>val data = sc.textFile("f1.txt")
scala>val DFData = data.filter(line => line.contains("spark"))
Actions, on the other hand, compute a result based on an RDD, and either return it to the
driver program or save it to an external storage system (e.g., HDFS). One example of an action
we called earlier is first() which returns the first element in an RDD and later foreach() which
returns the all elements in an RDD .
3
Unit-III
(Part-I)
4
Unit-III
(Part-I)
Features of RDD:
1.In memory Computation
2. Immutability
3. Lazy Evaluation
4. Persistence
5. Fault Tolerance.
Creating RDDs
Spark provides three ways to create RDDs:
(i) parallelizing a collection in your driver program
(ii) loading an external datasetand
(iii) from other RDD.
(i) parallelizing a collection in your driver program
The simplest way to create RDDs is to take an existing collection in your program and
pass it to SparkContext’sparallelize() method, as shown in Example. This approach is very
useful when you are learning Spark, since you can quickly create your own RDDs in the shell
and perform operations on them. Keep in mind, however, that outside of prototyping and
testing, this is not widely used since it requires that you have your entire dataset in memory on
one machine.
Example: parallelize() method in Scala
val rdd1 = spark.sparkContext.parallelize(Array("jan","feb","mar","april","may","jun"),3)
rdd1.result.foreach(println)
Output: It will display all the array elments.
(ii) loading an external dataset
A more common way to create RDDs is to load data from external storage. Loading
external datasets is covered in detail in Chapter 5. However, we already saw one method that
loads a text file as an RDD of strings, SparkContext.textFile(), which is shown in Example
Example: textFile() method in Scala
val lines = sc.textFile("/home/user/f1.txt")
(iii) from other RDD
Create some input RDDs from external data and Transform them to define new RDDs
using transformations like filter(),map, etc.,
5
Unit-III
(Part-I)
Example:
wordPair.foreach(println)
RDD Operations
RDDs support two types of operations: transformations and actions. Transformations are
operations on RDDs that return a new RDD, such as map() and filter(). Actions are operations
that return a result to the driver program or write it to storage, and kick off a computation, such
as count() and first().
Transformations
Transformations are operations on RDDs that return a new RDD. Transformed RDDs are
computed lazily, only when you use them in an action. Many transformations are element-wise;
that is, they work on one element at a time; but this is not true for all transformations.
As an example, suppose that we have a logfile, log.txt, with a number of messages, and
we want to select only the error messages. We can use the filter() transformation seen before.
Example:filter() transformation in Scala
scala>val inputRDD = sc.textFile("log.txt")
scala>val errorsRDD = inputRDD.filter(line =>line.contains("error"))
scala>errorsRDD.foreach(println)
filter() operation does not mutate the existing inputRDD. Instead, it returns a pointer to an
entirely new RDD. Let’s use input RDD again to search for lines with the word warning in
them. Then, we’ll use another transformation, union(), to print out the number of lines that
contained either error or warning.
union() is a bit different than filter(), in that it operates on two RDDs instead of one.
Transformations can actually operate on any number of input RDDs.
Finally, as you derive new RDDs from each other using transformations, Spark keeps
track of the set of dependencies between different RDDs, called the lineage graph. It uses this
information to compute each RDD on demand and to recover lost data if part of a persistent
RDD is lost.
Actions
Actions are the second type of RDD operation. They are the operations that return a final
value to the driver program or write data to an external storage system. Actions force the
6
Unit-III
(Part-I)
evaluation of the transformations required for the RDD since they produce output.
In the log example from the previous section, we want to print out some information about
the errorsRDD. To do that, use two actions, count(), which returns the count as a number, and
take(), which collects a number of elements from the RDD
Example: Scala error count using actions
println("Input had " + errorsRDD.count() + " concerning lines")
println("Here are 3 examples:")
errorsRDD.take(3).foreach(println)
In this example, we used take() to retrieve a small number of elements in the RDD at the
driver program. We then iterate over them locally to print out information at the driver. RDDs
also have a collect() function to retrieve the entire RDD. It is useful to filter RDDs down to a
very small size and to deal with it locally. Entire data set must fit in one machine so collect()
shouldn’t be used on large datasets.
In most cases RDDs can’t just be collect()ed to the driver because they are too large. In
these cases, it’s common to write data out to a distributed storage system such as HDFS or
Amazon S3. We can save the contents of an RDD using the saveAsTextFile() action,
saveAsSequenceFile(), or any of a number of actions for various built-in formats.
It is important to note that each time we call a new action, the entire RDD must be computed
“from scratch.” To avoid this inefficiency, users can persist intermediate results using
“Persistence (Caching)” .Lazy Evaluation transformations on RDDs are lazily evaluated,
meaning that Spark will not begin to execute until it sees an action.
Lazy evaluation means that when we call a transformation on an RDD (for instance, calling
map()), the operation is not immediately performed. Instead, Spark internally records metadata
to indicate that this operation has been requested. Rather than thinking of an RDD as containing
specific data, it is best to think of each RDD as consisting of instructions on how to compute
the data that we build up through transformations.
Loading data into an RDD is lazily evaluated in the same way transformations are. So,
when we call sc.textFile(), the data is not loaded until it is necessary. As with transformations,
the operation (in this case, reading the data) can occur multiple times. Although transformations
are lazy, we can force Spark to execute them at any time by running an action, such as
count().Spark uses lazy evaluation to reduce the number of passes it has to take over our data
by grouping operations together.
Passing Functions to Spark Most of Spark’s transformations, and some of its actions,
depend on passing in functions that are used by Spark to compute data.
7
Unit-III
(Part-I)
Scala
In Scala, we can pass in functions defined inline, references to methods, or static functions
as we do for Scala’s other functional APIs.
Example:
scala> def sayhello(){
| println("Hello")
|}
To check output use this
scala>sayhello()
output:Hello
Example:with parameters
scala>def sum(a:Int,b:Int)
|{
| println(a+b)
|}
To check output
scala>sum(2,6)
Output:8
Example using return
scala> def sum(a:Int,b:Int):Int={
| return a+b
|}
sum: (a: Int, b: Int)Int
scala> sum(2,6)
res7: Int = 8
Example:
def outer(a:Int){
| println("In outer")
| def inner(){
| println(a*3)
|}
| inner()
|}
outer: (a: Int)Unit
Output:
scala> outer(3)
8
Unit-III
(Part-I)
In outer
9
Unit-III
(Part-I)
1. map()
Spark map is a function which expresses a one-to-one transformation. It transforms each
element of a collection into one element of the resulting collection
It is useful to note that map()’s return type does not have to be the same as its input type,
so if we had an RDD String and our map() function were to parse the strings and return a
Double, our input RDD type would be RDD[String] and the resulting RDD type would be
RDD[Double].
Example: Scala squaring the values in an RDD
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => x * x)
println(result.collect().mkString(","))
2.flatmap()
Spark flatMap is a function which expresses a one-to-many transformation. It transforms
each element to 0 or more elements.Sometimes we want to produce multiple output elements
for each input element. The operation to do this is called flatMap(). As with map(), the function
we provide to flatMap() is called individually for each element in our input RDD. Instead of
returning a single element, we return an iterator with our return values. Rather than producing
an RDD of iterators, we get back an RDD that consists of the elements from all of the iterators.
A simple usage of flatMap() is splitting up an input string into words,
Example: flatMap() in Scala, splitting lines into multiple words
scala>val lines = sc.parallelize(List("hello world", "hi welcome to spark"))
scala>val words = lines.flatMap(line =>line.split(" "))
scala>words.first() // returns "hello"
scala>words.foreach(println) //returns all words.
We illustrate the difference between flatMap() and map() in Figure 3-3. flatMap() as
10
Unit-III
(Part-I)
“flattening” the iterators returned to it, so that instead of ending up with an RDD of lists we
have an RDD of the elements in those lists.
4.groupByKey()
When we use groupByKey() on a dataset of (K, V) pairs, the data is shuffled according to
the key value K in another RDD.
Example:
scala>val data =
spark.sparkContext.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3)
scala>val group = data.groupByKey().collect()
scala>group.foreach(println)
5.reduceByKey(func, [numTasks])
When we use reduceByKey on a dataset (K, V), the pairs on the same machine with the
same key are combined, before the data is shuffled.
Example:
val words = Array("one","two","two","four","five","six","six","eight","nine","ten")
val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)
data.foreach(println)
11
Unit-III
(Part-I)
6.sortByKey()
When we apply the sortByKey() function on a dataset of (K, V) pairs, the data is sorted
according to the key K in another RDD.
Example:
val data = spark.sparkContext.parallelize(Seq(("maths",52), ("english",75), ("science",82),
("computer",65), ("maths",85)))
val sorted = data.sortByKey()
sorted.foreach(println)
7.coalesce()
To avoid full shuffling of data we use coalesce() function. In coalesce() we use existing
partition so that less data is shuffled. Using this we can cut the number of the partition. Suppose,
we have four nodes and we want only two nodes. Then the data of extra nodes will be kept
onto nodes which we kept.
Example:
val rdd1 = spark.sparkContext.parallelize(Array("jan","feb","mar","april","may","jun"),3)
val result = rdd1.coalesce(2)
result.foreach(println)
8.join()
The Join is database term. It combines the fields from two table using common values.
join() operation in Spark is defined on pair-wise RDD. Pair-wise RDDs are RDD in which each
element is in the form of tuples. Where the first element is key and the second element is the
value.
Example:
val data = spark.sparkContext.parallelize(Array(('A',1),('b',2),('c',3)))
val data2 =spark.sparkContext.parallelize(Array(('A',4),('A',6),('b',7),('c',3),('c',8)))
val result = data.join(data2)
println(result.collect().mkString(","))
Pseudo set operations
RDDs support many of the operations of mathematical sets, such as union and intersection,
even when the RDDs themselves are not properly sets. Four operations are shown in Figure 3-
4. It’s important to note that all of these operations require that the RDDs being operated on
are of the same type.
12
Unit-III
(Part-I)
13
Unit-III
(Part-I)
Example:
scala>val rdd1 = spark.sparkContext.parallelize(Seq((1,"jan",2016),(3,"nov",2014),
(16,"feb",2014)))
scala>val rdd2 = sc.parallelize(Seq((5,"dec",2014),(1,"jan",2016)))
scala>val comman = rdd1.intersection(rdd2)
scala>comman.foreach(println)
subtract()
Sometimes we need to remove some data from consideration. The subtract(other)function
takes in another RDD and returns an RDD that has only values present in the first RDD and
not the second RDD. Like intersection(), it performs a shuffle.
scala>val rdd1 = sc.parallelize(Seq((1,"jan",2016),(3,"nov",2014), (16,"feb",2014)))
scala>val rdd2 sc.parallelize(Seq((5,"dec",2014),(1,"jan",2016)))
scala>val sub = rdd1.subtract(rdd2)
scala>sub.foreach(println)
cartesian()
We can also compute a Cartesian product between two RDDs, as shown in Figure 3-5. The
cartesian(other) transformation returns all possible pairs of (a,b) where a is in the source RDD
and b is in the other RDD.
scala>val rdd1 = sc.parallelize(Seq((1,"jan",2016),(3,"nov",2014), (16,"feb",2014)))
scala>val rdd2 = sc.parallelize(Seq((5,"dec",2014),(1,"jan",2016)))
scala>val cart = rdd1.cartesian(rdd2)
scala>cart.foreach(println)
14
Unit-III
(Part-I)
Actions
1.reduce()
The most common action on basic RDDs is reduce(), which takes a function that operates
on two elements of the type in your RDD and returns a new element of the same type.
A simple example of such a function is +, which we can use to sum our RDD. With
reduce(), we can easily sum the elements of our RDD, count the number of elements, and
perform other types of aggregations.
Example: reduce() in Scala
val rdd1 = spark.sparkContext.parallelize(List(20,32,45,62,8,5))
val sum = rdd1.reduce(_+_)
println(sum)
Similar to reduce() is fold(), which also takes a function with the same signature as needed
for reduce(), but in addition takes a “zero value” to be used for the initial call on each partition.
The zero value you provide should be the identity element for your operation; that is, applying
it multiple times with your function should not change the value (e.g., 0 for +, 1 for *, or an
empty list for concatenation).
You can minimize object creation in fold() by modifying and returning the first of the two
parameters in place. However, you should not modify the second parameter.
2.aggregate()
The aggregate() function frees us from the constraint of having the return be the same type
as the RDD we are working on. With aggregate(), like fold(), we supply an initial zero value
of the type we want to return. We then supply a function to combine the elements from our
RDD with the accumulator. Finally, we need to supply a second function to merge two
accumulators, given that each node accumulates its own results locally. We can use aggregate()
to compute the average of an RDD.
Example : aggregate() in Scala
val result = input.aggregate((0, 0))(
(acc, value) => (acc._1 + value, acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
valavg = result._1 / result._2.toDouble
collect()
Operation that returns data to our driver program is collect(), which returns the entire
RDD’s contents. collect() is commonly used in unit tests where the entire contents of the RDD
are expected to fit in memory,makes it easy to compare the value of our RDD with our expected
15
Unit-III
(Part-I)
result.
collect()suffers from the restriction that all of your data must fit on a single machine, as it
all needs to be copied to the driver.
Example:
val data = sc.parallelize(Array(('A',1),('b',2),('c',3)))
val data2 =sc.parallelize(Array(('A',4),('A',6),('b',7),('c',3),('c',8)))
val result = data.join(data2)
println(result.collect().mkString(","))
take()
take(n) returns n elements from the RDD and attempts to minimize the number of
partitions it accesses, so it may represent a biased collection. It’s important to note that these
operations do not return the elements in the order.
These operations are useful for unit tests and quick debugging, but may introduce
bottlenecks when dealing with large amounts of data.
Example:
val data sc.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3)
val group = data.groupByKey().collect()
val twoRec = result.take(2)
twoRec.foreach(println)
top()
we can also extract the top elements from an RDD using top(). top() will use the default
ordering on the data, but we can supply our own comparison function to extract the top
elements.
The take Sample(with Replacement, num, seed) function allows us to take a sample of our
dataeither with or without replacement.
Example:
val data = spark.read.textFile("spark_test.txt").rdd
val mapFile = data.map(line => (line,line.length))
val res = mapFile.top(3)
res.foreach(println)
foreach()
The foreach() action lets us perform computations on each element in the RDD without
bringing it back locally.
Example:
16
Unit-III
(Part-I)
collect()
The action collect() is the common and simplest operation that returns our entire RDDs
content to driver program. The application of collect() is unit testing where the entire RDD is
expected to fit in memory. As a result, it makes easy to compare the result of RDD with the
expected result.
Action Collect() had a constraint that all the data should fit in the machine, and copies to
the driver.
Example:
val data = sc.parallelize(Array(('A',1),('b',2),('c',3)))
val data2 =sc.parallelize(Array(('A',4),('A',6),('b',7),('c',3),('c',8)))
val result = data.join(data2)
println(result.collect().mkString(","))on n
Converting Between RDD Types
Some functions are available only on certain types of RDDs, such as mean() and variance()
on numeric RDDs or join() on key/value pair RDDs.
Scala
In Scala the conversion to RDDs with special functions (e.g., to expose numeric functions
on an RDD[Double]) is handled automatically using implicit conversions. As mentioned in
These implicit’s turn an RDD into various wrapper classes, such as DoubleRDD Functions (for
17
Unit-III
(Part-I)
RDDs of numeric data) and PairRDD Functions (for key/value pairs), to expose additional
functions such as mean() and variance().
Persistence (Caching)
Spark RDDs are lazily evaluated, and sometimes we may wish to use the same RDD
multiple times. If we do this naively, Spark will recompute the RDD and all of its dependencies
each time we call an action on the RDD. This can be especially expensive for iterative
algorithms, which look at the data many times. Another trivial example would be doing a count
and then writing out the same RDD,as shown in Example.
Example: Double execution in Scala
val result = input.map(x => x*x)
println(result.count())
println(result.collect().mkString(","))
To avoid computing an RDD multiple times, we can ask Spark to persist the data. When
we ask Spark to persist an RDD, the nodes that compute the RDD store their partitions. If a
node that has data persisted on it fails, Spark will recompute the lost partitions of the data when
needed. We can also replicate our data on multiple nodes if we want to be able to handle node
failure without slowdown.
In Scala (Example) and Java, the default persist() will store the data in the JVM heap as
unserialized objects. In Python, we always serialize the data that persist stores, so the default is
instead stored in the JVM heap as pickled objects. When we write data out to disk or off-heap
storage, that data is also always serialized.
Table 3-6. Persistence levels from org.apache.spark.storage.StorageLevel
andpyspark.StorageLevel; if desired we can replicate the data on two machines by adding _2
tothe end of the storage level
18
Unit-III
(Part-I)
19