0% found this document useful (0 votes)

17 views19 pages

BDA Unit III

Uploaded by

syambabuj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views19 pages

BDA Unit III

Uploaded by

syambabuj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Unit-III

(Part-I)

Spark was introduced by Apache Software Foundation for speeding up the Hadoop
computational computing software process.
As against a common belief, Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management. Hadoop is just one of the
ways to implement Spark.
Spark uses Hadoop in two ways – one is storage and second is processing. Since Spark
has its own cluster management computation, it uses Hadoop for storage purpose only.
Apache Spark
Apache Spark is a lightning-fast cluster computing technology, designed for fast
computation. It is based on Hadoop MapReduce and it extends the MapReduce model to
efficiently use it for more types of computations, which includes interactive queries and stream
processing. The main feature of Spark is its in-memory cluster computing that increases the
processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative
algorithms, interactive queries and streaming. Apart from supporting all these workloads in a
respective system, it reduces the management burden of maintaining separate tools.
Features:
Apache Spark has following features.
Speed: Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory,
and 10 times faster when running on disk. This is possible by reducing number of read/write
operations to disk. It stores the intermediate processing data in memory.
Supports multiple languages: Spark provides built-in APIs in Java, Scala, or Python.
Therefore, you can write applications in different languages. Spark comes up with 80 high-level
operators for interactive querying.
Advanced Analytics: Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL
queries, Streaming data, Machine learning (ML), and Graph algorithms.
Spark Built on Hadoop:
The following diagram shows three ways of how Spark can be built with Hadoop components.
There are three ways of Spark deployment as explained below.
Standalone: Spark Standalone deployment means Spark occupies the place on top of HDFS
(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and
MapReduce will run side by side to cover all spark jobs on cluster.
Hadoop Yarn: Hadoop Yarn deployment means, simply, spark runs on Yarn without any pre-
1
Unit-III
(Part-I)

installation or root access required. It helps to integrate Spark into Hadoop ecosystem or
Hadoop stack. It allows other components to run on top of stack.
Spark in MapReduce (SIMR): Spark in MapReduce is used to launch spark job in addition to
standalone deployment. With SIMR, user can start Spark and uses its shell without any
administrative access.
Components of Spark

Spark Core
Spark Core is the underlying general execution engine for spark platform that all other
functionality is built upon. It provides In-Memory computing and referencing datasets in
external storage systems.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming
analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets)
transformations on those mini-batches of data.
MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. It is, according to benchmarks, done by the MLlib
developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine
times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a
Spark interface).

2
Unit-III
(Part-I)

GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API
for expressing graph computation that can model the user-defined graphs by using Pregel
abstraction API. It also provides an optimized runtime for this abstraction.
Cluster Managers
To achieve this while maximizing flexibility, Spark can run over a variety of cluster
managers, including Hadoop YARN, Apache Mesos, and a simple cluster manager included in
Spark itself called the Standalone Scheduler.
RDD Basics
An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is
split into multiple partitions, which may be computed on different nodes of the cluster. RDDs
can contain any type of Python, Java, or Scala objects, including user defined classes.
Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to
recompute missing or damaged partitions due to node failures.
Distributed, since Data resides on multiple nodes.
Dataset represents records of the data you work with. The user can load the data set
externally which can be either JSON file, CSV file, text file or database via JDBC with no
specific data structure.
Users create RDDs in three ways: (i)parallelizing an existing collection in your driver
program, or (ii)referencing a dataset in an external storage system, such as a shared file
system, HDFS, HBase, or any data source offering a Hadoop Input Format. (iii) RDDs can be
created through deterministic operations on other RDDs. RDD is a fault-tolerant collection of
elements that can be operated on in parallel.
Once created, RDDs offer two types of operations: transformations and actions.
Transformations construct a new RDD from a previous one. For example, one common
transformation is filtering data that matches a predicate. In our text file example, we can use
this to create a new RDD holding just the strings that contain the word Python, as shown in
Example 1: Calling the filter() transformation
scala>val data = sc.textFile("f1.txt")
scala>val DFData = data.filter(line => line.contains("spark"))
Actions, on the other hand, compute a result based on an RDD, and either return it to the
driver program or save it to an external storage system (e.g., HDFS). One example of an action
we called earlier is first() which returns the first element in an RDD and later foreach() which
returns the all elements in an RDD .
3
Unit-III
(Part-I)

Example: Calling the first() action

scala>DFData .first()
Output:It will display the first occurrence of “spark” line.
scala> DFData.foreach(println)
Output:It will display the occurrences of “spark” in all line.
Transformations and actions are different because of the way Spark computes RDDs.
Although you can define new RDDs any time, Spark computes them only in a lazy fashion that
is, the first time they are used in an action. For instance, consider the above Example 1 and
Example 2, where we defined a text file and then filtered the lines that include Spark. If Spark
were to load and store all the lines in the file as soon as we wrote lines = sc.textFile(...), it
would waste a lot of storage space, given that we then immediately filter out many lines.
Instead, once Spark sees the whole chain of transformations, it can compute just the data needed
for its result. In fact, for the first() action, Spark scans the file only until it finds the first
matching line; it doesn’t even read the whole file. Finally, Spark’s RDDs are by default
recomputed each time you run an action on them. If you would like to reuse an RDD in multiple
actions, you can ask Spark to persist it using RDD.persist().
We can ask Spark to persist our data in a number of different places, which will be covered
in Table 3-6. After computing it the first time, Spark will store the RDD contents in memory
(partitioned across the machines in your cluster), and reuse them in future actions. Persisting
RDDs on disk instead of memory is also possible.
use persist() to load a subset of your data into memory and query it repeatedly.
Example: Persisting an RDD in memory
scala>data.persist
scala>data.cache()
scala>data.count()
Long = 8
To summarize, every Spark program and shell session will work as follows:
1. Create some input RDDs from external data.
2. Transform them to define new RDDs using transformations like filter().
3. Ask Spark to persist() any intermediate RDDs that will need to be reused.
4. Launch actions such as count() and first() to kick off a parallel computation, which is
then optimized and executed by Spark.
cache() is the same as calling persist() with the default storage level.

4
Unit-III
(Part-I)

Features of RDD:
1.In memory Computation
2. Immutability
3. Lazy Evaluation
4. Persistence
5. Fault Tolerance.
Creating RDDs
Spark provides three ways to create RDDs:
(i) parallelizing a collection in your driver program
(ii) loading an external datasetand
(iii) from other RDD.
(i) parallelizing a collection in your driver program
The simplest way to create RDDs is to take an existing collection in your program and
pass it to SparkContext’sparallelize() method, as shown in Example. This approach is very
useful when you are learning Spark, since you can quickly create your own RDDs in the shell
and perform operations on them. Keep in mind, however, that outside of prototyping and
testing, this is not widely used since it requires that you have your entire dataset in memory on
one machine.
Example: parallelize() method in Scala
val rdd1 = spark.sparkContext.parallelize(Array("jan","feb","mar","april","may","jun"),3)
rdd1.result.foreach(println)
Output: It will display all the array elments.
(ii) loading an external dataset
A more common way to create RDDs is to load data from external storage. Loading
external datasets is covered in detail in Chapter 5. However, we already saw one method that
loads a text file as an RDD of strings, SparkContext.textFile(), which is shown in Example
Example: textFile() method in Scala
val lines = sc.textFile("/home/user/f1.txt")
(iii) from other RDD
Create some input RDDs from external data and Transform them to define new RDDs
using transformations like filter(),map, etc.,

5
Unit-III
(Part-I)

Example:

val words=spark.sparkContext.parallelize(Seq("the", "quick", "brown", "fox", "jumps",

"over", "the", "lazy", "dog"))

val wordPair = words.map(w => (w.charAt(0), w))

wordPair.foreach(println)

RDD Operations
RDDs support two types of operations: transformations and actions. Transformations are
operations on RDDs that return a new RDD, such as map() and filter(). Actions are operations
that return a result to the driver program or write it to storage, and kick off a computation, such
as count() and first().
Transformations
Transformations are operations on RDDs that return a new RDD. Transformed RDDs are
computed lazily, only when you use them in an action. Many transformations are element-wise;
that is, they work on one element at a time; but this is not true for all transformations.
As an example, suppose that we have a logfile, log.txt, with a number of messages, and
we want to select only the error messages. We can use the filter() transformation seen before.
Example:filter() transformation in Scala
scala>val inputRDD = sc.textFile("log.txt")
scala>val errorsRDD = inputRDD.filter(line =>line.contains("error"))
scala>errorsRDD.foreach(println)
filter() operation does not mutate the existing inputRDD. Instead, it returns a pointer to an
entirely new RDD. Let’s use input RDD again to search for lines with the word warning in
them. Then, we’ll use another transformation, union(), to print out the number of lines that
contained either error or warning.
union() is a bit different than filter(), in that it operates on two RDDs instead of one.
Transformations can actually operate on any number of input RDDs.
Finally, as you derive new RDDs from each other using transformations, Spark keeps
track of the set of dependencies between different RDDs, called the lineage graph. It uses this
information to compute each RDD on demand and to recover lost data if part of a persistent
RDD is lost.
Actions
Actions are the second type of RDD operation. They are the operations that return a final
value to the driver program or write data to an external storage system. Actions force the

6
Unit-III
(Part-I)

evaluation of the transformations required for the RDD since they produce output.
In the log example from the previous section, we want to print out some information about
the errorsRDD. To do that, use two actions, count(), which returns the count as a number, and
take(), which collects a number of elements from the RDD
Example: Scala error count using actions
println("Input had " + errorsRDD.count() + " concerning lines")
println("Here are 3 examples:")
errorsRDD.take(3).foreach(println)
In this example, we used take() to retrieve a small number of elements in the RDD at the
driver program. We then iterate over them locally to print out information at the driver. RDDs
also have a collect() function to retrieve the entire RDD. It is useful to filter RDDs down to a
very small size and to deal with it locally. Entire data set must fit in one machine so collect()
shouldn’t be used on large datasets.
In most cases RDDs can’t just be collect()ed to the driver because they are too large. In
these cases, it’s common to write data out to a distributed storage system such as HDFS or
Amazon S3. We can save the contents of an RDD using the saveAsTextFile() action,
saveAsSequenceFile(), or any of a number of actions for various built-in formats.
It is important to note that each time we call a new action, the entire RDD must be computed
“from scratch.” To avoid this inefficiency, users can persist intermediate results using
“Persistence (Caching)” .Lazy Evaluation transformations on RDDs are lazily evaluated,
meaning that Spark will not begin to execute until it sees an action.
Lazy evaluation means that when we call a transformation on an RDD (for instance, calling
map()), the operation is not immediately performed. Instead, Spark internally records metadata
to indicate that this operation has been requested. Rather than thinking of an RDD as containing
specific data, it is best to think of each RDD as consisting of instructions on how to compute
the data that we build up through transformations.
Loading data into an RDD is lazily evaluated in the same way transformations are. So,
when we call sc.textFile(), the data is not loaded until it is necessary. As with transformations,
the operation (in this case, reading the data) can occur multiple times. Although transformations
are lazy, we can force Spark to execute them at any time by running an action, such as
count().Spark uses lazy evaluation to reduce the number of passes it has to take over our data
by grouping operations together.
Passing Functions to Spark Most of Spark’s transformations, and some of its actions,
depend on passing in functions that are used by Spark to compute data.
7
Unit-III
(Part-I)

Scala
In Scala, we can pass in functions defined inline, references to methods, or static functions
as we do for Scala’s other functional APIs.
Example:
scala> def sayhello(){
| println("Hello")
|}
To check output use this
scala>sayhello()

output:Hello

Example:with parameters

scala>def sum(a:Int,b:Int)

| println(a+b)

To check output
scala>sum(2,6)
Output:8
Example using return
scala> def sum(a:Int,b:Int):Int={
| return a+b
|}
sum: (a: Int, b: Int)Int
scala> sum(2,6)
res7: Int = 8

Example:
def outer(a:Int){
| println("In outer")
| def inner(){
| println(a*3)
|}
| inner()
|}
outer: (a: Int)Unit

Output:

scala> outer(3)

8
Unit-III
(Part-I)

In outer

Example: Scala function passing

classSearchFunctions(val query: String) {
defisMatch(s: String): Boolean = {
s.contains(query) }
defgetMatchesFunctionReference(rdd: RDD[String]): RDD[String] = {
// Problem: "isMatch" means "this.isMatch", so we pass all of "this"
rdd.map(isMatch)
}
defgetMatchesFieldReference(rdd: RDD[String]): RDD[String] = {
// Problem: "query" means "this.query", so we pass all of "this"
rdd.map(x =>x.split(query))
}
defgetMatchesNoReference(rdd: RDD[String]): RDD[String] = {
// Safe: extract just the field we need into a local variable
val query_ = this.query
rdd.map(x =>x.split(query_))
}}
If NotSerializableException occurs in Scala, a reference to a method or field in a
nonserializable class is usually the problem. Note that passing in local serializable variables or
functions that are members of a top-level object is always safe.
Common Transformations and Actions
Basic RDDs
transformations and actions that we can perform on all RDDs regardless of the data.
Element-wise transformations
The two most common transformations you will likely be using are map() and filter(). The
map() transformation takes in a function and applies it to each element in the RDD with the
result of the function being the new value of each element in the resulting RDD. The filter()
transformation takes in a function and returns an RDD that only has elements that pass the
filter() function.

9
Unit-III
(Part-I)

Figure 3-2. Mapped and filtered RDD from an input RDD

1. map()
Spark map is a function which expresses a one-to-one transformation. It transforms each
element of a collection into one element of the resulting collection
It is useful to note that map()’s return type does not have to be the same as its input type,
so if we had an RDD String and our map() function were to parse the strings and return a
Double, our input RDD type would be RDD[String] and the resulting RDD type would be
RDD[Double].
Example: Scala squaring the values in an RDD
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => x * x)
println(result.collect().mkString(","))
2.flatmap()
Spark flatMap is a function which expresses a one-to-many transformation. It transforms
each element to 0 or more elements.Sometimes we want to produce multiple output elements
for each input element. The operation to do this is called flatMap(). As with map(), the function
we provide to flatMap() is called individually for each element in our input RDD. Instead of
returning a single element, we return an iterator with our return values. Rather than producing
an RDD of iterators, we get back an RDD that consists of the elements from all of the iterators.
A simple usage of flatMap() is splitting up an input string into words,
Example: flatMap() in Scala, splitting lines into multiple words
scala>val lines = sc.parallelize(List("hello world", "hi welcome to spark"))
scala>val words = lines.flatMap(line =>line.split(" "))
scala>words.first() // returns "hello"
scala>words.foreach(println) //returns all words.
We illustrate the difference between flatMap() and map() in Figure 3-3. flatMap() as

10
Unit-III
(Part-I)

“flattening” the iterators returned to it, so that instead of ending up with an RDD of lists we
have an RDD of the elements in those lists.

Figure 3-3. Difference between flatMap() and map() on an RDD

3. filter()
Spark RDD filter() function returns a new RDD, containing only the elements that meet a
predicate. It is a narrow operation because it does not shuffle data from one partition to many
partitions.
Example:filter() transformation in Scala
scala>val inputRDD = sc.textFile("log.txt")
scala>val errorsRDD = inputRDD.filter(line =>line.contains("error"))
scala>errorsRDD.foreach(println)

4.groupByKey()
When we use groupByKey() on a dataset of (K, V) pairs, the data is shuffled according to
the key value K in another RDD.
Example:
scala>val data =
spark.sparkContext.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3)
scala>val group = data.groupByKey().collect()
scala>group.foreach(println)

5.reduceByKey(func, [numTasks])
When we use reduceByKey on a dataset (K, V), the pairs on the same machine with the
same key are combined, before the data is shuffled.
Example:
val words = Array("one","two","two","four","five","six","six","eight","nine","ten")
val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)
data.foreach(println)

11
Unit-III
(Part-I)

6.sortByKey()
When we apply the sortByKey() function on a dataset of (K, V) pairs, the data is sorted
according to the key K in another RDD.
Example:
val data = spark.sparkContext.parallelize(Seq(("maths",52), ("english",75), ("science",82),
("computer",65), ("maths",85)))
val sorted = data.sortByKey()
sorted.foreach(println)
7.coalesce()
To avoid full shuffling of data we use coalesce() function. In coalesce() we use existing
partition so that less data is shuffled. Using this we can cut the number of the partition. Suppose,
we have four nodes and we want only two nodes. Then the data of extra nodes will be kept
onto nodes which we kept.
Example:
val rdd1 = spark.sparkContext.parallelize(Array("jan","feb","mar","april","may","jun"),3)
val result = rdd1.coalesce(2)
result.foreach(println)
8.join()
The Join is database term. It combines the fields from two table using common values.
join() operation in Spark is defined on pair-wise RDD. Pair-wise RDDs are RDD in which each
element is in the form of tuples. Where the first element is key and the second element is the
value.
Example:
val data = spark.sparkContext.parallelize(Array(('A',1),('b',2),('c',3)))
val data2 =spark.sparkContext.parallelize(Array(('A',4),('A',6),('b',7),('c',3),('c',8)))
val result = data.join(data2)
println(result.collect().mkString(","))
Pseudo set operations
RDDs support many of the operations of mathematical sets, such as union and intersection,
even when the RDDs themselves are not properly sets. Four operations are shown in Figure 3-
4. It’s important to note that all of these operations require that the RDDs being operated on
are of the same type.

12
Unit-III
(Part-I)

Figure 3-4. Some simple set operations

distinct()
The set property most frequently missing from our RDDs is the uniqueness of elements,as
we often have duplicates. If we want only unique elements we can use theRDD.distinct()
transformation to produce a new RDD with only distinct items.distinct() is expensive, as it
requires shuffling all the data overthe network.
Example:
scala>val rdd=
sc.parallelize(Seq((1,"jan",2016),(3,"nov",2014),(16,"feb",2014),(3,"nov",2014)))
scala>val result = rdd1.distinct()
scala>println(result.collect().mkString(", "))
union()
The simplest set operation is union(other), which gives back an RDD consisting ofthe data
from both sources. This can be useful in a number of use cases, such as processinglogfiles from
many sources. Spark’s union() will contain duplicates.
Example:
scala>val rdd1 = sc.parallelize(Seq((1,"jan",2016),(3,"nov",2014),(16,"feb",2014)))
scala>val rdd2 = spark.sparkContext.parallelize(Seq((5,"dec",2014),(17,"sep",2015)))
scala>val rdd3 = spark.sparkContext.parallelize(Seq((6,"dec",2011),(16,"may",2015)))
scala>val rddUnion = rdd1.union(rdd2).union(rdd3)
scala>rddUnion.foreach(println)
intersection()
Spark also provides an intersection(other) method, which returns only elements in both
RDDs. intersection() also removes all duplicates (including duplicates from a single RDD)
while running. intersection()requires a shuffle over the network to identify common elements.

13
Unit-III
(Part-I)

Example:
scala>val rdd1 = spark.sparkContext.parallelize(Seq((1,"jan",2016),(3,"nov",2014),
(16,"feb",2014)))
scala>val rdd2 = sc.parallelize(Seq((5,"dec",2014),(1,"jan",2016)))
scala>val comman = rdd1.intersection(rdd2)
scala>comman.foreach(println)
subtract()
Sometimes we need to remove some data from consideration. The subtract(other)function
takes in another RDD and returns an RDD that has only values present in the first RDD and
not the second RDD. Like intersection(), it performs a shuffle.
scala>val rdd1 = sc.parallelize(Seq((1,"jan",2016),(3,"nov",2014), (16,"feb",2014)))
scala>val rdd2 sc.parallelize(Seq((5,"dec",2014),(1,"jan",2016)))
scala>val sub = rdd1.subtract(rdd2)
scala>sub.foreach(println)
cartesian()
We can also compute a Cartesian product between two RDDs, as shown in Figure 3-5. The
cartesian(other) transformation returns all possible pairs of (a,b) where a is in the source RDD
and b is in the other RDD.
scala>val rdd1 = sc.parallelize(Seq((1,"jan",2016),(3,"nov",2014), (16,"feb",2014)))
scala>val rdd2 = sc.parallelize(Seq((5,"dec",2014),(1,"jan",2016)))
scala>val cart = rdd1.cartesian(rdd2)
scala>cart.foreach(println)

Figure 3-5. Cartesian product between two RDDs

14
Unit-III
(Part-I)

Actions
1.reduce()
The most common action on basic RDDs is reduce(), which takes a function that operates
on two elements of the type in your RDD and returns a new element of the same type.
A simple example of such a function is +, which we can use to sum our RDD. With
reduce(), we can easily sum the elements of our RDD, count the number of elements, and
perform other types of aggregations.
Example: reduce() in Scala
val rdd1 = spark.sparkContext.parallelize(List(20,32,45,62,8,5))
val sum = rdd1.reduce(_+_)
println(sum)
Similar to reduce() is fold(), which also takes a function with the same signature as needed
for reduce(), but in addition takes a “zero value” to be used for the initial call on each partition.
The zero value you provide should be the identity element for your operation; that is, applying
it multiple times with your function should not change the value (e.g., 0 for +, 1 for *, or an
empty list for concatenation).
You can minimize object creation in fold() by modifying and returning the first of the two
parameters in place. However, you should not modify the second parameter.
2.aggregate()
The aggregate() function frees us from the constraint of having the return be the same type
as the RDD we are working on. With aggregate(), like fold(), we supply an initial zero value
of the type we want to return. We then supply a function to combine the elements from our
RDD with the accumulator. Finally, we need to supply a second function to merge two
accumulators, given that each node accumulates its own results locally. We can use aggregate()
to compute the average of an RDD.
Example : aggregate() in Scala
val result = input.aggregate((0, 0))(
(acc, value) => (acc._1 + value, acc._2 + 1),
(acc1, acc2) => (acc1._1 + acc2._1, acc1._2 + acc2._2))
valavg = result._1 / result._2.toDouble
collect()
Operation that returns data to our driver program is collect(), which returns the entire
RDD’s contents. collect() is commonly used in unit tests where the entire contents of the RDD
are expected to fit in memory,makes it easy to compare the value of our RDD with our expected

15
Unit-III
(Part-I)

result.
collect()suffers from the restriction that all of your data must fit on a single machine, as it
all needs to be copied to the driver.
Example:
val data = sc.parallelize(Array(('A',1),('b',2),('c',3)))
val data2 =sc.parallelize(Array(('A',4),('A',6),('b',7),('c',3),('c',8)))
val result = data.join(data2)
println(result.collect().mkString(","))
take()
take(n) returns n elements from the RDD and attempts to minimize the number of
partitions it accesses, so it may represent a biased collection. It’s important to note that these
operations do not return the elements in the order.
These operations are useful for unit tests and quick debugging, but may introduce
bottlenecks when dealing with large amounts of data.
Example:
val data sc.parallelize(Array(('k',5),('s',3),('s',4),('p',7),('p',5),('t',8),('k',6)),3)
val group = data.groupByKey().collect()
val twoRec = result.take(2)
twoRec.foreach(println)
top()
we can also extract the top elements from an RDD using top(). top() will use the default
ordering on the data, but we can supply our own comparison function to extract the top
elements.
The take Sample(with Replacement, num, seed) function allows us to take a sample of our
dataeither with or without replacement.
Example:
val data = spark.read.textFile("spark_test.txt").rdd
val mapFile = data.map(line => (line,line.length))
val res = mapFile.top(3)
res.foreach(println)
foreach()
The foreach() action lets us perform computations on each element in the RDD without
bringing it back locally.
Example:
16
Unit-III
(Part-I)

scala>val inputRDD = sc.textFile("f1.txt")

scala>val sp = inputRDD.filter(line =>line.contains("spark"))
scala>sp.foreach(println)
count()
count() returns a count of the elements, and countByValue() returns a map of each unique
value to its count.
Example:
scala>val data = spark.read.textFile("spark_test.txt").rdd
scala>val mapFile = data.flatMap(lines => lines.split(" ")).filter(value => value == "spark")
scala>println(mapFile.count())
Example:
val data = spark.read.textFile("spark_test.txt").rdd
val result= data.map(line => (line,line.length)).countByValue()
result.foreach(println)

collect()
The action collect() is the common and simplest operation that returns our entire RDDs
content to driver program. The application of collect() is unit testing where the entire RDD is
expected to fit in memory. As a result, it makes easy to compare the result of RDD with the
expected result.
Action Collect() had a constraint that all the data should fit in the machine, and copies to
the driver.
Example:
val data = sc.parallelize(Array(('A',1),('b',2),('c',3)))
val data2 =sc.parallelize(Array(('A',4),('A',6),('b',7),('c',3),('c',8)))
val result = data.join(data2)
println(result.collect().mkString(","))on n
Converting Between RDD Types
Some functions are available only on certain types of RDDs, such as mean() and variance()
on numeric RDDs or join() on key/value pair RDDs.
Scala
In Scala the conversion to RDDs with special functions (e.g., to expose numeric functions
on an RDD[Double]) is handled automatically using implicit conversions. As mentioned in
These implicit’s turn an RDD into various wrapper classes, such as DoubleRDD Functions (for

17
Unit-III
(Part-I)

RDDs of numeric data) and PairRDD Functions (for key/value pairs), to expose additional
functions such as mean() and variance().
Persistence (Caching)
Spark RDDs are lazily evaluated, and sometimes we may wish to use the same RDD
multiple times. If we do this naively, Spark will recompute the RDD and all of its dependencies
each time we call an action on the RDD. This can be especially expensive for iterative
algorithms, which look at the data many times. Another trivial example would be doing a count
and then writing out the same RDD,as shown in Example.
Example: Double execution in Scala
val result = input.map(x => x*x)
println(result.count())
println(result.collect().mkString(","))
To avoid computing an RDD multiple times, we can ask Spark to persist the data. When
we ask Spark to persist an RDD, the nodes that compute the RDD store their partitions. If a
node that has data persisted on it fails, Spark will recompute the lost partitions of the data when
needed. We can also replicate our data on multiple nodes if we want to be able to handle node
failure without slowdown.
In Scala (Example) and Java, the default persist() will store the data in the JVM heap as
unserialized objects. In Python, we always serialize the data that persist stores, so the default is
instead stored in the JVM heap as pickled objects. When we write data out to disk or off-heap
storage, that data is also always serialized.
Table 3-6. Persistence levels from org.apache.spark.storage.StorageLevel
andpyspark.StorageLevel; if desired we can replicate the data on two machines by adding _2
tothe end of the storage level

18
Unit-III
(Part-I)

Example: persist() in Scala

val result = input.map(x => x * x)
result.persist(StorageLevel.DISK_ONLY)
println(result.count())
println(result.collect().mkString(","))
Notice that we called persist() on the RDD before the first action. The persist()call on its
own doesn’t force evaluation. If you attempt to cache too much data to fit in memory, Spark
will automatically evict old partitions using a Least Recently Used (LRU) cache policy. For the
memoryonly storage levels, it will recompute these partitions the next time they are
accessed,while for the memory-and-disk ones, it will write them out to disk. RDDs come with
a method called unpersist() that lets you manually remove them from the cache.

Universal Parallel Accounting: Scope Information: Symptom
No ratings yet
Universal Parallel Accounting: Scope Information: Symptom
29 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Creating A Website With Joomla
No ratings yet
Creating A Website With Joomla
15 pages
Catering Management System Project Report
50% (4)
Catering Management System Project Report
88 pages
Change Pointer Technique For Idocs - SAP Community
No ratings yet
Change Pointer Technique For Idocs - SAP Community
9 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Unit 4
No ratings yet
Unit 4
8 pages
Spark
No ratings yet
Spark
9 pages
Module 3
No ratings yet
Module 3
51 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
Spark Slides
No ratings yet
Spark Slides
23 pages
8 Apache Spark
No ratings yet
8 Apache Spark
25 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Mod4 Bda
No ratings yet
Mod4 Bda
14 pages
Spark
No ratings yet
Spark
9 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Bda 5
No ratings yet
Bda 5
21 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Lecture 19-RDD in Spark
No ratings yet
Lecture 19-RDD in Spark
12 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Spark
No ratings yet
Spark
96 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
SPARK Architecture
No ratings yet
SPARK Architecture
22 pages
Unit 4 Spark Cassendra
No ratings yet
Unit 4 Spark Cassendra
41 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Super 25 Unit 5 Notes
No ratings yet
Super 25 Unit 5 Notes
11 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
4.1. Spark Basics
No ratings yet
4.1. Spark Basics
28 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Unit 5
100% (1)
Unit 5
109 pages
Spark
No ratings yet
Spark
160 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
SPARK
No ratings yet
SPARK
35 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Spark PPT
No ratings yet
Spark PPT
55 pages
Pyspark
No ratings yet
Pyspark
31 pages
Bda Notes
No ratings yet
Bda Notes
241 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
Hitesh Ghai QA
No ratings yet
Hitesh Ghai QA
5 pages
Advanced Backup: For Acronis Cyber Protect Cloud
No ratings yet
Advanced Backup: For Acronis Cyber Protect Cloud
2 pages
clc03 Hmtoan Ass4
No ratings yet
clc03 Hmtoan Ass4
56 pages
Step-by-Step Guide For Using LSMW To Update Customer Master Records
No ratings yet
Step-by-Step Guide For Using LSMW To Update Customer Master Records
18 pages
Chapter 1
No ratings yet
Chapter 1
25 pages
SAN-CLS-XII-CS-WS - AK - CH 13
No ratings yet
SAN-CLS-XII-CS-WS - AK - CH 13
4 pages
Minor Project (MCA-169) - MCA2022-24 - Format
No ratings yet
Minor Project (MCA-169) - MCA2022-24 - Format
14 pages
Priyanka Resume
No ratings yet
Priyanka Resume
4 pages
Forms and Reports
No ratings yet
Forms and Reports
3 pages
International Journal of Computer Science & Information Security
No ratings yet
International Journal of Computer Science & Information Security
192 pages
Cryptography and Network Security LTPC 3 0 0 3 Unit I Fundamentals 10
No ratings yet
Cryptography and Network Security LTPC 3 0 0 3 Unit I Fundamentals 10
4 pages
Ch-2.2 Measures of Dispersion
No ratings yet
Ch-2.2 Measures of Dispersion
14 pages
Nagarjuna Hadoop Resume
No ratings yet
Nagarjuna Hadoop Resume
7 pages
Exam Questions
No ratings yet
Exam Questions
3 pages
DBMS File
No ratings yet
DBMS File
9 pages
The Components of Clinic Management System Are
No ratings yet
The Components of Clinic Management System Are
5 pages
TCP1101 Assignment PDF
No ratings yet
TCP1101 Assignment PDF
16 pages
BigData Section1
No ratings yet
BigData Section1
14 pages
Oracle Data Guard Presentation
No ratings yet
Oracle Data Guard Presentation
52 pages
Dell Emc Data Protection Suite Family
No ratings yet
Dell Emc Data Protection Suite Family
4 pages
Building The Unified Data Warehouse and Data Lake TDWI Best Practices Report
No ratings yet
Building The Unified Data Warehouse and Data Lake TDWI Best Practices Report
30 pages
1 Course Information
No ratings yet
1 Course Information
3 pages
Sai Gayathri: Professional Summary
No ratings yet
Sai Gayathri: Professional Summary
9 pages
Create New User and Grant Privileges
No ratings yet
Create New User and Grant Privileges
1 page
Module2 Notes (Srinivasulu M)
No ratings yet
Module2 Notes (Srinivasulu M)
33 pages
Intrusion Detection System Using Machine Learning in Python
No ratings yet
Intrusion Detection System Using Machine Learning in Python
10 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BDA Unit III

Uploaded by

BDA Unit III

Uploaded by

Unit-III

Example: Calling the first() action

val words=spark.sparkContext.parallelize(Seq("the", "quick", "brown", "fox", "jumps",

val wordPair = words.map(w => (w.charAt(0), w))

Example: Scala function passing

Figure 3-2. Mapped and filtered RDD from an input RDD

Figure 3-3. Difference between flatMap() and map() on an RDD

Figure 3-4. Some simple set operations

Figure 3-5. Cartesian product between two RDDs

scala>val inputRDD = sc.textFile("f1.txt")

Example: persist() in Scala

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.