29 PDFsam Apache Spark Tutorial
29 PDFsam Apache Spark Tutorial
If it is executed successfully, then you will find the output given below. The OK letting in
the following output is for user identification and that is the last line of the program. If
you carefully read the following output, you will find different things, such as:
25
Apache Spark
OK
15/07/08 13:56:13 INFO SparkContext: Invoking stop() from shutdown hook
15/07/08 13:56:13 INFO SparkUI: Stopped Spark web UI at
http://192.168.1.217:4040
15/07/08 13:56:13 INFO DAGScheduler: Stopping DAGScheduler
15/07/08 13:56:14 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
15/07/08 13:56:14 INFO Utils: path = /tmp/spark-45a07b83-42ed-42b3-b2c2-
823d8d99c5af/blockmgr-ccdda9e3-24f6-491b-b509-3d15a9e05818, already present as
root for deletion.
15/07/08 13:56:14 INFO MemoryStore: MemoryStore cleared
15/07/08 13:56:14 INFO BlockManager: BlockManager stopped
15/07/08 13:56:14 INFO BlockManagerMaster: BlockManagerMaster stopped
15/07/08 13:56:14 INFO SparkContext: Successfully stopped SparkContext
15/07/08 13:56:14 INFO Utils: Shutdown hook called
15/07/08 13:56:14 INFO Utils: Deleting directory /tmp/spark-45a07b83-42ed-42b3-
b2c2-823d8d99c5af
15/07/08 13:56:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
The following commands are used for opening and checking the list of files in the outfile
directory.
$ cd outfile
$ ls
Part-00000 part-00001 _SUCCESS
$ cat part-00000
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)
26
Apache Spark
(look,1)
$ cat part-00001
(walk, 1)
(or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
(share, 1)
Go through the following section to know more about the ‘spark-submit’ command.
Spark-submit Syntax
spark-submit [options] <app jar | python file> [app arguments]
Options
The table given below describes a list of options:-
27
Apache Spark
28
Apache Spark
29
6. ADVANCED SPARK PROGRAMMING Apache Spark
Spark contains two different types of shared variables- one is broadcast variables and
second is accumulators.
Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached on each
machine rather than shipping a copy of it with tasks. They can be used, for example, to
give every node, a copy of a large input dataset, in an efficient manner. Spark also
attempts to distribute broadcast variables using efficient broadcast algorithms to reduce
communication cost.
Spark actions are executed through a set of stages, separated by distributed “shuffle”
operations. Spark automatically broadcasts the common data needed by tasks within
each stage.
The data broadcasted this way is cached in serialized form and is deserialized before
running each task. This means that explicitly creating broadcast variables, is only useful
when tasks across multiple stages need the same data or when caching the data in
deserialized form is important.
Output:
After the broadcast variable is created, it should be used instead of the value v in any
functions run on the cluster, so that v is not shipped to the nodes more than once. In
addition, the object v should not be modified after its broadcast, in order to ensure that
all nodes get the same value of the broadcast variable.
Accumulators
Accumulators are variables that are only “added” to through an associative operation
and can therefore, be efficiently supported in parallel. They can be used to implement
counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric
types, and programmers can add support for new types. If accumulators are created
with a name, they will be displayed in Spark’s UI. This can be useful for understanding
the progress of running stages (NOTE: this is not yet supported in Python).
30
Apache Spark
The code given below shows an accumulator being used to add up the elements of an
array:
If you want to see the output of above code then use the following command:
scala> accum.value
Output
res2: Int = 10
count()
1
Number of elements in the RDD.
Mean()
2
Average of the elements in the RDD.
Sum()
3
Total value of the elements in the RDD.
Max()
4
Maximum value among all elements in the RDD.
31