0% found this document useful (0 votes)
0 views7 pages

29 PDFsam Apache Spark Tutorial

Uploaded by

mitmak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views7 pages

29 PDFsam Apache Spark Tutorial

Uploaded by

mitmak
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Apache Spark

Submit the spark application using the following command:

spark-submit --class SparkWordCount --master local wordcount.jar

If it is executed successfully, then you will find the output given below. The OK letting in
the following output is for user identification and that is the last line of the program. If
you carefully read the following output, you will find different things, such as:

 successfully started service 'sparkDriver' on port 42954


 MemoryStore started with capacity 267.3 MB
 Started SparkUI at http://192.168.1.217:4040
 Added JAR file:/home/hadoop/piapplication/count.jar
 ResultStage 1 (saveAsTextFile at SparkPi.scala:11) finished in 0.566 s
 Stopped Spark web UI at http://192.168.1.217:4040
 MemoryStore cleared

15/07/08 13:56:04 INFO Slf4jLogger: Slf4jLogger started


15/07/08 13:56:04 INFO Utils: Successfully started service 'sparkDriver' on
port 42954.
15/07/08 13:56:04 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://sparkDriver@192.168.1.217:42954]
15/07/08 13:56:04 INFO MemoryStore: MemoryStore started with capacity 267.3 MB
15/07/08 13:56:05 INFO HttpServer: Starting HTTP Server
15/07/08 13:56:05 INFO Utils: Successfully started service 'HTTP file server'
on port 56707.
15/07/08 13:56:06 INFO SparkUI: Started SparkUI at http://192.168.1.217:4040
15/07/08 13:56:07 INFO SparkContext: Added JAR
file:/home/hadoop/piapplication/count.jar at
http://192.168.1.217:56707/jars/count.jar with timestamp 1436343967029
15/07/08 13:56:11 INFO Executor: Adding file:/tmp/spark-45a07b83-42ed-42b3-
b2c2-823d8d99c5af/userFiles-df4f4c20-a368-4cdd-a2a7-39ed45eb30cf/count.jar to
class loader
15/07/08 13:56:11 INFO HadoopRDD: Input split:
file:/home/hadoop/piapplication/in.txt:0+54
15/07/08 13:56:12 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2001
bytes result sent to driver
(MapPartitionsRDD[5] at saveAsTextFile at SparkPi.scala:11), which is now
runnable
15/07/08 13:56:12 INFO DAGScheduler: Submitting 1 missing tasks from
ResultStage 1 (MapPartitionsRDD[5] at saveAsTextFile at SparkPi.scala:11)
15/07/08 13:56:13 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at
SparkPi.scala:11) finished in 0.566 s
15/07/08 13:56:13 INFO DAGScheduler: Job 0 finished: saveAsTextFile at
SparkPi.scala:11, took 2.892996 s

25
Apache Spark

OK
15/07/08 13:56:13 INFO SparkContext: Invoking stop() from shutdown hook
15/07/08 13:56:13 INFO SparkUI: Stopped Spark web UI at
http://192.168.1.217:4040
15/07/08 13:56:13 INFO DAGScheduler: Stopping DAGScheduler
15/07/08 13:56:14 INFO MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped!
15/07/08 13:56:14 INFO Utils: path = /tmp/spark-45a07b83-42ed-42b3-b2c2-
823d8d99c5af/blockmgr-ccdda9e3-24f6-491b-b509-3d15a9e05818, already present as
root for deletion.
15/07/08 13:56:14 INFO MemoryStore: MemoryStore cleared
15/07/08 13:56:14 INFO BlockManager: BlockManager stopped
15/07/08 13:56:14 INFO BlockManagerMaster: BlockManagerMaster stopped
15/07/08 13:56:14 INFO SparkContext: Successfully stopped SparkContext
15/07/08 13:56:14 INFO Utils: Shutdown hook called
15/07/08 13:56:14 INFO Utils: Deleting directory /tmp/spark-45a07b83-42ed-42b3-
b2c2-823d8d99c5af
15/07/08 13:56:14 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!

Step 5: Checking output


After successful execution of the program, you will find the directory named outfile in
the spark-application directory.

The following commands are used for opening and checking the list of files in the outfile
directory.

$ cd outfile
$ ls
Part-00000 part-00001 _SUCCESS

The commands for checking output in part-00000 file are:

$ cat part-00000
(people,1)
(are,2)
(not,1)
(as,8)
(beautiful,2)
(they, 7)

26
Apache Spark

(look,1)

The commands for checking output in part-00001 file are:

$ cat part-00001
(walk, 1)
(or, 1)
(talk, 1)
(only, 1)
(love, 1)
(care, 1)
(share, 1)

Go through the following section to know more about the ‘spark-submit’ command.

Spark-submit Syntax
spark-submit [options] <app jar | python file> [app arguments]

Options
The table given below describes a list of options:-

S.No Option Description

1 --master spark://host:port, mesos://host:port, yarn, or local.

Whether to launch the driver program locally


2 --deploy-mode ("client") or on one of the worker machines inside the
cluster ("cluster") (Default: client).

3 --class Your application's main class (for Java / Scala apps).

4 --name A name of your application.

Comma-separated list of local jars to include on the


5 --jars
driver and executor classpaths.

Comma-separated list of maven coordinates of jars to


6 --packages
include on the driver and executor classpaths.

Comma-separated list of additional remote


7 --repositories repositories to search for the maven coordinates
given with --packages.

27
Apache Spark

Comma-separated list of .zip, .egg, or .py files to


8 --py-files
place on the PYTHON PATH for Python apps.

Comma-separated list of files to be placed in the


9 --files
working directory of each executor.

10 --conf (prop=val) Arbitrary Spark configuration property.

Path to a file from which to load extra properties. If


11 --properties-file
not specified, this will look for conf/spark-defaults.

12 --driver-memory Memory for driver (e.g. 1000M, 2G) (Default: 512M).

13 --driver-java-options Extra Java options to pass to the driver.

14 --driver-library-path Extra library path entries to pass to the driver.

Extra class path entries to pass to the driver.


15 --driver-class-path Note that jars added with --jars are automatically
included in the classpath.

16 --executor-memory Memory per executor (e.g. 1000M, 2G) (Default: 1G).

17 --proxy-user User to impersonate when submitting the application.

18 --help, -h Show this help message and exit.

19 --verbose, -v Print additional debug output.

20 --version Print the version of current Spark.

21 --driver-cores NUM Cores for driver (Default: 1).

22 --supervise If given, restarts the driver on failure.

23 --kill If given, kills the driver specified.

24 --status If given, requests the status of the driver specified.

25 --total-executor-cores Total cores for all executors.

Number of cores per executor. (Default: 1 in YARN


26 --executor-cores mode, or all available cores on the worker in
standalone mode).

28
Apache Spark

29
6. ADVANCED SPARK PROGRAMMING Apache Spark

Spark contains two different types of shared variables- one is broadcast variables and
second is accumulators.

 Broadcast variables: used to efficiently, distribute large values.

 Accumulators: used to aggregate the information of particular collection.

Broadcast Variables
Broadcast variables allow the programmer to keep a read-only variable cached on each
machine rather than shipping a copy of it with tasks. They can be used, for example, to
give every node, a copy of a large input dataset, in an efficient manner. Spark also
attempts to distribute broadcast variables using efficient broadcast algorithms to reduce
communication cost.

Spark actions are executed through a set of stages, separated by distributed “shuffle”
operations. Spark automatically broadcasts the common data needed by tasks within
each stage.

The data broadcasted this way is cached in serialized form and is deserialized before
running each task. This means that explicitly creating broadcast variables, is only useful
when tasks across multiple stages need the same data or when caching the data in
deserialized form is important.

Broadcast variables are created from a variable v by calling


SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its
value can be accessed by calling the value method. The code given below shows this:

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))

Output:

broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

After the broadcast variable is created, it should be used instead of the value v in any
functions run on the cluster, so that v is not shipped to the nodes more than once. In
addition, the object v should not be modified after its broadcast, in order to ensure that
all nodes get the same value of the broadcast variable.

Accumulators
Accumulators are variables that are only “added” to through an associative operation
and can therefore, be efficiently supported in parallel. They can be used to implement
counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric
types, and programmers can add support for new types. If accumulators are created
with a name, they will be displayed in Spark’s UI. This can be useful for understanding
the progress of running stages (NOTE: this is not yet supported in Python).
30
Apache Spark

An accumulator is created from an initial value v by calling


SparkContext.accumulator(v). Tasks running on the cluster can then add to it using
the add method or the += operator (in Scala and Python). However, they cannot read
its value. Only the driver program can read the accumulator’s value, using
its value method.

The code given below shows an accumulator being used to add up the elements of an
array:

scala> val accum = sc.accumulator(0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)

If you want to see the output of above code then use the following command:

scala> accum.value

Output

res2: Int = 10

Numeric RDD Operations


Spark allows you to do different operations on numeric data, using one of the
predefined API methods. Spark’s numeric operations are implemented with a streaming
algorithm that allows building the model, one element at a time.

These operations are computed and returned as a StatusCounter object by calling


status() method.

The following is a list of numeric methods available in StatusCounter.

S.No Method & Meaning

count()
1
Number of elements in the RDD.

Mean()
2
Average of the elements in the RDD.

Sum()
3
Total value of the elements in the RDD.

Max()
4
Maximum value among all elements in the RDD.

31

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy