0% found this document useful (0 votes)
10 views2 pages

Spark Questions

RDD persistence in Spark is an optimization technique that saves intermediate results to reduce computation time, particularly useful for iterative tasks. RDDs can be persisted using the cache() and persist() methods, with various storage levels available. RDDs can be created in Spark through parallelizing collections, loading external datasets, or transforming existing RDDs.

Uploaded by

praveen kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views2 pages

Spark Questions

RDD persistence in Spark is an optimization technique that saves intermediate results to reduce computation time, particularly useful for iterative tasks. RDDs can be persisted using the cache() and persist() methods, with various storage levels available. RDDs can be created in Spark through parallelizing collections, loading external datasets, or transforming existing RDDs.

Uploaded by

praveen kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

What do you mean by persistence?

Explain RDD Persistence in Spark.

RDD persistence
RDD persistence in Spark is an optimization technique, which is used to save the intermediate results of RDD, which can be
used for further evaluations if required. Thus, it reduces the computation time. This is helpful in iterative tasks, where the
computations are repeated on some RDDs.
RDD can be persisted by two methods: cache() and persist().
For cache() method, default storage level is MEMORY_ONLY, i.e., when we persist RDD, each node stores any partition of
it
that it computes in its memory and makes it reusable for future use, hence speeding up the computation.
persist() method has various storage levels:
MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER(RDD stored as serialized java object),
MEMORY_AND_DISK_SER, DISK_ONLY
The cache memory of spark is fault tolerant, i.e., if any partition of RDD is lost, it can be recovered by the transformation
operation that originally created it.
more on Persistence and caching. Refer link: RDD Persistence and Caching Mechanism in Apache Spark

List the ways of creating RDDs in Spark.


Describe how RDDs are created in Apache Spark.

Resilient Distributed Datasets (RDD) is spark's core abstraction which is a resilient distributed dataset.
It is an immutable (read-only) distributed collection of objects.
Each dataset in RDD is divided into logical partitions,
which may be computed on different nodes of the cluster.
Including user-defined classes, RDDs may contain any type of Python, Java, or Scala objects.
In 3 ways we can create RDD in Apache Spark:
1. Through distributing collection of objects
2. By loading an external dataset
3. From existing Apache Spark RDDs
1. Using parallelized collection
RDDs are generally created by parallelizing an existing collection
i.e. by taking an existing collection in the program and passing
it to SparkContext’s parallelize() method.
scala > val data = Array(1,2,3,4,5)
scala > val dataRDD = sc.parallelize (data)
scala > dataRDD.count
2. External Datasets
In Spark, a distributed dataset can be formed from any data source supported by Hadoop.

val dataRDD = spark.read.textFile("F:/BigData/DataFlair/Spark/Posts.xml").rdd

3. Creating RDD from existing RDD


Transformation is the way to create an RDD from already existing RDD.

Transformation acts as a function that intakes an RDD and produces another resultant RDD.
The input RDD does not get changed,
Some of the operations applied on RDD are: filter, Map, FlatMap
val dataRDD = spark.read.textFile("F:/Mritunjay/BigData/DataFlair/Spark/Posts.xml").rdd

val resultRDD = data.filter{line => {line.trim().startsWith("<row")}


}

How can we launch Spark application on YARN?


Explain the technique to launch Apache Spark over Hadoop YARN.
Apache Spark has two modes of running applications on YARN: cluster and client
spark-submit or spark-shell --master yarn-cluster or --master yarn-client
To know more about cluster managers, follow link: Apache Spark Cluster Managers – YARN, Mesos &
Standalone

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy