Spark Questions
Spark Questions
RDD persistence
RDD persistence in Spark is an optimization technique, which is used to save the intermediate results of RDD, which can be
used for further evaluations if required. Thus, it reduces the computation time. This is helpful in iterative tasks, where the
computations are repeated on some RDDs.
RDD can be persisted by two methods: cache() and persist().
For cache() method, default storage level is MEMORY_ONLY, i.e., when we persist RDD, each node stores any partition of
it
that it computes in its memory and makes it reusable for future use, hence speeding up the computation.
persist() method has various storage levels:
MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER(RDD stored as serialized java object),
MEMORY_AND_DISK_SER, DISK_ONLY
The cache memory of spark is fault tolerant, i.e., if any partition of RDD is lost, it can be recovered by the transformation
operation that originally created it.
more on Persistence and caching. Refer link: RDD Persistence and Caching Mechanism in Apache Spark
Resilient Distributed Datasets (RDD) is spark's core abstraction which is a resilient distributed dataset.
It is an immutable (read-only) distributed collection of objects.
Each dataset in RDD is divided into logical partitions,
which may be computed on different nodes of the cluster.
Including user-defined classes, RDDs may contain any type of Python, Java, or Scala objects.
In 3 ways we can create RDD in Apache Spark:
1. Through distributing collection of objects
2. By loading an external dataset
3. From existing Apache Spark RDDs
1. Using parallelized collection
RDDs are generally created by parallelizing an existing collection
i.e. by taking an existing collection in the program and passing
it to SparkContext’s parallelize() method.
scala > val data = Array(1,2,3,4,5)
scala > val dataRDD = sc.parallelize (data)
scala > dataRDD.count
2. External Datasets
In Spark, a distributed dataset can be formed from any data source supported by Hadoop.
Transformation acts as a function that intakes an RDD and produces another resultant RDD.
The input RDD does not get changed,
Some of the operations applied on RDD are: filter, Map, FlatMap
val dataRDD = spark.read.textFile("F:/Mritunjay/BigData/DataFlair/Spark/Posts.xml").rdd