Ch. 4
Ch. 4
used to have to write back data to main memory from Ram after the calculation
finished, then load data back up, do more calculations,
then store the new data back into main memory, repeat
To address these problems, Hadoop has moved to a more general resource management
framework for computation: YARN.
Whereas previously the MapReduce application allocated resources (processors,
memory) to jobs specifically for mappers and reducers,
YARN provides more general resource access to Hadoop applications.
The result is that specialized tools no longer have to be decomposed into a series
of MapReduce jobs and can become more complex.
By generalizing the management of the cluster, the programming model first imagined
in MapReduce can be expanded to include new abstractions and operations
SPARK
Importantly, Spark was designed from the ground up to support big data applications
and data science in particular.
Instead of a programming model that only supports map and reduce, the Spark API has
many other powerful distributed abstractions similarly related to
functional programming, including sample, filter, join, and collect, to name a few.
Moreover, while Spark is implemented in Scala, programming APIs in Scala, Java, R,
and Python makes Spark much more accessible
Instead, Spark keeps the dataset in memory as much as possible throughout the
course of the application, preventing the reloading of data between iterations.
Spark programmers therefore do not simply specify map and reduce steps, but rather
an entire series of data flow transformations to be applied to the
input data before performing some action that requires coordination like a
reduction or a write to disk.
Because data flows can be described using directed acyclic graphs (DAGs), Spark’s
execution engine knows ahead of time how to distribute
the computation across the cluster and manages the details of the computation,
similar to how MapReduce abstracts distributed computation.
Spark focuses purely on computation rather than data storage and as such is
typically run in a cluster that implements data warehousing and cluster management
tools.
Spark exposes its primary programming abstraction to developers through the Spark
Core module.
This module contains basic and general functionality, including the API that
defines resilient distributed datasets (RDDs).
RDDs, which we will describe in more detail in the next section, are the essential
functionality upon which all Spark computation resides.
Spark does not deal with distributed data storage, relying on Hadoop to provide
this functionality, and
instead focuses on reliable distributed computation through a framework called
resilient distributed datasets.
RDDs are operated upon with functional programming constructs that include and
expand upon map and reduce.
Programmers create new RDDs by loading data from an input source, or by
transforming an existing collection to generate a new one.
The history of applied transformations is primarily what defines the RDD’s lineage,
and because the collection is immutable (not directly modifiable),
transformations can be reapplied to part or all of the collection in order to
recover from failure.
The Spark API is therefore essentially a collection of operations that create,
transform, and export RDDs.
2 TYPES OF OPERATIONS
The fundamental programming model therefore is describing how RDDs are created and
modified via programmatic operations.
There are 2 types of operations that can be applied to RDDs: transformations and
actions.
-Transformations (map, filter, join) are operations that are applied to an existing
RDD to create a new RDD—for example, applying a filter operation on an RDD to
generate
a smaller RDD of filtered values.
-Actions, however, are operations that actually return a result back to the Spark
driver program—resulting in a coordination
or aggregation of all partitions in an RDD.
CLOSURE
A closure is a function that includes its own independent data environment.
As a result of this independence, a closure operates with no outside information
and is thus parallelizable.
will always end in same result as it is a closed operation.
RDDs are best suited for batch applications that apply the same operation to all
elements of a dataset.
RDDs would be less suitable for applications that make asynchronous fine-grained
updates to shared state,
(Fine-grained updates refer to the ability to efficiently update individual
elements)
such as a storage system for a web application or an incremental web crawler
Spark driver stays on the core. It connects to a cluster of workers and invokes
actions, things that retun a value to the driver.
it also tacks the lineage of RDDs.
workers are long lived processes that can store RDD partitions in RAM across
operations
One final question is why previous frameworks have not offered the same level of
generality. We believe that
this is because these systems explored specific problems that MapReduce and Dryad
do not handle well, such as
iteration, without observing that the common cause of these problems was a lack of
data sharing abstractions