0% found this document useful (0 votes)
16 views4 pages

Ch. 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views4 pages

Ch. 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 4

the maturation of Hadoop has led to a stable computing environment that is general

enough to build specialist tools for tasks such as graph processing,


micro-batch processing, SQL querying, data warehousing, and machine learning.
However, as Hadoop became more widely adopted, more specializations were
required for a wider variety of new use cases, and it became clear that the batch
processing model of MapReduce was not well suited to common workflows
including iterative, interactive, or on-demand computations upon a single dataset.

used to have to write back data to main memory from Ram after the calculation
finished, then load data back up, do more calculations,
then store the new data back into main memory, repeat

The primary MapReduce abstraction (specification of computation as a mapping then a


reduction)
is parallelizable, easy to understand, and hides the details of distributed
computing, thus allowing Hadoop to guarantee correctness.
However, in order to achieve coordination and fault tolerance, the MapReduce model
uses a pull execution model that requires intermediate writes of
data back to HDFS. Unfortunately, the input/output (I/O) of moving data from where
it’s stored to where it needs to be computed upon is the largest
time cost in any computing system; as a result, while MapReduce is incredibly safe
and resilient, it is also necessarily slow on a per-task basis.
Worse, almost all applications must chain multiple MapReduce jobs together in
multiple steps, creating a data flow toward the final required result.
This results in huge amounts of intermediate data written to HDFS that is not
required by the user, creating additional costs in terms of disk usage.

To address these problems, Hadoop has moved to a more general resource management
framework for computation: YARN.
Whereas previously the MapReduce application allocated resources (processors,
memory) to jobs specifically for mappers and reducers,
YARN provides more general resource access to Hadoop applications.
The result is that specialized tools no longer have to be decomposed into a series
of MapReduce jobs and can become more complex.
By generalizing the management of the cluster, the programming model first imagined
in MapReduce can be expanded to include new abstractions and operations

SPARK

Apache Spark is a cluster-computing platform that provides an API for distributed


programming similar to the MapReduce model,
but is designed to be fast for interactive queries and iterative algorithms.
It primarily achieves this by caching data required for computation in the memory
of the nodes in the cluster.
In-memory cluster computation enables Spark to run iterative algorithms, as
programs can checkpoint data and refer back to it without reloading it from disk;
in addition, it supports interactive querying and streaming data analysis at
extremely fast speeds.
Because Spark is compatible with YARN, it can run on an existing Hadoop cluster and
access any Hadoop data source, including HDFS, S3, HBase, and Cassandra

Importantly, Spark was designed from the ground up to support big data applications
and data science in particular.
Instead of a programming model that only supports map and reduce, the Spark API has
many other powerful distributed abstractions similarly related to
functional programming, including sample, filter, join, and collect, to name a few.
Moreover, while Spark is implemented in Scala, programming APIs in Scala, Java, R,
and Python makes Spark much more accessible

In order to program this type of algorithm in MapReduce, the parameters of the


target function would have to be mapped to every instance
in the dataset, and the error computed and reduced. After the reduce phase, the
parameters would be updated and fed into the next MapReduce job.
This is possible by chaining the error computation and update jobs together;
however, on each job the data would have to be read from disk and the errors
written back to it, causing significant I/O-related delay.

Instead, Spark keeps the dataset in memory as much as possible throughout the
course of the application, preventing the reloading of data between iterations.
Spark programmers therefore do not simply specify map and reduce steps, but rather
an entire series of data flow transformations to be applied to the
input data before performing some action that requires coordination like a
reduction or a write to disk.

Because data flows can be described using directed acyclic graphs (DAGs), Spark’s
execution engine knows ahead of time how to distribute
the computation across the cluster and manages the details of the computation,
similar to how MapReduce abstracts distributed computation.

By combining acyclic data flow and in-memory(RAM) computing, Spark is extremely


fast particularly when the cluster is large enough to hold all of the data in
memory.
In fact, by increasing the size of the cluster and therefore the amount of
available memory to hold an entire, very large dataset,
the speed of Spark means that it can be used interactively—making the user a key
participant of analytical processes that are running on the cluster.
As Spark evolved, the notion of user interaction became essential to its model of
distributed computation;
in fact, it is probably for this reason that so many languages are supported.

Spark focuses purely on computation rather than data storage and as such is
typically run in a cluster that implements data warehousing and cluster management
tools.

Spark exposes its primary programming abstraction to developers through the Spark
Core module.
This module contains basic and general functionality, including the API that
defines resilient distributed datasets (RDDs).
RDDs, which we will describe in more detail in the next section, are the essential
functionality upon which all Spark computation resides.

RESILIENT DISTRIBUTED DATASETS

In Chapter 2, we described Hadoop as a distributed computing framework that dealt


with two primary problems:
how to distribute data across a cluster, and how to distribute computation.

Spark does not deal with distributed data storage, relying on Hadoop to provide
this functionality, and
instead focuses on reliable distributed computation through a framework called
resilient distributed datasets.

RDDs are essentially a programming abstraction that represents a read-only


collection of objects that are partitioned across a set of machines.

RDDs are operated upon with functional programming constructs that include and
expand upon map and reduce.
Programmers create new RDDs by loading data from an input source, or by
transforming an existing collection to generate a new one.

The history of applied transformations is primarily what defines the RDD’s lineage,
and because the collection is immutable (not directly modifiable),
transformations can be reapplied to part or all of the collection in order to
recover from failure.
The Spark API is therefore essentially a collection of operations that create,
transform, and export RDDs.

2 TYPES OF OPERATIONS
The fundamental programming model therefore is describing how RDDs are created and
modified via programmatic operations.
There are 2 types of operations that can be applied to RDDs: transformations and
actions.
-Transformations (map, filter, join) are operations that are applied to an existing
RDD to create a new RDD—for example, applying a filter operation on an RDD to
generate
a smaller RDD of filtered values.
-Actions, however, are operations that actually return a result back to the Spark
driver program—resulting in a coordination
or aggregation of all partitions in an RDD.

CLOSURE
A closure is a function that includes its own independent data environment.
As a result of this independence, a closure operates with no outside information
and is thus parallelizable.
will always end in same result as it is a closed operation.

2 TYPES OF INTERMEDIATE RESULTS


If external data is required, Spark provides two types of shared variables that
can be interacted with by all workers in a restricted fashion:
broadcast variables and accumulators. Broadcast variables are distributed to all
workers, but are read-only and are often used as lookup tables or stopword lists.
Accumulators are variables that workers can “add” to using associative operations
and are typically used as counters.
These data structures are similar to the MapReduce distributed cache and counters,
and serve a similar role.
However, because Spark allows for general interprocess communication, these data
structures are perhaps used in a wider variety of applications.

PARTS 1-4, 7 of RDD PAPER


RDD way of handling fault tolerance - work with coarse grained processing
apply the same operation to many data items. This allows them to efficiently
provide fault tolerance by logging the transformations used to build a dataset
(its lineage) rather than the actual data

Other iterative interfaces:


Pregel - system for iterative graph computations
HaLoop - iterative MapReduce interface

RDDs are best suited for batch applications that apply the same operation to all
elements of a dataset.
RDDs would be less suitable for applications that make asynchronous fine-grained
updates to shared state,
(Fine-grained updates refer to the ability to efficiently update individual
elements)
such as a storage system for a web application or an incremental web crawler

Spark driver stays on the core. It connects to a cluster of workers and invokes
actions, things that retun a value to the driver.
it also tacks the lineage of RDDs.
workers are long lived processes that can store RDD partitions in RAM across
operations

RDDs have 5 attributes -


a set of partitions (atomic blocks), a set of dependencies on parent RDDs,
metadata about its partitioning scheme and data placement,
a function for computing the dataset based on its parents

Why are RDDs able to express these diverse programming models?


The reason is that the restrictions on RDDs have little impact in many parallel
applications. In particular, although
RDDs can only be created through bulk transformations, many parallel programs
naturally apply the same operation to many records, making them easy to express.
Sim-
ilarly, the immutability of RDDs is not an obstacle because one can create multiple
RDDs to represent versions of the same dataset.

One final question is why previous frameworks have not offered the same level of
generality. We believe that
this is because these systems explored specific problems that MapReduce and Dryad
do not handle well, such as
iteration, without observing that the common cause of these problems was a lack of
data sharing abstractions

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy