0% found this document useful (0 votes)

16 views4 pages

Ch. 4

Uploaded by

Xenos Playground aka Boxman Studios

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views4 pages

Ch. 4

Uploaded by

Xenos Playground aka Boxman Studios

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as TXT, PDF, TXT or read online on Scribd

You are on page 1/ 4

the maturation of Hadoop has led to a stable computing environment that is general

enough to build specialist tools for tasks such as graph processing,

micro-batch processing, SQL querying, data warehousing, and machine learning.
However, as Hadoop became more widely adopted, more specializations were
required for a wider variety of new use cases, and it became clear that the batch
processing model of MapReduce was not well suited to common workflows
including iterative, interactive, or on-demand computations upon a single dataset.

used to have to write back data to main memory from Ram after the calculation
finished, then load data back up, do more calculations,
then store the new data back into main memory, repeat

The primary MapReduce abstraction (specification of computation as a mapping then a

reduction)
is parallelizable, easy to understand, and hides the details of distributed
computing, thus allowing Hadoop to guarantee correctness.
However, in order to achieve coordination and fault tolerance, the MapReduce model
uses a pull execution model that requires intermediate writes of
data back to HDFS. Unfortunately, the input/output (I/O) of moving data from where
it’s stored to where it needs to be computed upon is the largest
time cost in any computing system; as a result, while MapReduce is incredibly safe
and resilient, it is also necessarily slow on a per-task basis.
Worse, almost all applications must chain multiple MapReduce jobs together in
multiple steps, creating a data flow toward the final required result.
This results in huge amounts of intermediate data written to HDFS that is not
required by the user, creating additional costs in terms of disk usage.

To address these problems, Hadoop has moved to a more general resource management
framework for computation: YARN.
Whereas previously the MapReduce application allocated resources (processors,
memory) to jobs specifically for mappers and reducers,
YARN provides more general resource access to Hadoop applications.
The result is that specialized tools no longer have to be decomposed into a series
of MapReduce jobs and can become more complex.
By generalizing the management of the cluster, the programming model first imagined
in MapReduce can be expanded to include new abstractions and operations

SPARK

Apache Spark is a cluster-computing platform that provides an API for distributed

programming similar to the MapReduce model,
but is designed to be fast for interactive queries and iterative algorithms.
It primarily achieves this by caching data required for computation in the memory
of the nodes in the cluster.
In-memory cluster computation enables Spark to run iterative algorithms, as
programs can checkpoint data and refer back to it without reloading it from disk;
in addition, it supports interactive querying and streaming data analysis at
extremely fast speeds.
Because Spark is compatible with YARN, it can run on an existing Hadoop cluster and
access any Hadoop data source, including HDFS, S3, HBase, and Cassandra

Importantly, Spark was designed from the ground up to support big data applications
and data science in particular.
Instead of a programming model that only supports map and reduce, the Spark API has
many other powerful distributed abstractions similarly related to
functional programming, including sample, filter, join, and collect, to name a few.
Moreover, while Spark is implemented in Scala, programming APIs in Scala, Java, R,
and Python makes Spark much more accessible

In order to program this type of algorithm in MapReduce, the parameters of the

target function would have to be mapped to every instance
in the dataset, and the error computed and reduced. After the reduce phase, the
parameters would be updated and fed into the next MapReduce job.
This is possible by chaining the error computation and update jobs together;
however, on each job the data would have to be read from disk and the errors
written back to it, causing significant I/O-related delay.

Instead, Spark keeps the dataset in memory as much as possible throughout the
course of the application, preventing the reloading of data between iterations.
Spark programmers therefore do not simply specify map and reduce steps, but rather
an entire series of data flow transformations to be applied to the
input data before performing some action that requires coordination like a
reduction or a write to disk.

Because data flows can be described using directed acyclic graphs (DAGs), Spark’s
execution engine knows ahead of time how to distribute
the computation across the cluster and manages the details of the computation,
similar to how MapReduce abstracts distributed computation.

By combining acyclic data flow and in-memory(RAM) computing, Spark is extremely

fast particularly when the cluster is large enough to hold all of the data in
memory.
In fact, by increasing the size of the cluster and therefore the amount of
available memory to hold an entire, very large dataset,
the speed of Spark means that it can be used interactively—making the user a key
participant of analytical processes that are running on the cluster.
As Spark evolved, the notion of user interaction became essential to its model of
distributed computation;
in fact, it is probably for this reason that so many languages are supported.

Spark focuses purely on computation rather than data storage and as such is
typically run in a cluster that implements data warehousing and cluster management
tools.

Spark exposes its primary programming abstraction to developers through the Spark
Core module.
This module contains basic and general functionality, including the API that
defines resilient distributed datasets (RDDs).
RDDs, which we will describe in more detail in the next section, are the essential
functionality upon which all Spark computation resides.

RESILIENT DISTRIBUTED DATASETS

In Chapter 2, we described Hadoop as a distributed computing framework that dealt

with two primary problems:
how to distribute data across a cluster, and how to distribute computation.

Spark does not deal with distributed data storage, relying on Hadoop to provide
this functionality, and
instead focuses on reliable distributed computation through a framework called
resilient distributed datasets.

RDDs are essentially a programming abstraction that represents a read-only

collection of objects that are partitioned across a set of machines.

RDDs are operated upon with functional programming constructs that include and
expand upon map and reduce.
Programmers create new RDDs by loading data from an input source, or by
transforming an existing collection to generate a new one.

The history of applied transformations is primarily what defines the RDD’s lineage,
and because the collection is immutable (not directly modifiable),
transformations can be reapplied to part or all of the collection in order to
recover from failure.
The Spark API is therefore essentially a collection of operations that create,
transform, and export RDDs.

2 TYPES OF OPERATIONS
The fundamental programming model therefore is describing how RDDs are created and
modified via programmatic operations.
There are 2 types of operations that can be applied to RDDs: transformations and
actions.
-Transformations (map, filter, join) are operations that are applied to an existing
RDD to create a new RDD—for example, applying a filter operation on an RDD to
generate
a smaller RDD of filtered values.
-Actions, however, are operations that actually return a result back to the Spark
driver program—resulting in a coordination
or aggregation of all partitions in an RDD.

CLOSURE
A closure is a function that includes its own independent data environment.
As a result of this independence, a closure operates with no outside information
and is thus parallelizable.
will always end in same result as it is a closed operation.

2 TYPES OF INTERMEDIATE RESULTS

If external data is required, Spark provides two types of shared variables that
can be interacted with by all workers in a restricted fashion:
broadcast variables and accumulators. Broadcast variables are distributed to all
workers, but are read-only and are often used as lookup tables or stopword lists.
Accumulators are variables that workers can “add” to using associative operations
and are typically used as counters.
These data structures are similar to the MapReduce distributed cache and counters,
and serve a similar role.
However, because Spark allows for general interprocess communication, these data
structures are perhaps used in a wider variety of applications.

PARTS 1-4, 7 of RDD PAPER

RDD way of handling fault tolerance - work with coarse grained processing
apply the same operation to many data items. This allows them to efficiently
provide fault tolerance by logging the transformations used to build a dataset
(its lineage) rather than the actual data

Other iterative interfaces:

Pregel - system for iterative graph computations
HaLoop - iterative MapReduce interface

RDDs are best suited for batch applications that apply the same operation to all
elements of a dataset.
RDDs would be less suitable for applications that make asynchronous fine-grained
updates to shared state,
(Fine-grained updates refer to the ability to efficiently update individual
elements)
such as a storage system for a web application or an incremental web crawler

Spark driver stays on the core. It connects to a cluster of workers and invokes
actions, things that retun a value to the driver.
it also tacks the lineage of RDDs.
workers are long lived processes that can store RDD partitions in RAM across
operations

RDDs have 5 attributes -

a set of partitions (atomic blocks), a set of dependencies on parent RDDs,
metadata about its partitioning scheme and data placement,
a function for computing the dataset based on its parents

Why are RDDs able to express these diverse programming models?

The reason is that the restrictions on RDDs have little impact in many parallel
applications. In particular, although
RDDs can only be created through bulk transformations, many parallel programs
naturally apply the same operation to many records, making them easy to express.
Sim-
ilarly, the immutability of RDDs is not an obstacle because one can create multiple
RDDs to represent versions of the same dataset.

One final question is why previous frameworks have not offered the same level of
generality. We believe that
this is because these systems explored specific problems that MapReduce and Dryad
do not handle well, such as
iteration, without observing that the common cause of these problems was a lack of
data sharing abstractions

21st Century Boys v02, (2007) (Obxist)
No ratings yet
21st Century Boys v02, (2007) (Obxist)
205 pages
Volume 1
No ratings yet
Volume 1
234 pages
Grade 7 SCIENCE Item-Analysis-for-item-bank
100% (1)
Grade 7 SCIENCE Item-Analysis-for-item-bank
5 pages
Works of Arthur Schopenhauer - Arthur Schopenhauer
100% (1)
Works of Arthur Schopenhauer - Arthur Schopenhauer
2,370 pages
Jean Watson's Human Caring Science, A Theory of Nursing
0% (1)
Jean Watson's Human Caring Science, A Theory of Nursing
30 pages
Yellow Musk Creeper
No ratings yet
Yellow Musk Creeper
7 pages
RHCSA Rapid Track Course
No ratings yet
RHCSA Rapid Track Course
3 pages
Big Data Anlytics Unit 3 R22 It
No ratings yet
Big Data Anlytics Unit 3 R22 It
57 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
No ratings yet
Comp9313: Big Data Management: Introduction To Mapreduce and Spark
30 pages
Apache Spark
No ratings yet
Apache Spark
7 pages
8 PDFsam Apache Spark Tutorial
No ratings yet
8 PDFsam Apache Spark Tutorial
7 pages
Poetic Seminar
No ratings yet
Poetic Seminar
17 pages
Features of Apache Spark
No ratings yet
Features of Apache Spark
7 pages
Intelligent Motion Control Design For An Omnidirectional Conveyor System
No ratings yet
Intelligent Motion Control Design For An Omnidirectional Conveyor System
11 pages
Lec28 - RDD
No ratings yet
Lec28 - RDD
56 pages
District Survey Report For Latur District FOR
No ratings yet
District Survey Report For Latur District FOR
146 pages
Only One Mind PDF
No ratings yet
Only One Mind PDF
34 pages
Implementing K Means For Achievement Stu
No ratings yet
Implementing K Means For Achievement Stu
5 pages
Juliani 2
No ratings yet
Juliani 2
4 pages
Chicago Boogie - Alto Sax
No ratings yet
Chicago Boogie - Alto Sax
2 pages
Big Data Analytics Presentation
No ratings yet
Big Data Analytics Presentation
30 pages
Engineer Pros Backend Level 2 Course
No ratings yet
Engineer Pros Backend Level 2 Course
12 pages
3hac042305 041
No ratings yet
3hac042305 041
1 page
SPARK Architecture
No ratings yet
SPARK Architecture
22 pages
Chapter 3 J v8.0 V04
No ratings yet
Chapter 3 J v8.0 V04
150 pages
Spark
No ratings yet
Spark
9 pages
Spark Introduction
No ratings yet
Spark Introduction
25 pages
SPARK
No ratings yet
SPARK
47 pages
Learn
No ratings yet
Learn
16 pages
GAA ADEK Inspection Report 17-18
No ratings yet
GAA ADEK Inspection Report 17-18
20 pages
Writing Spark Application
No ratings yet
Writing Spark Application
37 pages
Fluid Mechanics Lab Report: STUDY OF PRESSURE DISTRIBUTION ON A CYLINDER
No ratings yet
Fluid Mechanics Lab Report: STUDY OF PRESSURE DISTRIBUTION ON A CYLINDER
11 pages
Safety Data Sheet: 1. Identification of The Substance/Mixture and The Supplier
No ratings yet
Safety Data Sheet: 1. Identification of The Substance/Mixture and The Supplier
8 pages
M5
No ratings yet
M5
18 pages
Lecturer 5
No ratings yet
Lecturer 5
21 pages
Big Data Unit 2 Notes
No ratings yet
Big Data Unit 2 Notes
6 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Spark Streaming Research
No ratings yet
Spark Streaming Research
6 pages
Nothing To Hide, The Blurring of The Physical and Temporal Line Between Life, Work and Education - Microcities
No ratings yet
Nothing To Hide, The Blurring of The Physical and Temporal Line Between Life, Work and Education - Microcities
7 pages
Apache Spark
No ratings yet
Apache Spark
31 pages
Chapter 04
No ratings yet
Chapter 04
29 pages
Unit 6 Spark
No ratings yet
Unit 6 Spark
43 pages
Chapter 06
No ratings yet
Chapter 06
46 pages
Spark Class 1
No ratings yet
Spark Class 1
33 pages
Chapter 14
No ratings yet
Chapter 14
35 pages
Chinese and Japanese Architecture
No ratings yet
Chinese and Japanese Architecture
26 pages
BLAME! Master Edition v02 (2016) (Digital) (Danke-Empire)
No ratings yet
BLAME! Master Edition v02 (2016) (Digital) (Danke-Empire)
364 pages
Bulletin168bmet W
No ratings yet
Bulletin168bmet W
43 pages
Indonesia Security Market Report 2017
No ratings yet
Indonesia Security Market Report 2017
6 pages
Spark 4-2 Documentation
No ratings yet
Spark 4-2 Documentation
60 pages
Lhu Comp 200: Chapter 2 (2 C) Application Layer
No ratings yet
Lhu Comp 200: Chapter 2 (2 C) Application Layer
37 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Chapter 08 2
No ratings yet
Chapter 08 2
20 pages
BDA Lec7
No ratings yet
BDA Lec7
32 pages
Big Data Notes
No ratings yet
Big Data Notes
13 pages
Unit V Big Data
No ratings yet
Unit V Big Data
18 pages
Chapter 02
No ratings yet
Chapter 02
45 pages
Spark Class 1 PPT
No ratings yet
Spark Class 1 PPT
33 pages
Laphormur F7 - Rieter Manual
No ratings yet
Laphormur F7 - Rieter Manual
391 pages
Spark - RDD CS DESIGN
No ratings yet
Spark - RDD CS DESIGN
1 page
Unit 4
No ratings yet
Unit 4
8 pages
Spark Slides
No ratings yet
Spark Slides
23 pages
Sembagavally A/p Murugason V Tee Seng Hock (Evrol Mariette Peters JC)
No ratings yet
Sembagavally A/p Murugason V Tee Seng Hock (Evrol Mariette Peters JC)
22 pages
Hadoop: What Is Data Engineering? Hadoop Overview Hadoop Ecosystem
No ratings yet
Hadoop: What Is Data Engineering? Hadoop Overview Hadoop Ecosystem
9 pages
Bda 5
No ratings yet
Bda 5
21 pages
SQL Triggers & Functions
No ratings yet
SQL Triggers & Functions
16 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
Deutsch GroupFormation 1973
No ratings yet
Deutsch GroupFormation 1973
20 pages
Hadoop Vs Apache Spark
No ratings yet
Hadoop Vs Apache Spark
6 pages
Eliot PsychoanalyticInterpretationGroup 1920
No ratings yet
Eliot PsychoanalyticInterpretationGroup 1920
21 pages
Spark
No ratings yet
Spark
9 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Lec No 10
No ratings yet
Lec No 10
17 pages
SQL Functions
No ratings yet
SQL Functions
18 pages
Intro-Databases For Big Data
No ratings yet
Intro-Databases For Big Data
10 pages
SQL Views & Procedures
No ratings yet
SQL Views & Procedures
23 pages
Large Scale and MultiStructured Databases
No ratings yet
Large Scale and MultiStructured Databases
223 pages
Columnar Database
No ratings yet
Columnar Database
18 pages
Quality Indicators For The Care of Older Adults W Disabilities in Longterm Care Wbased On Maslow Hierarchy of Needs
No ratings yet
Quality Indicators For The Care of Older Adults W Disabilities in Longterm Care Wbased On Maslow Hierarchy of Needs
7 pages
Review - Normal Forms2
No ratings yet
Review - Normal Forms2
17 pages
SQL Queries5
No ratings yet
SQL Queries5
20 pages
Theo Notes
No ratings yet
Theo Notes
5 pages
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
No ratings yet
Intro To Apache Spark: Credits To CS 347-Stanford Course, 2015, Reynold Xin, Databricks (Spark Provider)
96 pages
Eaton DS 265760 NZMN4 AE1000 en - GB 20241113
No ratings yet
Eaton DS 265760 NZMN4 AE1000 en - GB 20241113
4 pages
Examining Maslow's Hierarchy Need Theory in The Social Media Adoption
No ratings yet
Examining Maslow's Hierarchy Need Theory in The Social Media Adoption
11 pages
Chapter 6 Management A Practical Introduction
No ratings yet
Chapter 6 Management A Practical Introduction
6 pages
Relativism in Ethics - William Shaw
No ratings yet
Relativism in Ethics - William Shaw
4 pages
Spark
No ratings yet
Spark
96 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
Introduction To Spark
No ratings yet
Introduction To Spark
84 pages
Supply Chain Improvement in Construction Industry
No ratings yet
Supply Chain Improvement in Construction Industry
8 pages
Big Data Assignment
No ratings yet
Big Data Assignment
6 pages
BLAME! Master Edition v01 (2016) (Digital) (Danke-Empire)
100% (1)
BLAME! Master Edition v01 (2016) (Digital) (Danke-Empire)
396 pages
CAP Theorem
No ratings yet
CAP Theorem
15 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Master Cheat Sheet
No ratings yet
Master Cheat Sheet
61 pages
BLAME! Master Edition v03 (2017) (Digital) (Danke-Empire)
100% (1)
BLAME! Master Edition v03 (2017) (Digital) (Danke-Empire)
341 pages
A Suggested Modification To Maslow's Need Hierarchy
No ratings yet
A Suggested Modification To Maslow's Need Hierarchy
6 pages
UNIT 4 Part 2
No ratings yet
UNIT 4 Part 2
11 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Query Optimization
No ratings yet
Query Optimization
10 pages
Review of DB Concepts
No ratings yet
Review of DB Concepts
27 pages
Apache Spark Quick Guide
100% (2)
Apache Spark Quick Guide
21 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
86EIGHTY-SIX Vol 10 Light Novel Fragmental Neoteny - Asato Asato
No ratings yet
86EIGHTY-SIX Vol 10 Light Novel Fragmental Neoteny - Asato Asato
289 pages
Analyst Prep Quants 2024
100% (1)
Analyst Prep Quants 2024
465 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
8 pages
Spark SQL
100% (1)
Spark SQL
25 pages
Public Administration
No ratings yet
Public Administration
178 pages
Cosmeceuticals Myths and Misconceptions
No ratings yet
Cosmeceuticals Myths and Misconceptions
7 pages
SAP Business One
No ratings yet
SAP Business One
52 pages
Tech Seminar Report
No ratings yet
Tech Seminar Report
5 pages
A Note On The Guru Cult
No ratings yet
A Note On The Guru Cult
4 pages
Circadian Rhythms
No ratings yet
Circadian Rhythms
10 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Ch. 4

Uploaded by

Ch. 4

Uploaded by

the maturation of Hadoop has led to a stable computing environment that is general

enough to build specialist tools for tasks such as graph processing,

The primary MapReduce abstraction (specification of computation as a mapping then a

Apache Spark is a cluster-computing platform that provides an API for distributed

In order to program this type of algorithm in MapReduce, the parameters of the

By combining acyclic data flow and in-memory(RAM) computing, Spark is extremely

RESILIENT DISTRIBUTED DATASETS

In Chapter 2, we described Hadoop as a distributed computing framework that dealt

RDDs are essentially a programming abstraction that represents a read-only

2 TYPES OF INTERMEDIATE RESULTS

PARTS 1-4, 7 of RDD PAPER

Other iterative interfaces:

RDDs have 5 attributes -

Why are RDDs able to express these diverse programming models?

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.