0% found this document useful (0 votes)

8 views154 pages

Hadoop

The Apache Hadoop project develops open-source software for distributed computing, enabling the processing of large data sets across clusters of computers. Originating from Google's MapReduce framework, Hadoop has evolved to include various components like Hive for data warehousing and YARN for resource management, supporting scalability and fault tolerance. The ecosystem consists of over 100 libraries, facilitating a range of applications from batch processing to real-time analytics.

Uploaded by

Md Hamid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views154 pages

Hadoop

Uploaded by

Md Hamid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 154

1

2 The Apache™ Hadoop® project

 develops open-source software for reliable, scalable,

distributed computing
 The Apache Hadoop software library is a framework that
allows for the distributed processing of large data sets
across clusters of computers using simple programming
models
 scale up from single servers to thousands of machines,
each offering local computation and storage
 Instead of focussing on hardware based high availability,
Hadoop software detect and handle failures at application
layer, delivering high availability running on commodity
clusters which are prone to failures
Hadoop’s Developers

2005: Doug Cutting and Michael J. Cafarella developed

Hadoop to support distribution for the Nutch search
engine project.
Doug Cutting
The project was funded by Yahoo.

2006: Yahoo gave the project to Apache

Software Foundation.
Google Origins

2003

2004

2006
5 History of Hadoop

 In 2004, Google developed MapReduce programming

framework for Google File System (GFS) for indexing
websites
 In 2005, Yahoo released open source framework based
on MapReduce called Hadoop
 In the following years, more software libraries which
provided different capabilities were added to Hadoop
like Hive, HBASE etc.
 Now over 100 open source libraries for Hadoop and
growing
History
6
of Hadoop

 The focus is on supporting redundancy, distributed

architectures, and parallel processing
 Completely written in java
 In 2007, Hadoop was used by Yahoo to sort about 1
TB data in only 62 seconds.
 The software libraries are organized in a layered
package (sometimes also called stacked)
7 History of Hadoop

moved from “running” on Hadoop DFS to “running” in

Hadoop operating system “YARN”
8
Hadoop Distributions
9 Distributions are composed of commercially packaged and supported
editions of open-source Apache Hadoop-related projects
Distributions provide access to applications, query/reporting tools, machine
learning and data management infrastructure components.
Layer diagram for Apache Hadoop 2.x
10
Ecosystem
Hadoop Common – contains libraries and utilities needed by other Hadoop
modules
11
12 Hadoop Main Components
13
14 Hive
 In Ecosystem of Hadoop, Hive also play a vital role as analytics tool.
 It can be seen as a data warehouse package mounted on the top
layer of Hadoop ecosystem for managing and processing huge
amount of data generated.
 It is featured with simple interface using SQL. Hence user don’t need
to bother about MapReduce complex program.
 Hive worked to access the data set stored at Hbase. It is can be
summarized as:
✓ Data Warehouse Infrastructure
✓ Definer of a Query Language popularly known as HiveQL (similar to
SQL)
✓ Provides us with various tools for easy extraction, transformation and
loading of data. d. Hive allows its users to embed customized
mappers and reducers.
15 Pig
 Pig: It is one of the popular big data tool that reflects the abstraction
over MapReduce of Hadoop.
 Based on a high level platform, it creates the program that run on
apache Hadoop.
 The functionality of Pig tool comprises of analysing larger datasets
consisting of high level language and express them into data flow.
 It is one of the important key component of Hadoop ecosystem
similar to hive.
 It provide an ease in writing small program running on a high level
data flow system and uses Pig Latin to manipulate queries.
 It is multi query approach, easy to read & write, simple to understand
as SQL, and equipped with nested data types like Maps, tuples that
are not featured by Hadoop`s MapReduce.
16
 Storm:
17  This analytics tool is widely applicable and preferable for real time
system in more reliable manner for analysis of unbounded streaming
data that can be developed in any programming language.
 The speed of the processing is millions of tuples per second per
node. The feature of this analytical tool are free, open source, fault
tolerant, and real time computing system.
 Spark:
 Analytics tool for streaming, predictive and graph computational
analytics.
 Open source engine
 Processing speed is measured 100 times faster than MapReduce as it
provide immediate data reuse in memory.
 Many research organizations uses both Hadoop and spark together
although it is designed as an alternative of Hadoop.
 In fact both are complementary framework of big data, together
they make an extremely powerful machine from batch processing
using HDFS for speed analysis to streaming and in-memory
distributive processing in real time.
18
19 HBase:
 It is also an open source tool.
 It is one of the distributed database and non-relational tool in
open source environment developed after Google`s Bigtable.
 Its core shell is written in java funded by apache software
foundation and natively integrated with Hadoop.
 It perform the mapping on large data and map it into number
of datasets, each of which is slated into n-tuple.
 This is performed as a separate task for the reduction part of
output and after that it merge all of them into original one.
20
21
22 HDFS: Assumptions
1. Hardware failure are common
2. Files are large
3. Two main types of reads: large streaming reads and small
random reads
4. Sequential writes that append to files
5. Files rarely modified
6. High sustained bandwidth (or throughput) preferred to low
latency
23 HDFS: Basic Features:
 Highly fault-tolerant
 Thousands of server machines
 Failure norm rather than exception
 High throughput
 Internet scale workloads
 Move compute to data (e.g., MapReduce)
 Suitable for applications with large data sets
 Streaming access to file system data
 Can be built out of commodity hardware
 Scale to thousands of nodes in a cluster
 Write-once-read-many: a file once created rarely needs
to changed
Physical Cluster Organization
24 (by Brad Hedlund)

CSE4/587 B. Ramamurthy 10/15/24

Hadoop Cluster at Yahoo!
26
Architecture Overview
28 Namenode
● Master/slave architecture
● HDFS cluster consists of a single Namenode, a master server
that manages the file system namespace and regulates
access to files by clients.
• Transaction log for file deletes/adds, etc. Does not use
transactions for whole blocks or file-streams, only metadata.
• Handles creation of more replica blocks when necessary
after a DataNode failure
29 Datanodes
● There are a number of DataNodes usually one per node in a
cluster.
● The DataNodes manage storage attached to the nodes that
they run on.
● DataNodes: serves read, write requests, performs block
creation, deletion, and replication upon instruction from
Namenode.
30 File system Namespace
 Hierarchical file system with directories and files
 Create, remove, move, rename etc.
 Namenode maintains the file system
 Any meta information changes to the file system
recorded by the Namenode (fsImage and editLog)
 An application can specify the number of replicas of
the file needed: replication factor of the file. This
information is stored in the Namenode.
Secondary Name Node

Name Node Periodically Secondary Name Node

FS image FS image

Edit log Edit log

- House Keeping
- Backup NN Meta
Data

Data Nodes
Reply
(Control Info.
Embedded)
32 Secondary Name Node
HDFS: block-structured file system

 HDFS is designed to store very large files across machines

in a large cluster.
 A file can be made of several blocks, and they are stored
across a cluster of machines with data storage capacity.
Each file is a sequence of blocks.
 All blocks in the file except the last are of the same size.
 In Nutshell, HDFS is a block-structured file system: Files
broken into blocks of 128MB (per-file configurable).
 Less number of block access requests
 Minimized overhead: disk seek time is almost constant
34
35 Replication for Fault Tolerance
HDFS Inside: Name Node
Edit log:
Name Node Snapshot of FS record
changes to FS
Filename Replication factor Block ID

File 1 3 [1, 2, 3]

File 2 2 [4, 5, 6]

File 3 1 [7,8]

Data Nodes

1, 2, 5, 7, 1, 5, 3, 1, 4, 3,
4, 3 2, 8, 6 2, 6
37 Data Replication
● Each block of a file is replicated across a number of
machines, To prevent loss of data.
● Block size and replicas are configurable per file.
● The Namenode receives a Heartbeat (every 3 secons) and a
BlockReport (with every 10th heartbeat) from each DataNode
in the cluster.
● BlockReport contains all the blocks on a Datanode.
38 Replica Placement
 The placement of the replicas is critical to HDFS reliability and performance.
 Optimizing replica placement distinguishes HDFS from other distributed file systems.
 Rack-aware replica placement:
 Goal: improve reliability, availability and network bandwidth utilization
 Many racks, communication between racks are through switches.
 Network bandwidth between machines on the same rack is greater than those in
different racks.
 Namenode determines the rack id for each DataNode.
 Replicas are typically placed on unique racks
 Simple but non-optimal
 Writes are expensive
 Replication factor is 3
 Replicas are placed: one on a node in a local rack, one on a different node in the
local rack and one on a node in a different rack.
 1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across
remaining racks.
39 Replica Selection
 Replica selection for READ operation: HDFS tries to minimize
the bandwidth consumption and latency.
 If there is a replica on the Reader node then that is preferred.
 HDFS cluster may span multiple data centers: replica in the
local data center is preferred over the remote one.
Write Mechanism: Setting Pipeline
40 Ensure that all Data Nodes which are expected to have a
copy of this block are ready to receive it
41 Pipelined Write
42 Acknowledgement
43 Multi-Block Write in Parallel
HDFS Inside: Read

1
Name Node

Client 2

3 4

DN1 DN2 DN3 ... DNn

1. Client connects to NN to read data

2. NN tells client where to find the data blocks
3. Client reads blocks directly from data nodes (without going through
NN)
4. In case of node failures, client connects to another node that serves
the missing block
HDFS Inside: Read

• Q: Why does HDFS choose such a design for read? Why

not ask client to read blocks through NN?
• Reasons:
• Prevent NN from being the bottleneck of the cluster
• Allow HDFS to scale to large number of concurrent clients
• Spread the data traffic across the cluster
HDFS Inside: Read

• Q: Given multiple replicas of the same block, how does

NN decide which replica the client should read?
• HDFS Solution:
• Rack awareness based on network topology
47 Read Example
HDFS Disadvantages
Multiple-Rack
Cluster
Switch Switch
Do not
ask me, I Single Point of
am down Failure
☹

Name Node Secondary Name Node

(NN) (SNN)

Data Node (DN) Data Node (DN) Data Node (DN)

Rack 1 Rack 2 ... Rack N

49 HDFS Summary
50 YARN (Yet Another Resource
Negotiator)
 Provides flexible resource management for Hadoop cluster
 The fundamental idea of YARN is to split up the
functionalities of resource management and job
scheduling/monitoring into separate daemons
 to have a global Resource Manager (RM) and per-
application Application Master (AM). An application is
either a single job or a DAG of jobs.
 Improves resource efficiency
 For large volume data processing, it is quite necessary to
manage the available resources properly so that every
application can leverage them.
 Through its various components, it can dynamically allocate
various resources and schedule the application processing
51 Hadoop 1.0 to Hadoop 2.0
52 MapReduce on Hadoop 1.0
53

 Disadvantages
54

 evolved to be known as large-scale distributed operating system

used for Big Data processing
 YARN architecture basically separates resource management
layer from the processing layer
 Extends Hadoop to enable multiple frameworks such as
MapReduce, Giraph, Spark, Flink
 In Hadoop 2.x, YARN provides a standard framework that supports
customized application development
 Giraph for graph analytics
 Storm for streaming applications
 Spark for in-memory applications
 Flink for streaming data-flow applications
 etc.
55 YARN Features

 Scalability: The scheduler in Resource manager of YARN

architecture allows Hadoop to extend and manage
thousands of nodes and clusters.
 Compatibility: YARN supports the existing map-reduce
applications without disruptions thus making it compatible
with Hadoop 1.0 as well.
 Cluster Utilization: YARN supports Dynamic utilization of cluster
in Hadoop, which enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving
organizations a benefit of multi-tenancy.
56
YARN Architecture
Main components of YARN
57
Architecture
Client: It submits jobs.

Container: It is a collection of physical resources such as RAM,

CPU cores and disk on a single node. The containers are
invoked by Container Launch Context(CLC) which is a record
that contains information such as environment variables,
security tokens, dependencies etc.
Main components of YARN
58
Architecture
Node Manager:
 It take care of individual node on Hadoop cluster and manages
application and workflow at that particular node.
 Its primary job is to keep-up with the Resource Manager.
 It monitors resource usage, performs log management and also kills a
container based on directions from the resource manager.
 It is also responsible for creating the container process and start it on the
request of Application master.
Application Master:
 An application is a single job submitted to a framework.
 The application master is responsible for negotiating resources with the
resource manager, tracking the status and monitoring progress of a
single application.
 The application master requests the container from the node manager
by sending a Container Launch Context(CLC) which includes everything
an application needs to run. Once the application is started, it sends the
health report to the resource manager from time-to-time.
Main components of YARN
59
Architecture
Resource Manager: It is the master daemon of YARN and
is responsible for resource assignment and management
among all the applications. Whenever it receives a
processing request, it forwards it to the corresponding
node manager and allocates resources for the
completion of the request accordingly. It has two major
components:
Scheduler: It performs scheduling based on the allocated
application and available resources. It is a pure scheduler,
means it does not perform other tasks such as monitoring or
tracking and does not guarantee a restart if a task fails. The
YARN scheduler supports plugins such as Capacity Scheduler
and Fair Scheduler to partition the cluster resources.
Application manager: It is responsible for accepting the
application and negotiating the first container for the
application specific Application Master. It also restarts the
Application Master container if a task fails.
Application Workflow in YARN
60

1. Client submits an application

2. The Resource Manager allocates a container to start the Application Master
3. The Application Master registers itself with the Resource Manager
4. The Application Master negotiates containers from the Resource Manager
5. The Application Master notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager to monitor application’s status
8. Once the processing is complete, the Application Master un-registers with the
Resource Manager
61 HDFS & YARN on Hadoop 2.0
62 YARN: more functionalities
 YARN supports the notion of resource reservation via the Reservation
System
 a component that allows users to specify a profile of resources over-time
and temporal constraints (e.g., deadlines), and reserve resources to
ensure the predictable execution of important jobs
 The Reservation System tracks resources over-time, performs admission
control for reservations, and dynamically instruct the underlying scheduler
to ensure that the reservation is fulfilled

 In order to scale YARN beyond few thousands nodes, YARN supports

the notion of Federation via the YARN Federation feature
 Federation allows to transparently wire together multiple yarn (sub-)
clusters, and make them appear as a single massive cluster
 This can be used to achieve larger scale, and/or to allow multiple
independent clusters to be used together for very large jobs, or for tenants
who have capacity across all of them.
63 MapReduce Paradigm
 Programming model developed at Google
 Initially, it was intended for their internal search/indexing application,
but now used extensively by more organizations (e.g., Yahoo,
Amazon.com, IBM, etc.)
 Sort/merge based distributed computing
 It is functional style programming (e.g., LISP) that is naturally
parallelizable across a large cluster of workstations or PCs.
 MapReduce programming model
 Data abstraction: KeyValue pairs
 Computation pattern: Map tasks and Reduce tasks
Features of MapReduce
64
Model
➢ It is developed to write applications which process huge
amounts of data in parallel manner on large cluster of
machines of commodity hardware.

 Designed for big data: Support from local disks based distributed
file system (GFS / HDFS)
 Designed for large number of servers: Large clusters of commodity
machines
 MapReduce works in a reliable, fault-tolerant manner, also
support Local optimization and Load balancing
 The underlying system takes care of the partitioning of the
input data, scheduling the program’s execution across several
machines, handling machine failures, and managing required
inter-machine communication.
MapReduce: A Real World
Analogy
Coins Deposit

?
MapReduce: A Real World
Analogy
Coins Deposit

Coins Counting Machine

MapReduce: A Real World
Analogy
Coins Deposit

Distribute the coins to multiple machines

Mapper: on each machine, Categorize coins by their
face values (key/value pairs e.g. <10,45>)
Reducer: Count the coins in each face value from all
mappers
Master-Slave framework
68
➢ MapReduce framework consists of a single master JobTracker and
several slave TaskTrackers
➢ The JobTracker runs on Master Node whereas TaskTrackers run on
Slave Nodes.
➢ Client applications submit MapReduce jobs to JobTracker.

➢ The JobTracker schedules MapReduce tasks out to available

TaskTracker nodes in the cluster, striving to keep the work as close to
the data as possible.
➢ TaskTrackers run the tasks and report the status of task to JobTracker.

➢ The master JobTracker is also responsible for monitoring the

component tasks of the job and re-executing the failed tasks.

▪ MapReduce adopts a pull scheduling strategy rather than

a push one
▪ I.e., JT does not push map and reduce tasks to TTs but rather TTs
pull them by making pertaining requests
MapReduce Master-Slave framework
69
HDFS & MapReduce Architecture
70
Programming Concept
71
▪ The programmer in MapReduce has to specify
two functions, the map function and the reduce
function that implement the Mapper and the
Reducer in a MapReduce program

▪ In MapReduce data elements are always

structured as
key-value (i.e., (K, V)) pairs

▪ The map and reduce functions receive and emit

(K, V) pairs
Input Splits Intermediate Outputs Final Outputs

Map Reduce (K’’,

(K, V) (K’, V’)
Pairs
Functio Pairs
Functio V’’)
n n Pairs
72 Programming Concept

 Map
 Perform a function on individual values in a data set to create a new list of
values
 Example: square x = x * x
map square [1,2,3,4,5]
returns [1,4,9,16,25]
 Reduce
 Combine values in a data set to create a new value
 Example: sum = (each elem in arr, total +=)
reduce [1,2,3,4,5]
returns 15 (the sum of the elements)
73 Partitions
▪ In MapReduce, intermediate output values are not
usually reduced together

▪ All values with the same key are presented to a single

Reducer together

▪ More specifically, a different subset of intermediate key

space is assigned to each Reducer

▪ These subsets are known as partitions

Different colors represent

different keys (potentially)
from different Mappers

Partitions are the input to Reducers

74 MapReduce
chunks C0 C1 C2 C3
▪ When the input dataset is provided to a MapReduce

Map Phase
job, independent data chunks are processed by the mappers
M M M M
0 1 2 3
Map Tasks called Mappers in parallel.
IO0 IO1 IO2 IO3
▪ The outputs from the mappers are denoted as
intermediate outputs (IOs) and are brought

Reduce Phase
Shuffling Data
into a second set of tasks called Reducers
▪ The process of bringing together IOs into a set Reducers R0 R1
of Reducers is known as shuffling process
▪ The Reducers produce the final outputs (FOs) FO0 FO1

▪ Overall, MapReduce breaks the data flow into two

phases, map phase and reduce phase
Partitioners: responsible for dividing up the intermediate key
space and assigning intermediate key-value pairs to reducers
75
MapReduce Architecture: Master-Slaves

Job
Client
Task Trackers

Job
Tracker
Redu
Map
ce
Name Input Outpu
Node s ts
HDFS

Job Client: Submit Jobs Task Tracker: Execute Jobs

Job: MapReduce Function+ Config
Job Tracker: Coordinate Jobs
(Scheduling, Phase Coordination, etc.)
MapReduce Architecture: Workflow
1
Job
Client
Task Trackers
1 6
3
Job
Tracker
4
2 Map Reduce

Name Inputs Outputs

5
Node HDFS
1. Client submits job to Job Tracker 4. Task trackers do the job and
and copy code to HDFS report progress/status to Job
2. Job Tracker talks to NN to find Tracker
data it needs 5. Job Tracker manages task
3. Job Tracker creates execution phases
plan and submits work to Task 6. Job Tracker finishes the job and
Trackers updates status
MapReduce: example
 Consider the problem of counting the number of
occurrences of each word in a large collection of
documents
 How would you do it in parallel ?
 Solution:
 Divide documents among workers
 Each worker parses document to find all words, outputs (word,
count) pairs (where count=1)
 Partition (word, count) pairs across workers based on word
 For each word at a worker, locally add up counts
 Add all counts for each word
MapReduce Example: Word Count
Shuffle/
Input Split Map Reduce Output
Sort

Deer, 1 Beer, 1
Dear Beer Beer, 1 Beer, 2
Beer, 1
River River, 1

Deer Beer Beer, 2

Car, 1 Car, 1 Car, 3
River Car Car Car, 3
Car, 1 Car, 1 Deer,
Car Car River River
River, 1 Car, 1 2
Deer Car
Beer River,
Deer, 1 Deer, 1 2
Deer Car Deer, 2
Beer Car, 1 Deer, 1
Beer, 1

River, 1
River, 1 River, 2

Similar Flavor of Coins

Deposit ? ☺
MapReduce Example: Word Count
Shuttle/
Input Split Map Reduce Output
Sort

Deer, 1 Beer, 1
Dear Beer Beer, 1 Beer, 2
Beer, 1
River River, 1

Deer Beer Beer, 2

Car, 1 Car, 1 Car, 3
River Car Car Car, 3
Car, 1 Car, 1 Deer,
Car Car River River
River, 1 Car, 1 2
Deer Car
Beer River,
Deer, 1 Deer, 1 2
Deer Car Deer, 2
Beer Car, 1 Deer, 1
Beer, 1

River, 1
River, 1 River, 2

Q: What are the Key and Value Pairs of Map and

Reduce?
Map: Key=word, Value=1
Reduce: Key=word, Value=aggregated count
Mapper and Reducer of
Word Count
 Map(key, value){
// key: line number
// value: words in a line
for each word w in value:
Emit(w, "1");}
 Reduce(key, list of values){
// key: a word
// list of values: a list of counts
int result = 0;
for each v in values:
result += ParseInt(v);
Emit(key, result);}
MapReduce Example: Word Count
Input Split Map Shuffle/Sort Reduce Output

Deer, 1 Beer, 1
Dear Beer Beer, 1 Beer, 2
Beer, 1
River River, 1

Deer Beer Beer, 2

Car, 1 Car, 1 Car, 3
River Car Car Car, 3
Car, 1 Car, 1 Deer,
Car Car River River
River, 1 Car, 1 2
Deer Car
Beer River,
Deer, 1 Deer, 1 2
Deer Car Deer, 2
Beer Car, 1 Deer, 1
Beer, 1

River, 1
River, 1 River, 2

Q: Do you see any place we can improve the

efficiency?
Local aggregation at mapper will be able to improve
MapReduce efficiency.
MapReduce: Combiner
 Combiner: do local aggregation/combine task at mapper

Car, 1 Car, 3
Car, 2 Car, 2
Car, 1 River, 1 Car, 1
River, 1

 Q: What are the benefits of using combiner:

 Reduce memory/disk requirement of Map tasks
 Reduce network traffic

 Q: Can we remove the reduce function?

 No, reducer still needs to process records with same key but from different
mappers
 Q: How would you implement combiner?
 It is the same as Reducer!
84 Partitioner and Combiner

 Partitioning function: The users of MapReduce specify the

number of reduce tasks/output files that they desire (R). Data gets
partitioned across these tasks using a partitioning function on the
intermediate key. A default partitioning function is provided that
uses hashing (e.g. .hash(key) mod R.). In some cases, it may be
useful to partition data by some other function of the key. The user
of the MapReduce library can provide a special partitioning
function.
 Combiner function: User can specify a Combiner function that
does partial merging of the intermediate local disk data before it is
sent over the network. The Combiner function is executed on each
machine that performs a map task. Typically the same code is used
to implement both the combiner and the reduce functions.
85
MapReduce with Combiner
86 Shuffling

 Shuffling is the process of moving the intermediate

data provided by the partitioner to the reducer node.
During this phase, there are sorting and merging
subphases:
 Merging - combines all key-value pairs which have same
keys and returns (Key, List[Value]).
 Sorting - takes output from Merging step and sort all key-
value pairs by using Keys. This step also returns (Key,
List[Value]) output but with sorted key-value pairs.
 Output of shuffle-sort phase is sent directly to reducers.
87 Shuffle/Sort
MapReduce WordCount 2

 New Goal: output all words sorted by their frequencies

(total counts) in a document.
 Question: How would you adopt the basic word count
program to solve it?
 Solution:
 Sort words by their counts in the reducer
 Problem: what happens if we have more than one
reducer?
MapReduce WordCount 2

 New Goal: output all words sorted by their frequencies

(total counts) in a document.
 Question: How would you adopt the basic word count
program to solve it?
 Solution:
 Do two rounds of MapReduce
 In the 2nd round, take the output of WordCount as input but
switch key and value pair!
 Do not merge
 Only one reducer in the second round of MapReduce
 Leverage the sorting capability of shuffle/sort to do the
global sorting!
Example 2
 Problem: Find the maximum monthly temperature for each year from
weather reports
 Input: A set of records with format as:
<Year/Month, Average Temperature of that month>
- (200707,100), (200706,90)
- (200508, 90), (200607,100)
- (200708, 80), (200606,80)
 Question: write down the Map and Reduce function to solve this
problem
 Assume we split the input by line
Mapper and Reducer of Max Temperature

 Map(key, value){
// key: line number
// value: tuples in a line
for each tuple t in value: Combiner is the
Emit(t->year, t->temperature);} same as Reducer
 Reduce(key, list of values){
// key: year
//list of values: a list of monthly temperature
int max_temp = -100;
for each v in values:
max_temp= max(v, max_temp);
Emit(key, max_temp);}
MapReduce Example: Max Temperature
(200707,100), (200706,90)
Input (200508, 90), (200607,100)
(200708, 80), (200606,80)

Map
(2007,100), (2005, 90),
(2006,100) (2007, 80), (2006, 80)
(2007,90)
Combine

(2007,100) (2005, 90), (2006,100) (2007, 80), (2006, 80)

Partition: assuming there are 2 reducers

(2005,90) (2007,80),(2007,100), (2006,100), (2006,80)

Shuttle/Sort

(2005,[90]) (2006,[100, 80]) (2007,[100, 80])

Reduce: apply reduce func on each
key
(2006,100), (2007,100)
(2005,90)
Example 2

 Key-Value Pair of Map and Reduce:

 Map: (year, temperature)
 Reduce: (year, maximum temperature of the year)

Question: How to use the above Map

Reduce program (that contains the
combiner) with slight changes to find the
average monthly temperature of the year?
Mapper and Reducer of Average
Temperature
 Map(key, value){
// key: line number
// value: tuples in a line
for each tuple t in value:
Emit(t->year, t->temperature);}
 Reduce(key, list of values){
// key: year
// list of values: a list of monthly temperatures
int total_temp = 0;
for each v in values:
total_temp= total_temp+v;
Emit(key, total_temp/size_of(values));}
MapReduce Example: Average
Temperature
Input (200707,100), (200706,90)
Real average
(200508, 90), (200607,100)
of 2007: 90
(200708, 80), (200606,80)

Map

(2007,100), (2005, 90), (2007, 80), (2006,80)

(2007,90) (2006,100)
Combin
e
(2005, 90),
(2007,95) (2007, 80), (2006,80)
(2006,100)
Shuttle/
Sort
(2005,[90]) (2006,[100, 80]) (2007,[95, 80])

Reduce
(2007,87.5
(2005,90) (2006,90)
)
Example 2: Combiner

 The problem is with the combiner!

 Here is a simple counterexample:
 (2007, 100), (2007,90) -> (2007, 95)
(2007,80)->(2007,80)
 Average of the above is: (2007,87.5)
 However, the real average is: (2007,90)

 However, we can do a small trick to get around this

 Mapper: (2007, 100), (2007,90) -> (2007, <190,2>)
(2007,80)->(2007,<80,1>)
 Reducer: (2007,<270,3>)->(2007,90)
MapReduce Example: Average
Temperature
Input (200707,100), (200706,90)
(200508, 90), (200607,100)
(200708, 80), (200606,80)

Map

(2007,100), (2005, 90), (2007, 80), (2006,80)

(2007,90) (2006,100)
Combin
e
(2007,<190,2>) (2005, <90,1>), (2007, <80,1>),
(2006, <100,1>) (2006,<80,1>)
Shuttle/
Sort
(2005, (2007,[<190,2>,
(2006,[<100,1>, <80,1>])
[<90,1>]) <80,1>])
Reduce

(2005,90) (2006,90) (2007,90)

Mapper and Reducer of Average
Temperature • Combine(key, list of values){
// key: year
 Map(key, value){
// list of values: a list of monthly
// key: line number temperature
// value: tuples in a line int total_temp = 0;
for each tuple t in value: for each v in values:
total_temp= total_temp+v;
Emit(t->year, t->temperature);}
• Reduce (key, list of values){ Emit(key,<total_temp,size_of(values)>);}
// key: year
// list of values: a list of <temperature sums,
counts> tuples
int total_temp = 0;
int total_count=0;
for each v in values:
total_temp= total_temp+v->sum;
total_count=total_count+v->count;
Emit(key,total_temp/total_count);}
Map Reduce Problems
Discussion
 Problem 3: Find Common Friends
 Statement: Given a group of people on online social media (e.g.,
Facebook), each has a list of friends, use Map-Reduce to find
common friends of any two persons who are friends
 Question:
 What are the Mapper and Reducer Functions?
Map Reduce Problems
Discussion
 Problem 3: Find Common Friends
 Simple example:
Input:
A -> B,C,D
B-> A,C,D
A C C-> A,B
D->A,B MapReduce

Output:
(A ,B) -> C,D
B D
(A,C) -> B
(A,D) -> ..
….
Mapper and Reducer of
Common Friends
 Map(key, value){
// key: person_id
// value: the list of friends of the person
for each friend f_id in value:
Emit(<person_id, f_id>, value);}
 Reduce(key, list of values){
// key: <friend pair>
// list of values: a set of friend lists related with the friend pair
for v1, v2 in values:
common_friends = v1 intersects v2;
Emit(key, common_friends);}
Map Reduce Problems
Discussion
 Problem 3: Find Common Friends
 Mapper and Reducer:

Mapper(friend list of a person)

{ for each person in the friend list:
Emit (<friend pair>, <list of friends>) }
Reducer(output of map)
{ Emit (<friend pair>, Intersection of two
(i.e, the one in friend pair) friend lists)}
Map Reduce Problems
Discussion
 Problem 3: Find Common Friends
 Mapper and Reducer:

Input: Map: Reduce:

Suggest
Fiends ☺
A -> B,C,D (A,B) -> B,C,D (A,B) -> C,D
B-> A,C,D (A,C) -> B,C,D (A,C) -> B
C-> A,B (A,D) -> B,C,D (A,D) -> B
D->A,B (A,B) -> A,C,D (B,C) -> A
(B,C) -> A,C,D (B,D) -> A
(B,D) -> A,C,D
(A,C) -> A,B
(B,C) -> A,B
(A,D) -> A,B
(B,D) -> A,B
Finding the Shortest Path
• A common graph
search application is
finding the shortest
path from a start node
to one or more target
nodes
• Commonly done on a
single machine with
Dijkstra's Algorithm
• Can we use BFS to
find the shortest path
via MapReduce?

This is called the single-source shortest path problem. (a.k.a. SSSP)

Finding the Shortest Path
● Consider simple case of equal edge weights
● Solution to the problem can be defined inductively
● Here’s the intuition:
– Define: b is reachable from a if b is on adjacency list of a
– DISTANCETO(s) = 0
– For all nodes p directly reachable from
s,
– DISTANCETO(p) = 1
– For all nodes n reachable from some other set of nodes M,
DISTANCETO(n) = 1 + min(DISTANCETO(m), m ∈ M)
d1 m1
…
d2
s … n
m2

… d3
CS 4407 m3
University College Cork, Gregory M. Provan
From Intuition to Algorithm

■ A map task receives a node n as a key, and

(D, points-to) as its value
□ D is the distance to the node from the start
□points-to is a list of nodes reachable from n
□ ∀p ∈ points-to, emit (p, D+1)

■ Reduce task gathers possible distances to

a given p and selects the minimum one
From Intuition to Algorithm
● Data representation:
– Key: node n
– Value: d (distance from start), adjacency list (list of nodes reachable
from n)
– Initialization: for all nodes except for start node, d = ∞
● Mapper:
– ∀m ∈ adjacency list: emit (m, d + 1)
● Sort/Shuffle
– Groups distances by reachable nodes
● Reducer:
– Selects minimum distance path for each reachable node
– Additional bookkeeping needed to keep track of actual path

CS 4407
University College Cork, Gregory M. Provan
Multiple Iterations Needed

● Each MapReduce iteration advances the “known frontier” by

one hop
– Subsequent iterations include more and more reachable nodes as
frontier expands
– Multiple iterations are needed to explore entire graph
● Preserving graph structure:
– Problem: Where did the adjacency list go?
– Solution: mapper emits (n, adjacency list) as well

CS 4407
University College Cork, Gregory M. Provan
109
Shortest path problem
110 Shortest path problem:
mapper
Shortest path problem:
111
reducer
Visualizing Parallel BFS

n7
n0 n1

n3 n2
n6

n5
n4
n8

CS 4407
University College Cork, Gregory M. Provan
Example: SSSP – Parallel BFS in
113 MapReduce
● Adjacency matrix B C
A B C D E
∞ 1 ∞
A 10 5
B 1 2 10
A
C 4
D 3 9 2 0 2 3 9 4 6
E 7 6
5 7

∞ ∞
● Adjacency List 2

A: (B, 10), (D, 5) D E

B: (C, 1), (D, 2)
C: (E, 4)
D: (B, 3), (C, 9), (E, 2)
E: (A, 7), (C, 6) CS 4407
University College Cork, Gregory M. Provan
4
Example: SSSP – Parallel BFS in MapReduce
● Map input: <node ID, <dist, adj list>> B C
<A, <0, <(B, 10), (D, 5)>>> ∞ 1 ∞
<B, <inf, <(C, 1), (D, 2)>>>
<C, <inf, <(E, 4)>>> 10
A
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
0 2 3 9 4 6
<E, <inf, <(A, 7), (C, 6)>>>

5 7
● Map output: <dest node ID, dist>
<B, 10> <D, 5> <A, <0, <(B, 10), (D, ∞ ∞
5)>>> 2
<C, inf> <D, inf>
<E, inf> <B, <inf, <(C, 1), (D, D E
<B, inf> <C, inf> <E, inf> 2)>>>
<D, <inf, <(B, 3), (C, 9), (E, 2)>>>
<A, inf> <C, inf> <C,<inf,
<E, <inf,<(A,
<(E, 7),
4)>>>
(C, 6)>>>
Flushed to local disk!!
Example: SSSP – Parallel BFS in
115
MapReduce
● Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>> 1
∞ ∞
<A, inf>
10
<B, <inf, <(C, 1), (D, 2)>>> A
<B, 10> <B, inf>
0 2 3 9 4 6

<C, <inf, <(E, 4)>>>

<C, inf> <C, inf> <C, inf> 5 7

∞ ∞
<D, <inf, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, inf>
D E

<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, inf>
CS 4407
University College Cork, Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce
● Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>> 1
∞ ∞
<A, inf>
10
<B, <inf, <(C, 1), (D, 2)>>> A
<B, 10> <B, inf>
0 2 3 9 4 6
<C, <inf, <(E, 4)>>>
<C, inf> <C, inf> <C, inf> 5 7

<D, <inf, <(B, 3), (C, 9), (E, 2)>>> ∞ ∞

2
<D, 5> <D, inf>
D E
<E, <inf, <(A, 7), (C, 6)>>>
<E, inf> <E, inf>
Example: SSSP – Parallel BFS in MapReduce
117
● Reduce output: <node ID, <dist, adj list>>B C
= Map input for next iteration
10 1 ∞
<A, <0, <(B, 10), (D, 5)>>> Flushed to DFS!!
<B, <10, <(C, 1), (D, 2)>>> 10
A
<C, <inf, <(E, 4)>>>
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 0 2 3 9 4 6
<E, <inf, <(A, 7), (C, 6)>>>
● Map output: <dest node ID, dis5t> 7

<B, 10> <D, 5> <A, <0, <(B, 10), (D, 5)>>> 5 ∞
<C, 11> <D, 12> 2
<B, <10, <(C, 1), (D, 2)>>>
<E, inf> D E
<C, <inf, <(E, 4)>>>
<B, 8> <C, 14> <E, 7> <D, <5, <(B, 3), (C, 9), (E,
<A, inf> <C, inf> 2)>>> Flushed to local disk!!
<E, <inf,University
<(A, 7), (C, 6)>>>
CS 4407
College Cork, Gregory M. Provan
Example: SSSP – Parallel BFS in MapReduce

● Reduce input: <node ID, dist> B C

<A, <0, <(B, 10), (D, 5)>>>
10 1 ∞
<A, inf>

10
<B, <10, <(C, 1), (D, 2)>>> A
<B, 10> <B, 8> 0 2 3 9 4 6

<C, <inf, <(E, 4)>>>

5 7
<C, 11> <C, 14> <C, inf>

5 ∞
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E

<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, 7>
Example: SSSP – Parallel BFS in MapReduce
● Reduce input: <node ID, dist> B C
<A, <0, <(B, 10), (D, 5)>>>
10 1 ∞
<A, inf>

10
<B, <10, <(C, 1), (D, 2)>>> A
<B, 10> <B, 8> 0 2 3 9 4 6

<C, <inf, <(E, 4)>>>

5 7
<C, 11> <C, 14> <C, inf>

5 ∞
<D, <5, <(B, 3), (C, 9), (E, 2)>>> 2
<D, 5> <D, 12> D E

<E, <inf, <(A, 7), (C, 6)>>>

<E, inf> <E, 7>
Example: SSSP – Parallel BFS in MapReduce
● Reduce output: <node ID, <dist, adj list>> B C
= Map input for next iteration 1
8 11
<A, <0, <(B, 10), (D, 5)>>> Flushed to DFS!!
<B, <8, <(C, 1), (D, 2)>>> 10
A
<C, <11, <(E, 4)>>>
0 2 3 9 4 6
<D, <5, <(B, 3), (C, 9), (E, 2)>>>
<E, <7, <(A, 7), (C, 6)>>>
5 7

… the rest omitted … 5 7

2
D E
Blow-up and Termination

■ This algorithm starts from one node

■ Subsequent iterations include many more
nodes of the graph as frontier advances
■ Does this ever terminate?
□ Yes! Eventually, routes between nodes will
stop being discovered and no better distances
will be found. When distance is the same, we
stop
□Mapper should emit (n, D) to ensure that
“current distance” is carried into the reducer
Stopping Criterion
● How many iterations are needed in parallel BFS (equal edge
weight case)?
● Convince yourself: when a node is first “discovered”, we’ve
found the shortest path
● Now answer the question...
– Six degrees of separation?
● Practicalities of implementation in MapReduce

CS 4407
University College Cork, Gregory M. Provan
Comparison to Dijkstra

■ Dijkstra's algorithm is more efficient

because at any step it only pursues edges
from the minimum-cost path inside the
frontier
■ MapReduce version explores all paths in
parallel; not as efficient overall, but the
architecture is more scalable
■ Equivalent to Dijkstra for weight=1 case
124 Iterations of MapReduce job

 Only one hop of the graph is completed in

MapReduce iteration
 Several iterations of MapReduce are needed
 How many iterations?
 Six (diameter of a large general graph)
 Until a criterion is met (e.g. all nodes distance have been
calculated)
 MapReduce counter API keep counting the various
events
 Driver code executes iterations of MapReduce Job
125
Summary of MapReduce
Graph Algorithms
PageRank: Random Walks
Over The Web

■ If a user starts at a random web page and

surfs by clicking links and randomly
entering new URLs, what is the probability
that s/he will arrive at a given page?
■ The PageRank of a page captures this
notion
□ More “popular” or “worthwhile” pages get a
higher rank
PageRank: Visually
www.cnn.com

en.wikipedia.org

www.nytimes.com
PageRank: Formula

Given page A, and pages T1 through Tn

linking to A, PageRank is defined as:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... +

PR(Tn)/C(Tn))

C(P) is the cardinality (out-degree) of page P

d is the damping (“random URL”) factor
PageRank: Intuition

■ Calculation is iterative: PRi+1 is based on PRi

■ Each page distributes its PRi to all pages it
links to. Linkees add up their awarded rank
fragments to find their PRi+1
■ d is a tunable parameter (usually = 0.85)
encapsulating the “random jump factor”

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

PageRank: First Implementation

■ Create two tables 'current' and 'next' holding

the PageRank for each page. Seed 'current'
with initial PR values
■ Iterate over all pages in the graph,
distributing PR from 'current' into 'next' of
linkees
■ current := next; next := fresh_table();
■ Go back to iteration step or end if converged
Distribution of the Algorithm

■ Key insights allowing parallelization:

□ The 'next' table depends on 'current', but not on
any other rows of 'next'
□Individual rows of the adjacency matrix can be
processed in parallel
□Sparse matrix rows are relatively small
Distribution of the Algorithm
Map step: break page rank into even fragments to distribute to link targets

■ Consequences of insights:
□ We can map each row of 'current' to a list of
PageRank “fragments” to assign to linkees
□These fragments
Reduce can
step: add together beintoreduced
fragments next PageRankinto a single
PageRank value for a page by summing
□Graph representation can be even more
compact; since each element is simply 0 or 1,
only transmit column numbers where it's 1

Iterate for next step...

Phase 1: Parse HTML

■ Map task takes (URL, page content) pairs

and maps them to (URL, (PRinit, list-of-urls))
□PRinit is the “seed” PageRank for URL
□list-of-urls contains all pages pointed to by URL

■ Reduce task is just the identity function

Phase 2: PageRank Distribution

■ Map task takes (URL, (cur_rank, url_list))

□ For each u in url_list, emit (u, cur_rank/|url_list|)
□ Emit (URL, url_list) to carry the points-to list
along through iterations

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Phase 2: PageRank Distribution

■ Reduce task gets (URL, url_list) and many

(URL, val) values
□ Sum vals and fix up with d
□ Emit (URL, (new_rank, url_list))

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Finishing up...

■ A subsequent component determines

whether convergence has been achieved
(Fixed number of iterations? Comparison of
key values?)
■ If so, write out the PageRank lists - done!
■ Otherwise, feed output of Phase 2 into
another Phase 2 iteration
PageRank Conclusions
■ MapReduce runs the “heavy lifting” in
iterated computation
■ Key element in parallelization is
independent PageRank computations in a
given step
■ Parallelization requires thinking about
minimum data partitions to transmit (e.g.,
compact representations of graph rows)
□ Even the implementation shown today doesn't
actually scale to the whole Internet; but it works
for intermediate-sized graphs
138 MapReduce Algorithm for K-
mean clustering
139 How to find the mean?
140 Mapper for k-mean
141 Reducer for k-mean
142 MapReduce algorithm for
k-mean clustering
143
MapReduce for k-NN
classification Algorithm
 Let TR a training set and TS a test set of a arbitrary sizes that
are stored in the HDFS
 The map phase starts dividing the TR set into a given number
of disjoint subsets, Each map task (Map1,Map2, ...,Mapm)
will create an associated TRj
 we will compute the distance of each xt in TS against the
instances of TRj
 The class label of the closest k neighbours (minimum
distance) in TRj, for each test example xt, and the their
distance will be saved
 As a result, we will obtain a matrix CDj of pairs < class,
distance > with dimension n k

 Therefore, at row t of CD, we will have the distance and the

class of the k nearest neighbours of xt
144
MapReduce for k-NN
classification Algorithm
145 Mapper for k-NN
146 Reducer for k-NN
147 Reducer for k-NN

 reduce phase consists of determining which of the

tentative k nearest neighbours from the maps are the
closest ones for the complete TS
 the setup operation will allocate a class-distance matrix
CDreducer of fixed size (size(TS) kneighbors)

 This matrix will be initialized with random values for the

classes and positive infinitive value for the distances
148 Reducer for k-NN
149 Reducer for k-NN
150 More examples

 Distributed grep (as in Unix grep command)

 Count of URL Access Frequency
 ReverseWeb-Link Graph: list of all source URLs
associated with a given target URL
 Inverted index: Produces <word, list(Document ID)>
pairs
 Distributed sort
151 MapReduce-Fault tolerance

 Worker failure: The master pings every worker

periodically. If no response is received from a worker in
a certain amount of time, the master marks the worker
as failed. Any map tasks completed by the worker are
reset back to their initial idle state, and therefore
become eligible for scheduling on other workers.
Similarly, any map task or reduce task in progress on a
failed worker is also reset to idle and becomes eligible
for rescheduling.
 Master Failure: It is easy to make the master write periodic
checkpoints of the master data structures described above. If the
master task dies, a new copy can be started from the last
checkpointed state. However, in most cases, the user restarts the
job.
152 Mapping workers to
Processors
 The input data (on HDFS) is stored on the local disks of the
machines in the cluster. HDFS divides each file into 128 MB
blocks, and stores several copies of each block (typically 3 copies)
on different machines.

 The MapReduce master takes the location information of the input

files into account and attempts to schedule a map task on a
machine that contains a replica of the corresponding input data.
Failing that, it attempts to schedule a map task near a replica of
that task's input data. When running large MapReduce operations
on a significant fraction of the workers in a cluster, most input data
is read locally and consumes no network bandwidth.
153 Task Granularity

 The map phase has M pieces and the reduce phase has R pieces.
 M and R should be much larger than the number of worker machines.
 Having each worker perform many different tasks improves dynamic
load balancing, and also speeds up recovery when a worker fails.
 Larger the M and R, more the decisions the master must make
 O(M+R) decisions and O(M*R) states

 R is often constrained by users because the output of each reduce task

ends up in a separate output file.
 Typically, M is chosen so that each mapper handles 128MB of input
data and R is multiple of worker machines
 Typically, (at Google), M = 200,000 and R = 5,000, using 2,000
worker machines.
Map Reduce vs. Parallel
Databases
 Map Reduce widely used for parallel processing
 Google, Yahoo, and 100’s of other companies
 Example uses: compute PageRank, build keyword
indices, do data analysis of web click logs, ….
 Database people say: but parallel databases
have been doing this for decades
 Map Reduce people say:
 we operate at scales of 1000’s of machines
 We handle failures seamlessly
 We allow procedural code in map and reduce and
allow data of any type

Critical Moments in Classical Literature Studies in The Ancient View of Literature and Its Uses 1st Edition Richard Hunter PDF Download
100% (1)
Critical Moments in Classical Literature Studies in The Ancient View of Literature and Its Uses 1st Edition Richard Hunter PDF Download
57 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
Algebra Book
No ratings yet
Algebra Book
804 pages
Bda Notes
No ratings yet
Bda Notes
110 pages
Big Data Refers To Extremely Large and Complex Datasets That 1
No ratings yet
Big Data Refers To Extremely Large and Complex Datasets That 1
421 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
2015.82742.delhi Darbar 1902 03 - Text
No ratings yet
2015.82742.delhi Darbar 1902 03 - Text
150 pages
Hadoop
No ratings yet
Hadoop
25 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
TIB Bwpluginftl 6.7.1 User-Guide
No ratings yet
TIB Bwpluginftl 6.7.1 User-Guide
55 pages
0547 - s03 - RP - 3 SPEAKING2
No ratings yet
0547 - s03 - RP - 3 SPEAKING2
18 pages
Bigdata Module2 7th-Sem 18cs72
No ratings yet
Bigdata Module2 7th-Sem 18cs72
64 pages
Afshin Molavi Persian Pilgrimages - Journeys Across Iran
100% (2)
Afshin Molavi Persian Pilgrimages - Journeys Across Iran
344 pages
Completion Ceremony Is Always A Day of Rejoicing For It Marks The Conclusion of Educational Journey and The Beginning of Another More Challenging Quest of Life and Future
No ratings yet
Completion Ceremony Is Always A Day of Rejoicing For It Marks The Conclusion of Educational Journey and The Beginning of Another More Challenging Quest of Life and Future
3 pages
Williams PDF
No ratings yet
Williams PDF
14 pages
Lit Exam
100% (2)
Lit Exam
56 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
Unit 5
No ratings yet
Unit 5
101 pages
BDA Module-2
No ratings yet
BDA Module-2
7 pages
MATHS TEST - Grade 1
No ratings yet
MATHS TEST - Grade 1
7 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Unit 2
No ratings yet
Unit 2
73 pages
Hadoop Ecosystem
100% (2)
Hadoop Ecosystem
33 pages
Bda Unit34
No ratings yet
Bda Unit34
17 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Unit 5-PLH
No ratings yet
Unit 5-PLH
34 pages
Digital Logic Design (DLD) : Lecturer: Engr. Ali Iqbal
No ratings yet
Digital Logic Design (DLD) : Lecturer: Engr. Ali Iqbal
18 pages
The Tenor Voice by Anthony Frisell Review By: Philip L. Miller Notes, Second Series, Vol. 22, No. 2 (Winter, 1965 - Winter, 1966), P. 903 Published By: Stable URL: Accessed: 14/06/2014 02:23
No ratings yet
The Tenor Voice by Anthony Frisell Review By: Philip L. Miller Notes, Second Series, Vol. 22, No. 2 (Winter, 1965 - Winter, 1966), P. 903 Published By: Stable URL: Accessed: 14/06/2014 02:23
2 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Lesson Plan: Materials Main Aims
No ratings yet
Lesson Plan: Materials Main Aims
2 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Unit 2
No ratings yet
Unit 2
14 pages
Hadoop Training in Bangalore
No ratings yet
Hadoop Training in Bangalore
31 pages
Unit 4 Endsem PYQs
No ratings yet
Unit 4 Endsem PYQs
24 pages
Module II
No ratings yet
Module II
46 pages
Concept Class 5 - New Pattern Based Questions Lyst7477
No ratings yet
Concept Class 5 - New Pattern Based Questions Lyst7477
14 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
Unit - 2
No ratings yet
Unit - 2
27 pages
Introduction To Ict
No ratings yet
Introduction To Ict
8 pages
The Impact of Listening To Short Stories On Comprehension
No ratings yet
The Impact of Listening To Short Stories On Comprehension
53 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Unit 3 Da
No ratings yet
Unit 3 Da
43 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Lesson Final
No ratings yet
Lesson Final
4 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
Visme Presentation 3
No ratings yet
Visme Presentation 3
6 pages
Unit 5 Print
No ratings yet
Unit 5 Print
32 pages
The Role of Input and Output in Second L
100% (1)
The Role of Input and Output in Second L
14 pages
CS19741-Cloud Computing-Unit 3 Notes
No ratings yet
CS19741-Cloud Computing-Unit 3 Notes
37 pages
UNIT V-Cloud Computing
No ratings yet
UNIT V-Cloud Computing
33 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Hadoop: A Software Framework For Data Intensive Computing Applications
No ratings yet
Hadoop: A Software Framework For Data Intensive Computing Applications
47 pages
Unit I
No ratings yet
Unit I
38 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
BDA Module 2 - Notes PDF
No ratings yet
BDA Module 2 - Notes PDF
101 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
9 Vi 4 PDF
No ratings yet
9 Vi 4 PDF
66 pages
MIL 12 Text Information and Media
100% (1)
MIL 12 Text Information and Media
66 pages
b2 First Handbook - Removed - Removed
No ratings yet
b2 First Handbook - Removed - Removed
1 page
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
15 pages
PDF Hostel Management System
0% (1)
PDF Hostel Management System
12 pages
English Tenses of Grammar
No ratings yet
English Tenses of Grammar
5 pages
Detailed Lesson Log in Mathematics
100% (1)
Detailed Lesson Log in Mathematics
7 pages
2 Hadoop
No ratings yet
2 Hadoop
20 pages
Bda Lab 1
No ratings yet
Bda Lab 1
9 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
OpenEdge 12 Product Availability Guide
No ratings yet
OpenEdge 12 Product Availability Guide
20 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
Unit 2
No ratings yet
Unit 2
9 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Human Knot Lesson Plan
No ratings yet
Human Knot Lesson Plan
3 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Apache Hadoop Filesystem and Its Usage in Facebook
No ratings yet
Apache Hadoop Filesystem and Its Usage in Facebook
33 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
I Dedicate My Victory To Palestine' Afaf Raed Sharif, 17, From Palestine
No ratings yet
I Dedicate My Victory To Palestine' Afaf Raed Sharif, 17, From Palestine
2 pages
Int SQL PLSQL
No ratings yet
Int SQL PLSQL
32 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
School Poems
No ratings yet
School Poems
20 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.