BDT Unit03.pptx
BDT Unit03.pptx
1
Unit-III-Introduction to Hadoop:
• What is Hadoop? Hadoop Overview, Processing Data with Hadoop,
• HDFS - Introduction, Concepts, Design, HDFS interfaces, HDFS Read
• Architecture, HDFS Write Architecture,
• Managing Resources and Applications with Hadoop YARN (Yet Another
Resource Negotiator),
• Introduction to MAPREDUCE Programming,Mapper, Reducer, Combiner,
Petitioner, Searching, Sorting
2
UNIT III
Challenges of Big Data
The term Big Data is refer to large amount of data and big
data analytics is used to find the hidden pattern which are
useful for predicting the future marketing strategies, offers,
customer retention, promotional content, Pricing strategies
etc.
UNIT III 5
5
Why Hadoop
• Distributed cluster system
• Platform for massively scalable applications
• Enables parallel data processing
UNIT III 6
6
Hadoop Features
• Hadoop provides access to the file systems
• The Hadoop Common package contains the necessary JAR files
and scripts
• The package also provides source code, documentation and a
contribution section that includes projects from the Hadoop
Community.
UNIT III 7
7
Hadoop : Big Data Processing Framework
• Hadoop is a well-adopted, standards-based, open-source
software framework built on the foundation of Google’s
MapReduce and Google File System papers.
UNIT III 8
8
Hadoop Major Component
UNIT III 9
Major Component of Hadoop
• Hadoop Common: The common utilities that support the other Hadoop
modules. A set of components and interfaces for distributed filesystems
and general I/O (serialization, Java RPC, persistent data structures).
UNIT III 10
10
Hadoop Ecosystem(Zoo)
UNIT III 11
11
Original Google Stack
12
Facebook’s Version of the Stack
13
Yahoo’s Version of the Stack
14
LinkedIn’s Version of the Stack
15
Cloudera’s Version of the Stack
16
Hadoop Ecosystem
UNIT III 17
17
Hadoop Ecosystem
UNIT III 18
18
HDFS
• A distributed filesystem that runs on large clusters of
commodity machines.
MapReduce
• A distributed data processing model and execution
environment that runs on large clusters of commodity
machines.
Pig
ECOSYSTEM Hive
• A distributed data warehouse. query language based on
ZooKeeper
• A distributed, highly available coordination service.
Sqoop
UNIT III 20
20
HDFS
Hadoop Distributed File System:
• It is a Distributed File System designed to run on Commodity hardware
• It is highly fault tolerant
▪ Feature of HDFS
▪ It is suitable for the distributed storage and processing.
▪ Hadoop provides a command interface to interact with HDFS.
▪ The built-in servers of namenode and datanode help users to easily check the
status of cluster.
▪ Streaming access to file system data.
▪ HDFS provides file permissions and authentication.
UNIT III 21
21
HDFS
▪ Assumptions and Goals
▪ Hardware Failure
▪ Streaming Data Access
▪ Large Data Sets
▪ Simple Coherency Model
▪ Moving Computation is cheaper than Moving data
▪ Portability across Heterogeneous Hardware and software platforms
UNIT III 22
22
HDFS
• Hadoop Distributed File System:
• It is a Distributed File System designed to run on Commodity hardware
• It is highly fault tolerant
• Feature of HDFS
▪ It is suitable for the distributed storage and processing.
▪ Hadoop provides a command interface to interact with HDFS.
▪ The built-in servers of namenode and datanode help users to easily check the status of cluster.
▪ Streaming access to file system data.
▪ HDFS provides file permissions and authentication.
• Assumptions and Goals
▪ Hardware Failure
▪ Streaming Data Access
▪ Large Data Sets
▪ Simple Coherency Model
▪ Moving Computation is cheaper than Moving data
▪ Portability across Heterogeneous Hardware and software platforms
23
HDFS
24
HDFS
25
HDFS
26
HDFS
27
HDFS Cluster Architecture
28
HDFS Data Blocks
31
Replication • The default replication factor is 3
32
Rack awareness in HDFS
33
Hadoop: How it Works
34
HDFS Architecture
Hadoop Distributed File System:
Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a
cluster comprises of a single NameNode (Master node) and all the other nodes are
DataNodes (Slave nodes).
UNIT III 35
35
Hadoop Architecture
36
Hadoop Distributed File System
(HDFS)
Centralized namenode
- Maintains metadata info about files
File F 1 2 3 4 5
37
HDFS Architecture
Hadoop Distributed File System:
UNIT III 38
38
HDFS(READ Mechanism)
HDFS Read Mechanism
Client Request:
A client application requests to read a file from HDFS by interacting with the Hadoop Distributed File System (HDFS). The request is sent to the NameNode, which manages the file system metadata.
Block Information Retrieval:
The NameNode identifies the file's location by checking its metadata. It breaks the file into blocks and returns the block locations (list of DataNodes that store each block) to the client.
Client Contacts DataNodes:
The client connects directly to the DataNodes that contain the blocks, not to the NameNode. The client retrieves blocks in parallel from the closest DataNodes based on the block information provided.
Data Transfer:
The DataNodes send the requested data blocks to the client. Each DataNode streams its respective block to the client, ensuring efficient data transfer through a pipeline mechanism.
Reconstruct the File:
The client reconstructs the original file from the blocks received from the DataNodes and processes or stores the file as needed.
UNIT III 39
39
HDFS(READ Mechanism)
UNIT III 40
40
HDFS(READ Mechanism)
Hadoop Distributed File System(READ Mechanism):
UNIT III 41
41
HDFS Write Mechanism
Hadoop Distributed File System(Write Mechanism):
UNIT III 42
42
Client Request to Write:
A client requests to write a file to HDFS. The request goes to the NameNode, which checks the metadata to ensure that the file doesn't already exist (HDFS is write-once).
Block Allocation:
The client sends the first block of data to the first chosen DataNode. This DataNode stores the block and forwards the block to the next DataNode in the replication pipeline until the number of replicas (default is 3) is satisfied.
Block Acknowledgement:
Once a block is replicated to all assigned DataNodes, each DataNode sends an acknowledgment back to the previous node, and eventually, the client receives confirmation that the block has been successfully written.
After all blocks are written and replicated, the client informs the NameNode that the file write operation is complete. The NameNode updates the file system metadata, and the file becomes available for reading.
UNIT III 43
43
HDFS Write Mechanism
Hadoop Distributed File System(Write Mechanism):
UNIT III 44
44
Advantages of HDFS
Hadoop Distributed File System:
UNIT III 45
45
HDFS Architecture: Component
• NameNode:
• HDFS cluster consists of a single NameNode, a master server that manages the file
system namespace and regulates access to files by clients.
• The NameNode executes file system namespace operations like opening, closing, and
renaming files and directories.
• It also determines the mapping of blocks to DataNodes
• DataNode:
• There are a number of DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on.
• HDFS exposes a file system namespace and allows user data to be stored in files.
• Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes
• The DataNodes are responsible for serving read and write requests from the file
system’s clients.
• The DataNodes also perform block creation, deletion, and replication upon instruction
from the NameNode.
UNIT III 46
46
HDFS Architecture: Component
• The File System Namespace:
• HDFS supports a traditional hierarchical file organization.
• The NameNode maintains the file system namespace.
• Any change to the file system namespace or its properties is recorded by
the NameNode.
• An application can specify the number of replicas of a file that should be
maintained by HDFS.
• The number of copies of a file is called the replication factor of that file.
This information is stored by the NameNode.
UNIT III 47
47
HDFS Architecture: Component
• Data Replication :
• HDFS is designed to reliably store very large files across machines in a large cluster.
• It stores each file as a sequence of blocks; all blocks in a file except the last block are the
same size.
• The blocks of a file are replicated for fault tolerance.
• The block size and replication factor are configurable per file
• An application can specify the number of replicas of a file.
UNIT III 48
48
HDFS Architecture: Component
Data Replication(contd)
• The replication factor can be specified at file creation time and can be changed later.
• Files in HDFS are write-once and have strictly one writer at any time.
• The NameNode makes all decisions regarding replication of blocks.
• It periodically receives a Heartbeat(message to indicate availability) and a Blockreport from
each of the DataNodes in the cluster.
• Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport
contains a list of all blocks on a DataNode.
1. Replica Placement: The First Baby Steps
• The purpose of a rack-aware replica placement policy is to improve data reliability,
availability, and network bandwidth utilization.
• Replica Selection:
• to minimize global bandwidth consumption and read latency, HDFS tries to satisfy
a read request from a replica that is closest to the reader.
• If there exists a replica on the same rack as the reader node, then that replica is
preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data
centers, then a replica that is resident in the local data center is preferred over any
remote replica.
UNIT III 49
49
On startup, the NameNode enters Safemode, where data block replication is paused.
The NameNode receives Heartbeat and Blockreport messages from DataNodes, indicating the blocks they host.
A block is safely replicated when its minimum replica count is met, and after a configurable percentage of blocks confirm this, the NameNode exits Safemode.
The NameNode identifies and replicates any blocks that have fewer than the required replicas.
UNIT III 50
50
HDFS Architecture: Component
• The Persistence of File System Metadata :
• The HDFS namespace is stored by the NameNode.
• The NameNode uses a transaction log called the EditLog to persistently record
every change that occurs to file system metadata.
• For example, creating a new file in HDFS causes the NameNode to insert a record
into the EditLog indicating this.
• The NameNode uses a file in its local host OS file system to store the EditLog.
The entire file system namespace, including the mapping of blocks to files and
file system properties, is stored in a file called the FsImage.
• The FsImage is stored as a file in the NameNode’s local file system too.
• When a DataNode starts up, it scans through its local file system, generates a list
of all HDFS data blocks that correspond to each of these local files and sends this
report to the NameNode: this is the Blockreport.
UNIT III 51
51
HDFS Architecture: Component
• Robustness:
• The primary objective of HDFS is to store data reliably even in the presence of
failures.
• Cluster Rebalancing
• The HDFS architecture is compatible with data rebalancing schemes. A scheme
might automatically move data from one DataNode to another if the free space
on a DataNode falls below a certain threshold.
UNIT III 52
52
HDFS Architecture: Component
• Data Integrity
• It is possible that a block of data fetched from a DataNode arrives corrupted.
• The HDFS client software implements checksum checking on the contents of HDFS
files.
• Snapshots
• Snapshots support storing a copy of data at a particular instant of time. One usage of the
snapshot feature may be to roll back a corrupted HDFS instance to a previously known good
point in time. HDFS does not currently support snapshots but will in a future release.
UNIT III 53
53
HDFS Architecture: Component
• Data Organization
1. Data Blocks:
• HDFS is designed to support very large files.
• Applications that are compatible with HDFS are those that deal with large data sets.
• These applications write their data only once but they read it one or more times and require these
reads to be satisfied at streaming speeds.
• HDFS supports write-once-read-many semantics on files.
• A typical block size used by HDFS is 64 MB.
• Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a
different DataNode.
2. Staging:
• A client request to create a file does not reach the NameNode immediately
• If a client writes to a remote file directly without any client side buffering, the network
speed and the congestion in the network impacts throughput considerably.
• This approach is not without precedent. Earlier distributed file systems, e.g. AFS, have
used client side caching to improve performance. A POSIX requirement has been relaxed
to achieve higher performance of data uploads.
3. Replication Pipelining
UNIT III 54
54
HDFS Commands
ls: This command is used to list all the files.
hdfs dfs -ls
copyFromLocal (or) put: To copy files/folders from local file system to hdfs store
copyToLocal (or) get: To copy files/folders from hdfs store to local file system.
Syntax:
55
YARN
12/13/2022 56
Hadoop 1.0
57
Hadoop 1.0
58
Hadoop 1.0
59
Limitations of
Hadoop 1.0
60
Need of YARN
61
Yarn
62
What is YARN?
63
Workloads running on YARN
64
YARN Components
65
YARN-Resource Manager
66
YARN-Resource Manager
67
YARN Architecture
68
Running an application in YARN
69
Step 1 -
Application
submitted
to the
Resource
Manager
70
Step 2 -
Resource
Manager
allocates
Container
71
Step 3 -
Application
Master
contacts
Node
Manager
72
Step 4
-Resource
Manager
Launches
Container
73
Step 5 -
Container
Executes the
Application
Master
74
Introduction to HBase
75
Introduction to HBase
76
HBase
77
Companies Using HBASE
78
HBASE Use Case
79
Hbase Commands
base(main):001:0> list
h
TABLE
create ‘<table name>’,’<column family>’
hbase(main):002:0> create 'emp', 'personal data', 'professional data'
hbase(main):028:0> scan 'emp'
hbase(main):006:0> describe 'emp'
put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
hbase(main):005:0> put 'emp','1','personal data:name','raju'
0 row(s) in 0.6600 seconds
get ’<table name>’,’row1’
hbase(main):012:0> get 'emp', '1'
delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’
hbase(main):006:0> delete 'emp', '1', 'personal data:city',
count ‘<table name>’
hbase(main):023:0> count 'emp'
hbase(main):018:0> disable 'emp'
UNIT III 81
81
Introduction to MapReduce: Use cases
• At Google
• Index construction for Google Search (replaced in 2010 by Caffeine)
• Article clustering for Google News
• Statistical machine translation
• At Yahoo!
• “Web map” powering Yahoo! Search
• Spam detection for Yahoo! Mail
• At Facebook
• Data mining, Web log processing
• SearchBox (with Cassandra)
• Facebook Messages (with HBase)
• Ad optimization
• Spam detection
UNIT III 82
82
Introduction to MapReduce
• Problem: How to compute the PageRank for a crawled set of
websites on a cluster of machines?
• Main Challenges:
• How to break up a large problem into smaller tasks, that can be executed
in parallel?
• How to assign tasks to machines?
• How to partition and distribute data?
• How to share intermediate results?
• How to coordinate synchronization, scheduling, fault-tolerance?
Solution is MapReduce:
• Algorithms that can be expressed as (or mapped to) a sequence of Map()
and Reduce() functions are automatically parallelized by the framework.
UNIT III 83
83
MapReduce Workflow
• Map Phase
• Raw data read and converted to key/value pairs
• Map() function applied to any pair
• Shuffle and Sort Phase
• All key/value pairs are sorted and grouped by their keys
• Reduce Phase
• All values with a the same key are processed by within the same reduce() function
UNIT III 84
84
MapReduce Programming Model
• Every MapReduce program must specify a Mapper and typically a Reducer
• The Mapper has a map() function that transforms input (key, value) pairs
into any number of intermediate (out_key, intermediate_value) pairs
• The Reducer has a reduce() function that transforms intermediate (out_key,
list(intermediate_value)) aggregates into any number of output (value’) pairs
UNIT III 85
85
MapReduce Programming Model
UNIT III 86
86
MapReduce Execution Details
UNIT III 87
87
MapReduce Data flow
UNIT III 88
Word Count Problem using
MapReduce
UNIT III 89
89
MapReduce Execution Details
UNIT III 90
90
Word Count Problem uisng MapReduce
UNIT III 91
91
References
References
1. Hadoop® FOR DUMMIES, SPECIAL EDITION, by Robert D. Schneider, John
Wiley & Sons, Inc.
2. Hadoop: The Definitive Guide, Tom White,3rd edition ,O’Reilly ,yahoo Press
3. Learning Spark-Lightning-Fast Big Data Analysis, Holden Karau, Andy Konwinski,
Patrick Wendell, and Matei Zaharia, O’Reilly, Copyright © 2015
4. https://spark.apache.org/
5. https://hadoop.apache.org/
6. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf
7. https://www.knowledgehut.com/blog/big-data/5-best-data-processing-frameworks
8. https://opensourceforu.com/2018/03/a-quick-comparison-of-the-five-best-big-data-fra
meworks/
9. https://data-flair.training/blogs/hadoop-ecosystem-components/
10. http://dbis.informatik.uni-freiburg.de/content/courses/SS12/Praktikum/Datenbanken%
20und%20Cloud%20Computing/slides/MapReduce.pdf
11. https://www.edureka.co/blog/mapreduce-tutorial/
UNIT III 92
92
References
References
1. Introduction to Analytics and Big Data-Hadoop, Thomas Rivera, Hitachi Data
System, SNIA Education,2014
2. Hadoop: A Framework for Data-Intensive Distributed Computing, CS561-Spring
2012 WPI, Mohamed Y. Eltabakh
UNIT III 93
93