0% found this document useful (0 votes)
20 views93 pages

BDT Unit03.pptx

The document provides an overview of Big Data technologies, focusing on Hadoop and its components, including HDFS, YARN, and MapReduce. It discusses the challenges of Big Data such as information growth, processing power, and costs, as well as the advantages of using Hadoop for distributed data processing. Additionally, it explains the architecture and mechanisms of HDFS, including read and write processes, data replication, and the role of NameNode and DataNodes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views93 pages

BDT Unit03.pptx

The document provides an overview of Big Data technologies, focusing on Hadoop and its components, including HDFS, YARN, and MapReduce. It discusses the challenges of Big Data such as information growth, processing power, and costs, as well as the advantages of using Hadoop for distributed data processing. Additionally, it explains the architecture and mechanisms of HDFS, including read and write processes, data replication, and the role of NameNode and DataNodes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

CET4001B Big Data Technologies

1
Unit-III-Introduction to Hadoop:
• What is Hadoop? Hadoop Overview, Processing Data with Hadoop,
• HDFS - Introduction, Concepts, Design, HDFS interfaces, HDFS Read
• Architecture, HDFS Write Architecture,
• Managing Resources and Applications with Hadoop YARN (Yet Another
Resource Negotiator),
• Introduction to MAPREDUCE Programming,Mapper, Reducer, Combiner,
Petitioner, Searching, Sorting

2
UNIT III
Challenges of Big Data

The term Big Data is refer to large amount of data and big
data analytics is used to find the hidden pattern which are
useful for predicting the future marketing strategies, offers,
customer retention, promotional content, Pricing strategies
etc.

Challenges of Big data—


1. Information growth-
2. Processing power-
3. Physical storage-
4. Data issues-
UNIT III 3
5. Costs- 3
Challenges of Big Data
• Information Growth: Over 80 percent of the data in the enterprise consists of
unstructured data, which tends to be growing at a much faster pace than traditional
relational information.
• Processing power: The customary approach of using a single, expensive, powerful
computer to crunch information just doesn’t scale for Big Data.
• Physical storage: Capturing and managing all this information can consume
enormous resources, outstripping all budgetary expectations.
• Data issues: Lack of data mobility, proprietary formats, and interoperability obstacles
can all make working with Big Data complicated.
• Cost: Extract, transform, and load (ETL) processes for Big Data can be expensive and
time consuming, particularly in the absence of specialized, well-designed software.
This large amount of data needs more robust
• Computer software for processing,
• Data processing framework
UNIT III 4
4
Big Data Processing
1. Framework
Hadoop-: This is an open-source batch processing framework that can be used for the
distributed storage and processing of big data sets.
2. Apache Spark: Apache Spark is a general purpose and lightning fast cluster computing
system. It is a batch processing framework that has the capability of stream processing, as
well, making it a hybrid framework.
3. Apache Storm: Apache Storm is a stream processing framework that focuses on extremely
low latency and is perhaps the best option for workloads that require near real-time
processing.
4. Apache Samza: Apache Samza is another open-source framework that offers near a
real-time, asynchronous framework for distributed stream processing.

UNIT III 5
5
Why Hadoop
• Distributed cluster system
• Platform for massively scalable applications
• Enables parallel data processing

UNIT III 6
6
Hadoop Features
• Hadoop provides access to the file systems
• The Hadoop Common package contains the necessary JAR files
and scripts
• The package also provides source code, documentation and a
contribution section that includes projects from the Hadoop
Community.

UNIT III 7
7
Hadoop : Big Data Processing Framework
• Hadoop is a well-adopted, standards-based, open-source
software framework built on the foundation of Google’s
MapReduce and Google File System papers.

• It’s meant to leverage the power of massive parallel


processing to take advantage of Big Data, generally by
using lots of inexpensive commodity servers.

• The Hadoop software library is a framework that allows for


the distributed processing of large data sets across clusters
of computers using simple programming models

UNIT III 8
8
Hadoop Major Component

UNIT III 9
Major Component of Hadoop

• Hadoop Common: The common utilities that support the other Hadoop
modules. A set of components and interfaces for distributed filesystems
and general I/O (serialization, Java RPC, persistent data structures).

• Hadoop Distributed File System (HDFS): A distributed file system that


provides high-throughput access to application data.

• Hadoop YARN (Yet another Resource Negotiator ): A framework for job


scheduling and cluster resource management.

• Hadoop MapReduce: A YARN-based system for parallel processing of


large data sets.

UNIT III 10
10
Hadoop Ecosystem(Zoo)

UNIT III 11
11
Original Google Stack

12
Facebook’s Version of the Stack

13
Yahoo’s Version of the Stack

14
LinkedIn’s Version of the Stack

15
Cloudera’s Version of the Stack

Cloudera’s Version of the Stack

16
Hadoop Ecosystem

UNIT III 17
17
Hadoop Ecosystem

UNIT III 18
18
HDFS
• A distributed filesystem that runs on large clusters of
commodity machines.

MapReduce
• A distributed data processing model and execution
environment that runs on large clusters of commodity
machines.
Pig

Hadoop • A data flow language and execution environment for


exploring very large datasets.

ECOSYSTEM Hive
• A distributed data warehouse. query language based on

Components SQL (and which is translated by the runtime engine


toMapReduce jobs) for querying the data.
HBase
• A distributed, column-oriented database.

ZooKeeper
• A distributed, highly available coordination service.

Sqoop

• A tool for efficiently moving data between relational


databases and HDFS. 19
HDFS

• Distributed file system


• Traditional hierarchical file organization
• Single namespace for the entire cluster
• Write-once-read-many access model
• Aware of the network topology

UNIT III 20
20
HDFS
Hadoop Distributed File System:
• It is a Distributed File System designed to run on Commodity hardware
• It is highly fault tolerant

▪ Feature of HDFS
▪ It is suitable for the distributed storage and processing.
▪ Hadoop provides a command interface to interact with HDFS.
▪ The built-in servers of namenode and datanode help users to easily check the
status of cluster.
▪ Streaming access to file system data.
▪ HDFS provides file permissions and authentication.

UNIT III 21
21
HDFS
▪ Assumptions and Goals
▪ Hardware Failure
▪ Streaming Data Access
▪ Large Data Sets
▪ Simple Coherency Model
▪ Moving Computation is cheaper than Moving data
▪ Portability across Heterogeneous Hardware and software platforms

UNIT III 22
22
HDFS
• Hadoop Distributed File System:
• It is a Distributed File System designed to run on Commodity hardware
• It is highly fault tolerant

• Feature of HDFS
▪ It is suitable for the distributed storage and processing.
▪ Hadoop provides a command interface to interact with HDFS.
▪ The built-in servers of namenode and datanode help users to easily check the status of cluster.
▪ Streaming access to file system data.
▪ HDFS provides file permissions and authentication.
• Assumptions and Goals
▪ Hardware Failure
▪ Streaming Data Access
▪ Large Data Sets
▪ Simple Coherency Model
▪ Moving Computation is cheaper than Moving data
▪ Portability across Heterogeneous Hardware and software platforms

23
HDFS

24
HDFS

25
HDFS

26
HDFS

27
HDFS Cluster Architecture

28
HDFS Data Blocks

10/11/2022 Big Data Analytics Lab 29


HDFS Data Blocks

10/11/2022 Big Data Analytics Lab 30


Data node Failure

31
Replication • The default replication factor is 3

32
Rack awareness in HDFS

33
Hadoop: How it Works

34
HDFS Architecture
Hadoop Distributed File System:
Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a
cluster comprises of a single NameNode (Master node) and all the other nodes are
DataNodes (Slave nodes).

UNIT III 35
35
Hadoop Architecture

• Distributed file system (HDFS)


• Execution engine (MapReduce)

Master node (single node)

Many slave nodes

36
Hadoop Distributed File System
(HDFS)
Centralized namenode
- Maintains metadata info about files

File F 1 2 3 4 5

Blocks (64 MB)

Many datanode (1000s)


- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)

37
HDFS Architecture
Hadoop Distributed File System:

UNIT III 38
38
HDFS(READ Mechanism)
HDFS Read Mechanism
Client Request:

A client application requests to read a file from HDFS by interacting with the Hadoop Distributed File System (HDFS). The request is sent to the NameNode, which manages the file system metadata.
Block Information Retrieval:

The NameNode identifies the file's location by checking its metadata. It breaks the file into blocks and returns the block locations (list of DataNodes that store each block) to the client.
Client Contacts DataNodes:

The client connects directly to the DataNodes that contain the blocks, not to the NameNode. The client retrieves blocks in parallel from the closest DataNodes based on the block information provided.
Data Transfer:

The DataNodes send the requested data blocks to the client. Each DataNode streams its respective block to the client, ensuring efficient data transfer through a pipeline mechanism.
Reconstruct the File:

The client reconstructs the original file from the blocks received from the DataNodes and processes or stores the file as needed.

UNIT III 39
39
HDFS(READ Mechanism)

UNIT III 40
40
HDFS(READ Mechanism)
Hadoop Distributed File System(READ Mechanism):

UNIT III 41
41
HDFS Write Mechanism
Hadoop Distributed File System(Write Mechanism):

UNIT III 42
42
Client Request to Write:

A client requests to write a file to HDFS. The request goes to the NameNode, which checks the metadata to ensure that the file doesn't already exist (HDFS is write-once).
Block Allocation:

HDFS Write Mechanism


The NameNode splits the file into blocks (typically 128 MB or 256 MB in newer versions) and determines where to store each block. The DataNodes to store the replicas are also chosen at this point.
Replication Pipeline Setup:

The client sends the first block of data to the first chosen DataNode. This DataNode stores the block and forwards the block to the next DataNode in the replication pipeline until the number of replicas (default is 3) is satisfied.
Block Acknowledgement:

Once a block is replicated to all assigned DataNodes, each DataNode sends an acknowledgment back to the previous node, and eventually, the client receives confirmation that the block has been successfully written.

Hadoop Distributed File System(Write Mechanism):


File Commit:

After all blocks are written and replicated, the client informs the NameNode that the file write operation is complete. The NameNode updates the file system metadata, and the file becomes available for reading.

UNIT III 43
43
HDFS Write Mechanism
Hadoop Distributed File System(Write Mechanism):

UNIT III 44
44
Advantages of HDFS
Hadoop Distributed File System:

UNIT III 45
45
HDFS Architecture: Component

• NameNode:
• HDFS cluster consists of a single NameNode, a master server that manages the file
system namespace and regulates access to files by clients.
• The NameNode executes file system namespace operations like opening, closing, and
renaming files and directories.
• It also determines the mapping of blocks to DataNodes

• DataNode:
• There are a number of DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on.
• HDFS exposes a file system namespace and allows user data to be stored in files.
• Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes
• The DataNodes are responsible for serving read and write requests from the file
system’s clients.
• The DataNodes also perform block creation, deletion, and replication upon instruction
from the NameNode.

UNIT III 46
46
HDFS Architecture: Component
• The File System Namespace:
• HDFS supports a traditional hierarchical file organization.
• The NameNode maintains the file system namespace.
• Any change to the file system namespace or its properties is recorded by
the NameNode.
• An application can specify the number of replicas of a file that should be
maintained by HDFS.
• The number of copies of a file is called the replication factor of that file.
This information is stored by the NameNode.

UNIT III 47
47
HDFS Architecture: Component
• Data Replication :
• HDFS is designed to reliably store very large files across machines in a large cluster.
• It stores each file as a sequence of blocks; all blocks in a file except the last block are the
same size.
• The blocks of a file are replicated for fault tolerance.
• The block size and replication factor are configurable per file
• An application can specify the number of replicas of a file.

UNIT III 48
48
HDFS Architecture: Component
Data Replication(contd)
• The replication factor can be specified at file creation time and can be changed later.
• Files in HDFS are write-once and have strictly one writer at any time.
• The NameNode makes all decisions regarding replication of blocks.
• It periodically receives a Heartbeat(message to indicate availability) and a Blockreport from
each of the DataNodes in the cluster.
• Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport
contains a list of all blocks on a DataNode.
1. Replica Placement: The First Baby Steps
• The purpose of a rack-aware replica placement policy is to improve data reliability,
availability, and network bandwidth utilization.
• Replica Selection:
• to minimize global bandwidth consumption and read latency, HDFS tries to satisfy
a read request from a replica that is closest to the reader.
• If there exists a replica on the same rack as the reader node, then that replica is
preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data
centers, then a replica that is resident in the local data center is preferred over any
remote replica.

UNIT III 49
49
On startup, the NameNode enters Safemode, where data block replication is paused.
The NameNode receives Heartbeat and Blockreport messages from DataNodes, indicating the blocks they host.
A block is safely replicated when its minimum replica count is met, and after a configurable percentage of blocks confirm this, the NameNode exits Safemode.
The NameNode identifies and replicates any blocks that have fewer than the required replicas.

HDFS Architecture: Component


2. Safemode
• On startup, the NameNode enters a special state called Safemode.
• Replication of data blocks does not occur when the NameNode is in the Safemode
state.
• The NameNode receives Heartbeat and Blockreport messages from the
DataNodes.
• A Blockreport contains the list of data blocks that a DataNode is hosting.
• Each block has a specified minimum number of replicas.
• A block is considered safely replicated when the minimum number of replicas of
that data block has checked in with the NameNode.
• After a configurable percentage of safely replicated data blocks checks in with the
NameNode (plus an additional 30 seconds), the NameNode exits the Safemode
state.
• It then determines the list of data blocks (if any) that still have fewer than the
specified number of replicas.
• The NameNode then replicates these blocks to other DataNodes.

UNIT III 50
50
HDFS Architecture: Component
• The Persistence of File System Metadata :
• The HDFS namespace is stored by the NameNode.
• The NameNode uses a transaction log called the EditLog to persistently record
every change that occurs to file system metadata.
• For example, creating a new file in HDFS causes the NameNode to insert a record
into the EditLog indicating this.
• The NameNode uses a file in its local host OS file system to store the EditLog.
The entire file system namespace, including the mapping of blocks to files and
file system properties, is stored in a file called the FsImage.
• The FsImage is stored as a file in the NameNode’s local file system too.
• When a DataNode starts up, it scans through its local file system, generates a list
of all HDFS data blocks that correspond to each of these local files and sends this
report to the NameNode: this is the Blockreport.

• The Communication Protocols:


• All HDFS communication protocols are layered on top of the TCP/IP protocol

UNIT III 51
51
HDFS Architecture: Component
• Robustness:
• The primary objective of HDFS is to store data reliably even in the presence of
failures.

• Data Disk Failure, Heartbeats and Re-Replication


• The NameNode constantly tracks which blocks need to be replicated and initiates
replication whenever necessary.
• The necessity for re-replication may arise due to many reasons: a DataNode may
become unavailable, a replica may become corrupted, a hard disk on a DataNode
may fail, or the replication factor of a file may be increased.

• Cluster Rebalancing
• The HDFS architecture is compatible with data rebalancing schemes. A scheme
might automatically move data from one DataNode to another if the free space
on a DataNode falls below a certain threshold.

UNIT III 52
52
HDFS Architecture: Component
• Data Integrity
• It is possible that a block of data fetched from a DataNode arrives corrupted.
• The HDFS client software implements checksum checking on the contents of HDFS
files.

• Metadata Disk Failure:


• The FsImage and the EditLog are central data structures of HDFS. A corruption of these
files can cause the HDFS instance to be non-functional.
• For this reason, the NameNode can be configured to support maintaining multiple copies
of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of
the FsImages and EditLogs to get updated synchronously.

• Snapshots
• Snapshots support storing a copy of data at a particular instant of time. One usage of the
snapshot feature may be to roll back a corrupted HDFS instance to a previously known good
point in time. HDFS does not currently support snapshots but will in a future release.

UNIT III 53
53
HDFS Architecture: Component
• Data Organization
1. Data Blocks:
• HDFS is designed to support very large files.
• Applications that are compatible with HDFS are those that deal with large data sets.
• These applications write their data only once but they read it one or more times and require these
reads to be satisfied at streaming speeds.
• HDFS supports write-once-read-many semantics on files.
• A typical block size used by HDFS is 64 MB.
• Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a
different DataNode.
2. Staging:
• A client request to create a file does not reach the NameNode immediately
• If a client writes to a remote file directly without any client side buffering, the network
speed and the congestion in the network impacts throughput considerably.
• This approach is not without precedent. Earlier distributed file systems, e.g. AFS, have
used client side caching to improve performance. A POSIX requirement has been relaxed
to achieve higher performance of data uploads.
3. Replication Pipelining

UNIT III 54
54
HDFS Commands
ls: This command is used to list all the files.
hdfs dfs -ls

mkdir: To create a directory. In Hadoop dfs there is no home directory by default.

hdfs dfs -mkdir <folder name>

touchz: It creates an empty file.

hdfs dfs -touchz <file_path>

copyFromLocal (or) put: To copy files/folders from local file system to hdfs store

hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>

hdfs dfs -put ../Desktop/AI.txt /folder

copyToLocal (or) get: To copy files/folders from hdfs store to local file system.

Syntax:

hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>

hdfs dfs -get /geeks/myfile.txt ../Desktop/hero

cat: To print file contents.

hdfs dfs -cat <path>

55
YARN

12/13/2022 56
Hadoop 1.0

57
Hadoop 1.0

58
Hadoop 1.0

59
Limitations of
Hadoop 1.0

60
Need of YARN

61
Yarn

62
What is YARN?

63
Workloads running on YARN

64
YARN Components

65
YARN-Resource Manager

66
YARN-Resource Manager

67
YARN Architecture

68
Running an application in YARN

69
Step 1 -
Application
submitted
to the
Resource
Manager

70
Step 2 -
Resource
Manager
allocates
Container

71
Step 3 -
Application
Master
contacts
Node
Manager

72
Step 4
-Resource
Manager
Launches
Container

73
Step 5 -
Container
Executes the
Application
Master

74
Introduction to HBase

75
Introduction to HBase

76
HBase

77
Companies Using HBASE

78
HBASE Use Case

79
Hbase Commands
base(main):001:0> list
h

TABLE
create ‘<table name>’,’<column family>’
hbase(main):002:0> create 'emp', 'personal data', 'professional data'
hbase(main):028:0> scan 'emp'
hbase(main):006:0> describe 'emp'
put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
hbase(main):005:0> put 'emp','1','personal data:name','raju'
0 row(s) in 0.6600 seconds
get ’<table name>’,’row1’
hbase(main):012:0> get 'emp', '1'
delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’
hbase(main):006:0> delete 'emp', '1', 'personal data:city',
count ‘<table name>’
hbase(main):023:0> count 'emp'
hbase(main):018:0> disable 'emp'

hbase(main):019:0> drop 'emp'


0 row(s) in 0.3060 seconds 80
Introduction to MapReduce
• Traditional techniques for working with information simply don’t scale to
Big Data: they’re too costly, time-consuming, and complex, which is where
MapReduce comes in.
• MapReduce is:
• new programming framework — created and successfully deployed by
Google
• uses the divide-and-conquer method (and lots of commodity servers)
• break down complex Big Data problems into small units of work, and
then process them in parallel.
• MapReduce is a programming framework that allows us to perform
distributed and parallel processing on large data sets in a distributed
environment.
• Eg: processing 100 TB data
• On 1 node🡪 scanning @50 MB/s =23 days
• On 1000 node cluster🡪 scanning @50 MB/s =33 min

UNIT III 81
81
Introduction to MapReduce: Use cases
• At Google
• Index construction for Google Search (replaced in 2010 by Caffeine)
• Article clustering for Google News
• Statistical machine translation
• At Yahoo!
• “Web map” powering Yahoo! Search
• Spam detection for Yahoo! Mail
• At Facebook
• Data mining, Web log processing
• SearchBox (with Cassandra)
• Facebook Messages (with HBase)
• Ad optimization
• Spam detection

UNIT III 82
82
Introduction to MapReduce
• Problem: How to compute the PageRank for a crawled set of
websites on a cluster of machines?
• Main Challenges:
• How to break up a large problem into smaller tasks, that can be executed
in parallel?
• How to assign tasks to machines?
• How to partition and distribute data?
• How to share intermediate results?
• How to coordinate synchronization, scheduling, fault-tolerance?
Solution is MapReduce:
• Algorithms that can be expressed as (or mapped to) a sequence of Map()
and Reduce() functions are automatically parallelized by the framework.

UNIT III 83
83
MapReduce Workflow
• Map Phase
• Raw data read and converted to key/value pairs
• Map() function applied to any pair
• Shuffle and Sort Phase
• All key/value pairs are sorted and grouped by their keys
• Reduce Phase
• All values with a the same key are processed by within the same reduce() function

UNIT III 84
84
MapReduce Programming Model
• Every MapReduce program must specify a Mapper and typically a Reducer
• The Mapper has a map() function that transforms input (key, value) pairs
into any number of intermediate (out_key, intermediate_value) pairs
• The Reducer has a reduce() function that transforms intermediate (out_key,
list(intermediate_value)) aggregates into any number of output (value’) pairs

UNIT III 85
85
MapReduce Programming Model

UNIT III 86
86
MapReduce Execution Details

• The complete execution process (execution of Map and Reduce tasks,


both) is controlled by two types of entities called a
• Jobtracker: Acts like a master (responsible for complete execution of submitted job)
• Multiple Task Trackers: Acts like slaves, each of them performing the job
• For every job submitted for execution in the system, there is
one Jobtracker that resides on NameNode and there are multiple task
trackers which reside on DataNode.

UNIT III 87
87
MapReduce Data flow

UNIT III 88
Word Count Problem using
MapReduce

Problem: Given a document, we want to count the


occurrences of any word
Input: Document with words.
Output: List of words and their occurrences, e.g.
“Infrastructure” 12 “the” 259 …

UNIT III 89
89
MapReduce Execution Details

UNIT III 90
90
Word Count Problem uisng MapReduce

UNIT III 91
91
References
References
1. Hadoop® FOR DUMMIES, SPECIAL EDITION, by Robert D. Schneider, John
Wiley & Sons, Inc.
2. Hadoop: The Definitive Guide, Tom White,3rd edition ,O’Reilly ,yahoo Press
3. Learning Spark-Lightning-Fast Big Data Analysis, Holden Karau, Andy Konwinski,
Patrick Wendell, and Matei Zaharia, O’Reilly, Copyright © 2015
4. https://spark.apache.org/
5. https://hadoop.apache.org/
6. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf
7. https://www.knowledgehut.com/blog/big-data/5-best-data-processing-frameworks
8. https://opensourceforu.com/2018/03/a-quick-comparison-of-the-five-best-big-data-fra
meworks/
9. https://data-flair.training/blogs/hadoop-ecosystem-components/
10. http://dbis.informatik.uni-freiburg.de/content/courses/SS12/Praktikum/Datenbanken%
20und%20Cloud%20Computing/slides/MapReduce.pdf
11. https://www.edureka.co/blog/mapreduce-tutorial/

UNIT III 92
92
References
References
1. Introduction to Analytics and Big Data-Hadoop, Thomas Rivera, Hitachi Data
System, SNIA Education,2014
2. Hadoop: A Framework for Data-Intensive Distributed Computing, CS561-Spring
2012 WPI, Mohamed Y. Eltabakh

UNIT III 93
93

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy