0% found this document useful (0 votes)

20 views93 pages

BDT Unit03.pptx

The document provides an overview of Big Data technologies, focusing on Hadoop and its components, including HDFS, YARN, and MapReduce. It discusses the challenges of Big Data such as information growth, processing power, and costs, as well as the advantages of using Hadoop for distributed data processing. Additionally, it explains the architecture and mechanisms of HDFS, including read and write processes, data replication, and the role of NameNode and DataNodes.

Uploaded by

chaitealattenaanbread

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views93 pages

BDT Unit03.pptx

Uploaded by

chaitealattenaanbread

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 93

CET4001B Big Data Technologies

1
Unit-III-Introduction to Hadoop:
• What is Hadoop? Hadoop Overview, Processing Data with Hadoop,
• HDFS - Introduction, Concepts, Design, HDFS interfaces, HDFS Read
• Architecture, HDFS Write Architecture,
• Managing Resources and Applications with Hadoop YARN (Yet Another
Resource Negotiator),
• Introduction to MAPREDUCE Programming,Mapper, Reducer, Combiner,
Petitioner, Searching, Sorting

2
UNIT III
Challenges of Big Data

The term Big Data is refer to large amount of data and big
data analytics is used to find the hidden pattern which are
useful for predicting the future marketing strategies, offers,
customer retention, promotional content, Pricing strategies
etc.

Challenges of Big data—

1. Information growth-
2. Processing power-
3. Physical storage-
4. Data issues-
UNIT III 3
5. Costs- 3
Challenges of Big Data
• Information Growth: Over 80 percent of the data in the enterprise consists of
unstructured data, which tends to be growing at a much faster pace than traditional
relational information.
• Processing power: The customary approach of using a single, expensive, powerful
computer to crunch information just doesn’t scale for Big Data.
• Physical storage: Capturing and managing all this information can consume
enormous resources, outstripping all budgetary expectations.
• Data issues: Lack of data mobility, proprietary formats, and interoperability obstacles
can all make working with Big Data complicated.
• Cost: Extract, transform, and load (ETL) processes for Big Data can be expensive and
time consuming, particularly in the absence of specialized, well-designed software.
This large amount of data needs more robust
• Computer software for processing,
• Data processing framework
UNIT III 4
4
Big Data Processing
1. Framework
Hadoop-: This is an open-source batch processing framework that can be used for the
distributed storage and processing of big data sets.
2. Apache Spark: Apache Spark is a general purpose and lightning fast cluster computing
system. It is a batch processing framework that has the capability of stream processing, as
well, making it a hybrid framework.
3. Apache Storm: Apache Storm is a stream processing framework that focuses on extremely
low latency and is perhaps the best option for workloads that require near real-time
processing.
4. Apache Samza: Apache Samza is another open-source framework that offers near a
real-time, asynchronous framework for distributed stream processing.

UNIT III 5
5
Why Hadoop
• Distributed cluster system
• Platform for massively scalable applications
• Enables parallel data processing

UNIT III 6
6
Hadoop Features
• Hadoop provides access to the file systems
• The Hadoop Common package contains the necessary JAR files
and scripts
• The package also provides source code, documentation and a
contribution section that includes projects from the Hadoop
Community.

UNIT III 7
7
Hadoop : Big Data Processing Framework
• Hadoop is a well-adopted, standards-based, open-source
software framework built on the foundation of Google’s
MapReduce and Google File System papers.

• It’s meant to leverage the power of massive parallel

processing to take advantage of Big Data, generally by
using lots of inexpensive commodity servers.

• The Hadoop software library is a framework that allows for

the distributed processing of large data sets across clusters
of computers using simple programming models

UNIT III 8
8
Hadoop Major Component

UNIT III 9
Major Component of Hadoop

• Hadoop Common: The common utilities that support the other Hadoop
modules. A set of components and interfaces for distributed filesystems
and general I/O (serialization, Java RPC, persistent data structures).

• Hadoop Distributed File System (HDFS): A distributed file system that

provides high-throughput access to application data.

• Hadoop YARN (Yet another Resource Negotiator ): A framework for job

scheduling and cluster resource management.

• Hadoop MapReduce: A YARN-based system for parallel processing of

large data sets.

UNIT III 10
10
Hadoop Ecosystem(Zoo)

UNIT III 11
11
Original Google Stack

12
Facebook’s Version of the Stack

13
Yahoo’s Version of the Stack

14
LinkedIn’s Version of the Stack

15
Cloudera’s Version of the Stack

Cloudera’s Version of the Stack

16
Hadoop Ecosystem

UNIT III 17
17
Hadoop Ecosystem

UNIT III 18
18
HDFS
• A distributed filesystem that runs on large clusters of
commodity machines.

MapReduce
• A distributed data processing model and execution
environment that runs on large clusters of commodity
machines.
Pig

Hadoop • A data flow language and execution environment for

exploring very large datasets.

ECOSYSTEM Hive
• A distributed data warehouse. query language based on

Components SQL (and which is translated by the runtime engine

toMapReduce jobs) for querying the data.
HBase
• A distributed, column-oriented database.

ZooKeeper
• A distributed, highly available coordination service.

Sqoop

• A tool for efficiently moving data between relational

databases and HDFS. 19
HDFS

• Distributed file system

• Traditional hierarchical file organization
• Single namespace for the entire cluster
• Write-once-read-many access model
• Aware of the network topology

UNIT III 20
20
HDFS
Hadoop Distributed File System:
• It is a Distributed File System designed to run on Commodity hardware
• It is highly fault tolerant

▪ Feature of HDFS
▪ It is suitable for the distributed storage and processing.
▪ Hadoop provides a command interface to interact with HDFS.
▪ The built-in servers of namenode and datanode help users to easily check the
status of cluster.
▪ Streaming access to file system data.
▪ HDFS provides file permissions and authentication.

UNIT III 21
21
HDFS
▪ Assumptions and Goals
▪ Hardware Failure
▪ Streaming Data Access
▪ Large Data Sets
▪ Simple Coherency Model
▪ Moving Computation is cheaper than Moving data
▪ Portability across Heterogeneous Hardware and software platforms

UNIT III 22
22
HDFS
• Hadoop Distributed File System:
• It is a Distributed File System designed to run on Commodity hardware
• It is highly fault tolerant

• Feature of HDFS
▪ It is suitable for the distributed storage and processing.
▪ Hadoop provides a command interface to interact with HDFS.
▪ The built-in servers of namenode and datanode help users to easily check the status of cluster.
▪ Streaming access to file system data.
▪ HDFS provides file permissions and authentication.
• Assumptions and Goals
▪ Hardware Failure
▪ Streaming Data Access
▪ Large Data Sets
▪ Simple Coherency Model
▪ Moving Computation is cheaper than Moving data
▪ Portability across Heterogeneous Hardware and software platforms

23
HDFS

24
HDFS

25
HDFS

26
HDFS

27
HDFS Cluster Architecture

28
HDFS Data Blocks

10/11/2022 Big Data Analytics Lab 29

HDFS Data Blocks

10/11/2022 Big Data Analytics Lab 30

Data node Failure

31
Replication • The default replication factor is 3

32
Rack awareness in HDFS

33
Hadoop: How it Works

34
HDFS Architecture
Hadoop Distributed File System:
Apache Hadoop HDFS Architecture follows a Master/Slave Architecture, where a
cluster comprises of a single NameNode (Master node) and all the other nodes are
DataNodes (Slave nodes).

UNIT III 35
35
Hadoop Architecture

• Distributed file system (HDFS)

• Execution engine (MapReduce)

Master node (single node)

Many slave nodes

36
Hadoop Distributed File System
(HDFS)
Centralized namenode
- Maintains metadata info about files

File F 1 2 3 4 5

Blocks (64 MB)

Many datanode (1000s)

- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)

37
HDFS Architecture
Hadoop Distributed File System:

UNIT III 38
38
HDFS(READ Mechanism)
HDFS Read Mechanism
Client Request:

A client application requests to read a file from HDFS by interacting with the Hadoop Distributed File System (HDFS). The request is sent to the NameNode, which manages the file system metadata.
Block Information Retrieval:

The NameNode identifies the file's location by checking its metadata. It breaks the file into blocks and returns the block locations (list of DataNodes that store each block) to the client.
Client Contacts DataNodes:

The client connects directly to the DataNodes that contain the blocks, not to the NameNode. The client retrieves blocks in parallel from the closest DataNodes based on the block information provided.
Data Transfer:

The DataNodes send the requested data blocks to the client. Each DataNode streams its respective block to the client, ensuring efficient data transfer through a pipeline mechanism.
Reconstruct the File:

The client reconstructs the original file from the blocks received from the DataNodes and processes or stores the file as needed.

UNIT III 39
39
HDFS(READ Mechanism)

UNIT III 40
40
HDFS(READ Mechanism)
Hadoop Distributed File System(READ Mechanism):

UNIT III 41
41
HDFS Write Mechanism
Hadoop Distributed File System(Write Mechanism):

UNIT III 42
42
Client Request to Write:

A client requests to write a file to HDFS. The request goes to the NameNode, which checks the metadata to ensure that the file doesn't already exist (HDFS is write-once).
Block Allocation:

HDFS Write Mechanism

The NameNode splits the file into blocks (typically 128 MB or 256 MB in newer versions) and determines where to store each block. The DataNodes to store the replicas are also chosen at this point.
Replication Pipeline Setup:

The client sends the first block of data to the first chosen DataNode. This DataNode stores the block and forwards the block to the next DataNode in the replication pipeline until the number of replicas (default is 3) is satisfied.
Block Acknowledgement:

Once a block is replicated to all assigned DataNodes, each DataNode sends an acknowledgment back to the previous node, and eventually, the client receives confirmation that the block has been successfully written.

Hadoop Distributed File System(Write Mechanism):

File Commit:

After all blocks are written and replicated, the client informs the NameNode that the file write operation is complete. The NameNode updates the file system metadata, and the file becomes available for reading.

UNIT III 43
43
HDFS Write Mechanism
Hadoop Distributed File System(Write Mechanism):

UNIT III 44
44
Advantages of HDFS
Hadoop Distributed File System:

UNIT III 45
45
HDFS Architecture: Component

• NameNode:
• HDFS cluster consists of a single NameNode, a master server that manages the file
system namespace and regulates access to files by clients.
• The NameNode executes file system namespace operations like opening, closing, and
renaming files and directories.
• It also determines the mapping of blocks to DataNodes

• DataNode:
• There are a number of DataNodes, usually one per node in the cluster, which manage
storage attached to the nodes that they run on.
• HDFS exposes a file system namespace and allows user data to be stored in files.
• Internally, a file is split into one or more blocks and these blocks are stored in a set of
DataNodes
• The DataNodes are responsible for serving read and write requests from the file
system’s clients.
• The DataNodes also perform block creation, deletion, and replication upon instruction
from the NameNode.

UNIT III 46
46
HDFS Architecture: Component
• The File System Namespace:
• HDFS supports a traditional hierarchical file organization.
• The NameNode maintains the file system namespace.
• Any change to the file system namespace or its properties is recorded by
the NameNode.
• An application can specify the number of replicas of a file that should be
maintained by HDFS.
• The number of copies of a file is called the replication factor of that file.
This information is stored by the NameNode.

UNIT III 47
47
HDFS Architecture: Component
• Data Replication :
• HDFS is designed to reliably store very large files across machines in a large cluster.
• It stores each file as a sequence of blocks; all blocks in a file except the last block are the
same size.
• The blocks of a file are replicated for fault tolerance.
• The block size and replication factor are configurable per file
• An application can specify the number of replicas of a file.

UNIT III 48
48
HDFS Architecture: Component
Data Replication(contd)
• The replication factor can be specified at file creation time and can be changed later.
• Files in HDFS are write-once and have strictly one writer at any time.
• The NameNode makes all decisions regarding replication of blocks.
• It periodically receives a Heartbeat(message to indicate availability) and a Blockreport from
each of the DataNodes in the cluster.
• Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport
contains a list of all blocks on a DataNode.
1. Replica Placement: The First Baby Steps
• The purpose of a rack-aware replica placement policy is to improve data reliability,
availability, and network bandwidth utilization.
• Replica Selection:
• to minimize global bandwidth consumption and read latency, HDFS tries to satisfy
a read request from a replica that is closest to the reader.
• If there exists a replica on the same rack as the reader node, then that replica is
preferred to satisfy the read request. If angg/ HDFS cluster spans multiple data
centers, then a replica that is resident in the local data center is preferred over any
remote replica.

UNIT III 49
49
On startup, the NameNode enters Safemode, where data block replication is paused.
The NameNode receives Heartbeat and Blockreport messages from DataNodes, indicating the blocks they host.
A block is safely replicated when its minimum replica count is met, and after a configurable percentage of blocks confirm this, the NameNode exits Safemode.
The NameNode identifies and replicates any blocks that have fewer than the required replicas.

HDFS Architecture: Component

2. Safemode
• On startup, the NameNode enters a special state called Safemode.
• Replication of data blocks does not occur when the NameNode is in the Safemode
state.
• The NameNode receives Heartbeat and Blockreport messages from the
DataNodes.
• A Blockreport contains the list of data blocks that a DataNode is hosting.
• Each block has a specified minimum number of replicas.
• A block is considered safely replicated when the minimum number of replicas of
that data block has checked in with the NameNode.
• After a configurable percentage of safely replicated data blocks checks in with the
NameNode (plus an additional 30 seconds), the NameNode exits the Safemode
state.
• It then determines the list of data blocks (if any) that still have fewer than the
specified number of replicas.
• The NameNode then replicates these blocks to other DataNodes.

UNIT III 50
50
HDFS Architecture: Component
• The Persistence of File System Metadata :
• The HDFS namespace is stored by the NameNode.
• The NameNode uses a transaction log called the EditLog to persistently record
every change that occurs to file system metadata.
• For example, creating a new file in HDFS causes the NameNode to insert a record
into the EditLog indicating this.
• The NameNode uses a file in its local host OS file system to store the EditLog.
The entire file system namespace, including the mapping of blocks to files and
file system properties, is stored in a file called the FsImage.
• The FsImage is stored as a file in the NameNode’s local file system too.
• When a DataNode starts up, it scans through its local file system, generates a list
of all HDFS data blocks that correspond to each of these local files and sends this
report to the NameNode: this is the Blockreport.

• The Communication Protocols:

• All HDFS communication protocols are layered on top of the TCP/IP protocol

UNIT III 51
51
HDFS Architecture: Component
• Robustness:
• The primary objective of HDFS is to store data reliably even in the presence of
failures.

• Data Disk Failure, Heartbeats and Re-Replication

• The NameNode constantly tracks which blocks need to be replicated and initiates
replication whenever necessary.
• The necessity for re-replication may arise due to many reasons: a DataNode may
become unavailable, a replica may become corrupted, a hard disk on a DataNode
may fail, or the replication factor of a file may be increased.

• Cluster Rebalancing
• The HDFS architecture is compatible with data rebalancing schemes. A scheme
might automatically move data from one DataNode to another if the free space
on a DataNode falls below a certain threshold.

UNIT III 52
52
HDFS Architecture: Component
• Data Integrity
• It is possible that a block of data fetched from a DataNode arrives corrupted.
• The HDFS client software implements checksum checking on the contents of HDFS
files.

• Metadata Disk Failure:

• The FsImage and the EditLog are central data structures of HDFS. A corruption of these
files can cause the HDFS instance to be non-functional.
• For this reason, the NameNode can be configured to support maintaining multiple copies
of the FsImage and EditLog. Any update to either the FsImage or EditLog causes each of
the FsImages and EditLogs to get updated synchronously.

• Snapshots
• Snapshots support storing a copy of data at a particular instant of time. One usage of the
snapshot feature may be to roll back a corrupted HDFS instance to a previously known good
point in time. HDFS does not currently support snapshots but will in a future release.

UNIT III 53
53
HDFS Architecture: Component
• Data Organization
1. Data Blocks:
• HDFS is designed to support very large files.
• Applications that are compatible with HDFS are those that deal with large data sets.
• These applications write their data only once but they read it one or more times and require these
reads to be satisfied at streaming speeds.
• HDFS supports write-once-read-many semantics on files.
• A typical block size used by HDFS is 64 MB.
• Thus, an HDFS file is chopped up into 64 MB chunks, and if possible, each chunk will reside on a
different DataNode.
2. Staging:
• A client request to create a file does not reach the NameNode immediately
• If a client writes to a remote file directly without any client side buffering, the network
speed and the congestion in the network impacts throughput considerably.
• This approach is not without precedent. Earlier distributed file systems, e.g. AFS, have
used client side caching to improve performance. A POSIX requirement has been relaxed
to achieve higher performance of data uploads.
3. Replication Pipelining

UNIT III 54
54
HDFS Commands
ls: This command is used to list all the ﬁles.
hdfs dfs -ls

mkdir: To create a directory. In Hadoop dfs there is no home directory by default.

hdfs dfs -mkdir <folder name>

touchz: It creates an empty ﬁle.

hdfs dfs -touchz <file_path>

copyFromLocal (or) put: To copy ﬁles/folders from local ﬁle system to hdfs store

hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>

hdfs dfs -put ../Desktop/AI.txt /folder

copyToLocal (or) get: To copy ﬁles/folders from hdfs store to local ﬁle system.

Syntax:

hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>

hdfs dfs -get /geeks/myfile.txt ../Desktop/hero

cat: To print ﬁle contents.

hdfs dfs -cat <path>

55
YARN

12/13/2022 56
Hadoop 1.0

57
Hadoop 1.0

58
Hadoop 1.0

59
Limitations of
Hadoop 1.0

60
Need of YARN

61
Yarn

62
What is YARN?

63
Workloads running on YARN

64
YARN Components

65
YARN-Resource Manager

66
YARN-Resource Manager

67
YARN Architecture

68
Running an application in YARN

69
Step 1 -
Application
submitted
to the
Resource
Manager

70
Step 2 -
Resource
Manager
allocates
Container

71
Step 3 -
Application
Master
contacts
Node
Manager

72
Step 4
-Resource
Manager
Launches
Container

73
Step 5 -
Container
Executes the
Application
Master

74
Introduction to HBase

75
Introduction to HBase

76
HBase

77
Companies Using HBASE

78
HBASE Use Case

79
Hbase Commands
base(main):001:0> list
h

TABLE
create ‘<table name>’,’<column family>’
hbase(main):002:0> create 'emp', 'personal data', 'professional data'
hbase(main):028:0> scan 'emp'
hbase(main):006:0> describe 'emp'
put ’<table name>’,’row1’,’<colfamily:colname>’,’<value>’
hbase(main):005:0> put 'emp','1','personal data:name','raju'
0 row(s) in 0.6600 seconds
get ’<table name>’,’row1’
hbase(main):012:0> get 'emp', '1'
delete ‘<table name>’, ‘<row>’, ‘<column name >’, ‘<time stamp>’
hbase(main):006:0> delete 'emp', '1', 'personal data:city',
count ‘<table name>’
hbase(main):023:0> count 'emp'
hbase(main):018:0> disable 'emp'

hbase(main):019:0> drop 'emp'

0 row(s) in 0.3060 seconds 80
Introduction to MapReduce
• Traditional techniques for working with information simply don’t scale to
Big Data: they’re too costly, time-consuming, and complex, which is where
MapReduce comes in.
• MapReduce is:
• new programming framework — created and successfully deployed by
Google
• uses the divide-and-conquer method (and lots of commodity servers)
• break down complex Big Data problems into small units of work, and
then process them in parallel.
• MapReduce is a programming framework that allows us to perform
distributed and parallel processing on large data sets in a distributed
environment.
• Eg: processing 100 TB data
• On 1 node🡪 scanning @50 MB/s =23 days
• On 1000 node cluster🡪 scanning @50 MB/s =33 min

UNIT III 81
81
Introduction to MapReduce: Use cases
• At Google
• Index construction for Google Search (replaced in 2010 by Caffeine)
• Article clustering for Google News
• Statistical machine translation
• At Yahoo!
• “Web map” powering Yahoo! Search
• Spam detection for Yahoo! Mail
• At Facebook
• Data mining, Web log processing
• SearchBox (with Cassandra)
• Facebook Messages (with HBase)
• Ad optimization
• Spam detection

UNIT III 82
82
Introduction to MapReduce
• Problem: How to compute the PageRank for a crawled set of
websites on a cluster of machines?
• Main Challenges:
• How to break up a large problem into smaller tasks, that can be executed
in parallel?
• How to assign tasks to machines?
• How to partition and distribute data?
• How to share intermediate results?
• How to coordinate synchronization, scheduling, fault-tolerance?
Solution is MapReduce:
• Algorithms that can be expressed as (or mapped to) a sequence of Map()
and Reduce() functions are automatically parallelized by the framework.

UNIT III 83
83
MapReduce Workflow
• Map Phase
• Raw data read and converted to key/value pairs
• Map() function applied to any pair
• Shuffle and Sort Phase
• All key/value pairs are sorted and grouped by their keys
• Reduce Phase
• All values with a the same key are processed by within the same reduce() function

UNIT III 84
84
MapReduce Programming Model
• Every MapReduce program must specify a Mapper and typically a Reducer
• The Mapper has a map() function that transforms input (key, value) pairs
into any number of intermediate (out_key, intermediate_value) pairs
• The Reducer has a reduce() function that transforms intermediate (out_key,
list(intermediate_value)) aggregates into any number of output (value’) pairs

UNIT III 85
85
MapReduce Programming Model

UNIT III 86
86
MapReduce Execution Details

• The complete execution process (execution of Map and Reduce tasks,

both) is controlled by two types of entities called a
• Jobtracker: Acts like a master (responsible for complete execution of submitted job)
• Multiple Task Trackers: Acts like slaves, each of them performing the job
• For every job submitted for execution in the system, there is
one Jobtracker that resides on NameNode and there are multiple task
trackers which reside on DataNode.

UNIT III 87
87
MapReduce Data flow

UNIT III 88
Word Count Problem using
MapReduce

Problem: Given a document, we want to count the

occurrences of any word
Input: Document with words.
Output: List of words and their occurrences, e.g.
“Infrastructure” 12 “the” 259 …

UNIT III 89
89
MapReduce Execution Details

UNIT III 90
90
Word Count Problem uisng MapReduce

UNIT III 91
91
References
References
1. Hadoop® FOR DUMMIES, SPECIAL EDITION, by Robert D. Schneider, John
Wiley & Sons, Inc.
2. Hadoop: The Definitive Guide, Tom White,3rd edition ,O’Reilly ,yahoo Press
3. Learning Spark-Lightning-Fast Big Data Analysis, Holden Karau, Andy Konwinski,
Patrick Wendell, and Matei Zaharia, O’Reilly, Copyright © 2015
4. https://spark.apache.org/
5. https://hadoop.apache.org/
6. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf
7. https://www.knowledgehut.com/blog/big-data/5-best-data-processing-frameworks
8. https://opensourceforu.com/2018/03/a-quick-comparison-of-the-five-best-big-data-fra
meworks/
9. https://data-flair.training/blogs/hadoop-ecosystem-components/
10. http://dbis.informatik.uni-freiburg.de/content/courses/SS12/Praktikum/Datenbanken%
20und%20Cloud%20Computing/slides/MapReduce.pdf
11. https://www.edureka.co/blog/mapreduce-tutorial/

UNIT III 92
92
References
References
1. Introduction to Analytics and Big Data-Hadoop, Thomas Rivera, Hitachi Data
System, SNIA Education,2014
2. Hadoop: A Framework for Data-Intensive Distributed Computing, CS561-Spring
2012 WPI, Mohamed Y. Eltabakh

UNIT III 93
93

Multithreading: The Java Thread Model
No ratings yet
Multithreading: The Java Thread Model
15 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Bda Unit 2
No ratings yet
Bda Unit 2
79 pages
Unit 3-1
No ratings yet
Unit 3-1
14 pages
Module 4 - Hadoop
No ratings yet
Module 4 - Hadoop
5 pages
Unit 5
No ratings yet
Unit 5
101 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
67 pages
UNIT 2 Full
No ratings yet
UNIT 2 Full
121 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
BDA UNIT-2dhhhhbv
No ratings yet
BDA UNIT-2dhhhhbv
23 pages
Big Data Aktu Unit 2
No ratings yet
Big Data Aktu Unit 2
127 pages
Big Data Analytics - Unit 4
No ratings yet
Big Data Analytics - Unit 4
32 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Hadoop
No ratings yet
Hadoop
154 pages
BigData Unit 2
No ratings yet
BigData Unit 2
56 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
Unit 2
No ratings yet
Unit 2
73 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
25 pages
Bda-Unit-2 - 2023
No ratings yet
Bda-Unit-2 - 2023
58 pages
Unit - 2
No ratings yet
Unit - 2
27 pages
BDS Session 6
No ratings yet
BDS Session 6
78 pages
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
No ratings yet
Class: CS 237 Distributed Systems Middleware Instructor: Nalini Venkatasubramanian
55 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
No ratings yet
Big Data Technologies (Spark & Scala) (22CSH-391) Lecture-1 (CO1)
30 pages
Hadoop Important Lecture
No ratings yet
Hadoop Important Lecture
38 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Session3 - 4-Bigdata Tools and Movie Use Case
No ratings yet
Session3 - 4-Bigdata Tools and Movie Use Case
79 pages
Introduction To Hadoop
No ratings yet
Introduction To Hadoop
5 pages
Chapter 2 Introduction To Hadoop
No ratings yet
Chapter 2 Introduction To Hadoop
31 pages
Understanding Hadoop Ecosystem
No ratings yet
Understanding Hadoop Ecosystem
38 pages
BDA Exp 1
No ratings yet
BDA Exp 1
7 pages
Attachment
No ratings yet
Attachment
11 pages
Slide 2 GFS and Hadoop
No ratings yet
Slide 2 GFS and Hadoop
95 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
Unit 4 Endsem PYQs
No ratings yet
Unit 4 Endsem PYQs
24 pages
Unit 3 - BD - Hadoop Ecosystem
No ratings yet
Unit 3 - BD - Hadoop Ecosystem
42 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Hadoop Frame Work
No ratings yet
Hadoop Frame Work
38 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
BDT - Unit - II - Hdfs and Hadoop Io
No ratings yet
BDT - Unit - II - Hdfs and Hadoop Io
42 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
Unit-Iv CC&BD CS71
No ratings yet
Unit-Iv CC&BD CS71
148 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
No ratings yet
School of Computer Engineering: Kalinga Institute of Industrial Technology Deemed To Be University Bhubaneswar-751024
260 pages
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
No ratings yet
Introduction: Hadoop's History and Advantages 2. Architecture in Detail 3. Hadoop in Industry
53 pages
Chapter - 6 - Hadoop
No ratings yet
Chapter - 6 - Hadoop
51 pages
Hadoop 1
No ratings yet
Hadoop 1
75 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
Unit-5 - Hadoop
No ratings yet
Unit-5 - Hadoop
29 pages
Module 2 Big Data Analytics
No ratings yet
Module 2 Big Data Analytics
38 pages
HDFS 79
No ratings yet
HDFS 79
74 pages
Unit I
No ratings yet
Unit I
38 pages
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
No ratings yet
Introduction To Hadoop: Dr. G Sudha Sadhasivam Professor, CSE PSG College of Technology Coimbatore
34 pages
1 - HADOOP Crash Course
No ratings yet
1 - HADOOP Crash Course
52 pages
DSCI 5350 - Lecture 2 PDF
No ratings yet
DSCI 5350 - Lecture 2 PDF
54 pages
Final OS Lab Manual 2021-22 (Winter)
No ratings yet
Final OS Lab Manual 2021-22 (Winter)
40 pages
BCS601 Module 1 PDF
No ratings yet
BCS601 Module 1 PDF
32 pages
Chapter Four
No ratings yet
Chapter Four
28 pages
Multi Thread
No ratings yet
Multi Thread
26 pages
Chapter 5: CPU Scheduling: Silberschatz, Galvin and Gagne ©2009 Operating System Concepts - 8 Edition
No ratings yet
Chapter 5: CPU Scheduling: Silberschatz, Galvin and Gagne ©2009 Operating System Concepts - 8 Edition
46 pages
EDT in Swings
No ratings yet
EDT in Swings
2 pages
qUESTION bANK OS
No ratings yet
qUESTION bANK OS
3 pages
Introduction To Parallel Databases
No ratings yet
Introduction To Parallel Databases
24 pages
Concurrent Objects: Companion Slides For by Maurice Herlihy & Nir Shavit
No ratings yet
Concurrent Objects: Companion Slides For by Maurice Herlihy & Nir Shavit
184 pages
G1 - Chapter 5
No ratings yet
G1 - Chapter 5
14 pages
Thread Level Parallelism
No ratings yet
Thread Level Parallelism
3 pages
Anr
No ratings yet
Anr
12 pages
Multiple Tasks and Multiple Processes-Preemptive RTOS
No ratings yet
Multiple Tasks and Multiple Processes-Preemptive RTOS
32 pages
Lecture 23
No ratings yet
Lecture 23
7 pages
Fcfs and SJF Scheduling
No ratings yet
Fcfs and SJF Scheduling
2 pages
Lecture 2
No ratings yet
Lecture 2
17 pages
Course End Projects-A8512
No ratings yet
Course End Projects-A8512
4 pages
OS 2022 Basic Questions Only
No ratings yet
OS 2022 Basic Questions Only
72 pages
Concurrency Control in Distributed Databases: Gul Sabah Arif
No ratings yet
Concurrency Control in Distributed Databases: Gul Sabah Arif
18 pages
Hpca2020 Gpu 2
No ratings yet
Hpca2020 Gpu 2
41 pages
3.1 - MapReduce Paradigm
No ratings yet
3.1 - MapReduce Paradigm
28 pages
Lab7 Openended
No ratings yet
Lab7 Openended
24 pages
Chapter 2 Architectural Models
No ratings yet
Chapter 2 Architectural Models
44 pages
Operating Systems: Chapter Three
No ratings yet
Operating Systems: Chapter Three
32 pages
Coordination and Agreement Distributed Systems Designs and Concept
No ratings yet
Coordination and Agreement Distributed Systems Designs and Concept
63 pages
What Is Operating System
No ratings yet
What Is Operating System
28 pages
Syllabus CSC322 OS
No ratings yet
Syllabus CSC322 OS
5 pages
Concurrent Traffic Simulation
No ratings yet
Concurrent Traffic Simulation
26 pages
Parallel Computing Lecture # 6: Parallel Computer Memory Architectures
No ratings yet
Parallel Computing Lecture # 6: Parallel Computer Memory Architectures
16 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BDT Unit03.pptx

Uploaded by

BDT Unit03.pptx

Uploaded by

CET4001B Big Data Technologies

Challenges of Big data—

• It’s meant to leverage the power of massive parallel

• The Hadoop software library is a framework that allows for

• Hadoop Distributed File System (HDFS): A distributed file system that

• Hadoop YARN (Yet another Resource Negotiator ): A framework for job

• Hadoop MapReduce: A YARN-based system for parallel processing of

Cloudera’s Version of the Stack

Hadoop • A data flow language and execution environment for

Components SQL (and which is translated by the runtime engine

• A tool for efficiently moving data between relational

• Distributed file system

10/11/2022 Big Data Analytics Lab 29

10/11/2022 Big Data Analytics Lab 30

• Distributed file system (HDFS)

Master node (single node)

Many slave nodes

Blocks (64 MB)

Many datanode (1000s)

HDFS Write Mechanism

Hadoop Distributed File System(Write Mechanism):

HDFS Architecture: Component

• The Communication Protocols:

• Data Disk Failure, Heartbeats and Re-Replication

• Metadata Disk Failure:

mkdir: To create a directory. In Hadoop dfs there is no home directory by default.

hdfs dfs -mkdir <folder name>

touchz: It creates an empty ﬁle.

hdfs dfs -touchz <file_path>

hdfs dfs -copyFromLocal <local file path> <dest(present on hdfs)>

hdfs dfs -put ../Desktop/AI.txt /folder

hdfs dfs -copyToLocal <<srcfile(on hdfs)> <local file dest>

hdfs dfs -get /geeks/myfile.txt ../Desktop/hero

cat: To print ﬁle contents.

hdfs dfs -cat <path>

hbase(main):019:0> drop 'emp'

• The complete execution process (execution of Map and Reduce tasks,

Problem: Given a document, we want to count the

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.