BDA_UNIT-IV
BDA_UNIT-IV
Why Hadoop?
Hadoop is more suited for big data analytics and machine learning
tasks compared to RDBMS, which struggles to manage the
volume, variety, and velocity of big data.
Hadoop Overview
Hadoop is an open-source software framework designed to store and
process massive amounts of data in a distributed manner using clusters
of commodity hardware. Its primary tasks are:
o EditLog:
o Single Point:
Client Application: The client interacts with the HDFS through the
Hadoop File System Client, which communicates with the
NameNode to manage file operations.
o For example:
DataNode
Communication:
o DataNodes send heartbeat messages to the NameNode at
regular intervals to confirm they are active and functional.
NameNode Operations
5. After all blocks are read, the client closes the FSDataInputStream.
Anatomy of File Write (Figure 5.19)
1. The client calls create() on the DistributedFileSystem to create a
new file.
The output from map tasks is shuffled and sorted based on keys.
---------------JobTracker-------------
1.1. core-site.xml
1.3. mapred-site.xml
1.4. yarn-site.xml
3. Column-Oriented Storage
5. Strong Consistency
7. Fault Tolerance
o Built on HDFS, meaning data is replicated across multiple
nodes for reliability.
Row-Based Query:
Table: users
Advantages:
If we need only "Age", HBase can scan only the "Age" column
instead of the entire row.
Architecture of HBase
HBase architecture has 3 main components: HMaster, Region
Server, Zookeeper.
HMaster (Master Server in HBase)
The HMaster is the main process that manages the HBase cluster. It is
responsible for:
- Assigning Regions to Region Servers.
- Performing DDL (Data Definition Language) operations such as
creating, deleting, and modifying tables.
- Monitoring all Region Servers in the cluster.
- Running background threads for tasks like load balancing and failover
handling.
- Handlingautomatic region splitting when a region grows beyond a
predefined size.
Zookeeper
-It is like a coordinator in HBase.
-It provides services like maintaining configuration information, naming,
providing distributed synchronization, server failure notification etc.
HIVE
Apache Hive is a data warehouse infrastructure built on top of Hadoop
that allows users to process and analyze large datasets using SQL-like
queries. Instead of writing complex MapReduce programs, users can
write Hive Query Language (HiveQL), which is then converted into
MapReduce, Tez, or Spark jobs for execution.
Apache Pig
Apache Pig is a high-level scripting language used for processing large
datasets in Hadoop. It provides an easy way to write complex data
transformations using a procedural scripting language called Pig Latin.
Pig - Architecture
Hive vs Pig
Both Apache Hive and Apache Pig simplify Big Data processing but
serve different use cases. Here’s a comparison:
Complex data
Best for Querying structured data
transformations
Ease of
Easier for SQL users Easier for programmers
Use
4. Visualization Capabilities
Data Transformation
Descriptive Statistics
Data Visualization
Regression Metrics
Classification Metrics
o Accuracy → (Correct Predictions / Total Predictions)
Mathematical Approach: