0% found this document useful (0 votes)
16 views34 pages

BIG DATA ANALYTICS Notes Unit 1 and 2

The document outlines a course on Big Data and Analytics for B.Tech students, detailing objectives, course content, and outcomes related to Big Data technologies, including Hadoop and NoSQL databases. It covers the types of data, characteristics, and the importance of Big Data in modern analytics, emphasizing the challenges and advantages of structured, unstructured, and semi-structured data. Additionally, it highlights the significance of Big Data for organizations, particularly in gaining customer insights and driving strategic value.

Uploaded by

voila.alberto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views34 pages

BIG DATA ANALYTICS Notes Unit 1 and 2

The document outlines a course on Big Data and Analytics for B.Tech students, detailing objectives, course content, and outcomes related to Big Data technologies, including Hadoop and NoSQL databases. It covers the types of data, characteristics, and the importance of Big Data in modern analytics, emphasizing the challenges and advantages of structured, unstructured, and semi-structured data. Additionally, it highlights the significance of Big Data for organizations, particularly in gaining customer insights and driving strategic value.

Uploaded by

voila.alberto
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 34

PA

NOTES ON
BIG DATA AND
ANALYTICS (KDS 601)

B.TECH III YEAR- 6th


SEM(2023-24)

DEPARTMENT OF INFORMATION TECHNOLOGY


JSSATE Noida

COURSE OBJECTIVES:
The objectives of this course are
To learn the need of Big Data and the various challenges involved and to acquire
Knowledge about different analytical architectures.
To understand Hadoop Architecture and its ecosystems.
To Understand Hadoop Ecosystem and acquire knowledge about the NoSQL
database.
To acquire knowledge about the NewSQL, MongoDB and Cassandra databases.
To imbibe the processing of Big Data with advanced architectures like Spark.

1
PA

UNIT – I
Introduction to Big Data: Types of digital data, history of Big Data innovation, introduction to Big
Data platform, drivers for Big Data, Big Data architecture and characteristics, 5 Vs of Big Data, Big
Data technology components, Big Data importance and applications, Big Data features – security,
compliance, auditing and protection, Big Data privacy and ethics, Big Data Analytics, Challenges of
conventional systems, intelligent data analysis, nature of data, analytic processes and tools, analysis vs
reporting, modern data analytic tools.

UNIT – II
Hadoop: History of Hadoop, Apache Hadoop, the Hadoop Distributed File System, components of
Hadoop, data format, analyzing data with Hadoop, scaling out, Hadoop streaming, Hadoop pipes,
Hadoop Echo System. Map-Reduce: Map-Reduce framework and basics, how Map Reduce works,
developing a Map Reduce application, unit tests with MR unit, test data and local tests, anatomy of a
Map Reduce job run, failures, job scheduling, shuffle and sort, task execution, Map Reduce types, input
formats, output formats, Map Reduce features, Real-world Map Reduce

TEXT BOOKS:
Seema Acharya and Subhashini Chellappan, “Big Data and Analytics”, Wiley India Pvt.
Ltd., 2016.
Mike Frampton, “Mastering Apache Spark”, Packt Publishing, 2015.

REFERENCE BOOKS:
Tom White, “Hadoop: The Definitive Guide”, O‟Reilly, 4th Edition, 2015.
Mohammed Guller, “Big Data Analytics with Spark”, Apress, 2015
Donald Miner, Adam Shook, “Map Reduce Design Pattern”, O‟Reilly, 2012

COURSE OUTCOMES:
On successful completion of the course, students will be able to
Demonstrate knowledge of Big Data, Data Analytics, challenges and their solutions in Big
Data.
Analyze Hadoop Framework and eco systems.
Analyze MapReduce and Yarn, Work on NoSQL environment.
Work on NewSQL environment, MongoDB and Cassandra.
Apply the Big Data using Map-reduce programming in Both Hadoop and Spark framework.

2
PA

Unit 1

What is Data?
Data is defined as individual facts, such as numbers, words, measurements, observations or
just descriptions of things.

For example, data might include individual prices, weights, addresses, ages, names, temperatures, dates,
or distances.

There are two main types of data:


 Quantitative data is provided in numerical form, like the weight, volume, or cost of
an item.
 Qualitative data is descriptive, but non-numerical, like the name, sex, or eye
colour of a person.

Characteristics of Data
The following are six key characteristics of data which discussed below:
Accuracy
Validity
Reliability
Timeliness
Relevance
Completeness

1. Accuracy
 Data should be sufficiently accurate for the intended use and should be captured only
once, although it may have multiple uses.
 Data should be captured at the point of activity.

3
PA

2. Validity
 Data should be recorded and used in compliance with relevant requirements, including
the correct application of any rules or definitions.
 This will ensure consistency between periods and with similar organizations,
measuring what is intended to be measured.

3. Reliability
 Data should reflect stable and consistent data collection processes across collection
points and over time.
 Progress toward performance targets should reflect real changes rather than variations
in data collection approaches or methods.
 Source data is clearly identified and readily available from manual, automated, or
other systems and records.

4. Timeliness
 Data should be captured as quickly as possible after the event or activity and must be
available for the intended use within a reasonable time period.
 Data must be available quickly and frequently enough to support information needs
and to influence service or management decisions.

5. Relevance
 Data captured should be relevant to the purposes for which it is to be used. This will
require a periodic review of requirements to reflect changing needs.

6. Completeness
 Data requirements should be clearly specified based on the information needs of the
organization and data collection processes matched to these requirements.

Types of Digital Data


 Digital data is the electronic representation of information in a format or language that
machines can read and understand.
 In more technical terms, Digital data is a binary format of information that's converted
into a machine-readable digital format.
4
 The power of digital data is that any analog inputs, from very simple text documents
PA
to genome sequencing results, can be represented with the binary system.

Types of Digital Data:


 Structured
 Unstructured
 Semi Structured Data

1. Structured Data:
 Structured data refers to any data that resides in a fixed field within a record or
file.
 Having a particular Data Model.
 Meaningful data.
 Data arranged in arow and column.
 Structured data has the advantage of being easily entered, stored, queried and
analysed.
 E.g.: Relational Data Base, Spread sheets.
 Structured data is often managed using Structured Query Language (SQL)

Sources of Structured Data:


 SQL Databases
 Spreadsheets such as Excel
 OLTP Systems
 Online forms
 Sensors such as GPS or RFID tags
 Network and Web server logs
 Medical devices

Advantages of Structured Data:


 Easy to understand and use: Structured data has a well-defined schema or data
model, making it easy to understand and use. This allows for easy data retrieval,
analysis, and reporting.
 Consistency: The well-defined structure of structured data ensures consistency and
accuracy in the data, making it easier to compare and analyze data across different
sources.

5
 Efficient storage and retrieval: Structured data is typically stored in relational
PA
databases, which are designed to efficiently store and retrieve large amounts of
data. This makes it easy to access and process data quickly.
 Enhanced data security: Structured data can be more easily secured than
unstructured or semi-structured data, as access to the data can be controlled
through database security protocols.
 Clear data lineage: Structured data typically has a clear lineage or history, making
it easy to track changes and ensure data quality.

Disadvantages of Structured Data:


 Inflexibility: Structured data can be inflexible in terms of accommodating new
types of data, as any changes to the schema or data model require significant
changes to the database.
 Limited complexity: Structured data is often limited in terms of the complexity of
relationships between data entities. This can make it difficult to model complex
real-world scenarios.
 Limited context: Structured data often lacks the additional context and
information that unstructured or semi-structured data can provide, making it more
difficult to understand the meaning and significance of the data.
 Expensive: Structured data requires the use of relational databases and related
technologies, which can be expensive to implement and maintain.
 Data quality: The structured nature of the data can sometimes lead to missing or
incomplete data, or data that does not fit cleanly into the defined schema, leading
to data quality issues.

2. Unstructured Data:
 Unstructured data can not readily classify and fit into a neat box
 Also called unclassified data.
 Which does not confirm to any data model.
 Business rules are not applied.
 Indexing is not required.

6
PA
 E.g.: photos and graphic images, videos, streaming instrument data, webpages, Pdf
files, PowerPoint presentations, emails, blog entries, wikis and word processing
documents.

Sources of Unstructured Data:


 Web pages
 Images (JPEG, GIF, PNG, etc.)
 Videos
 Memos
 Reports
 Word documents and PowerPoint presentations
 Surveys

Advantages of Unstructured Data:


 Its supports the data which lacks a proper format or sequence
 The data is not constrained by a fixed schema
 Very Flexible due to absence of schema.
 Data is portable
 It is very scalable
 It can deal easily with the heterogeneity of sources.
 These type of data have a variety of business intelligence and analytics
applications.

Disadvantages Of Unstructured data:


 It is difficult to store and manage unstructured data due to lack of schema and
structure
 Indexing the data is difficult and error prone due to unclear structure and not
having pre-defined attributes. Due to which search results are not very accurate.
 Ensuring security to data is difficult task.

3. Semi structured Data:


 Self-describing data.
 Metadata (Data about data).
 Also called quiz data: data in between structured and semi structured.
 It is a type of structured data but not followed data model.
 Data which does not have rigid structure.
 E.g.: E-mails, word processing software.
 XML and other markup language are often used to manage semi structured data.

7
PA

Sources of semi-structured Data:


 E-mails
 XML and other markup languages
 Binary executables
 TCP/IP packets
 Zipped files
 Integration of data from different sources
 Web pages

Advantages of Semi-structured Data:


 Data is portable
 It is possible to view structured data as semi-structured data
 Its supports users who can not express their need in SQL
 It can deal easily with the heterogeneity of sources.
 Flexibility: Semi-structured data provides more flexibility in terms of data storage
and management, as it can accommodate data that does not fit into a strict,
predefined schema. This makes it easier to incorporate new types of data into an
existing database or data processing pipeline.
 Scalability: Semi-structured data is particularly well-suited for managing large
volumes of data, as it can be stored and processed using distributed computing
systems, such as Hadoop or Spark, which can scale to handle massive amounts of
data.
 Faster data processing: Semi-structured data can be processed more quickly than
traditional structured data, as it can be indexed and queried in a more flexible way.
This makes it easier to retrieve specific subsets of data for analysis and reporting.
 Improved data integration: Semi-structured data can be more easily integrated
with other types of data, such as unstructured data, making it easier to combine
and analyze data from multiple sources.
 Richer data analysis: Semi-structured data often contains more contextual
information than traditional structured data, such as metadata or tags. This can
provide additional insights and context that can improve the accuracy and
relevance of data analysis.

8
PA
Disadvantages of Semi-structured data
 Lack of fixed, rigid schema make it difficult in storage of the data
 Interpreting the relationship between data is difficult as there is no separation of
the schema and the data.
 Queries are less efficient as compared to structured data.
 Complexity: Semi-structured data can be more complex to manage and process
than structured data, as it may contain a wide variety of formats, tags, and
metadata. This can make it more difficult to develop and maintain data models and
processing pipelines.
 Lack of standardization: Semi-structured data often lacks the standardization and
consistency of structured data, which can make it more difficult to ensure data
quality and accuracy. This can also make it harder to compare and analyze data
across different sources.
 Reduced performance: Processing semi-structured data can be more resource-
intensive than processing structured data, as it often requires more complex
parsing and indexing operations. This can lead to reduced performance and longer
processing times.
 Limited tooling: While there are many tools and technologies available for
working with structured data, there are fewer options for working with semi-
structured data. This can make it more challenging to find the right tools and
technologies for a particular use case.
 Data security: Semi-structured data can be more difficult to secure than structured
data, as it may contain sensitive information in unstructured or less- visible parts
of the data. This can make it more challenging to identify and protect sensitive
information from unauthorized access.
Overall, while semi-structured data offers many advantages in terms of flexibility and
scalability, it also presents some challenges and limitations that need to be carefully
considered when designing and implementing data processing and analysis pipelines.

9
PA

Big Data
Big Data is a collection of data that is huge in volume, yet growing exponentially with time.
It is a data with so large size and complexity that none of traditional data management tools
can store it or process it efficiently. Big data is also a data but with huge size.

What is an Example of Big Data?


Following are some of the Big Data examples-

 New York Stock Exchange : The New York Stock Exchange is an example of Big
Data that generates about one terabyte of new trade data per day.

10
PA
 Social Media: The statistic shows that 500+terabytes of new data get ingested into
the databases of social media site Facebook, every day. This data is mainly generated
in terms of photo and video uploads, message exchanges, putting comments etc.

 Jet engine :A single Jet engine can generate 10+terabytes of data in 30 minutes of
flight time. With many thousand flights per day, generation of data reaches up to
many Petabytes.

11
PA

Big Data Characteristics

1. Volume:
 The name Big Data itself is related to an enormous size.
 Big Data is a vast ‘volume’ of data generated from many sources daily, such as
business processes, machines, social media platforms, networks, human
interactions, and many more.

2. Variety:
 Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources.
 Data will only be collected from databases and sheets in the past, but these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts,
photos, videos, etc.

3. Veracity
 Veracity means how much the data is reliable. It has many ways to filter or translate
the data.
 Veracity is the process of being able to handle and manage data efficiently. Big Data
is also essential in business development.

12
PA
4. Value
 Value is an essential characteristic of big data. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.

5. Velocity
 Velocity plays an important role compared to others. Velocity creates the speed by
which the data is created in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.
 Big data velocity deals with the speed at the data flows from sources like application
logs, business processes, networks, and social media sites, sensors, mobile devices,
etc.

Why Big Data?

Big Data has become a critical asset for organizations, with 93% of companies rating Big Data initiatives as
“extremely important.” Leveraging Big Data analytics helps organizations unlock strategic value and maximize
their potential.

Key Benefits of Big Data for Organizations

1. Customer Insights and Engagement

 Understand where, when, and why customers buy.


 Protect the company’s client base with improved loyalty programs.
 Seize cross-selling and upselling opportunities.
 Provide targeted promotional information to customers.

2. Operational Efficiency and Optimization

 Optimize workforce planning and operations.


 Improve inefficiencies in the supply chain.
 Predict market trends and future needs.

3. Competitive Advantage and Innovation

 Helps companies become more innovative and competitive.


 Enables businesses to discover new sources of revenue.
 Enhances decision-making by analyzing historical and real-time data.

13
Big Data and Customer-Centric Strategies PA

 Companies use Big Data to identify customer preferences, best customers, and reasons behind product
choices.
 Machine Learning and predictive analytics help in designing market strategies tailored to customer needs.
 Businesses can continuously improve and update marketing strategies to be more responsive to consumer
demands.

By leveraging Big Data, companies can gain deep insights into customer behavior, optimize operations, and stay
ahead in a competitive market.

Importance of big data

The significance of Big Data does not depend on the volume of data a company possesses but on how effectively it
is utilized. Companies that efficiently analyze and leverage their data experience faster growth and greater
success.

Why Companies Need to Collect and Analyze Big Data

1. Cost Savings

 Big Data tools like Apache Hadoop and Spark help businesses store and manage vast amounts of data cost-
effectively.
 These tools assist in identifying more efficient business processes, reducing operational expenses.

2. Time-Saving

 Real-time in-memory analytics enables companies to collect and process data instantly.
 Tools like Hadoop facilitate immediate analysis, allowing for quick and informed decision-making.

3. Market Understanding

 Big Data analytics helps businesses gain insights into market conditions.

14
 Customer purchasing behavior analysis allows companies to identify high-demand products and
PA adjust
production accordingly.
 This ensures a competitive edge over rivals.

4. Social Media Listening

 Companies can perform sentiment analysis using Big Data tools.


 These tools help businesses track online feedback, understanding who is saying what about the company.
 Helps improve the brand’s online presence and reputation.

5. Boost Customer Acquisition and Retention

 Customers are the core asset of any business.


 Big Data analytics identifies customer trends and behavior patterns, leading to better engagement
strategies.
 Understanding customer needs prevents client loss and drives business growth.

6. Solving Advertisers’ Challenges & Marketing Insights

 Big Data analytics optimizes business operations by aligning with customer expectations.
 Helps in refining product lines and designing powerful marketing campaigns.

7. Driving Innovation and Product Development

 Big Data empowers companies to innovate and enhance their products.


 Continuous analysis of consumer needs and trends leads to new product development and market
leadership.

By leveraging Big Data, businesses can optimize their strategies, improve customer satisfaction, and drive
innovation, ensuring long-term success.

Challenges of Big Data and Their Solutions

Implementing a Big Data solution comes with several challenges. Below are some of the most common issues
businesses face and their potential solutions.

1. Managing Massive Amounts of Data

 Challenge:
o Companies collect increasing amounts of data daily, making traditional data storage solutions
insufficient.
o 43% of IT decision-makers worry about infrastructure overload.
 Solution:
o Cloud migration: Businesses are shifting to cloud-based storage solutions that scale dynamically.
o Big Data tools: Software like Apache Hadoop and Spark enable efficient data storage and quick
access.

15
2. Integrating Data from Multiple Sources PA

 Challenge:
o Data comes from various sources (web analytics, social media, CRM, emails, etc.), making
integration difficult.
o Different structures require standardization for effective analysis.
 Solution:
o ETL (Extract, Transform, Load) software: Helps map disparate data sources into a common
structure.
o Business intelligence tools: Combine multiple data sets for accurate reporting and insights.

3. Ensuring Data Quality

 Challenge:
o Incomplete, corrupted, or inaccurate data leads to flawed analytics and incorrect predictions.
o Increasing data volume makes quality control difficult.
 Solution:
o Data governance tools: Organize, manage, and secure data while validating sources.
o Data quality software: Cleans up corrupted and incomplete data before processing.

4. Keeping Data Secure

 Challenge:
o Sensitive data (company strategies, financial records, customer details) makes businesses targets for
hackers.
o Breaches can lead to financial loss, identity theft, and reputational damage.
 Solution:
o Encryption: Ensures data is unusable without a decryption key.
o Identity & access controls: Restrict data access to authorized users.
o Endpoint protection & real-time monitoring: Prevents malware infections and detects threats
instantly.

5. Selecting the Right Big Data Tools

 Challenge:
o The wide range of Big Data tools makes it difficult to choose the right one.
o Overlapping functionalities can lead to inefficiencies.
 Solution:
o Consulting Big Data professionals: Experts assess business needs and recommend appropriate
tools.
o Enterprise data streaming & ETL solutions: Aggregate data from multiple sources for seamless
processing.
o Dynamic cloud configuration: Ensures the system scales efficiently with minimal maintenance.

16
6. Scaling Systems and Costs Efficiently PA

 Challenge:
o Without proper planning, companies may store and process irrelevant or excessive data.
o Uncontrolled expansion leads to unnecessary costs.
 Solution:
o Goal-oriented planning: Define objectives before starting a data project.
o Schema design: Structure data storage efficiently.
o Data retention policies: Purge outdated and unnecessary data periodically.

7. Lack of Skilled Data Professionals

 Challenge:
o Many businesses lack trained personnel for handling Big Data.
o Untrained employees can cause workflow disruptions and data errors.
 Solution:
o Hiring specialists: Employ full-time data experts or consultants to guide teams.
o Training programs: Upskill existing employees in Big Data management.
o Self-service analytics tools: Enable non-data professionals to work with analytics solutions.

8. Organizational Resistance

 Challenge:
o Larger organizations often resist adopting Big Data due to cost concerns or reluctance to change.
o Leaders may not see immediate value in analytics and machine learning.
 Solution:
o Pilot projects: Start small and use successful results to demonstrate value.
o Leadership changes: Appoint data-driven leaders to guide transformation.

Conclusion

Big Data presents various challenges, from infrastructure scalability to security concerns. However, with the right
strategies—cloud adoption, data governance, skilled professionals, and efficient planning—businesses can harness
Big Data’s full potential for growth and innovation.

17
PA

Unit 2

Requirement of Hadoop Framework


 Hadoop is an Apache open source framework written in java that allows
distributed processing of large datasets across clusters of computers using simple
programming models.

 The Hadoop framework application works in an environment that provides


distributed storage and computation across clusters of computers.

 Hadoop is designed to scale up from single server to thousands of machines, each


offering local computation and storage.

Design Principles of Hadoop

Hadoop is designed to efficiently process large-scale data across distributed clusters. The core design principles of
Hadoop are:

1. Fault Tolerance: Self-Healing System

 The system shall manage and heal itself.


 Automatically and transparently routes around failures.
 Speculative execution of redundant tasks if certain nodes are detected to be slow.

2. Scalability: Linear Performance Scaling

 Performance scales linearly with an increase in resources.


 Ensures proportional growth in capacity as nodes are added.

3. Data Locality: Moving Computation to Data

 Computation is performed closer to where data is stored.


 Reduces latency and minimizes bandwidth usage for data transfer.

4. Economy: Simple, Modular, and Extensible Structure

 Designed with a simple core architecture.


 Modular and extensible framework to support various big data applications.

18
 Cost-effective as it runs on commodity hardware. PA

These principles ensure Hadoop’s efficiency, reliability, and scalability in processing large datasets across
distributed environments.

Comparison with other system like SQL

Parameter Hadoop SQL


Open-source framework, distributes data across Structured Query Language (SQL) used
Architecture
clusters for parallel processing. for managing relational databases.
Stores, processes, retrieves, and extracts patterns
Stores, processes, retrieves, and mines
Operations from various data formats (XML, Text, JSON,
patterns from structured relational data.
etc.).
Supports structured and unstructured data; writes Works only with structured data; allows
Data Type/Update
data once and reads multiple times. multiple writes and reads.
Data Volume Handles large data volumes (Terabytes to Works better with low data volumes
Processed Petabytes). (Gigabytes).
Stores data in key-value pairs, hash, maps, tables,
Data Storage etc., with dynamic schemas in a distributed Uses tabular format with fixed schemas.
system.
Schema Structure Supports dynamic schemas. Supports static schemas.
Supports NoSQL data structures and columnar Works on ACID (Atomicity,
Data Structures
data but requires additional coding for Consistency, Isolation, Durability)
Supported
transactions. principles.
Fault Tolerance Highly fault-tolerant. Good fault tolerance.
Uses distributed computing and MapReduce for Available on-premise or cloud but lacks
Availability
high availability across multiple locations. distributed computing benefits.
Integrity Low integrity. High integrity.
Uses horizontal scaling, connecting multiple Requires additional SQL servers, making
Scaling
computers, making it cost-effective and flexible. scaling expensive and time-consuming.
Supports large-scale batch data processing (OLAP Supports real-time data processing
Data Processing
- Online Analytical Processing). (OLTP - Online Transaction Processing).
Executes queries quickly, even for millions of SQL syntax can be slower when handling
Execution Time
records. millions of rows.
Uses JDBC to transfer and receive data from SQL SQL systems can read and write data to
Interaction
systems. Hadoop.
Support for ML & Strong support for machine learning and artificial Limited support for machine learning and
AI intelligence. AI.
Requires advanced skills, making it challenging Intermediate skill level; easier to learn for
Skill Level
for beginners. beginners.
Language Uses traditional database languages like
Built with Java programming language.
Supported MySQL, Oracle, and SQL Server.
Best for handling large volumes of structured, Ideal for moderate data volumes with
Use Case
semi-structured, and unstructured data. structured data.
Hardware Requires commodity hardware installation on the
Uses proprietary hardware installation.
Configuration server.
Pricing Free, open-source framework. Mostly licensed systems.
19
PA

Comparison with other system like RDBMS


Below is the comparison table between Hadoop vs RDBMS.

Feature RDBMS Hadoop


Data Variety Mainly for Structured data Used for Structured, Semi-
Structured, and Unstructured data
Data Storage Average size data (GBS) Use for large data sets (Tbs and
Pbs)
Querying SQL Language HQL (Hive Query Language)
Schema Required on write (static Required on reading (dynamic
schema) schema)
Speed Reads are fast Both reads and writes are fast
Cost License Free
Use Case OLTP (Online transaction Analytics (Audio, video, logs, etc.),
processing) Data Discovery
Data Objects Works on Relational Works on Key/Value Pair
Tables
Throughput Low High
Scalability Vertical Horizontal
Hardware Profile High-End Servers Commodity/Utility Hardware
Integrity High (ACID) Low

20
PA
Components of Hadoop

There are three core components of Hadoop as mentioned earlier. They are HDFS,
MapReduce, and YARN. These together form the Hadoop framework architecture.

1) HDFS (Hadoop Distributed File System):


 It is a data storage system. Since the data sets are huge, it uses a distributed system to
store this data.
 It is stored in blocks where each block is 128 MB. It consists of NameNode and
DataNode. There can only be one NameNode but multiple DataNodes.

Features:
 The storage is distributed to handle a large data pool
 Distribution increases data security
 It is fault-tolerant, other blocks can pick up the failure of one block

2) MapReduce:
 The MapReduce framework is the processing unit. All data is distributed and
processed parallelly.
 There is a MasterNode that distributes data amongst SlaveNodes. The SlaveNodes do
the processing and send it back to the MasterNode.
Features:
 Consists of two phases, Map Phase and Reduce Phase.
 Processes big data faster with multiples nodes working under one CPU

21
PA
3) YARN (yet another Resources Negotiator):
 It is the resource management unit of the Hadoop framework.
 The data which is stored can be processed with help of YARN using data processing
engines like interactive processing. It can be used to fetch any sort of data analysis.

Features:
 It is a filing system that acts as an Operating System for the data stored on HDFS
 It helps to schedule the tasks to avoid overloading any system

Hadoop Architecture
o The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.

o A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node
includes DataNode and TaskTracker.

22
PA

Hadoop Distributed File System


 The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop.
It contains a master/slave architecture. This architecture consist of a single NameNode
performs the role of master, and multiple DataNodes performs the role of a slave.
 Both NameNode and DataNode are capable enough to run on commodity machines.
The Java language is used to develop HDFS. So any machine that supports Java
language can easily run the NameNode and DataNode software.

Components of HDFS
a) NameNode
 It is a single master server exist in the HDFS cluster.
 As it is a single node, it may become the reason of single point failure.
 It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
 It simplifies the architecture of the system.

b) DataNode
 The HDFS cluster contains multiple DataNodes.
 Each DataNode contains multiple data blocks.
 These data blocks are used to store data.
 It is the responsibility of DataNode to read and write requests from the file
system's clients.
 It performs block creation, deletion, and replication upon instruction from the
NameNode.

c) Job Tracker
 The role of Job Tracker is to accept the MapReduce jobs from client and
process the data by using NameNode.
 In response, NameNode provides metadata to Job Tracker.

d) Task Tracker
 It works as a slave node for Job Tracker.
 It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.

23
PA

MapReduce Layer
 The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker.
 In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is
rescheduled.

Difference between Hadoop 1 and Hadoop 2


 Hadoop is an open source software programming framework for storing a large
amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.

Hadoop 1 vs Hadoop 2

Components: In Hadoop 1 we have MapReduce but Hadoop 2 has YARN(Yet Another


Resource Negotiator) and MapReduce version 2.

Hadoop 1 Hadoop 2

HDFS HDFS

Map Reduce YARN / MRv2

Daemons:

Hadoop 1 Hadoop 2

Namenode Namenode

Datanode Datanode

Secondary Namenode Secondary Namenode

Job Tracker Resource Manager

24
Task Tracker Node Manager PA

Working:

 In Hadoop 1, there is HDFS which is used for storage and top of it, Map Reduce
which works as Resource Management as well as Data Processing. Due to this
workload on Map Reduce, it will affect the performance.
 In Hadoop 2, there is again HDFS which is again used for storage and on the top of
HDFS, there is YARN which works as Resource Management. It basically allocates
the resources and keeps all the things going on.

Limitations:

 Hadoop 1 is a Master-Slave architecture. It consists of a single master and multiple


slaves. Suppose if master node got crashed then irrespective of your best slave nodes,
your cluster will be destroyed. Again for creating that cluster means copying system
files, image files, etc. on another system is too much time consuming which will not
be tolerated by organizations in today’s time.
 Hadoop 2 is also a Master-Slave architecture. But this consists of multiple masters (i.e
active namenodes and standby namenodes) and multiple slaves. If here master node
got crashed then standby master node will take over it. You can make multiple

25
PA
combinations of active-standby nodes. Thus Hadoop 2 will eliminate the problem of a single
point of failure.

Ecosystem

 Oozie is basically Work Flow Scheduler. It decides the particular time of jobs to
execute according to their dependency.
 Pig, Hive and Mahout are data processing tools that are working on the top of
Hadoop.
 Sqoop is used to import and export structured data. You can directly import and
export the data into HDFS using SQL database.
 Flume is used to import and export the unstructured data and streaming data.

Hadoop Ecosystem

The Hadoop Ecosystem is a platform that provides various services to solve big data problems. It includes
Apache projects and various commercial tools and solutions. The four major elements of Hadoop are:

 HDFS (Hadoop Distributed File System)


 MapReduce (Programming-based Data Processing)
 YARN (Yet Another Resource Negotiator)
 Hadoop Common

Most of the tools in the ecosystem supplement or support these major elements, working collectively to provide
services such as data absorption, analysis, storage, and maintenance.

26
PA
Components of the Hadoop Ecosystem

Core Components

1. HDFS (Hadoop Distributed File System)


o Responsible for storing large datasets of structured or unstructured data.
o Maintains metadata in the form of log files.
o Consists of:
 Name Node (stores metadata)
 Data Nodes (store actual data on commodity hardware)
2. YARN (Yet Another Resource Negotiator)
o Manages resources across clusters.
o Performs scheduling and resource allocation.
o Consists of:
 Resource Manager (allocates resources to applications)
 Node Managers (manage resources like CPU, memory, and bandwidth per machine)
 Application Manager (interface between Resource Manager and Node Manager)
3. MapReduce
o Uses distributed and parallel algorithms to process large datasets.
o Works in two stages:
 Map(): Sorts and filters data, organizes it into key-value pairs.
 Reduce(): Summarizes and aggregates the mapped data.

Supporting Components

4. Apache Spark
o In-memory data processing for real-time analytics.
o Handles batch processing, interactive/iterative processing, and graph conversions.
5. PIG & HIVE
o PIG: Developed by Yahoo, uses Pig Latin (query-based language similar to SQL) for structuring
data flow and analysis.
o HIVE: Uses HQL (Hive Query Language) for reading/writing large datasets, supporting real-time
and batch processing.
6. Apache HBase

27
o A NoSQL database that supports structured, semi-structured, and unstructured data. PA
o Inspired by Google's BigTable, designed for real-time data retrieval.
7. Mahout & Spark MLlib
o Libraries for machine learning algorithms.
o Provide functionalities like collaborative filtering, clustering, and classification.
8. Solr & Lucene
o Lucene: Java-based search library supporting spell-checking and indexing.
o Solr: Uses Lucene for searching and indexing large datasets.
9. Zookeeper
o Manages coordination, synchronization, inter-component communication, and clustering.
10. Oozie
o Job scheduling framework.
o Supports:
 Oozie Workflows (sequential job execution)
 Oozie Coordinator Jobs (triggered by data availability or external events)

Introduction to MapReduce in Hadoop


 MapReduce is a software framework and programming model used for processing
huge amounts of data. MapReduce program work in two phases, namely, Map and
Reduce.
 Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and
reduce the data.
 Hadoop is capable of running MapReduce programs written in various languages:
Java, Ruby, Python, and C++. The programs of Map Reduce in cloud computing are
parallel in nature, thus are very useful for performing large-scale data analysis using
multiple machines in the cluster.
 The input to each phase is key-value pairs. In addition, every programmer needs to
specify two functions: map function and reduce function.

Processing data with Hadoop using MapReduce

28
PA

 MapReduce is a programming framework that allows us to perform distributed and


parallel processing on large data sets in a distributed environment.

MapReduce: Key Concepts and Process

MapReduce consists of two distinct tasks: Map and Reduce.

1. Execution Order
o As the name suggests, the Reducer phase takes place after the Mapper phase is completed.
2. Map Phase
o The Map job is the first step in the process.
o It reads and processes a block of data.
o Produces key-value pairs as intermediate outputs.
3. Intermediate Data Transfer
o The output of a Mapper (key-value pairs) is passed as input to the Reducer.
o The Reducer receives key-value pairs from multiple Map jobs.
4. Reduce Phase
o The Reducer aggregates intermediate key-value pairs into a smaller set of tuples.
o This results in the final output of the MapReduce process.

MapReduce and Its Components


MapReduce majorly consists of the following three classes:

1. Mapper Class

 The first stage in data processing using MapReduce.


 RecordReader processes each input record and generates the respective key-value pair.
 Hadoop’s Mapper store saves this intermediate data into the local disk.

Input Split

 A logical representation of data.


 Represents a block of work containing a single map task in the MapReduce program.

RecordReader

 Interacts with the Input Split.


 Converts the obtained data into key-value pairs.

2. Reducer Class

29
 The intermediate output generated from the Mapper is fed to the Reducer. PA
 The Reducer processes the data and generates the final output, which is then saved in HDFS.

3. Driver Class

 The major component in a MapReduce job.


 Responsible for setting up a MapReduce job to run in Hadoop.
 Specifies:
o Names of the Mapper and Reducer classes.
o Data types and their respective job names.

Here’s your structured content on YARN and its Components while keeping the data unchanged:

Introduction to YARN
Yet Another Resource Manager (YARN) takes programming beyond Java, making it interactive and allowing
other applications like HBase, Spark, etc. to work on it.

 Different YARN applications can co-exist on the same cluster.


 This enables MapReduce, HBase, and Spark to run simultaneously, leading to better manageability and
optimized cluster utilization.

Components of YARN
1. Client
o Used for submitting MapReduce jobs.
2. Resource Manager
o Manages the use of resources across the cluster.
3. Node Manager
o Launches and monitors compute containers on machines in the cluster.
4. MapReduce Application Master
o Checks tasks running the MapReduce job.
o Both the Application Master and MapReduce tasks run in containers.
o These containers are scheduled by the Resource Manager and managed by the Node Managers.

Evolution from Job Tracker & Task Tracker


 In previous versions of Hadoop, JobTracker & TaskTracker were responsible for handling resources
and managing progress.

30
 However, in Hadoop 2.0, Resource Manager and Node Manager replaced JobTracker & TaskTracker
PA
to overcome their shortcomings.

31
PA

Hadoop Yarn Architecture

Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave
daemon called node manager (one per slave node) and Application Master (one per
application).

1. Resource Manager (RM)

 It is the master daemon of Yarn. RM manages the global assignments of resources


(CPU and memory) among all the applications.
 It arbitrates system resources between competing applications. follow Resource
Manager guide to learn Yarn Resource manager in great detail.
 Resource Manager has two Main components:

o Scheduler
o Application manager

a) Scheduler

 The scheduler is responsible for allocating the resources to the running application.
 The scheduler is pure scheduler it means that it performs no monitoring no tracking
for the application and even doesn’t guarantees about restarting failed tasks either due
to application failure or hardware failures.

32
PA
b) Application Manager

 It manages running Application Masters in the cluster, i.e., it is responsible for


starting application masters and for monitoring and restarting them on different nodes
in case of failures.

2.Node Manager (NM)

 It is the slave daemon of Yarn.


 NM is responsible for containers monitoring their resource usage and reporting the
same to the Resource Manager. It manages the user process on that machine.
 Yarn Node Manager also tracks the health of the node on which it is running.
 The design also allows plugging long-running auxiliary services to the NM; these are
application-specific services, specified as part of the configurations and loaded by the
NM during startup.
 A shuffle is a typical auxiliary service by the NMs for MapReduce applications on
YARN

3.Application Master (AM)

 One application master runs per application. It negotiates resources from the resource
manager and works with the node manager. It Manages the application life cycle.
 The AM acquires containers from the RM’s Scheduler before contacting the
corresponding NMs to start the application’s individual tasks.

33
PA

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy