0% found this document useful (0 votes)
24 views38 pages

Bda Unit-2

Big data analytics

Uploaded by

Syed Affan Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views38 pages

Bda Unit-2

Big data analytics

Uploaded by

Syed Affan Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Big Data Analytics.

UNIT-II
Syllabus: Intorducing Technologies For Handling Big Data: Distributed and Parallel
Computing for big data, Introducing Hadoop, And Cloud Computing in Big Data.
Understanding Hadoop eco system: Hadoop EcoSystem, Hadoop Distributed file system,
Map Reduce, Hadoop Yarn,Hive, Pig, Sqoop, Zookeeper, Flum, OOzie.

Difference between Parallel Computing and


Distributed Computing
There are mainly two computation types, including parallel computing and distributed
computing. A computer system may perform tasks according to human instructions. A single
processor executes only one task in the computer system, which is not an effective way. Parallel
computing solves this problem by allowing numerous processors to accomplish tasks
simultaneously. Modern computers support parallel processing to improve system performance.
In contrast, distributed computing enables several computers to communicate with one another
and achieve a goal. All of these computers communicate and collaborate over the network.
Distributed computing is commonly used by organizations such as Facebook and Google that
allow people to share resources.

In this article, you will learn about the difference between Parallel Computing and Distributed
Computing. But before discussing the differences, you must know about parallel computing and
distributed computing.

What is Parallel Computing?


It is also known as parallel processing. It utilizes several processors. Each of the processors
completes the tasks that have been allocated to them. In other words, parallel computing involves
performing numerous tasks simultaneously. A shared memory or distributed memory system can
be used to assist in parallel computing. All CPUs in shared memory systems share the memory.
Memory is shared between the processors in distributed memory systems.

Parallel computing provides numerous advantages. Parallel computing helps to increase the CPU
utilization and improve the performance because several processors work simultaneously.
Moreover, the failure of one CPU has no impact on the other CPUs' functionality. Furthermore, if
one processor needs instructions from another, the CPU might cause latency.

Advantages and Disadvantages of Parallel Computing


There are various advantages and disadvantages of parallel computing. Some of the advantages
and disadvantages are as follows:

Advantages

1. It saves time and money because many resources working together cut down on time and costs.
2. It may be difficult to resolve larger problems on Serial Computing.
3. You can do many things at once using many computing resources.
4. Parallel computing is much better than serial computing for modeling, simulating, and
comprehending complicated real-world events.

Disadvantages

1. The multi-core architectures consume a lot of power.


2. Parallel solutions are more difficult to implement, debug, and prove right due to the complexity of
communication and coordination, and they frequently perform worse than their serial equivalents.

What is Distributing Computing?


It comprises several software components that reside on different systems but operate as a single
system. A distributed system's computers can be physically close together and linked by a local
network or geographically distant and linked by a wide area network (WAN). A distributed
system can be made up of any number of different configurations, such as mainframes, PCs,
workstations, and minicomputers. The main aim of distributed computing is to make a network
work as a single computer.

There are various benefits of using distributed computing. It enables scalability and makes it
simpler to share resources. It also aids in the efficiency of computation processes.

Advantages and Disadvantages of Distributed Computing


There are various advantages and disadvantages of distributed computing. Some of the
advantages and disadvantages are as follows:

Advantages

1. It is flexible, making it simple to install, use, and debug new services.


2. In distributed computing, you may add multiple machines as required.
3. If the system crashes on one server, that doesn't affect other servers.
4. A distributed computer system may combine the computational capacity of several computers,
making it faster than traditional systems.

Disadvantages

1. Data security and sharing are the main issues in distributed systems due to the features of open
systems
2. Because of the distribution across multiple servers, troubleshooting and diagnostics are more
challenging.
3. The main disadvantage of distributed computer systems is the lack of software support.

Key differences between the Parallel Computing and Distributed


Computing

Here, you will learn the various key differences between parallel computing and distributed
computation. Some of the key differences between parallel computing and distributed computing
are as follows:
1. Parallel computing is a sort of computation in which various tasks or processes are run at the same
time. In contrast, distributed computing is that type of computing in which the components are
located on various networked systems that interact and coordinate their actions by passing
messages to one another.
2. In parallel computing, processors communicate with another processor via a bus. On the other
hand, computer systems in distributed computing connect with one another via a network.
3. Parallel computing takes place on a single computer. In contrast, distributed computing takes
place on several computers.
4. Parallel computing aids in improving system performance. On the other hand, distributed
computing allows for scalability, resource sharing, and the efficient completion of computation
tasks.
5. The computer in parallel computing can have shared or distributed memory. In contrast, every
system in distributed computing has its memory.
6. Multiple processors execute multiple tasks simultaneously in parallel computing. In contrast,
many computer systems execute tasks simultaneously in distributed computing.

Head-to-head Comparison between the Parallel Computing and


Distributed Computing

Features Parallel Computing Distributed Computing

It is a type of It is that type of computing in which the


computation in which components are located on various networked
Definition
various processes runs systems that interact and coordinate their actions
simultaneously. by passing messages to one another.

The processors
The computer systems connect with one another
Communication communicate with
via a network.
one another via a bus.

Several processors
execute various tasks
Functionality Several computers execute tasks simultaneously.
simultaneously in
parallel computing.

Number of It occurs in a single


It involves various computers.
Computers computer system.

The system may have


Each computer system in distributed computing
Memory distributed or shared
has its own memory.
memory.

Usage It helps to improve It allows for scalability, resource sharing, and


the system
the efficient completion of computation tasks.
performance

Conclusion
There are two types of computations: parallel computing and distributed computing. Parallel
computing allows several processors to accomplish their tasks at the same time. In contrast,
distributed computing splits a single task among numerous systems to achieve a common goal.

Role of Cloud Computing in Big Data


Analytics
In this day and age where information is everything, organizations
are overwhelmed. This information, often called “big
data,” refers to huge, complicated datasets that ordinary
procedures cannot process. Businesses are increasingly turning to
cloud computing in order to unlock the true value of big data and
make use of it.
This article examines how cloud platforms can be used for storing
vast amounts of data effectively as well as managing and
analyzing such information. It will reveal what exactly are
some benefits brought by cloud computing into big-data
analytics, and discuss different services offered by providers
among other things like considerations for adopting a cloud-based
strategy towards big-data.
Table of Content
 The Challenges of Big Data
 Cloud Computing: The Big Data Solution
 Cloud Services for Big Data Analytics
 Benefits Beyond Core Analytics Services
 Choosing the Right Cloud Platform for Big Data Analytics
 Security Considerations for Cloud-Based Big Data Analytics
 Real-World Examples: Unveiling Insights Across Industries
 The Future Of Cloud Computing And Big Data Analytics
 Conclusion
The Challenges of Big Data
Big data poses several problems that impede traditional methods
of analyzing data. These include:
1. Volume: The amount of data being created today is mind-
bogglingly large. Regular storage systems do not have enough
space to accommodate all these massive sets.
2. Variety: Big data comes in different forms such as structured
(relational databases), unstructured (text files, pictures
or videos from social media posts), and semi-structured
logs or emails. Traditional tools struggle with this complexity.
3. Velocity: The speed at which new records are generated
keeps rising every time; hence real-time analysis becomes
difficult due to slow processing speed.
4. Veracity: If you want accurate findings from your research
then you must ensure that your facts are correct since the
garbage in garbage out rule applies here too. There is nothing
worse than cleaning up after the traditional method has been
used on a large dataset because it can take forever.

Cloud computing offers an effective solution towards


dealing with big size information sets. Organizations can
store their big-data efficiently manage them as well analyze them
by leveraging scalability provided through clouds on demand
resources such as storage capacity . Here’s how:
 Scalability: One thing about these platforms is scalability;
they provide large amounts storage when needed most without
having to buy any hardware infrastructure in advance. For
instance, if you know that there will be a lot of processing
power required during certain periods then scaling up becomes
very easy and quick.
 Cost Effectiveness: It also saves on costs since organizations
only pay for what has been utilized unlike maintaining on-site
infrastructure which may not be used all year round thus
resulting into huge savings.
 Performance: Cloud computing offers high performance
computing resources like servers with advanced networking
features plus memory based in-memory capabilities
which enable faster data processing real-time analytics
 Accessibility: geographical location should never hinder any
business from getting value out of its information stores hence
cloud-based solutions being accessible everywhere provided
there’s internet connection. This encourages team work among
members who are far apart geographically as well enables
analysis to happen around the clock.
 Security: It is important that sensitive data is well guarded
against unauthorized access, modification or loss hence cloud
providers investing heavily in security measures such
as encryption, access control and residency options for
compliance purposes.

Cloud Services for Big Data Analytics


1. Data Ingestion
 Managed data pipelines: These services automate the
collection, transformation and loading of data from different
sources into your cloud storage i.e., Apache Airflow or AWS
Glue offered by various service providers.
 Streaming ingestion: Real time ingestion can be achieved
using services like Apache Kafka which allows integration
with social media feeds among others
2. Data Storage
 Object storage: The best option for storing vast quantities of
unstructured and semi-structured data are highly scalable and
cost-effective object storage options such as Amazon
S3, Azure Blob Storage, Google Cloud Storage among
others.
 Lakes of Data: A cloud data lake serves as a centralized
storage system that saves all of the data in its original format,
giving users the opportunity to examine and analyze it at a
later time. Time is saved because of the flexible
procedures that may be performed on the data.
 Data Warehouses: When dealing with large
datasets, structured schemas are required for storage
and analysis purposes; this is exactly what a cloud data
warehouse does. The method has made querying and reporting
processes easier hence faster.
3. Data Processing and Transformation:
 Managed Hadoop and Spark environments: Complex
infrastructure setup can be avoided by using pre-configured
managed Hadoop clusters or Spark clusters provided by
various cloud services.
 Serverless information processing: With serverless
compute services like AWS Lambda or Azure Functions, you
can run data processing tasks without managing servers. This
simplifies development and scaling.
 Data anonymization and masking: Cloud platforms provide
tools and services to comply with privacy regulations by
anonymizing or masking confidential datasets.
4. Data Analytics and Visualization:
 Business intelligence (BI) tools: Some cloud-based BI
applications like Tableau, Power BI, Looker etc. provide
interactive dashboards and reports for visual big data analysis.
 Managed machine learning (ML) platforms such
as Google Cloud AI Platform, Amazon SageMaker, Azure
Machine Learning etc., allow ML models development,
testing, and deployment on massive datasets.
 Predictive analytics and data mining: Cloud platforms are
equipped with built-in facilities both for predictive
analytics and data mining that can help you find patterns or
trends in your data to assist you in future forecasting or better
decision making.
Benefits Beyond Core Analytics Services
 Collaboration: You can collaborate between a data
scientist/analyst/business user since all your team members
will have access through one centralized location where they
can share insights with each other easily using; shared storage
space or communication channels provided by these platforms
themselves.
 Disaster Recovery: In case something unexpected happens
such as power failure then rest assured because most cloud
providers always ensure that there is minimum downtime
experienced during any disaster recovery process thanks to
their robustness in this area.
 Innovation: Organizations can take advantage of various
cutting-edge technologies that are available through cloud
platforms like Artificial Intelligence (AI) which will help them
come up with new data-driven solutions.
By using comprehensive suite of services from different
Cloud Providers, organizations can create an elastic &
scalable ecosystem for big-data analytics that enables
maximum value extraction from information assets.
Choosing the Right Cloud Platform for Big Data Analytics
When choosing a cloud platform for big data analytics, there
are several factors that need to be considered:
 Scalability requirements: Evaluate whether the platform can
scale resources up or down as per your fluctuating needs in
terms of processing power or storage space etc.
 Security features: Make sure the chosen provider has good
security measures put in place especially when dealing with
sensitive datasets so as not compromise privacy rights of
individuals involved directly/indirectly during
analysis process itself .
 Cost considerations: Compare pricing models offered by
various providers against usage patterns based on current
budgetary allocation then go ahead selecting most appropriate
one among them all at hand.
 Integration capabilities: Check how well does it integrate
with existing data infrastructure i.e., databases, warehouses
etc., including ETL tools like Informatica Power Center which
might be already installed within organization environment
thus avoiding compatibility issues arising later during
implementation phase itself.
 Vendor lock-in: This is very crucial because you should
always choose a platform that supports open standards thus
providing flexibility needed incase one decides or wishes
migrate from his/her current vendor/product line due change
management related reasons where such may require
significant investment both time wise as well financially too.
Security Considerations for Cloud-Based Big Data Analytics
Security is always paramount when dealing with large volumes of
information. Here are some key security considerations regarding
cloud-based big-data analytics:
 Data encryption: Ensure all your stored files/data are
encrypted; this helps safeguard against unauthorized access
especially during transmission over unsecured networks
where they might get intercepted easily before reaching
intended recipient(s).
 Access control: Always make sure that only authorized
personnel have access rights granted either individually or
collectively towards particular dataset(s) held within a given
storage location (s3 bucket etc.) so as not compromise security
aspects involved during analysis phase itself.
 Compliance regulations: Confirm whether these cloud
providers comply fully with relevant industry
standards/regulations pertaining data protection act especially
if dealing with health sector related information which should
remain confidential throughout its lifecycle while being
processed through various stages involved till final decision
making moment reached upon by responsible parties
concerned here.
 Regular security audits: Regularly conduct comprehensive
security audits on your cloud environment to identify any
potential vulnerability areas & address them accordingly before
they can be exploited by malicious actors who might wish take
advantage such weaknesses thereby causing harm
intentionally against organization reputation or even financial
loss too.
 Data Copying and Restoration: Keep an all-inclusive plan
for data copying and restoration so that you could retrieve your
files if a security breach occurs.
Real-World Examples: Unveiling Insights Across Industries
Cloud-supported massive information analysis is changing the
ways of working and decision-making in many companies. Here
are a few interesting instances that demonstrate such
technology’s capabilities:

Examples
1. Retail Industry: The Power of Personalization
Think about a retail environment where product
recommendations seem uncannily accurate and marketing
campaigns speak to your soul. This is made possible by cloud-
based big data analytics. Retailers use these tools to process
immense volumes of customer information, such as
purchase history, browsing habits and social media
sentiment. They then apply this knowledge to:
 Customize marketing campaigns: Higher conversion rates
and increased customer satisfaction are achieved through
targeted email blasts and social media ads that cater for
individual preferences.
 Optimize product recommendations: Recommender
systems driven by big data analytics propose products
customers are likely to find interesting thereby increasing sales
and reducing cart abandonment rates.
 Enhance inventory management: Retailers can optimize
their inventory levels by scrutinizing sales trends alongside
customer demand patterns which eliminates stockouts while
minimizing clearance sales.
2. Healthcare: From Diagnosis to Personalized Care
The healthcare industry has rapidly adopted cloud-based big data
analytics for better patient care and operational efficiency. Here’s
how:
 Improved diagnosis: Healthcare providers can now diagnose
patients faster and more accurately by analyzing medical
records together with imaging scans besides wearable device
sensor data.
 Individual treatment plans: Big data analytics makes it
possible to create individualized treatment plans through
identification of factors affecting response to certain drugs or
therapies.
 Predictive prevention care: Through cloud based analytics it
is possible to identify people at high risk of particular illnesses
before they actually occur thus leading to better outcomes for
patients and lower healthcare expenses.
3. Financial Services: Risk Management & Fraud
Detection
Effectively managing risks and making informed decisions are
crucial in the ever changing banking industry. Here’s how
financial companies can use big data analytics in the cloud:
 Identify fraudulent activity: By using advanced algorithms
to make sense of real-time transaction patterns, banks are able
to detect and prevent fraudulent transactions from taking
place, thereby protecting both themselves and customers.
 Evaluate credit riskiness: By checking borrowers’ financial
histories against other types of relevant data points, lenders
can make better choices concerning approvals on loans and
interest rates hence reducing credit risk.
 Develop cutting-edge financial products: Banks can use
big data analytics to craft unique financial products for
different market segments as they continue studying their
clients’ desires and preferences.
These are only a few instances of the current industry
transformations brought about by cloud-based big data analytics.
It is inevitable that as technology advances and data quantities
expand, more inventive applications will surface, enabling
businesses to obtain more profound insights, make fact-based
decisions, and accomplish remarkable outcomes.
The Future Of Cloud Computing And Big Data Analytics
The future of big data analysis is directly related to that of cloud
computing. The significance of cloud platforms will only increase
as enterprises grapple with information overload and seek deeper
insights. The following are some tendencies to watch out for:
 Hybrid and Multi-Cloud Environments: As per their unique
needs, companies will use more and more Hybrid and Multi
Cloud approaches to take advantage of the specific
capabilities typical for different providers.
 Serverless Computing: Businesses will increasingly adopt
serverless computing due to its liberation of administrators
from the management of underlying infrastructure to
concentrate on analytics functions.
 Integration Of AI & ML: Cloud platforms will seamlessly
integrate artificial intelligence (AI) alongside machine
learning (ML) functionalities thus enabling advanced
analytics as well as automated decision making.
 Emphasis on Data Governance and Privacy: To keep pace
with shifting rules on data security and privacy, businesses will
need more advanced means of governing their information,
which cloud providers can supply.
Conclusion
Cloud computing has become the bedrock of big data analytics; it
is inexpensive, flexible, secure, and capable of accommodating
large quantities of information that companies can use to make
sense of what’s going on around them. As cloud technology
and big data analytics continue to evolve, we can expect
even more powerful tools and services to emerge,
enabling organizations to unlock the true potential of their
data and make data-driven decisions that fuel innovation
and success.

Summer-time is here and so is the time to skill-up! More than


5,000 learners have now completed their journey from basics of
DSA to advanced level development programs such as Full-
Stack, Backend Development, Data Science.

And why go anywhere else when our DSA to Development: Coding


Guide will help you master all this in a few months! Apply now to
our DSA to Development Program and our counsellors will connect
with you for further guidance & support.

Introduction to Hadoop Distributed File


System(HDFS)


With growing data velocity the data size easily outgrows the
storage limit of a machine. A solution would be to store the data
across a network of machines. Such filesystems are
called distributed filesystems. Since data is stored across a
network all the complications of a network come in.
This is where Hadoop comes in. It provides one of the most
reliable filesystems. HDFS (Hadoop Distributed File System) is a
unique design that provides storage for extremely large files with
streaming data access pattern and it runs on commodity
hardware. Let’s elaborate the terms:
 Extremely large files: Here we are talking about the data in
range of petabytes(1000 TB).
 Streaming Data Access Pattern: HDFS is designed on
principle of write-once and read-many-times. Once data is
written large portions of dataset can be processed any number
times.
 Commodity hardware: Hardware that is inexpensive and
easily available in the market. This is one of feature which
specially distinguishes HDFS from other file system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1. NameNode(MasterNode):
 Manages all the slave nodes and assign work to them.
 It executes filesystem namespace operations like opening,
closing, renaming files and directories.
 It should be deployed on reliable hardware which has the
high config. not on commodity hardware.
2. DataNode(SlaveNode):
 Actual worker nodes, who do the actual work like reading,
writing, processing etc.
 They also perform creation, deletion, and replication upon
instruction from the master.
 They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in
background.
 Namenodes:
o Run on the master node.
o Store metadata (data about data) like file path, the
number of blocks, block Ids. etc.
o Require high amount of RAM.
o Store meta-data in RAM for fast retrieval i.e to reduce
seek time. Though a persistent copy of it is kept on
disk.
 DataNodes:
o Run on slave nodes.
o Require high memory as data is actually stored here.
Data storage in HDFS: Now let’s see how the data is stored in a
distributed manner.
Lets assume that 100TB file is inserted, then
masternode(namenode) will first divide the file into blocks of 10TB
(default size is 128 MB in Hadoop 2.x and above). Then these
blocks are stored across different datanodes(slavenode).
Datanodes(slavenode)replicate the blocks among themselves and
the information of what blocks they contain is sent to the master.
Default replication factor is 3 means for each block 3 replicas are
created (including itself). In hdfs.site.xml we can increase or
decrease the replication factor i.e we can edit its configuration
here.
Note: MasterNode has the record of everything, it knows the
location and info of each and every single data nodes and the
blocks they contain, i.e. nothing is done without the permission of
master node.
Why divide the file into blocks?
Answer: Let’s assume that we don’t divide, now it’s very difficult
to store a 100 TB file on a single machine. Even if we store, then
each read and write operation on that whole file is going to take
very high seek time. But if we have multiple blocks of size 128MB
then its become easy to perform various read and write
operations on it compared to doing it on a whole file at once. So
we divide the file to have faster data access i.e. reduce seek
time.
Why replicate the blocks in data nodes while storing?
Answer: Let’s assume we don’t replicate and only one yellow
block is present on datanode D1. Now if the data node D1 crashes
we will lose the block and which will make the overall data
inconsistent and faulty. So we replicate the blocks to
achieve fault-tolerance.
Terms related to HDFS:
 HeartBeat : It is the signal that datanode continuously sends
to namenode. If namenode doesn’t receive heartbeat from a
datanode then it will consider it dead.
 Balancing : If a datanode is crashed the blocks present on it
will be gone too and the blocks will be under-
replicated compared to the remaining blocks. Here master
node(namenode) will give a signal to datanodes containing
replicas of those lost blocks to replicate so that overall
distribution of blocks is balanced.
 Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same
datanode.
Features:
 Distributed data storage.
 Blocks reduce seek time.
 The data is highly available as the same block is present at
multiple datanodes.
 Even if multiple datanodes are down we can still do our work,
thus making it highly reliable.
 High fault tolerance.
Limitations: Though HDFS provide many features there are
some areas where it doesn’t work well.
 Low latency data access: Applications that require low-
latency access to data i.e in the range of milliseconds will not
work well with HDFS, because HDFS is designed keeping in
mind that we need high-throughput of data even at the cost of
latency.
 Small file problem: Having lots of small files will result in lots
of seeks and lots of movement from one datanode to another
datanode to retrieve each small file, this whole process is a
very inefficient data access pattern.

HADOOP ECOSYSTEM
Overview: Apache Hadoop is an open source framework intended to make interaction
with big data easier, However, for those who are not acquainted with this technology, one
question arises that what is big data ? Big data is a term given to the data sets which can’t
be processed in an efficient manner with the help of traditional methodology such as
RDBMS. Hadoop has made its place in the industries and companies that need to work on
large data sets which are sensitive and needs efficient handling. Hadoop is a framework that
enables processing of large data sets which reside in the form of clusters. Being a
framework, Hadoop is made up of several modules that are supported by a large ecosystem
of technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various services
to solve the big data problems. It includes Apache projects and various commercial tools
and solutions. There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN,
and Hadoop Common Utilities. Most of the tools or solutions are used to supplement or
support these major elements. All these tools work collectively to provide services such as
absorption, analysis, storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

Note: Apart from the above-mentioned components, there are many other components too
that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the beauty of
Hadoop that it revolves around data and hence making its synthesis easier.
HDFS:
 HDFS is the primary or major component of Hadoop ecosystem and is responsible for
storing large data sets of structured or unstructured data across various nodes and
thereby maintaining the metadata in the form of log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus working
at the heart of the system.
YARN:
 Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the applications in a
system whereas Node managers work on the allocation of resources such as CPU,
memory, bandwidth per machine and later on acknowledges the resource manager.
Application manager works as an interface between the resource manager and node
manager and performs negotiations as per the requirement of the two.
MapReduce:
 By making the use of distributed and parallel algorithms, MapReduce makes it possible
to carry over the processing’s logic and helps to write applications which transform big
data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
1. Map() performs sorting and filtering of data and thereby organizing them in the
form of group. Map generates a key-value pair based result which is later on
processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and
combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which is
Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge data sets.
 Pig does the work of executing commands and in the background, all the activities of
MapReduce are taken care of. After the processing, pig stores the result in HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig Runtime.
Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a major
segment of the Hadoop Ecosystem.
HIVE:
 With the help of SQL methodology and interface, HIVE performs reading and writing
of large data sets. However, its query language is called as HQL (Hive Query
Language).
 It is highly scalable as it allows real-time processing and batch processing both. Also,
all the SQL datatypes are supported by Hive thus, making the query processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage permissions and
connection whereas HIVE Command line helps in the processing of queries.
Mahout:
 Mahout, allows Machine Learnability to a system or application. Machine Learning, as
the name suggests helps the system to develop itself based on some patterns,
user/environmental interaction or on the basis of algorithms.
 It provides various libraries or functionalities such as collaborative filtering, clustering,
and classification which are nothing but concepts of Machine learning. It allows
invoking algorithms as per our need with the help of its own libraries.
Apache Spark:
 It’s a platform that handles all the process consumptive tasks like batch processing,
interactive or iterative real-time processing, graph conversions, and visualization, etc.
 It consumes in memory resources hence, thus being faster than the prior in terms of
optimization.
 Spark is best suited for real-time data whereas Hadoop is best suited for structured data
or batch processing, hence both are used in most of the companies interchangeably.
Apache HBase:
 It’s a NoSQL database which supports all kinds of data and thus capable of handling
anything of Hadoop Database. It provides capabilities of Google’s BigTable, thus able
to work on Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of something small in a
huge database, the request must be processed within a short quick span of time. At such
times, HBase comes handy as it gives us a tolerant way of storing limited data
Other Components: Apart from all of these, there are some other components too that
carry out a huge task in order to make Hadoop capable of processing large datasets. They
are as follows:
 Solr, Lucene: These are the two services that perform the task of searching and
indexing with the help of some java libraries, especially Lucene is based on Java which
allows spell check mechanism, as well. However, Lucene is driven by Solr.
 Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which resulted in
inconsistency, often. Zookeeper overcame all the problems by performing
synchronization, inter-component based communication, grouping, and maintenance.
 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and binding
them together as a single unit. There is two kinds of jobs .i.e Oozie workflow and Oozie
coordinator jobs. Oozie workflow is the jobs that need to be executed in a sequentially
ordered manner whereas Oozie Coordinator jobs are those that are triggered when some
data or external stimulus is given to it.

Hadoop YARN Architecture

YARN stands for “Yet Another Resource Negotiator“. It was introduced in Hadoop 2.0 to
remove the bottleneck on Job Tracker which was present in Hadoop 1.0. YARN was
described as a “Redesigned Resource Manager” at the time of its launching, but it has now
evolved to be known as large-scale distributed operating system used for Big Data
processing.
YARN architecture basically separates resource management layer from the processing layer.
In Hadoop 1.0 version, the responsibility of Job tracker is split between the resource manager
and application manager.

YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data stored in
HDFS (Hadoop Distributed File System) thus making the system much more efficient.
Through its various components, it can dynamically allocate various resources and schedule
the application processing. For large volume data processing, it is quite necessary to manage
the available resources properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-
 Scalability: The scheduler in Resource manager of YARN architecture allows Hadoop to
extend and manage thousands of nodes and clusters.
 Compatibility: YARN supports the existing map-reduce applications without disruptions
thus making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in Hadoop,
which enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving organizations a benefit of
multi-tenancy.

Hadoop YARN Architecture

The main components of YARN architecture include:

 Client: It submits map-reduce jobs.


 Resource Manager: It is the master daemon of YARN and is responsible for resource
assignment and management among all the applications. Whenever it receives a
processing request, it forwards it to the corresponding node manager and allocates
resources for the completion of the request accordingly. It has two major components:
o Scheduler: It performs scheduling based on the allocated application and
available resources. It is a pure scheduler, means it does not perform other
tasks such as monitoring or tracking and does not guarantee a restart if a task
fails. The YARN scheduler supports plugins such as Capacity Scheduler and
Fair Scheduler to partition the cluster resources.
o Application manager: It is responsible for accepting the application and
negotiating the first container from the resource manager. It also restarts the
Application Master container if a task fails.
 Node Manager: It take care of individual node on Hadoop cluster and manages
application and workflow and that particular node. Its primary job is to keep-up with the
Resource Manager. It registers with the Resource Manager and sends heartbeats with the
health status of the node. It monitors resource usage, performs log management and also
kills a container based on directions from the resource manager. It is also responsible for
creating the container process and start it on the request of Application master.
 Application Master: An application is a single job submitted to a framework. The
application master is responsible for negotiating resources with the resource manager,
tracking the status and monitoring progress of a single application. The application master
requests the container from the node manager by sending a Container Launch
Context(CLC) which includes everything an application needs to run. Once the
application is started, it sends the health report to the resource manager from time-to-
time.
 Container: It is a collection of physical resources such as RAM, CPU cores and disk on a
single node. The containers are invoked by Container Launch Context(CLC) which is a
record that contains information such as environment variables, security tokens,
dependencies etc.
Application workflow in Hadoop YARN:

1. Client submits an application


2. The Resource Manager allocates a container to start the Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s status
8. Once the processing is complete, the Application Manager un-registers with the Resource
Manager

Advantages :
 Flexibility: YARN offers flexibility to run various types of distributed processing
systems such as Apache Spark, Apache Flink, Apache Storm, and others. It allows
multiple processing engines to run simultaneously on a single Hadoop cluster.
 Resource Management: YARN provides an efficient way of managing resources in the
Hadoop cluster. It allows administrators to allocate and monitor the resources required by
each application in a cluster, such as CPU, memory, and disk space.
 Scalability: YARN is designed to be highly scalable and can handle thousands of nodes
in a cluster. It can scale up or down based on the requirements of the applications running
on the cluster.
 Improved Performance: YARN offers better performance by providing a centralized
resource management system. It ensures that the resources are optimally utilized, and
applications are efficiently scheduled on the available resources.
 Security: YARN provides robust security features such as Kerberos authentication,
Secure Shell (SSH) access, and secure data transmission. It ensures that the data stored
and processed on the Hadoop cluster is secure.

Disadvantages :

 Complexity: YARN adds complexity to the Hadoop ecosystem. It requires additional


configurations and settings, which can be difficult for users who are not familiar with
YARN.
 Overhead: YARN introduces additional overhead, which can slow down the
performance of the Hadoop cluster. This overhead is required for managing resources and
scheduling applications.
 Latency: YARN introduces additional latency in the Hadoop ecosystem. This latency can
be caused by resource allocation, application scheduling, and communication between
components.
 Single Point of Failure: YARN can be a single point of failure in the Hadoop cluster. If
YARN fails, it can cause the entire cluster to go down. To avoid this, administrators need
to set up a backup YARN instance for high availability.
 Limited Support: YARN has limited support for non-Java programming languages.
Although it supports multiple processing engines, some engines have limited language
support, which can limit the usability of YARN in certain environments.

Apache Hive
Prerequisites – Introduction to Hadoop , Computing Platforms and Technologies
Apache Hive is a data warehouse and an ETL(ETL stands for "extract, transform, and
load". It's a process that combines data from multiple sources into a single repository, such as
a data warehouse, data store, or data lake) tool which provides an SQL-like interface
between the user and the Hadoop distributed file system (HDFS) which integrates Hadoop.
It is built on top of Hadoop. It is a software project that provides data query and analysis. It
facilitates reading, writing and handling wide datasets that stored in distributed storage and
queried by Structure Query Language (SQL) syntax. It is not built for Online Transactional
Processing (OLTP) workloads. It is frequently used for data warehousing tasks like data
encapsulation, Ad-hoc Queries, and analysis of huge datasets. It is designed to enhance
scalability, extensibility, performance, fault-tolerance and loose-coupling with its input
formats.
Initially Hive is developed by Facebook and Amazon, Netflix and It delivers standard SQL
functionality for analytics. Traditional SQL queries are written in the MapReduce Java API
to execute SQL Application and SQL queries over distributed data. Hive provides
portability as most data warehousing applications functions with SQL-based query
languages like NoSQL.
Apache Hive is a data warehouse software project that is built on top of the Hadoop
ecosystem. It provides an SQL-like interface to query and analyze large datasets stored in
Hadoop’s distributed file system (HDFS) or other compatible storage systems.
Hive uses a language called HiveQL, which is similar to SQL, to allow users to express
data queries, transformations, and analyses in a familiar syntax. HiveQL statements are
compiled into MapReduce jobs, which are then executed on the Hadoop cluster to process
the data.
Hive includes many features that make it a useful tool for big data analysis, including
support for partitioning, indexing, and user-defined functions (UDFs). It also provides a
number of optimization techniques to improve query performance, such as predicate
pushdown, column pruning, and query parallelization.
Hive can be used for a variety of data processing tasks, such as data warehousing, ETL
(extract, transform, load) pipelines, and ad-hoc data analysis. It is widely used in the big
data industry, especially in companies that have adopted the Hadoop ecosystem as their
primary data processing platform.
Components of Hive:
1. HCatalog –
It is a Hive component and is a table as well as a store management layer for Hadoop. It
enables user along with various data processing tools like Pig and MapReduce which
enables to read and write on the grid easily.
2. WebHCat –
It provides a service which can be utilized by the user to run Hadoop MapReduce (or
YARN), Pig, Hive tasks or function Hive metadata operations with an HTTP interface.
Modes of Hive:
1. Local Mode –
It is used, when the Hadoop is built under pseudo mode which has only one data node,
when the data size is smaller in term of restricted to single local machine, and when
processing will be faster on smaller datasets existing in the local machine.
2. Map Reduce Mode –
It is used, when Hadoop is built with multiple data nodes and data is divided across
various nodes, it will function on huge datasets and query is executed parallelly, and to
achieve enhanced performance in processing large datasets.
Characteristics of Hive:
1. Databases and tables are built before loading the data.
2. Hive as data warehouse is built to manage and query only structured data which is
residing under tables.
3. At the time of handling structured data, MapReduce lacks optimization and usability
function such as UDFs whereas Hive framework have optimization and usability.
4. Programming in Hadoop deals directly with the files. So, Hive can partition the data
with directory structures to improve performance on certain queries.
5. Hive is compatible for the various file formats which are TEXTFILE,
SEQUENCEFILE, ORC, RCFILE, etc.
6. Hive uses derby database in single user metadata storage and it uses MYSQL for
multiple user Metadata or shared Metadata.
Features of Hive:
1. It provides indexes, including bitmap indexes to accelerate the queries. Index type
containing compaction and bitmap index as of 0.10.
2. Metadata storage in a RDBMS, reduces the time to function semantic checks during
query execution.
3. Built in user-defined functions (UDFs) to manipulation of strings, dates, and other data-
mining tools. Hive is reinforced to extend the UDF set to deal with the use-cases not
reinforced by predefined functions.
4. DEFLATE, BWT, snappy, etc are the algorithms to operation on compressed data
which is stored in Hadoop Ecosystem.
5. It stores schemas in a database and processes the data into the Hadoop File Distributed
File System (HDFS).
6. It is built for Online Analytical Processing (OLAP).
7. It delivers various types of querying language which are frequently known as Hive
Query Language (HVL or HiveQL).
Advantages:
Scalability: Apache Hive is designed to handle large volumes of data, making it a scalable
solution for big data processing.
Familiar SQL-like interface: Hive uses a SQL-like language called HiveQL, which makes
it easy for SQL users to learn and use.
Integration with Hadoop ecosystem: Hive integrates well with the Hadoop ecosystem,
enabling users to process data using other Hadoop tools like Pig, MapReduce, and Spark.
Supports partitioning and bucketing: Hive supports partitioning and bucketing, which
can improve query performance by limiting the amount of data scanned.
User-defined functions: Hive allows users to define their own functions, which can be
used in HiveQL queries.
Disadvantages:
Limited real-time processing: Hive is designed for batch processing, which means it may
not be the best tool for real-time data processing.
Slow performance: Hive can be slower than traditional relational databases because it is
built on top of Hadoop, which is optimized for batch processing rather than interactive
querying.
Steep learning curve: While Hive uses a SQL-like language, it still requires users to have
knowledge of Hadoop and distributed computing, which can make it difficult for beginners
to use.
Limited flexibility: Hive is not as flexible as other data warehousing tools because it is
designed to work specifically with Hadoop, which can limit its usability in other
environments.

Introduction to Apache Pig

Pig Represents Big Data as data flows. Pig is a high-level platform or tool which is used to
process the large datasets. It provides a high-level of abstraction for processing over the
MapReduce. It provides a high-level scripting language, known as Pig Latin which is used to
develop the data analysis codes. First, to process the data which is stored in the HDFS, the
programmers will write the scripts using the Pig Latin Language. Internally Pig Engine(a
component of Apache Pig) converted all these scripts into a specific map and reduce task. But
these are not visible to the programmers in order to provide a high-level of abstraction. Pig
Latin and Pig Engine are the two main components of the Apache Pig tool. The result of Pig
always stored in the HDFS.
Note: Pig Engine has two type of execution environment i.e. a local execution
environment in a single JVM (used when dataset is small in size)and distributed execution
environment in a Hadoop Cluster.
Need of Pig: One limitation of MapReduce is that the development cycle is very long.
Writing the reducer and mapper, compiling packaging the code, submitting the job and
retrieving the output is a time-consuming task. Apache Pig reduces the time of development
using the multi-query approach. Also, Pig is beneficial for programmers who are not
from Java background. 200 lines of Java code can be written in only 10 lines using the Pig
Latin language. Programmers who have SQL knowledge needed less effort to learn Pig Latin.
 It uses query approach which results in reducing the length of the code.
 Pig Latin is SQL like language.
 It provides many builtIn operators.
 It provides nested data types (tuples, bags, map).
Evolution of Pig: Earlier in 2006, Apache Pig was developed by Yahoo’s researchers. At
that time, the main idea to develop Pig was to execute the MapReduce jobs on extremely
large datasets. In the year 2007, it moved to Apache Software Foundation(ASF) which makes
it an open source project. The first version(0.1) of Pig came in the year 2008. The latest
version of Apache Pig is 0.18 which came in the year 2017.
Features of Apache Pig:
 For performing several operations Apache Pig provides rich sets of operators like the
filtering, joining, sorting, aggregation etc.
 Easy to learn, read and write. Especially for SQL-programmer, Apache Pig is a boon.
 Apache Pig is extensible so that you can make your own process and user-defined
functions(UDFs) written in python, java or other programming languages .
 Join operation is easy in Apache Pig.
 Fewer lines of code.
 Apache Pig allows splits in the pipeline.
 By integrating with other components of the Apache Hadoop ecosystem, such as Apache
Hive, Apache Spark, and Apache ZooKeeper, Apache Pig enables users to take advantage
of these components’ capabilities while transforming data.
 The data structure is multivalued, nested, and richer.
 Pig can handle the analysis of both structured and unstructured data.

Difference between Pig and MapReduce


Apache Pig MapReduce

It is a scripting language. It is a compiled programming language.

Abstraction is at higher level. Abstraction is at lower level.

It have less line of code as compared to


Lines of code is more.
MapReduce.
Apache Pig MapReduce

More development efforts are required


Less effort is needed for Apache Pig.
for MapReduce.

Code efficiency is less as compared to As compared to Pig efficiency of code is


MapReduce. higher.

Pig provides built in functions for ordering,


Hard to perform data operations.
sorting and union.

It allows nested data types like map, tuple


It does not allow nested data types
and bag

Applications of Apache Pig:


 For exploring large datasets Pig Scripting is used.
 Provides the supports across large data-sets for Ad-hoc queries.
 In the prototyping of large data-sets processing algorithms.
 Required to process the time sensitive data loads.
 For collecting large amounts of datasets in form of search logs and web crawls.
 Used where the analytical insights are needed using the sampling.
Types of Data Models in Apache Pig: It consist of the 4 types of data models as follows:
 Atom: It is a atomic data value which is used to store as a string. The main use of this
model is that it can be used as a number and as well as a string.
 Tuple: It is an ordered set of the fields.
 Bag: It is a collection of the tuples.
 Map: It is a set of key/value pairs.

Overview of SQOOP in Hadoop


SQOOP :
Previously when there was no Hadoop or there was no concept of big data at that point in
time all the data is used to be stored in the relational database management system. But
nowadays after the introduction of concepts of Big data, the data need to be stored in a more
concise and effective way. Thus Sqoop comes into existence.
So all the data which are stored in a relational database management system needed to be
transferred into the Hadoop structure. So the transfer of this large amount of data manually is
not possible but with the help of Sqoop, we can able to do it. Thus Sqoop is defined as the
tool which is used to perform data transfer operations from relational database management
system to Hadoop server. Thus it helps in transfer of bulk of data from one point of source to
another point of source.
Some of the important Features of the Sqoop :
 Sqoop also helps us to connect the result from the SQL Queries into Hadoop distributed
file system.
 Sqoop helps us to load the processed data directly into the hive or Hbase.
 It performs the security operation of data with the help of Kerberos.
 With the help of Sqoop, we can perform compression of processed data.
 Sqoop is highly powerful and efficient in nature.
There are two major operations performed in Sqoop :
1. Import
2. Export
Sqoop Working :

SQOOP ARCHITECTURE
Basically the operations that take place in Sqoop are usually user-friendly. Sqoop used the
command-line interface to process command of user. The Sqoop can also use alternative
ways by using Java APIs to interact with the user. Basically, when it receives command by
the user, it is handled by the Sqoop and then the further processing of the command takes
place. Sqoop will only be able to perform the import and export of data based on user
command it is not able to form an aggregation of data.
Sqoop is a tool in which works in the following manner, it first parses argument which is
provided by user in the command-line interface and then sends those arguments to a further
stage where arguments are induced for Map only job. Once the Map receives arguments it
then gives command of release of multiple mappers depending upon the number defined by
the user as an argument in command line Interface. Once these jobs are then for Import
command, each mapper task is assigned with respective part of data that is to be imported on
basis of key which is defined by user in the command line interface. To increase efficiency of
process Sqoop uses parallel processing technique in which data is been distributed equally
among all mappers. After this, each mapper then creates an individual connection with the
database by using java database connection model and then fetches individual part of the data
assigned by Sqoop. Once the data is been fetched then the data is been written in HDFS or
Hbase or Hive on basis of argument provided in command line. thus the process Sqoop
import is completed.

The export process of the data in Sqoop is performed in same way, Sqoop export tool which
available performs the operation by allowing set of files from the Hadoop distributed system
back to the Relational Database management system. The files which are given as an input
during import process are called records, after that when user submits its job then it is
mapped into Map Task that brings the files of data from Hadoop data storage, and these data
files are exported to any structured data destination which is in the form of relational database
management system such as MySQL, SQL Server, and Oracle, etc.
Let us now understand the two main operations in detail:
Sqoop Import :
Sqoop import command helps in implementation of the operation. With the help of the import
command, we can import a table from the Relational database management system to the
Hadoop database server. Records in Hadoop structure are stored in text files and each record
is imported as a separate record in Hadoop database server. We can also create load and
partition in Hive while importing data..Sqoop also supports incremental import of data which
means in case we have imported a database and we want to add some more rows, so with the
help of these functions we can only add the new rows to existing database, not the complete
database.
Sqoop Export :
Sqoop export command helps in the implementation of operation. With the help of the export
command which works as a reverse process of operation. Herewith the help of the export
command we can transfer the data from the Hadoop database file system to the Relational
database management system. The data which will be exported is processed into records
before operation is completed. The export of data is done with two steps, first is to examine
the database for metadata and second step involves migration of data.
Here you can get the idea of how the import and export operation is performed in Hadoop
with the help of Sqoop.
Advantages of Sqoop :
 With the help of Sqoop, we can perform transfer operations of data with a variety of
structured data stores like Oracle, Teradata, etc.
 Sqoop helps us to perform ETL operations in a very fast and cost-effective manner.
 With the help of Sqoop, we can perform parallel processing of data which leads to fasten
the overall process.
 Sqoop uses the MapReduce mechanism for its operations which also supports fault
tolerance.
Disadvantages of Sqoop :
 The failure occurs during the implementation of operation needed a special solution to
handle the problem.
 The Sqoop uses JDBC connection to establish a connection with the relational database
management system which is an inefficient way.
 The performance of Sqoop export operation depends upon hardware configuration
relational database management system.

What is Apache ZooKeeper?


Zookeeper is a distributed, open-source coordination service for distributed applications. It
exposes a simple set of primitives to implement higher-level services for synchronization,
configuration maintenance, and group and naming.
In a distributed system, there are multiple nodes or machines that need to communicate with
each other and coordinate their actions. ZooKeeper provides a way to ensure that these nodes
are aware of each other and can coordinate their actions. It does this by maintaining a
hierarchical tree of data nodes called “Znodes“, which can be used to store and retrieve data
and maintain state information. ZooKeeper provides a set of primitives, such as locks,
barriers, and queues, that can be used to coordinate the actions of nodes in a distributed
system. It also provides features such as leader election, failover, and recovery, which can
help ensure that the system is resilient to failures. ZooKeeper is widely used in distributed
systems such as Hadoop, Kafka, and HBase, and it has become an essential component of
many distributed applications.

Why do we need it?

 Coordination services: The integration/communication of services in a distributed


environment.
 Coordination services are complex to get right. They are especially prone to errors such
as race conditions and deadlock.
 Race condition-Two or more systems trying to perform some task.
 Deadlocks– Two or more operations are waiting for each other.
 To make the coordination between distributed environments easy, developers came up
with an idea called zookeeper so that they don’t have to relieve distributed applications of
the responsibility of implementing coordination services from scratch.

What is distributed system?

 Multiple computer systems working on a single problem.


 It is a network that consists of autonomous computers that are connected using distributed
middleware.
 Key Features: Concurrent, resource sharing, independent, global, greater fault tolerance,
and price/performance ratio is much better.
 Key Goals: Transparency, Reliability, Performance, Scalability.
 Challenges: Security, Fault, Coordination, and resource sharing.

Coordination Challenge

 Why is coordination in a distributed system the hard problem?


 Coordination or configuration management for a distributed application that has many
systems.
 Master Node where the cluster data is stored.
 Worker nodes or slave nodes get the data from this master node.
 single point of failure.
 synchronization is not easy.
 Careful design and implementation are needed.

Apache Zookeeper

Apache Zookeeper is a distributed, open-source coordination service for distributed systems.


It provides a central place for distributed applications to store data, communicate with one
another, and coordinate activities. Zookeeper is used in distributed systems to coordinate
distributed processes and services. It provides a simple, tree-structured data model, a simple
API, and a distributed protocol to ensure data consistency and availability. Zookeeper is
designed to be highly reliable and fault-tolerant, and it can handle high levels of read and
write throughput.
Zookeeper is implemented in Java and is widely used in distributed systems, particularly in
the Hadoop ecosystem. It is an Apache Software Foundation project and is released under the
Apache License 2.0.

Architecture of Zookeeper

Zookeeper Services
The ZooKeeper architecture consists of a hierarchy of nodes called znodes, organized in a
tree-like structure. Each znode can store data and has a set of permissions that control access
to the znode. The znodes are organized in a hierarchical namespace, similar to a file system.
At the root of the hierarchy is the root znode, and all other znodes are children of the root
znode. The hierarchy is similar to a file system hierarchy, where each znode can have
children and grandchildren, and so on.

Important Components in Zookeeper


ZooKeeper Services
 Leader & Follower
 Request Processor – Active in Leader Node and is responsible for processing write
requests. After processing, it sends changes to the follower nodes
 Atomic Broadcast – Present in both Leader Node and Follower Nodes. It is responsible
for sending the changes to other Nodes.
 In-memory Databases (Replicated Databases)-It is responsible for storing the data in the
zookeeper. Every node contains its own databases. Data is also written to the file system
providing recoverability in case of any problems with the cluster.
Other Components
 Client – One of the nodes in our distributed application cluster. Access information from
the server. Every client sends a message to the server to let the server know that client is
alive.
 Server– Provides all the services to the client. Gives acknowledgment to the client.
 Ensemble– Group of Zookeeper servers. The minimum number of nodes that are
required to form an ensemble is 3.

Zookeeper Data Model

ZooKeeper data model


In Zookeeper, data is stored in a hierarchical namespace, similar to a file system. Each node
in the namespace is called a Znode, and it can store data and have children. Znodes are
similar to files and directories in a file system. Zookeeper provides a simple API for creating,
reading, writing, and deleting Znodes. It also provides mechanisms for detecting changes to
the data stored in Znodes, such as watches and triggers. Znodes maintain a stat structure that
includes: Version number, ACL, Timestamp, Data Length
Types of Znodes:
 Persistence: Alive until they’re explicitly deleted.
 Ephemeral: Active until the client connection is alive.
 Sequential: Either persistent or ephemeral.

Why do we need ZooKeeper in the Hadoop?

Zookeeper is used to manage and coordinate the nodes in a Hadoop cluster, including the
NameNode, DataNode, and ResourceManager. In a Hadoop cluster, Zookeeper helps to:
 Maintain configuration information: Zookeeper stores the configuration information for
the Hadoop cluster, including the location of the NameNode, DataNode, and
ResourceManager.
 Manage the state of the cluster: Zookeeper tracks the state of the nodes in the Hadoop
cluster and can be used to detect when a node has failed or become unavailable.
 Coordinate distributed processes: Zookeeper can be used to coordinate distributed
processes, such as job scheduling and resource allocation, across the nodes in a Hadoop
cluster.
Zookeeper helps to ensure the availability and reliability of a Hadoop cluster by providing a
central coordination service for the nodes in the cluster.

How ZooKeeper in Hadoop Works?

ZooKeeper operates as a distributed file system and exposes a simple set of APIs that enable
clients to read and write data to the file system. It stores its data in a tree-like structure called
a znode, which can be thought of as a file or a directory in a traditional file system.
ZooKeeper uses a consensus algorithm to ensure that all of its servers have a consistent view
of the data stored in the Znodes. This means that if a client writes data to a znode, that data
will be replicated to all of the other servers in the ZooKeeper ensemble.
One important feature of ZooKeeper is its ability to support the notion of a “watch.” A watch
allows a client to register for notifications when the data stored in a znode changes. This can
be useful for monitoring changes to the data stored in ZooKeeper and reacting to those
changes in a distributed system.
In Hadoop, ZooKeeper is used for a variety of purposes, including:
 Storing configuration information: ZooKeeper is used to store configuration information
that is shared by multiple Hadoop components. For example, it might be used to store the
locations of NameNodes in a Hadoop cluster or the addresses of JobTracker nodes.
 Providing distributed synchronization: ZooKeeper is used to coordinate the activities of
various Hadoop components and ensure that they are working together in a consistent
manner. For example, it might be used to ensure that only one NameNode is active at a
time in a Hadoop cluster.
 Maintaining naming: ZooKeeper is used to maintain a centralized naming service for
Hadoop components. This can be useful for identifying and locating resources in a
distributed system.
ZooKeeper is an essential component of Hadoop and plays a crucial role in coordinating the
activity of its various subcomponents.

Reading and Writing in Apache Zookeeper

ZooKeeper provides a simple and reliable interface for reading and writing data. The data is
stored in a hierarchical namespace, similar to a file system, with nodes called znodes. Each
znode can store data and have children znodes. ZooKeeper clients can read and write data to
these znodes by using the getData() and setData() methods, respectively. Here is an example
of reading and writing data using the ZooKeeper Java API:

 Java
 Python3

// Connect to the ZooKeeper ensemble

ZooKeeper zk = new ZooKeeper("localhost:2181", 3000, null);

// Write data to the znode "/myZnode"

String path = "/myZnode";

String data = "hello world";

zk.create(path, data.getBytes(), Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT);

// Read data from the znode "/myZnode"

byte[] bytes = zk.getData(path, false, null);

String readData = new String(bytes);

// Prints "hello world"

System.out.println(readData);

// Closing the connection


// to the ZooKeeper ensemble

zk.close();

Session and Watches

Session
 Requests in a session are executed in FIFO order.
 Once the session is established then the session id is assigned to the client.
 Client sends heartbeats to keep the session valid
 session timeout is usually represented in milliseconds

Watches
 Watches are mechanisms for clients to get notifications about the changes in the
Zookeeper
 Client can watch while reading a particular znode.
 Znodes changes are modifications of data associated with the znodes or changes in the
znode’s children.
 Watches are triggered only once.
 If the session is expired, watches are also removed.

What is Flume?
Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and
transporting large amounts of streaming data such as log files, events (etc...) from various
sources to a centralized data store.

Flume is a highly reliable, distributed, and configurable tool. It is principally designed to


copy streaming data (log data) from various web servers to HDFS.
Applications of Flume

Assume an e-commerce web application wants to analyze the customer behavior from a
particular region. To do so, they would need to move the available log data in to Hadoop for
analysis. Here, Apache Flume comes to our rescue.

Flume is used to move the log data generated by application servers into HDFS at a higher
speed.

Advantages of Flume

Here are the advantages of using Flume −

 Using Apache Flume we can store the data in to any of the centralized stores (HBase,
HDFS).
 When the rate of incoming data exceeds the rate at which data can be written to the
destination, Flume acts as a mediator between data producers and the centralized
stores and provides a steady flow of data between them.
 Flume provides the feature of contextual routing.
 The transactions in Flume are channel-based where two transactions (one sender and
one receiver) are maintained for each message. It guarantees reliable message
delivery.
 Flume is reliable, fault tolerant, scalable, manageable, and customizable.

Features of Flume

Some of the notable features of Flume are as follows −

 Flume ingests log data from multiple web servers into a centralized store (HDFS,
HBase) efficiently.
 Using Flume, we can get the data from multiple servers immediately into Hadoop.
 Along with the log files, Flume is also used to import huge volumes of event data
produced by social networking sites like Facebook and Twitter, and e-commerce
websites like Amazon and Flipkart.
 Flume supports a large set of sources and destinations types.
 Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.
 Flume can be scaled horizontally.

Apache Oozie - Introduction


What is Apache Oozie?
Apache Oozie is a scheduler system to run and manage Hadoop jobs in a distributed
environment. It allows to combine multiple complex jobs to be run in a sequential order to
achieve a bigger task. Within a sequence of task, two or more jobs can also be programmed
to run parallel to each other.
One of the main advantages of Oozie is that it is tightly integrated with Hadoop stack
supporting various Hadoop jobs like Hive, Pig, Sqoop as well as system-specific jobs
like Java and Shell.
Oozie is an Open Source Java Web-Application available under Apache license 2.0. It is
responsible for triggering the workflow actions, which in turn uses the Hadoop execution
engine to actually execute the task. Hence, Oozie is able to leverage the existing Hadoop
machinery for load balancing, fail-over, etc.
Oozie detects completion of tasks through callback and polling. When Oozie starts a task, it
provides a unique callback HTTP URL to the task, and notifies that URL when it is
complete. If the task fails to invoke the callback URL, Oozie can poll the task for completion.

Following three types of jobs are common in Oozie −

 Oozie Workflow Jobs − These are represented as Directed Acyclic Graphs (DAGs)
to specify a sequence of actions to be executed.
 Oozie Coordinator Jobs − These consist of workflow jobs triggered by time and data
availability.
 Oozie Bundle − These can be referred to as a package of multiple coordinator and
workflow jobs.

We will look into each of these in detail in the following chapters.

A sample workflow with Controls (Start, Decision, Fork, Join and End) and Actions (Hive,
Shell, Pig) will
look like the
following
diagram −
Workflow will always start with a W567-Start tag and end with an End tag.

Use-Cases of Apache Oozie

Apache Oozie is used by Hadoop system administrators to run complex log analysis
on HDFS. Hadoop Developers use Oozie for performing ETL operations on data in a
sequential order and saving the output in a specified format (Avro, ORC, etc.) in HDFS.

In an enterprise, Oozie jobs are scheduled as coordinators or bundles.

Oozie Editors

Before we dive into Oozie lets have a quick look at the available editors for Oozie.

Most of the time, you won’t need an editor and will write the workflows using any popular
text editors (like Notepad++, Sublime or Atom) as we will be doing in this tutorial.

But as a beginner it makes some sense to create a workflow by the drag and drop method
using the editor and then see how the workflow gets generated. Also, to map GUI with the
actual workflow.xml created by the editor. This is the only section where we will discuss
about Oozie editors and won’t use it in our tutorial.
The most popular among Oozie editors is Hue.

Hue Editor for Oozie

This editor is very handy to use and is available with almost all Hadoop vendors’ solutions.

The following screenshot shows an example workflow created by this editor.


You can drag and drop controls and actions and add your job inside these actions.

A good resource to learn more on this topic −

http://gethue.com/new-apache-oozie-workflow-coordinator-bundle-editors/

Oozie Eclipse Plugin (OEP)

Oozie Eclipse plugin (OEP) is an Eclipse plugin for editing Apache Oozie workflows
graphically. It is a graphical editor for editing Apache Oozie workflows inside Eclipse.

Composing Apache Oozie workflows is becoming much simpler. It becomes a matter of


drag-and-drop, a matter of connecting lines between the nodes.

The following screenshots are examples of OEP.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy