Hadoop Ecosystem
Hadoop Ecosystem
1
Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis
easier.
HDFS:
Yet Another Resource Negotiator, as the name implies, YARN is the one who
helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
2
3. Application Manager
Resource manager has the privilege of allocating resources for the applications
in a system whereas Node managers work on the allocation of resources such
as CPU, memory, bandwidth per machine and later on acknowledges the
resource manager. Application manager works as an interface between the
resource manager and node manager and performs negotiations as per the
requirement of the two.
MapReduce:
With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
3
Mahout:
It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions, and
visualization, etc.
It consumes in memory resources hence, thus being faster than the prior in
terms of optimization.
Spark is best suited for real-time data whereas Hadoop is best suited for
structured data or batch processing, hence both are used in most of the
companies interchangeably.
Apache HBase:
It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
At times where we need to search or retrieve the occurrences of something
small in a huge database, the request must be processed within a short quick
span of time. At such times, HBase comes handy as it gives us a tolerant way of
storing limited data
Other Components: Apart from all of these, there are some other components too
that carry out a huge task in order to make Hadoop capable of processing large
datasets. They are as follows:
Solr, Lucene: These are the two services that perform the task of searching
and indexing with the help of some java libraries, especially Lucene is based on
Java which allows spell check mechanism, as well. However, Lucene is driven
by Solr.
Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which
resulted in inconsistency, often. Zookeeper overcame all the problems by
performing synchronization, inter-component based communication, grouping,
and maintenance.
Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie
workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be
4
executed in a sequentially ordered manner whereas Oozie Coordinator jobs are
those that are triggered when some data or external stimulus is given to it.
Hadoop – Architecture
As we all know Hadoop is a framework written in Java that utilizes a large cluster
of commodity hardware to maintain and store big size data. Hadoop works on
MapReduce Programming Algorithm that was introduced by Google. Today lots of
Big Brand Companies are using Hadoop in their Organization to deal with big data,
eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists
of 4 components.
MapReduce
HDFS(Hadoop Distributed File System)
YARN(Yet Another Resource Negotiator)
Common Utilities or Hadoop Common
1. MapReduce
MapReduce nothing but just like an Algorithm or a data structure that is based on
the YARN framework. The major feature of MapReduce is to perform the
5
distributed processing in parallel in a Hadoop cluster which Makes Hadoop working
so fast. When you are dealing with Big Data, serial processing is no more of any
use. MapReduce has mainly 2 tasks which are divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.
Here, we can see that the Input is provided to the Map() function then it’s output is
used as an input to the Reduce function and after that, we receive our final output.
Let’s understand What this Map() and Reduce() does.
As we can see that an Input is provided to the Map(), now as we are using Big
Data. The Input is a set of Data. The Map() function here breaks this DataBlocks
into Tuples that are nothing but a key-value pair. These key-value pairs are now
sent as input to the Reduce(). The Reduce() function then combines this broken
Tuples or key-value pair based on its Key value and form set of Tuples, and
perform some operation like sorting, summation type job, etc. which is then sent to
the final Output Node. Finally, the Output is Obtained.
The data processing is always done in Reducer depending upon the business
requirement of that industry. This is How First Map() and then Reduce is utilized
one by one.
Let’s understand the Map Task and Reduce Task in detail.
Map Task:
6
Map is combined with the help of this combiner. Using a combiner is not
necessary as it is optional.
Partitionar: Partitional is responsible for fetching key-value pairs generated in
the Mapper Phases. The partitioner generates the shards corresponding to each
reducer. Hashcode of each key is also fetched by this partition. Then partitioner
performs it’s(Hashcode) modulus with the number of reducers(key.hashcode()%
(number of reducers)).
Reduce Task
Shuffle and Sort: The Task of Reducer starts with this step, the process in
which the Mapper generates the intermediate key-value and transfers them to
the Reducer task is known as Shuffling. Using the Shuffling process the system
can sort the data using its key value.
Once some of the Mapping tasks are done Shuffling begins that is why it is a
faster process and does not wait for the completion of the task performed by
Mapper.
Reduce: The main function or task of the Reduce is to gather the Tuple
generated from Map and then perform some sorting and aggregation sort of
process on those key-value depending on its key element.
OutputFormat: Once all the operations are performed, the key-value pairs are
written into the file with the help of record writer, each record in a new line, and
the key and value in a space-separated manner.
7
2. HDFS
NameNode(Master)
DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data
about the data. Meta Data can be the transaction logs that keep track of the user’s
activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing
the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or
even more than that. The more number of DataNode, the Hadoop cluster will be
able to store more data. So it is advised that the DataNode should have High
storing capacity to store a large number of file blocks.
High Level Architecture Of Hadoop
8
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the
single block of data is divided into multiple blocks of size 128MB which is default
and you can also change it manually.
Let’s understand this concept of breaking down of file in blocks with an example.
Suppose you have uploaded a file of 400MB to your HDFS then what happens is
this file got divided into blocks of 128MB+128MB+128MB+16MB = 400MB size.
Means 4 blocks are created each of 128MB except the last one. Hadoop doesn’t
know or it doesn’t care about what data is stored in these blocks so it considers the
final file blocks as a partial record as it does not have any idea regarding it. In the
Linux file system, the size of a file block is about 4KB which is very much less than
the default size of file blocks in the Hadoop file system. As we all know Hadoop is
mainly configured for storing the large size data which is in petabyte, this is what
makes Hadoop file system different from other file systems as it can be scaled,
nowadays file blocks of 128MB to 256MB are considered in Hadoop.
Replication In HDFS Replication ensures the availability of the data. Replication is
making a copy of something and the number of times you make a copy of that
particular thing can be expressed as it’s Replication Factor. As we have seen in
File blocks that the HDFS stores the data in the form of various blocks at the same
time Hadoop is also configured to make a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured
means you can change it manually as per your requirement like in above example
we have made 4 file blocks which means that 3 Replica or copy of each file block
is made means total of 4×3 = 12 blocks are made for the backup purpose.
This is because for running Hadoop we are using commodity hardware
(inexpensive system hardware) which can be crashed at any time. We are not
using the supercomputer for our Hadoop setup. That is why we need such a
feature in HDFS which can make copies of that file blocks for backup purposes,
this is known as fault tolerance.
Now one thing we also need to notice that after making so many replica’s of our file
blocks we are wasting so much of our storage but for the big brand organization
9
the data is very much important than the storage so nobody cares for this extra
storage. You can configure the Replication factor in your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in
our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of so
many Racks . with the help of this Racks information Namenode chooses the
closest Datanode to achieve the maximum performance while performing the
read/write information which reduces the Network Traffic.
HDFS Architecture
10
Multi-Tenancy
Scalability
Cluster-Utilization
Compatibility
Hadoop common or Common utilities are nothing but our java library and java files
or we can say the java scripts that we need for all the other components present in
a Hadoop cluster. these utilities are used by HDFS, YARN, and MapReduce for
running the cluster. Hadoop Common verify that Hardware failure in a Hadoop
cluster is common so it needs to be solved automatically in software by Hadoop
Framework.
11
YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run and
process data stored in HDFS (Hadoop Distributed File System) thus making the
system much more efficient. Through its various components, it can dynamically
allocate various resources and schedule the application processing. For large
volume data processing, it is quite necessary to manage the available resources
properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-
12
The main components of YARN architecture include:
13
resource manager, tracking the status and monitoring progress of a single
application. The application master requests the container from the node
manager by sending a Container Launch Context(CLC) which includes
everything an application needs to run. Once the application is started, it sends
the health report to the resource manager from time-to-time.
Container: It is a collection of physical resources such as RAM, CPU cores and
disk on a single node. The containers are invoked by Container Launch
Context(CLC) which is a record that contains information such as environment
variables, security tokens, dependencies etc.
Application workflow in Hadoop YARN:
Advantages :
14
allows multiple processing engines to run simultaneously on a single Hadoop
cluster.
Resource Management: YARN provides an efficient way of managing
resources in the Hadoop cluster. It allows administrators to allocate and monitor
the resources required by each application in a cluster, such as CPU, memory,
and disk space.
Scalability: YARN is designed to be highly scalable and can handle thousands
of nodes in a cluster. It can scale up or down based on the requirements of the
applications running on the cluster.
Improved Performance: YARN offers better performance by providing a
centralized resource management system. It ensures that the resources are
optimally utilized, and applications are efficiently scheduled on the available
resources.
Security: YARN provides robust security features such as Kerberos
authentication, Secure Shell (SSH) access, and secure data transmission. It
ensures that the data stored and processed on the Hadoop cluster is secure.
Disadvantages :
15
MapReduce Architecture
MapReduce and HDFS are the two major components of Hadoop which makes it
so powerful and efficient to use. MapReduce is a programming model used for
efficient processing in parallel over large data-sets in a distributed manner. The
data is first split and then combined to produce the final result. The libraries for
MapReduce is written in so many programming languages with various different-
different optimizations. The purpose of MapReduce in Hadoop is to Map each of
the jobs and then it will reduce it to equivalent tasks for providing less overhead
over the cluster network and to reduce the processing power. The MapReduce
task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the MapReduce
for processing. There can be multiple clients available that continuously send
jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which
is comprised of so many smaller tasks that the client wants to process or
execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-
parts.
16
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job.
The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size
to the Hadoop MapReduce Master. Now, the MapReduce master will divide this job
into further equivalent job-parts. These job-parts are then made available for the
Map and Reduce Task. This Map and Reduce task will contain the program as per
the requirement of the use-case that the particular company is solving. The
developer writes their logic to fulfill the requirement that the industry requires. The
input data which we are using is then fed to the Map Task and the Map will
generate intermediate key-value pair as its output. The output of Map i.e. these
key-value pairs are then fed to the Reducer and the final output is stored on the
HDFS. There can be n number of Map and Reduce tasks made available for
processing the data as per the requirement. The algorithm for Map and Reduce is
made with a very optimized way such that the time complexity or space complexity
is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its
architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce
phase.
1. Map: As the name suggests its main use is to map the input data in key-value
pairs. The input to the map may be a key-value pair where the key can be the id
of some kind of address and value is the actual value that it keeps.
The Map() function will be executed in its memory repository on each of these
input key-value pairs and generates the intermediate key-value pair which
works as input for the Reducer or Reduce() function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or
group the data based on its key-value pair as per the reducer algorithm written
by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the
jobs across the cluster and also to schedule each map on the Task Tracker
running on the same data node since there can be hundreds of data nodes
available in the cluster.
2. Task Tracker: The Task Tracker can be considered as the actual slaves that
are working on the instruction given by the Job Tracker. This Task Tracker is
deployed on each of the nodes available in the cluster that executes the Map
and Reduce task as instructed by Job Tracker.
17
Hadoop – Pros and Cons
Big Data has become necessary as industries are growing, the goal is to
congregate information and finding hidden facts behind the data. Data defines how
industries can improve their activity and affair. A large number of industries are
revolving around the data, there is a large amount of data that is gathered and
analyzed through various processes with various tools. Hadoop is one of the tools
to deal with this huge amount of data as it can easily extract the information from
data, Hadoop has its Advantages and Disadvantages while we deal with Big Data.
Pros
1. Cost
Hadoop is open-source and uses cost-effective commodity hardware which
provides a cost-efficient model, unlike traditional Relational databases that require
expensive hardware and high-end processors to deal with Big Data. The problem
with traditional Relational databases is that storing the Massive volume of data is
not cost-effective, so the company’s started to remove the Raw data. which may
not result in the correct scenario of their business. Means Hadoop provides us 2
main benefits with the cost one is it’s open-source means free to use and the other
is that it uses commodity hardware which is also inexpensive.
2. Scalability
Hadoop is a highly scalable model. A large amount of data is divided into multiple
inexpensive machines in a cluster which is processed parallelly. the number of
these machines or nodes can be increased or decreased as per the enterprise’s
18
requirements. In traditional RDBMS(Relational DataBase Management System)
the systems can not be scaled to approach large amounts of data.
3. Flexibility
Hadoop is designed in such a way that it can deal with any kind of dataset like
structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and
Videos) very efficiently. This means it can easily process any kind of data
independent of its structure which makes it highly flexible. which is very much
useful for enterprises as they can process large datasets easily, so the businesses
can use Hadoop to analyze valuable insights of data from sources like social
media, email, etc. with this flexibility Hadoop can be used with log processing, Data
Warehousing, Fraud detection, etc.
4. Speed
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop
Distributed File System). In DFS(Distributed File System) a large size file is broken
into small size file blocks then distributed among the Nodes available in a Hadoop
cluster, as this massive number of file blocks are processed parallelly which makes
Hadoop faster, because of which it provides a High-level performance as
compared to the traditional DataBase Management Systems. When you are
dealing with a large amount of unstructured data speed is an important factor, with
Hadoop you can easily access TB’s of data in just a few minutes.
5. Fault Tolerance
Hadoop uses commodity hardware(inexpensive systems) which can be crashed at
any moment. In Hadoop data is replicated on various DataNodes in a Hadoop
cluster which ensures the availability of data if somehow any of your systems got
crashed. You can read all of the data from a single machine if this machine faces a
technical issue data can also be read from other nodes in a Hadoop cluster
because the data is copied or replicated by default. Hadoop makes 3 copies of
each file block and stored it into different nodes.
6. High Throughput
Hadoop works on Distributed file System where various jobs are assigned to
various Data node in a cluster, the bar of this data is processed parallelly in the
Hadoop cluster which produces high throughput. Throughput is nothing but the
task or job done per unit time.
7. Minimum Network Traffic
In Hadoop, each task is divided into various small sub-task which is then assigned
to each data node available in the Hadoop cluster. Each data node process a small
amount of data which leads to low traffic in a Hadoop cluster.
19
Cons
1. Problem with Small files
Hadoop can efficiently perform over a small number of files of large size. Hadoop
stores the file in the form of file blocks which are from 128MB in size(by default) to
256MB. Hadoop fails when it needs to access the small size file in a large amount.
This so many small files surcharge the Namenode and make it difficult to work.
2. Vulnerability
Hadoop is a framework that is written in java, and java is one of the most
commonly used programming languages which makes it more insecure as it can
be easily exploited by any of the cyber-criminal.
3. Low Performance In Small Data Surrounding
Hadoop is mainly designed for dealing with large datasets, so it can be efficiently
utilized for the organizations that are generating a massive volume of data. It’s
efficiency decreases while performing in small data surroundings.
4. Lack of Security
Data is everything for an organization, by default the security feature in Hadoop is
made un-available. So the Data driver needs to be careful with this security face
and should take appropriate action on it. Hadoop uses Kerberos for security
feature which is not easy to manage. Storage and network encryption are missing
in Kerberos which makes us more concerned about it.
5. High Up Processing
Read/Write operation in Hadoop is immoderate since we are dealing with large
size data that is in TB or PB. In Hadoop, the data read or write done from the disk
which makes it difficult to perform in-memory calculation and lead to processing
overhead or High up processing.
6. Supports Only Batch Processing
The batch process is nothing but the processes that are running in the background
and does not have any kind of interaction with the user. The engines used for
20
these processes inside the Hadoop core is not that much efficient. Producing the
output with low latency is not possible with it.
21
2. Highly Scalable Cluster:
Hadoop is a highly scalable model. A large amount of data is divided into multiple
inexpensive machines in a cluster which is processed parallelly. the number of
these machines or nodes can be increased or decreased as per the enterprise’s
requirements. In traditional RDBMS(Relational DataBase Management System)
the systems can not be scaled to approach large amounts of data.
3. Fault Tolerance is Available:
Hadoop uses commodity hardware(inexpensive systems) which can be crashed at
any moment. In Hadoop data is replicated on various DataNodes in a Hadoop
cluster which ensures the availability of data if somehow any of your systems got
crashed. You can read all of the data from a single machine if this machine faces a
technical issue data can also be read from other nodes in a Hadoop cluster
because the data is copied or replicated by default. By default, Hadoop makes 3
copies of each file block and stored it into different nodes. This replication factor is
configurable and can be changed by changing the replication property in the hdfs-
site.xml file.
4. High Availability is Provided:
Fault tolerance provides High Availability in the Hadoop cluster. High Availability
means the availability of data on the Hadoop cluster. Due to fault tolerance in case
if any of the DataNode goes down the same data can be retrieved from any other
node where the data is replicated. The High available Hadoop cluster also has 2 or
more than two Name Node i.e. Active NameNode and Passive NameNode also
known as stand by NameNode. In case if Active NameNode fails then the Passive
node will take the responsibility of Active Node and provide the same data as that
of Active NameNode which can easily be utilized by the user.
5. Cost-Effective:
Hadoop is open-source and uses cost-effective commodity hardware which
provides a cost-efficient model, unlike traditional Relational databases that require
expensive hardware and high-end processors to deal with Big Data. The problem
with traditional Relational databases is that storing the Massive volume of data is
not cost-effective, so the company’s started to remove the Raw data. which may
not result in the correct scenario of their business. Means Hadoop provides us 2
main benefits with the cost one is it’s open-source means free to use and the other
is that it uses commodity hardware which is also inexpensive.
6. Hadoop Provide Flexibility:
Hadoop is designed in such a way that it can deal with any kind of dataset like
structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and
Videos) very efficiently. This means it can easily process any kind of data
independent of its structure which makes it highly flexible. It is very much useful for
enterprises as they can process large datasets easily, so the businesses can use
Hadoop to analyze valuable insights of data from sources like social media, email,
etc. With this flexibility, Hadoop can be used with log processing, Data
Warehousing, Fraud detection, etc.
22
7. Easy to Use:
Hadoop is easy to use since the developers need not worry about any of the
processing work since it is managed by the Hadoop itself. Hadoop ecosystem is
also very large comes up with lots of tools like Hive, Pig, Spark, HBase, Mahout,
etc.
8. Hadoop uses Data Locality:
The concept of Data Locality is used to make Hadoop processing fast. In the data
locality concept, the computation logic is moved near data rather than moving the
data to the computation logic. The cost of Moving data on HDFS is costliest and
with the help of the data locality concept, the bandwidth utilization in the system is
minimized.
9. Provides Faster Data Processing:
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop
Distributed File System). In DFS(Distributed File System) a large size file is broken
into small size file blocks then distributed among the Nodes available in a Hadoop
cluster, as this massive number of file blocks are processed parallelly which makes
Hadoop faster, because of which it provides a High-level performance as
compared to the traditional DataBase Management Systems.
10. Support for Multiple Data Formats:
Hadoop supports multiple data formats like CSV, JSON, Avro, and more, making it
easier to work with different types of data sources. This makes it more convenient
for developers and data analysts to handle large volumes of data with different
formats.
11. High Processing Speed:
Hadoop’s distributed processing model allows it to process large amounts of data
at high speeds. This is achieved by distributing data across multiple nodes and
processing it in parallel. As a result, Hadoop can process data much faster than
traditional database systems.
12. Machine Learning Capabilities:
Hadoop offers machine learning capabilities through its ecosystem tools like
Mahout, which is a library for creating scalable machine learning applications. With
these tools, data analysts and developers can build machine learning models to
analyze and process large datasets.
13. Integration with Other Tools:
Hadoop integrates with other popular tools like Apache Spark, Apache Flink, and
Apache Storm, making it easier to build data processing pipelines. This integration
allows developers and data analysts to use their favorite tools and frameworks for
building data pipelines and processing large datasets.
14. Secure:
Hadoop provides built-in security features like authentication, authorization, and
encryption. These features help to protect data and ensure that only authorized
users have access to it. This makes Hadoop a more secure platform for processing
sensitive data.
23
15. Community Support:
Hadoop has a large community of users and developers who contribute to its
development and provide support to users. This means that users can access a
wealth of resources and support to help them get the most out of Hadoop.
24
Based on Hadoop MapReduce
MapReduce is a programming
The Apache Hadoop is a software
model which is an
that allows all the distributed
implementation for processing
Definition processing of large data sets
and generating big data sets
across clusters of computers using
with distributed algorithm on a
simple programming
cluster.
MapReduce is a submodule of
The Apache Hadoop is an eco-
this project which is a
system which provides an
programming model and is
Concept environment which is reliable,
used to process huge datasets
scalable and ready for distributed
which sits on HDFS (Hadoop
computing.
distributed file system).
25
Based on Hadoop MapReduce
It converts the job into map-reduce It converts the query into map-reduce
2.
functions. functions.
26
S.No MapReduce Pig
It is difficult for the user to perform join Makes it easy for the user to perform
4.
operations. Join operations.
The user has to write 10 times more The user has to write fewer lines of code
5. lines of code to perform a similar task because it supports the multi-query
than Pig. approach.
The user has to write 10 times more lines The user has to write a few lines of
5.
of code to perform a similar task than Pig. code than MapReduce.
It has several jobs therefore execution time The code execution time is more but
6.
is more. development effort is less.
28
and reduce tasks. Pig abstraction is at a higher level. It contains less line of code
as compared to MapReduce.
2. Hive :
Hive is built on the top of Hadoop and is used to process structured data in
Hadoop. Hive was developed by Facebook. It provides various types of querying
language which is frequently known as Hive Query Language. Apache Hive is a
data warehouse and which provides an SQL-like interface between the user and
the Hadoop distributed file system (HDFS) which integrates Hadoop.
Difference between Pig and Hive :
S.No
. Pig Hive
Pig operates on the client
1. side of a cluster. Hive operates on the server side of a cluster.
2. Pig uses pig-latin language. Hive uses HiveQL language.
Pig is a Procedural Data
3. Flow Language. Hive is a Declarative SQLish Language.
4. It was developed by Yahoo. It was developed by Facebook.
It is used by Researchers
5. and Programmers. It is mainly used by Data Analysts.
It is used to handle
structured and semi-
6. structured data. It is mainly used to handle structured data.
7. It is used for programming. It is used for creating reports.
Pig scripts end with .pig
8. extension. In HIve, all extensions are supported.
It does not support
9. partitioning. It supports partitioning.
10. It loads data quickly. It loads data slowly.
11. It does not support JDBC. It supports JDBC.
12. It does not support ODBC. It supports ODBC.
Pig does not have a
dedicated metadata Hive makes use of the exact variation of dedicated
13. database. SQL-DDL language by defining tables beforehand.
14. It supports Avro file format. It does not support Avro file format.
29
Pig is suitable for complex Hive is suitable for batch-
15. and nested data structures. processing OLAP systems.
Pig does not support schema
16. to store data. Hive supports schema for data insertion in tables.
It is very easy to write UDFs
17. to calculate matrices. It does support UDFs but is much hard to debug.
It is also an open-source
2. It is an open-source software.
software.
30
It supports many
programming languages
It support C++, Java, PHP, Python programming like C, C#,
5.
languages. Java, JavaScript, PHP,
Lau, Python, R, Ruby,
etc.
Proprietary protocol
JDBC, ODBC, Thrift are used as APIs and other using JSON is used as
8.
access methods. APIs and other access
methods.
It supports in-memory
9. It do not support in-memory capabilities.
capabilities.
ACID properties of
10. No transaction concepts.
transaction is used.
31
Below is a table of differences between RDBMS and Hadoop:
32
Hadoop 1 vs Hadoop 2
HDFS HDFS
Map
Reduce YARN / MRv2
2. Daemons:
Hadoop 1 Hadoop 2
Namenode Namenode
Datanode Datanode
3. Working:
In Hadoop 1, there is HDFS which is used for storage and top of it, Map Reduce
which works as Resource Management as well as Data Processing. Due to this
workload on Map Reduce, it will affect the performance.
In Hadoop 2, there is again HDFS which is again used for storage and on the
top of HDFS, there is YARN which works as Resource Management. It basically
allocates the resources and keeps all the things going on.
33
4. Limitations: Hadoop 1 is a Master-Slave architecture. It consists of a single
master and multiple slaves. Suppose if master node got crashed then irrespective
of your best slave nodes, your cluster will be destroyed. Again for creating that
cluster means copying system files, image files, etc. on another system is too
much time consuming which will not be tolerated by organizations in today’s
time. Hadoop 2 is also a Master-Slave architecture. But this consists of multiple
masters (i.e active namenodes and standby namenodes) and multiple slaves. If
here master node got crashed then standby master node will take over it. You can
make multiple combinations of active-standby nodes. Thus Hadoop 2 will eliminate
the problem of a single point of failure.
5. Ecosystem:
34
Oozie is basically Work Flow Scheduler. It decides the particular time of jobs to
execute according to their dependency.
Pig, Hive and Mahout are data processing tools that are working on the top of
Hadoop.
Sqoop is used to import and export structured data. You can directly import and
export the data into HDFS using SQL database.
Flume is used to import and export the unstructured data and streaming data.
6. Windows Support:
in Hadoop 1 there is no support for Microsoft Windows provided by Apache
whereas in Hadoop 2 there is support for Microsoft windows.
Difference Between Hadoop and SQL
35
Hadoop: It is a framework that stores Big Data in distributed systems and then
processes it parallelly. Four main components of Hadoop are Hadoop Distributed
File System(HDFS), Yarn, MapReduce, and libraries. It involves not only large data
but a mixture of structured, semi-structured, and unstructured information.
Amazon, IBM, Microsoft, Cloudera, ScienceSoft, Pivotal, Hortonworks are some of
the companies using Hadoop technology.
SQL: Structured Query Language is a domain-specific language used in
computing and to handle data management in relational database management
systems, it also processes data streams in relational data stream management
systems. In nutshell, SQL is a standard Database language that is used for
creating, storing and extracting data from relational databases such as MySQL,
Oracle, SQL Server, etc.
Below is a table of differences between Hadoop and SQL:
Fault
Hadoop is highly fault tolerant SQL has good fault tolerance
Tolerance
36
Feature Hadoop SQL
Write data once, read data Read and Write data multiple
Data Update
multiple times times
37
Cloudera, ScienceSoft, Pivotal, Hortonworks are some of the companies using
Hadoop technology.
HBase: HBase is an open source database from Apache that runs on Hadoop
cluster. It falls under the non-relational database management system. Three
important components of HBase are HMaster, Region server, Zookeeper.
CapitalOne, JPMorganchase, apple, MTB, AT& T, Lockheed Martin are some of
the companies using HBase.
S.No
. Hadoop HBase
4 Data are stored in form of chunks Data are stored in form of key/value pair
7 Hadoop has low latency operations HBase has high latency operations
38
Hive: Hive is an application that runs over the Hadoop framework and provides
SQL like interface for processing/query the data. Hive is designed and developed
by Facebook before becoming part of the Apache-Hadoop project. Hive runs its
query using HQL (Hive query language). Hive is having the same structure as
RDBMS and almost the same commands can be used in Hive. Hive can store the
data in external tables so it’s not mandatory to used HDFS also it support file
formats such as ORC, Avro files, Sequence File and Text files, etc.
Hadoop Hive
In the simple Hadoop ecosystem, the need Using Hive, one can process/query
to write complex Java programs for the the data without complex
same data. programming
One side Hadoop frameworks need 100s Hive can query the same data using
line for preparing Java-based MR program 8 to 10 lines of HQL.
39
Difference between Data Warehouse and Hadoop
Data Warehouse :
It is a technique for gathering and managing information from different sources to
supply significant commercial enterprise insights. A Data warehouse is commonly
used to join and analyze commercial enterprise information from heterogeneous
sources. It acts as the heart of the BI system which is constructed for data
evaluation and reporting.
2. Hadoop :
It is an open-source software program framework for storing information and
strolling applications on clusters of commodity hardware. It offers large storage for
any sort of data, extensive processing strength, and the potential to deal with
actually limitless concurrent duties or jobs.
In this, we first analyze the data It can process various types of data such
1. and then further do the as Structured data, unstructured data, or
processing. raw data.
40
Hadoop Version 3.0 – What’s New?
Hadoop is a framework written in Java used to solve Big Data problems. The initial
version of Hadoop is released in April 2006. Apache community has made many
changes from the day of the first release of Hadoop in the market. The journey of
Hadoop started in 2005 by Doug Cutting and Mike Cafarella. The reason behind
developing Hadoop is to support distribution for the Nutch Search Engine
Project.
Hadoop conquers the supercomputer in the year 2008 and becomes the fastest
system ever made by humans to sort terabytes of data stored. Hadoop has come a
long way and has accommodated so many changes from its previous version i.e.
Hadoop 2.x. In this article, we are going to discuss the changes made by Apache
to the Hadoop version 3.x to make it more efficient and faster.
Since Oracle has ended the use of JDK 7 in 2015, so to use Hadoop 3 users have
to upgrade their Java version to JDK 8 or above to compile and run all the Hadoop
files. JDK version below 8 is no more supported for using Hadoop 3.
Erasure coding is used to recover the data when the computer hard disk fails. It is
a high-level RAID(Redundant Array of Independent Disks) technology used by so
many IT company’s to recover their data. Hadoop file system HDFS i.e. Hadoop
Distributed File System uses Erasure coding to provide fault tolerance in the
Hadoop cluster. Since we are using commodity hardware to build our Hadoop
cluster, failure of the node is normal. Hadoop 2 uses a replication mechanism to
provide a similar kind of fault-tolerance as that of Erasure coding in Hadoop 3.
In Hadoop 2 replicas of the data, blocks are made which is then stored on different
nodes in the Hadoop cluster. Erasure coding consumes less or half storage as that
of replication in Hadoop 2 to provide the same level of fault tolerance. With the
increasing amount of data in the industry, developers can save a large amount of
storage with erasure coding. Erasure encoding minimizes the requirement of hard
disk and improves the fault tolerance by 50% with the similar resources provided.
The previous version of Hadoop supports a single active NameNode and a single
standby NameNode. In the latest version of Hadoop i.e. Hadoop 3.x, the data block
41
replication is done among three JournalNodes(JNs). With the help of that, the
Hadoop 3.x architecture is more capable to handle fault tolerance than that of its
previous version. Big data problems where high fault tolerance is needed, Hadoop
3.x is very useful in that situation. In this Hadoop, 3.x users can manage the
number of standby nodes according to the requirement since the facility of multiple
standby nodes is provided.
For example, developers can now easily configure three NameNodes and Five
JournalNodes with that our Hadoop cluster is capable to handle two nodes rather
than a single one.
The Hadoop file system utilizes various shell-type commands that directly interact
with the HDFS and other file systems that Hadoop supports i.e. such as
WebHDFS, Local FS, S3 FS, etc. The multiple functionalities of Hadoop are
controlled by the shell. The shell script used in the latest version of Hadoop i.e.
Hadoop 3.x has fixed lots of bugs. Hadoop 3.x shell scripts also provide the
functionality of rewriting the shell script.
5. Timeline Service v.2 for YARN
The YARN Timeline service stores and retrieve the applicant’s information(The
information can be ongoing or historical). Timeline service v.2 was much important
to improve the reliability and scalability of our Hadoop. System usability is
enhanced with the help of flows and aggregation. In Hadoop 1.x with TimeLine
service, v.1 users can only make a single instance of reader/writer and storage
architecture that can not be scaled further.
Hadoop 2.x uses distributed writer architecture where data read and write
operations are separable. Here distributed collectors are provided for every
YARN(Yet Another Resource Negotiator) application. Timeline service v.2 uses
HBase for storage purposes which can be scaled to massive size along with
providing good response time for reading and writing operations.
The information that Timeline service v.2 stores can be of major 2 types:
A. Generic information of the completed application
user information
queue name
count of attempts made per application
container information which runs for each attempt on application
B. Per framework information about running and completed application
count of Map and Reduce Task
counters
information broadcast by the developer for TimeLine Server with the help of
Timeline client.
42
6. Filesystem Connector Support
This new Hadoop version 3.x now supports Azure Data Lake and Aliyun Object
Storage System which are the other standby option for the Hadoop-compatible
filesystem.
In the Previous version of Hadoop, the multiple service port for Hadoop is in
the Linux ephemeral port range (32768-61000). In this kind of configuration due
to conflicts occurs in some other application sometimes the service fails to bind to
the ports. So to overcome this problem Hadoop 3.x has moved the conflicts ports
from the Linux ephemeral port range and new ports have been assigned to this as
shown below.
// The new assigned Port
Namenode Ports: 50470 -> 9871, 50070 -> 9870, 8020 -> 9820
Datanode Ports: 50020-> 9867,50010 -> 9866, 50475 -> 9865,
50075 -> 9864
Secondary NN Ports: 50091 -> 9869, 50090 -> 9868
43
8. Intra-Datanode Balancer
DataNodes are utilized in the Hadoop cluster for storage purposes. The
DataNodes handles multiple disks at a time. This Disk’s got filled evenly during
write operations. Adding or Removing the disk can cause significant skewness in a
DataNode. The existing HDFS-BALANCER can not handle this significant
skewness, which concerns itself with inter-, not intra-, DN skew. The latest intra-
DataNode balancing feature can manage this situation which is invoked with the
help of HDFS disk balancer CLI.
In Hadoop version 3.x we can easily configure Hadoop daemon heap size with
some newly added ways. With the help of the memory size of the host auto-tuning
is made available. Instead of HADOOP_HEAPSIZE, developers can use
the HEAP_MAX_SIZE and HEAP_MIN_SIZE variables. JAVA_HEAP_SIZE intern
al variable is also removed in this latest Hadoop version 3.x. Default heap sizes
are also removed which is used for auto-tuning by JVM(Java Virtual Machine). If
you want to use the older default then enable it by
configuring HADOOP_HEAPSIZE_MAX in Hadoop-env.sh file.
Hadoop – Introduction
The definition of a powerful person has changed in this world. A powerful is one
who has access to the data. This is because data is increasing at a tremendous
rate. Suppose we are living in 100% data world. Then 90% of the data is produced
in the last 2 to 4 years. This is because now when a child is born, before her
mother, she first faces the flash of the camera. All these pictures and videos are
44
nothing but data. Similarly, there is data of emails, various smartphone
applications, statistical data, etc. All this data has the enormous power to affect
various incidents and trends. This data is not only used by companies to affect
their consumers but also by politicians to affect elections. This huge data is
referred to as Big Data. In such a world, where data is being produced at such an
exponential rate, it needs to maintained, analyzed, and tackled. This is where
Hadoop creeps in.
Hadoop is a framework of the open source set of tools distributed under Apache
License. It is used to manage data, store data, and process data for various big
data applications running under clustered systems. In the previous years, Big Data
was defined by the “3Vs” but now there are “5Vs” of Big Data which are also
termed as the characteristics of Big Data.
45
except Google knew about this, till that time. So, in the year 2003 Google released
some papers on GFS. But it was not enough to understand the overall working of
Google. So in 2004, Google again released the remaining papers. The two
enthusiasts Doug Cutting and Michael Cafarella studied those papers and
designed what is called, Hadoop in the year 2005. Doug’s son had a toy elephant
whose name was Hadoop and thus Doug and Michael gave their new creation, the
name “Hadoop” and hence the symbol “toy elephant.” This is how Hadoop evolved.
Thus the designs of HDFS and Map Reduced though created by Doug Cutting and
Michael Cafarella, but are originally inspired by Google. For more details about the
evolution of Hadoop, you can refer to Hadoop | History or Evolution .
Traditional Approach: Suppose we want to process a data. In the traditional
approach, we used to store data on local machines. This data was then processed.
Now as data started increasing, the local machines or computers were not capable
enough to store this huge data set. So, data was then started to be stored on
remote servers. Now suppose we need to process that data. So, in the traditional
approach, this data has to be fetched from the servers and then processed upon.
Suppose this data is of 500 GB. Now, practically it is very complex and expensive
to fetch this data. This approach is also called Enterprise Approach.
In the new Hadoop Approach, instead of fetching the data on local machines we
send the query to the data. Obviously, the query to process the data will not be as
huge as the data itself. Moreover, at the server, the query is divided into several
parts. All these parts process the data simultaneously. This is called parallel
execution and is possible because of Map Reduce. So, now not only there is no
need to fetch the data, but also the processing takes lesser time. The result of the
query is then sent to the user. Thus the Hadoop makes data storage, processing
and analyzing way easier than its traditional approach.
Components of Hadoop: Hadoop has three components:
1. HDFS: Hadoop Distributed File System is a dedicated file system to store big
data with a cluster of commodity hardware or cheaper hardware with streaming
access pattern. It enables data to be stored at multiple nodes in the cluster
which ensures data security and fault tolerance.
2. Map Reduce : Data once stored in the HDFS also needs to be processed upon.
Now suppose a query is sent to process a data set in the HDFS. Now, Hadoop
identifies where this data is stored, this is called Mapping. Now the query is
broken into multiple parts and the results of all these multiple parts are
combined and the overall result is sent back to the user. This is called reduce
process. Thus while HDFS is used to store the data, Map Reduce is used to
process the data.
3. YARN : YARN stands for Yet Another Resource Negotiator. It is a dedicated
operating system for Hadoop which manages the resources of the cluster and
also functions as a framework for job scheduling in Hadoop. The various types
of scheduling are First Come First Serve, Fair Share Scheduler and Capacity
Scheduler etc. The First Come First Serve scheduling is set by default in YARN.
How the components of Hadoop make it as a solution for Big Data?
46
1. Hadoop Distributed File System: In our local PC, by default the block size in
Hard Disk is 4KB. When we install Hadoop, the HDFS by default changes the
block size to 64 MB. Since it is used to store huge data. We can also change
the block size to 128 MB. Now HDFS works with Data Node and Name Node.
While Name Node is a master service and it keeps the metadata as for on which
commodity hardware, the data is residing, the Data Node stores the actual data.
Now, since the block size is of 64 MB thus the storage required to store
metadata is reduced thus making HDFS better. Also, Hadoop stores three
copies of every dataset at three different locations. This ensures that the
Hadoop is not prone to single point of failure.
2. Map Reduce: In the simplest manner, it can be understood that MapReduce
breaks a query into multiple parts and now each part process the data
coherently. This parallel execution helps to execute a query faster and makes
Hadoop a suitable and optimal choice to deal with Big Data.
3. YARN: As we know that Yet Another Resource Negotiator works like an
operating system to Hadoop and as operating systems are resource managers
so YARN manages the resources of Hadoop so that Hadoop serves big data in
a better way.
Hadoop Versions: Till now there are three versions of Hadoop as follows.
Hadoop 1: This is the first and most basic version of Hadoop. It includes
Hadoop Common, Hadoop Distributed File System (HDFS), and Map Reduce.
47
Hadoop 3: This is the recent version of Hadoop. Along with the merits of the
first two versions, Hadoop 3 has one most important merit. It has resolved the
issue of single point failure by having multiple name nodes. Various other
advantages like erasure coding, use of GPU hardware and Dockers makes it
superior to the earlier versions of Hadoop.
Economically Feasible: It is cheaper to store data and process it than
it was in the traditional approach. Since the actual machines used to
store data are only commodity hardware.
Easy to Use: The projects or set of tools provided by Apache Hadoop
are easy to work upon in order to analyze complex data sets.
Open Source: Since Hadoop is distributed as an open source
software under Apache License, so one does not need to pay for it,
just download it and use it.
Fault Tolerance: Since Hadoop stores three copies of data, so even if
one copy is lost because of any commodity hardware failure, the data
is safe. Moreover, as Hadoop version 3 has multiple name nodes, so
even the single point of failure of Hadoop has also been removed.
Scalability: Hadoop is highly scalable in nature. If one needs to scale
up or scale down the cluster, one only needs to change the number of
commodity hardware in the cluster.
Distributed Processing: HDFS and Map Reduce ensures distributed
storage and processing of the data.
48
Locality of Data: This is one of the most alluring and promising
features of Hadoop. In Hadoop, to process a query over a data set,
instead of bringing the data to the local computer we send the query to
the server and fetch the final result from there. This is called data
locality.
49