0% found this document useful (0 votes)
19 views49 pages

Hadoop Ecosystem

Uploaded by

21ve1a6772
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views49 pages

Hadoop Ecosystem

Uploaded by

21ve1a6772
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Hadoop Ecosystem

Apache Hadoop is an open source framework intended to make interaction


with big data easier, However, for those who are not acquainted with this
technology, one question arises that what is big data ? Big data is a term given to
the data sets which can’t be processed in an efficient manner with the help of
traditional methodology such as RDBMS. Hadoop has made its place in the
industries and companies that need to work on large data sets which are sensitive
and needs efficient handling. Hadoop is a framework that enables processing of
large data sets which reside in the form of clusters. Being a framework, Hadoop is
made up of several modules that are supported by a large ecosystem of
technologies.
Introduction: Hadoop Ecosystem is a platform or a suite which provides various
services to solve the big data problems. It includes Apache projects and various
commercial tools and solutions. There are four major elements of
Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop Common Utilities. Most of
the tools or solutions are used to supplement or support these major elements. All
these tools work collectively to provide services such as absorption, analysis,
storage and maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:

 HDFS: Hadoop Distributed File System


 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

1
Note: Apart from the above-mentioned components, there are many other
components too that are part of the Hadoop ecosystem.
All these toolkits or components revolve around one term i.e. Data. That’s the
beauty of Hadoop that it revolves around data and hence making its synthesis
easier.
HDFS:

 HDFS is the primary or major component of Hadoop ecosystem and is


responsible for storing large data sets of structured or unstructured data across
various nodes and thereby maintaining the metadata in the form of log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data)
requiring comparatively fewer resources than the data nodes that stores the
actual data. These data nodes are commodity hardware in the distributed
environment. Undoubtedly, making Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus
working at the heart of the system.
YARN:

 Yet Another Resource Negotiator, as the name implies, YARN is the one who
helps to manage the resources across the clusters. In short, it performs
scheduling and resource allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager

2
3. Application Manager
 Resource manager has the privilege of allocating resources for the applications
in a system whereas Node managers work on the allocation of resources such
as CPU, memory, bandwidth per machine and later on acknowledges the
resource manager. Application manager works as an interface between the
resource manager and node manager and performs negotiations as per the
requirement of the two.
MapReduce:

 By making the use of distributed and parallel algorithms, MapReduce makes it


possible to carry over the processing’s logic and helps to write applications
which transform big data sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task
is:
1. Map() performs sorting and filtering of data and thereby organizing them in
the form of group. Map generates a key-value pair based result which is later
on processed by the Reduce() method.
2. Reduce(), as the name suggests does the summarization by aggregating the
mapped data. In simple, Reduce() takes the output generated by Map() as
input and combines those tuples into smaller set of tuples.
PIG:
Pig was basically developed by Yahoo which works on a pig Latin language, which
is Query based language similar to SQL.
 It is a platform for structuring the data flow, processing and analyzing huge data
sets.
 Pig does the work of executing commands and in the background, all the
activities of MapReduce are taken care of. After the processing, pig stores the
result in HDFS.
 Pig Latin language is specially designed for this framework which runs on Pig
Runtime. Just the way Java runs on the JVM.
 Pig helps to achieve ease of programming and optimization and hence is a
major segment of the Hadoop Ecosystem.
HIVE:

 With the help of SQL methodology and interface, HIVE performs reading and
writing of large data sets. However, its query language is called as HQL (Hive
Query Language).
 It is highly scalable as it allows real-time processing and batch processing both.
Also, all the SQL datatypes are supported by Hive thus, making the query
processing easier.
 Similar to the Query Processing frameworks, HIVE too comes with two
components: JDBC Drivers and HIVE Command Line.
 JDBC, along with ODBC drivers work on establishing the data storage
permissions and connection whereas HIVE Command line helps in the
processing of queries.
3
Mahout:

 Mahout, allows Machine Learnability to a system or application. Machine


Learning, as the name suggests helps the system to develop itself based on
some patterns, user/environmental interaction or on the basis of algorithms.
 It provides various libraries or functionalities such as collaborative filtering,
clustering, and classification which are nothing but concepts of Machine
learning. It allows invoking algorithms as per our need with the help of its own
libraries.
Apache Spark:

 It’s a platform that handles all the process consumptive tasks like batch
processing, interactive or iterative real-time processing, graph conversions, and
visualization, etc.
 It consumes in memory resources hence, thus being faster than the prior in
terms of optimization.
 Spark is best suited for real-time data whereas Hadoop is best suited for
structured data or batch processing, hence both are used in most of the
companies interchangeably.
Apache HBase:

 It’s a NoSQL database which supports all kinds of data and thus capable of
handling anything of Hadoop Database. It provides capabilities of Google’s
BigTable, thus able to work on Big Data sets effectively.
 At times where we need to search or retrieve the occurrences of something
small in a huge database, the request must be processed within a short quick
span of time. At such times, HBase comes handy as it gives us a tolerant way of
storing limited data
Other Components: Apart from all of these, there are some other components too
that carry out a huge task in order to make Hadoop capable of processing large
datasets. They are as follows:

 Solr, Lucene: These are the two services that perform the task of searching
and indexing with the help of some java libraries, especially Lucene is based on
Java which allows spell check mechanism, as well. However, Lucene is driven
by Solr.
 Zookeeper: There was a huge issue of management of coordination and
synchronization among the resources or the components of Hadoop which
resulted in inconsistency, often. Zookeeper overcame all the problems by
performing synchronization, inter-component based communication, grouping,
and maintenance.
 Oozie: Oozie simply performs the task of a scheduler, thus scheduling jobs and
binding them together as a single unit. There is two kinds of jobs .i.e Oozie
workflow and Oozie coordinator jobs. Oozie workflow is the jobs that need to be

4
executed in a sequentially ordered manner whereas Oozie Coordinator jobs are
those that are triggered when some data or external stimulus is given to it.
Hadoop – Architecture
As we all know Hadoop is a framework written in Java that utilizes a large cluster
of commodity hardware to maintain and store big size data. Hadoop works on
MapReduce Programming Algorithm that was introduced by Google. Today lots of
Big Brand Companies are using Hadoop in their Organization to deal with big data,
eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop Architecture Mainly consists
of 4 components.

 MapReduce
 HDFS(Hadoop Distributed File System)
 YARN(Yet Another Resource Negotiator)
 Common Utilities or Hadoop Common

Let’s understand the role of each one of this component in detail.

1. MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is based on
the YARN framework. The major feature of MapReduce is to perform the

5
distributed processing in parallel in a Hadoop cluster which Makes Hadoop working
so fast. When you are dealing with Big Data, serial processing is no more of any
use. MapReduce has mainly 2 tasks which are divided phase-wise:
In first phase, Map is utilized and in next phase Reduce is utilized.

Here, we can see that the Input is provided to the Map() function then it’s output is
used as an input to the Reduce function and after that, we receive our final output.
Let’s understand What this Map() and Reduce() does.
As we can see that an Input is provided to the Map(), now as we are using Big
Data. The Input is a set of Data. The Map() function here breaks this DataBlocks
into Tuples that are nothing but a key-value pair. These key-value pairs are now
sent as input to the Reduce(). The Reduce() function then combines this broken
Tuples or key-value pair based on its Key value and form set of Tuples, and
perform some operation like sorting, summation type job, etc. which is then sent to
the final Output Node. Finally, the Output is Obtained.
The data processing is always done in Reducer depending upon the business
requirement of that industry. This is How First Map() and then Reduce is utilized
one by one.
Let’s understand the Map Task and Reduce Task in detail.
Map Task:

 RecordReader The purpose of recordreader is to break the records. It is


responsible for providing key-value pairs in a Map() function. The key is actually
is its locational information and value is the data associated with it.
 Map: A map is nothing but a user-defined function whose work is to process the
Tuples obtained from record reader. The Map() function either does not
generate any key-value pair or generate multiple pairs of these tuples.
 Combiner: Combiner is used for grouping the data in the Map workflow. It is
similar to a Local reducer. The intermediate key-value that are generated in the

6
Map is combined with the help of this combiner. Using a combiner is not
necessary as it is optional.
 Partitionar: Partitional is responsible for fetching key-value pairs generated in
the Mapper Phases. The partitioner generates the shards corresponding to each
reducer. Hashcode of each key is also fetched by this partition. Then partitioner
performs it’s(Hashcode) modulus with the number of reducers(key.hashcode()%
(number of reducers)).
Reduce Task

 Shuffle and Sort: The Task of Reducer starts with this step, the process in
which the Mapper generates the intermediate key-value and transfers them to
the Reducer task is known as Shuffling. Using the Shuffling process the system
can sort the data using its key value.
Once some of the Mapping tasks are done Shuffling begins that is why it is a
faster process and does not wait for the completion of the task performed by
Mapper.
 Reduce: The main function or task of the Reduce is to gather the Tuple
generated from Map and then perform some sorting and aggregation sort of
process on those key-value depending on its key element.
 OutputFormat: Once all the operations are performed, the key-value pairs are
written into the file with the help of record writer, each record in a new line, and
the key and value in a space-separated manner.

7
2. HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission. It is


mainly designed for working on commodity Hardware devices(inexpensive
devices), working on a distributed file system design. HDFS is designed in such a
way that it believes more in storing the data in a large chunk of blocks rather than
storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer
and the other devices present in that Hadoop cluster. Data storage Nodes in
HDFS.

 NameNode(Master)
 DataNode(Slave)
NameNode:NameNode works as a Master in a Hadoop cluster that guides the
Datanode(Slaves). Namenode is mainly used for storing the Metadata i.e. the data
about the data. Meta Data can be the transaction logs that keep track of the user’s
activity in a Hadoop cluster.
Meta Data can also be the name of the file, size, and the information about the
location(Block number, Block ids) of Datanode that Namenode stores to find the
closest DataNode for Faster Communication. Namenode instructs the DataNodes
with the operation like delete, create, Replicate, etc.
DataNode: DataNodes works as a Slave DataNodes are mainly utilized for storing
the data in a Hadoop cluster, the number of DataNodes can be from 1 to 500 or
even more than that. The more number of DataNode, the Hadoop cluster will be
able to store more data. So it is advised that the DataNode should have High
storing capacity to store a large number of file blocks.
High Level Architecture Of Hadoop

8
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the
single block of data is divided into multiple blocks of size 128MB which is default
and you can also change it manually.

Let’s understand this concept of breaking down of file in blocks with an example.
Suppose you have uploaded a file of 400MB to your HDFS then what happens is
this file got divided into blocks of 128MB+128MB+128MB+16MB = 400MB size.
Means 4 blocks are created each of 128MB except the last one. Hadoop doesn’t
know or it doesn’t care about what data is stored in these blocks so it considers the
final file blocks as a partial record as it does not have any idea regarding it. In the
Linux file system, the size of a file block is about 4KB which is very much less than
the default size of file blocks in the Hadoop file system. As we all know Hadoop is
mainly configured for storing the large size data which is in petabyte, this is what
makes Hadoop file system different from other file systems as it can be scaled,
nowadays file blocks of 128MB to 256MB are considered in Hadoop.
Replication In HDFS Replication ensures the availability of the data. Replication is
making a copy of something and the number of times you make a copy of that
particular thing can be expressed as it’s Replication Factor. As we have seen in
File blocks that the HDFS stores the data in the form of various blocks at the same
time Hadoop is also configured to make a copy of those file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured
means you can change it manually as per your requirement like in above example
we have made 4 file blocks which means that 3 Replica or copy of each file block
is made means total of 4×3 = 12 blocks are made for the backup purpose.
This is because for running Hadoop we are using commodity hardware
(inexpensive system hardware) which can be crashed at any time. We are not
using the supercomputer for our Hadoop setup. That is why we need such a
feature in HDFS which can make copies of that file blocks for backup purposes,
this is known as fault tolerance.
Now one thing we also need to notice that after making so many replica’s of our file
blocks we are wasting so much of our storage but for the big brand organization
9
the data is very much important than the storage so nobody cares for this extra
storage. You can configure the Replication factor in your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in
our Hadoop cluster (maybe 30 to 40). A large Hadoop cluster is consists of so
many Racks . with the help of this Racks information Namenode chooses the
closest Datanode to achieve the maximum performance while performing the
read/write information which reduces the Network Traffic.
HDFS Architecture

3. YARN(Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2 operations


that are Job scheduling and Resource Management. The Purpose of Job
schedular is to divide a big task into small jobs so that each job can be assigned to
various slaves in a Hadoop cluster and Processing can be Maximized. Job
Scheduler also keeps track of which job is important, which job has more priority,
dependencies between the jobs and all the other information like job timing, etc.
And the use of Resource Manager is to manage all the resources that are made
available for running a Hadoop cluster.
Features of YARN

10
 Multi-Tenancy
 Scalability
 Cluster-Utilization
 Compatibility

4. Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library and java files
or we can say the java scripts that we need for all the other components present in
a Hadoop cluster. these utilities are used by HDFS, YARN, and MapReduce for
running the cluster. Hadoop Common verify that Hardware failure in a Hadoop
cluster is common so it needs to be solved automatically in software by Hadoop
Framework.

Hadoop YARN Architecture


YARN stands for “Yet Another Resource Negotiator“. It was introduced in
Hadoop 2.0 to remove the bottleneck on Job Tracker which was present in Hadoop
1.0. YARN was described as a “Redesigned Resource Manager” at the time of its
launching, but it has now evolved to be known as large-scale distributed operating
system used for Big Data processing.

YARN architecture basically separates resource management layer from the


processing layer. In Hadoop 1.0 version, the responsibility of Job tracker is split
between the resource manager and application manager.

11
YARN also allows different data processing engines like graph processing,
interactive processing, stream processing as well as batch processing to run and
process data stored in HDFS (Hadoop Distributed File System) thus making the
system much more efficient. Through its various components, it can dynamically
allocate various resources and schedule the application processing. For large
volume data processing, it is quite necessary to manage the available resources
properly so that every application can leverage them.
YARN Features: YARN gained popularity because of the following features-

 Scalability: The scheduler in Resource manager of YARN architecture allows


Hadoop to extend and manage thousands of nodes and clusters.
 Compatibility: YARN supports the existing map-reduce applications without
disruptions thus making it compatible with Hadoop 1.0 as well.
 Cluster Utilization:Since YARN supports Dynamic utilization of cluster in
Hadoop, which enables optimized Cluster Utilization.
 Multi-tenancy: It allows multiple engine access thus giving organizations a
benefit of multi-tenancy.

Hadoop YARN Architecture

12
The main components of YARN architecture include:

 Client: It submits map-reduce jobs.


 Resource Manager: It is the master daemon of YARN and is responsible for
resource assignment and management among all the applications. Whenever it
receives a processing request, it forwards it to the corresponding node manager
and allocates resources for the completion of the request accordingly. It has two
major components:
 Scheduler: It performs scheduling based on the allocated application
and available resources. It is a pure scheduler, means it does not
perform other tasks such as monitoring or tracking and does not
guarantee a restart if a task fails. The YARN scheduler supports
plugins such as Capacity Scheduler and Fair Scheduler to partition the
cluster resources.
 Application manager: It is responsible for accepting the application
and negotiating the first container from the resource manager. It also
restarts the Application Master container if a task fails.
 Node Manager: It take care of individual node on Hadoop cluster and manages
application and workflow and that particular node. Its primary job is to keep-up
with the Resource Manager. It registers with the Resource Manager and sends
heartbeats with the health status of the node. It monitors resource usage,
performs log management and also kills a container based on directions from
the resource manager. It is also responsible for creating the container process
and start it on the request of Application master.
 Application Master: An application is a single job submitted to a framework.
The application master is responsible for negotiating resources with the

13
resource manager, tracking the status and monitoring progress of a single
application. The application master requests the container from the node
manager by sending a Container Launch Context(CLC) which includes
everything an application needs to run. Once the application is started, it sends
the health report to the resource manager from time-to-time.
 Container: It is a collection of physical resources such as RAM, CPU cores and
disk on a single node. The containers are invoked by Container Launch
Context(CLC) which is a record that contains information such as environment
variables, security tokens, dependencies etc.
Application workflow in Hadoop YARN:

1. Client submits an application


2. The Resource Manager allocates a container to start the Application Manager
3. The Application Manager registers itself with the Resource Manager
4. The Application Manager negotiates containers from the Resource Manager
5. The Application Manager notifies the Node Manager to launch containers
6. Application code is executed in the container
7. Client contacts Resource Manager/Application Manager to monitor application’s
status
8. Once the processing is complete, the Application Manager un-registers with the
Resource Manager

Advantages :

 Flexibility: YARN offers flexibility to run various types of distributed processing


systems such as Apache Spark, Apache Flink, Apache Storm, and others. It

14
allows multiple processing engines to run simultaneously on a single Hadoop
cluster.
 Resource Management: YARN provides an efficient way of managing
resources in the Hadoop cluster. It allows administrators to allocate and monitor
the resources required by each application in a cluster, such as CPU, memory,
and disk space.
 Scalability: YARN is designed to be highly scalable and can handle thousands
of nodes in a cluster. It can scale up or down based on the requirements of the
applications running on the cluster.
 Improved Performance: YARN offers better performance by providing a
centralized resource management system. It ensures that the resources are
optimally utilized, and applications are efficiently scheduled on the available
resources.
 Security: YARN provides robust security features such as Kerberos
authentication, Secure Shell (SSH) access, and secure data transmission. It
ensures that the data stored and processed on the Hadoop cluster is secure.

Disadvantages :

 Complexity: YARN adds complexity to the Hadoop ecosystem. It requires


additional configurations and settings, which can be difficult for users who are
not familiar with YARN.
 Overhead: YARN introduces additional overhead, which can slow down the
performance of the Hadoop cluster. This overhead is required for managing
resources and scheduling applications.
 Latency: YARN introduces additional latency in the Hadoop ecosystem. This
latency can be caused by resource allocation, application scheduling, and
communication between components.
 Single Point of Failure: YARN can be a single point of failure in the Hadoop
cluster. If YARN fails, it can cause the entire cluster to go down. To avoid this,
administrators need to set up a backup YARN instance for high availability.
 Limited Support: YARN has limited support for non-Java programming
languages. Although it supports multiple processing engines, some engines
have limited language support, which can limit the usability of YARN in certain
environments.

15
MapReduce Architecture
MapReduce and HDFS are the two major components of Hadoop which makes it
so powerful and efficient to use. MapReduce is a programming model used for
efficient processing in parallel over large data-sets in a distributed manner. The
data is first split and then combined to produce the final result. The libraries for
MapReduce is written in so many programming languages with various different-
different optimizations. The purpose of MapReduce in Hadoop is to Map each of
the jobs and then it will reduce it to equivalent tasks for providing less overhead
over the cluster network and to reduce the processing power. The MapReduce
task is mainly divided into two phases Map Phase and Reduce Phase.
MapReduce Architecture:

Components of MapReduce Architecture:

1. Client: The MapReduce client is the one who brings the Job to the MapReduce
for processing. There can be multiple clients available that continuously send
jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do which
is comprised of so many smaller tasks that the client wants to process or
execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent job-
parts.

16
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main job.
The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size
to the Hadoop MapReduce Master. Now, the MapReduce master will divide this job
into further equivalent job-parts. These job-parts are then made available for the
Map and Reduce Task. This Map and Reduce task will contain the program as per
the requirement of the use-case that the particular company is solving. The
developer writes their logic to fulfill the requirement that the industry requires. The
input data which we are using is then fed to the Map Task and the Map will
generate intermediate key-value pair as its output. The output of Map i.e. these
key-value pairs are then fed to the Reducer and the final output is stored on the
HDFS. There can be n number of Map and Reduce tasks made available for
processing the data as per the requirement. The algorithm for Map and Reduce is
made with a very optimized way such that the time complexity or space complexity
is minimum.
Let’s discuss the MapReduce phases to get a better understanding of its
architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce
phase.
1. Map: As the name suggests its main use is to map the input data in key-value
pairs. The input to the map may be a key-value pair where the key can be the id
of some kind of address and value is the actual value that it keeps.
The Map() function will be executed in its memory repository on each of these
input key-value pairs and generates the intermediate key-value pair which
works as input for the Reducer or Reduce() function.

2. Reduce: The intermediate key-value pairs that work as input for Reducer are
shuffled and sort and send to the Reduce() function. Reducer aggregate or
group the data based on its key-value pair as per the reducer algorithm written
by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all the
jobs across the cluster and also to schedule each map on the Task Tracker
running on the same data node since there can be hundreds of data nodes
available in the cluster.

2. Task Tracker: The Task Tracker can be considered as the actual slaves that
are working on the instruction given by the Job Tracker. This Task Tracker is
deployed on each of the nodes available in the cluster that executes the Map
and Reduce task as instructed by Job Tracker.

17
Hadoop – Pros and Cons
Big Data has become necessary as industries are growing, the goal is to
congregate information and finding hidden facts behind the data. Data defines how
industries can improve their activity and affair. A large number of industries are
revolving around the data, there is a large amount of data that is gathered and
analyzed through various processes with various tools. Hadoop is one of the tools
to deal with this huge amount of data as it can easily extract the information from
data, Hadoop has its Advantages and Disadvantages while we deal with Big Data.

Pros

1. Cost
Hadoop is open-source and uses cost-effective commodity hardware which
provides a cost-efficient model, unlike traditional Relational databases that require
expensive hardware and high-end processors to deal with Big Data. The problem
with traditional Relational databases is that storing the Massive volume of data is
not cost-effective, so the company’s started to remove the Raw data. which may
not result in the correct scenario of their business. Means Hadoop provides us 2
main benefits with the cost one is it’s open-source means free to use and the other
is that it uses commodity hardware which is also inexpensive.
2. Scalability
Hadoop is a highly scalable model. A large amount of data is divided into multiple
inexpensive machines in a cluster which is processed parallelly. the number of
these machines or nodes can be increased or decreased as per the enterprise’s

18
requirements. In traditional RDBMS(Relational DataBase Management System)
the systems can not be scaled to approach large amounts of data.
3. Flexibility
Hadoop is designed in such a way that it can deal with any kind of dataset like
structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and
Videos) very efficiently. This means it can easily process any kind of data
independent of its structure which makes it highly flexible. which is very much
useful for enterprises as they can process large datasets easily, so the businesses
can use Hadoop to analyze valuable insights of data from sources like social
media, email, etc. with this flexibility Hadoop can be used with log processing, Data
Warehousing, Fraud detection, etc.
4. Speed
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop
Distributed File System). In DFS(Distributed File System) a large size file is broken
into small size file blocks then distributed among the Nodes available in a Hadoop
cluster, as this massive number of file blocks are processed parallelly which makes
Hadoop faster, because of which it provides a High-level performance as
compared to the traditional DataBase Management Systems. When you are
dealing with a large amount of unstructured data speed is an important factor, with
Hadoop you can easily access TB’s of data in just a few minutes.
5. Fault Tolerance
Hadoop uses commodity hardware(inexpensive systems) which can be crashed at
any moment. In Hadoop data is replicated on various DataNodes in a Hadoop
cluster which ensures the availability of data if somehow any of your systems got
crashed. You can read all of the data from a single machine if this machine faces a
technical issue data can also be read from other nodes in a Hadoop cluster
because the data is copied or replicated by default. Hadoop makes 3 copies of
each file block and stored it into different nodes.
6. High Throughput
Hadoop works on Distributed file System where various jobs are assigned to
various Data node in a cluster, the bar of this data is processed parallelly in the
Hadoop cluster which produces high throughput. Throughput is nothing but the
task or job done per unit time.
7. Minimum Network Traffic
In Hadoop, each task is divided into various small sub-task which is then assigned
to each data node available in the Hadoop cluster. Each data node process a small
amount of data which leads to low traffic in a Hadoop cluster.

19
Cons
1. Problem with Small files
Hadoop can efficiently perform over a small number of files of large size. Hadoop
stores the file in the form of file blocks which are from 128MB in size(by default) to
256MB. Hadoop fails when it needs to access the small size file in a large amount.
This so many small files surcharge the Namenode and make it difficult to work.
2. Vulnerability
Hadoop is a framework that is written in java, and java is one of the most
commonly used programming languages which makes it more insecure as it can
be easily exploited by any of the cyber-criminal.
3. Low Performance In Small Data Surrounding
Hadoop is mainly designed for dealing with large datasets, so it can be efficiently
utilized for the organizations that are generating a massive volume of data. It’s
efficiency decreases while performing in small data surroundings.
4. Lack of Security
Data is everything for an organization, by default the security feature in Hadoop is
made un-available. So the Data driver needs to be careful with this security face
and should take appropriate action on it. Hadoop uses Kerberos for security
feature which is not easy to manage. Storage and network encryption are missing
in Kerberos which makes us more concerned about it.
5. High Up Processing
Read/Write operation in Hadoop is immoderate since we are dealing with large
size data that is in TB or PB. In Hadoop, the data read or write done from the disk
which makes it difficult to perform in-memory calculation and lead to processing
overhead or High up processing.
6. Supports Only Batch Processing
The batch process is nothing but the processes that are running in the background
and does not have any kind of interaction with the user. The engines used for

20
these processes inside the Hadoop core is not that much efficient. Producing the
output with low latency is not possible with it.

Hadoop – Features of Hadoop Which Makes It


Popular
Today tons of Companies are adopting Hadoop Big Data tools to solve their Big
Data queries and their customer market segments. There are lots of other tools
also available in the Market like HPCC developed by LexisNexis Risk Solution,
Storm, Qubole, Cassandra, Statwing, CouchDB, Pentaho, Openrefine, Flink, etc.
Then why Hadoop is so popular among all of them. Here we will discuss some top
essential industrial ready features that make Hadoop so popular and the Industry
favorite.
Hadoop is a framework written in java with some code in C and Shell Script that
works over the collection of various simple commodity hardware to deal with the
large dataset using a very basic level programming model. It is developed by Doug
Cutting and Mike Cafarella and now it comes under Apache License 2.0. Now,
Hadoop will be considered as the must-learn skill for the data-scientist and Big
Data Technology. Companies are investing big in it and it will become an in-
demand skill in the future. Hadoop 3.x is the latest version of Hadoop. Hadoop
consist of Mainly 3 components.
1. HDFS(Hadoop Distributed File System): HDFS is working as a storage layer
on Hadoop. The data is always stored in the form of data-blocks on HDFS
where the default size of each data-block is 128 MB in size which is
configurable. Hadoop works on the MapReduce algorithm which is a master-
slave architecture. HDFS has NameNode and DataNode that works in a similar
pattern.
2. MapReduce: MapReduce works as a processing layer on Hadoop. Map-
Reduce is a programming model that is mainly divided into two
phases Map Phase and Reduce Phase. It is designed for processing the data in
parallel which is divided on various machines(nodes).
3. YARN(yet another Resources Negotiator): YARN is the job scheduling and
resource management layer in Hadoop. The data stored on HDFS is processed
and run with the help of data processing engines like graph processing,
interactive processing, batch processing, etc. The overall performance of
Hadoop is improved up with the Help of this YARN framework.
Features of Hadoop Which Makes It Popular
Let’s discuss the key features which make Hadoop more reliable to use, an
industry favorite, and the most powerful Big Data tool.
1. Open Source:
Hadoop is open-source, which means it is free to use. Since it is an open-source
project the source-code is available online for anyone to understand it or make
some modifications as per their industry requirement.

21
2. Highly Scalable Cluster:
Hadoop is a highly scalable model. A large amount of data is divided into multiple
inexpensive machines in a cluster which is processed parallelly. the number of
these machines or nodes can be increased or decreased as per the enterprise’s
requirements. In traditional RDBMS(Relational DataBase Management System)
the systems can not be scaled to approach large amounts of data.
3. Fault Tolerance is Available:
Hadoop uses commodity hardware(inexpensive systems) which can be crashed at
any moment. In Hadoop data is replicated on various DataNodes in a Hadoop
cluster which ensures the availability of data if somehow any of your systems got
crashed. You can read all of the data from a single machine if this machine faces a
technical issue data can also be read from other nodes in a Hadoop cluster
because the data is copied or replicated by default. By default, Hadoop makes 3
copies of each file block and stored it into different nodes. This replication factor is
configurable and can be changed by changing the replication property in the hdfs-
site.xml file.
4. High Availability is Provided:
Fault tolerance provides High Availability in the Hadoop cluster. High Availability
means the availability of data on the Hadoop cluster. Due to fault tolerance in case
if any of the DataNode goes down the same data can be retrieved from any other
node where the data is replicated. The High available Hadoop cluster also has 2 or
more than two Name Node i.e. Active NameNode and Passive NameNode also
known as stand by NameNode. In case if Active NameNode fails then the Passive
node will take the responsibility of Active Node and provide the same data as that
of Active NameNode which can easily be utilized by the user.
5. Cost-Effective:
Hadoop is open-source and uses cost-effective commodity hardware which
provides a cost-efficient model, unlike traditional Relational databases that require
expensive hardware and high-end processors to deal with Big Data. The problem
with traditional Relational databases is that storing the Massive volume of data is
not cost-effective, so the company’s started to remove the Raw data. which may
not result in the correct scenario of their business. Means Hadoop provides us 2
main benefits with the cost one is it’s open-source means free to use and the other
is that it uses commodity hardware which is also inexpensive.
6. Hadoop Provide Flexibility:
Hadoop is designed in such a way that it can deal with any kind of dataset like
structured(MySql Data), Semi-Structured(XML, JSON), Un-structured (Images and
Videos) very efficiently. This means it can easily process any kind of data
independent of its structure which makes it highly flexible. It is very much useful for
enterprises as they can process large datasets easily, so the businesses can use
Hadoop to analyze valuable insights of data from sources like social media, email,
etc. With this flexibility, Hadoop can be used with log processing, Data
Warehousing, Fraud detection, etc.

22
7. Easy to Use:
Hadoop is easy to use since the developers need not worry about any of the
processing work since it is managed by the Hadoop itself. Hadoop ecosystem is
also very large comes up with lots of tools like Hive, Pig, Spark, HBase, Mahout,
etc.
8. Hadoop uses Data Locality:
The concept of Data Locality is used to make Hadoop processing fast. In the data
locality concept, the computation logic is moved near data rather than moving the
data to the computation logic. The cost of Moving data on HDFS is costliest and
with the help of the data locality concept, the bandwidth utilization in the system is
minimized.
9. Provides Faster Data Processing:
Hadoop uses a distributed file system to manage its storage i.e. HDFS(Hadoop
Distributed File System). In DFS(Distributed File System) a large size file is broken
into small size file blocks then distributed among the Nodes available in a Hadoop
cluster, as this massive number of file blocks are processed parallelly which makes
Hadoop faster, because of which it provides a High-level performance as
compared to the traditional DataBase Management Systems.
10. Support for Multiple Data Formats:
Hadoop supports multiple data formats like CSV, JSON, Avro, and more, making it
easier to work with different types of data sources. This makes it more convenient
for developers and data analysts to handle large volumes of data with different
formats.
11. High Processing Speed:
Hadoop’s distributed processing model allows it to process large amounts of data
at high speeds. This is achieved by distributing data across multiple nodes and
processing it in parallel. As a result, Hadoop can process data much faster than
traditional database systems.
12. Machine Learning Capabilities:
Hadoop offers machine learning capabilities through its ecosystem tools like
Mahout, which is a library for creating scalable machine learning applications. With
these tools, data analysts and developers can build machine learning models to
analyze and process large datasets.
13. Integration with Other Tools:
Hadoop integrates with other popular tools like Apache Spark, Apache Flink, and
Apache Storm, making it easier to build data processing pipelines. This integration
allows developers and data analysts to use their favorite tools and frameworks for
building data pipelines and processing large datasets.
14. Secure:
Hadoop provides built-in security features like authentication, authorization, and
encryption. These features help to protect data and ensure that only authorized
users have access to it. This makes Hadoop a more secure platform for processing
sensitive data.

23
15. Community Support:
Hadoop has a large community of users and developers who contribute to its
development and provide support to users. This means that users can access a
wealth of resources and support to help them get the most out of Hadoop.

Difference Between Hadoop and MapReduce


Hadoop: Hadoop software is a framework that permits for the distributed
processing of huge data sets across clusters of computers using simple
programming models. In simple terms, Hadoop is a framework for processing ‘Big
Data’. Hadoop was created by Doug Cutting.it was also created by Mike Cafarella.
It is designed to divide from single servers to thousands of machines, each having
local computation and storage. Hadoop is an open-source software. The core of
Apache Hadoop consists of a storage part, known as the Hadoop Distributed File
System(HDFS), and a processing part which may be a Map-Reduce programming
model. Hadoop splits files into large blocks and distributes them across nodes
during a cluster. It then transfers packaged code into nodes to process the info in
parallel.
Mapreduce: MapReduce is a programming model that is used for processing and
generating large data sets on clusters of computers. It was introduced by Google.
Mapreduce is a concept or a method for large scale parallelization.It is inspired by
functional programming’s map() and reduce() functions.
MapReduce program is executed in three stages they are:
 Mapping: Mapper’s job is to process input data.Each node applies the map
function to the local data.
 Shuffle: Here nodes are redistributed where data is based on the output keys.
(output keys are produced by map function).
 Reduce: Nodes are now processed into each group of output data, per key in
parallel.

Below is a table of differences between Hadoop and MapReduce:

24
Based on Hadoop MapReduce

MapReduce is a programming
The Apache Hadoop is a software
model which is an
that allows all the distributed
implementation for processing
Definition processing of large data sets
and generating big data sets
across clusters of computers using
with distributed algorithm on a
simple programming
cluster.

The name “Hadoop” was the


The “MapReduce” name came
named after Doug cutting’s son’s
into existence as per the
Meaning toy elephant. He named this project
functionality itself of mapping
as “Hadoop” as it was easy to
and reducing in key-value pairs.
pronounce it.

Hadoop not only has storage


MapReduce is a programming
framework which stores the data
framework which uses a key,
Framework but creating name node’s and data
value mappings to sort/process
node’s it also has other frameworks
the data
which include MapReduce itself.

Hadoop was created by Doug Mapreduce is invented by


Invention
Cutting and Mike Cafarella. Google.

 Mapreduce provides Fault


 Hadoop is Open Source
Tolerance
Features  Hadoop cluster is Highly
 Mapreduce provides High
Scalable
Availability

MapReduce is a submodule of
The Apache Hadoop is an eco-
this project which is a
system which provides an
programming model and is
Concept environment which is reliable,
used to process huge datasets
scalable and ready for distributed
which sits on HDFS (Hadoop
computing.
distributed file system).

Hadoop is a collection of all


modules and hence may include MapReduce is basically written
Language
other programming/scripting in Java programming language
languages too

25
Based on Hadoop MapReduce

MapReduce can run on


Pre- Hadoop runs on HDFS (Hadoop HDFS/GFS/NDFS or any other
requisites Distributed File System) distributed system for example
MapR-FS

Difference between MapReduce and Pig


MapReduce is a model that works over Hadoop to access big data efficiently
stored in HDFS (Hadoop Distributed File System). It is the core component of
Hadoop, which divides the big data into small chunks and process them parallelly.
Features of MapReduce:
 It can store and distribute huge data across various servers.
 Allows users to store data in a map and reduce form to get processed.
 It protects the system to get any unauthorized access.
 It supports the parallel processing model.
Pig is an open-source tool that is built on the Hadoop ecosystem for providing
better processing of Big data. It is SQL-like language .It is a high-level scripting
language that is commonly known as Pig Latin scripts. Pig scripts enables to
create user defined functions for analyzing and processing data. It works on the
HDFS (Hadoop Distributed File System), which supports the use of various types
of data. MapReduce tasks can be performed easily by using Pig even without
having good knowledge of Java.
Features of Pig:
 It allows the user to create custom user-defined functions.
 It is extensible to use.
 Supports a variety of data types such as char long float schema, and functions.
 Provides different operations on HDFS such as GROUP, FILTER, JOIN, SORT.

Difference between MapReduce and Pig:

S.No MapReduce Pig

1. It is a Data Processing Language. It is a Data Flow Language.

It converts the job into map-reduce It converts the query into map-reduce
2.
functions. functions.

26
S.No MapReduce Pig

3. It is a Low-level Language. It is a High-level Language

It is difficult for the user to perform join Makes it easy for the user to perform
4.
operations. Join operations.

The user has to write 10 times more The user has to write fewer lines of code
5. lines of code to perform a similar task because it supports the multi-query
than Pig. approach.

It is less compilation time as the Pig


It has several jobs therefore execution
6. operator converts it into MapReduce
time is more.
jobs.

It is supported by recent versions of the It is supported with all versions of


7.
Hadoop. Hadoop.

Difference Between MapReduce and Hive


MapReduce is a model that works over Hadoop to access big data efficiently
stored in HDFS (Hadoop Distributed File System). It is the core component of
Hadoop, which divides the big data into small chunks and process them parallelly.
Features of MapReduce:
 It can store and distribute huge data across various servers.
 Allows users to store data in a map and reduce form to get processed.
 It protects the system to get any unauthorized access.
 It supports the parallel processing model.
Hive is an initiative started by Facebook to provide a traditional Data Warehouse
interface for MapReduce programming. For writing queries for MapReduce in SQL
fashion, the Hive compiler converts them in the background to be executed in the
Hadoop cluster. It helps the programmers to use their SQL knowledge rather than
focusing on developing a new language.
Features of Hive:
 Provide SQL type language which is called HQL.
 Helps in querying large data sets stored in HDFS(Hadoop Distributed File
System).
27
 It is an open-source tool.
 It supports flexible project views and makes data visualization easy.
MapReduce vs Hive

S.No MapReduce Hive

1. It is a Data Processing Language. It is a SQL-like Query Language.

It converts the job into map-reduce It converts the SQL queries to


2.
functions. HQL(Hive-QL)

It provides a high level of


3. It provides low level of abstraction.
abstraction.

It makes it easy for the user to


It is difficult for the user to perform join
4. perform SQL-like operations on
operations.
HDFS.

The user has to write 10 times more lines The user has to write a few lines of
5.
of code to perform a similar task than Pig. code than MapReduce.

It has several jobs therefore execution time The code execution time is more but
6.
is more. development effort is less.

It is also supported with recent


7. It is supported by versions of the Hadoop.
versions of Hadoop.

Difference between Pig and Hive


1. Pig :
Pig is used for the analysis of a large amount of data. It is abstract over
MapReduce. Pig is used to perform all kinds of data manipulation operations in
Hadoop. It provides the Pig-Latin language to write the code that contains many
inbuilt functions like join, filter, etc. The two parts of the Apache Pig are Pig-Latin
and Pig-Engine. Pig Engine is used to convert all these scripts into a specific map

28
and reduce tasks. Pig abstraction is at a higher level. It contains less line of code
as compared to MapReduce.
2. Hive :
Hive is built on the top of Hadoop and is used to process structured data in
Hadoop. Hive was developed by Facebook. It provides various types of querying
language which is frequently known as Hive Query Language. Apache Hive is a
data warehouse and which provides an SQL-like interface between the user and
the Hadoop distributed file system (HDFS) which integrates Hadoop.
Difference between Pig and Hive :
S.No
. Pig Hive
Pig operates on the client
1. side of a cluster. Hive operates on the server side of a cluster.
2. Pig uses pig-latin language. Hive uses HiveQL language.
Pig is a Procedural Data
3. Flow Language. Hive is a Declarative SQLish Language.
4. It was developed by Yahoo. It was developed by Facebook.
It is used by Researchers
5. and Programmers. It is mainly used by Data Analysts.
It is used to handle
structured and semi-
6. structured data. It is mainly used to handle structured data.
7. It is used for programming. It is used for creating reports.
Pig scripts end with .pig
8. extension. In HIve, all extensions are supported.
It does not support
9. partitioning. It supports partitioning.
10. It loads data quickly. It loads data slowly.
11. It does not support JDBC. It supports JDBC.
12. It does not support ODBC. It supports ODBC.
Pig does not have a
dedicated metadata Hive makes use of the exact variation of dedicated
13. database. SQL-DDL language by defining tables beforehand.
14. It supports Avro file format. It does not support Avro file format.

29
Pig is suitable for complex Hive is suitable for batch-
15. and nested data structures. processing OLAP systems.
Pig does not support schema
16. to store data. Hive supports schema for data insertion in tables.
It is very easy to write UDFs
17. to calculate matrices. It does support UDFs but is much hard to debug.

Difference between Hive and MongoDB


1. Hive :
Hive is a data warehouse software for querying and managing large distributed
datasets, built on Hadoop. It is developed by Apache Software Foundation in 2012.
It contains two modules, one is MapReduce and another is Hadoop Distributed File
System (HDFS). It stores schema in a database and processed data into HDFS. It
resides on top of Hadoop to summarize Big Data and makes querying and
analyzing easy.
2. MongoDB :
MongoDB is a cross-platform document-oriented and a non relational (NoSQL)
database program. It is an open-source document database, that stores the data in
the form of key-value pairs. MongoDB is developed by MongoDB Inc. and initially
released on 11 February 2009. It is written in C++, Go, JavaScript, Python
languages. MongoDB offers high speed, high availability, and high scalability.

Difference between Hive and MongoDB :


S.NO. HIVE MONGODB

It was developed by Apache Software It was developed by


1.
Foundation in 2012. MongoDB Inc. in 2009.

It is also an open-source
2. It is an open-source software.
software.

Server operating systems


Server operating systems for Hive is all OS with for MongoDB are
3.
a Java VM . Solaris, Linux, OS X,
Windows.

The replication method


The replication method that Hive supports is that MongoDB supports
4.
Selectable Replication Factor. is Master-Slave
Replication.

30
It supports many
programming languages
It support C++, Java, PHP, Python programming like C, C#,
5.
languages. Java, JavaScript, PHP,
Lau, Python, R, Ruby,
etc.

It also supports Sharding


6. It supports Sharding partitioning method.
partitioning method.

The primary database


The primary database model is Relational
7. model is Document
DBMS.
Store.

Proprietary protocol
JDBC, ODBC, Thrift are used as APIs and other using JSON is used as
8.
access methods. APIs and other access
methods.

It supports in-memory
9. It do not support in-memory capabilities.
capabilities.

ACID properties of
10. No transaction concepts.
transaction is used.

Difference Between RDBMS and Hadoop


RDMS (Relational Database Management System): RDBMS is an information
management system, which is based on a data model.In RDBMS tables are used
for information storage. Each row of the table represents a record and column
represents an attribute of data. Organization of data and their manipulation
processes are different in RDBMS from other databases. RDBMS ensures ACID
(atomicity, consistency, integrity, durability) properties required for designing a
database. The purpose of RDBMS is to store, manage, and retrieve data as quickly
and reliably as possible.
Hadoop: It is an open-source software framework used for storing data and
running applications on a group of commodity hardware. It has large storage
capacity and high processing power. It can manage multiple concurrent processes
at the same time. It is used in predictive analysis, data mining and machine
learning. It can handle both structured and unstructured form of data. It is more
flexible in storing, processing, and managing data than traditional RDBMS. Unlike
traditional systems, Hadoop enables multiple analytical processes on the same
data at the same time. It supports scalability very flexibly.

31
Below is a table of differences between RDBMS and Hadoop:

S.No. RDBMS Hadoop

Traditional row-column based An open-source software used for


1. databases, basically used for data storing data and running applications
storage, manipulation and retrieval. or processes concurrently.

In this structured data is mostly In this both structured and


2.
processed. unstructured data is processed.

It is best suited for OLTP


3. It is best suited for BIG data.
environment.

4. It is less scalable than Hadoop. It is highly scalable.

Data normalization is required in Data normalization is not required in


5.
RDBMS. Hadoop.

It stores transformed and aggregated


6. It stores huge volume of data.
data.

7. It has no latency in response. It has some latency in response.

The data schema of RDBMS is static The data schema of Hadoop is


8.
type. dynamic type.

Low data integrity available than


9. High data integrity available.
RDBMS.

Cost is applicable for licensed Free of cost, as it is an open source


10.
software. software.

Difference between Hadoop 1 and Hadoop 2


Hadoop is an open source software programming framework for storing a large
amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.

32
Hadoop 1 vs Hadoop 2

1. Components: In Hadoop 1 we have MapReduce but Hadoop 2 has YARN(Yet


Another Resource Negotiator) and MapReduce version 2.
Hadoop 1 Hadoop 2

HDFS HDFS

Map
Reduce YARN / MRv2

2. Daemons:
Hadoop 1 Hadoop 2

Namenode Namenode

Datanode Datanode

Secondary Namenode Secondary Namenode

Job Tracker Resource Manager

Task Tracker Node Manager

3. Working:
 In Hadoop 1, there is HDFS which is used for storage and top of it, Map Reduce
which works as Resource Management as well as Data Processing. Due to this
workload on Map Reduce, it will affect the performance.
 In Hadoop 2, there is again HDFS which is again used for storage and on the
top of HDFS, there is YARN which works as Resource Management. It basically
allocates the resources and keeps all the things going on.

33
4. Limitations: Hadoop 1 is a Master-Slave architecture. It consists of a single
master and multiple slaves. Suppose if master node got crashed then irrespective
of your best slave nodes, your cluster will be destroyed. Again for creating that
cluster means copying system files, image files, etc. on another system is too
much time consuming which will not be tolerated by organizations in today’s
time. Hadoop 2 is also a Master-Slave architecture. But this consists of multiple
masters (i.e active namenodes and standby namenodes) and multiple slaves. If
here master node got crashed then standby master node will take over it. You can
make multiple combinations of active-standby nodes. Thus Hadoop 2 will eliminate
the problem of a single point of failure.
5. Ecosystem:

34
 Oozie is basically Work Flow Scheduler. It decides the particular time of jobs to
execute according to their dependency.
 Pig, Hive and Mahout are data processing tools that are working on the top of
Hadoop.
 Sqoop is used to import and export structured data. You can directly import and
export the data into HDFS using SQL database.
 Flume is used to import and export the unstructured data and streaming data.
6. Windows Support:
in Hadoop 1 there is no support for Microsoft Windows provided by Apache
whereas in Hadoop 2 there is support for Microsoft windows.
Difference Between Hadoop and SQL

35
Hadoop: It is a framework that stores Big Data in distributed systems and then
processes it parallelly. Four main components of Hadoop are Hadoop Distributed
File System(HDFS), Yarn, MapReduce, and libraries. It involves not only large data
but a mixture of structured, semi-structured, and unstructured information.
Amazon, IBM, Microsoft, Cloudera, ScienceSoft, Pivotal, Hortonworks are some of
the companies using Hadoop technology.
SQL: Structured Query Language is a domain-specific language used in
computing and to handle data management in relational database management
systems, it also processes data streams in relational data stream management
systems. In nutshell, SQL is a standard Database language that is used for
creating, storing and extracting data from relational databases such as MySQL,
Oracle, SQL Server, etc.
Below is a table of differences between Hadoop and SQL:

Feature Hadoop SQL

Technology Modern Traditional

Volume Usually in PetaBytes Usually in GigaBytes

Storage, processing, retrieval Storage, processing, retrieval


Operations
and pattern extraction from data and pattern mining of data

Fault
Hadoop is highly fault tolerant SQL has good fault tolerance
Tolerance

Stores data in the form of key- Stores structured data in tabular


Storage value pairs, tables, hash map etc format with fixed schema in
in distributed systems. cloud

Scaling Linear Non linear

Well-known industry leaders in


Cloudera, Horton work, AWS etc.
Providers SQL systems are Microsoft,
provides Hadoop systems.
SAP, Oracle etc.

Interactive and batch oriented


Data Access Batch oriented data access
data access

Cost It is open source and systems It is licensed and costs a fortune


can be cost effectively scaled to buy a SQL server, moreover

36
Feature Hadoop SQL

if system runs out of storage


additional charges also emerge

Statements are executed very SQL syntax is slow when


Time
quickly executed in millions of rows

It stores data in HDFS and


It does not have any advanced
Optimization process though Map Reduce with
optimization techniques
huge optimization techniques.

Dynamic schema, capable of


storing and processing log data, Static Schema, capable of
Structure real-time data, images, videos, storing data(fixed schema) in
sensor data etc.(both structured tabular format only(structured)
and unstructured)

Write data once, read data Read and Write data multiple
Data Update
multiple times times

Integrity Low High

Hadoop uses JDBC(Java


Database Connectivity) to SQL systems can read and write
Interaction
communicate with SQL systems data to Hadoop systems
to send and receive data

Hardware Uses commodity hardware Uses propriety hardware

Learning Hadoop for entry-level


Learning SQL is easy for even
Training as well as seasoned profession
entry-level professionals
is moderately hard

Difference Between Hadoop and HBase


Hadoop: Hadoop is an open source framework from Apache that is used to store
and process large datasets distributed across a cluster of servers. Four main
components of Hadoop are Hadoop Distributed File System(HDFS), Yarn,
MapReduce, and libraries. It involves not only large data but a mixture of
structured, semi-structured, and unstructured information. Amazon, IBM, Microsoft,

37
Cloudera, ScienceSoft, Pivotal, Hortonworks are some of the companies using
Hadoop technology.
HBase: HBase is an open source database from Apache that runs on Hadoop
cluster. It falls under the non-relational database management system. Three
important components of HBase are HMaster, Region server, Zookeeper.
CapitalOne, JPMorganchase, apple, MTB, AT& T, Lockheed Martin are some of
the companies using HBase.

Below is a table of differences between Hadoop and HBase:

S.No
. Hadoop HBase

Hadoop is a collection of software


1 HBase is a part of hadoop eco-system
tools

Stores data sets in a distributed


2 Stores data in a column-oriented manner
environment

3 Hadoop is a framework HBase is a NOSQL database

4 Data are stored in form of chunks Data are stored in form of key/value pair

Hadoop does not allow run time


5 HBase allows run time changes
changes

File can be written only once, can be


6 File can be read and write multiple times
read many times

7 Hadoop has low latency operations HBase has high latency operations

HDFS can be accessed through HBase can be accessed through shell


8
MapReduce commands, Java API, REST

Difference Between Hadoop and Hive


Hadoop: Hadoop is a Framework or Software which was invented to manage huge
data or Big Data. Hadoop is used for storing and processing large data distributed
across a cluster of commodity servers. Hadoop stores the data using Hadoop
distributed file system and process/query it using the Map-Reduce programming
model.

38
Hive: Hive is an application that runs over the Hadoop framework and provides
SQL like interface for processing/query the data. Hive is designed and developed
by Facebook before becoming part of the Apache-Hadoop project. Hive runs its
query using HQL (Hive query language). Hive is having the same structure as
RDBMS and almost the same commands can be used in Hive. Hive can store the
data in external tables so it’s not mandatory to used HDFS also it support file
formats such as ORC, Avro files, Sequence File and Text files, etc.

Below is a table of differences between Hadoop and Hive:

Hadoop Hive

Hadoop is a framework to process/query Hive is an SQL Based tool that builds


the Big data over Hadoop to process the data.

Hive process/query all the data using


Hadoop can understand Map Reduce only. HQL (Hive Query Language) it’s SQL-
Like Language

Hive’s query first get converted into


Map Reduce is an integral part of Hadoop Map Reduce than processed by
Hadoop to query the data.

Hadoop understands SQL using Java-


Hive works on SQL Like query
based Map Reduce only.

In Hive, earlier used traditional


In Hadoop, have to write complex Map
“Relational Database’s” commands
Reduce programs using Java which is not
can also be used to query the big
similar to traditional Java.
data

Hadoop is meant for all types of data


Hive can only process/query the
whether it is Structured, Unstructured or
structured data
Semi-Structured.

In the simple Hadoop ecosystem, the need Using Hive, one can process/query
to write complex Java programs for the the data without complex
same data. programming

One side Hadoop frameworks need 100s Hive can query the same data using
line for preparing Java-based MR program 8 to 10 lines of HQL.

39
Difference between Data Warehouse and Hadoop
Data Warehouse :
It is a technique for gathering and managing information from different sources to
supply significant commercial enterprise insights. A Data warehouse is commonly
used to join and analyze commercial enterprise information from heterogeneous
sources. It acts as the heart of the BI system which is constructed for data
evaluation and reporting.
2. Hadoop :
It is an open-source software program framework for storing information and
strolling applications on clusters of commodity hardware. It offers large storage for
any sort of data, extensive processing strength, and the potential to deal with
actually limitless concurrent duties or jobs.

Difference between Data Warehouse and Hadoop:


S.No. Data Warehouse Hadoop

In this, we first analyze the data It can process various types of data such
1. and then further do the as Structured data, unstructured data, or
processing. raw data.

It is convenient for storing a


2. It deals with a large volume of data.
small volume of data.

It uses schema-for-write logic to It deals with schema-for-read logic to


3.
process the data. process the data.

It is very less agile as compared It is more agile as compared to Data


4.
to Hadoop. Warehouse.

It can be configured or reconfigured,


5. It is of fixed configuration.
accordingly.

It has high security for storing Security is a great concern and It is


6.
different data. improving and working on it.

It is mainly used by business It mainly deals with Data Engineering


7.
professionals. and Data Science.

40
Hadoop Version 3.0 – What’s New?
Hadoop is a framework written in Java used to solve Big Data problems. The initial
version of Hadoop is released in April 2006. Apache community has made many
changes from the day of the first release of Hadoop in the market. The journey of
Hadoop started in 2005 by Doug Cutting and Mike Cafarella. The reason behind
developing Hadoop is to support distribution for the Nutch Search Engine
Project.
Hadoop conquers the supercomputer in the year 2008 and becomes the fastest
system ever made by humans to sort terabytes of data stored. Hadoop has come a
long way and has accommodated so many changes from its previous version i.e.
Hadoop 2.x. In this article, we are going to discuss the changes made by Apache
to the Hadoop version 3.x to make it more efficient and faster.

What’s New in Hadoop 3.0?

1. JDK 8.0 is the Minimum JAVA Version Supported by Hadoop 3.x

Since Oracle has ended the use of JDK 7 in 2015, so to use Hadoop 3 users have
to upgrade their Java version to JDK 8 or above to compile and run all the Hadoop
files. JDK version below 8 is no more supported for using Hadoop 3.

2. Erasure Coding is Supported

Erasure coding is used to recover the data when the computer hard disk fails. It is
a high-level RAID(Redundant Array of Independent Disks) technology used by so
many IT company’s to recover their data. Hadoop file system HDFS i.e. Hadoop
Distributed File System uses Erasure coding to provide fault tolerance in the
Hadoop cluster. Since we are using commodity hardware to build our Hadoop
cluster, failure of the node is normal. Hadoop 2 uses a replication mechanism to
provide a similar kind of fault-tolerance as that of Erasure coding in Hadoop 3.
In Hadoop 2 replicas of the data, blocks are made which is then stored on different
nodes in the Hadoop cluster. Erasure coding consumes less or half storage as that
of replication in Hadoop 2 to provide the same level of fault tolerance. With the
increasing amount of data in the industry, developers can save a large amount of
storage with erasure coding. Erasure encoding minimizes the requirement of hard
disk and improves the fault tolerance by 50% with the similar resources provided.

3. More Than Two NameNodes Supported

The previous version of Hadoop supports a single active NameNode and a single
standby NameNode. In the latest version of Hadoop i.e. Hadoop 3.x, the data block

41
replication is done among three JournalNodes(JNs). With the help of that, the
Hadoop 3.x architecture is more capable to handle fault tolerance than that of its
previous version. Big data problems where high fault tolerance is needed, Hadoop
3.x is very useful in that situation. In this Hadoop, 3.x users can manage the
number of standby nodes according to the requirement since the facility of multiple
standby nodes is provided.
For example, developers can now easily configure three NameNodes and Five
JournalNodes with that our Hadoop cluster is capable to handle two nodes rather
than a single one.

4. Shell Script Rewriting

The Hadoop file system utilizes various shell-type commands that directly interact
with the HDFS and other file systems that Hadoop supports i.e. such as
WebHDFS, Local FS, S3 FS, etc. The multiple functionalities of Hadoop are
controlled by the shell. The shell script used in the latest version of Hadoop i.e.
Hadoop 3.x has fixed lots of bugs. Hadoop 3.x shell scripts also provide the
functionality of rewriting the shell script.
5. Timeline Service v.2 for YARN
The YARN Timeline service stores and retrieve the applicant’s information(The
information can be ongoing or historical). Timeline service v.2 was much important
to improve the reliability and scalability of our Hadoop. System usability is
enhanced with the help of flows and aggregation. In Hadoop 1.x with TimeLine
service, v.1 users can only make a single instance of reader/writer and storage
architecture that can not be scaled further.
Hadoop 2.x uses distributed writer architecture where data read and write
operations are separable. Here distributed collectors are provided for every
YARN(Yet Another Resource Negotiator) application. Timeline service v.2 uses
HBase for storage purposes which can be scaled to massive size along with
providing good response time for reading and writing operations.
The information that Timeline service v.2 stores can be of major 2 types:
A. Generic information of the completed application
 user information
 queue name
 count of attempts made per application
 container information which runs for each attempt on application
B. Per framework information about running and completed application
 count of Map and Reduce Task
 counters
 information broadcast by the developer for TimeLine Server with the help of
Timeline client.

42
6. Filesystem Connector Support

This new Hadoop version 3.x now supports Azure Data Lake and Aliyun Object
Storage System which are the other standby option for the Hadoop-compatible
filesystem.

7. Default Multiple Service Ports Have Been Changed

In the Previous version of Hadoop, the multiple service port for Hadoop is in
the Linux ephemeral port range (32768-61000). In this kind of configuration due
to conflicts occurs in some other application sometimes the service fails to bind to
the ports. So to overcome this problem Hadoop 3.x has moved the conflicts ports
from the Linux ephemeral port range and new ports have been assigned to this as
shown below.
// The new assigned Port
Namenode Ports: 50470 -> 9871, 50070 -> 9870, 8020 -> 9820
Datanode Ports: 50020-> 9867,50010 -> 9866, 50475 -> 9865,
50075 -> 9864
Secondary NN Ports: 50091 -> 9869, 50090 -> 9868

43
8. Intra-Datanode Balancer

DataNodes are utilized in the Hadoop cluster for storage purposes. The
DataNodes handles multiple disks at a time. This Disk’s got filled evenly during
write operations. Adding or Removing the disk can cause significant skewness in a
DataNode. The existing HDFS-BALANCER can not handle this significant
skewness, which concerns itself with inter-, not intra-, DN skew. The latest intra-
DataNode balancing feature can manage this situation which is invoked with the
help of HDFS disk balancer CLI.

9. Shaded Client Jars

The new Hadoop–client-API and Hadoop-client-runtime are made available in


Hadoop 3.x which provides Hadoop dependencies in a single packet or single jar
file. In Hadoop 3.x the Hadoop –client-API have compile-time scope while Hadoop-
client-runtime has runtime scope. Both of these contain third-party dependencies
provided by Hadoop-client. Now, the developers can easily bundle all the
dependencies in a single jar file and can easily test the jars for any version
conflicts. using this way, the Hadoop dependencies onto application classpath can
be easily withdrawn.

10. Task Heap and Daemon Management

In Hadoop version 3.x we can easily configure Hadoop daemon heap size with
some newly added ways. With the help of the memory size of the host auto-tuning
is made available. Instead of HADOOP_HEAPSIZE, developers can use
the HEAP_MAX_SIZE and HEAP_MIN_SIZE variables. JAVA_HEAP_SIZE intern
al variable is also removed in this latest Hadoop version 3.x. Default heap sizes
are also removed which is used for auto-tuning by JVM(Java Virtual Machine). If
you want to use the older default then enable it by
configuring HADOOP_HEAPSIZE_MAX in Hadoop-env.sh file.

Hadoop – Introduction
The definition of a powerful person has changed in this world. A powerful is one
who has access to the data. This is because data is increasing at a tremendous
rate. Suppose we are living in 100% data world. Then 90% of the data is produced
in the last 2 to 4 years. This is because now when a child is born, before her
mother, she first faces the flash of the camera. All these pictures and videos are

44
nothing but data. Similarly, there is data of emails, various smartphone
applications, statistical data, etc. All this data has the enormous power to affect
various incidents and trends. This data is not only used by companies to affect
their consumers but also by politicians to affect elections. This huge data is
referred to as Big Data. In such a world, where data is being produced at such an
exponential rate, it needs to maintained, analyzed, and tackled. This is where
Hadoop creeps in.
Hadoop is a framework of the open source set of tools distributed under Apache
License. It is used to manage data, store data, and process data for various big
data applications running under clustered systems. In the previous years, Big Data
was defined by the “3Vs” but now there are “5Vs” of Big Data which are also
termed as the characteristics of Big Data.

1. Volume: With increasing dependence on technology, data is producing at a


large volume. Common examples are data being produced by various social
networking sites, sensors, scanners, airlines and other organizations.
2. Velocity: Huge amount of data is generated per second. It is estimated that by
the end of 2020, every individual will produce 3mb data per second. This large
volume of data is being generated with a great velocity.
3. Variety: The data being produced by different means is of three types:
 Structured Data: It is the relational data which is stored in the form of rows
and columns.
 Unstructured Data: Texts, pictures, videos etc. are the examples of
unstructured data which can’t be stored in the form of rows and columns.
 Semi Structured Data: Log files are the examples of this type of data.
4. Veracity: The term Veracity is coined for the inconsistent or incomplete data
which results in the generation of doubtful or uncertain Information. Often data
inconsistency arises because of the volume or amount of data e.g. data in bulk
could create confusion whereas less amount of data could convey half or
incomplete Information.
5. Value: After having the 4 V’s into account there comes one more V which
stands for Value!. Bulk of Data having no Value is of no good to the company,
unless you turn it into something useful. Data in itself is of no use or importance
but it needs to be converted into something valuable to extract Information.
Hence, you can state that Value! is the most important V of all the 5V’s

Evolution of Hadoop: Hadoop was designed by Doug Cutting and Michael


Cafarella in 2005. The design of Hadoop is inspired by Google. Hadoop stores the
huge amount of data through a system called Hadoop Distributed File System
(HDFS) and processes this data with the technology of Map Reduce. The designs
of HDFS and Map Reduce are inspired by the Google File System (GFS) and Map
Reduce. In the year 2000 Google suddenly overtook all existing search engines
and became the most popular and profitable search engine. The success of
Google was attributed to its unique Google File System and Map Reduce. No one

45
except Google knew about this, till that time. So, in the year 2003 Google released
some papers on GFS. But it was not enough to understand the overall working of
Google. So in 2004, Google again released the remaining papers. The two
enthusiasts Doug Cutting and Michael Cafarella studied those papers and
designed what is called, Hadoop in the year 2005. Doug’s son had a toy elephant
whose name was Hadoop and thus Doug and Michael gave their new creation, the
name “Hadoop” and hence the symbol “toy elephant.” This is how Hadoop evolved.
Thus the designs of HDFS and Map Reduced though created by Doug Cutting and
Michael Cafarella, but are originally inspired by Google. For more details about the
evolution of Hadoop, you can refer to Hadoop | History or Evolution .
Traditional Approach: Suppose we want to process a data. In the traditional
approach, we used to store data on local machines. This data was then processed.
Now as data started increasing, the local machines or computers were not capable
enough to store this huge data set. So, data was then started to be stored on
remote servers. Now suppose we need to process that data. So, in the traditional
approach, this data has to be fetched from the servers and then processed upon.
Suppose this data is of 500 GB. Now, practically it is very complex and expensive
to fetch this data. This approach is also called Enterprise Approach.
In the new Hadoop Approach, instead of fetching the data on local machines we
send the query to the data. Obviously, the query to process the data will not be as
huge as the data itself. Moreover, at the server, the query is divided into several
parts. All these parts process the data simultaneously. This is called parallel
execution and is possible because of Map Reduce. So, now not only there is no
need to fetch the data, but also the processing takes lesser time. The result of the
query is then sent to the user. Thus the Hadoop makes data storage, processing
and analyzing way easier than its traditional approach.
Components of Hadoop: Hadoop has three components:
1. HDFS: Hadoop Distributed File System is a dedicated file system to store big
data with a cluster of commodity hardware or cheaper hardware with streaming
access pattern. It enables data to be stored at multiple nodes in the cluster
which ensures data security and fault tolerance.
2. Map Reduce : Data once stored in the HDFS also needs to be processed upon.
Now suppose a query is sent to process a data set in the HDFS. Now, Hadoop
identifies where this data is stored, this is called Mapping. Now the query is
broken into multiple parts and the results of all these multiple parts are
combined and the overall result is sent back to the user. This is called reduce
process. Thus while HDFS is used to store the data, Map Reduce is used to
process the data.
3. YARN : YARN stands for Yet Another Resource Negotiator. It is a dedicated
operating system for Hadoop which manages the resources of the cluster and
also functions as a framework for job scheduling in Hadoop. The various types
of scheduling are First Come First Serve, Fair Share Scheduler and Capacity
Scheduler etc. The First Come First Serve scheduling is set by default in YARN.
How the components of Hadoop make it as a solution for Big Data?

46
1. Hadoop Distributed File System: In our local PC, by default the block size in
Hard Disk is 4KB. When we install Hadoop, the HDFS by default changes the
block size to 64 MB. Since it is used to store huge data. We can also change
the block size to 128 MB. Now HDFS works with Data Node and Name Node.
While Name Node is a master service and it keeps the metadata as for on which
commodity hardware, the data is residing, the Data Node stores the actual data.
Now, since the block size is of 64 MB thus the storage required to store
metadata is reduced thus making HDFS better. Also, Hadoop stores three
copies of every dataset at three different locations. This ensures that the
Hadoop is not prone to single point of failure.
2. Map Reduce: In the simplest manner, it can be understood that MapReduce
breaks a query into multiple parts and now each part process the data
coherently. This parallel execution helps to execute a query faster and makes
Hadoop a suitable and optimal choice to deal with Big Data.
3. YARN: As we know that Yet Another Resource Negotiator works like an
operating system to Hadoop and as operating systems are resource managers
so YARN manages the resources of Hadoop so that Hadoop serves big data in
a better way.
Hadoop Versions: Till now there are three versions of Hadoop as follows.
 Hadoop 1: This is the first and most basic version of Hadoop. It includes
Hadoop Common, Hadoop Distributed File System (HDFS), and Map Reduce.

 Hadoop 2: The only difference between Hadoop 1 and Hadoop 2 is that


Hadoop 2 additionally contains YARN (Yet Another Resource Negotiator).
YARN helps in resource management and task scheduling through its two
daemons namely job tracking and progress monitoring.

47
 Hadoop 3: This is the recent version of Hadoop. Along with the merits of the
first two versions, Hadoop 3 has one most important merit. It has resolved the
issue of single point failure by having multiple name nodes. Various other
advantages like erasure coding, use of GPU hardware and Dockers makes it
superior to the earlier versions of Hadoop.
 Economically Feasible: It is cheaper to store data and process it than
it was in the traditional approach. Since the actual machines used to
store data are only commodity hardware.
 Easy to Use: The projects or set of tools provided by Apache Hadoop
are easy to work upon in order to analyze complex data sets.
 Open Source: Since Hadoop is distributed as an open source
software under Apache License, so one does not need to pay for it,
just download it and use it.
 Fault Tolerance: Since Hadoop stores three copies of data, so even if
one copy is lost because of any commodity hardware failure, the data
is safe. Moreover, as Hadoop version 3 has multiple name nodes, so
even the single point of failure of Hadoop has also been removed.
 Scalability: Hadoop is highly scalable in nature. If one needs to scale
up or scale down the cluster, one only needs to change the number of
commodity hardware in the cluster.
 Distributed Processing: HDFS and Map Reduce ensures distributed
storage and processing of the data.

48
 Locality of Data: This is one of the most alluring and promising
features of Hadoop. In Hadoop, to process a query over a data set,
instead of bringing the data to the local computer we send the query to
the server and fetch the final result from there. This is called data
locality.

49

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy