Big Data Analytics
Big Data Analytics
Big data refers to extremely large and diverse collections of structured, unstructured,
and semi-structured data that continues to grow exponentially over time. These
datasets are so huge and complex in volume, velocity, and variety, that traditional data
management systems cannot store, process, and analyze them.
Big data describes large and diverse datasets that are huge in volume and also rapidly
grow in size over time. Big data is used in machine learning, predictive modeling, and
other advanced analytics to solve business problems and make informed decisions.
Companies use big data in their systems to improve operations, provide better customer service,
create personalized marketing campaigns and take other actions that, ultimately, can increase
revenue and profits. Businesses that use it effectively hold a potential competitive advantage
over those that don't because they're able to make faster and more informed business decisions.
For example, big data provides valuable insights into customers that companies can use to refine
their marketing, advertising and promotions in order to increase customer engagement and
conversion rates. Both historical and real-time data can be analyzed to assess the evolving
preferences of consumers or corporate buyers, enabling businesses to become more responsive to
customer wants and needs.
Big data is also used by medical researchers to identify disease signs and risk factors and by
doctors to help diagnose illnesses and medical conditions in patients. In addition, a combination
of data from electronic health records, social media sites, the web and other sources gives
healthcare organizations and government agencies up-to-date information on infectious disease
threats or outbreaks.
Here are some more examples of how big data is used by organizations:
● In the energy industry, big data helps oil and gas companies identify potential drilling
locations and monitor pipeline operations; likewise, utilities use it to track electrical
grids.
● Financial services firms use big data systems for risk management and real-time
analysis of market data.
● Manufacturers and transportation companies rely on big data to manage their supply
chains and optimize delivery routes.
● Other government uses include emergency response, crime prevention and smart city
initiatives.
These are some of the business benefits organizations can get by using big data.
In addition to data from internal systems, big data environments often incorporate external data
on consumers, financial markets, weather and traffic conditions, geographic information,
scientific research and more. Images, videos and audio files are forms of big data, too, and many
big data applications involve streaming data that is processed and collected on a continual basis.
The idea of Data Preparation procedures conducted once during the project and performed before
using any iterative model. Contrarily, Data Wrangling is done during iterative analysis and model
construction. At the period of feature engineering, this idea.
2. Data exploration
The initial phase in data analysis is called data exploration, and it involves looking at and
visualizing data to find insights right away or point out regions or patterns that need further
investigation. Users may more quickly gain insights by using interactive dashboards and
point-and-click data exploration to better understand the broader picture.
3. Scalability
To scale up, or vertically scale, a system, a faster server with more powerful processors and
memory is needed. This technique utilizes less network gear and uses less energy, but it may
only be a temporary cure for many big data analytics platform characteristics, especially if more
growth is anticipated.
Due to the big data revolution, new forms, stages, and types of data analysis have evolved. Data
analytics is exploding in boardrooms all over the world, offering enterprise-wide commercial
success techniques. What do these, though, mean for businesses? Gaining the appropriate
expertise, which results in information, enables organizations to develop a competitive edge,
which is crucial for enterprises to successfully leverage Big Data. Big data analytics' main goal is
to help firms make better business decisions.
Big data analytics shouldn't be thought of as a universal fix. The best data scientists and analysts
are also distinguished from the competition by their aptitude for identifying the many forms of
analytics that may be applied to benefit the business the most. The three most typical categories
5. Version control
Version control, often known as source control, is the process of keeping track of and controlling
changes to software code. Version control systems are computerized tools that help software
development teams keep track of changes to source code over time.
6. Data management
The process of obtaining, storing, and using data in a cost-effective, effective, and secure way is
known as data management. Data management assists people, organizations, and connected
things in optimizing the use of data within the bounds of policy and regulation, enabling
decision-making and actions that will benefit the business as much as feasible. As businesses
increasingly rely on intangible assets to create value, an efficient data management strategy is
more important than ever.
7. Data Integration
Data integration is the process of combining information from several sources to give people a
cohesive perspective. The fundamental idea behind data integration is to open up data and make
it simpler for individuals and systems to access, use, and process. When done correctly, data
integration can enhance data quality, free up resources, lower IT costs, and stimulate creativity
without significantly modifying current applications or data structures. Aside from the fact that
IT firms have always needed to integrate, the benefits of doing so may have never been as large
as they are now.
8. Data Governance
Data governance is the process of ensuring that data is trustworthy, accurate, available, and
usable. It describes the actions people must take, the rules they must follow, and the technology
that will support them throughout the data life cycle.
9. Data security
Data security is the technique of preventing digital data from being accessed by unauthorized
parties, being corrupted, or being stolen at any point in its lifecycle. It is a concept that
encompasses all elements of data security, including administrative and access controls, logical
programme security, and physical hardware and storage device security. Also data security is one
of the key features of data analytics. Also data security is one of the key features of data
analytics. Also covered are the policies and practices of the organization.
It's more crucial than ever to have easy ways to see and comprehend data in our increasingly
data-driven environment. Employers are, after all, increasingly seeking employees with data
skills. Data and its ramifications must be understood by all employees and business owners.
Big data has many qualities—it’s unstructured, dynamic, and complex. But, perhaps most
importantly: Big data is big. Humans and IoT sensors are producing trillions of gigabytes of data
each year. But this isn’t yesterday’s data—it’s modern data, in an increasingly diverse range of
formats and from an ever-broader variety of sources.
This is leading to a chasm between today’s date and yesterday’s systems. The sheer size and
scale, along with its speed and complexity, is putting a new kind of stress on traditional data
storage systems. Many are just plain ill-equipped, and organizations that want to make use of this
goldmine of data are running into roadblocks.
Why is this happening? What are the key big data challenges to know? If you’re looking to
harness the power of big data, will your storage solutions be enough to overcome them?
Perhaps the most obvious of the big data challenges is its enormous scale. We typically measure
it in petabytes (so that’s 1,024 terabytes or 1,048,576 gigabytes).
To give you an idea of how big big data can get, here’s an example: Facebook users upload at
least 14.58 million photos per hour. Each photo garners interactions stored along with it, such as
likes and comments. Users have “liked” at least a trillion posts, comments, and other data points.
But it’s not just tech giants like Facebook that are storing and analyzing huge quantities of data.
Even a small business taking a slice of social media information—for example, to see what
people are saying about its brand—requires high-capacity data storage architecture.
Traditional data storage systems can, in theory, handle large amounts of data. But when tasked to
deliver the efficiency and insights we need, many simply can’t keep up with the demands of
modern data.
Relational SQL databases are trusty, timeworn methods to house, read, and write data. But these
databases can struggle to operate efficiently, even before they’ve met maximum capacity. A
relational database containing large quantities of data can become slow for many reasons. For
example, each time you insert a record into a relational database, the index must update itself.
This operation takes longer each time the number of records increases. Inserting, updating,
deleting, and performing other operations can take longer depending on the number of
relationships they have to other tables.
Simply put: The more data that is in a relational database, the longer each operation takes.
It’s also possible to scale traditional data storage systems to improve performance. But because
traditional data storage systems are centralized, you’re forced to scale “up” rather than “out.”
Scaling up is less resource-efficient than scaling out, as it requires you to add new systems,
migrate data, and then manage the load across multiple systems. Traditional data storage
architecture soon becomes too sprawling and unwieldy to manage properly.
Attempting to use traditional storage architecture for big data is doomed to fail in part because
the quantity of data makes it unrealistic to scale up sufficiently. This makes scaling out the only
realistic option. Using a distributed storage architecture, you can add new nodes to a cluster once
you reach a given capacity—and you can do so pretty much indefinitely.
Another major challenge for traditional storage when it comes to big data? The complexity of
data styles. Traditional data is “structured.” You can organize it in tables with rows and columns
that bear a straightforward relation to one another.
A relational database can be relatively large and complex: It may consist of thousands of rows
and columns. But crucially, with a relational database, you can access a piece of data by
reference to its relation to another piece of data.
Big data doesn’t always fit neatly into the relational rows and columns of a traditional data
storage system. It’s largely unstructured, consisting of myriad file types and often including
images, videos, audio, and social media content. That’s why traditional storage solutions are
unsuitable for working with big data: They can’t properly categorize it.
Modern containerized applications also create new storage challenges. For example, Kubernetes
applications are more complex than traditional applications. These applications contain many
parts—such as pods, volumes, and configmaps—and they require frequent updates. Traditional
storage can’t offer the necessary functionality to run Kubernetes effectively.
Using a non-relational (NoSQL) database such as MongoDB, Cassandra, or Redis can allow you
to gain valuable insights into complex and varied sets of unstructured data.
Traditional data storage systems are for steady data retention. You can add more data regularly
and then perform analysis on the new data set. But big data grows almost instantaneously, and
analysis often needs to occur in real time. An RDBMS isn’t designed for rapid fluctuations.
Take sensor data, for example. Internet of things (IoT) devices need to process large amounts of
sensor data with minimal latency. Sensors transmit data from the “real world” at a near-constant
rate. Traditional storage systems struggle to store and analyze data arriving at such a velocity.
Or, another example: cybersecurity. IT departments must inspect each packet of data arriving
through a company’s firewall to check whether it contains suspicious code. Many gigabytes
might be passing through the network each day. To avoid falling victim to cybercrime, analysis
must occur instantaneously—storing all the data in a table until the end of the day is not an
option.
The high-velocity nature of big data is not kind to traditional storage systems, which can be a
root cause of project failure or unrealized ROI.
Traditional storage architectures are suitable for working with structured data. But when it comes
to the vast, complex, and high-velocity nature of unstructured big data, businesses must find
alternative solutions to start getting the outcomes they’re looking for.
Distributed, scalable, non-relational storage systems can process large quantities of complex data
in real time. This approach can help organizations overcome big data challenges with ease—and
start gleaning breakthrough-driving insights.
If your storage architecture is struggling to keep up with your business needs—or if you want to
gain the competitive edge of a data-mature company—upgrading to a modern storage solution
capable of harnessing the power of big data may make sense.
● Machine data
● Social data, and
● Transactional data.
In addition to this, companies also generate data internally through direct customer engagement.
This data is usually stored in the company’s firewall. It is then imported externally into the
management and analytics system.
Another critical factor to consider about Big data sources is whether it is structured or
unstructured. Unstructured data doesn’t have any predefined model of storage and management.
Therefore, it requires far more resources to extract meaning out of unstructured data and make it
business-ready.
Now, we’ll take a look at the three primary sources of big data:
1. Machine Data
In a more broad context, machine data also encompasses information churned by servers, user
applications, websites, cloud programs, and so on.
2. Social Data
It is derived from social media platforms through tweets, retweets, likes, video uploads, and
comments shared on Facebook, Instagram, Twitter, YouTube, Linked In etc. The extensive data
generated through social media platforms and online channels offer qualitative and quantitative
insights on each crucial facet of brand-customer interaction.
Social media data spreads like wildfire and reaches an extensive audience base. It gauges
important insights regarding customer behavior, their sentiment regarding products and services.
This is why brands capitalizing on social media channels can build a strong connection with their
online demographic. Businesses can harness this data to understand their target market and
customer base. This inevitably enhances their decision-making process.
3. Transactional Data
As the name suggests, transactional data is information gathered via online and offline
transactions during different points of sale. The data includes vital details like transaction time,
location, products purchased, product prices, payment methods, discounts/coupons used, and
other relevant quantifiable information related to transactions.
● Payment orders
● Invoices
● Storage records and
● E-receipts
However, transactional data demand a separate set of experts to process, analyze, and interpret,
manage data. Moreover, such type of data is the most challenging to interpret for most
businesses.
Structured data can be crudely defined as the data that resides in a fixed field within a
record.
It is the type of data most familiar to our everyday lives. for ex: birthday,address
A certain schema binds it, so all the data has the same set of properties. Structured data is
also called relational data. It is split into multiple tables to enhance the integrity of the
data by creating a single record to depict an entity. Relationships are enforced by the
application of table constraints.
The business value of structured data lies within how well an organization can utilize its
existing systems and processes for analysis purposes.
Structured data is stored in a data warehouse with rigid constraints and a definite schema.
Any change in requirements would mean updating all of that structured data to meet the
new needs. This is a massive drawback in terms of resource and time management.
Semi-Structured Data
Semi-structured data is not bound by any rigid schema for data storage and handling. The
data is not in the relational format and is not neatly organized into rows and columns like
that in a spreadsheet. However, there are some features like key-value pairs that help in
discerning the different entities from each other.
A data serialization language is used to exchange semi-structured data across systems that
may even have varied underlying infrastructure.
Semi-structured content is often used to store metadata about a business process but it can
also include files containing machine instructions for computer programs.
This type of information typically comes from external sources such as social media
platforms or other web-based data feeds.
Semi-Structured Data
Data is created in plain text so that different text-editing tools can be used to draw valuable
insights. Due to a simple format, data serialization readers can be implemented on hardware with
limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in files, transit,
store, and parse. The sender and the receiver don’t need to know about the other system. As long
as the same serialization language is used, the data can be understood by both systems
comfortably. There are three predominantly used Serialization languages.
1. XML– XML stands for eXtensible Markup Language. It is a text-based markup language
designed to store and transport data. XML parsers can be found in almost all popular
development platforms. It is human and machine-readable. XML has definite standards for
schema, transformation, and display. It is self-descriptive. Below is an example of a
programmer’s details in XML.
XML expresses the data using tags (text within angular brackets) to shape the data (for ex:
FirstName) and attributes (For ex: Type) to feature the data. However, being a verbose and
voluminous language, other formats have gained more popularity.
2. JSON– JSON (JavaScript Object Notation) is a lightweight open-standard file format for data
interchange. JSON is easy to use and uses human/machine-readable text to store and transmit
data objects.
This format isn’t as formal as XML. It’s more like a key/value pair model than a formal data
depiction. Javascript has inbuilt support for JSON. Although JSON is very popular amongst web
developers, non-technical personnel find it tedious to work with JSON due to its heavy
dependence on JavaScript and structural characters (braces, commas, etc.)
3. YAML– YAML is a user-friendly data serialization language. Figuratively, it stands for YAML
Ain’t Markup Language. It is adopted by technical and non-technical handlers all across the
globe owing to its simplicity. The data structure is defined by line separation and indentation and
reduces the dependency on structural characters. YAML is extremely comprehensive and its
popularity is a result of its human-machine readability.
YAML example
A product catalog organized by tags is an example of semi-structured data.
Unstructured Data
Unstructured data is the kind of data that doesn’t adhere to any definite schema or set of
rules. Its arrangement is unplanned and haphazard.
Photos, videos, text documents, and log files can be generally considered unstructured
data. Even though the metadata accompanying an image or a video may be
semi-structured, the actual data being dealt with is unstructured.
Features of GFS
Advantages of GFS
1. High accessibility Data is still accessible even if a few nodes fail. (replication)
Component failures are more common than not, as the saying goes.
2. Excessive throughput. many nodes operating concurrently.
3. Dependable storing. Data that has been corrupted can be found and duplicated.
Disadvantages of GFS
Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks and
stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and managing the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pairs. The Map task takes input data and converts it
into a data set which can be computed in Key value pairs. The output of Map task is
consumed by reduce task and then the output of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.
Hadoop Architecture
The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes
Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.
Hadoop Distributed File System
The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains
a master/slave architecture. This architecture consists of a single NameNode performing the role
of master, and multiple DataNodes performing the role of a slave.
Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run
the NameNode and DataNode software.
NameNode
DataNode
Job Tracker
○ The role of Job Tracker is to accept the MapReduce jobs from clients and process the
data by using NameNode.
○ In response, NameNode provides metadata to Job Tracker.
Task Tracker
MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to
Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.
Advantages of Hadoop
○ Fast: In HDFS the data is distributed over the cluster and is mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in minutes and Peta bytes in
hours.
○ Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
○ Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
is really cost effective as compared to traditional relational database management
systems.
○ Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and uses it. Normally, data are replicated thrice but the
replication factor is configurable.
History of Hadoop
Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File
System paper, published by Google.
○ In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It
is an open source web crawler software project.
○ While working on Apache Nutch, they were dealing with big data. To store that data they
have to spend a lot of money which becomes the consequence of that project. This
problem becomes one of the important reasons for the emergence of Hadoop.
○ In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
○ In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
○ In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
(Nutch Distributed File System). This file system also includes Map reduce.
○ In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project,
Dough Cutting introduces a new project Hadoop with a file system known as HDFS
(Hadoop Distributed File System). Hadoop first version 0.1.0 released this year.
○ Doug Cutting named his project Hadoop after his son's toy elephant.
○ In 2007, Yahoo ran two clusters of 1000 machines.
○ In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
○ In 2013, Hadoop 2.2 was released.
○ In 2017, Hadoop 3.0 was released.
Hadoop – Architecture
As we all know Hadoop is a framework written in Java that utilizes a large cluster of commodity
hardware to maintain and store big size data. Hadoop works on the MapReduce Programming
Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop
in their Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop
Architecture Mainly consists of 4 components.
● MapReduce
● HDFS(Hadoop Distributed File System)
● YARN(Yet Another Resource Negotiator)
● Common Utilities or Hadoop Common
Let’s understand the role of each one of these components in detail.
1. MapReduce
MapReduce is nothing but just like an Algorithm or a data structure that is based on the YARN
framework. The major feature of MapReduce is to perform the distributed processing in parallel
in a Hadoop cluster which Makes Hadoop working so fast. When you are dealing with Big Data,
serial processing is no more of any use. MapReduce has mainly 2 tasks which are divided
phase-wise:
In the first phase, Map is utilized and in next phase Reduce is utilized.
Here, we can see that the Input is provided to the Map() function then its output is used as an
input to the Reduce function and after that, we receive our final output. Let’s understand What
this Map() and Reduce() does.
As we can see that an Input is provided to the Map(), now as we are using Big Data. The Input is
a set of Data. The Map() function here breaks these DataBlocks into Tuples that are nothing but
a key-value pair. These key-value pairs are now sent as input to the Reduce(). The Reduce()
function then combines this broken Tuples or key-value pair based on its Key value and form set
of Tuples, and performs some operation like sorting, summation type job, etc. which is then sent
to the final Output Node. Finally, the Output is Obtained.
The data processing is always done in Reducer depending upon the business requirement of that
industry. This is How First Map() and then Reduce is utilized one by one.
Map Task:
Reduce Task
● Shuffle and Sort: The Task of Reducer starts with this step, the process in which the
Mapper generates the intermediate key-value and transfers them to the Reducer task
is known as Shuffling. Using the Shuffling process the system can sort the data using
its key value.
Once some of the Mapping tasks are done Shuffling begins, that is why it is a faster
process and does not wait for the completion of the task performed by Mapper.
● Reduce: The main function or task of the Reduce is to gather the Tuple generated
from Map and then perform some sorting and aggregation sort of process on those
key-values depending on its key element.
● OutputFormat: Once all the operations are performed, the key-value pairs are
written into the file with the help of record writer, each record in a new line, and the
key and value in a space-separated manner.
2. HDFS
HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly designed
for working on commodity Hardware devices(inexpensive devices), working on a distributed file
system design. HDFS is designed in such a way that it believes more in storing the data in a large
chunk of blocks rather than storing small data blocks.
HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the
other devices present in that Hadoop cluster. Data storage Nodes in HDFS.
● NameNode(Master)
● DataNode(Slave)
Metadata can also be the name of the file, size, and the information about the location(Block
number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster
Communication. Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.
DataNode: DataNodes works as a Slave. DataNodes are mainly utilized for storing the data in a
Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that. The
more data Node, the Hadoop cluster will be able to store more data. So it is advised that the
DataNode should have High storage capacity to store a large number of file blocks.
File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of
data is divided into multiple blocks of size 128MB which is default and you can also change it
manually.
Let’s understand this concept of breaking down a file in blocks with an example. Suppose you
have uploaded a file of 400MB to your HDFS then what happens is this file gets divided into
blocks of 128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created each of
128MB except the last one. Hadoop doesn’t know or it doesn’t care about what data is stored in
these blocks so it considers the final file blocks as a partial record as it does not have any idea
regarding it. In the Linux file system, the size of a file block is about 4KB which is very much
less than the default size of file blocks in the Hadoop file system. As we all know Hadoop is
mainly configured for storing the large size data which is in petabyte, this is what makes Hadoop
file system different from other file systems as it can be scaled, nowadays file blocks of 128MB
to 256MB are considered in Hadoop.
Replication In HDFS Replication ensures the availability of the data. Replication is making a
copy of something and the number of times you make a copy of that particular thing can be
expressed as its Replication Factor. As we have seen in File blocks that the HDFS stores the data
in the form of various blocks at the same time Hadoop is also configured to make a copy of those
file blocks.
By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can
change it manually as per your requirement like in above example we have made 4 file blocks
which means that 3 Replica or copy of each file block is made means total of 4×3 = 12 blocks
are made for the backup purpose.
This is because for running Hadoop we are using commodity hardware (inexpensive system
hardware) which can be crashed at any time. We are not using the supercomputer for our Hadoop
setup. That is why we need such a feature in HDFS which can make copies of that file blocks for
backup purposes, this is known as fault tolerance.
Now one thing we also need to notice is that after making so many replicas of our file blocks we
are wasting so much of our storage but for the big brand organization the data is very much more
important than the storage so nobody cares for this extra storage. You can configure the
Replication factor in your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in our Hadoop
cluster (maybe 30 to 40). A large Hadoop cluster consists of so many Racks . With the help of
this Racks information Namenode chooses the closest Datanode to achieve the maximum
performance while performing the read/write information which reduces the Network Traffic.
HDFS Architecture
YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job
scheduling and Resource Management. The Purpose of Job scheduler is to divide a big task into
small jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing
can be Maximized. Job Scheduler also keeps track of which job is important, which job has more
priority, dependencies between the jobs and all the other information like job timing, etc. And the
use of Resource Manager is to manage all the resources that are made available for running a
Hadoop cluster.
Features of YARN
● Multi-Tenancy
● Scalability
● Cluster-Utilization
● Compatibility
Hadoop common or Common utilities are nothing but our java library and java files or we can
say the java scripts that we need for all the other components present in a Hadoop cluster. These
utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop Common
verifies that Hardware failure in a Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.
Hadoop Cluster
A Hadoop cluster is a collection of computers, known as nodes, that are networked together to
perform these kinds of parallel computations on big data sets. Unlike other computer clusters,
Hadoop clusters are designed specifically to store and analyze mass amounts of structured and
unstructured data in a distributed computing environment. Further distinguishing Hadoop
ecosystems from other computer clusters are their unique structure and architecture. Hadoop
clusters consist of a network of connected master and slave nodes that utilize high availability,
low-cost commodity hardware. The ability to linearly scale and quickly add or subtract nodes as
volume demands makes them well-suited to big data analytics jobs with data sets highly variable
in size.
Hadoop clusters are composed of a network of master and worker nodes that orchestrate and
execute the various jobs across the Hadoop distributed file system. The master nodes typically
utilize higher quality hardware and include a NameNode, Secondary NameNode, and
JobTracker, with each running on a separate machine. The workers consist of virtual machines,
running both DataNode and TaskTracker services on commodity hardware, and do the actual
work of storing and processing the jobs as directed by the master nodes. The final part of the
system are the Client Nodes, which are responsible for loading the data and fetching the results.
● Master nodes are responsible for storing data in HDFS and overseeing key operations,
such as running parallel computations on the data using MapReduce.
● The worker nodes comprise most of the virtual machines in a Hadoop cluster, and
perform the job of storing the data and running computations. Each worker node runs the
DataNode and TaskTracker services, which are used to receive the instructions from the
master nodes.
● Client nodes are in charge of loading the data into the cluster. Client nodes first submit
MapReduce jobs describing how data needs to be processed and then fetch the results
once the processing is finished.
A Hadoop cluster size is a set of metrics that defines storage and compute capabilities to run
Hadoop workloads, namely :
● Number of nodes : number of Master nodes, number of Edge Nodes, number of Worker
Nodes.
● Configuration of each type node: number of cores per node, RAM and Disk Volume.
Hadoop – Different Modes of Operation
As we all know Hadoop is an open-source framework which is mainly used for storage purpose
and maintaining and analyzing a large amount of data or datasets on the clusters of commodity
hardware, which means it is actually a data management tool. Hadoop also posses a scale-out
storage property, which means that we can scale up or scale down the number of nodes as per are
a requirement in the future which is really a cool feature.
1. Standalone Mode
In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode, Secondary Name
node, Job Tracker, and Task Tracker. We use job-tracker and task-tracker for processing purposes
in Hadoop1. For Hadoop2 we use Resource Manager and Node Manager. Standalone Mode also
means that we are installing Hadoop only in a single system. By default, Hadoop is made to run
in this Standalone Mode or we can also call it the Local mode. We mainly use Hadoop in this
Mode for the Purpose of Learning, testing, and debugging. Hadoop works very much Fastest in
this mode among all of these 3 modes. As we all know HDFS (Hadoop distributed file system) is
one of the major components for Hadoop which is utilized for storage Permission is not utilized
in this mode. You can think of HDFS as similar to the file system’s available for windows i.e.
NTFS (New Technology File System) and FAT32(File Allocation Table which stores the data in
the blocks of 32 bits ). When your Hadoop works in this mode there is no need to configure the
files – hdfs-site.xml, mapred-site.xml, core-site.xml for Hadoop environment. In this Mode, all of
your Processes will run on a single JVM(Java Virtual Machine) and this mode can only be used
for small development purposes.
In Pseudo-distributed Mode we also use only a single node, but the main thing is that the cluster
is simulated, which means that all the processes inside the cluster will run independently to each
other. All the daemons that are Namenode, Datanode, Secondary Name node, Resource Manager,
Node Manager, etc. will be running as a separate process on separate JVM(Java Virtual Machine)
or we can say run on different java processes that is why it is called a Pseudo-distributed. One
thing we should remember is that as we are using only the single node set up so all the Master
and Slave processes are handled by the single system. Namenode and Resource Manager are
used as Master and Datanode and Node Manager is used as a slave. A secondary name node is
also used as a Master. The purpose of the Secondary Name node is to just keep the hourly based
backup of the Name node. In this Mode,
● Hadoop is used for development and for debugging purposes both.
● Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input and
Output processes.
● We need to change the configuration files mapred-site.xml, core-site.xml,
hdfs-site.xml for setting up the environment.
This is the most important one in which multiple nodes are used few of them run the Master
Daemon’s that are Namenode and Resource Manager and the rest of them run the Slave
Daemon’s that are DataNode and Node Manager. Here Hadoop will run on the clusters of
Machine or nodes. Here the data that is used is distributed across different nodes. This is actually
the Production Mode of Hadoop let’s clarify or understand this Mode in a better way in Physical
Terminology. Once you download the Hadoop in a tar file format or zip file format then you
install it in your system and you run all the processes in a single system but here in the fully
distributed mode we are extracting this tar or zip file to each of the nodes in the Hadoop cluster
and then we are using a particular node for a particular process. Once you distribute the process
among the nodes then you’ll define which nodes are working as a master or which one of them is
working as a slave
.
Configuration Files are the files which are located in the extracted tar.gz file in the etc/hadoop/
directory.
All Configuration Files in Hadoop are listed below,
1) HADOOP-ENV.sh->>It specifies the environment variables that affect the JDK used by
Hadoop Daemon (bin/hadoop). We know that the Hadoop framework is written in Java and uses
JRE so one of the environment variables in Hadoop Daemons is $Java_Home in Hadoop-env.sh.
2) CORE-SITE.XML->>It is one of the important configuration files which is required for
runtime environment settings of a Hadoop cluster.It informs Hadoop daemons where the
NAMENODE runs in the cluster. It also informs the Name Node as to which IP and ports it
should bind.
3) HDFS-SITE.XML->>It is one of the important configuration files which is required for
runtime environment settings of a Hadoop. It contains the configuration settings for
NAMENODE, DATANODE, SECONDARY NODE. It is used to specify default block
replication. The actual number of replications can also be specified when the file is created,
4) MAPRED-SITE.XML->>It is one of the important configuration files which is required for
runtime environment settings of a Hadoop. It contains the configuration settings for MapReduce
. In this file, we specify a framework name for MapReduce, by setting the
MapReduce.framework.name.
5) Masters->>It is used to determine the master Nodes in a Hadoop cluster. It will inform about
the location of SECONDARY NAMENODE to Hadoop Daemon.
The Mater File on Slave node is blank.
6) Slave->>It is used to determine the slave Nodes in a Hadoop cluster.
The Slave file at Master Node contains a list of hosts, one per line.
The Slave file at Slave server contains the IP address of Slave nodes.
UNIT II
Map Reduce
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is
merely a configuration change. This simple scalability is what has attracted many programmers
to use the MapReduce model.
The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the data
resides!
MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
Reduce stage − This stage is the combination of the Shuffle stage and the
Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
PayLoad − Applications implement the Map and the Reduce functions, and form the
core of the job.
Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
DataNode − Node where data is presented in advance before any processing takes place.
MasterNode − Node where JobTracker runs and which accepts job requests from clients.
SlaveNode − Node where Map and Reduce program runs.
JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker − Tracks the task and reports status to JobTracker.
Job − A program is an execution of a Mapper and Reducer across a dataset.
Task − An execution of a Mapper or a Reducer on a slice of data.
Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.
A job run in classic MapReduce is illustrated in Figure -1. At the highest level, there are four
independent entities:
• The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main
class is JobTracker.
• The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java
applica tions whose main class is TaskTracker.
• The distributed filesystem (normally HDFS, covered in Chapter 3), which is used for sharing
job files between the other entities.
Figure 1. How Hadoop runs a MapReduce job using the classic framework
Job Submission
The submit() method on Job creates an internal JobSummitter instance and calls sub
mitJobInternal() on it (step 1 in Figure 6-1). Having submitted the job, waitForCompletion()
polls the job’s progress once a second and reports the progress to the console if it has changed
since the last report.
• Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step 2).
• Checks the output specification of the job. For example, if the output directory has not been
specified or it already exists, the job is not submitted and an error is thrown to the MapReduce
program.
• Computes the input splits for the job. If the splits cannot be computed, because the input paths
don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce
program.
• Copies the resources needed to run the job, including the job JAR file, the configuration file,
and the computed input splits, to the jobtracker’sfilesystem in a directory named after the job ID.
The job JAR is copied with a high replication factor (controlled by the mapred.submit.replication
property, which defaults to 10) so that there are lots of copies across the cluster for the
tasktrackers to access when they run tasks for the job (step 3).
• Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker)
(step 4).
Job Initialization
When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue
from where the job scheduler will pick it up and initialize it (step 5).
To create the list of tasks to run, the job scheduler first retrieves the input splits computed
by the client from the shared filesystem (step 6). It then creates one map task for each split. The
number of reduce tasks to create is determined by the mapred.reduce.tasks property in the Job,
which is set by the setNumReduceTasks() method, and the scheduler simply creates this number
of reduce tasks to be run.
In addition to the map and reduce tasks, two further tasks are created: a job setup task and
a job cleanup task. These are run by tasktrackers and are used to run code to setup the job before
any map tasks run, and to cleanup after all the reduce tasks are complete.
Task Assignment
Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker.
Heartbeats tell the jobtracker that a tasktracker is alive. As a part of the heartbeat, a tasktracker
will indicate whether it is ready to run a new task, and if it is, the jobtracker will allocate it a
task, which it communicates to the tasktracker using the heartbeat return value (step 7).
Tasktrackers have a fixed number of slots for map tasks and for reduce tasks: for
example, a tasktracker may be able to run two map tasks and two reduce tasks simultaneously.
(The precise number depends on the number of cores and the amount of memory on the
tasktracker.
Task Execution Now that the tasktracker has been assigned a task, the next step is for it to run the
task. First, it localizes the job JAR by copying it from the shared filesystem to the
tasktracker’sfilesystem. It also copies any files needed from the distributed cache by the
application to the local disk. Then, it creates a local working directory for the task, and un-jars
the contents of the JAR into this directory. Third, it creates an instance of TaskRunner to run the
task. TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10), so
that any bugs in the user-defined map and reduce functions don’t affect the tasktracker.
MapReduce jobs are long-running batch jobs, taking anything from minutes to hours to run.
Because this is a significant length of time, it’s important for the user to get feedback on how the
job is progressing. A job and each of its tasks have a status, which includes such things as the
state of the job or task (e.g., running, successfully completed, failed), the progress of maps and
reduces, the values of the job’s counters, and a status message or description.
When a task is running, it keeps track of its progress, that is, the proportion of the task
completed. For map tasks, this is the proportion of the input that has been processed. For reduce
tasks, it’s a little more complex, but the system can still estimate the proportion of the reduce
input processed.
Job Completion
When the jobtracker receives a notification that the last task for a job is complete (this will be the
special job cleanup task), it changes the status for the job to “successful.” Then, when the Job
polls for status, it learns that the job has completed successfully, so it prints a message to tell the
user and then returns from the waitForCompletion() method. Last, the jobtracker cleans up its
working state for the job and instructs tasktrackers to do the same (so intermediate output is
deleted, for example).
For very large clusters in the region of 4000 nodes and higher, the MapReduce system
described in the previous section begins to hit scalability bottlenecks, so in 2010 a group at
Yahoo! began to design the next generation of MapReduce. The result was YARN, short for Yet
Another Resource Negotiator.
YARN (MapReduce 2)
I. Resource Manager
a resource manager to manage the use of resources across the cluster, and an application master
to manage the lifecycle of applications running on the cluster.
The idea is that an application master negotiates with the resource manager for cluster
resources—described in terms of a number of containers each with a certain memory limit—then
runs applicationspecific processes in those containers. The containers are overseen by node
managers running on cluster nodes, which ensure that the application does not use more
resources than it has been allocated. In contrast to the jobtracker, each instance of an
application—here a MapReduce job —has a dedicated application master, which runs for the
duration of the application.
The beauty of YARN’s design is that different YARN applications can co-exist on the same
cluster—so a MapReduce application can run at the same time as an MPI application, for
example—which brings great benefits for managability and cluster utilization. Furthermore, it is
even possible for users to run different versions of MapReduce on the same YARN cluster, which
makes the process of upgrading MapReduce more manageable.
MapReduce on YARN involves more entities than classic MapReduce. They are:
• The YARN resource manager, which coordinates the allocation of compute resources on the
cluster.
• The YARN node managers, which launch and monitor the compute containers on machines in
the cluster.
• The MapReduce application master, which coordinates the tasks running the MapReduce job.
The application master and the MapReduce tasks run in containers that are scheduled by the
resource manager, and managed by the node managers.
• The distributed filesystem (normally HDFS, covered in Chapter 3), which is used for sharing
job files between the other entities. The process of running a job is shown in Figure 2, and
described in the following sections.
Job Submission
Jobs are submitted in MapReduce 2 using the same user API as MapReduce 1 (step 1).
MapReduce 2 has an implementation of ClientProtocol that is activated when
mapreduce.framework.name is set to yarn. The submission process is very similar to the classic
implementation. The new job ID is retrieved from the resource manager rather than the
jobtracker.
Job Initialization
When the resource manager receives a call to its submitApplication(), it hands off the request to
the scheduler. The scheduler allocates a container, and the resource manager then launches the
application master’s process there, under the node manager’s management (steps 5a and 5b).
The application master initializes the job by creating a number of bookkeeping objects to
keep track of the job’s progress, as it will receive progress and completion reports from the tasks
(step 6). Next, it retrieves the input splits computed in the client from the shared filesystem (step
7). It then creates a map task object for each split, and a number of reduce task objects
determined by the mapreduce.job.reduces property.
Task Assignment
The application master requests containers for all the map and reduce tasks in the job from the
resource manager (step 8). Each request, which are piggybacked on heartbeat calls, includes
information about each map task’s data locality, in particular the hosts and corresponding racks
that the input split resides on.
Task Execution
Once a task has been assigned a container by the resource manager’s scheduler, the application
master starts the container by contacting the node manager (steps 9a and 9b). The task is
executed by a Java application whose main class is YarnChild. Before it can run the task it
localizes the resources that the task needs, including the job configuration and JAR file, and any
files from the distributed cache (step 10). Finally, it runs the map or reduce task (step 11).
1. The new API favors abstract classes over interfaces, since these are easier to evolve. For
example, you can add a method (with a default implementation) to an abstract class without
breaking old implementations of the class2. For example, the
Mapper and Reducer interfaces in the old API are abstract classes in the new API.
4. In both APIs, key-value record pairs are pushed to the mapper and reducer, but in addition, the
new API allows both mappers and reducers to control the execution flow by overriding the run()
method.
In the old API this is possible for mappers by writing a MapRunnable, but no equivalent exists
for reducers.
5. Configuration has been unified. The old API has a special JobConfobject for job
configuration. In the new API, this distinction is dropped, so job configuration is done through a
Configuration.
6. Job control is performed through the Job class in the new API, rather than the oldJobClient,
which no longer exists in the new API.
7. Output files are named slightly differently: in the old API both map and reduce outputs are
named part-n, while in the new API map outputs are named partm-nnnnn, and reduce outputs are
named part-r-nnnnn (where n is an integer designating the part number, starting from zero).
8. In the new API the reduce() method passes values as a java.lang.Iterable, rather than a
java.lang.Iterator (as the old API does). This change makes it easier to iterate over the values
using Java’s for-each loop construct:for (VALUEIN value : values) { ...}
Hadoop provides two Java MapReduce APIs named as old and new respectively.
1. The new API favors abstract classes over interfaces, since these are easier to evolve. For
example, you can add a method (with a default implementation) to an abstract class without
breaking old implementations of the class2. For example, the
Mapper and Reducer interfaces in the old API are abstract classes in the new API.
3. The new API makes extensive use of context objects that allow the user code tocommunicate
with the MapReduce system. The new Context, for example, essentially unifies the role of the
JobConf, the OutputCollector, and the Reporter from the old API.
4. In both APIs, key-value record pairs are pushed to the mapper and reducer, but inaddition, the
new API allows both mappers and reducers to control the executionflow by overriding the run()
method.
In the old API this is possible for mappers by writing a MapRunnable, but noequivalent exists
for reducers.
5. Configuration has been unified. The old API has a special JobConfobject for jobconfiguration.
In the new API, this distinction is dropped, so job configuration is done through a Configuration.
6. Job control is performed through the Job class in the new API, rather than the oldJobClient,
which no longer exists in the new API.
7. Output files are named slightly differently: in the old API both map and reduceoutputs are
named part-nnnnn, while in the new API map outputs are named partm-nnnnn, and reduce
outputs are named part-r-nnnnn (where nnnnn is an integerdesignating the part number, starting
from zero).
8. In the new API the reduce() method passes values as a java.lang.Iterable, ratherthan a
java.lang.Iterator (as the old API does). This change makes it easier toiterate over the values
using Java’s for-each loop construct:for (VALUEIN value : values) { ...}
Basic programs of HadoopMapReduce:
Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1
The following command is to create a directory to store the compiled java classes.
$ mkdir units
Step 2
Step 3
The following commands are used for compiling the ProcessUnits.java program and
creating a jar for the program.
Step 4
Step 5
The following command is used to copy the input file named sample.txtin the input
directory of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
Step 6
The following command is used to verify the files in the input directory.
Step 7
The following command is used to run the Eleunit_max application by taking the input
files from the input directory.
Wait for a while until the file is executed. After execution, as shown below, the output
will contain the number of input splits, the number of Map tasks, the number of reducer
tasks, etc.
completed successfully
14/10/31 06:02:52
Map-Reduce Framework
Spilled Records = 10
Shuffled Maps = 2
Failed Shuffles = 0
Bytes Written = 40
Step 8
The following command is used to verify the resultant files in the output folder.
$HADOOP_HOME/bin/hadoop fs -ls output_dir/
Step 9
The following command is used to see the output in Part-00000 file. This file is generated
by HDFS.
1981 34
1984 40
1985 45
Step 10
The following command is used to copy the output folder from HDFS to the local file
system for analyzing.
Important Commands
All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command.
Running the Hadoop script without any arguments prints the description for all
commands.
The following table lists the options available and their description.
2 secondarynamenode
Runs the DFS secondary namenode.
3 namenode
Runs the DFS namenode.
4 datanode
Runs a DFS datanode.
5 dfsadmin
Runs a DFS admin client.
6 mradmin
Runs a Map-Reduce admin client.
7 fsck
Runs a DFS filesystem checking utility.
8 fs
Runs a generic filesystem user client.
9 balancer
Runs a cluster balancing utility.
10 oiv
Applies the offline fsimage viewer to an fsimage.
11 fetchdt
Fetches a delegation token from the NameNode.
12 jobtracker
Runs the MapReduce job Tracker node.
13 pipes
Runs a Pipes job.
14 tasktracker
Runs a MapReduce task Tracker node.
15 historyserver
Runs job history servers as a standalone daemon.
16 job
Manipulates the MapReduce jobs.
17 queue
Gets information regarding JobQueues.
18 version
Prints the version.
19 jar <jar>
Runs a jar file.
23 classpath
Prints the class path needed to get the Hadoop jar and the required
libraries.
24 daemonlog
Get/Set the log level for each daemon
1 -submit <job-file>
Submits the job.
2 -status <job-id>
Prints the map and reduce completion percentage and all job counters.
3 -counter <job-id> <group-name> <countername>
Prints the counter value.
4 -kill <job-id>
Kills the job.
7 -list[all]
Displays all jobs. -list displays only jobs which are yet to complete.
8 -kill-task <task-id>
Kills the task. Killed tasks are NOT counted against failed attempts.
9 -fail-task <task-id>
Fails the task. Failed tasks are counted against failed attempts.
e.g.
e.g.
e.g.
Driver code
The driver initializes the job and instructs the hadoop platform to execute your code on a set of
input files and controls where the output files are placed.
A Job object forms the specification of the job. It gives you control over how the job is run.
When we run this job on a Hadoop cluster, we will package the code into a JAR file (which
Hadoop will distribute around the cluster). Rather than explicitly specify the name of the JAR
file, we can pass a class in the Job’s setJarByClass() method, which Hadoop will use to locate the
relevant JAR file by looking for the JAR file containing this class.
Having constructed a Job object, we specify the input and output paths. An input path is
specified by calling the static addInputPath() method on FileInputFormat, and it can be a single
file, a directory (in which case, the input forms all the files in that directory), or a file pattern.
The output path (of which there is only one) is specified by the static setOutputPath() method on
FileOutputFormat. It specifies a directory where the output files from the reducer functions are
written.
Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods.
The setOutputKeyClass() and setOutputValueClass() methods control the output types for the
map and the reduce functions, which are often the same, as they are in our case.
If they are different, then the map output types can be set using the methods
setMapOutputKeyClass() and setMapOutputValueClass().
The input types are controlled via the input format, which we have not explicitly set since we are
using the default TextInputFormat.
After setting the classes that define the map and reduce functions, we are ready to run the job.
The waitForCompletion() method on Job submits the job and waits for it to finish.
The return value of the waitForCompletion() method is a boolean indicating success (true) or
failure (false), which we translate into the program’s exit code of 0 or 1. The driver code for
weather program is specified below.
importorg.apache.hadoop.fs.Path;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Job;
importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;
importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
if (args.length != 2) {
System.exit(-1);
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Mapper code
The Mapper class is a generic type, with four formal type parameters that specify the input key,
input value, output key, and output value types of the map function. For the present example, the
input key is a long integer offset, the input value is a line of text, the output key is a year, and the
output value is an air temperature (an integer).
It converts the text value containing the line of input into JAVA string then uses substring() to
extract columns.
importjava.io.IOException;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.LongWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Mapper;
@Override
intairTemperature;
if (line.charAt(87) == '+')
Else
}
The map() method is passed a key and a value. We convert the Text value containing the
line of input into a Java String, then use its substring() method to extract the columns we are
interested in.
The map() method also provides an instance of Context to write the output to. In thiscase,
we write the year as a Text object (since we are just using it as a key), and thetemperature is
wrapped in an IntWritable. We write an output record only if the temperatureis present and the
quality code indicates the temperature reading is OK.
Reducer code
Again, four formal type parameters are used to specify the input and output types, this time for
the reduce function. The input types of the reduce function must match the output types of the
map function: Text and IntWritable. And in this case, the output types of the reduce function are
Text and IntWritable, for a year and its maximum temperature, which we find by iterating
through the temperatures and comparing each with a record of the highest found so far.
importjava.io.IOException;
importorg.apache.hadoop.io.IntWritable;
importorg.apache.hadoop.io.Text;
importorg.apache.hadoop.mapreduce.Reducer;
@Override
throwsIOException, InterruptedException
intmaxValue = Integer.MIN_VALUE;
Record Reader
RecordReader is responsible for creating key / value pair which has been fed to Map task to
process.
Each InputFormat has to provide its own RecordReader implementation to generate key / value
pairs.
For example, the default TextInputFormat provides LineRecordReader which generates byte
offset of the file as key and n separated line in the input file as value.
Combiner code
Many MapReduce jobs are limited by the bandwidth available on the cluster, so it paysto
minimize the data transferred between map and reduce tasks. Hadoop allows theuser to specify a
combiner function to be run on the map output—the combiner function’soutput forms the input
to the reduce function. Since the combiner function is anoptimization, Hadoop does not provide a
guarantee of how many times it will call itfor a particular map output record, if at all. In other
words, calling the combiner function zero, one, or many times should produce the same output
from the reducer.The combiner function doesn’t replace the reduce function. But It can help cut
down the amount of data shuffled between the maps and reduces.
if (args.length != 2)
{
System.err.println("Usage: MaxTemperatureWithCombiner<input path> " +"<output path>");
System.exit(-1);
job.setJarByClass(MaxTemperatureWithCombiner.class);
job.setJobName("Max temperature");
job.setMapperClass(MaxTemperatureMapper.class);
job.setCombinerClass(MaxTemperatureReducer.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
System.exit(job.waitForCompletion(true) ? 0 : 1);
Partitioner code
The partitioning phase takes place after the map phase and before the reduce phase. The number
of partitions is equal to the number of reducers. The data gets partitioned across the reducers
according to the partitioning function. The difference between a partitioner and a combiner is
that the partitioner divides the data according to the number of reducers so that all the data in a
single partition gets executed by a single reducer. However, the combiner functions similar to the
reducer and processes the data in each partition. The combiner is an optimization to the reducer.
The default partitioning function is the hash partitioning function where the hashing is done on
the key. However it might be useful to partition the data according to some other function of the
key or the value.
UNIT III
Serialization
Serialization is the process of turning structured objects into a byte stream for
transmission over a network or for writing to persistent storage. Deserialization is the
reverse process of turning a byte stream back into a series of structured objects.
Serialization appears in two quite distinct areas of distributed data processing: for
interprocess communication and for persistent storage. In Hadoop, interprocess
communication between nodes in the system is implemented using remote procedure
calls (RPCs). The RPC protocol uses serialization to render the message into a binary
stream to be sent to the remote node, which then deserializes the binary stream into the
original message. In general, an RPC serialization format is:
Compact
A compact format makes the best use of network bandwidth, which is the most scarce
resource in a data center.
Fast
Interprocess communication forms the backbone for a distributed system, so it is essential
that there is as little performance overhead as possible for the serialization and
deserialization process.
Extensible
Protocols change over time to meet new requirements, so it should be straightforward to
evolve the protocol in a controlled manner for clients and servers.
Interoperable
For some systems, it is desirable to be able to support clients that are written in different
languages to the server, so the format needs to be designed to make this possible.
Hadoop uses its own serialization format, Writables, which is certainly compact and fast,
but not so easy to extend or use from languages other than Java.
Writable in an interface in Hadoop and types in Hadoop must implement this interface.
Hadoop provides these writable wrappers for almost all Java primitive types and some
other types, but sometimes we need to pass custom objects and these custom objects
should implement Hadoop's Writable interface. Hadoop MapReduce uses
implementations of Writables for interacting with user-provided Mappers and
Reducers.
To implement the Writable interface we require two methods:
public interface Writable {
void readFields(DataInput in);
void write(DataOutput out);
}
Now the question is whether Writables are necessary for Hadoop. Hadoop frame work
definitely needs Writable type of interface in order to perform the following tasks:
Implement serialization ,Transfer data between clusters and networks Store the
deserialized data in the local disk of the system Implementation of writable is similar to
implementation of interface in Java. It can be done by simply writing the keyword
‘implements’ and overriding the default writable method.
Writable is a strong interface in Hadoop which while serializing the data, reduces the data
size enormously, so that data can be exchanged easily within the networks. It has separate
read and write fields to read data from network and write data into local disk respectively.
Every data inside Hadoop should accept writable and comparable interface properties.
We have seen how Writables reduces the data size overhead and make the data transfer
easier in the network.
Why use Hadoop Writable(s)?
As we already know, data needs to be transmitted between different nodes in a distributed
computing environment. This requires serialization and deserialization of data to convert
the data that is in structured format to byte stream and vice-versa.
Hadoop therefore uses simple and efficient serialization protocol to serialize data between
map and reduce phase and these are called Writable(s). Some of the examples of
writables as already mentioned before are IntWritable, LongWritable, BooleanWritable
and FloatWritable.
The key-value pairs (K2,V2) are called the intermediary key-value pairs. They are passed
from the mapper to the reducer. Before these intermediary key-value pairs reach the
reducer, a shuffle and sort step is performed.
The shuffle is the assignment of the intermediary keys (K2) to reducers and the sort is the
sorting of these keys. In this blog, by implementing the RawComparator to compare the
intermediary keys, this extra effort will greatly improve sorting. Sorting is improved
because the RawComparator will compare the keys by byte. If we did not use
RawComparator, the intermediary keys would have to be completely deserialized to
perform a comparison.
Data Type
A data type is a set of data with values having predefined characteristics. There are
several kinds of data types in Java. For example- int, short, byte, long, char etc. These are
called as primitive data types.
All these primitive data types are bound to classes called as wrapper class. For example
int, short, byte, long are grouped under INTEGER which is a wrapper class.
These wrapper classes are predefined in the Java. Interface in Java An interface in Java is
a complete abstract class. The methods within an interface are abstract methods which do
not accept body and the fields within the interface are public, static and final, which
means that the fields cannot be modified.
The structure of an interface is most likely to be a class. We cannot create an object for an
interface and the only way to use the interface is to implement it in other class by using
‘implements’ keyword.
Writable Classes
Hadoop comes with a large selection of Writable classes in the org.apache.hadoop.io
package. They form the class hierarchy shown in Figure 1.
The test confirms that the length of a String is the number of char code units it contains
(5, one from each of the first three characters in the string, and a surrogate pair from the
last), whereas the length of a Text object is the number of bytes in its UTF-8 encoding
(10 = 1+2+3+4). Similarly, the indexOf() method in String returns an index in char code
units, and find() for Text is a byte offset.
The charAt() method in String returns the char code unit for the given index, which in the
case of a surrogate pair will not represent a whole Unicode character. The codePointAt()
method, indexed by char code unit, is needed to retrieve a single Unicode character
represented as an int. In fact, the charAt() method in Text is more like the codePointAt()
method than its namesake in String. The only difference is that it is indexed by byte
offset.
Iteration
Iterating over the Unicode characters in Text is complicated by the use of byte offsets for
indexing, since you can’t just increment the index. Turn the Text object into a
java.nio.ByteBuffer, then repeatedly call the bytesToCodePoint() static method on Text
with the buffer. This method extracts the next code point as an int and updates the
position in the buffer. The end of the string is detected when bytesToCodePoint() returns
–1. See the following example.
public class TextIterator
{
public static void main(String[] args)
{
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
ByteBuffer buf = ByteBuffer.wrap(t.getBytes(), 0, t.getLength());
int cp;
while (buf.hasRemaining() && (cp = Text.bytesToCodePoint(buf)) != -1)
{
System.out.println(Integer.toHexString(cp));
}
}
}
Example . Iterating over the characters in a Text object
Running the program prints the code points for the four characters in the string:
% hadoop TextIterator
41
df
6771
10400
Another difference with String is that Text is mutable. We can reuse a Text instance by
calling one of the set() methods on it. For example:
Text t = new Text("hadoop");
t.set("pig");
assertThat(t.getLength(), is(3));
assertThat(t.getBytes().length, is(3));
Resorting to String Text doesn’t have as rich an API for manipulating strings as
java.lang.String, so in many cases, you need to convert the Text object to a String. This is
done in the usual way, using the toString() method:
assertThat(new Text("hadoop").toString(), is("hadoop"));
BytesWritable
BytesWritable is a wrapper for an array of binary data. Its serialized format is an integer
field (4 bytes) that specifies the number of bytes to follow, followed by the bytes
themselves. For example, the byte array of length two with values 3 and 5 is serialized as
a 4-byte integer (00000002) followed by the two bytes from the array (03 and 05):
BytesWritable b = new BytesWritable(new byte[] { 3, 5 });
byte[] bytes = serialize(b);
assertThat(StringUtils.byteToHexString(bytes), is("000000020305"));
BytesWritable is mutable, and its value may be changed by calling its set() method.
NullWritable
NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes
are written to, or read from, the stream. It is used as a placeholder; for example, in
MapReduce, a key or a value can be declared as a NullWritable when you don’t need to
use that position—it effectively stores a constant empty value. NullWritable can also be
useful as a key in SequenceFile when you want to store a list of values, as opposed to
key-value pairs.
ObjectWritable and GenericWritable
ObjectWritable is a general-purpose wrapper for the following: Java primitives, String,
enum, Writable, null, or arrays of any of these types.
GenericWritable is useful when a field can be of more than one type: for example, if the
values in a SequenceFile have multiple types, then you can declare the value type as an
GenericWritable and wrap each type in an GenericWritable.
Writable collections
There are six Writable collection types in the org.apache.hadoop.io package: Array
Writable, ArrayPrimitiveWritable, TwoDArrayWritable, MapWritable,
SortedMapWritable, and EnumSetWritable.
ArrayWritable and TwoDArrayWritable are Writable implementations for arrays and
two-dimensional arrays (array of arrays) of Writable instances. All the elements of an
ArrayWritable or a TwoDArrayWritable must be instances of the same class, which is
specified at construction, as follows:
ArrayWritable writable = new ArrayWritable(Text.class);
In contexts where the Writable is defined by type, such as in SequenceFile keys or values,
or as input to MapReduce in general, you need to subclass ArrayWritable (or
TwoDArrayWritable, as appropriate) to set the type statically. For example:
public class TextArrayWritable extends ArrayWritable
{
public TextArrayWritable()
{
super(Text.class);
}
}
ArrayWritable and TwoDArrayWritable both have get() and set() methods, as well as a
toArray() method, which creates a shallow copy of the array.
if (cmp != 0)
{
return cmp;
}
return TEXT_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1,
b2, s2 + firstL2, l2 - firstL2);
}
catch (IOException e)
{
throw new IllegalArgumentException(e);
}
}
}
Static
{
WritableComparator.define(TextPair.class, new Comparator());
}
Example 4-8. A RawComparator for comparing TextPair byte representations
We actually subclass WritableComparator rather than implement RawComparator
directly, since it provides some convenience methods and default implementations. The
subtle part of this code is calculating firstL1 and firstL2, the lengths of the first Text field
in each byte stream. Each is made up of the length of the variable-length integer (returned
by decodeVIntSize() on WritableUtils) and the value it is encoding (returned by
readVInt()).
The static block registers the raw comparator so that whenever MapReduce sees the
TextPair class, it knows to use the raw comparator as its default comparator.
Custom comparators
As we can see with TextPair, writing raw comparators takes some care, since you have to
deal with details at the byte level. Custom comparators should also be written to be
RawComparators, if possible. These are comparators that implement a different sort order
to the natural sort order defined by the default comparator. The following Example a
comparator for TextPair, called FirstComparator, that considers only the first string of the
pair. Note that we override the compare() method that takes objects so both compare()
methods have the same semantics.
public static class FirstComparator extends WritableComparator
{
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public FirstComparator() {
super(TextPair.class);
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2)
{
Try
{
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
}
catch (IOException e)
{
throw new IllegalArgumentException(e);
}
}
@Override
public int compare(WritableComparable a, WritableComparable b)
{
if (a instanceof TextPair && b instanceof TextPair)
{
return ((TextPair) a).first.compareTo(((TextPair) b).first);
}
return super.compare(a, b);
}
}
Example . A custom RawComparator for comparing the first field of TextPair byte
representations
UNIT IV
Java MapReduce programs and the Hadoop Distributed File System (HDFS) provide you with a
powerful distributed computing framework, but they come with one major drawback — relying
on them limits the use of Hadoop to Java programmers who can think in Map and Reduce terms
when writing programs. Pig is a programming tool attempting to have the best of both worlds: a
declarative query language inspired by SQL and a low-level procedural programming language
that can generate MapReduce code. This lowers the bar when it comes to the level of technical
knowledge needed to exploit the power of Hadoop. Pig was initially developed at Yahoo! in
2006 as part of a research project tasked with coming up with ways for people using Hadoop to
focus more on analyzing large data sets rather than spending lots of time writing Java
MapReduce programs. The goal here was a familiar one: Allow users to focus more on what
they want to do and less on how it‘s done. Not long after, in 2007, Pig officially became an
Apache project. As such, it is included in most Hadoop distributions.
The Pig programming language is designed to handle any kind of data tossed its way —
structured, semistructured, unstructured data. Pigs, of course, have a reputation for eating
anything they come across. According to the Apache Pig philosophy, pigs eat anything, live
anywhere, are domesticated and can fly to boot. Pigs ―living anywhere‖ refers to the fact that
Pig is a parallel data processing programming language and is not committed to any particular
parallel framework — including Hadoop. What makes it a domesticated animal? Well, if
―domesticated‖ means ―plays well with humans,‖ then it‘s definitely the case that Pig prides
itself on being easy for humans to code and maintain. Lastly, Pig is smart and in data processing
lingo this means there is an optimizer that figures out how to do the hard work of figuring out
how to get the data quickly. Pig is not just going to be quick — it‘s going to fly.
Listing: Sample pig code to illustrate the data processing data flow
Some of the text in this example actually looks like English. Looking at each line in turn, you
can see the basic flow of a Pig program. This code can either be part of a script or issued on the
interactive shell called Grunt.
Load: You first load (LOAD) the data you want to manipulate. As in a typical
MapReduce job, that data is stored in HDFS.
For a Pig program to access the data, you first tell Pig what file or files to use. For that
task, you use the LOAD 'data_file' command. Here, 'data_file' can specify either an
HDFS file or a directory. If a directory is specified, all files in that directory are loaded
into the program. If the data is stored in a file format that isn‘t natively accessible to Pig,
you can optionally add the USING function to the LOAD statement to specify a
user-defined function that can read in (and interpret) the data.
Transform: You run the data through a set of transformations which are translated into a
set of Map and Reduce tasks.
The transformation logic is where all the data manipulation happens.
You can FILTER out rows that aren‘t of interest, JOIN two sets of data files, GROUP
data to build aggregations, ORDER results, and do much, much more.
Keep it simple.
Pig Latin provides a streamlined method for interacting with Java MapReduce. It‘s an
abstraction, in other words, that simplifies the creation of parallel programs on the
Hadoop cluster for data flows and analysis. Complex tasks may require a series of
interrelated data transformations — such series are encoded as data flow sequences.
Writing data transformation and flows as Pig Latin scripts instead of Java MapReduce
programs makes these programs easier to write, understand, and maintain because a)
you don‘t have to write the job in Java, b) you don‘t have to think in terms of
MapReduce, and c) you don‘t need to come up with custom code to support rich data
types.
Pig Latin provides a simpler language to exploit your Hadoop cluster, thus making it
easier for more people to leverage the power of Hadoop and become productive sooner.
Make it smart.
You may recall that the Pig Latin Compiler does the work of transforming a Pig Latin
program into a series of Java MapReduce jobs. The trick is to make sure that the
compiler can optimize the execution of these Java MapReduce jobs automatically,
allowing the user to focus on semantics rather than on how to optimize and access the
dataSQL is set up as a declarative query that you use to access structured data stored in
an RDBMS. The RDBMS engine first translates the query to a data access method and
then looks at the statistics and generates a series of data access approaches. The
cost-based optimizer chooses the most efficient approach for execution.
Don’t limit development.
Make Pig extensible so that developers can add functions to address their particular
business problems.
The problem we‘re trying to solve involves calculating the total number of flights flown by
every carrier. Following listing is the Pig Latin script we‘ll use to answer this question.
Tuple: A
tuple is a record that consists of a sequence of fields. Each field can be of any type
— ‗Diego‘, ‗Gomez‘, or 6, for example. Think of a tuple as a row in a table.
Bag: A bag is a collection of non-unique tuples. The schema of the bag is flexible —
each tuple in the collection can contain an arbitrary number of fields, and each field can
be of any type.
Map: A map is a collection of key value pairs. Any type can be stored in the value, and the
key needs to be unique. The key of a map must be a chararray and the value can be of any
type.
Figure -2 offers some fine examples of Tuple, Bag, and Map data types, as well.
Figure 2:Sample Pig Data Types
In a Hadoop context, accessing data means allowing developers to load, store, and stream data,
whereas transforming data means taking advantage of Pig‘s ability to group, join, combine,
split, filter, and sort data. Table 1 gives an overview of the operators associated with each
operation.
Pig also provides a few operators that are helpful for debugging and troubleshooting, as shown in
Table 2:
Table 2 Operators for Debugging and Troubleshooting
The optional USING statement defines how to map the data structure within the file to the Pig
data model — in this case, the PigStorage () data structure, which parses delimited text files.
The optional AS clause defines a schema for the data that is eing mapped. If you don‘t use an
AS clause, you‘re basically telling the default LOAD Func to expect a plain text file that is tab
delimited.
Script: This method is nothing more than a file containing Pig Latin commands,
identified by the .pig suffix (FlightData.pig, for example). Ending your Pig program
with the .pig extension is a convention but not required. The commands are interpreted
by the Pig Latin compiler and executed in the order determined by the Pig optimizer.
Grunt: Grunt acts as a command interpreter where you can interactively enter Pig Latin
at the Grunt command line and immediately see the response. This method is helpful for
prototyping during initial development and with what-if scenarios.
Embedded: Pig Latin statements can be executed within Java, Python, or JavaScript
programs.
Pig scripts, Grunt shell Pig commands, and embedded Pig programs can run in either Local
mode or MapReduce mode. The Grunt shell provides an interactive shell to submit Pig
commands or run Pig scripts. To start the Grunt shell in Interactive mode, just submit the
command pig at your shell. To specify whether a script or Grunt shell is executed locally or in
Hadoop mode just specify it in the –x flag to the pig command.
The following is an example of how you‘d specify running your Pig script
in local mode: pig -x local milesPerCarrier.pig
Here‘s how you‘d run the Pig script in Hadoop mode, which is the default if you don‘t
specify the flag: pig -x mapreduce milesPerCarrier.pig
By default, when you specify the pig command without any parameters, it starts the Grunt shell
in Hadoop mode. If you want to start the Grunt shell in local mode just add the –x local flag to
the command. Here is an example:
pig -x local