0% found this document useful (0 votes)
31 views

Big Data Analytics

Uploaded by

Tasleem Mansuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Big Data Analytics

Uploaded by

Tasleem Mansuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

UNIT I

Big data refers to extremely large and diverse collections of structured, unstructured,
and semi-structured data that continues to grow exponentially over time. These
datasets are so huge and complex in volume, velocity, and variety, that traditional data
management systems cannot store, process, and analyze them.

The amount and availability of data is growing rapidly, spurred on by digital


technology advancements, such as connectivity, mobility, the Internet of Things (IoT),
and artificial intelligence (AI). As data continues to expand and proliferate, new big
data tools are emerging to help companies collect, process, and analyze data at the
speed needed to gain the most value from it.

Big data describes large and diverse datasets that are huge in volume and also rapidly
grow in size over time. Big data is used in machine learning, predictive modeling, and
other advanced analytics to solve business problems and make informed decisions.

The three Vs of big data


Volume : The amount of data matters. With big data, you’ll have to process high volumes of
low-density, unstructured data. This can be data of unknown value, such as Twitter data feeds,
clickstreams on a web page or a mobile app, or sensor-enabled equipment. For some
organizations, this might be tens of terabytes of data. For others, it may be hundreds of petabytes.
Velocity : Velocity is the fast rate at which data is received and (perhaps) acted on. Normally, the
highest velocity of data streams directly into memory versus being written to disk. Some
internet-enabled smart products operate in real time or near real time and will require real-time
evaluation and action.
Variety : Variety refers to the many types of data that are available. Traditional data types were
structured and fit neatly in a relational database. With the rise of big data, data comes in new
unstructured data types. Unstructured and semistructured data types, such as text, audio, and
video, require additional preprocessing to derive meaning and support metadata.

Why is big data important?

Companies use big data in their systems to improve operations, provide better customer service,
create personalized marketing campaigns and take other actions that, ultimately, can increase
revenue and profits. Businesses that use it effectively hold a potential competitive advantage
over those that don't because they're able to make faster and more informed business decisions.

For example, big data provides valuable insights into customers that companies can use to refine
their marketing, advertising and promotions in order to increase customer engagement and
conversion rates. Both historical and real-time data can be analyzed to assess the evolving
preferences of consumers or corporate buyers, enabling businesses to become more responsive to
customer wants and needs.

Big data is also used by medical researchers to identify disease signs and risk factors and by
doctors to help diagnose illnesses and medical conditions in patients. In addition, a combination
of data from electronic health records, social media sites, the web and other sources gives
healthcare organizations and government agencies up-to-date information on infectious disease
threats or outbreaks.

Here are some more examples of how big data is used by organizations:

● In the energy industry, big data helps oil and gas companies identify potential drilling
locations and monitor pipeline operations; likewise, utilities use it to track electrical
grids.
● Financial services firms use big data systems for risk management and real-time
analysis of market data.
● Manufacturers and transportation companies rely on big data to manage their supply
chains and optimize delivery routes.
● Other government uses include emergency response, crime prevention and smart city
initiatives.

These are some of the business benefits organizations can get by using big data.

What are examples of big data?


Big data comes from myriad sources -- some examples are transaction processing systems,
customer databases, documents, emails, medical records, internet clickstream logs, mobile apps
and social networks. It also includes machine-generated data, such as network and server log
files and data from sensors on manufacturing machines, industrial equipment and internet of
things devices.

In addition to data from internal systems, big data environments often incorporate external data
on consumers, financial markets, weather and traffic conditions, geographic information,
scientific research and more. Images, videos and audio files are forms of big data, too, and many
big data applications involve streaming data that is processed and collected on a continual basis.

Types of Big Data


Now that we are on track with what is big data, let’s have a look at the types of big data:
a) Structured Structured is one of the types of big data and By structured data, we mean data that
can be processed, stored, and retrieved in a fixed format. It refers to highly organized
information that can be readily and seamlessly stored and accessed from a database by simple
search engine algorithms. For instance, the employee table in a company database will be
structured as the employee details, their job positions, their salaries, etc., will be present in an
organized manner.
b) Unstructured Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured
data. Email is an example of unstructured data. Structured and unstructured are two important
types of big data.
c) Semi-structured Semi structured is the third type of big data. Semi-structured data pertains to
the data containing both the formats mentioned above, that is, structured and unstructured data.
To be precise, it refers to the data that, although has not been classified under a particular
repository (database), yet contains vital information or tags that segregate individual elements
within the data. Thus we come to the end of types of data.

Features of Big Data Analytics


These technologies are necessary for data scientists to speed up and increase the efficiency of the
process. The main features of big data analytics are:

1. Data wrangling and Preparation

The idea of Data Preparation procedures conducted once during the project and performed before
using any iterative model. Contrarily, Data Wrangling is done during iterative analysis and model
construction. At the period of feature engineering, this idea.
2. Data exploration

The initial phase in data analysis is called data exploration, and it involves looking at and
visualizing data to find insights right away or point out regions or patterns that need further
investigation. Users may more quickly gain insights by using interactive dashboards and
point-and-click data exploration to better understand the broader picture.

3. Scalability

To scale up, or vertically scale, a system, a faster server with more powerful processors and
memory is needed. This technique utilizes less network gear and uses less energy, but it may
only be a temporary cure for many big data analytics platform characteristics, especially if more
growth is anticipated.

4. Support for various types of Analytics

Due to the big data revolution, new forms, stages, and types of data analysis have evolved. Data
analytics is exploding in boardrooms all over the world, offering enterprise-wide commercial
success techniques. What do these, though, mean for businesses? Gaining the appropriate
expertise, which results in information, enables organizations to develop a competitive edge,
which is crucial for enterprises to successfully leverage Big Data. Big data analytics' main goal is
to help firms make better business decisions.

Big data analytics shouldn't be thought of as a universal fix. The best data scientists and analysts
are also distinguished from the competition by their aptitude for identifying the many forms of
analytics that may be applied to benefit the business the most. The three most typical categories

5. Version control

Version control, often known as source control, is the process of keeping track of and controlling
changes to software code. Version control systems are computerized tools that help software
development teams keep track of changes to source code over time.

6. Data management

The process of obtaining, storing, and using data in a cost-effective, effective, and secure way is
known as data management. Data management assists people, organizations, and connected
things in optimizing the use of data within the bounds of policy and regulation, enabling
decision-making and actions that will benefit the business as much as feasible. As businesses
increasingly rely on intangible assets to create value, an efficient data management strategy is
more important than ever.

7. Data Integration
Data integration is the process of combining information from several sources to give people a
cohesive perspective. The fundamental idea behind data integration is to open up data and make
it simpler for individuals and systems to access, use, and process. When done correctly, data
integration can enhance data quality, free up resources, lower IT costs, and stimulate creativity
without significantly modifying current applications or data structures. Aside from the fact that
IT firms have always needed to integrate, the benefits of doing so may have never been as large
as they are now.

8. Data Governance

Data governance is the process of ensuring that data is trustworthy, accurate, available, and
usable. It describes the actions people must take, the rules they must follow, and the technology
that will support them throughout the data life cycle.

9. Data security

Data security is the technique of preventing digital data from being accessed by unauthorized
parties, being corrupted, or being stolen at any point in its lifecycle. It is a concept that
encompasses all elements of data security, including administrative and access controls, logical
programme security, and physical hardware and storage device security. Also data security is one
of the key features of data analytics. Also data security is one of the key features of data
analytics. Also covered are the policies and practices of the organization.

10. Data visualization

It's more crucial than ever to have easy ways to see and comprehend data in our increasingly
data-driven environment. Employers are, after all, increasingly seeking employees with data
skills. Data and its ramifications must be understood by all employees and business owners.

Challenges with Big Data


The challenges in Big Data are the real implementation hurdles. These require immediate
attention and need to be handled because if not handled then the failure of the technology may
take place which can also lead to some unpleasant result. Big data challenges include the storing,
analyzing the extremely large and fast-growing data. Some of the Big Data challenges are:
1. Sharing and Accessing Data:
● Perhaps the most frequent challenge in big data efforts is the
inaccessibility of data sets from external sources.
● Sharing data can cause substantial challenges.
● It includes the need for inter- and intra- institutional legal documents.
● Accessing data from public repositories leads to multiple difficulties.
● It is necessary for the data to be available in an accurate, complete and
timely manner because if data in the company's information system is to
be used to make accurate decisions in time then it becomes necessary for
data to be available in this manner.
2.
Privacy and Security:
● It is another most important challenge with Big Data. This challenge
includes sensitive, conceptual, technical as well as legal significance.
● Most of the organizations are unable to maintain regular checks due to
large amounts of data generation. However, it should be necessary to
perform security checks and observations in real time because it is most
beneficial.
● There is some information of a person which when combined with
external large data may lead to some facts of a person which may be
secretive and he might not want the owner to know this information about
that person.
● Some organizations collect information about the people in order to add
value to their business. This is done by making insights into their lives that
they’re unaware of.
3.
Analytical Challenges:
● There are some huge analytical challenges in big data which arise from
some main challenges like how to deal with a problem if data volume gets
too large?
● Or how to find out the important data points?
● Or how to use data to the best advantage?
● These large amounts of data on which this type of analysis is to be done
can be structured (organized data), semi-structured (Semi-organized data)
or unstructured (unorganized data). There are two techniques through
which decision making can be done:
● Either incorporate massive data volumes in the analysis.
● Or determine upfront which Big data is relevant.
4.
Technical challenges:
● Quality of data:
● When there is a collection of a large amount of data and
storage of this data, it comes at a cost. Big companies, business
leaders and IT leaders always want large data storage.
● For better results and conclusions, Big data rather than having
irrelevant data, focuses on quality data storage.
● This further raises the question of how it can be ensured that
data is relevant, how much data would be enough for decision
making and whether the stored data is accurate or not.
● Fault tolerance:
● Fault tolerance is another technical challenge and fault
tolerance computing is extremely hard, involving intricate
algorithms.
● Nowadays some of the new technologies like cloud computing
and big data always intended that whenever the failure occurs
the damage done should be within the acceptable threshold that
is the whole task should not begin from the scratch.
● Scalability:
● Big data projects can grow and evolve rapidly. The scalability
issue of Big Data has led towards cloud computing.
● It leads to various challenges like how to run and execute
various jobs so that the goal of each workload can be achieved
cost-effectively.
● It also requires dealing with the system failures in an efficient
manner. This leads to a big question again: what kinds of
storage devices are to be used.

Problems with Traditional Large-Scale System

Big data has many qualities—it’s unstructured, dynamic, and complex. But, perhaps most
importantly: Big data is big. Humans and IoT sensors are producing trillions of gigabytes of data
each year. But this isn’t yesterday’s data—it’s modern data, in an increasingly diverse range of
formats and from an ever-broader variety of sources.

This is leading to a chasm between today’s date and yesterday’s systems. The sheer size and
scale, along with its speed and complexity, is putting a new kind of stress on traditional data
storage systems. Many are just plain ill-equipped, and organizations that want to make use of this
goldmine of data are running into roadblocks.

Why is this happening? What are the key big data challenges to know? If you’re looking to
harness the power of big data, will your storage solutions be enough to overcome them?

1. Big Data Is Too Big for Traditional Storage

Perhaps the most obvious of the big data challenges is its enormous scale. We typically measure
it in petabytes (so that’s 1,024 terabytes or 1,048,576 gigabytes).

To give you an idea of how big big data can get, here’s an example: Facebook users upload at
least 14.58 million photos per hour. Each photo garners interactions stored along with it, such as
likes and comments. Users have “liked” at least a trillion posts, comments, and other data points.

But it’s not just tech giants like Facebook that are storing and analyzing huge quantities of data.
Even a small business taking a slice of social media information—for example, to see what
people are saying about its brand—requires high-capacity data storage architecture.
Traditional data storage systems can, in theory, handle large amounts of data. But when tasked to
deliver the efficiency and insights we need, many simply can’t keep up with the demands of
modern data.

The Relational Database Conundrum

Relational SQL databases are trusty, timeworn methods to house, read, and write data. But these
databases can struggle to operate efficiently, even before they’ve met maximum capacity. A
relational database containing large quantities of data can become slow for many reasons. For
example, each time you insert a record into a relational database, the index must update itself.
This operation takes longer each time the number of records increases. Inserting, updating,
deleting, and performing other operations can take longer depending on the number of
relationships they have to other tables.

Simply put: The more data that is in a relational database, the longer each operation takes.

Scaling Up vs. Scaling Out

It’s also possible to scale traditional data storage systems to improve performance. But because
traditional data storage systems are centralized, you’re forced to scale “up” rather than “out.”

Scaling up is less resource-efficient than scaling out, as it requires you to add new systems,
migrate data, and then manage the load across multiple systems. Traditional data storage
architecture soon becomes too sprawling and unwieldy to manage properly.

Attempting to use traditional storage architecture for big data is doomed to fail in part because
the quantity of data makes it unrealistic to scale up sufficiently. This makes scaling out the only
realistic option. Using a distributed storage architecture, you can add new nodes to a cluster once
you reach a given capacity—and you can do so pretty much indefinitely.

2. Big Data Is Too Complex for Traditional Storage

Another major challenge for traditional storage when it comes to big data? The complexity of
data styles. Traditional data is “structured.” You can organize it in tables with rows and columns
that bear a straightforward relation to one another.

A relational database—the type of database that stores traditional data—consists of records


containing clearly defined fields. You can access this type of database using a relational database
management system (RDBMS) such as MySQL, Oracle DB, or SQL Server.

A relational database can be relatively large and complex: It may consist of thousands of rows
and columns. But crucially, with a relational database, you can access a piece of data by
reference to its relation to another piece of data.

Big data doesn’t always fit neatly into the relational rows and columns of a traditional data
storage system. It’s largely unstructured, consisting of myriad file types and often including
images, videos, audio, and social media content. That’s why traditional storage solutions are
unsuitable for working with big data: They can’t properly categorize it.
Modern containerized applications also create new storage challenges. For example, Kubernetes
applications are more complex than traditional applications. These applications contain many
parts—such as pods, volumes, and configmaps—and they require frequent updates. Traditional
storage can’t offer the necessary functionality to run Kubernetes effectively.

Using a non-relational (NoSQL) database such as MongoDB, Cassandra, or Redis can allow you
to gain valuable insights into complex and varied sets of unstructured data.

3. Big Data Is Too Fast for Traditional Storage

Traditional data storage systems are for steady data retention. You can add more data regularly
and then perform analysis on the new data set. But big data grows almost instantaneously, and
analysis often needs to occur in real time. An RDBMS isn’t designed for rapid fluctuations.

Take sensor data, for example. Internet of things (IoT) devices need to process large amounts of
sensor data with minimal latency. Sensors transmit data from the “real world” at a near-constant
rate. Traditional storage systems struggle to store and analyze data arriving at such a velocity.

Or, another example: cybersecurity. IT departments must inspect each packet of data arriving
through a company’s firewall to check whether it contains suspicious code. Many gigabytes
might be passing through the network each day. To avoid falling victim to cybercrime, analysis
must occur instantaneously—storing all the data in a table until the end of the day is not an
option.

The high-velocity nature of big data is not kind to traditional storage systems, which can be a
root cause of project failure or unrealized ROI.

Big Data Challenges Require Modern Storage Solutions

Traditional storage architectures are suitable for working with structured data. But when it comes
to the vast, complex, and high-velocity nature of unstructured big data, businesses must find
alternative solutions to start getting the outcomes they’re looking for.

Distributed, scalable, non-relational storage systems can process large quantities of complex data
in real time. This approach can help organizations overcome big data challenges with ease—and
start gleaning breakthrough-driving insights.

If your storage architecture is struggling to keep up with your business needs—or if you want to
gain the competitive edge of a data-mature company—upgrading to a modern storage solution
capable of harnessing the power of big data may make sense.

Sources of Big Data


The Primary Sources of Big Data:
A significant part of big data is generated from three primary resources:

● Machine data
● Social data, and
● Transactional data.

In addition to this, companies also generate data internally through direct customer engagement.
This data is usually stored in the company’s firewall. It is then imported externally into the
management and analytics system.

Another critical factor to consider about Big data sources is whether it is structured or
unstructured. Unstructured data doesn’t have any predefined model of storage and management.
Therefore, it requires far more resources to extract meaning out of unstructured data and make it
business-ready.

Now, we’ll take a look at the three primary sources of big data:

1. Machine Data

Machine data is automatically generated, either as a response to a specific event or a fixed


schedule. It means all the information is developed from multiple sources such as smart sensors,
SIEM logs, medical devices and wearables, road cameras, IoT devices, satellites, desktops,
mobile phones, industrial machinery, etc. These sources enable companies to track consumer
behavior. Data extracted from machine sources grow exponentially along with the changing
external environment of the market. The sensors which record this type of data include:

In a more broad context, machine data also encompasses information churned by servers, user
applications, websites, cloud programs, and so on.

2. Social Data

It is derived from social media platforms through tweets, retweets, likes, video uploads, and
comments shared on Facebook, Instagram, Twitter, YouTube, Linked In etc. The extensive data
generated through social media platforms and online channels offer qualitative and quantitative
insights on each crucial facet of brand-customer interaction.

Social media data spreads like wildfire and reaches an extensive audience base. It gauges
important insights regarding customer behavior, their sentiment regarding products and services.
This is why brands capitalizing on social media channels can build a strong connection with their
online demographic. Businesses can harness this data to understand their target market and
customer base. This inevitably enhances their decision-making process.

3. Transactional Data
As the name suggests, transactional data is information gathered via online and offline
transactions during different points of sale. The data includes vital details like transaction time,
location, products purchased, product prices, payment methods, discounts/coupons used, and
other relevant quantifiable information related to transactions.

The sources of transactional data include:

● Payment orders
● Invoices
● Storage records and
● E-receipts

Transactional data is a key source of business intelligence. The unique characteristic of


transactional data is its time print. Since all transactional data include a time print, it is
time-sensitive and highly volatile. In plain words, transactional data will lose its credibility and
importance if not used in due time. Thus, companies using transactional data promptly can gain
the upper hand in the market.

However, transactional data demand a separate set of experts to process, analyze, and interpret,
manage data. Moreover, such type of data is the most challenging to interpret for most
businesses.

Types of Big Data


2.5 quintillion bytes of data are generated every day by users. Predictions by Statista suggest that
by the end of 2021, 74 Zettabytes( 74 trillion GBs) of data would be generated by the internet.
Managing such a vacuous and perennial outsourcing of data is increasingly difficult. So, to
manage such huge complex data, Big data was introduced. It is related to the extraction of large
and complex data into meaningful data which can’t be extracted or analyzed by traditional
methods.
All data cannot be stored in the same way. The methods for data storage can be accurately
evaluated after the type of data has been identified. A Cloud Service, like Microsoft Azure, is a
one-stop destination for storing all kinds of data; blobs, queues, files, tables, disks, and
applications data. However, even within the Cloud, there are special services to deal with
specific sub-categories of data.
For example, Azure Cloud Services like Azure SQL and Azure Cosmos DB help in handling and
managing sparsely varied kinds of data.
Applications Data is the data that is created, read, updated, deleted, or processed by applications.
This data could be generated via web apps, android apps, iOS apps, or any applications
whatsoever. Due to a varied diversity in the kinds of data being used, determining the storage
approach is a little nuanced.
Types of Big Data
Structured Data

Structured data can be crudely defined as the data that resides in a fixed field within a
record.

It is the type of data most familiar to our everyday lives. for ex: birthday,address

A certain schema binds it, so all the data has the same set of properties. Structured data is
also called relational data. It is split into multiple tables to enhance the integrity of the
data by creating a single record to depict an entity. Relationships are enforced by the
application of table constraints.

The business value of structured data lies within how well an organization can utilize its
existing systems and processes for analysis purposes.

Sources of structured data


A Structured Query Language (SQL) is needed to bring the data together. Structured data is easy
to enter, query, and analyze. All of the data follows the same format. However, forcing a
consistent structure also means that any alteration of data is too tough as each record has to be
updated to adhere to the new structure. Examples of structured data include numbers, dates,
strings, etc. The business data of an e-commerce website can be considered to be structured data.
Structured data can only be leveraged in cases of predefined functionalities. This means
that structured data has limited flexibility and is suitable for certain specific use cases
only.

Structured data is stored in a data warehouse with rigid constraints and a definite schema.
Any change in requirements would mean updating all of that structured data to meet the
new needs. This is a massive drawback in terms of resource and time management.

Semi-Structured Data

Semi-structured data is not bound by any rigid schema for data storage and handling. The
data is not in the relational format and is not neatly organized into rows and columns like
that in a spreadsheet. However, there are some features like key-value pairs that help in
discerning the different entities from each other.

Since semi-structured data doesn’t need a structured query language, it is commonly


called NoSQL data.

A data serialization language is used to exchange semi-structured data across systems that
may even have varied underlying infrastructure.

Semi-structured content is often used to store metadata about a business process but it can
also include files containing machine instructions for computer programs.

This type of information typically comes from external sources such as social media
platforms or other web-based data feeds.
Semi-Structured Data
Data is created in plain text so that different text-editing tools can be used to draw valuable
insights. Due to a simple format, data serialization readers can be implemented on hardware with
limited processing resources and bandwidth.
Data Serialization Languages
Software developers use serialization languages to write memory-based data in files, transit,
store, and parse. The sender and the receiver don’t need to know about the other system. As long
as the same serialization language is used, the data can be understood by both systems
comfortably. There are three predominantly used Serialization languages.

1. XML– XML stands for eXtensible Markup Language. It is a text-based markup language
designed to store and transport data. XML parsers can be found in almost all popular
development platforms. It is human and machine-readable. XML has definite standards for
schema, transformation, and display. It is self-descriptive. Below is an example of a
programmer’s details in XML.

XML expresses the data using tags (text within angular brackets) to shape the data (for ex:
FirstName) and attributes (For ex: Type) to feature the data. However, being a verbose and
voluminous language, other formats have gained more popularity.
2. JSON– JSON (JavaScript Object Notation) is a lightweight open-standard file format for data
interchange. JSON is easy to use and uses human/machine-readable text to store and transmit
data objects.
This format isn’t as formal as XML. It’s more like a key/value pair model than a formal data
depiction. Javascript has inbuilt support for JSON. Although JSON is very popular amongst web
developers, non-technical personnel find it tedious to work with JSON due to its heavy
dependence on JavaScript and structural characters (braces, commas, etc.)
3. YAML– YAML is a user-friendly data serialization language. Figuratively, it stands for YAML
Ain’t Markup Language. It is adopted by technical and non-technical handlers all across the
globe owing to its simplicity. The data structure is defined by line separation and indentation and
reduces the dependency on structural characters. YAML is extremely comprehensive and its
popularity is a result of its human-machine readability.

YAML example
A product catalog organized by tags is an example of semi-structured data.

Unstructured Data

Unstructured data is the kind of data that doesn’t adhere to any definite schema or set of
rules. Its arrangement is unplanned and haphazard.

Photos, videos, text documents, and log files can be generally considered unstructured
data. Even though the metadata accompanying an image or a video may be
semi-structured, the actual data being dealt with is unstructured.

Additionally, Unstructured data is also known as “dark data” because it cannot be


analyzed without the proper software tools.
Unstructured Data

Google File System


Google Inc. developed the Google File System (GFS), a scalable distributed file system
(DFS), to meet the company’s growing data processing needs. GFS offers fault tolerance,
dependability, scalability, availability, and performance to big networks and connected
nodes. GFS is made up of a number of storage systems constructed from inexpensive
commodity hardware parts. The search engine, which creates enormous volumes of data
that must be kept, is only one example of how it is customized to meet Google’s various
data use and storage requirements.
The Google File System reduced hardware flaws while gaining commercially available
servers.
GoogleFS is another name for GFS. It manages two types of data namely File metadata
and File Data.
The GFS node cluster consists of a single master and several chunk servers that various
client systems regularly access. On local discs, chunk servers keep data in the form of
Linux files. Large (64 MB) pieces of the stored data are split up and replicated at least
three times around the network. Reduced network overhead results from the greater
chunk size.
Without hindering applications, GFS is made to meet Google’s huge cluster requirements.
Hierarchical directories with path names are used to store files. The master is in charge of
managing metadata, including namespace, access control, and mapping data. The master
communicates with each chunk server by timed heartbeat messages and keeps track of its
status updates.
More than 1,000 nodes with 300 TB of disc storage capacity make up the largest GFS
clusters. This is available for constant access by hundreds of clients.
Components of GFS

A group of computers makes up GFS. A cluster is just a group of connected computers.


There could be hundreds or even thousands of computers in each cluster. There are three
basic entities included in any GFS cluster as follows:
● GFS Clients: They can be computer programs or applications which may be
used to request files. Requests may be made to access and modify
already-existing files or add new files to the system.
● GFS Master Server: It serves as the cluster’s coordinator. It preserves a
record of the cluster’s actions in an operation log. Additionally, it keeps track
of the data that describes chunks, or metadata. The chunks’ place in the overall
file and which files they belong to are indicated by the metadata to the master
server.
● GFS Chunk Servers: They are the GFS’s workhorses. They keep 64
MB-sized file chunks. The master server does not receive any chunks from the
chunk servers. Instead, they directly deliver the client the desired chunks. The
GFS makes numerous copies of each chunk and stores them on various chunk
servers in order to assure stability; the default is three copies. Every replica is
referred to as one.

Features of GFS

● Namespace management and locking.


● Fault tolerance.
● Reduced client and master interaction because of the large chunk server size.
● High availability.
● Critical data replication.
● Automatic and efficient data recovery.
● High aggregate throughput.

Advantages of GFS

1. High accessibility Data is still accessible even if a few nodes fail. (replication)
Component failures are more common than not, as the saying goes.
2. Excessive throughput. many nodes operating concurrently.
3. Dependable storing. Data that has been corrupted can be found and duplicated.

Disadvantages of GFS

1. Not the best fit for small files.


2. Master may act as a bottleneck.
3. unable to type at random.
4. Suitable for procedures or data that are written once and only read (appended)
later.
Hadoop Distributed File System(HDFS)
Hadoop is an open source framework from Apache and is used to store, process and analyze data
which are very huge in volume. Hadoop is written in Java and is not OLAP (online analytical
processing). It is used for batch/offline processing.It is being used by Facebook, Yahoo, Google,
Twitter, LinkedIn and many more. Moreover it can be scaled up just by adding nodes in the
cluster.

Modules of Hadoop

1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the
basis of that HDFS was developed. It states that the files will be broken into blocks and
stored in nodes over the distributed architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and managing the
cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel
computation on data using key value pairs. The Map task takes input data and converts it
into a data set which can be computed in Key value pairs. The output of Map task is
consumed by reduce task and then the output of reducer gives the desired result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other
Hadoop modules.

Hadoop Architecture

The Hadoop architecture is a package of the file system, MapReduce engine and the HDFS
(Hadoop Distributed File System). The MapReduce engine can be MapReduce/MR1 or
YARN/MR2.

A Hadoop cluster consists of a single master and multiple slave nodes. The master node includes
Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node includes
DataNode and TaskTracker.
Hadoop Distributed File System

The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop. It contains
a master/slave architecture. This architecture consists of a single NameNode performing the role
of master, and multiple DataNodes performing the role of a slave.

Both NameNode and DataNode are capable enough to run on commodity machines. The Java
language is used to develop HDFS. So any machine that supports Java language can easily run
the NameNode and DataNode software.

NameNode

○ It is a single master server that exists in the HDFS cluster.


○ As it is a single node, it may become the reason for single point failure.
○ It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
○ It simplifies the architecture of the system.

DataNode

○ The HDFS cluster contains multiple DataNodes.


○ Each DataNode contains multiple data blocks.
○ These data blocks are used to store data.
○ It is the responsibility of DataNode to read and write requests from the file system's
clients.
○ It performs block creation, deletion, and replication upon instruction from the
NameNode.

Job Tracker

○ The role of Job Tracker is to accept the MapReduce jobs from clients and process the
data by using NameNode.
○ In response, NameNode provides metadata to Job Tracker.

Task Tracker

○ It works as a slave node for Job Tracker.


○ It receives tasks and code from Job Tracker and applies that code on the file. This process
can also be called a Mapper.

MapReduce Layer
The MapReduce comes into existence when the client application submits the MapReduce job to
Job Tracker. In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is rescheduled.

Advantages of Hadoop

○ Fast: In HDFS the data is distributed over the cluster and is mapped which helps in faster
retrieval. Even the tools to process the data are often on the same servers, thus reducing
the processing time. It is able to process terabytes of data in minutes and Peta bytes in
hours.
○ Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
○ Cost Effective: Hadoop is open source and uses commodity hardware to store data so it
is really cost effective as compared to traditional relational database management
systems.
○ Resilient to failure: HDFS has the property with which it can replicate data over the
network, so if one node is down or some other network failure happens, then Hadoop
takes the other copy of data and uses it. Normally, data are replicated thrice but the
replication factor is configurable.

History of Hadoop
Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File
System paper, published by Google.

Let's focus on the history of Hadoop in the following steps: -

○ In 2002, Doug Cutting and Mike Cafarella started to work on a project, Apache Nutch. It
is an open source web crawler software project.
○ While working on Apache Nutch, they were dealing with big data. To store that data they
have to spend a lot of money which becomes the consequence of that project. This
problem becomes one of the important reasons for the emergence of Hadoop.
○ In 2003, Google introduced a file system known as GFS (Google file system). It is a
proprietary distributed file system developed to provide efficient access to data.
○ In 2004, Google released a white paper on Map Reduce. This technique simplifies the
data processing on large clusters.
○ In 2005, Doug Cutting and Mike Cafarella introduced a new file system known as NDFS
(Nutch Distributed File System). This file system also includes Map reduce.
○ In 2006, Doug Cutting quit Google and joined Yahoo. On the basis of the Nutch project,
Dough Cutting introduces a new project Hadoop with a file system known as HDFS
(Hadoop Distributed File System). Hadoop first version 0.1.0 released this year.
○ Doug Cutting named his project Hadoop after his son's toy elephant.
○ In 2007, Yahoo ran two clusters of 1000 machines.
○ In 2008, Hadoop became the fastest system to sort 1 terabyte of data on a 900 node
cluster within 209 seconds.
○ In 2013, Hadoop 2.2 was released.
○ In 2017, Hadoop 3.0 was released.

Hadoop – Architecture
As we all know Hadoop is a framework written in Java that utilizes a large cluster of commodity
hardware to maintain and store big size data. Hadoop works on the MapReduce Programming
Algorithm that was introduced by Google. Today lots of Big Brand Companies are using Hadoop
in their Organization to deal with big data, eg. Facebook, Yahoo, Netflix, eBay, etc. The Hadoop
Architecture Mainly consists of 4 components.

● MapReduce
● HDFS(Hadoop Distributed File System)
● YARN(Yet Another Resource Negotiator)
● Common Utilities or Hadoop Common
Let’s understand the role of each one of these components in detail.

1. MapReduce

MapReduce is nothing but just like an Algorithm or a data structure that is based on the YARN
framework. The major feature of MapReduce is to perform the distributed processing in parallel
in a Hadoop cluster which Makes Hadoop working so fast. When you are dealing with Big Data,
serial processing is no more of any use. MapReduce has mainly 2 tasks which are divided
phase-wise:

In the first phase, Map is utilized and in next phase Reduce is utilized.
Here, we can see that the Input is provided to the Map() function then its output is used as an
input to the Reduce function and after that, we receive our final output. Let’s understand What
this Map() and Reduce() does.

As we can see that an Input is provided to the Map(), now as we are using Big Data. The Input is
a set of Data. The Map() function here breaks these DataBlocks into Tuples that are nothing but
a key-value pair. These key-value pairs are now sent as input to the Reduce(). The Reduce()
function then combines this broken Tuples or key-value pair based on its Key value and form set
of Tuples, and performs some operation like sorting, summation type job, etc. which is then sent
to the final Output Node. Finally, the Output is Obtained.

The data processing is always done in Reducer depending upon the business requirement of that
industry. This is How First Map() and then Reduce is utilized one by one.

Let’s understand the Map Task and Reduce Task in detail.

Map Task:

● RecordReader The purpose of recordreader is to break the records. It is responsible


for providing key-value pairs in a Map() function. The key is actually its locational
information and value is the data associated with it.
● Map: A map is nothing but a user-defined function whose work is to process the
Tuples obtained from a record reader. The Map() function either does not generate
any key-value pair or generate multiple pairs of these tuples.
● Combiner: Combiner is used for grouping the data in the Map workflow. It is similar
to a Local reducer. The intermediate key-value that is generated in the Map is
combined with the help of this combiner. Using a combiner is not necessary as it is
optional.
● Partitionar: Partitional is responsible for fetching key-value pairs generated in the
Mapper Phases. The partitioner generates the shards corresponding to each reducer.
Hashcode of each key is also fetched by this partition. Then the partitioner performs
its(Hashcode) modulus with the number of reducers(key.hashcode()%(number of
reducers)).

Reduce Task

● Shuffle and Sort: The Task of Reducer starts with this step, the process in which the
Mapper generates the intermediate key-value and transfers them to the Reducer task
is known as Shuffling. Using the Shuffling process the system can sort the data using
its key value.

Once some of the Mapping tasks are done Shuffling begins, that is why it is a faster
process and does not wait for the completion of the task performed by Mapper.
● Reduce: The main function or task of the Reduce is to gather the Tuple generated
from Map and then perform some sorting and aggregation sort of process on those
key-values depending on its key element.
● OutputFormat: Once all the operations are performed, the key-value pairs are
written into the file with the help of record writer, each record in a new line, and the
key and value in a space-separated manner.
2. HDFS

HDFS(Hadoop Distributed File System) is utilized for storage permission. It is mainly designed
for working on commodity Hardware devices(inexpensive devices), working on a distributed file
system design. HDFS is designed in such a way that it believes more in storing the data in a large
chunk of blocks rather than storing small data blocks.

HDFS in Hadoop provides Fault-tolerance and High availability to the storage layer and the
other devices present in that Hadoop cluster. Data storage Nodes in HDFS.

● NameNode(Master)
● DataNode(Slave)

NameNode:NameNode works as a Master in a Hadoop cluster that guides the Datanode(Slaves).


Namenode is mainly used for storing the Metadata i.e. the data about the data. Metadata can be
the transaction logs that keep track of the user’s activity in a Hadoop cluster.

Metadata can also be the name of the file, size, and the information about the location(Block
number, Block ids) of Datanode that Namenode stores to find the closest DataNode for Faster
Communication. Namenode instructs the DataNodes with the operation like delete, create,
Replicate, etc.

DataNode: DataNodes works as a Slave. DataNodes are mainly utilized for storing the data in a
Hadoop cluster, the number of DataNodes can be from 1 to 500 or even more than that. The
more data Node, the Hadoop cluster will be able to store more data. So it is advised that the
DataNode should have High storage capacity to store a large number of file blocks.

High Level Architecture Of Hadoop

File Block In HDFS: Data in HDFS is always stored in terms of blocks. So the single block of
data is divided into multiple blocks of size 128MB which is default and you can also change it
manually.
Let’s understand this concept of breaking down a file in blocks with an example. Suppose you
have uploaded a file of 400MB to your HDFS then what happens is this file gets divided into
blocks of 128MB+128MB+128MB+16MB = 400MB size. Means 4 blocks are created each of
128MB except the last one. Hadoop doesn’t know or it doesn’t care about what data is stored in
these blocks so it considers the final file blocks as a partial record as it does not have any idea
regarding it. In the Linux file system, the size of a file block is about 4KB which is very much
less than the default size of file blocks in the Hadoop file system. As we all know Hadoop is
mainly configured for storing the large size data which is in petabyte, this is what makes Hadoop
file system different from other file systems as it can be scaled, nowadays file blocks of 128MB
to 256MB are considered in Hadoop.

Replication In HDFS Replication ensures the availability of the data. Replication is making a
copy of something and the number of times you make a copy of that particular thing can be
expressed as its Replication Factor. As we have seen in File blocks that the HDFS stores the data
in the form of various blocks at the same time Hadoop is also configured to make a copy of those
file blocks.

By default, the Replication Factor for Hadoop is set to 3 which can be configured means you can
change it manually as per your requirement like in above example we have made 4 file blocks
which means that 3 Replica or copy of each file block is made means total of 4×3 = 12 blocks
are made for the backup purpose.

This is because for running Hadoop we are using commodity hardware (inexpensive system
hardware) which can be crashed at any time. We are not using the supercomputer for our Hadoop
setup. That is why we need such a feature in HDFS which can make copies of that file blocks for
backup purposes, this is known as fault tolerance.

Now one thing we also need to notice is that after making so many replicas of our file blocks we
are wasting so much of our storage but for the big brand organization the data is very much more
important than the storage so nobody cares for this extra storage. You can configure the
Replication factor in your hdfs-site.xml file.
Rack Awareness The rack is nothing but just the physical collection of nodes in our Hadoop
cluster (maybe 30 to 40). A large Hadoop cluster consists of so many Racks . With the help of
this Racks information Namenode chooses the closest Datanode to achieve the maximum
performance while performing the read/write information which reduces the Network Traffic.

HDFS Architecture

3. YARN(Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job
scheduling and Resource Management. The Purpose of Job scheduler is to divide a big task into
small jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing
can be Maximized. Job Scheduler also keeps track of which job is important, which job has more
priority, dependencies between the jobs and all the other information like job timing, etc. And the
use of Resource Manager is to manage all the resources that are made available for running a
Hadoop cluster.

Features of YARN
● Multi-Tenancy
● Scalability
● Cluster-Utilization
● Compatibility

4. Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library and java files or we can
say the java scripts that we need for all the other components present in a Hadoop cluster. These
utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop Common
verifies that Hardware failure in a Hadoop cluster is common so it needs to be solved
automatically in software by Hadoop Framework.
Hadoop Cluster
A Hadoop cluster is a collection of computers, known as nodes, that are networked together to
perform these kinds of parallel computations on big data sets. Unlike other computer clusters,
Hadoop clusters are designed specifically to store and analyze mass amounts of structured and
unstructured data in a distributed computing environment. Further distinguishing Hadoop
ecosystems from other computer clusters are their unique structure and architecture. Hadoop
clusters consist of a network of connected master and slave nodes that utilize high availability,
low-cost commodity hardware. The ability to linearly scale and quickly add or subtract nodes as
volume demands makes them well-suited to big data analytics jobs with data sets highly variable
in size.
Hadoop clusters are composed of a network of master and worker nodes that orchestrate and
execute the various jobs across the Hadoop distributed file system. The master nodes typically
utilize higher quality hardware and include a NameNode, Secondary NameNode, and
JobTracker, with each running on a separate machine. The workers consist of virtual machines,
running both DataNode and TaskTracker services on commodity hardware, and do the actual
work of storing and processing the jobs as directed by the master nodes. The final part of the
system are the Client Nodes, which are responsible for loading the data and fetching the results.

● Master nodes are responsible for storing data in HDFS and overseeing key operations,
such as running parallel computations on the data using MapReduce.
● The worker nodes comprise most of the virtual machines in a Hadoop cluster, and
perform the job of storing the data and running computations. Each worker node runs the
DataNode and TaskTracker services, which are used to receive the instructions from the
master nodes.
● Client nodes are in charge of loading the data into the cluster. Client nodes first submit
MapReduce jobs describing how data needs to be processed and then fetch the results
once the processing is finished.

A Hadoop cluster size is a set of metrics that defines storage and compute capabilities to run
Hadoop workloads, namely :

● Number of nodes : number of Master nodes, number of Edge Nodes, number of Worker
Nodes.
● Configuration of each type node: number of cores per node, RAM and Disk Volume.
Hadoop – Different Modes of Operation
As we all know Hadoop is an open-source framework which is mainly used for storage purpose
and maintaining and analyzing a large amount of data or datasets on the clusters of commodity
hardware, which means it is actually a data management tool. Hadoop also posses a scale-out
storage property, which means that we can scale up or scale down the number of nodes as per are
a requirement in the future which is really a cool feature.

Hadoop Mainly works on 3 different Modes:


1. Standalone Mode
2. Pseudo-distributed Mode
3. Fully-Distributed Mode

1. Standalone Mode

In Standalone Mode none of the Daemon will run i.e. Namenode, Datanode, Secondary Name
node, Job Tracker, and Task Tracker. We use job-tracker and task-tracker for processing purposes
in Hadoop1. For Hadoop2 we use Resource Manager and Node Manager. Standalone Mode also
means that we are installing Hadoop only in a single system. By default, Hadoop is made to run
in this Standalone Mode or we can also call it the Local mode. We mainly use Hadoop in this
Mode for the Purpose of Learning, testing, and debugging. Hadoop works very much Fastest in
this mode among all of these 3 modes. As we all know HDFS (Hadoop distributed file system) is
one of the major components for Hadoop which is utilized for storage Permission is not utilized
in this mode. You can think of HDFS as similar to the file system’s available for windows i.e.
NTFS (New Technology File System) and FAT32(File Allocation Table which stores the data in
the blocks of 32 bits ). When your Hadoop works in this mode there is no need to configure the
files – hdfs-site.xml, mapred-site.xml, core-site.xml for Hadoop environment. In this Mode, all of
your Processes will run on a single JVM(Java Virtual Machine) and this mode can only be used
for small development purposes.

2. Pseudo Distributed Mode (Single Node Cluster)

In Pseudo-distributed Mode we also use only a single node, but the main thing is that the cluster
is simulated, which means that all the processes inside the cluster will run independently to each
other. All the daemons that are Namenode, Datanode, Secondary Name node, Resource Manager,
Node Manager, etc. will be running as a separate process on separate JVM(Java Virtual Machine)
or we can say run on different java processes that is why it is called a Pseudo-distributed. One
thing we should remember is that as we are using only the single node set up so all the Master
and Slave processes are handled by the single system. Namenode and Resource Manager are
used as Master and Datanode and Node Manager is used as a slave. A secondary name node is
also used as a Master. The purpose of the Secondary Name node is to just keep the hourly based
backup of the Name node. In this Mode,
● Hadoop is used for development and for debugging purposes both.
● Our HDFS(Hadoop Distributed File System ) is utilized for managing the Input and
Output processes.
● We need to change the configuration files mapred-site.xml, core-site.xml,
hdfs-site.xml for setting up the environment.

3. Fully Distributed Mode (Multi-Node Cluster)

This is the most important one in which multiple nodes are used few of them run the Master
Daemon’s that are Namenode and Resource Manager and the rest of them run the Slave
Daemon’s that are DataNode and Node Manager. Here Hadoop will run on the clusters of
Machine or nodes. Here the data that is used is distributed across different nodes. This is actually
the Production Mode of Hadoop let’s clarify or understand this Mode in a better way in Physical
Terminology. Once you download the Hadoop in a tar file format or zip file format then you
install it in your system and you run all the processes in a single system but here in the fully
distributed mode we are extracting this tar or zip file to each of the nodes in the Hadoop cluster
and then we are using a particular node for a particular process. Once you distribute the process
among the nodes then you’ll define which nodes are working as a master or which one of them is
working as a slave
.

Configuring XML Files

Configuration Files are the files which are located in the extracted tar.gz file in the etc/hadoop/
directory.
All Configuration Files in Hadoop are listed below,
1) HADOOP-ENV.sh->>It specifies the environment variables that affect the JDK used by
Hadoop Daemon (bin/hadoop). We know that the Hadoop framework is written in Java and uses
JRE so one of the environment variables in Hadoop Daemons is $Java_Home in Hadoop-env.sh.
2) CORE-SITE.XML->>It is one of the important configuration files which is required for
runtime environment settings of a Hadoop cluster.It informs Hadoop daemons where the
NAMENODE runs in the cluster. It also informs the Name Node as to which IP and ports it
should bind.
3) HDFS-SITE.XML->>It is one of the important configuration files which is required for
runtime environment settings of a Hadoop. It contains the configuration settings for
NAMENODE, DATANODE, SECONDARY NODE. It is used to specify default block
replication. The actual number of replications can also be specified when the file is created,
4) MAPRED-SITE.XML->>It is one of the important configuration files which is required for
runtime environment settings of a Hadoop. It contains the configuration settings for MapReduce
. In this file, we specify a framework name for MapReduce, by setting the
MapReduce.framework.name.
5) Masters->>It is used to determine the master Nodes in a Hadoop cluster. It will inform about
the location of SECONDARY NAMENODE to Hadoop Daemon.
The Mater File on Slave node is blank.
6) Slave->>It is used to determine the slave Nodes in a Hadoop cluster.
The Slave file at Master Node contains a list of hosts, one per line.
The Slave file at Slave server contains the IP address of Slave nodes.
UNIT II
Map Reduce
MapReduce is a processing technique and a program model for distributed computing based on
java. The MapReduce algorithm contains two important tasks, namely Map and Reduce. Map
takes a set of data and converts it into another set of data, where individual elements are broken
down into tuples (key/value pairs). Secondly, reduce task, which takes the output from a map as
an input and combines those data tuples into a smaller set of tuples. As the sequence of the name
MapReduce implies, the reduce task is always performed after the map job.

The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called
mappers and reducers. Decomposing a data processing application into mappers and reducers is
sometimes nontrivial. But, once we write an application in the MapReduce form, scaling the
application to run over hundreds, thousands, or even tens of thousands of machines in a cluster is
merely a configuration change. This simple scalability is what has attracted many programmers
to use the MapReduce model.

The Algorithm
​ Generally MapReduce paradigm is based on sending the computer to where the data
resides!
​ MapReduce program executes in three stages, namely map stage, shuffle stage, and
reduce stage.
​ Map stage − The map or mapper’s job is to process the input data. Generally the
input data is in the form of file or directory and is stored in the Hadoop file
system (HDFS). The input file is passed to the mapper function line by line. The
mapper processes the data and creates several small chunks of data.
​ Reduce stage − This stage is the combination of the Shuffle stage and the
Reduce stage. The Reducer’s job is to process the data that comes from the
mapper. After processing, it produces a new set of output, which will be stored in
the HDFS.
​ During a MapReduce job, Hadoop sends the Map and Reduce tasks to the appropriate
servers in the cluster.
​ The framework manages all the details of data-passing such as issuing tasks, verifying
task completion, and copying data around the cluster between the nodes.
​ Most of the computing takes place on nodes with data on local disks that reduces the
network traffic.
​ After completion of the given tasks, the cluster collects and reduces the data to form an
appropriate result, and sends it back to the Hadoop server.
​ PayLoad − Applications implement the Map and the Reduce functions, and form the
core of the job.
​ Mapper − Mapper maps the input key/value pairs to a set of intermediate key/value pair.
​ NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
​ DataNode − Node where data is presented in advance before any processing takes place.
​ MasterNode − Node where JobTracker runs and which accepts job requests from clients.
​ SlaveNode − Node where Map and Reduce program runs.
​ JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
​ Task Tracker − Tracks the task and reports status to JobTracker.
​ Job − A program is an execution of a Mapper and Reducer across a dataset.
​ Task − An execution of a Mapper or a Reducer on a slice of data.
​ Task Attempt − A particular instance of an attempt to execute a task on a SlaveNode.

Weather Data Analysis For Analyzing Hot And


Cold Days
Weather sensors are collecting weather information across the globe in a large volume of
log data. This weather data is semi-structured and record-oriented.
This data is stored in a line-oriented ASCII format, where each row represents a single
record. Each row has lots of fields like longitude, latitude, daily max-min temperature,
daily average temperature, etc. for easiness, we will focus on the main element, i.e.
temperature. We will use the data from the National Centres for Environmental
Information(NCEI). It has a massive amount of historical weather data that we can use
for our data analysis.
The data files are organised by date and weather station.
STEP : 1
The data sets are processed using MapReduce by converting data sets into mapreduce
programs and then hadoop is used where i/p raw data is data from NCEI.
STEP: 2
Mapper will then pull out year and temperature as per the requirements.
STEP : 3
Then the divided data is reduces using the reducer to generate the required o/p from it.
Classic MapReduce (MapReduce 1)

A job run in classic MapReduce is illustrated in Figure -1. At the highest level, there are four
independent entities:

• The client, which submits the MapReduce job.

• The jobtracker, which coordinates the job run. The jobtracker is a Java application whose main
class is JobTracker.

• The tasktrackers, which run the tasks that the job has been split into. Tasktrackers are Java
applica tions whose main class is TaskTracker.

• The distributed filesystem (normally HDFS, covered in Chapter 3), which is used for sharing
job files between the other entities.

Figure 1. How Hadoop runs a MapReduce job using the classic framework

Job Submission

The submit() method on Job creates an internal JobSummitter instance and calls sub
mitJobInternal() on it (step 1 in Figure 6-1). Having submitted the job, waitForCompletion()
polls the job’s progress once a second and reports the progress to the console if it has changed
since the last report.

The job submission process implemented by JobSummitter does the following:

• Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step 2).

• Checks the output specification of the job. For example, if the output directory has not been
specified or it already exists, the job is not submitted and an error is thrown to the MapReduce
program.

• Computes the input splits for the job. If the splits cannot be computed, because the input paths
don’t exist, for example, then the job is not submitted and an error is thrown to the MapReduce
program.

• Copies the resources needed to run the job, including the job JAR file, the configuration file,
and the computed input splits, to the jobtracker’sfilesystem in a directory named after the job ID.
The job JAR is copied with a high replication factor (controlled by the mapred.submit.replication
property, which defaults to 10) so that there are lots of copies across the cluster for the
tasktrackers to access when they run tasks for the job (step 3).
• Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker)
(step 4).

Job Initialization

When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue
from where the job scheduler will pick it up and initialize it (step 5).

To create the list of tasks to run, the job scheduler first retrieves the input splits computed
by the client from the shared filesystem (step 6). It then creates one map task for each split. The
number of reduce tasks to create is determined by the mapred.reduce.tasks property in the Job,
which is set by the setNumReduceTasks() method, and the scheduler simply creates this number
of reduce tasks to be run.

In addition to the map and reduce tasks, two further tasks are created: a job setup task and
a job cleanup task. These are run by tasktrackers and are used to run code to setup the job before
any map tasks run, and to cleanup after all the reduce tasks are complete.

Task Assignment

Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker.
Heartbeats tell the jobtracker that a tasktracker is alive. As a part of the heartbeat, a tasktracker
will indicate whether it is ready to run a new task, and if it is, the jobtracker will allocate it a
task, which it communicates to the tasktracker using the heartbeat return value (step 7).

Tasktrackers have a fixed number of slots for map tasks and for reduce tasks: for
example, a tasktracker may be able to run two map tasks and two reduce tasks simultaneously.
(The precise number depends on the number of cores and the amount of memory on the
tasktracker.

Task Execution Now that the tasktracker has been assigned a task, the next step is for it to run the
task. First, it localizes the job JAR by copying it from the shared filesystem to the
tasktracker’sfilesystem. It also copies any files needed from the distributed cache by the
application to the local disk. Then, it creates a local working directory for the task, and un-jars
the contents of the JAR into this directory. Third, it creates an instance of TaskRunner to run the
task. TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10), so
that any bugs in the user-defined map and reduce functions don’t affect the tasktracker.

Progress and Status Updates

MapReduce jobs are long-running batch jobs, taking anything from minutes to hours to run.
Because this is a significant length of time, it’s important for the user to get feedback on how the
job is progressing. A job and each of its tasks have a status, which includes such things as the
state of the job or task (e.g., running, successfully completed, failed), the progress of maps and
reduces, the values of the job’s counters, and a status message or description.

When a task is running, it keeps track of its progress, that is, the proportion of the task
completed. For map tasks, this is the proportion of the input that has been processed. For reduce
tasks, it’s a little more complex, but the system can still estimate the proportion of the reduce
input processed.

Job Completion

When the jobtracker receives a notification that the last task for a job is complete (this will be the
special job cleanup task), it changes the status for the job to “successful.” Then, when the Job
polls for status, it learns that the job has completed successfully, so it prints a message to tell the
user and then returns from the waitForCompletion() method. Last, the jobtracker cleans up its
working state for the job and instructs tasktrackers to do the same (so intermediate output is
deleted, for example).

For very large clusters in the region of 4000 nodes and higher, the MapReduce system
described in the previous section begins to hit scalability bottlenecks, so in 2010 a group at
Yahoo! began to design the next generation of MapReduce. The result was YARN, short for Yet
Another Resource Negotiator.

YARN (MapReduce 2)

YARN meets the scalability shortcomings of “classic” MapReduce by splitting the


responsibilities of the jobtracker into separate entities. The jobtracker takes care of both job
scheduling (matching tasks with tasktrackers) and task progress monitoring (keeping track of
tasks and restarting failed or slow tasks, and doing task bookkeeping such as maintaining counter
totals).

YARN separates these two roles into two independent daemons:

I. Resource Manager

II. Application master

a resource manager to manage the use of resources across the cluster, and an application master
to manage the lifecycle of applications running on the cluster.
The idea is that an application master negotiates with the resource manager for cluster
resources—described in terms of a number of containers each with a certain memory limit—then
runs applicationspecific processes in those containers. The containers are overseen by node
managers running on cluster nodes, which ensure that the application does not use more
resources than it has been allocated. In contrast to the jobtracker, each instance of an
application—here a MapReduce job —has a dedicated application master, which runs for the
duration of the application.

The beauty of YARN’s design is that different YARN applications can co-exist on the same
cluster—so a MapReduce application can run at the same time as an MPI application, for
example—which brings great benefits for managability and cluster utilization. Furthermore, it is
even possible for users to run different versions of MapReduce on the same YARN cluster, which
makes the process of upgrading MapReduce more manageable.

Figure 4. How Hadoop runs a MapReduce job using YARN

MapReduce on YARN involves more entities than classic MapReduce. They are:

• The client, which submits the MapReduce job.

• The YARN resource manager, which coordinates the allocation of compute resources on the
cluster.

• The YARN node managers, which launch and monitor the compute containers on machines in
the cluster.

• The MapReduce application master, which coordinates the tasks running the MapReduce job.
The application master and the MapReduce tasks run in containers that are scheduled by the
resource manager, and managed by the node managers.

• The distributed filesystem (normally HDFS, covered in Chapter 3), which is used for sharing
job files between the other entities. The process of running a job is shown in Figure 2, and
described in the following sections.

Job Submission

Jobs are submitted in MapReduce 2 using the same user API as MapReduce 1 (step 1).
MapReduce 2 has an implementation of ClientProtocol that is activated when
mapreduce.framework.name is set to yarn. The submission process is very similar to the classic
implementation. The new job ID is retrieved from the resource manager rather than the
jobtracker.

Job Initialization
When the resource manager receives a call to its submitApplication(), it hands off the request to
the scheduler. The scheduler allocates a container, and the resource manager then launches the
application master’s process there, under the node manager’s management (steps 5a and 5b).

The application master initializes the job by creating a number of bookkeeping objects to
keep track of the job’s progress, as it will receive progress and completion reports from the tasks
(step 6). Next, it retrieves the input splits computed in the client from the shared filesystem (step
7). It then creates a map task object for each split, and a number of reduce task objects
determined by the mapreduce.job.reduces property.

Task Assignment

The application master requests containers for all the map and reduce tasks in the job from the
resource manager (step 8). Each request, which are piggybacked on heartbeat calls, includes
information about each map task’s data locality, in particular the hosts and corresponding racks
that the input split resides on.

Task Execution

Once a task has been assigned a container by the resource manager’s scheduler, the application
master starts the container by contacting the node manager (steps 9a and 9b). The task is
executed by a Java application whose main class is YarnChild. Before it can run the task it
localizes the resources that the task needs, including the job configuration and JAR file, and any
files from the distributed cache (step 10). Finally, it runs the map or reduce task (step 11).

Understanding Hadoop API for MapReduce Framework (Old and New)


Hadoop provides two Java MapReduce APIs named as old and new respectively.

There are several notable differences between the two APIs:

1. The new API favors abstract classes over interfaces, since these are easier to evolve. For
example, you can add a method (with a default implementation) to an abstract class without
breaking old implementations of the class2. For example, the

Mapper and Reducer interfaces in the old API are abstract classes in the new API.

2. The new API is in the org.apache.hadoop.mapreduce package (and subpackages).

The old API can still be found in org.apache.hadoop.mapred.


3. The new API makes extensive use of context objects that allow the user code to communicate
with the MapReduce system. The new Context, for example, essentially unifies the role of the
JobConf, the OutputCollector, and the Reporter from the old API.

4. In both APIs, key-value record pairs are pushed to the mapper and reducer, but in addition, the
new API allows both mappers and reducers to control the execution flow by overriding the run()
method.

In the old API this is possible for mappers by writing a MapRunnable, but no equivalent exists
for reducers.

5. Configuration has been unified. The old API has a special JobConfobject for job
configuration. In the new API, this distinction is dropped, so job configuration is done through a
Configuration.

6. Job control is performed through the Job class in the new API, rather than the oldJobClient,
which no longer exists in the new API.

7. Output files are named slightly differently: in the old API both map and reduce outputs are
named part-n, while in the new API map outputs are named partm-nnnnn, and reduce outputs are
named part-r-nnnnn (where n is an integer designating the part number, starting from zero).

8. In the new API the reduce() method passes values as a java.lang.Iterable, rather than a
java.lang.Iterator (as the old API does). This change makes it easier to iterate over the values
using Java’s for-each loop construct:for (VALUEIN value : values) { ...}

Understanding Hadoop API for MapReduce Framework (Old and New)

Hadoop provides two Java MapReduce APIs named as old and new respectively.

There are several notable differences between the two APIs:

1. The new API favors abstract classes over interfaces, since these are easier to evolve. For
example, you can add a method (with a default implementation) to an abstract class without
breaking old implementations of the class2. For example, the

Mapper and Reducer interfaces in the old API are abstract classes in the new API.

2. The new API is in the org.apache.hadoop.mapreducepackage (and subpackages).

The old API can still be found in org.apache.hadoop.mapred.

3. The new API makes extensive use of context objects that allow the user code tocommunicate
with the MapReduce system. The new Context, for example, essentially unifies the role of the
JobConf, the OutputCollector, and the Reporter from the old API.
4. In both APIs, key-value record pairs are pushed to the mapper and reducer, but inaddition, the
new API allows both mappers and reducers to control the executionflow by overriding the run()
method.

In the old API this is possible for mappers by writing a MapRunnable, but noequivalent exists
for reducers.

5. Configuration has been unified. The old API has a special JobConfobject for jobconfiguration.
In the new API, this distinction is dropped, so job configuration is done through a Configuration.

6. Job control is performed through the Job class in the new API, rather than the oldJobClient,
which no longer exists in the new API.

7. Output files are named slightly differently: in the old API both map and reduceoutputs are
named part-nnnnn, while in the new API map outputs are named partm-nnnnn, and reduce
outputs are named part-r-nnnnn (where nnnnn is an integerdesignating the part number, starting
from zero).

8. In the new API the reduce() method passes values as a java.lang.Iterable, ratherthan a
java.lang.Iterator (as the old API does). This change makes it easier toiterate over the values
using Java’s for-each loop construct:for (VALUEIN value : values) { ...}
Basic programs of HadoopMapReduce:

Compilation and Execution of Process Units Program

Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).

Follow the steps given below to compile and execute the above program.

Step 1

The following command is to create a directory to store the compiled java classes.

$ mkdir units

Step 2

Download Hadoop-core-1.2.1.jar, which is used to compile and execute the MapReduce


program. Visit the following link mvnrepository.com to download the jar. Let us assume
the downloaded folder is /home/hadoop/.

Step 3

The following commands are used for compiling the ProcessUnits.java program and
creating a jar for the program.

$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java

$ jar -cvf units.jar -C units/ .

Step 4

The following command is used to create an input directory in HDFS.

$HADOOP_HOME/bin/hadoop fs -mkdir input_dir

Step 5

The following command is used to copy the input file named sample.txtin the input
directory of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir

Step 6

The following command is used to verify the files in the input directory.

$HADOOP_HOME/bin/hadoop fs -ls input_dir/

Step 7

The following command is used to run the Eleunit_max application by taking the input
files from the input directory.

$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir

Wait for a while until the file is executed. After execution, as shown below, the output
will contain the number of input splits, the number of Map tasks, the number of reducer
tasks, etc.

INFO mapreduce.Job: Job job_1414748220717_0002

completed successfully

14/10/31 06:02:52

INFO mapreduce.Job: Counters: 49

File System Counters

FILE: Number of bytes read = 61

FILE: Number of bytes written = 279400

FILE: Number of read operations = 0

FILE: Number of large read operations = 0

FILE: Number of write operations = 0


HDFS: Number of bytes read = 546

HDFS: Number of bytes written = 40

HDFS: Number of read operations = 9

HDFS: Number of large read operations = 0

HDFS: Number of write operations = 2 Job Counters

Launched map tasks = 2

Launched reduce tasks = 1

Data-local map tasks = 2

Total time spent by all maps in occupied slots (ms) = 146137

Total time spent by all reduces in occupied slots (ms) = 441

Total time spent by all map tasks (ms) = 14613

Total time spent by all reduce tasks (ms) = 44120

Total vcore-seconds taken by all map tasks = 146137

Total vcore-seconds taken by all reduce tasks = 44120

Total megabyte-seconds taken by all map tasks = 149644288

Total megabyte-seconds taken by all reduce tasks = 45178880

Map-Reduce Framework

Map input records = 5

Map output records = 5

Map output bytes = 45


Map output materialized bytes = 67

Input split bytes = 208

Combine input records = 5

Combine output records = 5

Reduce input groups = 5

Reduce shuffle bytes = 6

Reduce input records = 5

Reduce output records = 5

Spilled Records = 10

Shuffled Maps = 2

Failed Shuffles = 0

Merged Map outputs = 2

GC time elapsed (ms) = 948

CPU time spent (ms) = 5160

Physical memory (bytes) snapshot = 47749120

Virtual memory (bytes) snapshot = 2899349504

Total committed heap usage (bytes) = 277684224

File Output Format Counters

Bytes Written = 40

Step 8

The following command is used to verify the resultant files in the output folder.
$HADOOP_HOME/bin/hadoop fs -ls output_dir/

Step 9

The following command is used to see the output in Part-00000 file. This file is generated
by HDFS.

$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000

Below is the output generated by the MapReduce program.

1981 34

1984 40

1985 45

Step 10

The following command is used to copy the output folder from HDFS to the local file
system for analyzing.

$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000/bin/hadoop dfs get


output_dir /home/hadoop

Important Commands
All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command.
Running the Hadoop script without any arguments prints the description for all
commands.

Usage − hadoop [--config confdir] COMMAND

The following table lists the options available and their description.

Sr.No. Option & Description


1 namenode -format
Formats the DFS filesystem.

2 secondarynamenode
Runs the DFS secondary namenode.

3 namenode
Runs the DFS namenode.

4 datanode
Runs a DFS datanode.

5 dfsadmin
Runs a DFS admin client.

6 mradmin
Runs a Map-Reduce admin client.

7 fsck
Runs a DFS filesystem checking utility.

8 fs
Runs a generic filesystem user client.

9 balancer
Runs a cluster balancing utility.

10 oiv
Applies the offline fsimage viewer to an fsimage.

11 fetchdt
Fetches a delegation token from the NameNode.

12 jobtracker
Runs the MapReduce job Tracker node.

13 pipes
Runs a Pipes job.

14 tasktracker
Runs a MapReduce task Tracker node.

15 historyserver
Runs job history servers as a standalone daemon.
16 job
Manipulates the MapReduce jobs.

17 queue
Gets information regarding JobQueues.

18 version
Prints the version.

19 jar <jar>
Runs a jar file.

20 distcp <srcurl> <desturl>


Copies file or directories recursively.

21 distcp2 <srcurl> <desturl>


DistCp version 2.

22 archive -archiveName NAME -p <parent path> <src>* <dest>


Creates a hadoop archive.

23 classpath
Prints the class path needed to get the Hadoop jar and the required
libraries.

24 daemonlog
Get/Set the log level for each daemon

How to Interact with MapReduce Jobs

Usage − hadoop job [GENERIC_OPTIONS]

The following are the Generic Options available in a Hadoop job.

Sr.No. GENERIC_OPTION & Description

1 -submit <job-file>
Submits the job.

2 -status <job-id>
Prints the map and reduce completion percentage and all job counters.
3 -counter <job-id> <group-name> <countername>
Prints the counter value.

4 -kill <job-id>
Kills the job.

5 -events <job-id> <fromevent-#> <#-of-events>


Prints the events' details received by jobtracker for the given range.

6 -history [all] <jobOutputDir> - history < jobOutputDir>


Prints job details, failed and killed tip details. More details about the
job such as successful tasks and task attempts made for each task can
be viewed by specifying the [all] option.

7 -list[all]
Displays all jobs. -list displays only jobs which are yet to complete.

8 -kill-task <task-id>
Kills the task. Killed tasks are NOT counted against failed attempts.

9 -fail-task <task-id>
Fails the task. Failed tasks are counted against failed attempts.

10 -set-priority <job-id> <priority>


Changes the priority of the job. Allowed priority values are
VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW

To see the status of job

$ $HADOOP_HOME/bin/hadoop job -status <JOB-ID>

e.g.

$ $HADOOP_HOME/bin/hadoop job -status job_201310191043_0004

To see the history of job output-dir


$ $HADOOP_HOME/bin/hadoop job -history <DIR-NAME>

e.g.

$ $HADOOP_HOME/bin/hadoop job -history /user/expert/output

To kill the job

$ $HADOOP_HOME/bin/hadoop job -kill <JOB-ID>

e.g.

$ $HADOOP_HOME/bin/hadoop job -kill job_201310191043_0004

Driver code

The driver initializes the job and instructs the hadoop platform to execute your code on a set of
input files and controls where the output files are placed.

A Job object forms the specification of the job. It gives you control over how the job is run.
When we run this job on a Hadoop cluster, we will package the code into a JAR file (which
Hadoop will distribute around the cluster). Rather than explicitly specify the name of the JAR
file, we can pass a class in the Job’s setJarByClass() method, which Hadoop will use to locate the
relevant JAR file by looking for the JAR file containing this class.

Having constructed a Job object, we specify the input and output paths. An input path is
specified by calling the static addInputPath() method on FileInputFormat, and it can be a single
file, a directory (in which case, the input forms all the files in that directory), or a file pattern.

The output path (of which there is only one) is specified by the static setOutputPath() method on
FileOutputFormat. It specifies a directory where the output files from the reducer functions are
written.

Next, we specify the map and reduce types to use via the setMapperClass() and
setReducerClass() methods.

The setOutputKeyClass() and setOutputValueClass() methods control the output types for the
map and the reduce functions, which are often the same, as they are in our case.

If they are different, then the map output types can be set using the methods
setMapOutputKeyClass() and setMapOutputValueClass().

The input types are controlled via the input format, which we have not explicitly set since we are
using the default TextInputFormat.

After setting the classes that define the map and reduce functions, we are ready to run the job.
The waitForCompletion() method on Job submits the job and waits for it to finish.

The return value of the waitForCompletion() method is a boolean indicating success (true) or
failure (false), which we translate into the program’s exit code of 0 or 1. The driver code for
weather program is specified below.

importorg.apache.hadoop.fs.Path;

importorg.apache.hadoop.io.IntWritable;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Job;

importorg.apache.hadoop.mapreduce.lib.input.FileInputFormat;

importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class MaxTemperature

public static void main(String[] args) throws Exception

if (args.length != 2) {

System.err.println("Usage: MaxTemperature<input path><output path>");

System.exit(-1);

Job job = new Job();

job.setJarByClass(MaxTemperature.class);

job.setJobName("Max temperature");
FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);

job.setReducerClass(MaxTemperatureReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

Mapper code

The Mapper class is a generic type, with four formal type parameters that specify the input key,
input value, output key, and output value types of the map function. For the present example, the
input key is a long integer offset, the input value is a line of text, the output key is a year, and the
output value is an air temperature (an integer).

It converts the text value containing the line of input into JAVA string then uses substring() to
extract columns.

The following example shows the implementation of our map method.

importjava.io.IOException;

importorg.apache.hadoop.io.IntWritable;

importorg.apache.hadoop.io.LongWritable;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Mapper;

public class MaxTemperatureMapper

extends Mapper<LongWritable, Text, Text, IntWritable>


{

private static final int MISSING = 9999;

@Override

public void map(LongWritable key, Text value, Context context)throws IOException,


InterruptedException {

String line = value.toString();

String year = line.substring(15, 19);

intairTemperature;

if (line.charAt(87) == '+')

airTemperature = Integer.parseInt(line.substring(88, 92));

Else

airTemperature = Integer.parseInt(line.substring(87, 92));

String quality = line.substring(92, 93);

if (airTemperature != MISSING &&quality.matches("[01459]"))

context.write(new Text(year), new IntWritable(airTemperature));

}
The map() method is passed a key and a value. We convert the Text value containing the
line of input into a Java String, then use its substring() method to extract the columns we are
interested in.

The map() method also provides an instance of Context to write the output to. In thiscase,
we write the year as a Text object (since we are just using it as a key), and thetemperature is
wrapped in an IntWritable. We write an output record only if the temperatureis present and the
quality code indicates the temperature reading is OK.

Reducer code

Again, four formal type parameters are used to specify the input and output types, this time for
the reduce function. The input types of the reduce function must match the output types of the
map function: Text and IntWritable. And in this case, the output types of the reduce function are
Text and IntWritable, for a year and its maximum temperature, which we find by iterating
through the temperatures and comparing each with a record of the highest found so far.

importjava.io.IOException;

importorg.apache.hadoop.io.IntWritable;

importorg.apache.hadoop.io.Text;

importorg.apache.hadoop.mapreduce.Reducer;

public class MaxTemperatureReducer

extends Reducer<Text, IntWritable, Text, IntWritable>

@Override

public void reduce(Text key, Iterable<IntWritable>values,Context context)

throwsIOException, InterruptedException

intmaxValue = Integer.MIN_VALUE;

for (IntWritable value : values)


{

maxValue = Math.max(maxValue, value.get());

context.write(key, new IntWritable(maxValue));

Record Reader

RecordReader is responsible for creating key / value pair which has been fed to Map task to
process.
Each InputFormat has to provide its own RecordReader implementation to generate key / value
pairs.
For example, the default TextInputFormat provides LineRecordReader which generates byte
offset of the file as key and n separated line in the input file as value.

Combiner code

Many MapReduce jobs are limited by the bandwidth available on the cluster, so it paysto
minimize the data transferred between map and reduce tasks. Hadoop allows theuser to specify a
combiner function to be run on the map output—the combiner function’soutput forms the input
to the reduce function. Since the combiner function is anoptimization, Hadoop does not provide a
guarantee of how many times it will call itfor a particular map output record, if at all. In other
words, calling the combiner function zero, one, or many times should produce the same output
from the reducer.The combiner function doesn’t replace the reduce function. But It can help cut
down the amount of data shuffled between the maps and reduces.

public class MaxTemperatureWithCombiner

public static void main(String[] args) throws Exception

if (args.length != 2)

{
System.err.println("Usage: MaxTemperatureWithCombiner<input path> " +"<output path>");

System.exit(-1);

Job job = new Job();

job.setJarByClass(MaxTemperatureWithCombiner.class);

job.setJobName("Max temperature");

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);

job.setCombinerClass(MaxTemperatureReducer.class);

job.setReducerClass(MaxTemperatureReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

Partitioner code

The partitioning phase takes place after the map phase and before the reduce phase. The number
of partitions is equal to the number of reducers. The data gets partitioned across the reducers
according to the partitioning function. The difference between a partitioner and a combiner is
that the partitioner divides the data according to the number of reducers so that all the data in a
single partition gets executed by a single reducer. However, the combiner functions similar to the
reducer and processes the data in each partition. The combiner is an optimization to the reducer.
The default partitioning function is the hash partitioning function where the hashing is done on
the key. However it might be useful to partition the data according to some other function of the
key or the value.
UNIT III

Serialization
Serialization is the process of turning structured objects into a byte stream for
transmission over a network or for writing to persistent storage. Deserialization is the
reverse process of turning a byte stream back into a series of structured objects.
Serialization appears in two quite distinct areas of distributed data processing: for
interprocess communication and for persistent storage. In Hadoop, interprocess
communication between nodes in the system is implemented using remote procedure
calls (RPCs). The RPC protocol uses serialization to render the message into a binary
stream to be sent to the remote node, which then deserializes the binary stream into the
original message. In general, an RPC serialization format is:
Compact
A compact format makes the best use of network bandwidth, which is the most scarce
resource in a data center.
Fast
Interprocess communication forms the backbone for a distributed system, so it is essential
that there is as little performance overhead as possible for the serialization and
deserialization process.
Extensible
Protocols change over time to meet new requirements, so it should be straightforward to
evolve the protocol in a controlled manner for clients and servers.
Interoperable
For some systems, it is desirable to be able to support clients that are written in different
languages to the server, so the format needs to be designed to make this possible.
Hadoop uses its own serialization format, Writables, which is certainly compact and fast,
but not so easy to extend or use from languages other than Java.

Writable in an interface in Hadoop and types in Hadoop must implement this interface.
Hadoop provides these writable wrappers for almost all Java primitive types and some
other types, but sometimes we need to pass custom objects and these custom objects
should implement Hadoop's Writable interface. Hadoop MapReduce uses
implementations of Writables for interacting with user-provided Mappers and
Reducers.
To implement the Writable interface we require two methods:
public interface Writable {
void readFields(DataInput in);
void write(DataOutput out);
}

Why are Writables Introduced in Hadoop?

Now the question is whether Writables are necessary for Hadoop. Hadoop frame work
definitely needs Writable type of interface in order to perform the following tasks:
Implement serialization ,Transfer data between clusters and networks Store the
deserialized data in the local disk of the system Implementation of writable is similar to
implementation of interface in Java. It can be done by simply writing the keyword
‘implements’ and overriding the default writable method.
Writable is a strong interface in Hadoop which while serializing the data, reduces the data
size enormously, so that data can be exchanged easily within the networks. It has separate
read and write fields to read data from network and write data into local disk respectively.
Every data inside Hadoop should accept writable and comparable interface properties.
We have seen how Writables reduces the data size overhead and make the data transfer
easier in the network.
Why use Hadoop Writable(s)?
As we already know, data needs to be transmitted between different nodes in a distributed
computing environment. This requires serialization and deserialization of data to convert
the data that is in structured format to byte stream and vice-versa.
Hadoop therefore uses simple and efficient serialization protocol to serialize data between
map and reduce phase and these are called Writable(s). Some of the examples of
writables as already mentioned before are IntWritable, LongWritable, BooleanWritable
and FloatWritable.

Writable Comparable interface is just a sub interface of the Writable and


java.lang.Comparable interfaces. For implementing a WritableComparable we must have
compareTo method apart from readFields and write methods, as shown below:
public interface WritableComparable extends Writable, Comparable
{ void readFields(DataInput in);
void write(DataOutput out);
int compareTo(WritableComparable o) }
Comparison of types is crucial for MapReduce, where there is a sorting phase during
which keys are compared with one another. Implementing a comparator for
WritableComparables like the org.apache.hadoop.io.
RawComparator interface will definitely help speed up your Map/Reduce (MR) Jobs. As
you may recall, a MR Job is composed of receiving and sending key-value pairs. The
process looks like the following.

(K1,V1) –> Map –> (K2,V2)


(K2,List[V2]) –> Reduce –> (K3,V3)

The key-value pairs (K2,V2) are called the intermediary key-value pairs. They are passed
from the mapper to the reducer. Before these intermediary key-value pairs reach the
reducer, a shuffle and sort step is performed.
The shuffle is the assignment of the intermediary keys (K2) to reducers and the sort is the
sorting of these keys. In this blog, by implementing the RawComparator to compare the
intermediary keys, this extra effort will greatly improve sorting. Sorting is improved
because the RawComparator will compare the keys by byte. If we did not use
RawComparator, the intermediary keys would have to be completely deserialized to
perform a comparison.

Note (In Short):


1)WritableComparables can be compared to each other, typically via Comparators. Any
type which is to be used as a key in the Hadoop Map-Reduce framework should
implement this interface.
2) Any type which is to be used as a value in the Hadoop Map-Reduce framework should
implement the Writable interface. Writables and its Importance in Hadoop Writable is an
interface in Hadoop. Writable in Hadoop acts as a wrapper class to almost all the
primitive data type of Java. That is how int of java has become IntWritable in Hadoop
and String of Java has become Text in Hadoop. Writables are used for creating serialized
data types in Hadoop.
So, let us start by understanding what data type, interface and serilization is.

Data Type

A data type is a set of data with values having predefined characteristics. There are
several kinds of data types in Java. For example- int, short, byte, long, char etc. These are
called as primitive data types.
All these primitive data types are bound to classes called as wrapper class. For example
int, short, byte, long are grouped under INTEGER which is a wrapper class.
These wrapper classes are predefined in the Java. Interface in Java An interface in Java is
a complete abstract class. The methods within an interface are abstract methods which do
not accept body and the fields within the interface are public, static and final, which
means that the fields cannot be modified.
The structure of an interface is most likely to be a class. We cannot create an object for an
interface and the only way to use the interface is to implement it in other class by using
‘implements’ keyword.

Writable Classes
Hadoop comes with a large selection of Writable classes in the org.apache.hadoop.io
package. They form the class hierarchy shown in Figure 1.

Figure 1. Writable class hierarchy


Writable wrappers for Java primitives
There are Writable wrappers for all the Java primitive types except char (which can be
stored in an IntWritable)
as shown in Table 1. All have a get() and a set() method for retrieving and storing the
wrapped value.

Table 1. Writable wrapper classes for java primitives


When encoding integers, there is a choice between the fixed-length formats (IntWritable
and LongWritable) and the variable-length formats (VIntWritable and VLongWritable).
The variable-length formats use only a single byte to encode the value if it is small
enough (between –112 and 127, inclusive); otherwise, they use the first byte to indicate
whether the value is positive or negative, and how many bytes follow.
For example, 163 requires two bytes:
byte[] data = serialize(new VIntWritable(163));
assertThat(StringUtils.byteToHexString(data), is("8fa3"));
Fixedlength encodings are good when the distribution of values is fairly uniform across
the whole value space, such as a (well-designed) hash function. Most numeric variables
tend to have non uniform distributions, and on average the variable-length encoding will
save space. Another advantage of variable-length encodings is that you can switch from
VIntWritable to VLongWritable
Text
Text is a Writable for UTF-8 sequences. It can be thought of as the Writable equivalent of
java.lang.String. The Text class uses an int (with a variable-length encoding) to store the
number of bytes in the string encoding, so the maximum value is 2 GB.
Indexing
Because of its emphasis on using standard UTF-8, there are some differences between
Text and the Java String class.
Here is an example to demonstrate the use of the charAt() method:
Text t = new Text("hadoop");
assertThat(t.getLength(), is(6));
assertThat(t.getBytes().length, is(6));
assertThat(t.charAt(2), is((int) 'd'));
assertThat("Out of bounds", t.charAt(100), is(-1));
Notice that charAt() returns an int representing a Unicode code point, unlike the String
variant that returns a char. Text also has a find() method, which is analogous to String’s
indexOf():
Text t = new Text("hadoop");
assertThat("Find a substring", t.find("do"), is(2));
assertThat("Finds first 'o'", t.find("o"), is(3));
assertThat("Finds 'o' from position 4 or later", t.find("o", 4), is(4));
assertThat("No match", t.find("pig"), is(-1));
Unicode:
When we start using characters that are encoded with more than a single byte, the
differences between Text and String become clear. Consider the Unicode characters
shown in Table2.

Table 2. Unicode characters


All but the last character in the table, U+10400, can be expressed using a single Java
char. U+10400 is a supplementary character and is represented by two Java chars, known
as a surrogate pair. The following example show the differences between String and Text
when processing a string of the four characters from Table 2.
public class StringTextComparisonTest {
@Test
public void string() throws UnsupportedEncodingException {
String s = "\u0041\u00DF\u6771\uD801\uDC00";
assertThat(s.length(), is(5));
assertThat(s.getBytes("UTF-8").length, is(10));
assertThat(s.indexOf("\u0041"), is(0));
assertThat(s.indexOf("\u00DF"), is(1));
assertThat(s.indexOf("\u6771"), is(2));
assertThat(s.indexOf("\uD801\uDC00"), is(3));
assertThat(s.charAt(0), is('\u0041'));
assertThat(s.charAt(1), is('\u00DF'));
assertThat(s.charAt(2), is('\u6771'));
assertThat(s.charAt(3), is('\uD801'));
assertThat(s.charAt(4), is('\uDC00'));
assertThat(s.codePointAt(0), is(0x0041));
assertThat(s.codePointAt(1), is(0x00DF));
assertThat(s.codePointAt(2), is(0x6771));
assertThat(s.codePointAt(3), is(0x10400));
}
@Test
public void text() {
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
assertThat(t.getLength(), is(10));
assertThat(t.find("\u0041"), is(0));
assertThat(t.find("\u00DF"), is(1));
assertThat(t.find("\u6771"), is(3));
assertThat(t.find("\uD801\uDC00"), is(6));
assertThat(t.charAt(0), is(0x0041));
assertThat(t.charAt(1), is(0x00DF));
assertThat(t.charAt(3), is(0x6771));
assertThat(t.charAt(6), is(0x10400));
}
}
Example . Tests showing the differences between the String and Text classes

The test confirms that the length of a String is the number of char code units it contains
(5, one from each of the first three characters in the string, and a surrogate pair from the
last), whereas the length of a Text object is the number of bytes in its UTF-8 encoding
(10 = 1+2+3+4). Similarly, the indexOf() method in String returns an index in char code
units, and find() for Text is a byte offset.
The charAt() method in String returns the char code unit for the given index, which in the
case of a surrogate pair will not represent a whole Unicode character. The codePointAt()
method, indexed by char code unit, is needed to retrieve a single Unicode character
represented as an int. In fact, the charAt() method in Text is more like the codePointAt()
method than its namesake in String. The only difference is that it is indexed by byte
offset.
Iteration
Iterating over the Unicode characters in Text is complicated by the use of byte offsets for
indexing, since you can’t just increment the index. Turn the Text object into a
java.nio.ByteBuffer, then repeatedly call the bytesToCodePoint() static method on Text
with the buffer. This method extracts the next code point as an int and updates the
position in the buffer. The end of the string is detected when bytesToCodePoint() returns
–1. See the following example.
public class TextIterator
{
public static void main(String[] args)
{
Text t = new Text("\u0041\u00DF\u6771\uD801\uDC00");
ByteBuffer buf = ByteBuffer.wrap(t.getBytes(), 0, t.getLength());
int cp;
while (buf.hasRemaining() && (cp = Text.bytesToCodePoint(buf)) != -1)
{
System.out.println(Integer.toHexString(cp));
}
}
}
Example . Iterating over the characters in a Text object
Running the program prints the code points for the four characters in the string:
% hadoop TextIterator
41
df
6771
10400
Another difference with String is that Text is mutable. We can reuse a Text instance by
calling one of the set() methods on it. For example:
Text t = new Text("hadoop");
t.set("pig");
assertThat(t.getLength(), is(3));
assertThat(t.getBytes().length, is(3));
Resorting to String Text doesn’t have as rich an API for manipulating strings as
java.lang.String, so in many cases, you need to convert the Text object to a String. This is
done in the usual way, using the toString() method:
assertThat(new Text("hadoop").toString(), is("hadoop"));

BytesWritable
BytesWritable is a wrapper for an array of binary data. Its serialized format is an integer
field (4 bytes) that specifies the number of bytes to follow, followed by the bytes
themselves. For example, the byte array of length two with values 3 and 5 is serialized as
a 4-byte integer (00000002) followed by the two bytes from the array (03 and 05):
BytesWritable b = new BytesWritable(new byte[] { 3, 5 });
byte[] bytes = serialize(b);
assertThat(StringUtils.byteToHexString(bytes), is("000000020305"));
BytesWritable is mutable, and its value may be changed by calling its set() method.
NullWritable
NullWritable is a special type of Writable, as it has a zero-length serialization. No bytes
are written to, or read from, the stream. It is used as a placeholder; for example, in
MapReduce, a key or a value can be declared as a NullWritable when you don’t need to
use that position—it effectively stores a constant empty value. NullWritable can also be
useful as a key in SequenceFile when you want to store a list of values, as opposed to
key-value pairs.
ObjectWritable and GenericWritable
ObjectWritable is a general-purpose wrapper for the following: Java primitives, String,
enum, Writable, null, or arrays of any of these types.
GenericWritable is useful when a field can be of more than one type: for example, if the
values in a SequenceFile have multiple types, then you can declare the value type as an
GenericWritable and wrap each type in an GenericWritable.
Writable collections
There are six Writable collection types in the org.apache.hadoop.io package: Array
Writable, ArrayPrimitiveWritable, TwoDArrayWritable, MapWritable,
SortedMapWritable, and EnumSetWritable.
ArrayWritable and TwoDArrayWritable are Writable implementations for arrays and
two-dimensional arrays (array of arrays) of Writable instances. All the elements of an
ArrayWritable or a TwoDArrayWritable must be instances of the same class, which is
specified at construction, as follows:
ArrayWritable writable = new ArrayWritable(Text.class);
In contexts where the Writable is defined by type, such as in SequenceFile keys or values,
or as input to MapReduce in general, you need to subclass ArrayWritable (or
TwoDArrayWritable, as appropriate) to set the type statically. For example:
public class TextArrayWritable extends ArrayWritable
{
public TextArrayWritable()
{
super(Text.class);
}
}
ArrayWritable and TwoDArrayWritable both have get() and set() methods, as well as a
toArray() method, which creates a shallow copy of the array.

ArrayPrimitiveWritable is a wrapper for arrays of Java primitives. The component type is


detected when you call set(), so there is no need to subclass to set the type.
MapWritable and SortedMapWritable are implementations of
java.util.Map<Writable,Writable> and java.util.SortedMap<WritableComparable,
Writable>, respectively. Here’s a demonstration of using a MapWritable with different
types for keys and values:
MapWritable src = new MapWritable();
src.put(new IntWritable(1), new Text("cat"));
src.put(new VIntWritable(2), new LongWritable(163));
MapWritable dest = new MapWritable();
WritableUtils.cloneInto(dest, src);
assertThat((Text) dest.get(new IntWritable(1)), is(new Text("cat")));
assertThat((LongWritable) dest.get(new VIntWritable(2)), is(new
LongWritable(163)));
Conspicuous by their absence are Writable collection implementations for sets and lists.
A general set can be emulated by using a MapWritable (or a SortedMapWritable for a
sorted set), with NullWritable values. There is also EnumSetWritable for sets of enum
types. For lists of a single type of Writable, ArrayWritable is adequate, but to store
different types of Writable in a single list, you can use GenericWritable to wrap the
elements in an ArrayWritable.
Implementing a Custom Writable
Hadoop comes with a useful set of Writable implementations that serve most purposes;
however, on occasion, you may need to write your own custom implementation. nWith a
custom Writable, you have full control over the binary representation and the sort order.
Because Writables are at the heart of the MapReduce data path, tuning the binary
representation can have a significant effect on performance. To demonstrate how to create
a custom Writable, we shall write an implementation that represents a pair of strings,
called TextPair. The basic implementation is shown in following Example.
import java.io.*;
import org.apache.hadoop.io.*;
public class TextPair implements WritableComparable<TextPair>
{
private Text first;
private Text second;
public TextPair() {
set(new Text(), new Text());
}
public TextPair(String first, String second)
{
set(new Text(first), new Text(second));
}
public TextPair(Text first, Text second)
{
set(first, second);
}
public void set(Text first, Text second)
{
this.first = first;
this.second = second;
}
public Text getFirst()
{
return first;
}
public Text getSecond()
{
return second;
}
@Override
public void write(DataOutput out) throws IOException
{
first.write(out);
second.write(out);
}
@Override
public void readFields(DataInput in) throws IOException
{
first.readFields(in);
second.readFields(in);
}
@Override
public int hashCode()
{
return first.hashCode() * 163 + second.hashCode();
}
@Override
public boolean equals(Object o)
{
if (o instanceof TextPair) {
TextPair tp = (TextPair) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}
@Override
public String toString()
{
return first + "\t" + second;
}
@Override
public int compareTo(TextPair tp)
{
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
}
return second.compareTo(tp.second);
}
}

Example . A Writable implementation that stores a pair of Text objects


The first part of the implementation is straightforward: there are two Text instance
variables, first and second, and associated constructors, getters, and setters. All Writable
implementations must have a default constructor so that the MapReduce framework can
instantiate them, then populate their fields by calling readFields().
TextPair’s write() method serializes each Text object in turn to the output stream, by
delegating to the Text objects themselves. Similarly, readFields() deserializes the bytes
from the input stream by delegating to each Text object. The DataOutput and DataInput
interfaces have a rich set of methods for serializing and deserializing Java Primitives.
Just as you would for any value object you write in Java, you should override the
hashCode(), equals(), and toString() methods from java.lang.Object. The hash Code()
method is used by the HashPartitioner (the default partitioner in MapReduce) to choose a
reduce partition, so you should make sure that you write a good hash function that mixes
well to ensure reduce partitions are of a similar size.
If you ever plan to use your custom Writable with TextOutputFormat, then you must
implement its toString() method. TextOutputFormat calls toString() on keys and values
for their output representation. For TextPair, we write the underlying Text objects as
strings separated by a tab character.
TextPair is an implementation of WritableComparable, so it provides an implementation
of the compareTo() method that imposes the ordering you would expect: it sorts by the
first string followed by the second.
Implementing a RawComparator for speed
In the above example, when TextPair is being used as a key in MapReduce, it will have to
be deserialized into an object for the compareTo() method to be invoked. since TextPair is
the concatenation of two Text objects, and the binary representation of a Text object is a
variable-length integer containing the number of bytes in the UTF-8 representation of the
string, followed by the UTF-8 bytes themselves. The trick is to read the initial length, so
we know how long the first Text object’s byte representation is; then we can delegate to
Text’s RawComparator, and invoke it with the appropriate offsets for the first or second
string. Consider the following example for more details.
public static class Comparator extends WritableComparator
{
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public Comparator()
{
super(TextPair.class);
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2)
{
try
{
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
int cmp = TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);

if (cmp != 0)
{
return cmp;
}
return TEXT_COMPARATOR.compare(b1, s1 + firstL1, l1 - firstL1,
b2, s2 + firstL2, l2 - firstL2);
}
catch (IOException e)
{
throw new IllegalArgumentException(e);
}
}
}
Static
{
WritableComparator.define(TextPair.class, new Comparator());
}
Example 4-8. A RawComparator for comparing TextPair byte representations
We actually subclass WritableComparator rather than implement RawComparator
directly, since it provides some convenience methods and default implementations. The
subtle part of this code is calculating firstL1 and firstL2, the lengths of the first Text field
in each byte stream. Each is made up of the length of the variable-length integer (returned
by decodeVIntSize() on WritableUtils) and the value it is encoding (returned by
readVInt()).
The static block registers the raw comparator so that whenever MapReduce sees the
TextPair class, it knows to use the raw comparator as its default comparator.
Custom comparators
As we can see with TextPair, writing raw comparators takes some care, since you have to
deal with details at the byte level. Custom comparators should also be written to be
RawComparators, if possible. These are comparators that implement a different sort order
to the natural sort order defined by the default comparator. The following Example a
comparator for TextPair, called FirstComparator, that considers only the first string of the
pair. Note that we override the compare() method that takes objects so both compare()
methods have the same semantics.
public static class FirstComparator extends WritableComparator
{
private static final Text.Comparator TEXT_COMPARATOR = new Text.Comparator();
public FirstComparator() {
super(TextPair.class);
}
@Override
public int compare(byte[] b1, int s1, int l1,
byte[] b2, int s2, int l2)
{
Try
{
int firstL1 = WritableUtils.decodeVIntSize(b1[s1]) + readVInt(b1, s1);
int firstL2 = WritableUtils.decodeVIntSize(b2[s2]) + readVInt(b2, s2);
return TEXT_COMPARATOR.compare(b1, s1, firstL1, b2, s2, firstL2);
}
catch (IOException e)
{
throw new IllegalArgumentException(e);
}
}
@Override
public int compare(WritableComparable a, WritableComparable b)
{
if (a instanceof TextPair && b instanceof TextPair)
{
return ((TextPair) a).first.compareTo(((TextPair) b).first);
}
return super.compare(a, b);
}
}
Example . A custom RawComparator for comparing the first field of TextPair byte
representations
UNIT IV

Java MapReduce programs and the Hadoop Distributed File System (HDFS) provide you with a
powerful distributed computing framework, but they come with one major drawback — relying
on them limits the use of Hadoop to Java programmers who can think in Map and Reduce terms
when writing programs. Pig is a programming tool attempting to have the best of both worlds: a
declarative query language inspired by SQL and a low-level procedural programming language
that can generate MapReduce code. This lowers the bar when it comes to the level of technical
knowledge needed to exploit the power of Hadoop. Pig was initially developed at Yahoo! in
2006 as part of a research project tasked with coming up with ways for people using Hadoop to
focus more on analyzing large data sets rather than spending lots of time writing Java
MapReduce programs. The goal here was a familiar one: Allow users to focus more on what
they want to do and less on how it‘s done. Not long after, in 2007, Pig officially became an
Apache project. As such, it is included in most Hadoop distributions.
The Pig programming language is designed to handle any kind of data tossed its way —
structured, semistructured, unstructured data. Pigs, of course, have a reputation for eating
anything they come across. According to the Apache Pig philosophy, pigs eat anything, live
anywhere, are domesticated and can fly to boot. Pigs ―living anywhere‖ refers to the fact that
Pig is a parallel data processing programming language and is not committed to any particular
parallel framework — including Hadoop. What makes it a domesticated animal? Well, if
―domesticated‖ means ―plays well with humans,‖ then it‘s definitely the case that Pig prides
itself on being easy for humans to code and maintain. Lastly, Pig is smart and in data processing
lingo this means there is an optimizer that figures out how to do the hard work of figuring out
how to get the data quickly. Pig is not just going to be quick — it‘s going to fly.

Admiring the Pig Architecture


Pig is made up of two components:
a) The language itself:
The programming language for Pig is known as Pig Latin, a high-level language that
allows you to write data processing and analysis programs.
b) The Pig Latin compiler:
The Pig Latin compiler converts the Pig Latin code into executable code. The executable
code is either in the form of MapReduce jobs or it can spawn a process where a virtual
Hadoop instance is created to run the Pig code on a single node.
The sequence of MapReduce programs enables Pig programs to do data processing and analysis
in parallel, leveraging Hadoop MapReduce and HDFS. Running the Pig job in the virtual
Hadoop instance is a useful strategy for testing your Pig scripts. Figure 1 shows how Pig relates
to the Hadoop ecosystem.
Figure1: Pig architecture.
Pig programs can run on MapReduce 1 or MapReduce 2 without any code changes, regardless
of what mode your cluster is running. However, Pig scripts can also run using the Tez API
instead. Apache Tez provides a more efficient execution framework than MapReduce. YARN
enables application frameworks other than MapReduce (like Tez) to run on Hadoop. Hive can
also run against the Tez framework.

Going with the Pig Latin Application Flow


At its core, Pig Latin is a dataflow language, where you define a data stream and a series of
transformations that are applied to the data as it flows through your application. This is in
contrast to a control flow language (like C or Java), where you write a series of instructions. In
control flow languages, we use constructs like loops and conditional logic (like an if statement).
You won‘t find loops and if statements in Pig Latin.
If you need some convincing that working with Pig is a significantly easier than having to write
Map and Reduce programs, start by taking a look at some real Pig syntax. The following Listing
specifies Sample Pig Code to illustrate the data processing dataflow.

Listing: Sample pig code to illustrate the data processing data flow
Some of the text in this example actually looks like English. Looking at each line in turn, you
can see the basic flow of a Pig program. This code can either be part of a script or issued on the
interactive shell called Grunt.
Load: You first load (LOAD) the data you want to manipulate. As in a typical
MapReduce job, that data is stored in HDFS.
For a Pig program to access the data, you first tell Pig what file or files to use. For that
task, you use the LOAD 'data_file' command. Here, 'data_file' can specify either an
HDFS file or a directory. If a directory is specified, all files in that directory are loaded
into the program. If the data is stored in a file format that isn‘t natively accessible to Pig,
you can optionally add the USING function to the LOAD statement to specify a
user-defined function that can read in (and interpret) the data.

Transform: You run the data through a set of transformations which are translated into a
set of Map and Reduce tasks.
The transformation logic is where all the data manipulation happens.
You can FILTER out rows that aren‘t of interest, JOIN two sets of data files, GROUP
data to build aggregations, ORDER results, and do much, much more.

Dump: Finally, you dump (DUMP) the results to the screen


or
Store (STORE) the results in a file somewhere.

Working through the ABCs of Pig Latin


Pig Latin is the language for Pig programs. Pig translates the Pig Latin script into MapReduce
jobs that can be executed within Hadoop cluster. When coming up with Pig Latin, the
development team followed three key design principles:

Keep it simple.
Pig Latin provides a streamlined method for interacting with Java MapReduce. It‘s an
abstraction, in other words, that simplifies the creation of parallel programs on the
Hadoop cluster for data flows and analysis. Complex tasks may require a series of
interrelated data transformations — such series are encoded as data flow sequences.
Writing data transformation and flows as Pig Latin scripts instead of Java MapReduce
programs makes these programs easier to write, understand, and maintain because a)
you don‘t have to write the job in Java, b) you don‘t have to think in terms of
MapReduce, and c) you don‘t need to come up with custom code to support rich data
types.
Pig Latin provides a simpler language to exploit your Hadoop cluster, thus making it
easier for more people to leverage the power of Hadoop and become productive sooner.

Make it smart.
You may recall that the Pig Latin Compiler does the work of transforming a Pig Latin
program into a series of Java MapReduce jobs. The trick is to make sure that the
compiler can optimize the execution of these Java MapReduce jobs automatically,
allowing the user to focus on semantics rather than on how to optimize and access the
dataSQL is set up as a declarative query that you use to access structured data stored in
an RDBMS. The RDBMS engine first translates the query to a data access method and
then looks at the statistics and generates a series of data access approaches. The
cost-based optimizer chooses the most efficient approach for execution.
Don’t limit development.
Make Pig extensible so that developers can add functions to address their particular
business problems.

Uncovering Pig Latin structures

The problem we‘re trying to solve involves calculating the total number of flights flown by
every carrier. Following listing is the Pig Latin script we‘ll use to answer this question.

Listing : Pig script calculating the total miles flown


The Pig script is a lot smaller than the MapReduce application you‘d need to accomplish the
same task — the Pig script only has 4 lines of code. And not only is the code shorter, but it‘s
even semi-human readable.
Most Pig scripts start with the LOAD statement to read data from HDFS.
In this case, we‘re loading data from a .csv file. Pig has a data model it uses, so next we
need to map the file‘s data model to the Pig data mode. This is accomplished with the help
of the USING statement. We then specify that it is a comma-delimited file with the
PigStorage(',') statement followed by the AS statement defining the name of each of the
columns.
Aggregations are commonly used in Pig to summarize data sets.
The GROUP statement is used to aggregate the records into a single record mileage_recs. The
ALL statement is used to aggregate all tuples into a single group. Note that some statements —
including the following SUM statement — requires a preceding GROUP ALL statement for
global sums.

FOREACH . . . GENERATE statements are used here to transformcolumns


data In this case, we want to
count the miles traveled in the records_Distance column. The SUM statement computes the
sum of the record_Distance column into a single-column collection total_miles.
The DUMP operator is used to execute the Pig Latin statement and display the results on the screen.
DUMP is
used in interactive mode, which means that the statements are executable immediately and
the results are not saved. Typically, you will either use the DUMP or STORE operators at the
end of your Pig script.

Looking at Pig data types and syntax


Pig‘s data types make up the data model for how Pig thinks of the structure of the data it is
processing. With Pig, the data model gets defined when the data is loaded. Any data you load
into Pig from disk is going to have a particular schema and structure. In general terms, though,
Pig data types can be broken into two categories: scalar types and complex types. Scalar types
contain a single value, whereas complex types contain other types, such as the Tuple, Bag, and
Map types.
Pig Latin has these four types in its data model:
Atom: An atom is any single value, such as a string or a number — ‗Diego‘, for example. Pig‘s
atomic values are scalar types that appear in most programming languages — int, long, float,
double, chararray, and bytearray, for example. See Figure 2 to see sample atom types.

Tuple: A
tuple is a record that consists of a sequence of fields. Each field can be of any type
— ‗Diego‘, ‗Gomez‘, or 6, for example. Think of a tuple as a row in a table.

Bag: A bag is a collection of non-unique tuples. The schema of the bag is flexible —
each tuple in the collection can contain an arbitrary number of fields, and each field can
be of any type.

Map: A map is a collection of key value pairs. Any type can be stored in the value, and the
key needs to be unique. The key of a map must be a chararray and the value can be of any
type.
Figure -2 offers some fine examples of Tuple, Bag, and Map data types, as well.
Figure 2:Sample Pig Data Types
In a Hadoop context, accessing data means allowing developers to load, store, and stream data,
whereas transforming data means taking advantage of Pig‘s ability to group, join, combine,
split, filter, and sort data. Table 1 gives an overview of the operators associated with each
operation.

Table 1 Pig Latin Operators

Pig also provides a few operators that are helpful for debugging and troubleshooting, as shown in
Table 2:
Table 2 Operators for Debugging and Troubleshooting

The optional USING statement defines how to map the data structure within the file to the Pig
data model — in this case, the PigStorage () data structure, which parses delimited text files.
The optional AS clause defines a schema for the data that is eing mapped. If you don‘t use an
AS clause, you‘re basically telling the default LOAD Func to expect a plain text file that is tab
delimited.

Evaluating Local and Distributed Modes of Running Pig scripts


Before you can run your first Pig script, you need to have a handle on how Pig programs can be
packaged with the Pig server.Pig has two modes for running scripts, as shown in Figure 3.

Figure 3. Pig modes


Local mode
All scripts are run on a single machine without requiring Hadoop MapReduce and HDFS. This
can be useful for developing and testing Pig logic. If you‘re using a small set of data to develope
or test your code, then local mode could be faster than going through the MapReduce
infrastructure.
Local mode doesn‘t require Hadoop. When you run in Local mode, the Pig program runs in the
context of a local Java Virtual Machine, and data access is via the local file system of a single
machine. Local mode is actually a local simulation of MapReduce in Hadoop‘s LocalJobRunner
class.

MapReduce mode (also known as Hadoop mode)


Pig is executed on the Hadoop cluster. In this case, the Pig script gets converted into a series of
MapReduce jobs that are then run on the Hadoop cluster.If you have a terabyte of data that you
want to perform operations on and you want to interactively develop a program, you may soon
find things slowing down considerably, and you may start growing your storage.
Local mode allows you to work with a subset of your data in a more interactive manner so that
you can figure out the logic (and work out the bugs) of your Pig program. After you have things
set up as you want them and your operations are running smoothly, you can then run the script
against the full data set using MapReduce mode.

Checking Out the Pig Script Interfaces


Pig programs can be packaged in three different ways:

Script: This method is nothing more than a file containing Pig Latin commands,
identified by the .pig suffix (FlightData.pig, for example). Ending your Pig program
with the .pig extension is a convention but not required. The commands are interpreted
by the Pig Latin compiler and executed in the order determined by the Pig optimizer.

Grunt: Grunt acts as a command interpreter where you can interactively enter Pig Latin
at the Grunt command line and immediately see the response. This method is helpful for
prototyping during initial development and with what-if scenarios.

Embedded: Pig Latin statements can be executed within Java, Python, or JavaScript
programs.

Pig scripts, Grunt shell Pig commands, and embedded Pig programs can run in either Local
mode or MapReduce mode. The Grunt shell provides an interactive shell to submit Pig
commands or run Pig scripts. To start the Grunt shell in Interactive mode, just submit the
command pig at your shell. To specify whether a script or Grunt shell is executed locally or in
Hadoop mode just specify it in the –x flag to the pig command.
The following is an example of how you‘d specify running your Pig script
in local mode: pig -x local milesPerCarrier.pig
Here‘s how you‘d run the Pig script in Hadoop mode, which is the default if you don‘t
specify the flag: pig -x mapreduce milesPerCarrier.pig
By default, when you specify the pig command without any parameters, it starts the Grunt shell
in Hadoop mode. If you want to start the Grunt shell in local mode just add the –x local flag to
the command. Here is an example:
pig -x local

Scripting with Pig Latin


Hadoop is a rich and quickly evolving ecosystem with a growing set of new applications. Rather
than try to keep up with all the requirements for new capabilities, Pig is designed to be
extensible via user-defined functions, also known as UDFs. UDFs can be written in a number of
programming languages, including Java, Python, and JavaScript. Developers are also posting
and sharing a growing collection of UDFs online. (Look for Piggy Bank and DataFu, to name
just two examples of such online collections.) Some of the Pig UDFs that are part of these
repositories are LOAD/STORE functions (XML, for example), date time functions, text, math,
and stats functions.
Pig can also be embedded in host languages such as Java, Python, and JavaScript, which allows
you to integrate Pig with your existing applications. It also helps overcome limitations in the Pig
language. One of the most commonly referenced limitations is that Pig doesn‘t support control
flow statements: if/else, while loop, for loop, and condition statements. Pig natively supports
data flow, but needs to be embedded within another language to provide control flow. There are
tradeoffs, however, of embedding Pig in a control-flow language. For example if a Pig statement
is embedded in a loop, every time the loop iterates and runs the Pig statement, this causes a
separate MapReduce job to run.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy