Bda Module 5
Bda Module 5
BDA Module 5
BIGDATA ANALYSIS
MODULE 5 : FRAMEWORKS
APPLICATION ON BIGDATA USING PIG AND HIVE
Pig :
Pig is used for the analysis of a large amount of data. It is abstract over
MapReduce. Pig is used to perform all kinds of data manipulation operations in
Hadoop. It provides the Pig-Latin language to write the code that contains many
inbuilt functions like join, filter, etc. The two parts of the Apache Pig are Pig-
Latin and Pig-Engine. Pig Engine is used to convert all these scripts into a
specific map and reduce tasks. Pig abstraction is at a higher level. It contains
less line of code as compared to MapReduce.
Hive :
Hive is built on the top of Hadoop and is used to process structured data in
Hadoop. Hive was developed by Facebook. It provides various types of
querying language which is frequently known as Hive Query Language. Apache
Hive is a data warehouse and which provides an SQL-like interface between the
user and the Hadoop distributed file system (HDFS) which integrates Hadoop.
Difference between Pig and Hive :
Pig Hive
Pig is a Procedural
3. Data Flow Language. Hive is a Declarative SQLish Language.
It was developed by
4. Yahoo. It was developed by Facebook.
It is used by
Researchers and
5. Programmers. It is mainly used by Data Analysts.
It is used to handle
structured and semi-
6. structured data. It is mainly used to handle structured data.
It is used for
7. programming. It is used for creating reports.
Pig does not have a Hive makes use of the exact variation of
dedicated metadata dedicated SQL-DDL language by
13. database. defining tables beforehand.
Pig does not support Hive supports schema for data insertion in
16. schema to store data. tables.
Filtering Data
Once you have some data loaded into a relation, the next step is often to filter it
to remove the data that you are not interested in. By filtering early in the
processing pipeline, you minimize the amount of data flowing through the
system, which can improve efficiency.
FOREACH...GENERATE
The FOREACH...GENERATE operator is used to act on every row in a relation.
It can be used to remove fields or to generate new ones.
In this example, we do both:
grunt> DUMP A;
(Joe,cherry,2)
(Ali,apple,3)
(Joe,banana,2)
(Eve,apple,7)
grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant';
grunt> DUMP B;
(Joe,3,Constant)
(Ali,4,Constant)
(Joe,3,Constant)
(Eve,8,Constant)
Here we have created a new relation B with three fields. Its first field is a
projection of the first field ($0) of A. B’s second field is the third field of A ($2)
with one added to it. B’s third field is a constant field (every row in B has the
same third field) with the chararray value Constant. The
FOREACH...GENERATE operator has a nested form to support more complex
processing. A nested FOREACH...GENERATE must always have a
GENERATE statement as the last nested statemen, that generates the summary
fields of interest using the grouped records, as well as the relations created in the
nested block.
STREAM
(Hank,2)
We can join the two relations on the numerical (identity) field in each:
grunt> C = JOIN A BY $0, B BY $1;
grunt> DUMP C;
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)
You should use the general join operator if all the relations being joined are too
large to fit in memory. If one of the relations is small enough to fit in memory,
there is a special type of join called a fragment replicate join, which is
implemented by distributing the small input to all the mappers and performing a
map-side join using an in-memory lookup table against the (fragmented) larger
relation. There is a special syntax for telling Pig to use a fragment replicate join:
grunt> C = JOIN A BY $0, B BY $1 USING "replicated";
COGROUP
JOIN always gives a flat structure: a set of tuples. The COGROUP statement is
similar to JOIN, but creates a nested set of output tuples. This can be useful if you
want to exploit the structure in subsequent statements:
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
(0,{},{(Ali,0)})
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})
COGROUP generates a tuple for each unique grouping key. The first field of each
tuple is the key, and the remaining fields are bags of tuples from the relations with
a matching key. The first bag contains the matching tuples from relation A with
the same key. Similarly, the second bag contains the matching tuples from
relation B with the same key.
CROSS
Pig Latin includes the cross-product operator (also known as the cartesian
product), which joins every tuple in a relation with every tuple in a second relation
(and with every tuple in further relations if supplied). The size of the output is the
product of the size of the inputs, potentially making the output very large:
grunt> I = CROSS A, B;
grunt> DUMP I;
(2,Tie,Joe,2)
(2,Tie,Hank,4)
(2,Tie,Ali,0)
(2,Tie,Eve,3)
(2,Tie,Hank,2)
(4,Coat,Joe,2)
(4,Coat,Hank,4)
(4,Coat,Ali,0)
(4,Coat,Eve,3)
(4,Coat,Hank,2)
(3,Hat,Joe,2)
(3,Hat,Hank,4)
(3,Hat,Ali,0)
(3,Hat,Eve,3)
(3,Hat,Hank,2)
(1,Scarf,Joe,2)
(1,Scarf,Hank,4)
(1,Scarf,Ali,0)
(1,Scarf,Eve,3)
(1,Scarf,Hank,2)
When dealing with large datasets, you should try to avoid operations that
generate intermediate representations that are quadratic (or worse) in size.
Computing the crossproduct of the whole input dataset is rarely needed, if ever.
GROUP
Although COGROUP groups the data in two or more relations, the GROUP
statement groups the data in a single relation. GROUP supports grouping by more
than equality of keys: you can use an expression or user-defined function as the
group key. For example, consider the following relation A:
grunt> DUMP A;
(Joe,cherry)
(Ali,apple)
(Joe,banana)
(Eve,apple)
(5,{(Ali,apple),(Eve,apple)})
(6,{(Joe,cherry),(Joe,banana)})
GROUP creates a relation whose first field is the grouping field, which is given
the alias group. The second field is a bag containing the grouped fields with the
same schema as the original relation (in this case, A).
There are also two special grouping operations: ALL and ANY. ALL groups all
the tuples in a relation in a single group, as if the GROUP function was a constant:
grunt> C = GROUP A ALL;
grunt> DUMP C;
(all,{(Joe,cherry),(Ali,apple),(Joe,banana),(Eve,apple)})
Note that there is no BY in this form of the GROUP statement. The ALL grouping
is commonly used to count the number of tuples in a relation
Sorting
Data Relations are unordered in Pig. There is no guarantee which order the rows
will be processed in. In particular, when retrieving the contents of A using DUMP
or STORE, the rows may be written in any order. If you want to impose an order
on the output, you can use the ORDER operator to sort a relation by one or more
fields. The default sort order compares fields of the same type using the natural
ordering, and different types are given an arbitrary, but deterministic, ordering.
The following example sorts A by the first field in ascending order and by the
second field in descending order:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
(1,2)
(2,4)
(2,3)
The LIMIT statement is useful for limiting the number of results, as a quick and
dirty way to get a sample of a relation; prototyping (the ILLUSTRATE
command) should be preferred for generating more representative samples of the
data. It can be used immediately after the ORDER statement to retrieve the first
n tuples. Usually, LIMIT will select any n tuples from a relation, but when used
immediately after an ORDER statement, the order is retained (in an exception to
the rule that processing a relation does not retain its order):
grunt> D = LIMIT B 2;
grunt> DUMP D;
(1,2)
(2,4)
If the limit is greater than the number of tuples in the relation, all tuples are
returned (so LIMIT has no effect).
Combining and Splitting Data
Sometimes you have several relations that you would like to combine into one.
For this, the UNION statement is used. For example:
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
grunt> DUMP B;
(z,x,8)
(w,y,1)
grunt> C = UNION A, B;
grunt> DUMP C;
(2,3)
(1,2)
(2,4)
(z,x,8)
(w,y,1)
C is the union of relations A and B, and since relations are unordered, the order
of the tuples in C is undefined. Also, it’s possible to form the union of two
relations with different schemas or with different numbers of fields, as we have
done here. Pig attempts to merge the schemas from the relations that UNION is
operating on In this case, they are incompatible, so C has no schema:
grunt> DESCRIBE A;
A: {f0: int,f1: int}
grunt> DESCRIBE B;
B: {f0: chararray,f1: chararray,f2: int}
grunt> DESCRIBE C;
Schema for C unknown.
If the output relation has no schema, your script needs to be able to handle tuples
that vary in the number of fields and/or types.
The SPLIT operator is the opposite of UNION ; it partitions a relation into two
or more relations.
Hive Services
The Hive shell is only one of several services that you can run using the hive
command. You can specify the service to run using the --service option. Type
hive --service help to get a list of available service names; the most useful are
described below.
cli
The command line interface to Hive (the shell). This is the default service.
hiveserver
Runs Hive as a server exposing a Thrift service, enabling access from a range
of clients written in different languages. Applications using the Thrift, JDBC, and
ODBC connectors need to run a Hive server to communicate with Hive. Set the
HIVE_PORT environment variable to specify the port the server will listen on
(defaults to 10,000).
hwi
The Hive Web Interface.
jar
The Hive equivalent to hadoop jar, a convenient way to run Java applications
that includes both Hadoop and Hive classes on the classpath.
Metastore
By default, the metastore is run in the same process as the Hive service.
Using this service, it is possible to run the metastore as a standalone (remote)
process. Set the METASTORE_PORT environment variable to specify the port
the server will listen on.
HiveQL
Hive’s SQL dialect, called HiveQL, does not support the full SQL-92
specification. Hive has some extensions that are not in SQL-92, which have been
inspired by syntax from other database systems, notably MySQL. Table 12-2
provides a high-level comparison of SQL and HiveQL.
Data Types
Hive supports both primitive and complex data types. Primitives include numeric,
boolean, string, and timestamp types. The complex data types include arrays,
maps, and structs. Hive’s data types are listed in Table 12-3. Note that the literals
shown are those used from within HiveQL; they are not the serialized form used
in the table’s storage format.
Primitive types
Hive’s primitive types correspond roughly to Java’s, although some names are
influenced by MySQL’s type names (some of which, in turn, overlap with SQL-
92). There are four signed integral types: TINYINT, SMALLINT, INT, and
BIGINT, which are equivalent to Java’s byte, short, int, and long primitive types,
respectively; they are 1-byte, 2-byte, 4-byte, and 8-byte signed integers.
Hive’s floating-point types, FLOAT and DOUBLE, correspond to Java’s float
and double, which are 32-bit and 64-bit floating point numbers. Unlike some
databases, there is no option to control the number of significant digits or decimal
places stored for floating point values.
Hive supports a BOOLEAN type for storing true and false values.
There is a single Hive data type for storing text, STRING, which is a variable-
length character string. Hive’s STRING type is like VARCHAR in other
In the case when there is no built-in function that does what you want, you can
write your own.
QUERYING DATA IN HIVE
This section discusses how to use various forms of the SELECT statement to
retrieve data from Hive.
Sorting and Aggregating
Sorting data in Hive can be achieved by use of a standard ORDER BY clause, but
there is a catch. ORDER BY produces a result that is totally sorted, as expected,
but to do so it sets the number of reducers to one, making it very inefficient for
large datasets.
When a globally sorted result is not required—and in many cases it isn’t—then
you can use Hive’s nonstandard extension, SORT BY instead. SORT BY
produces a sorted file per reducer.
In some cases, you want to control which reducer a particular row goes to,
typically so you can perform some subsequent aggregation. This is what Hive’s
DISTRIBUTE BY clause does. Here’s an example to sort the weather dataset by
year and temperature, in such a way to ensure that all the rows for a given year
end up in the same reducer partition.
hive> FROM records2
> SELECT year, temperature
> DISTRIBUTE BY year
> SORT BY year ASC, temperature DESC;
1949 111
1949 78
1950 22
1950 0
1950 -11
If the columns for SORT BY and DISTRIBUTE BY are the same, you can use
CLUSTER BY as a shorthand for specifying both.
MapReduce Scripts
Before running the query, we need to register the script with Hive.
If we use a nested form for the query, we can specify a map and a reduce function. This time we use
the MAP and REDUCE keywords, but SELECT TRANSFORM in both cases would have the same result.
FROM
( FROM records2
MAP year, temperature, quality
USING 'is_good_quality.py'
AS year, temperature) map_output
REDUCE year, temperature
Joins
One of the nice things about using Hive, rather than raw MapReduce, is that it
makes performing commonly used operations very simple. Join operations are a
case in point, given how involved they are to implement in MapReduce.
Inner joins
The simplest kind of join is the inner join, where each match in the input tables
results in a row in the output. Consider two small demonstration tables: sales,
which lists the names of people and the ID of the item they bought; and things,
which lists the item ID and its name:
hive> SELECT * FROM sales;
Joe 2
Hank 4
Ali 0
Eve 3
Hank 2
hive> SELECT * FROM things;
2 Tie
4 Coat
3 Hat
1 Scarf
Hive only supports equijoins, which means that only equality can be used in the
join predicate, which here matches on the id column in both tables.
In Hive, you can join on multiple columns in the join predicate by specifying a
series of expressions, separated by AND keywords. You can also join more than
two tables by supplying additional JOIN...ON... clauses in the query. Hive is
intelligent about trying to minimize the number of MapReduce jobs to perform
the joins.
A single join is implemented as a single MapReduce job, but multiple joins can
be performed in less than one MapReduce job per join if the same column is used
in the join condition. You can see how many MapReduce jobs Hive will use for
any particular query by prefixing it with the EXPLAIN keyword:
EXPLAIN
SELECT sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);
The EXPLAIN output includes many details about the execution plan for the
query, including the abstract syntax tree, the dependency graph for the stages that
Hive will execute, and information about each stage. Stages may be MapReduce
jobs or operations such as file moves. For even more detail, prefix the query with
EXPLAIN EXTENDED.
Hive currently uses a rule-based query optimizer for determining how to execute
a query, but it’s likely that in the future a cost-based optimizer will be added.
Outer joins
Outer joins allow you to find nonmatches in the tables being joined. In the current
example, when we performed an inner join, the row for Ali did not appear in the
output, since the ID of the item she purchased was not present in the things table.
If we change the join type to LEFT OUTER JOIN, then the query will return a
row for every row in the left table (sales), even if there is no corresponding row
in the table it is being joined to (things):
hive> SELECT sales.*, things.*
> FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
Ali 0 NULL NULL
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat
Hive supports right outer joins, which reverses the roles of the tables relative to
the left join. In this case, all items from the things table are included, even those
that weren’t purchased by anyone (a scarf):
hive> SELECT sales.*, things.*
> FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);
NULL NULL 1 Scarf
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat
Finally, there is a full outer join, where the output has a row for each row from
both tables in the join:
hive> SELECT sales.*, things.*
> FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
Ali 0 NULL NULL
NULL NULL 1 Scarf
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat
Semi joins
Hive doesn’t support IN subqueries (at the time of writing), but you can use a
LEFT SEMI JOIN to do the same thing. Consider this IN subquery, which finds
all the items in the things table that are in the sales table:
SELECT * FROM things WHERE things.id IN (SELECT id from sales);
There is a restriction that we must observe for LEFT SEMI JOIN queries: the
right table (sales) may only appear in the ON clause. It cannot be referenced in a
SELECT expression, for example.
Map joins
If one table is small enough to fit in memory, then Hive can load the smaller table
into memory to perform the join in each of the mappers. The syntax for specifying
a map join is a hint embedded in an SQL C-style comment:
SELECT /*+ MAPJOIN(things) */ sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);
The job to execute this query has no reducers, so this query would not work for a
RIGHT or FULL OUTER JOIN, since absence of matching can only be detected
in an aggregating (reduce) step across all the inputs.
Subqueries
The following query finds the mean maximum temperature for every year and
weather station:
SELECT station, year, AVG(max_temperature)
FROM (
SELECT station, year, MAX(temperature) AS max_temperature
FROM records2
WHERE temperature != 9999 AND (quality = 0 OR quality = 1 OR quality = 4 OR quality =
5 OR quality = 9)
GROUP BY station, year
) mt
GROUP BY station, year;
The subquery is used to find the maximum temperature for each station/date
combination, then the outer query uses the AVG aggregate function to find the
average of the maximum temperature readings for each station/date combination.
The outer query accesses the results of the subquery like it does a table, which is
why the subquery must be given an alias (mt). The columns of the subquery have
to be given unique names so that the outer query can refer to them.
Views
A view is a sort of “virtual table” that is defined by a SELECT statement. Views
can be used to present data to users in a different way to the way it is actually
stored on disk. Often, the data from existing tables is simplified or aggregated in
a particular way that makes it convenient for further processing. Views may also
be used to restrict users’ access to particular subsets of tables that they are
authorized to see.
In Hive, a view is not materialized to disk when it is created; rather, the view’s
SELECT statement is executed when the statement that refers to the view is run.
If a view performs extensive transformations on the base tables, or is used
frequently, then you may choose to manually materialize it by creating a new
table that stores the contents of the view.
HBase is a data model similar to Google’s big table that is designed to provide
random access to high volume of structured or unstructured data. HBase is an
important component of the Hadoop ecosystem that leverages the fault tolerance
feature of HDFS. HBase provides real-time read or write access to data
in HDFS. HBase data model stores semi-structured data having different data
types, varying column size and field size. The layout of HBase data model eases
data partitioning and distribution across the cluster. HBase data model consists of
several logical components- row key, column family, table name, timestamp, etc.
Row Key is used to uniquely identify the rows in HBase tables. Column families
in HBase are static whereas the columns, by themselves, are dynamic.
i. HMaster
These are the worker nodes which handle read, write, update, and delete requests
from clients. Region Server process, runs on every node in the hadoop cluster.
Region Server runs on HDFS DataNode and consists of the following
components –
• Block Cache – This is the read cache. Most frequently read data is stored
in the read cache and whenever the block cache is full, recently used data
is evicted.
• MemStore- This is the write cache and stores new data that is not yet
written to the disk. Every column family in a region has a MemStore.
• Write Ahead Log (WAL) is a file that stores new data that is not persisted
to permanent storage.
• HFile is the actual storage file that stores the rows as sorted key values on
a disk.
iii. Zookeeper
ZooKeeper service keeps track of all the region servers that are there in an HBase
cluster- tracking information about how many region servers are there and which
region servers are holding which DataNode. HMaster contacts ZooKeeper to get
the details of region servers. Various services that Zookeeper provides include –
Here are some of the standard open source utilities in InfoSphere BigInsights:
• PIG
• Hive / HCatalog
• Oozie
• HBASE
• Zookeeper
• Flume
• Avro
• Chukwa
Key software capabilities
Infosphere Stream
VISUALIZATIONS
Visualization is the first step to make sense of data. To translate and present
complex data and relations in a simple way, data analysts use different
Charts
The easiest way to show the development of one or several data sets is a chart.
Charts vary from bar and line charts that show the relationship between elements
over time to pie charts that demonstrate the components or proportions between
the elements of one whole.
Plots
Plots allow to distribute two or more data sets over a 2D or even 3D space to
show the relationship between these sets and the parameters on the plot. Plots also
vary. Scatter and bubble plots are some of the most widely-used visualizations.
When it comes to big data, analysts often use more complex box plots to visualize
the relationships between large volumes of data.
Maps
Maps are popular techniques used for data visualization in different industries.
They allow locating elements on relevant objects and areas — geographical maps,
building plans, website layouts, etc. Among the most popular map visualizations
are heat maps, dot distribution maps, cartograms.
Matrix is one of the advanced data visualization techniques that help determine
the correlation between multiple constantly updating (steaming) data sets.