0% found this document useful (0 votes)
5 views26 pages

Bda Module 5

Module 5 of the BDA course covers big data analysis frameworks, specifically Apache Pig and Hive. It explains the functionalities, differences, and data processing operators of both tools, highlighting their respective languages, use cases, and capabilities for handling structured and semi-structured data. Additionally, it discusses Hive services and the HiveQL dialect, emphasizing its limitations compared to standard SQL.

Uploaded by

Viji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views26 pages

Bda Module 5

Module 5 of the BDA course covers big data analysis frameworks, specifically Apache Pig and Hive. It explains the functionalities, differences, and data processing operators of both tools, highlighting their respective languages, use cases, and capabilities for handling structured and semi-structured data. Additionally, it discusses Hive services and the HiveQL dialect, emphasizing its limitations compared to standard SQL.

Uploaded by

Viji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

lOMoARcPSD|51623677

BDA Module 5

BSc Computer Science (Mahatma Gandhi University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)
lOMoARcPSD|51623677

BIGDATA ANALYSIS
MODULE 5 : FRAMEWORKS
APPLICATION ON BIGDATA USING PIG AND HIVE
Pig :
Pig is used for the analysis of a large amount of data. It is abstract over
MapReduce. Pig is used to perform all kinds of data manipulation operations in
Hadoop. It provides the Pig-Latin language to write the code that contains many
inbuilt functions like join, filter, etc. The two parts of the Apache Pig are Pig-
Latin and Pig-Engine. Pig Engine is used to convert all these scripts into a
specific map and reduce tasks. Pig abstraction is at a higher level. It contains
less line of code as compared to MapReduce.
Hive :
Hive is built on the top of Hadoop and is used to process structured data in
Hadoop. Hive was developed by Facebook. It provides various types of
querying language which is frequently known as Hive Query Language. Apache
Hive is a data warehouse and which provides an SQL-like interface between the
user and the Hadoop distributed file system (HDFS) which integrates Hadoop.
Difference between Pig and Hive :
Pig Hive

Pig operates on the Hive operates on the server side of a


1. client side of a cluster. cluster.

Pig uses pig-latin


2. language. Hive uses HiveQL language.

Pig is a Procedural
3. Data Flow Language. Hive is a Declarative SQLish Language.

It was developed by
4. Yahoo. It was developed by Facebook.

It is used by
Researchers and
5. Programmers. It is mainly used by Data Analysts.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

It is used to handle
structured and semi-
6. structured data. It is mainly used to handle structured data.

It is used for
7. programming. It is used for creating reports.

Pig scripts end with


8. .pig extension. In HIve, all extensions are supported.

It does not support


9. partitioning. It supports partitioning.

10. It loads data quickly. It loads data slowly.

It does not support


11. JDBC. It supports JDBC.

It does not support


12. ODBC. It supports ODBC.

Pig does not have a Hive makes use of the exact variation of
dedicated metadata dedicated SQL-DDL language by
13. database. defining tables beforehand.

It supports Avro file


14. format. It does not support Avro file format.

Pig is suitable for


complex and nested Hive is suitable for batch-
15. data structures. processing OLAP systems.

Pig does not support Hive supports schema for data insertion in
16. schema to store data. tables.

Data Processing Operators in Pig


Loading and Storing Data
PigStorage to store tuples as plain-text values separated by a colon character:

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

grunt> STORE A INTO 'out' USING PigStorage(':');


grunt> cat out
Joe:cherry:2
Ali:apple:3
Joe:banana:2
Eve:apple:7

Filtering Data
Once you have some data loaded into a relation, the next step is often to filter it
to remove the data that you are not interested in. By filtering early in the
processing pipeline, you minimize the amount of data flowing through the
system, which can improve efficiency.
FOREACH...GENERATE
The FOREACH...GENERATE operator is used to act on every row in a relation.
It can be used to remove fields or to generate new ones.
In this example, we do both:
grunt> DUMP A;
(Joe,cherry,2)
(Ali,apple,3)
(Joe,banana,2)
(Eve,apple,7)
grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant';
grunt> DUMP B;
(Joe,3,Constant)
(Ali,4,Constant)
(Joe,3,Constant)
(Eve,8,Constant)

Here we have created a new relation B with three fields. Its first field is a
projection of the first field ($0) of A. B’s second field is the third field of A ($2)
with one added to it. B’s third field is a constant field (every row in B has the
same third field) with the chararray value Constant. The
FOREACH...GENERATE operator has a nested form to support more complex
processing. A nested FOREACH...GENERATE must always have a
GENERATE statement as the last nested statemen, that generates the summary
fields of interest using the grouped records, as well as the relations created in the
nested block.
STREAM

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

The STREAM operator allows you to transform data in a relation using an


external program or script. It is named by analogy with Hadoop Streaming, which
provides a similar capability for MapReduce.
STREAM can use built-in commands with arguments. Here is an example that
uses the Unix cut command to extract the second field of each tuple in A. Note
that the command and its arguments are enclosed in backticks:
grunt> C = STREAM A THROUGH `cut -f 2`;
grunt> DUMP C;
(cherry)
(apple)
(banana)
(apple)

The STREAM operator uses PigStorage to serialize and deserialize relations to


and from the program’s standard input and output streams. Tuples in A are
converted to tabdelimited lines that are passed to the script. The output of the
script is read one line at a time and split on tabs to create new tuples for the output
relation C. You can provide a custom serializer and deserializer, which implement
PigToStream and StreamToPig respectively (both in the org.apache.pig package),
using the DEFINE command. Pig streaming is most powerful when you write
custom processing scripts.
Grouping and Joining Data
Joining datasets in MapReduce takes some work on the part of the, whereas Pig
has very good built-in support for join operations, making it much more
approachable. Since the large datasets that are suitable for analysis by Pig (and
MapReduce in general) are not normalized, joins are used more infrequently in
Pig than they are in SQL.
JOIN
Let’s look at an example of an inner join. Consider the relations A and B:
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

(Hank,2)

We can join the two relations on the numerical (identity) field in each:
grunt> C = JOIN A BY $0, B BY $1;
grunt> DUMP C;
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)

You should use the general join operator if all the relations being joined are too
large to fit in memory. If one of the relations is small enough to fit in memory,
there is a special type of join called a fragment replicate join, which is
implemented by distributing the small input to all the mappers and performing a
map-side join using an in-memory lookup table against the (fragmented) larger
relation. There is a special syntax for telling Pig to use a fragment replicate join:
grunt> C = JOIN A BY $0, B BY $1 USING "replicated";

COGROUP
JOIN always gives a flat structure: a set of tuples. The COGROUP statement is
similar to JOIN, but creates a nested set of output tuples. This can be useful if you
want to exploit the structure in subsequent statements:
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
(0,{},{(Ali,0)})
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})

COGROUP generates a tuple for each unique grouping key. The first field of each
tuple is the key, and the remaining fields are bags of tuples from the relations with
a matching key. The first bag contains the matching tuples from relation A with
the same key. Similarly, the second bag contains the matching tuples from
relation B with the same key.
CROSS
Pig Latin includes the cross-product operator (also known as the cartesian
product), which joins every tuple in a relation with every tuple in a second relation

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

(and with every tuple in further relations if supplied). The size of the output is the
product of the size of the inputs, potentially making the output very large:
grunt> I = CROSS A, B;
grunt> DUMP I;
(2,Tie,Joe,2)
(2,Tie,Hank,4)
(2,Tie,Ali,0)
(2,Tie,Eve,3)
(2,Tie,Hank,2)
(4,Coat,Joe,2)
(4,Coat,Hank,4)
(4,Coat,Ali,0)
(4,Coat,Eve,3)
(4,Coat,Hank,2)
(3,Hat,Joe,2)
(3,Hat,Hank,4)
(3,Hat,Ali,0)
(3,Hat,Eve,3)
(3,Hat,Hank,2)
(1,Scarf,Joe,2)
(1,Scarf,Hank,4)
(1,Scarf,Ali,0)
(1,Scarf,Eve,3)
(1,Scarf,Hank,2)

When dealing with large datasets, you should try to avoid operations that
generate intermediate representations that are quadratic (or worse) in size.
Computing the crossproduct of the whole input dataset is rarely needed, if ever.
GROUP
Although COGROUP groups the data in two or more relations, the GROUP
statement groups the data in a single relation. GROUP supports grouping by more
than equality of keys: you can use an expression or user-defined function as the
group key. For example, consider the following relation A:
grunt> DUMP A;
(Joe,cherry)
(Ali,apple)
(Joe,banana)
(Eve,apple)

Let’s group by the number of characters in the second field:


grunt> B = GROUP A BY SIZE($1);
grunt> DUMP B;

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

(5,{(Ali,apple),(Eve,apple)})
(6,{(Joe,cherry),(Joe,banana)})

GROUP creates a relation whose first field is the grouping field, which is given
the alias group. The second field is a bag containing the grouped fields with the
same schema as the original relation (in this case, A).
There are also two special grouping operations: ALL and ANY. ALL groups all
the tuples in a relation in a single group, as if the GROUP function was a constant:
grunt> C = GROUP A ALL;
grunt> DUMP C;
(all,{(Joe,cherry),(Ali,apple),(Joe,banana),(Eve,apple)})

Note that there is no BY in this form of the GROUP statement. The ALL grouping
is commonly used to count the number of tuples in a relation

Sorting
Data Relations are unordered in Pig. There is no guarantee which order the rows
will be processed in. In particular, when retrieving the contents of A using DUMP
or STORE, the rows may be written in any order. If you want to impose an order
on the output, you can use the ORDER operator to sort a relation by one or more
fields. The default sort order compares fields of the same type using the natural
ordering, and different types are given an arbitrary, but deterministic, ordering.
The following example sorts A by the first field in ascending order and by the
second field in descending order:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
(1,2)
(2,4)
(2,3)

The LIMIT statement is useful for limiting the number of results, as a quick and
dirty way to get a sample of a relation; prototyping (the ILLUSTRATE
command) should be preferred for generating more representative samples of the
data. It can be used immediately after the ORDER statement to retrieve the first
n tuples. Usually, LIMIT will select any n tuples from a relation, but when used
immediately after an ORDER statement, the order is retained (in an exception to
the rule that processing a relation does not retain its order):

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

grunt> D = LIMIT B 2;
grunt> DUMP D;
(1,2)
(2,4)

If the limit is greater than the number of tuples in the relation, all tuples are
returned (so LIMIT has no effect).
Combining and Splitting Data
Sometimes you have several relations that you would like to combine into one.
For this, the UNION statement is used. For example:
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
grunt> DUMP B;
(z,x,8)
(w,y,1)
grunt> C = UNION A, B;
grunt> DUMP C;
(2,3)
(1,2)
(2,4)
(z,x,8)
(w,y,1)

C is the union of relations A and B, and since relations are unordered, the order
of the tuples in C is undefined. Also, it’s possible to form the union of two
relations with different schemas or with different numbers of fields, as we have
done here. Pig attempts to merge the schemas from the relations that UNION is
operating on In this case, they are incompatible, so C has no schema:
grunt> DESCRIBE A;
A: {f0: int,f1: int}
grunt> DESCRIBE B;
B: {f0: chararray,f1: chararray,f2: int}
grunt> DESCRIBE C;
Schema for C unknown.

If the output relation has no schema, your script needs to be able to handle tuples
that vary in the number of fields and/or types.
The SPLIT operator is the opposite of UNION ; it partitions a relation into two
or more relations.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

Hive Services
The Hive shell is only one of several services that you can run using the hive
command. You can specify the service to run using the --service option. Type
hive --service help to get a list of available service names; the most useful are
described below.
cli
The command line interface to Hive (the shell). This is the default service.
hiveserver
Runs Hive as a server exposing a Thrift service, enabling access from a range
of clients written in different languages. Applications using the Thrift, JDBC, and
ODBC connectors need to run a Hive server to communicate with Hive. Set the
HIVE_PORT environment variable to specify the port the server will listen on
(defaults to 10,000).
hwi
The Hive Web Interface.
jar
The Hive equivalent to hadoop jar, a convenient way to run Java applications
that includes both Hadoop and Hive classes on the classpath.
Metastore
By default, the metastore is run in the same process as the Hive service.
Using this service, it is possible to run the metastore as a standalone (remote)
process. Set the METASTORE_PORT environment variable to specify the port
the server will listen on.

HiveQL
Hive’s SQL dialect, called HiveQL, does not support the full SQL-92
specification. Hive has some extensions that are not in SQL-92, which have been
inspired by syntax from other database systems, notably MySQL. Table 12-2
provides a high-level comparison of SQL and HiveQL.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

Data Types
Hive supports both primitive and complex data types. Primitives include numeric,
boolean, string, and timestamp types. The complex data types include arrays,
maps, and structs. Hive’s data types are listed in Table 12-3. Note that the literals
shown are those used from within HiveQL; they are not the serialized form used
in the table’s storage format.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

Primitive types
Hive’s primitive types correspond roughly to Java’s, although some names are
influenced by MySQL’s type names (some of which, in turn, overlap with SQL-
92). There are four signed integral types: TINYINT, SMALLINT, INT, and
BIGINT, which are equivalent to Java’s byte, short, int, and long primitive types,
respectively; they are 1-byte, 2-byte, 4-byte, and 8-byte signed integers.
Hive’s floating-point types, FLOAT and DOUBLE, correspond to Java’s float
and double, which are 32-bit and 64-bit floating point numbers. Unlike some
databases, there is no option to control the number of significant digits or decimal
places stored for floating point values.
Hive supports a BOOLEAN type for storing true and false values.
There is a single Hive data type for storing text, STRING, which is a variable-
length character string. Hive’s STRING type is like VARCHAR in other

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

databases, although there is no declaration of the maximum number of characters


to store with STRING. (The theoretical maximum size STRING that may be
stored is 2GB, although in practice it may be inefficient to materialize such large
values.
The BINARY data type is for storing variable-length binary data.
The TIMESTAMP data type stores timestamps with nanosecond precision. Hive
comes with UDFs for converting between Hive timestamps, Unix timestamps
(seconds since the Unix epoch), and strings, which makes most common date
operations tractable. TIMESTAMP does not encapsulate a timezone, however the
to_utc_timestamp and from_utc_timestamp functions make it possible to do
timezone conversions.
Conversions
Primitive types form a hierarchy, which dictates the implicit type conversions
that Hive will perform. For example, a TINYINT will be converted to an INT, if
an expression ex pects an INT; however, the reverse conversion will not occur
and Hive will return an error unless the CAST operator is used.
The implicit conversion rules can be summarized as follows. Any integral
numeric type can be implicitly converted to a wider type. All the integral numeric
types, FLOAT, and (perhaps surprisingly) STRING can be implicitly converted
to DOUBLE. TINYINT, SMALL INT, and INT can all be converted to FLOAT.
BOOLEAN types cannot be converted to any other type.
You can perform explicit type conversion using CAST. For example, CAST('1'
AS INT) will convert the string '1' to the integer value 1. If the cast fails—as it
does in CAST('X' AS INT), for example—then the expression returns NULL.
Complex types
Hive has three complex types: ARRAY, MAP, and STRUCT. ARRAY and MAP
are like their namesakes in Java, while a STRUCT is a record type which
encapsulates a set of named fields. Complex types permit an arbitrary level of
nesting. Complex type declarations must specify the type of the fields in the
collection, using an angled bracket notation, as illustrated in this table definition
which has three columns, one for each complex type:
CREATE TABLE complex (
col1 ARRAY<INT>,
col2 MAP<STRING,INT>,
col3 STRUCT <a:STRING, b:INT, c:DOUBLE>
);

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

Operators and Functions


The usual set of SQL operators is provided by Hive: relational operators (such as
x = 'a' for testing equality, x IS NULL for testing nullity, x LIKE 'a%' for pattern
matching), arithmetic operators (such as x + 1 for addition), and logical operators
(such as x OR y for logical OR). The operators match those in MySQL, which
deviates from SQL-92 since || is logical OR, not string concatenation. Use the
concat function for the latter in both MySQL and Hive.
Hive comes with a large number of built-in functions—too many to list here—
divided into categories including mathematical and statistical functions, string
functions, date functions (for operating on string representations of dates),
conditional functions, aggregate functions, and functions for working with XML
(using the xpath function) and JSON.
You can retrieve a list of functions from the Hive shell by typing SHOW
FUNCTIONS.To get brief usage instructions for a particular function, use the
DESCRIBE command:
hive> DESCRIBE FUNCTION length;
length(str) - Returns the length of str

In the case when there is no built-in function that does what you want, you can
write your own.
QUERYING DATA IN HIVE
This section discusses how to use various forms of the SELECT statement to
retrieve data from Hive.
Sorting and Aggregating
Sorting data in Hive can be achieved by use of a standard ORDER BY clause, but
there is a catch. ORDER BY produces a result that is totally sorted, as expected,
but to do so it sets the number of reducers to one, making it very inefficient for
large datasets.
When a globally sorted result is not required—and in many cases it isn’t—then
you can use Hive’s nonstandard extension, SORT BY instead. SORT BY
produces a sorted file per reducer.
In some cases, you want to control which reducer a particular row goes to,
typically so you can perform some subsequent aggregation. This is what Hive’s
DISTRIBUTE BY clause does. Here’s an example to sort the weather dataset by

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

year and temperature, in such a way to ensure that all the rows for a given year
end up in the same reducer partition.
hive> FROM records2
> SELECT year, temperature
> DISTRIBUTE BY year
> SORT BY year ASC, temperature DESC;
1949 111
1949 78
1950 22
1950 0
1950 -11

If the columns for SORT BY and DISTRIBUTE BY are the same, you can use
CLUSTER BY as a shorthand for specifying both.

MapReduce Scripts

Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and


REDUCE clauses make it possible to invoke an external script or program from
Hive.
For example:

hive> ADD FILE /path/to/is_good_quality.py;


hive> FROM records2
> SELECT TRANSFORM(year, temperature, quality)
> USING 'is_good_quality.py'
> AS year, temperature;
1949 111
1949 78
1950 0
1950 22
1950 -11

Before running the query, we need to register the script with Hive.
If we use a nested form for the query, we can specify a map and a reduce function. This time we use
the MAP and REDUCE keywords, but SELECT TRANSFORM in both cases would have the same result.

FROM
( FROM records2
MAP year, temperature, quality
USING 'is_good_quality.py'
AS year, temperature) map_output
REDUCE year, temperature

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

USING 'max_temperature_reduce.py' AS year, temperature;

Joins
One of the nice things about using Hive, rather than raw MapReduce, is that it
makes performing commonly used operations very simple. Join operations are a
case in point, given how involved they are to implement in MapReduce.

Inner joins

The simplest kind of join is the inner join, where each match in the input tables
results in a row in the output. Consider two small demonstration tables: sales,
which lists the names of people and the ID of the item they bought; and things,
which lists the item ID and its name:
hive> SELECT * FROM sales;
Joe 2
Hank 4
Ali 0
Eve 3
Hank 2
hive> SELECT * FROM things;
2 Tie
4 Coat
3 Hat
1 Scarf

We can perform an inner join on the two tables as follows:

hive> SELECT sales.*, things.*


> FROM sales JOIN things ON (sales.id = things.id);
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat

Hive only supports equijoins, which means that only equality can be used in the
join predicate, which here matches on the id column in both tables.

In Hive, you can join on multiple columns in the join predicate by specifying a
series of expressions, separated by AND keywords. You can also join more than
two tables by supplying additional JOIN...ON... clauses in the query. Hive is
intelligent about trying to minimize the number of MapReduce jobs to perform
the joins.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

A single join is implemented as a single MapReduce job, but multiple joins can
be performed in less than one MapReduce job per join if the same column is used
in the join condition. You can see how many MapReduce jobs Hive will use for
any particular query by prefixing it with the EXPLAIN keyword:
EXPLAIN
SELECT sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);

The EXPLAIN output includes many details about the execution plan for the
query, including the abstract syntax tree, the dependency graph for the stages that
Hive will execute, and information about each stage. Stages may be MapReduce
jobs or operations such as file moves. For even more detail, prefix the query with
EXPLAIN EXTENDED.

Hive currently uses a rule-based query optimizer for determining how to execute
a query, but it’s likely that in the future a cost-based optimizer will be added.

Outer joins

Outer joins allow you to find nonmatches in the tables being joined. In the current
example, when we performed an inner join, the row for Ali did not appear in the
output, since the ID of the item she purchased was not present in the things table.
If we change the join type to LEFT OUTER JOIN, then the query will return a
row for every row in the left table (sales), even if there is no corresponding row
in the table it is being joined to (things):
hive> SELECT sales.*, things.*
> FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
Ali 0 NULL NULL
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat

Hive supports right outer joins, which reverses the roles of the tables relative to
the left join. In this case, all items from the things table are included, even those
that weren’t purchased by anyone (a scarf):
hive> SELECT sales.*, things.*
> FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);
NULL NULL 1 Scarf
Joe 2 2 Tie
Hank 2 2 Tie

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

Eve 3 3 Hat
Hank 4 4 Coat

Finally, there is a full outer join, where the output has a row for each row from
both tables in the join:
hive> SELECT sales.*, things.*
> FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
Ali 0 NULL NULL
NULL NULL 1 Scarf
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat

Semi joins

Hive doesn’t support IN subqueries (at the time of writing), but you can use a
LEFT SEMI JOIN to do the same thing. Consider this IN subquery, which finds
all the items in the things table that are in the sales table:
SELECT * FROM things WHERE things.id IN (SELECT id from sales);

We can rewrite it as follows:


hive> SELECT * > FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);
2 Tie
3 Hat
4 Coat

There is a restriction that we must observe for LEFT SEMI JOIN queries: the
right table (sales) may only appear in the ON clause. It cannot be referenced in a
SELECT expression, for example.

Map joins

If one table is small enough to fit in memory, then Hive can load the smaller table
into memory to perform the join in each of the mappers. The syntax for specifying
a map join is a hint embedded in an SQL C-style comment:
SELECT /*+ MAPJOIN(things) */ sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);

The job to execute this query has no reducers, so this query would not work for a
RIGHT or FULL OUTER JOIN, since absence of matching can only be detected
in an aggregating (reduce) step across all the inputs.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

Subqueries

A subquery is a SELECT statement that is embedded in another SQL statement.


Hive has limited support for subqueries, only permitting a subquery in the FROM
clause of a SELECT statement.

The following query finds the mean maximum temperature for every year and
weather station:
SELECT station, year, AVG(max_temperature)
FROM (
SELECT station, year, MAX(temperature) AS max_temperature
FROM records2
WHERE temperature != 9999 AND (quality = 0 OR quality = 1 OR quality = 4 OR quality =
5 OR quality = 9)
GROUP BY station, year
) mt
GROUP BY station, year;

The subquery is used to find the maximum temperature for each station/date
combination, then the outer query uses the AVG aggregate function to find the
average of the maximum temperature readings for each station/date combination.

The outer query accesses the results of the subquery like it does a table, which is
why the subquery must be given an alias (mt). The columns of the subquery have
to be given unique names so that the outer query can refer to them.

Views
A view is a sort of “virtual table” that is defined by a SELECT statement. Views
can be used to present data to users in a different way to the way it is actually
stored on disk. Often, the data from existing tables is simplified or aggregated in
a particular way that makes it convenient for further processing. Views may also
be used to restrict users’ access to particular subsets of tables that they are
authorized to see.

In Hive, a view is not materialized to disk when it is created; rather, the view’s
SELECT statement is executed when the statement that refers to the view is run.
If a view performs extensive transformations on the base tables, or is used
frequently, then you may choose to manually materialize it by creating a new
table that stores the contents of the view.

Fundamentals of HBase and Zookeeper

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

HBase is a data model similar to Google’s big table that is designed to provide
random access to high volume of structured or unstructured data. HBase is an
important component of the Hadoop ecosystem that leverages the fault tolerance
feature of HDFS. HBase provides real-time read or write access to data
in HDFS. HBase data model stores semi-structured data having different data
types, varying column size and field size. The layout of HBase data model eases
data partitioning and distribution across the cluster. HBase data model consists of
several logical components- row key, column family, table name, timestamp, etc.
Row Key is used to uniquely identify the rows in HBase tables. Column families
in HBase are static whereas the columns, by themselves, are dynamic.

Components of Apache HBase Architecture


HBase architecture has 3 important components- HMaster, Region Server and
ZooKeeper.

i. HMaster

HBase HMaster is a lightweight process that assigns regions to region servers in


the Hadoop cluster for load balancing. Responsibilities of HMaster –

• Manages and Monitors the Hadoop Cluster


• Performs Administration (Interface for creating, updating and deleting
tables.)
• Controlling the failover
• DDL operations are handled by the HMaster
• Whenever a client wants to change the schema and change any of the
metadata operations, HMaster is responsible for all these operations.

ii. Region Server

These are the worker nodes which handle read, write, update, and delete requests
from clients. Region Server process, runs on every node in the hadoop cluster.
Region Server runs on HDFS DataNode and consists of the following
components –

• Block Cache – This is the read cache. Most frequently read data is stored
in the read cache and whenever the block cache is full, recently used data
is evicted.
• MemStore- This is the write cache and stores new data that is not yet
written to the disk. Every column family in a region has a MemStore.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

• Write Ahead Log (WAL) is a file that stores new data that is not persisted
to permanent storage.
• HFile is the actual storage file that stores the rows as sorted key values on
a disk.

iii. Zookeeper

HBase uses ZooKeeper as a distributed coordination service for region


assignments and to recover any region server crashes by loading them onto other
region servers that are functioning. ZooKeeper is a centralized monitoring server
that maintains configuration information and provides distributed
synchronization. Whenever a client wants to communicate with regions, they
have to approach Zookeeper first. HMaster and Region servers are registered with
ZooKeeper service, client needs to access ZooKeeper quorum in order to connect
with region servers and HMaster. In case of node failure within an HBase cluster,
ZKquoram will trigger error messages and start repairing failed nodes.

ZooKeeper service keeps track of all the region servers that are there in an HBase
cluster- tracking information about how many region servers are there and which
region servers are holding which DataNode. HMaster contacts ZooKeeper to get
the details of region servers. Various services that Zookeeper provides include –

• Establishing client communication with region servers.


• Tracking server failure and network partitions.
• Maintain Configuration Information
• Provides ephemeral nodes, which represent different region servers.

IBM Infosphere BigInsights and Streams


Infoshere BigInsights

InfoSphere BigInsights is the IBM enterprise-grade Hadoop offering. It is based


on industry-standard Apache Hadoop, but IBM provides extensive capabilities,
including installation and management facilities and additional tools and
utilities.

Special care is taken to ensure that InfoSphere BigInsights is 100% compatible


with open source Hadoop. The IBM capabilities provide the Hadoop developer
or administrator with additional choices and flexibility without locking them
into proprietary technology.

Here are some of the standard open source utilities in InfoSphere BigInsights:

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

• PIG
• Hive / HCatalog
• Oozie
• HBASE
• Zookeeper
• Flume
• Avro
• Chukwa
Key software capabilities

IBM InfoSphere BigInsights provides advanced software capabilities that are


not found in competing Hadoop distributions. Here are some of these
capabilities:

• Big SQL: Big SQL is a rich, ANSI-compliant SQL implementation. Big


SQL builds on 30 years of IBM experience in SQL and database
engineering. Big SQL has several ad-vantages over competing SQL on
Hadoop implementations:
o SQL language compatibility
o Support for native data sources
o Performance
o Federation
o Security
• Big R: Big R is a set of libraries that provide end-to-end integration with
the popular R programming language that is included in InfoSphere
BigInsights. Big R provides a familiar environment for developers and data
scientists proficient with the R language.
• Big Sheets: Big Sheets is a spreadsheet style data manipulation and
visualization tool that allows business users to access and analyze data in
Hadoop without the need to be knowledgeable in Hadoop scripting
languages or MapReduce programming.
• Application Accelerators: IBM InfoSphere BigInsights extends the
capabilities of open source Hadoop with accelerators that use pre-written
capabilities for common big data use cases to build quickly high-quality
applications. Here are some of the accelerators that are included in
InfoSphere BigInsights:
o Text Analytics Accelerators: A set of facilities for developing
applications that analyze text across multiple spoken languages
o Machine Data Accelerators: Tools that are aimed at developers that
make it easy to develop applications that process log files, including
web logs, mail logs, and various specialized file formats

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

o Social Data Accelerators: Tools to easily import and analyze social


data at scale from multiple online sources, including tweets, boards,
and blogs
• Adaptive MapReduce: An alternative, Hadoop-compatible scheduling
framework that provides enhanced performance for latency sensitive
Hadoop MapReduce jobs.

Infosphere Stream

InfoSphere® Streams consists of a programming language, an API, and an


integrated development environment (IDE) for applications, and a runtime
system that can run the applications on a single or distributed set of resources.
The Streams Studio IDE includes tools for authoring and creating visual
representations of streams processing applications.

InfoSphere Streams is designed to address the following data processing


platform objectives:

• Parallel and high performance streams processing software platform that


can scale over a range of hardware environments
• Automated deployment of streams processing applications on configured
hardware
• Incremental deployment without restarting to extend streams processing
applications
• Secure and auditable run time environment

The InfoSphere Streams architecture represents a significant change in


computing system organization and capability. InfoSphere Streams provides a
runtime platform, programming model, and tools for applications that are
required to process continuous data streams. The need for such applications arises
in environments where information from one to many data streams can be used
to alert humans or other systems, or to populate knowledge bases for later queries.

InfoSphere Streams is especially powerful in environments where traditional


batch or transactional systems might not be sufficient, for example:

• The environment must process many data streams at high rates.


• Complex processing of the data streams is required.
• Low latency is needed when processing the data streams.

VISUALIZATIONS
Visualization is the first step to make sense of data. To translate and present
complex data and relations in a simple way, data analysts use different

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

methods of data visualization — charts, diagrams, maps, etc. Choosing the


right technique and its setup is often the only way to make data
understandable. Vice versa, poorly selected tactics won't let to unlock the
full potential of data or even make it irrelevant.
5 factors that influence data visualization choices:
1. Audience. It’s important to adjust data representation to the specific
target audience. For example, fitness mobile app users who browse
through their progress can easily work with uncomplicated
visualizations. On the other hand, if data insights are intended for
researchers and experienced decision-makers who regularly work with
data, you can and often have to go beyond simple charts.
2. Content. The type of data you are dealing with will determine the
tactics. For example, if it’s time-series metrics, you will use line charts
to show the dynamics in many cases. To show the relationship between
two elements, scatter plots are often used. In turn, bar charts work well
for comparative analysis.
3. Context. You can use different data visualization approaches and read
data depending on the context. To emphasize a certain figure, for
example, significant profit growth, you can use the shades of one color
on the chart and highlight the highest value with the brightest one. On
the contrary, to differentiate elements, you can use contrast colors.
4. Dynamics. There are various types of data, and each type has a different
rate of change. For example, financial results can be measured monthly
or yearly, while time series and tracking data are changing constantly.
Depending on the rate of change, you may consider dynamic
representation (steaming) or static data visualization techniques in data
mining.
5. Purpose. The goal of data visualization affects the way it is
implemented. In order to make a complex analysis, visualizations are
compiled into dynamic and controllable dashboards equipped with
different tools for visual data analytics (comparison, formatting,
filtering, etc.). However, dashboards are not necessary to show a single
or occasional data insight.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

Data visualization techniques

Depending on these factors, you can choose different data visualization


techniques and configure their features. Here are the common types of data
visualization techniques:

Charts
The easiest way to show the development of one or several data sets is a chart.
Charts vary from bar and line charts that show the relationship between elements
over time to pie charts that demonstrate the components or proportions between
the elements of one whole.

Plots
Plots allow to distribute two or more data sets over a 2D or even 3D space to
show the relationship between these sets and the parameters on the plot. Plots also
vary. Scatter and bubble plots are some of the most widely-used visualizations.
When it comes to big data, analysts often use more complex box plots to visualize
the relationships between large volumes of data.

Maps

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)


lOMoARcPSD|51623677

Maps are popular techniques used for data visualization in different industries.
They allow locating elements on relevant objects and areas — geographical maps,
building plans, website layouts, etc. Among the most popular map visualizations
are heat maps, dot distribution maps, cartograms.

Diagrams and matrices


Diagrams are usually used to demonstrate complex data relationships and links
and include various types of data in one visual representation. They can be
hierarchical, multidimensional, tree-like.

Matrix is one of the advanced data visualization techniques that help determine
the correlation between multiple constantly updating (steaming) data sets.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy