0% found this document useful (0 votes)

5 views26 pages

Bda Module 5

Module 5 of the BDA course covers big data analysis frameworks, specifically Apache Pig and Hive. It explains the functionalities, differences, and data processing operators of both tools, highlighting their respective languages, use cases, and capabilities for handling structured and semi-structured data. Additionally, it discusses Hive services and the HiveQL dialect, emphasizing its limitations compared to standard SQL.

Uploaded by

Viji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views26 pages

Bda Module 5

Uploaded by

Viji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

lOMoARcPSD|51623677

BDA Module 5

BSc Computer Science (Mahatma Gandhi University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)
lOMoARcPSD|51623677

BIGDATA ANALYSIS
MODULE 5 : FRAMEWORKS
APPLICATION ON BIGDATA USING PIG AND HIVE
Pig :
Pig is used for the analysis of a large amount of data. It is abstract over
MapReduce. Pig is used to perform all kinds of data manipulation operations in
Hadoop. It provides the Pig-Latin language to write the code that contains many
inbuilt functions like join, filter, etc. The two parts of the Apache Pig are Pig-
Latin and Pig-Engine. Pig Engine is used to convert all these scripts into a
specific map and reduce tasks. Pig abstraction is at a higher level. It contains
less line of code as compared to MapReduce.
Hive :
Hive is built on the top of Hadoop and is used to process structured data in
Hadoop. Hive was developed by Facebook. It provides various types of
querying language which is frequently known as Hive Query Language. Apache
Hive is a data warehouse and which provides an SQL-like interface between the
user and the Hadoop distributed file system (HDFS) which integrates Hadoop.
Difference between Pig and Hive :
Pig Hive

Pig operates on the Hive operates on the server side of a

1. client side of a cluster. cluster.

Pig uses pig-latin

2. language. Hive uses HiveQL language.

Pig is a Procedural
3. Data Flow Language. Hive is a Declarative SQLish Language.

It was developed by
4. Yahoo. It was developed by Facebook.

It is used by
Researchers and
5. Programmers. It is mainly used by Data Analysts.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

It is used to handle
structured and semi-
6. structured data. It is mainly used to handle structured data.

It is used for
7. programming. It is used for creating reports.

Pig scripts end with

8. .pig extension. In HIve, all extensions are supported.

It does not support

9. partitioning. It supports partitioning.

10. It loads data quickly. It loads data slowly.

It does not support

11. JDBC. It supports JDBC.

It does not support

12. ODBC. It supports ODBC.

Pig does not have a Hive makes use of the exact variation of
dedicated metadata dedicated SQL-DDL language by
13. database. defining tables beforehand.

It supports Avro file

14. format. It does not support Avro file format.

Pig is suitable for

complex and nested Hive is suitable for batch-
15. data structures. processing OLAP systems.

Pig does not support Hive supports schema for data insertion in
16. schema to store data. tables.

Data Processing Operators in Pig

Loading and Storing Data
PigStorage to store tuples as plain-text values separated by a colon character:

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

grunt> STORE A INTO 'out' USING PigStorage(':');

grunt> cat out
Joe:cherry:2
Ali:apple:3
Joe:banana:2
Eve:apple:7

Filtering Data
Once you have some data loaded into a relation, the next step is often to filter it
to remove the data that you are not interested in. By filtering early in the
processing pipeline, you minimize the amount of data flowing through the
system, which can improve efficiency.
FOREACH...GENERATE
The FOREACH...GENERATE operator is used to act on every row in a relation.
It can be used to remove fields or to generate new ones.
In this example, we do both:
grunt> DUMP A;
(Joe,cherry,2)
(Ali,apple,3)
(Joe,banana,2)
(Eve,apple,7)
grunt> B = FOREACH A GENERATE $0, $2+1, 'Constant';
grunt> DUMP B;
(Joe,3,Constant)
(Ali,4,Constant)
(Joe,3,Constant)
(Eve,8,Constant)

Here we have created a new relation B with three fields. Its first field is a
projection of the first field ($0) of A. B’s second field is the third field of A ($2)
with one added to it. B’s third field is a constant field (every row in B has the
same third field) with the chararray value Constant. The
FOREACH...GENERATE operator has a nested form to support more complex
processing. A nested FOREACH...GENERATE must always have a
GENERATE statement as the last nested statemen, that generates the summary
fields of interest using the grouped records, as well as the relations created in the
nested block.
STREAM

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

The STREAM operator allows you to transform data in a relation using an

external program or script. It is named by analogy with Hadoop Streaming, which
provides a similar capability for MapReduce.
STREAM can use built-in commands with arguments. Here is an example that
uses the Unix cut command to extract the second field of each tuple in A. Note
that the command and its arguments are enclosed in backticks:
grunt> C = STREAM A THROUGH `cut -f 2`;
grunt> DUMP C;
(cherry)
(apple)
(banana)
(apple)

The STREAM operator uses PigStorage to serialize and deserialize relations to

and from the program’s standard input and output streams. Tuples in A are
converted to tabdelimited lines that are passed to the script. The output of the
script is read one line at a time and split on tabs to create new tuples for the output
relation C. You can provide a custom serializer and deserializer, which implement
PigToStream and StreamToPig respectively (both in the org.apache.pig package),
using the DEFINE command. Pig streaming is most powerful when you write
custom processing scripts.
Grouping and Joining Data
Joining datasets in MapReduce takes some work on the part of the, whereas Pig
has very good built-in support for join operations, making it much more
approachable. Since the large datasets that are suitable for analysis by Pig (and
MapReduce in general) are not normalized, joins are used more infrequently in
Pig than they are in SQL.
JOIN
Let’s look at an example of an inner join. Consider the relations A and B:
grunt> DUMP A;
(2,Tie)
(4,Coat)
(3,Hat)
(1,Scarf)
grunt> DUMP B;
(Joe,2)
(Hank,4)
(Ali,0)
(Eve,3)

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

(Hank,2)

We can join the two relations on the numerical (identity) field in each:
grunt> C = JOIN A BY $0, B BY $1;
grunt> DUMP C;
(2,Tie,Joe,2)
(2,Tie,Hank,2)
(3,Hat,Eve,3)
(4,Coat,Hank,4)

You should use the general join operator if all the relations being joined are too
large to fit in memory. If one of the relations is small enough to fit in memory,
there is a special type of join called a fragment replicate join, which is
implemented by distributing the small input to all the mappers and performing a
map-side join using an in-memory lookup table against the (fragmented) larger
relation. There is a special syntax for telling Pig to use a fragment replicate join:
grunt> C = JOIN A BY $0, B BY $1 USING "replicated";

COGROUP
JOIN always gives a flat structure: a set of tuples. The COGROUP statement is
similar to JOIN, but creates a nested set of output tuples. This can be useful if you
want to exploit the structure in subsequent statements:
grunt> D = COGROUP A BY $0, B BY $1;
grunt> DUMP D;
(0,{},{(Ali,0)})
(1,{(1,Scarf)},{})
(2,{(2,Tie)},{(Joe,2),(Hank,2)})
(3,{(3,Hat)},{(Eve,3)})
(4,{(4,Coat)},{(Hank,4)})

COGROUP generates a tuple for each unique grouping key. The first field of each
tuple is the key, and the remaining fields are bags of tuples from the relations with
a matching key. The first bag contains the matching tuples from relation A with
the same key. Similarly, the second bag contains the matching tuples from
relation B with the same key.
CROSS
Pig Latin includes the cross-product operator (also known as the cartesian
product), which joins every tuple in a relation with every tuple in a second relation

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

(and with every tuple in further relations if supplied). The size of the output is the
product of the size of the inputs, potentially making the output very large:
grunt> I = CROSS A, B;
grunt> DUMP I;
(2,Tie,Joe,2)
(2,Tie,Hank,4)
(2,Tie,Ali,0)
(2,Tie,Eve,3)
(2,Tie,Hank,2)
(4,Coat,Joe,2)
(4,Coat,Hank,4)
(4,Coat,Ali,0)
(4,Coat,Eve,3)
(4,Coat,Hank,2)
(3,Hat,Joe,2)
(3,Hat,Hank,4)
(3,Hat,Ali,0)
(3,Hat,Eve,3)
(3,Hat,Hank,2)
(1,Scarf,Joe,2)
(1,Scarf,Hank,4)
(1,Scarf,Ali,0)
(1,Scarf,Eve,3)
(1,Scarf,Hank,2)

When dealing with large datasets, you should try to avoid operations that
generate intermediate representations that are quadratic (or worse) in size.
Computing the crossproduct of the whole input dataset is rarely needed, if ever.
GROUP
Although COGROUP groups the data in two or more relations, the GROUP
statement groups the data in a single relation. GROUP supports grouping by more
than equality of keys: you can use an expression or user-defined function as the
group key. For example, consider the following relation A:
grunt> DUMP A;
(Joe,cherry)
(Ali,apple)
(Joe,banana)
(Eve,apple)

Let’s group by the number of characters in the second field:

grunt> B = GROUP A BY SIZE($1);
grunt> DUMP B;

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

(5,{(Ali,apple),(Eve,apple)})
(6,{(Joe,cherry),(Joe,banana)})

GROUP creates a relation whose first field is the grouping field, which is given
the alias group. The second field is a bag containing the grouped fields with the
same schema as the original relation (in this case, A).
There are also two special grouping operations: ALL and ANY. ALL groups all
the tuples in a relation in a single group, as if the GROUP function was a constant:
grunt> C = GROUP A ALL;
grunt> DUMP C;
(all,{(Joe,cherry),(Ali,apple),(Joe,banana),(Eve,apple)})

Note that there is no BY in this form of the GROUP statement. The ALL grouping
is commonly used to count the number of tuples in a relation

Sorting
Data Relations are unordered in Pig. There is no guarantee which order the rows
will be processed in. In particular, when retrieving the contents of A using DUMP
or STORE, the rows may be written in any order. If you want to impose an order
on the output, you can use the ORDER operator to sort a relation by one or more
fields. The default sort order compares fields of the same type using the natural
ordering, and different types are given an arbitrary, but deterministic, ordering.
The following example sorts A by the first field in ascending order and by the
second field in descending order:
grunt> B = ORDER A BY $0, $1 DESC;
grunt> DUMP B;
(1,2)
(2,4)
(2,3)

The LIMIT statement is useful for limiting the number of results, as a quick and
dirty way to get a sample of a relation; prototyping (the ILLUSTRATE
command) should be preferred for generating more representative samples of the
data. It can be used immediately after the ORDER statement to retrieve the first
n tuples. Usually, LIMIT will select any n tuples from a relation, but when used
immediately after an ORDER statement, the order is retained (in an exception to
the rule that processing a relation does not retain its order):

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

grunt> D = LIMIT B 2;
grunt> DUMP D;
(1,2)
(2,4)

If the limit is greater than the number of tuples in the relation, all tuples are
returned (so LIMIT has no effect).
Combining and Splitting Data
Sometimes you have several relations that you would like to combine into one.
For this, the UNION statement is used. For example:
grunt> DUMP A;
(2,3)
(1,2)
(2,4)
grunt> DUMP B;
(z,x,8)
(w,y,1)
grunt> C = UNION A, B;
grunt> DUMP C;
(2,3)
(1,2)
(2,4)
(z,x,8)
(w,y,1)

C is the union of relations A and B, and since relations are unordered, the order
of the tuples in C is undefined. Also, it’s possible to form the union of two
relations with different schemas or with different numbers of fields, as we have
done here. Pig attempts to merge the schemas from the relations that UNION is
operating on In this case, they are incompatible, so C has no schema:
grunt> DESCRIBE A;
A: {f0: int,f1: int}
grunt> DESCRIBE B;
B: {f0: chararray,f1: chararray,f2: int}
grunt> DESCRIBE C;
Schema for C unknown.

If the output relation has no schema, your script needs to be able to handle tuples
that vary in the number of fields and/or types.
The SPLIT operator is the opposite of UNION ; it partitions a relation into two
or more relations.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

Hive Services
The Hive shell is only one of several services that you can run using the hive
command. You can specify the service to run using the --service option. Type
hive --service help to get a list of available service names; the most useful are
described below.
cli
The command line interface to Hive (the shell). This is the default service.
hiveserver
Runs Hive as a server exposing a Thrift service, enabling access from a range
of clients written in different languages. Applications using the Thrift, JDBC, and
ODBC connectors need to run a Hive server to communicate with Hive. Set the
HIVE_PORT environment variable to specify the port the server will listen on
(defaults to 10,000).
hwi
The Hive Web Interface.
jar
The Hive equivalent to hadoop jar, a convenient way to run Java applications
that includes both Hadoop and Hive classes on the classpath.
Metastore
By default, the metastore is run in the same process as the Hive service.
Using this service, it is possible to run the metastore as a standalone (remote)
process. Set the METASTORE_PORT environment variable to specify the port
the server will listen on.

HiveQL
Hive’s SQL dialect, called HiveQL, does not support the full SQL-92
specification. Hive has some extensions that are not in SQL-92, which have been
inspired by syntax from other database systems, notably MySQL. Table 12-2
provides a high-level comparison of SQL and HiveQL.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

Data Types
Hive supports both primitive and complex data types. Primitives include numeric,
boolean, string, and timestamp types. The complex data types include arrays,
maps, and structs. Hive’s data types are listed in Table 12-3. Note that the literals
shown are those used from within HiveQL; they are not the serialized form used
in the table’s storage format.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

Primitive types
Hive’s primitive types correspond roughly to Java’s, although some names are
influenced by MySQL’s type names (some of which, in turn, overlap with SQL-
92). There are four signed integral types: TINYINT, SMALLINT, INT, and
BIGINT, which are equivalent to Java’s byte, short, int, and long primitive types,
respectively; they are 1-byte, 2-byte, 4-byte, and 8-byte signed integers.
Hive’s floating-point types, FLOAT and DOUBLE, correspond to Java’s float
and double, which are 32-bit and 64-bit floating point numbers. Unlike some
databases, there is no option to control the number of significant digits or decimal
places stored for floating point values.
Hive supports a BOOLEAN type for storing true and false values.
There is a single Hive data type for storing text, STRING, which is a variable-
length character string. Hive’s STRING type is like VARCHAR in other

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

databases, although there is no declaration of the maximum number of characters

to store with STRING. (The theoretical maximum size STRING that may be
stored is 2GB, although in practice it may be inefficient to materialize such large
values.
The BINARY data type is for storing variable-length binary data.
The TIMESTAMP data type stores timestamps with nanosecond precision. Hive
comes with UDFs for converting between Hive timestamps, Unix timestamps
(seconds since the Unix epoch), and strings, which makes most common date
operations tractable. TIMESTAMP does not encapsulate a timezone, however the
to_utc_timestamp and from_utc_timestamp functions make it possible to do
timezone conversions.
Conversions
Primitive types form a hierarchy, which dictates the implicit type conversions
that Hive will perform. For example, a TINYINT will be converted to an INT, if
an expression ex pects an INT; however, the reverse conversion will not occur
and Hive will return an error unless the CAST operator is used.
The implicit conversion rules can be summarized as follows. Any integral
numeric type can be implicitly converted to a wider type. All the integral numeric
types, FLOAT, and (perhaps surprisingly) STRING can be implicitly converted
to DOUBLE. TINYINT, SMALL INT, and INT can all be converted to FLOAT.
BOOLEAN types cannot be converted to any other type.
You can perform explicit type conversion using CAST. For example, CAST('1'
AS INT) will convert the string '1' to the integer value 1. If the cast fails—as it
does in CAST('X' AS INT), for example—then the expression returns NULL.
Complex types
Hive has three complex types: ARRAY, MAP, and STRUCT. ARRAY and MAP
are like their namesakes in Java, while a STRUCT is a record type which
encapsulates a set of named fields. Complex types permit an arbitrary level of
nesting. Complex type declarations must specify the type of the fields in the
collection, using an angled bracket notation, as illustrated in this table definition
which has three columns, one for each complex type:
CREATE TABLE complex (
col1 ARRAY<INT>,
col2 MAP<STRING,INT>,
col3 STRUCT <a:STRING, b:INT, c:DOUBLE>
);

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

Operators and Functions

The usual set of SQL operators is provided by Hive: relational operators (such as
x = 'a' for testing equality, x IS NULL for testing nullity, x LIKE 'a%' for pattern
matching), arithmetic operators (such as x + 1 for addition), and logical operators
(such as x OR y for logical OR). The operators match those in MySQL, which
deviates from SQL-92 since || is logical OR, not string concatenation. Use the
concat function for the latter in both MySQL and Hive.
Hive comes with a large number of built-in functions—too many to list here—
divided into categories including mathematical and statistical functions, string
functions, date functions (for operating on string representations of dates),
conditional functions, aggregate functions, and functions for working with XML
(using the xpath function) and JSON.
You can retrieve a list of functions from the Hive shell by typing SHOW
FUNCTIONS.To get brief usage instructions for a particular function, use the
DESCRIBE command:
hive> DESCRIBE FUNCTION length;
length(str) - Returns the length of str

In the case when there is no built-in function that does what you want, you can
write your own.
QUERYING DATA IN HIVE
This section discusses how to use various forms of the SELECT statement to
retrieve data from Hive.
Sorting and Aggregating
Sorting data in Hive can be achieved by use of a standard ORDER BY clause, but
there is a catch. ORDER BY produces a result that is totally sorted, as expected,
but to do so it sets the number of reducers to one, making it very inefficient for
large datasets.
When a globally sorted result is not required—and in many cases it isn’t—then
you can use Hive’s nonstandard extension, SORT BY instead. SORT BY
produces a sorted file per reducer.
In some cases, you want to control which reducer a particular row goes to,
typically so you can perform some subsequent aggregation. This is what Hive’s
DISTRIBUTE BY clause does. Here’s an example to sort the weather dataset by

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

year and temperature, in such a way to ensure that all the rows for a given year
end up in the same reducer partition.
hive> FROM records2
> SELECT year, temperature
> DISTRIBUTE BY year
> SORT BY year ASC, temperature DESC;
1949 111
1949 78
1950 22
1950 0
1950 -11

If the columns for SORT BY and DISTRIBUTE BY are the same, you can use
CLUSTER BY as a shorthand for specifying both.

MapReduce Scripts

Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and

REDUCE clauses make it possible to invoke an external script or program from
Hive.
For example:

hive> ADD FILE /path/to/is_good_quality.py;

hive> FROM records2
> SELECT TRANSFORM(year, temperature, quality)
> USING 'is_good_quality.py'
> AS year, temperature;
1949 111
1949 78
1950 0
1950 22
1950 -11

Before running the query, we need to register the script with Hive.
If we use a nested form for the query, we can specify a map and a reduce function. This time we use
the MAP and REDUCE keywords, but SELECT TRANSFORM in both cases would have the same result.

FROM
( FROM records2
MAP year, temperature, quality
USING 'is_good_quality.py'
AS year, temperature) map_output
REDUCE year, temperature

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

USING 'max_temperature_reduce.py' AS year, temperature;

Joins
One of the nice things about using Hive, rather than raw MapReduce, is that it
makes performing commonly used operations very simple. Join operations are a
case in point, given how involved they are to implement in MapReduce.

Inner joins

The simplest kind of join is the inner join, where each match in the input tables
results in a row in the output. Consider two small demonstration tables: sales,
which lists the names of people and the ID of the item they bought; and things,
which lists the item ID and its name:
hive> SELECT * FROM sales;
Joe 2
Hank 4
Ali 0
Eve 3
Hank 2
hive> SELECT * FROM things;
2 Tie
4 Coat
3 Hat
1 Scarf

We can perform an inner join on the two tables as follows:

hive> SELECT sales., things.

> FROM sales JOIN things ON (sales.id = things.id);
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat

Hive only supports equijoins, which means that only equality can be used in the
join predicate, which here matches on the id column in both tables.

In Hive, you can join on multiple columns in the join predicate by specifying a
series of expressions, separated by AND keywords. You can also join more than
two tables by supplying additional JOIN...ON... clauses in the query. Hive is
intelligent about trying to minimize the number of MapReduce jobs to perform
the joins.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

A single join is implemented as a single MapReduce job, but multiple joins can
be performed in less than one MapReduce job per join if the same column is used
in the join condition. You can see how many MapReduce jobs Hive will use for
any particular query by prefixing it with the EXPLAIN keyword:
EXPLAIN
SELECT sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);

The EXPLAIN output includes many details about the execution plan for the
query, including the abstract syntax tree, the dependency graph for the stages that
Hive will execute, and information about each stage. Stages may be MapReduce
jobs or operations such as file moves. For even more detail, prefix the query with
EXPLAIN EXTENDED.

Hive currently uses a rule-based query optimizer for determining how to execute
a query, but it’s likely that in the future a cost-based optimizer will be added.

Outer joins

Outer joins allow you to find nonmatches in the tables being joined. In the current
example, when we performed an inner join, the row for Ali did not appear in the
output, since the ID of the item she purchased was not present in the things table.
If we change the join type to LEFT OUTER JOIN, then the query will return a
row for every row in the left table (sales), even if there is no corresponding row
in the table it is being joined to (things):
hive> SELECT sales.*, things.*
> FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
Ali 0 NULL NULL
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat

Hive supports right outer joins, which reverses the roles of the tables relative to
the left join. In this case, all items from the things table are included, even those
that weren’t purchased by anyone (a scarf):
hive> SELECT sales.*, things.*
> FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id);
NULL NULL 1 Scarf
Joe 2 2 Tie
Hank 2 2 Tie

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

Eve 3 3 Hat
Hank 4 4 Coat

Finally, there is a full outer join, where the output has a row for each row from
both tables in the join:
hive> SELECT sales.*, things.*
> FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
Ali 0 NULL NULL
NULL NULL 1 Scarf
Joe 2 2 Tie
Hank 2 2 Tie
Eve 3 3 Hat
Hank 4 4 Coat

Semi joins

Hive doesn’t support IN subqueries (at the time of writing), but you can use a
LEFT SEMI JOIN to do the same thing. Consider this IN subquery, which finds
all the items in the things table that are in the sales table:
SELECT * FROM things WHERE things.id IN (SELECT id from sales);

We can rewrite it as follows:

hive> SELECT * > FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);
2 Tie
3 Hat
4 Coat

There is a restriction that we must observe for LEFT SEMI JOIN queries: the
right table (sales) may only appear in the ON clause. It cannot be referenced in a
SELECT expression, for example.

Map joins

If one table is small enough to fit in memory, then Hive can load the smaller table
into memory to perform the join in each of the mappers. The syntax for specifying
a map join is a hint embedded in an SQL C-style comment:
SELECT /*+ MAPJOIN(things) */ sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);

The job to execute this query has no reducers, so this query would not work for a
RIGHT or FULL OUTER JOIN, since absence of matching can only be detected
in an aggregating (reduce) step across all the inputs.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

Subqueries

A subquery is a SELECT statement that is embedded in another SQL statement.

Hive has limited support for subqueries, only permitting a subquery in the FROM
clause of a SELECT statement.

The following query finds the mean maximum temperature for every year and
weather station:
SELECT station, year, AVG(max_temperature)
FROM (
SELECT station, year, MAX(temperature) AS max_temperature
FROM records2
WHERE temperature != 9999 AND (quality = 0 OR quality = 1 OR quality = 4 OR quality =
5 OR quality = 9)
GROUP BY station, year
) mt
GROUP BY station, year;

The subquery is used to find the maximum temperature for each station/date
combination, then the outer query uses the AVG aggregate function to find the
average of the maximum temperature readings for each station/date combination.

The outer query accesses the results of the subquery like it does a table, which is
why the subquery must be given an alias (mt). The columns of the subquery have
to be given unique names so that the outer query can refer to them.

Views
A view is a sort of “virtual table” that is defined by a SELECT statement. Views
can be used to present data to users in a different way to the way it is actually
stored on disk. Often, the data from existing tables is simplified or aggregated in
a particular way that makes it convenient for further processing. Views may also
be used to restrict users’ access to particular subsets of tables that they are
authorized to see.

In Hive, a view is not materialized to disk when it is created; rather, the view’s
SELECT statement is executed when the statement that refers to the view is run.
If a view performs extensive transformations on the base tables, or is used
frequently, then you may choose to manually materialize it by creating a new
table that stores the contents of the view.

Fundamentals of HBase and Zookeeper

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

HBase is a data model similar to Google’s big table that is designed to provide
random access to high volume of structured or unstructured data. HBase is an
important component of the Hadoop ecosystem that leverages the fault tolerance
feature of HDFS. HBase provides real-time read or write access to data
in HDFS. HBase data model stores semi-structured data having different data
types, varying column size and field size. The layout of HBase data model eases
data partitioning and distribution across the cluster. HBase data model consists of
several logical components- row key, column family, table name, timestamp, etc.
Row Key is used to uniquely identify the rows in HBase tables. Column families
in HBase are static whereas the columns, by themselves, are dynamic.

Components of Apache HBase Architecture

HBase architecture has 3 important components- HMaster, Region Server and
ZooKeeper.

i. HMaster

HBase HMaster is a lightweight process that assigns regions to region servers in

the Hadoop cluster for load balancing. Responsibilities of HMaster –

• Manages and Monitors the Hadoop Cluster

• Performs Administration (Interface for creating, updating and deleting
tables.)
• Controlling the failover
• DDL operations are handled by the HMaster
• Whenever a client wants to change the schema and change any of the
metadata operations, HMaster is responsible for all these operations.

ii. Region Server

These are the worker nodes which handle read, write, update, and delete requests
from clients. Region Server process, runs on every node in the hadoop cluster.
Region Server runs on HDFS DataNode and consists of the following
components –

• Block Cache – This is the read cache. Most frequently read data is stored
in the read cache and whenever the block cache is full, recently used data
is evicted.
• MemStore- This is the write cache and stores new data that is not yet
written to the disk. Every column family in a region has a MemStore.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

• Write Ahead Log (WAL) is a file that stores new data that is not persisted
to permanent storage.
• HFile is the actual storage file that stores the rows as sorted key values on
a disk.

iii. Zookeeper

HBase uses ZooKeeper as a distributed coordination service for region

assignments and to recover any region server crashes by loading them onto other
region servers that are functioning. ZooKeeper is a centralized monitoring server
that maintains configuration information and provides distributed
synchronization. Whenever a client wants to communicate with regions, they
have to approach Zookeeper first. HMaster and Region servers are registered with
ZooKeeper service, client needs to access ZooKeeper quorum in order to connect
with region servers and HMaster. In case of node failure within an HBase cluster,
ZKquoram will trigger error messages and start repairing failed nodes.

ZooKeeper service keeps track of all the region servers that are there in an HBase
cluster- tracking information about how many region servers are there and which
region servers are holding which DataNode. HMaster contacts ZooKeeper to get
the details of region servers. Various services that Zookeeper provides include –

• Establishing client communication with region servers.

• Tracking server failure and network partitions.
• Maintain Configuration Information
• Provides ephemeral nodes, which represent different region servers.

IBM Infosphere BigInsights and Streams

Infoshere BigInsights

InfoSphere BigInsights is the IBM enterprise-grade Hadoop offering. It is based

on industry-standard Apache Hadoop, but IBM provides extensive capabilities,
including installation and management facilities and additional tools and
utilities.

Special care is taken to ensure that InfoSphere BigInsights is 100% compatible

with open source Hadoop. The IBM capabilities provide the Hadoop developer
or administrator with additional choices and flexibility without locking them
into proprietary technology.

Here are some of the standard open source utilities in InfoSphere BigInsights:

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

• PIG
• Hive / HCatalog
• Oozie
• HBASE
• Zookeeper
• Flume
• Avro
• Chukwa
Key software capabilities

IBM InfoSphere BigInsights provides advanced software capabilities that are

not found in competing Hadoop distributions. Here are some of these
capabilities:

• Big SQL: Big SQL is a rich, ANSI-compliant SQL implementation. Big

SQL builds on 30 years of IBM experience in SQL and database
engineering. Big SQL has several ad-vantages over competing SQL on
Hadoop implementations:
o SQL language compatibility
o Support for native data sources
o Performance
o Federation
o Security
• Big R: Big R is a set of libraries that provide end-to-end integration with
the popular R programming language that is included in InfoSphere
BigInsights. Big R provides a familiar environment for developers and data
scientists proficient with the R language.
• Big Sheets: Big Sheets is a spreadsheet style data manipulation and
visualization tool that allows business users to access and analyze data in
Hadoop without the need to be knowledgeable in Hadoop scripting
languages or MapReduce programming.
• Application Accelerators: IBM InfoSphere BigInsights extends the
capabilities of open source Hadoop with accelerators that use pre-written
capabilities for common big data use cases to build quickly high-quality
applications. Here are some of the accelerators that are included in
InfoSphere BigInsights:
o Text Analytics Accelerators: A set of facilities for developing
applications that analyze text across multiple spoken languages
o Machine Data Accelerators: Tools that are aimed at developers that
make it easy to develop applications that process log files, including
web logs, mail logs, and various specialized file formats

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

o Social Data Accelerators: Tools to easily import and analyze social

data at scale from multiple online sources, including tweets, boards,
and blogs
• Adaptive MapReduce: An alternative, Hadoop-compatible scheduling
framework that provides enhanced performance for latency sensitive
Hadoop MapReduce jobs.

Infosphere Stream

InfoSphere® Streams consists of a programming language, an API, and an

integrated development environment (IDE) for applications, and a runtime
system that can run the applications on a single or distributed set of resources.
The Streams Studio IDE includes tools for authoring and creating visual
representations of streams processing applications.

InfoSphere Streams is designed to address the following data processing

platform objectives:

• Parallel and high performance streams processing software platform that

can scale over a range of hardware environments
• Automated deployment of streams processing applications on configured
hardware
• Incremental deployment without restarting to extend streams processing
applications
• Secure and auditable run time environment

The InfoSphere Streams architecture represents a significant change in

computing system organization and capability. InfoSphere Streams provides a
runtime platform, programming model, and tools for applications that are
required to process continuous data streams. The need for such applications arises
in environments where information from one to many data streams can be used
to alert humans or other systems, or to populate knowledge bases for later queries.

InfoSphere Streams is especially powerful in environments where traditional

batch or transactional systems might not be sufficient, for example:

• The environment must process many data streams at high rates.

• Complex processing of the data streams is required.
• Low latency is needed when processing the data streams.

VISUALIZATIONS
Visualization is the first step to make sense of data. To translate and present
complex data and relations in a simple way, data analysts use different

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

methods of data visualization — charts, diagrams, maps, etc. Choosing the

right technique and its setup is often the only way to make data
understandable. Vice versa, poorly selected tactics won't let to unlock the
full potential of data or even make it irrelevant.
5 factors that influence data visualization choices:
1. Audience. It’s important to adjust data representation to the specific
target audience. For example, fitness mobile app users who browse
through their progress can easily work with uncomplicated
visualizations. On the other hand, if data insights are intended for
researchers and experienced decision-makers who regularly work with
data, you can and often have to go beyond simple charts.
2. Content. The type of data you are dealing with will determine the
tactics. For example, if it’s time-series metrics, you will use line charts
to show the dynamics in many cases. To show the relationship between
two elements, scatter plots are often used. In turn, bar charts work well
for comparative analysis.
3. Context. You can use different data visualization approaches and read
data depending on the context. To emphasize a certain figure, for
example, significant profit growth, you can use the shades of one color
on the chart and highlight the highest value with the brightest one. On
the contrary, to differentiate elements, you can use contrast colors.
4. Dynamics. There are various types of data, and each type has a different
rate of change. For example, financial results can be measured monthly
or yearly, while time series and tracking data are changing constantly.
Depending on the rate of change, you may consider dynamic
representation (steaming) or static data visualization techniques in data
mining.
5. Purpose. The goal of data visualization affects the way it is
implemented. In order to make a complex analysis, visualizations are
compiled into dynamic and controllable dashboards equipped with
different tools for visual data analytics (comparison, formatting,
filtering, etc.). However, dashboards are not necessary to show a single
or occasional data insight.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

Data visualization techniques

Depending on these factors, you can choose different data visualization

techniques and configure their features. Here are the common types of data
visualization techniques:

Charts
The easiest way to show the development of one or several data sets is a chart.
Charts vary from bar and line charts that show the relationship between elements
over time to pie charts that demonstrate the components or proportions between
the elements of one whole.

Plots
Plots allow to distribute two or more data sets over a 2D or even 3D space to
show the relationship between these sets and the parameters on the plot. Plots also
vary. Scatter and bubble plots are some of the most widely-used visualizations.
When it comes to big data, analysts often use more complex box plots to visualize
the relationships between large volumes of data.

Maps

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

lOMoARcPSD|51623677

Maps are popular techniques used for data visualization in different industries.
They allow locating elements on relevant objects and areas — geographical maps,
building plans, website layouts, etc. Among the most popular map visualizations
are heat maps, dot distribution maps, cartograms.

Diagrams and matrices

Diagrams are usually used to demonstrate complex data relationships and links
and include various types of data in one visual representation. They can be
hierarchical, multidimensional, tree-like.

Matrix is one of the advanced data visualization techniques that help determine
the correlation between multiple constantly updating (steaming) data sets.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
BDA Unit-4
No ratings yet
BDA Unit-4
98 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
High Replay Backlog On HANA System Replication Secondary Site v6
No ratings yet
High Replay Backlog On HANA System Replication Secondary Site v6
3 pages
Apache Pig
100% (2)
Apache Pig
80 pages
BD Unit 2
No ratings yet
BD Unit 2
20 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
94 pages
Lecture+Notes+ +PIG
No ratings yet
Lecture+Notes+ +PIG
21 pages
Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
101 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Unit V
No ratings yet
Unit V
30 pages
Latihan Soal Aws
No ratings yet
Latihan Soal Aws
4 pages
Unit 5
No ratings yet
Unit 5
19 pages
Database System For Library Management S
60% (5)
Database System For Library Management S
31 pages
Pig Latin Command
No ratings yet
Pig Latin Command
12 pages
Unit 5
No ratings yet
Unit 5
24 pages
BDA Unit - IV
No ratings yet
BDA Unit - IV
81 pages
Unit5 Part1 Notes
No ratings yet
Unit5 Part1 Notes
21 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
Pig
No ratings yet
Pig
27 pages
Unit-V CC&BD CS62
No ratings yet
Unit-V CC&BD CS62
73 pages
Bda Notes Jntuk R20 Unit 4
No ratings yet
Bda Notes Jntuk R20 Unit 4
14 pages
Pig 2
No ratings yet
Pig 2
63 pages
PIG A Big Data Processor
No ratings yet
PIG A Big Data Processor
49 pages
Unit IV - Pig PDF
No ratings yet
Unit IV - Pig PDF
79 pages
Apache Pig
No ratings yet
Apache Pig
28 pages
Pig Notes-1
No ratings yet
Pig Notes-1
6 pages
Pig Hive
No ratings yet
Pig Hive
58 pages
Apache PIG
No ratings yet
Apache PIG
41 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Unit V Notes
No ratings yet
Unit V Notes
17 pages
Big Data Unit 5 Big Data Notes of Unit 5
No ratings yet
Big Data Unit 5 Big Data Notes of Unit 5
16 pages
MySQL Question Bank
0% (2)
MySQL Question Bank
4 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Notes UNIT 5 Bigdata
No ratings yet
Notes UNIT 5 Bigdata
18 pages
Module 4 - Pig
No ratings yet
Module 4 - Pig
65 pages
PIG: A Big Data Processor: Tushar B. Kute
No ratings yet
PIG: A Big Data Processor: Tushar B. Kute
50 pages
Unit 5 2 Marks
No ratings yet
Unit 5 2 Marks
10 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
Unit 4
No ratings yet
Unit 4
29 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
Notes
No ratings yet
Notes
19 pages
Apache Pig
No ratings yet
Apache Pig
61 pages
UNIT 5 Complete Notes
No ratings yet
UNIT 5 Complete Notes
21 pages
Session 3.3
No ratings yet
Session 3.3
30 pages
Unit-4 PIG
No ratings yet
Unit-4 PIG
9 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Notes - 5 Unit Big Data
No ratings yet
Notes - 5 Unit Big Data
22 pages
AS400 Basics FAQs
100% (2)
AS400 Basics FAQs
140 pages
Navy Education Society Conduct of Common Preboard Examination For For Navy Children Schools For Class 12 Computer Science
No ratings yet
Navy Education Society Conduct of Common Preboard Examination For For Navy Children Schools For Class 12 Computer Science
10 pages
BDA Module 4 - Part 1 (Pig) 2023
No ratings yet
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Unit 5-1
No ratings yet
Unit 5-1
8 pages
Unit 5
No ratings yet
Unit 5
16 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
KPMG - Second Drive For B.Tech. 2022 - Final List
No ratings yet
KPMG - Second Drive For B.Tech. 2022 - Final List
18 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Apache PIG by Sravanthi
No ratings yet
Apache PIG by Sravanthi
31 pages
PHP Runner
No ratings yet
PHP Runner
613 pages
Inverted File
No ratings yet
Inverted File
20 pages
Unit 4
No ratings yet
Unit 4
5 pages
Introduction To Databases - COM4015 - (4378)
No ratings yet
Introduction To Databases - COM4015 - (4378)
16 pages
Bigdata: What Is Pig?
No ratings yet
Bigdata: What Is Pig?
16 pages
Pandas - Dataframe - Merging or Joining
No ratings yet
Pandas - Dataframe - Merging or Joining
29 pages
L Apachepigdataquery PDF
No ratings yet
L Apachepigdataquery PDF
10 pages
PIG Interview Qusetions
No ratings yet
PIG Interview Qusetions
15 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
CSE 209 Lecture-1 Introduction
No ratings yet
CSE 209 Lecture-1 Introduction
20 pages
Data Warehousing and Management Midterms
No ratings yet
Data Warehousing and Management Midterms
9 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
3 pages
Advanced DFIR PDF
No ratings yet
Advanced DFIR PDF
9 pages
Rsaikumar Resume PLSQL
No ratings yet
Rsaikumar Resume PLSQL
3 pages
How To Migrate From Oracle To PostgreSQL
No ratings yet
How To Migrate From Oracle To PostgreSQL
13 pages
SAP HANA Data Management and Performance On IBM Power Systems PDF
No ratings yet
SAP HANA Data Management and Performance On IBM Power Systems PDF
76 pages
Informatica Track - RDBMS SQL - PLSQL TOC - Deloitte (Week 1 5 Days) PDF
No ratings yet
Informatica Track - RDBMS SQL - PLSQL TOC - Deloitte (Week 1 5 Days) PDF
6 pages
Upgrade Guide Build 8010 10020
No ratings yet
Upgrade Guide Build 8010 10020
3 pages
Cassandra - Data Model For Twitter - Part 3 - Treselle Systems
No ratings yet
Cassandra - Data Model For Twitter - Part 3 - Treselle Systems
6 pages
Language Integrated Query (LINQ)
No ratings yet
Language Integrated Query (LINQ)
21 pages
SQL Server 2016: Databases Engine Enhancements
No ratings yet
SQL Server 2016: Databases Engine Enhancements
23 pages
Uas Komputec Foren
No ratings yet
Uas Komputec Foren
6 pages
Trigger
No ratings yet
Trigger
5 pages
Create Auto Customization Criteria OAF Search Page - Welcome To My Oracle World
No ratings yet
Create Auto Customization Criteria OAF Search Page - Welcome To My Oracle World
9 pages
Ntfs Vs Fat
No ratings yet
Ntfs Vs Fat
2 pages
DLookup Function - Access - Microsoft Office
No ratings yet
DLookup Function - Access - Microsoft Office
2 pages
Document Management Exercise - Creating A Folder Structure
No ratings yet
Document Management Exercise - Creating A Folder Structure
2 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
R Programming - a Comprehensive Guide: Software
From Everand
R Programming - a Comprehensive Guide: Software
Editor IJSMI
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Bda Module 5

Uploaded by

Bda Module 5

Uploaded by

lOMoARcPSD|51623677

BSc Computer Science (Mahatma Gandhi University)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university

Pig operates on the Hive operates on the server side of a

Pig uses pig-latin

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Pig scripts end with

It does not support

10. It loads data quickly. It loads data slowly.

It does not support

It does not support

It supports Avro file

Pig is suitable for

Data Processing Operators in Pig

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

grunt> STORE A INTO 'out' USING PigStorage(':');

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

The STREAM operator allows you to transform data in a relation using an

The STREAM operator uses PigStorage to serialize and deserialize relations to

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Let’s group by the number of characters in the second field:

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

databases, although there is no declaration of the maximum number of characters

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Operators and Functions

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and

hive> ADD FILE /path/to/is_good_quality.py;

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

USING 'max_temperature_reduce.py' AS year, temperature;

We can perform an inner join on the two tables as follows:

hive> SELECT sales.*, things.*

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

We can rewrite it as follows:

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

A subquery is a SELECT statement that is embedded in another SQL statement.

Fundamentals of HBase and Zookeeper

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Components of Apache HBase Architecture

HBase HMaster is a lightweight process that assigns regions to region servers in

• Manages and Monitors the Hadoop Cluster

ii. Region Server

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

HBase uses ZooKeeper as a distributed coordination service for region

• Establishing client communication with region servers.

IBM Infosphere BigInsights and Streams

InfoSphere BigInsights is the IBM enterprise-grade Hadoop offering. It is based

Special care is taken to ensure that InfoSphere BigInsights is 100% compatible

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

IBM InfoSphere BigInsights provides advanced software capabilities that are

• Big SQL: Big SQL is a rich, ANSI-compliant SQL implementation. Big

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

o Social Data Accelerators: Tools to easily import and analyze social

InfoSphere® Streams consists of a programming language, an API, and an

InfoSphere Streams is designed to address the following data processing

• Parallel and high performance streams processing software platform that

The InfoSphere Streams architecture represents a significant change in

InfoSphere Streams is especially powerful in environments where traditional

• The environment must process many data streams at high rates.

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

methods of data visualization — charts, diagrams, maps, etc. Choosing the

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Data visualization techniques

Depending on these factors, you can choose different data visualization

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

Diagrams and matrices

Downloaded by Vijayalakshmi K (vijiyuvavelan@gmail.com)

You might also like

hive> SELECT sales., things.