0% found this document useful (0 votes)
13 views32 pages

Bda Unit Iv Notes

The document provides an overview of the Hadoop frameworks Apache Pig and Apache Hive, detailing their applications in big data environments such as data transformation, analysis, and warehousing. It explains the features, architecture, and components of both Pig and Hive, emphasizing their ease of use and ability to handle large datasets. Additionally, it compares Pig with MapReduce, highlighting the advantages of using Pig for data processing tasks.

Uploaded by

selfa3430
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views32 pages

Bda Unit Iv Notes

The document provides an overview of the Hadoop frameworks Apache Pig and Apache Hive, detailing their applications in big data environments such as data transformation, analysis, and warehousing. It explains the features, architecture, and components of both Pig and Hive, emphasizing their ease of use and ability to handle large datasets. Additionally, it compares Pig with MapReduce, highlighting the advantages of using Pig for data processing tasks.

Uploaded by

selfa3430
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Course Name: Bigdata Analytics

Course Coordinator: V Gopinath

UNIT 4

HADOOP FRAMEWORKS & APPLICATIONS

(Pig, Hive, HBase, ZooKeeper)


APPLICATIONS ON BIG DATA USING PIG AND HIVE:

Both Apache Pig and Apache Hive are high-level frameworks built on top of Apache Hadoop,
designed to simplify and accelerate the development of data processing applications. They
provide an abstraction layer over the complexities of Hadoop, allowing developers and data
analysts to write queries and transformations using familiar languages (Pig Latin for Pig and
SQL-like HiveQL for Hive) without needing to write complex MapReduce jobs manually.

Here are some common applications of Apache Pig and Apache Hive in big data environments:

1. Data Transformation and ETL (Extract, Transform, Load):

• Pig and Hive are commonly used for preprocessing large volumes of data stored
in various formats (e.g., CSV, JSON, XML) into a format suitable for analysis or
loading into data warehouses.

2. Data Analysis and Exploration:

• HiveQL provides SQL-like querying capabilities, making it easy for analysts and
data scientists to write complex queries to analyze and explore large datasets
stored in Hadoop Distributed File System (HDFS) or Hadoop-compatible file
systems.

• Pig Latin offers a more procedural approach to data processing, allowing users to
define custom data flows and transformations using a scripting language. This
flexibility is useful for scenarios where SQL-like querying is insufficient.

3. Batch Processing:

• Both Pig and Hive are suitable for performing batch processing tasks on large
datasets. They leverage the parallel processing capabilities of Hadoop to distribute
computation across a cluster of machines, enabling efficient processing of large
volumes of data.
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

4. Data Warehousing:

• Hive is often used as a data warehousing solution on Hadoop clusters. It allows


users to define schema on read, making it easy to query structured data stored in
HDFS using familiar SQL-like syntax.

• Pig can also be used for data warehousing tasks, although it is more commonly
used for ETL and data processing tasks.

5. Ad-hoc Data Analysis:

• Hive's interactive querying capabilities make it well-suited for ad-hoc data


analysis tasks. Users can run SQL-like queries on large datasets stored in Hadoop
without the need to predefine data structures or write complex MapReduce jobs.

• Pig's scripting language allows users to write custom data processing scripts for
ad-hoc analysis tasks. While not as SQL-like as HiveQL, Pig Latin provides
greater flexibility and control over data processing workflows.

6. Data Pipelines and Workflow Orchestration:

• Pig and Hive can be integrated with workflow orchestration tools like Apache
Oozie or Apache Airflow to create complex data pipelines for ETL, data
processing, and analysis tasks. These tools enable users to define dependencies
between jobs and schedule their execution at regular intervals.

INTRODUCTION TO PIG:
Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze
larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can
perform all the data manipulation operations in Hadoop using Pig.

To write data analysis programs, Pig provides a high-level language known as Pig Latin. This
language provides various operators using which programmers can develop their own functions
for reading, writing, and processing data.

To analyze data using Apache Pig, programmers need to write scripts using Pig Latin language.
All these scripts are internally converted to Map and Reduce tasks. Apache Pig has a component
known as Pig Engine that accepts the Pig Latin scripts as input and converts those scripts into
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

MapReduce jobs.

Apache Pig Vs MapReduce

Listed below are the major differences between Apache Pig and MapReduce.

Apache Pig MapReduce


Apache Pig is a data flow language. MapReduce is a data processing paradigm.

It is a high level language. MapReduce is low level and rigid.


Performing a Join operation in Apache Pig is It is quite difficult in MapReduce
pretty simple. to perform a Join operation between datasets.
Any novice programmer with a basic Exposure to Java is must to work with
knowledge of SQL can work conveniently MapReduce.
with Apache Pig.
Apache Pig uses multi-query approach, MapReduce will require almost 20
thereby reducing the length of the codes to a times more the number of lines to perform the
great extent. same task.
There is no need for compilation. On MapReduce jobs have a long compilation
execution, every Apache Pig operator is process.
converted internally into a MapReduce
job.

Features of Pig

Apache Pig comes with the following features −

• Rich set of operators − It provides many operators to perform operations like join, sort,
filer, etc.
• Ease of programming − Pig Latin is similar to SQL and it is easy to write a Pig script
if you are good at SQL.
• Optimization opportunities − The tasks in Apache Pig optimize their execution
automatically, so the programmers need to focus only on semantics of the language.

• Extensibility − Using the existing operators, users can develop their own functions to
read, process, and write data.
• UDF’s − Pig provides the facility to create User-defined Functions in other

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

programming languages such as Java and invoke or embed them in Pig Scripts.
• Handles all kinds of data − Apache Pig analyzes all kinds of data, both structured as
well as unstructured. It stores the results in HDFS.

APACHE PIG - ARCHITECTURE

The language used to analyze data in Hadoop using Pig is known as Pig Latin. It is a high level
data processing language which provides a rich set of data types and operators to perform
various operations on the data.

To perform a particular task Programmers using Pig, programmers need to write a Pig script
using the Pig Latin language, and execute them using any of the execution mechanisms (Grunt
Shell, UDFs, Embedded). After execution, these scripts will go through a series of
transformations applied by the Pig Framework, to produce the desired output.

Internally, Apache Pig converts these scripts into a series of MapReduce jobs, and thus, it makes

the programmer’s job easy. The architecture of Apache Pig is shown below.

Apache Pig Components

As shown in the figure, there are various components in the Apache Pig framework. Let us take
a look at the major components.

Parser

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script, does type
checking, and other miscellaneous checks. The output of the parser will be a DAG Latin
statements and logical (directed acyclic graph), which represents the Pig operators.

In the DAG, the logical operators of the script are represented and the data flows are represented
as edges.

Optimizer

The logical plan (DAG) is passed to the logical optimizer, which carries out the logical
optimizations such as projection and pushdown.

Compiler

The compiler compiles the optimized logical plan into a series of MapReduce jobs.

Execution engine

Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.

PIG LATIN DATA MODEL ( PIG DATA TYPES )

The data model of Pig Latin is fully nested and it allows complex non-atomic data types such as
map and tuple. Given below is the diagrammatical representation of Pig Latin’s data model.

Atom: An atom is any single value, such as a string or a number — ‘Diego’, for example. Pig’s
atomic values are scalar types that appear in most programming languages — int, long, float,
double chararray, and bytearray.

Tuple: A tuple is a record that consists of a sequence of fields. Each field can be of any type —
‘Diego’, ‘Gomez’, or 6, for example. Think of a tuple as a row in a table.
Bag: A bag is a collection of non-unique tuples. The schema of the bag is flexible — each tuple
in the collection can contain an arbitrary number of fields, and each field can be of any type.

Map: A map is a collection of key value pairs. Any type can be stored in the value, and the key
needs to be unique. The key of a map must be a chararray and the value can be of any type.

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

PIG LATIN OPEARATIONS:

In a Hadoop context, accessing data means allowing developers to load, store, and stream data,
whereas transforming data means taking advantage of Pig’s ability to group, join, combine,
split, filter, and sort data.

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

Evaluating Local and Distributed Modes of Running Pig scripts

We can run Apache Pig in two modes, namely, Local Mode and HDFS mode.

Local Mode

In this mode, all the files are installed and run from your local host and local file system. There
is no need of Hadoop or HDFS. This mode is generally used for testing purpose.
MapReduce Mode

MapReduce mode is where we load or process the data that exists in the Hadoop File System
(HDFS) using Apache Pig. In this mode, whenever we execute the Pig Latin statements to
process the data, a MapReduce job is invoked in the back-end to perform a particular operation
on the data that exists in the HDFS.

Checking Out the Pig Script Interfaces

Apache Pig scripts can be executed in three ways, namely, interactive mode, batch mode, and
embedded mode.

• Interactive Mode (Grunt shell) − You can run Apache Pig in interactive mode using the
Grunt shell. In this shell, you can enter the Pig Latin statements and get the output (using
Dump operator).
• Batch Mode (Script) − You can run Apache Pig in Batch mode by writing the Pig Latin
script in a single file with .pig extension.
• Embedded Mode (UDF) − Apache Pig provides the provision of defining our own
functions (User Defined Functions) in programming languages such as Java, and using
them in our script.

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

HIVE Introduction
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
topof Hadoop to summarize Big Data, and makes querying and analyzing easy. The term „Big
Data‟ is used for collections of large datasets that include huge volume, high velocity, and a
variety of data that is increasing day by day. Using traditional data management systems, it is
difficult to process Big Data. Therefore, the Apache Software Foundation introduced a
framework called Hadoop to solve Big Data management and processing challenges.

The Hadoop ecosystem contains different sub-projects (tools) such as Sqoop, Pig, and Hive that
are used to help Hadoop modules.

• Sqoop: It is used to import and export data to and from between HDFS and RDBMS.
• Pig: It is a procedural language platform used to develop a script for MapReduce
operations.
• Hive: It is a platform used to develop SQL type scripts to do MapReduce operations.

Note: There are various ways to execute MapReduce operations:

• The traditional approach using Java MapReduce program for structured, semi-
structured,and unstructured data.
• The scripting approach for MapReduce to process structured and semi structured data
usingPig.
• The Hive Query Language (HiveQL or HQL) for MapReduce to process structured data
using Hive.

What is Hive

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on
topof Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different
companies. For example, Amazon uses it in Amazon Elastic MapReduce.

Hive is not

• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

Features of Hive

• It stores schema in a database and processed data into HDFS.


• It is designed for OLAP.
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
HIVE SERVICES:
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, and analysis. Hive supports querying and managing large datasets stored
in Hadoop's HDFS (Hadoop Distributed File System) and other compatible file systems such as
Amazon S3, Azure Data Lake Storage, etc. Hive is primarily used for data warehousing and
analysis tasks, and it provides SQL-like query language called HiveQL for interacting with data.
The core components of Apache Hive include:
1. Hive Metastore: The metastore is a centralized repository that stores metadata about
Hive tables, partitions, columns, and storage location. It provides schema information to
Hive clients and ensures that queries are executed efficiently.
2. Hive Query Processor: The query processor is responsible for parsing, optimizing, and
executing HiveQL queries. It translates HiveQL queries into a series of MapReduce, Tez,
or Spark jobs, depending on the execution engine configured.
3. Hive Execution Engines:
• MapReduce: Historically, Hive used MapReduce as its default execution engine,
where queries were translated into MapReduce jobs for processing.
• Tez: Apache Tez is an alternative execution engine for Hive that provides more
efficient query execution by optimizing task scheduling and reducing overhead.
• Spark: Hive also supports Apache Spark as an execution engine, leveraging
Spark's in-memory processing capabilities for faster query execution.
4. Hive CLI (Command-Line Interface): The Hive CLI is an interactive shell that allows
users to submit HiveQL queries and commands to the Hive service. It provides a familiar
command-line interface for interacting with Hive.
5. HiveServer2: HiveServer2 is a service that provides a Thrift and JDBC interface to Hive,
allowing remote clients to connect and execute HiveQL queries programmatically. It
supports multi-session and multi-user access to Hive.
6. WebHCat (Templeton): WebHCat is a REST API for Hadoop services, including Hive.

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

It allows users to submit HiveQL queries and manage Hive resources programmatically
over HTTP.
7. Hive Beeline: Beeline is a lightweight JDBC client provided by Hive for connecting to
HiveServer2. It allows users to run HiveQL queries and commands from the command
line or scripts.

ARCHITECTURE OF HIVE
The following component diagram depicts the architecture of Hive:

Apache Hive architecture


This component diagram contains different units. The following table describes each unit:

Unit Operation
Name
UI The user interface for users to submit queries and other operations to the system.
As of 2011 the system had a command line interface and a web based GUI was
being developed.
Driver The component which receives the queries. This component implements the notion
of session handles and provides execute and fetch APIs modeled on JDBC/ODBC
interfaces.

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

Compiler The component that parses the query, does semantic analysis on the different query
blocks and query expressions and eventually generates an execution plan with the
help of the table and partition metadata looked up from the metastore.
Metastore The component that stores all the structure information of the various tables and
partitions in the warehouse including column and column type information, the
serializers and deserializers necessary to read and write data and the corresponding
HDFS files where the data is stored.
Execution The component which executes the execution plan created by the compiler. The
Engine plan is a DAG of stages. The execution engine manages the dependencies between
these different stages of the plan and executes these stages on the appropriate
system components.

Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.

The following table defines how Hive interacts with Hadoop framework:
Step Operation
No.
1 Execute Query
The Hive interface such as Command Line or Web UI sends query to Driver (any
database driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan
The driver takes the help of query compiler that parses the query to check thesyntax
and query plan or the requirement of query.
3 Get Metadata
The compiler sends metadata request to Metastore (any database).
4 Send Metadata
Metastore sends metadata as a response to the compiler.
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

5 Send Plan
The compiler checks the requirement and resends the plan to the driver. Up tohere, the
parsing and compiling of a query is complete.
6 Execute Plan
The driver sends the execute plan to the execution engine.
7 Execute Job
Internally, the process of execution job is a MapReduce job. The execution engine
sends the job to JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes MapReduce job.
7.1 Metadata Ops
Meanwhile in execution, the execution engine can execute metadata operationswith
Metastore.
8 Fetch Result
The execution engine receives the results from Data nodes.
9 Send Results
The execution engine sends those resultant values to the driver.

10 Send Results
The driver sends the results to Hive Interfaces.

Hive Data Model

Data in Hive is organized into:

Tables – These are analogous to Tables in Relational Databases. Tables can be filtered,
projected, joined and unioned. Additionally all the data of a table is stored in a directory in
HDFS. Hive also supports the notion of external tables wherein a table can be created on
prexisting files or directories in HDFS by providing the appropriate location to the table
creation DDL. The rows in a table are organized into typed columns similar to Relational
Databases.

Partitions – Each Table can have one or more partition keys which determine how the data is
stored, for example a table T with a date partition column ds had files with data for a particular
date stored in the <table location>/ds=<date> directory in HDFS. Partitions allow the system
to prune data to be inspected based on query predicates, for example a query that is interested
in rows from T that satisfy the predicate T.ds = '2008-09-01' would only have to look at files in
<table location>/ds=2008-09-01/ directory in HDFS.

Buckets – Data in each partition may in turn be divided into Buckets based on the hash of a
column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

the system to efficiently evaluate queries that depend on a sample of data (these are queries that
use the SAMPLE clause on the table).

HIVE - DATA TYPES


All the data types in Hive are classified into four types, given as follows:

• Column Types
• Literals
• Null Values
• Complex Types

➢ Column Types

Column type are used as column data types of Hive. They are as follows:

Integral Types

Integer type data can be specified using integral data types, INT. When the data range exceeds
therange of INT, you need to use BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.

The following table depicts various INT data types:

Type Postfix Example


TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L

String Types

String type data types can be specified using single quotes (' ') or double quotes (" "). It contains
two data types: VARCHAR and CHAR. Hive follows C-types escape characters.

The following table depicts various CHAR data types:

Data Type Length


VARCHAR 1 to 65355
CHAR 255

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

Timestamp

It supports traditional UNIX timestamp with optional nanosecond precision. It supports


java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff” and format “yyyy-mm-dd
hh:mm:ss.ffffffffff”.

Dates

DATE values are described in year/month/day format in the form {{YYYY- MM-DD}}.

Decimals

The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing
immutable arbitrary precision. The syntax and example is as follows:

Union Types

Union is a collection of heterogeneous data types. You can create an instance using create
union.The syntax and example is as follows:

UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>

{0:1}

{1:2.0}

{2:["three","four"]}

{3:{"a":5,"b":"five"}}

{2:["six","seven"]}

{3:{"a":8,"b":"eight"}}

{0:9}

{1:10.0}

➢ Literals

The following literals are used in Hive:


Floating Point Types

Floating point types are nothing but numbers with decimal points. Generally, this type of data
iscomposed of DOUBLE data type.

Decimal Type

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

Decimal type data is nothing but floating point value with higher range than DOUBLE data
type.The range of decimal type is approximately -10 -308 to 10 308 .

➢ Null Value

Missing values are represented by the special value NULL.

➢ Complex Types

The Hive complex data types are as follows:

Arrays

Arrays in Hive are used the same way they are used in Java.Syntax: ARRAY<data_type>
Maps

Maps in Hive are similar to Java Maps. Syntax: MAP<primitive_type, data_type>Structs


Structs in Hive is similar to using complex data with comment.

Syntax: STRUCT<col_name: data_type [COMMENT col_comment], ...>


QUERYING DATA IN HIVE:
Querying data in Apache Hive involves writing HiveQL (Hive Query Language)
queries to retrieve, manipulate, and analyze data stored in Hive tables. Here are some
common operations and examples of querying data in Hive:
1. Selecting Data:
• Use the SELECT statement to retrieve data from one or more columns in a table.
• Example:
SELECT * FROM employees;
2. Filtering Data:
• Use the WHERE clause to filter rows based on specified conditions.
• Example:
SELECT * FROM employees WHERE department = 'IT';
3. Aggregating Data:
• Use aggregate functions like COUNT, SUM, AVG, MIN, MAX, etc., to perform
calculations on groups of rows.
• Example:
SELECT department, COUNT(*) AS num_employees FROM employees GROUP BY
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

department;
4. Sorting Data:
• Use the ORDER BY clause to sort the result set based on one or more columns.
• Example:
SELECT * FROM employees ORDER BY salary DESC;
5. Joining Tables:
• Use join operations to combine data from multiple tables based on common
columns.
• Example:
SELECT e.name, d.department_name FROM employees e JOIN departments d ON
e.department_id = d.department_id;
6. Subqueries:
• Use subqueries to embed one query inside another query.
• Example:
SELECT * FROM employees WHERE department_id IN (SELECT department_id FROM
departments WHERE location = 'New York');
7. Conditional Logic:
• Use CASE statements to apply conditional logic within queries.
• Example:
SELECT name, CASE WHEN salary > 100000 THEN 'High' ELSE 'Low' END AS salary_level
FROM employees;
8. Window Functions:
• Use window functions for advanced analytics tasks such as ranking, aggregation,
and moving averages.
• Example:
SELECT name, salary, AVG(salary) OVER (PARTITION BY department_id) AS avg_salary
FROM employees;
9. Creating Views:
• Create views to store the results of queries as virtual tables for reuse.
• Example:
CREATE VIEW high_salary_employees AS SELECT * FROM employees WHERE salary >

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

100000;
10. Exporting Data:
• Use the INSERT OVERWRITE or INSERT INTO statements to export query
results to external storage or files.
• Example:
INSERT OVERWRITE DIRECTORY '/user/hive/output' SELECT * FROM employees
WHERE department = 'IT';
These are some common examples of querying data in Apache Hive using HiveQL.
Hive provides extensive support for SQL-like syntax, making it easy to perform
various data manipulation and analysis tasks on large datasets stored in Hadoop.

HIVE - CREATE DATABASE

• Hive is a database technology that can define databases and tables to analyze structured
data. The theme for structured data analysis is to store the data in a tabular manner, and
pass queries to analyze it. This chapter explains how to create Hive database. Hive
containsa default database named default.

Create Database Statement

• Create Database is a statement used to create a database in Hive. A database in Hive is


a namespace or a collection of tables. The syntax for this statement is as follows:
CREATE DATABASE|SCHEMA [IF NOT EXISTS] <database name>
• Here, IF NOT EXISTS is an optional clause, which notifies the user that a database with
the same name already exists. We can use SCHEMA in place of DATABASE in this
command. The following query is executed to create a database named userdb:

hive> CREATE DATABASE [IF NOT EXISTS] userdb;


or
hive> CREATE SCHEMA userdb;
• The following query is used to verify a databases list:hive> SHOW DATABASES;
default userdb

DROP DATABASE STATEMENT

Drop Database is a statement that drops all the tables and deletes the database. Its syntax is as

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

follows:

DROP DATABASE StatementDROP (DATABASE|SCHEMA) [IF


EXISTS]
database_name [RESTRICT | CASCADE];

The following queries are used to drop a database. Let us assume that the database name is userdb.
hive> DROP DATABASE IF EXISTS userdb;
The following query drops the database using CASCADE. It means dropping respective tables
before dropping the database.

hive> DROP DATABASE IF EXISTS userdb CASCADE;

The following query drops the database using SCHEMA.hive> DROP SCHEMA userdb;
CREATE TABLE STATEMENT

Create Table is a statement used to create a table in Hive. The syntax and example are as follows:
Syntax
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.] table_name
[(col_name data_type [COMMENT col_comment], ...)][COMMENT table_comment]
[ROW FORMAT row_format] [STORED AS file_format]Example

Let us assume you need to create a table named employee using CREATE TABLE statement.
Thefollowing table lists the fields and their data types in employee table:

S. No. Field Name Data Type


1 Eid int
2 Name String
3 Salary Float
4 Designation string

The following data is a Comment, Row formatted fields such as Field terminator, Lines
terminator,and Stored File type.

COMMENT Employee details FIELDS TERMINATED BY\tLINES TERMINATED


BY\nSTORED IN TEXT FILE

The following query creates a table named employee using the above data.

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

hive> CREATE TABLE IF NOT EXISTS employee(eid int, name String, salary String,
destination String)

COMMENT Employee details ROW FORMAT DELIMITED FIELDS TERMINATED


BY\tLINES TERMINATED BY\nStored AS TEXTFILE;

If you add the option IF NOT EXISTS, Hive ignores the statement in case the table already
exists.On successful creation of table, you get to see the following response:
OK
Time taken: 5.905 secondshive>
LOAD DATA STATEMENT

Generally, after creating a table in SQL, we can insert data using the Insert statement. But in
Hive,we can insert data using the LOAD DATA statement.

While inserting data into Hive, it is better to use LOAD DATA to store bulk records. There are
two ways to load data: one is from local file system and second is from Hadoop file system.

Syntax

The syntax for load data is as follows:

LOAD DATA [LOCAL] INPATH „filepath‟ [OVERWRITE] INTO TABLE tablename


[PARTITION [partcol1=val1, partcol2=val2 ...)]

• LOCAL is identifier to specify the local path. It is optional.


• OVERWRITE is optional to overwrite the data in the table.
• PARTITION is optional.Example
We will insert the following data into the table. It is a text file named sample.txt in /home/user
directory.

1201 Gopal 45000 Technical


manager
1202 Manisha 45000 Proof reader
1203 Masthanvali 40000 Technical writer
1204 Kiran 40000 Hr Admin
1205 Kranthi 30000 Op Admin

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

The following query loads the given text into the table.
hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt' OVERWRITE INTO
TABLE employee;
On successful download, you get to see the following response:OK
Time taken: 15.905 secondshive>
ALTER TABLE STATEMENT
It is used to alter a table in Hive.Syntax
The statement takes any of the following syntaxes based on what attributes we wish to modify
ina table.
ALTER TABLE name RENAME TO new_name
ALTER TABLE name ADD COLUMNS (col_spec[, col_spec ...])ALTER TABLE name
DROP [COLUMN] column_name
ALTER TABLE name CHANGE column_name new_name new_type ALTER TABLE
name REPLACE COLUMNS (col_spec[, col_spec ...])
Rename To… Statement
The following query renames the table from employee to emp.hive> ALTER TABLE employee
RENAME TO emp; Change Statement
The following table contains the fields of employee table and it shows the fields to be changed
(inbold).
Field name Convert from data Change field name Convert to data type
type
Eid int eid int
name String ename String
salary Float salary Double
designation String designation String
The following queries rename the column name and column data type using the above data:
hive> ALTER TABLE employee CHANGE name ename String;
hive> ALTER TABLE employee CHANGE salary salary Double;

Add Columns Statement


The following query adds a column named dept to the employee table.
hive> ALTER TABLE employee ADD COLUMNS (dept STRING COMMENT
„Department name‟);
Replace Statement
The following query deletes all the columns from the employee table and replaces it with emp

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

andname columns:
hive> ALTER TABLE employee REPLACE COLUMNS (eid INT empid Int, ename
STRING name String);
DROP TABLE STATEMENT
Hive Metastore, it removes the table/column data and their metadata. It can be a normal table
(stored in Metastore) or an external table (stored in local file system); Hive treats both in the
samemanner, irrespective of their types.
The syntax is as follows:
DROP TABLE [IF EXISTS] table_name; The following query drops a table named
employee:
hive> DROP TABLE IF EXISTS employee;
On successful execution of the query, you get to response:OK
Time taken: 5.3 secondshive>
The following query is used to verify the list of tables:
hive> SHOW TABLES;
emp ok
Time taken: 2.1 secondshive>
Operators in HIVE:
There are four types of operators in Hive:
• Relational Operators
• Arithmetic Operators
• Logical Operators
• Complex Operators
Relational Operators: These operators are used to compare two operands. The following table
describes the relational operators available in Hive:

Operator Operand Description


A=B all primitive types TRUE if expression A is equivalent to
expression B otherwise FALSE.
A != B all primitive types TRUE if expression A is not equivalent to
expression B otherwise FALSE.
A<B all primitive types TRUE if expression A is less than expression
B otherwise FALSE.
A <= B all primitive types TRUE if expression A is less than or equal to
expression B otherwise FALSE.
A>B all primitive types TRUE if expression A is greater than
expression B otherwise FALSE.
A >= B all primitive types TRUE if expression A is greater than or equal
to expression B otherwise FALSE.
A IS NULL all types TRUE if expression A evaluates to NULL
otherwise FALSE.

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

A IS NOT NULL all types FALSE if expression A evaluates to NULL


otherwise TRUE.
A LIKE B Strings TRUE if string pattern A matches to B
otherwise FALSE.
A RLIKE B Strings NULL if A or B is NULL, TRUE if any
substring of A matches the Java regular
expression B , otherwise FALSE.
A REGEXP B Strings Same as RLIKE.
Example
Let us assume the employee table is composed of fields named Id, Name, Salary, Designation,
andDept as shown below. Generate a query to retrieve the employee details whose Id is 1205.

The following query is executed to retrieve the employee details using the above table:hive>
SELECT * FROM employee WHERE Id=1205;
On successful execution of query, you get to see the following response:

The following query is executed to retrieve the employee details whose salary is more than or
equal to Rs 40000.

hive> SELECT * FROM employee WHERE Salary>=40000; On successful execution


of query, you get to see the following response:

Arithmetic Operators
These operators support various common arithmetic operations on the operands. All of them
returnnumber types. The following table describes the arithmetic operators available in Hive:
Operators Operand Description
A+B all number types Gives the result of adding A and B.
A-B all number types Gives the result of subtracting B from A.
A*B all number types Gives the result of multiplying A and B.
A/B all number types Gives the result of dividing B from A.
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

A%B all number types Gives the reminder resulting from dividing A by B.
A&B all number types Gives the result of bitwise AND of A and B.
A|B all number types Gives the result of bitwise OR of A and B.
A^B all number types Gives the result of bitwise XOR of A and B.
~A all number types Gives the result of bitwise NOT of A.
Example
The following query adds two numbers, 20 and 30.hive> SELECT 20+30 ADD FROM temp;
On successful execution of the query, you get to see the following response:
+ +
| ADD |
+ +
| 50 |
+ +
Logical operators
The operators are logical expressions. All of them return either TRUE or FALSE.
Operators Operands Description
A AND B boolean TRUE if both A and B are TRUE, otherwise FALSE.
A && B boolean Same as A AND B.
A OR B boolean TRUE if either A or B or both are TRUE, otherwise
FALSE.
A || B boolean Same as A OR B.
NOT A boolean TRUE if A is FALSE, otherwise FALSE.
!A boolean Same as NOT A.
Example
The following query is used to retrieve employee details whose Department is TP and Salary is
more than Rs 40000.
hive> SELECT * FROM employee WHERE Salary>40000 && Dept=TP; On
successful execution of the query, you get to see the following response:

Complex Operators
These operators provide an expression to access the elements of Complex Types.
Operator Operand Description
A[n] A is an Array and n isan int
It returns the nth element in the array A.
The first element has index 0.
M[key] M is a Map<K, V> and key It returns the value corresponding to thekey
has type K in the map.

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

S.x S is a struct It returns the x field of S.

HIVEQL - SELECT-WHERE
• The Hive Query Language (HiveQL) is a query language for Hive to process and analyze
structured data in a Metastore. This chapter explains how to use the SELECT statement
with WHERE clause.
• SELECT statement is used to retrieve the data from a table. WHERE clause works
similarto a condition. It filters the data using the condition and gives you a finite result.
The built-in operators and functions generate an expression, which fulfills the condition.
Syntax
Given below is the syntax of the SELECT query:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING
having_condition]
[CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]][LIMIT
number];

Let us take an example for SELECT…WHERE clause. Assume we have the employee table as
given below, with fields named Id, Name, Salary, Designation, and Dept. Generate a query to
retrieve the employee details who earn a salary of more than Rs 30000.

The following query retrieves the employee details using the above scenario:hive> SELECT *
FROM employee WHERE salary>30000;
On successful execution of the query, you get to see the following response:

HIVEQL - SELECT-ORDER BY
This chapter explains how to use the ORDER BY clause in a SELECT statement. The ORDER
BY clause is used to retrieve the details based on one column and sort the result set by ascending
or descending order.
Syntax
Given below is the syntax of the ORDER BY clause:
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference


[WHERE where_condition] [GROUP BY col_list]
[HAVING having_condition]
[ORDER BY col_list]] [LIMIT number];

Example
• Let us take an example for SELECT...ORDER BY clause. Assume employee table as
givenbelow, with the fields named Id, Name, Salary, Designation, and Dept.
Generate a query to retrieve the employee details in order by using Department name.

The following query retrieves the employee details using the above scenario: hive> SELECT
Id, Name, Dept FROM employee ORDER BY DEPT;
HIVEQL - SELECT-GROUP BY
This chapter explains the details of GROUP BY clause in a SELECT statement. The GROUP
BYclause is used to group all the records in a result set using a particular collection column. It
is usedto query a group of records.
The syntax of GROUP BY clause is as follows:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING
having_condition][ORDER BY col_list]] [LIMIT number];

HIVEQL - SELECT-JOINS
JOIN is a clause that is used for combining specific fields from two tables by using values
commonto each one. It is used to combine records from two or more tables in the database. It
is more or less similar to SQL JOIN.
Syntax join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_referencejoin_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
| table_reference CROSS JOIN table_reference [join_condition]

We will use the following two tables in this chapter. Consider the following table named
CUSTOMERS.

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

Consider another table ORDERS as follows:

JOIN
There are different types of joins given as follows:
• JOIN
• LEFT OUTER JOIN
• RIGHT OUTER JOIN
• FULL OUTER JOIN
JOIN clause is used to combine and retrieve the records from multiple tables. JOIN is same as
OUTER JOIN in SQL. A JOIN condition is to be raised using the primary keys and foreign
keys of the tables.
The following query executes JOIN on the CUSTOMER and ORDER tables, and retrieves the
records:
hive> SELECT c.ID, c.NAME, c.AGE, o.AMOUNT FROM CUSTOMERS c JOIN
ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

LEFT OUTER JOIN


The HiveQL LEFT OUTER JOIN returns all the rows from the left table, even if there are no
matches in the right table. This means, if the ON clause matches 0 (zero) records in the right
table,the JOIN still returns a row in the result, but with NULL in each column from the right
table. A LEFT JOIN returns all the values from the left table, plus the matched values from the
right table,or NULL in case of no matching JOIN predicate.
The following query demonstrates LEFT OUTER JOIN between CUSTOMER and ORDER
tables:
Department of Information Technology
Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c LEFT


OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

RIGHT OUTER JOIN


The HiveQL RIGHT OUTER JOIN returns all the rows from the right table, even if there are
no matches in the left table. If the ON clause matches 0 (zero) records in the left table, the JOIN
stillreturns a row in the result, but with NULL in each column from the left table.
A RIGHT JOIN returns all the values from the right table, plus the matched values from the left
table, or NULL in case of no matching join predicate.
The following query demonstrates RIGHT OUTER JOIN between the CUSTOMER and
ORDERtables.
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c RIGHT
OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

FULL OUTER JOIN


The HiveQL FULL OUTER JOIN combines the records of both the left and the right outer
tablesthat fulfil the JOIN condition. The joined table contains either all the records from both
the tables,or fills in NULL values for missing matches on either side.
The following query demonstrates FULL OUTER JOIN between CUSTOMER and ORDER
tables:
hive> SELECT c.ID, c.NAME, o.AMOUNT, o.DATE FROM CUSTOMERS c FULL
OUTER JOIN ORDERS o ON (c.ID = o.CUSTOMER_ID);
On successful execution of the query, you get to see the following response:

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

FUNDAMENTALS OF HBASE:
Apache HBase is a distributed, scalable, and column-oriented NoSQL database
built on top of Apache Hadoop. It is designed to provide real-time random
read/write access to large volumes of data, enabling users to store and manage big
data in a fault-tolerant and highly available manner. Here are the fundamentals of
HBase:
1. Data Model:
• HBase organizes data into tables, rows, and columns similar to a traditional
relational database.
• Tables are composed of rows, and each row has a unique row key.
• Rows are further divided into column families, which contain one or more
columns.
• Columns are identified by a column family and a qualifier.
• HBase supports dynamic schema, allowing you to add columns to a table
without predefining them.

2. Schema Design:
• Unlike traditional databases, HBase does not enforce a schema on data.
Instead, it allows flexible schema design to accommodate evolving data

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

requirements.
• Column families should be defined based on access patterns and data locality
to optimize performance.
• Denormalization is common in HBase schema design to minimize data
duplication and improve query performance.
3. Architecture:
• HBase follows a master-slave architecture. The HBase Master server
coordinates metadata operations and manages regionservers.
• Regionservers host the actual data and handle read and write requests from
clients.
• Each table is partitioned into regions, and each region is served by a single
regionserver.
• ZooKeeper is used for distributed coordination and leader election among
HBase components.

4. Storage:
• HBase stores data in Hadoop Distributed File System (HDFS), which provides
fault tolerance and high availability.
• Data is stored in sorted order based on row keys, enabling efficient range scans
and point queries.
• HBase employs a Write-Ahead Log (WAL) to ensure data durability in case of
node failures.
• Data is stored in indexed HFiles on disk, and an in-memory MemStore is used
for write buffering before data is flushed to disk.

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

5. Access Patterns:
• HBase supports random read and write access to individual rows based on row
keys.
• It is optimized for low-latency, real-time access to data, making it suitable for
use cases such as serving web applications, IoT data storage, and real-time
analytics.
• HBase also supports batch operations, scans, and filters to process large
volumes of data efficiently.
6. Consistency and Replication:
• HBase provides strong consistency guarantees within a region, ensuring that
read and write operations see consistent views of the data.
• It supports asynchronous replication to replicate data across multiple clusters
for disaster recovery, backup, and load balancing purposes.
7. Integration with Hadoop Ecosystem:
• HBase integrates seamlessly with other components of the Hadoop ecosystem,
such as Apache Spark, Apache Hive, Apache Pig, and Apache Flume.
• This integration allows users to leverage the power of distributed computing
frameworks for data processing and analytics on HBase data.
Overall, Apache HBase is a powerful NoSQL database that offers scalability, high
availability, and real-time access to large-scale data. Its flexible data model,
distributed architecture, and integration with Hadoop ecosystem make it well-suited
for a wide range of use cases in big data analytics and real-time applications.
FUNDAMENTALS OF ZOOKEEPER:
Apache ZooKeeper is a centralized service for maintaining configuration
information, providing distributed synchronization, and offering group services. It
acts as a distributed coordination service for distributed applications. Here are the
fundamentals of ZooKeeper:
1. Data Model:
• ZooKeeper maintains a hierarchical namespace similar to a file system, called
znodes.
• Znodes are organized in a tree-like structure, where each node can contain data
and can have children nodes.

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

• Each znode can have associated data and metadata such as permissions, version
numbers, and timestamps.
• Znodes can be ephemeral or persistent. Ephemeral znodes exist only as long as
the session that created them is active.

2. Coordination:
• ZooKeeper provides coordination primitives such as locks, semaphores,
barriers, and queues to facilitate coordination among distributed processes.
• Processes can create, read, write, and delete znodes to implement coordination
patterns such as leader election, distributed locks, and configuration
management.
3. Consensus:
• ZooKeeper implements the ZAB (ZooKeeper Atomic Broadcast) protocol to
provide strong consistency and fault tolerance.
• ZAB ensures that updates to the ZooKeeper state are linearizable and ordered,
even in the presence of failures.
• ZooKeeper uses a leader-follower model, where one server acts as the leader
and coordinates updates, while other servers act as followers and replicate state
changes.
4. Quorums:
• ZooKeeper uses a replicated state machine model, where multiple ZooKeeper
servers form an ensemble to replicate and synchronize data.
• To achieve fault tolerance, ZooKeeper requires a majority (quorum) of servers
to agree on updates before committing them.

Department of Information Technology


Course Name: Bigdata Analytics
Course Coordinator: V Gopinath

• The size of the quorum is determined by the formula (n/2 + 1), where n is the
total number of servers in the ensemble.
5. Session Management:
• Clients connect to ZooKeeper servers to create and manage sessions for
interacting with the ZooKeeper service.
• Sessions are associated with a specific client and are used to maintain client-
server communication state.
• ZooKeeper servers monitor client sessions and detect session timeouts to
remove inactive clients and release associated resources.
6. Watches:
• Clients can set watches on znodes to receive notifications when changes occur
to the watched znodes.
• Watches are one-time triggers that fire when specific events, such as node
creation, deletion, or data modification, occur on the watched znode.
• Watches allow clients to react to changes in the ZooKeeper state and
synchronize distributed processes.
7. Use Cases:
• ZooKeeper is used in distributed systems to manage configuration, coordinate
distributed processes, and maintain distributed locks.
• It is widely used in distributed databases, messaging systems, and big data
frameworks to provide coordination and consistency guarantees.
Overall, ZooKeeper plays a crucial role in building reliable and scalable distributed
systems by providing coordination, consistency, and synchronization services. Its
simple and efficient design makes it suitable for a wide range of distributed
applications and use cases.

Department of Information Technology

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy