BD - Unit - IV - Hive and Pig
BD - Unit - IV - Hive and Pig
Syllabus
1
Unit – IV
1. HIVE: It is a data warehouse infrastructure tool to
process structured data in Hadoop and it resides on top
of Hadoop to summarize Big Data, and makes querying
and analyzing easy.
4
1. Company Name: Initially Hive was developed by
facebook, later the Apache Software Foundation took it
up and developed it further as an open source under the
name Apache Hive.
7
4. Hive Architecture: Hive is a data warehouse system
for Hadoop that facilitates ad-hoc queries and the analysis
of large datasets stored in Hadoop.
10
6. Features / Benefits
It stores schema in a database and processed data
into HDFS.
It has very low maintenance and is very simple to learn & use.
12
8. APPLICATIONS
Log processing
Document indexing
Predictive modeling
Hypothesis testing
Customer facing BI
Data Mining
Call Center Apps
Marketing Apps
Create new Apps
Website.com Apps
Enterprise applications
13
Fig: APPLICATIONS 14
9. PICTURES / VIDEOS
Hive is a tool in Hadoop ecosystem which provides an
interface to organize and query data in a database like
fashion and write SQL like queries.
It is suitable for accessing and analyzing data in Hadoop
using SQL syntax.
Difference between RDBMS and Hive
• RDBMS:
RDBMS stands for Relational Database Management System.
RDBMS is a such type of database management system which
is specifically designed for relational databases. RDBMS is a
subset of DBMS. A relational database refers to a database
that stores data in a structured format using rows and
columns and that structured form is known as table. There are
some certain rules defined in RDBMS and that are known as
Codd’s rule.
• Hive:
Hive is a data warehouse software system that provides data
query and analysis. Hive gives an interface like SQL to query
data stored in various databases and file systems that
integrate with Hadoop. Hive helps with querying and
managing large datasets real fast. It is an ETL tool for Hadoop
16
ecosystem.
17
10. SOFTWARE (TRIAL VERSION)
Hive is a Data warehousing Software.
http://www.tutorialspoint.com/hive/hive_installation.htm
Step 1: Verifying JAVA Installation
Step 2: Verifying Hadoop Installation
Step 3: Downloading Hive
Step 4: Installing Hive
Step 5: Configuring Hive
Step 6: Downloading and Installing Apache Derby
Step 7: Configuring Metastore of Hive
Step 8: Verifying Hive Installation
11. References
1. https://hive.apache.org/
2. http://hadooptutorials.co.in/hive/
3. https://en.wikipedia.org/hive
4. http://www.rohitmenon.com/hive/
5. http://www-01.ibm.com/hive/
12. Case Studies / White Papers
Large-Scale Mining Software Repositories Studies
http://hadoop.apache.org/hive/
Amazon
Facebook
Google
IBM
New York Times
Yahoo!
Pig is a two part ecosystem, the actual language (Pig) and the
execution environment i.e. where the programmer enters the logic
called Pig Latin.
Pig scripts are translated into a series of MapReduce jobs that are
run on the Apache Hadoop cluster.
23
1. Company Name: The Company name is Apache and
Yahoo.
Yahoo! was the first big adopter of Hadoop, Hadoop
gained popularity in the company quickly.
Pig has join and order by operators that will handle this case and
rebalance the reducers.
Pig Components
Pig Latin: Command based language.
Execution Environment: The environment in which Pig Latin
commands are executed.
Pig compiler: Converts Pig Latin to MapReduce – Compiler
strives to optimize execution.
Fig: Pig Architecture
Pig user-defined functions: Pig provides extensive support
for user defined functions (UDFs) as a way to
specify custom processing.
Pig UDFs can currently be implemented in three
languages: Java, Python, and JavaScript.
Ease of programming
Mobile Programming
Branded email templates
Data Analytics
Join Datasets
Sort Datasets
Filters and Data Types
Group By
User Defined Functions
Extract-transform-load (ETL) data pipelines,
Iterative data processing
33
7. ADVANTAGES
Increases productivity
10 lines of Pig Latin ≈ 200 lines of Java
Quickly changing data processing requirements
Processing data from multiple channels
Quick hypothesis testing
Time sensitive data refreshes
Data profiling using sampling
Metadata not required, but used when available.
Support for nested types.
Web log processing.
Data processing for web search platforms.
Ad hoc queries across large data sets.
34
8. APPLICATIONS
https://pig.apache.org/
http://hortonworks.com/pig/
https://en.wikipedia.org/Pig_(programming_tool)
http://www.rohitmenon.com/apache-pig-tutorial-part-1/
http://www-01.ibm.com//pig/
12. Case Studies / White Papers:
Large-Scale Mining Software Repositories Studies
Flight Delay Analysis
YouTube
Yahoo
Google
Facebook
Microsoft