0% found this document useful (0 votes)
225 views41 pages

BD - Unit - IV - Hive and Pig

The document provides information about the Hive framework and Pig scripting language. It begins with an introduction to Hive, including that it is a data warehousing tool used to process structured data in Hadoop. It then discusses key aspects of Hive like architecture, features, and advantages. The document also provides an overview of Pig, including that it is a scripting language that enables complex data transformations without Java coding through its Pig Latin scripting language.

Uploaded by

Prem Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
225 views41 pages

BD - Unit - IV - Hive and Pig

The document provides information about the Hive framework and Pig scripting language. It begins with an introduction to Hive, including that it is a data warehousing tool used to process structured data in Hadoop. It then discusses key aspects of Hive like architecture, features, and advantages. The document also provides an overview of Pig, including that it is a scripting language that enables complex data transformations without Java coding through its Pig Latin scripting language.

Uploaded by

Prem Kumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

BIG DATA

Syllabus

Unit-I : Introduction to Big Data


Unit-II : Hadoop Frameworks and HDFS
Unit-III :MapReduce
Unit-VI : Hive and Pig
Unit-V : Mahout, Sqoop and CASE STUDY

1
Unit – IV
1. HIVE: It is a data warehouse infrastructure tool to
process structured data in Hadoop and it resides on top
of Hadoop to summarize Big Data, and makes querying
and analyzing easy.

 Hive is a data warehousing tool based on Hadoop, as we


know Hadoop provides massive scale out on distributed
infrastructure with high degree of fault tolerance for data
storage and processing.

 Hive is a platform used to develop SQL type scripts to


do MapReduce operations. 2
 Hive processor converts most of its queries into a Map
Reduce job which runs on Hadoop cluster.

 Hive is designed for easy and effective data aggregation, ad-


hoc querying and analysis of huge volumes of data.

 Hive provides a database query interface to Apache


Hadoop.

 Hive is not a relational database, On Line Transaction


Processing (OLTP), Real-time queries and Row-level
updates.
 This Case Study consists of
1. Company Name
2. CEO
3. Introduction
4. Hive Architecture
5. Services
6. Features
7. Advantages
8. Applications
9. Pictures/videos
10. Software (Trial version)
11. References( URL’s)
12. Case Studies / Whitepapers
13. Conclusion

4
 1. Company Name: Initially Hive was developed by
facebook, later the Apache Software Foundation took it
up and developed it further as an open source under the
name Apache Hive.

 2. CEO: The Apache project and the CEO is Steven Farris.


– Dec 2004 – Google GFS paper published
– July 2005 – Nutch uses MapReduce
– Feb 2006 – Becomes Lucene subproject
– Apr 2007 – Yahoo! on 1000-node cluster
– Jan 2008 – An Apache Top Level Project
– Jul 2008 – A 4000 node test cluster
– Sept 2008 – Hive becomes a Hadoop subproject
 3. Introduction: Hive is a data warehousing system for
Hadoop to meet the needs of businesses, data scientists,
analysts and BI professionals.

 Analysis of Large Datasets stored in Hadoop File


Systems, SQL-Like language called HiveQL and Custom
mappers and reduces when HiveQL isn’t enough.

 Hive can help with a range of business problems are Log


Processing, Predictive Modelling, Hypothesis testing and
Business Intelligence.
 There are two types of tables in Hive
i. Managed table: In managed table both the data an schema in
under control of hive
ii. External table: In External table only the schema is under
control of Hive.

7
 4. Hive Architecture: Hive is a data warehouse system
for Hadoop that facilitates ad-hoc queries and the analysis
of large datasets stored in Hadoop.

 Hive provides a SQL-like language called HiveQL.


 Hive data is organized into:
i. Databases: Namespaces that separate tables and other data
units from naming confliction.
ii. Tables: Homogeneous units of data which have the same
schema.
iii. Partitions: Each Table can have one or more partition Keys
which determines how the data is stored.
iv. Buckets (or Clusters): Data in each partition may in turn be
divided into Buckets based on the value of a hash function of
some column of the Table.
Fig: Hive Architecture
5. SERVICES
 Storing the metadata of hive tables, partitions, Hive DB
 File system service
 Job Client service
 Hive Web Interface
 The Hive Metastore Server
 Disabling Bypass Mode
 Using Hive Gateways
 Hive web interface
 Hive server service

10
6. Features / Benefits
 It stores schema in a database and processed data
into HDFS.

 It is designed for OLAP.

 It provides SQL type language for querying


called HiveQL or HQL.

 It is familiar, fast, scalable, and extensible.


11
7. ADVANTAGES

 It take very less time to write Hive Query compared to Map


Reduce code.

 It supports many SQL Syntax which means that it is possible to


integrate Hive with existing BI tools.

 It is very easy to write query involving joins in Hive.

 It has very low maintenance and is very simple to learn & use.

 Hive is built on Hadoop, so supports and handles all the


capabilities of hadoop provides like reliability, high performance
and node failure.

12
8. APPLICATIONS

 Log processing
 Document indexing
 Predictive modeling
 Hypothesis testing
 Customer facing BI
 Data Mining
 Call Center Apps
 Marketing Apps
 Create new Apps
 Website.com Apps
 Enterprise applications
13
Fig: APPLICATIONS 14
9. PICTURES / VIDEOS
 Hive is a tool in Hadoop ecosystem which provides an
interface to organize and query data in a database like
fashion and write SQL like queries.
 It is suitable for accessing and analyzing data in Hadoop
using SQL syntax.
Difference between RDBMS and Hive
• RDBMS:
RDBMS stands for Relational Database Management System.
RDBMS is a such type of database management system which
is specifically designed for relational databases. RDBMS is a
subset of DBMS. A relational database refers to a database
that stores data in a structured format using rows and
columns and that structured form is known as table. There are
some certain rules defined in RDBMS and that are known as
Codd’s rule.
• Hive:
Hive is a data warehouse software system that provides data
query and analysis. Hive gives an interface like SQL to query
data stored in various databases and file systems that
integrate with Hadoop. Hive helps with querying and
managing large datasets real fast. It is an ETL tool for Hadoop
16
ecosystem.
17
10. SOFTWARE (TRIAL VERSION)
 Hive is a Data warehousing Software.

 The following simple steps are executed for Hive installation:

 http://www.tutorialspoint.com/hive/hive_installation.htm
 Step 1: Verifying JAVA Installation
 Step 2: Verifying Hadoop Installation
 Step 3: Downloading Hive
 Step 4: Installing Hive
 Step 5: Configuring Hive
 Step 6: Downloading and Installing Apache Derby
 Step 7: Configuring Metastore of Hive
 Step 8: Verifying Hive Installation
11. References

1. https://hive.apache.org/

2. http://hadooptutorials.co.in/hive/

3. https://en.wikipedia.org/hive

4. http://www.rohitmenon.com/hive/

5. http://www-01.ibm.com/hive/
12. Case Studies / White Papers
 Large-Scale Mining Software Repositories Studies
 http://hadoop.apache.org/hive/
 Amazon
 Facebook
 Google
 IBM
 New York Times
 Yahoo!

13. Conclusions: Hive is a data warehouse infrastructure


tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
2. Pig: Pig is a high level scripting language that is used with
Apache Hadoop. 

 Pig is a framework or platform for the execution of


complex queries to analyze data.

 Pig enables data workers to write complex data


transformations without knowing Java.

 Pig built on Hadoop and takes advantage of the distributed


nature and implementation of MapReduce.

 Pig works with data from many sources, including


structured and unstructured data, and store the results into
the Hadoop Data File System. 21
 Similar to Pigs, who eat anything, the Pig programming language is
designed to work upon any kind of data, that's why the name, Pig!

 Pig is a two part ecosystem, the actual language (Pig) and the
execution environment i.e. where the programmer enters the logic
called Pig Latin.

 Pig scripts are translated into a series of MapReduce jobs that are
run on the Apache Hadoop cluster.

 Pig’s simple SQL-like scripting language is called Pig Latin, and


appeals to developers already familiar with scripting languages and
SQL.

 Pig Latin is a dataflow language, this means it allows users to


describe how data from one or more inputs should be read,
processed, and then stored to one or more outputs in parallel. 
 This Case Study consists of
1. Company Name
2. CEO
3. Introduction
4. Pig Architecture
5. Services
6. Features
7. Advantages
8. Applications
9. Pictures/videos
10. Software's / Tools
11. References( URL’s)
12. Case Studies / Whitepapers
13. Conclusion

23
1. Company Name: The Company name is Apache and
Yahoo.
 Yahoo! was the first big adopter of Hadoop, Hadoop
gained popularity in the company quickly.

2. CEO: Pig was first built in Yahoo! and CEO is Marissa


Mayer, later pig became a top level Apache project and
the CEO is Steven Farris.

 Pig was originally developed at Yahoo Research around


2006.
3. Introduction: PIG is a platform for analyzing large data
sets that implements a high-level abstraction for
expressing data analysis.
 Pig consists of two components:
 Pig Latin: Which is a language
 Runtime environment: For running Pig Latin programs.
 Pig runs on Hadoop, it makes use of both the Hadoop
Distributed File System, HDFS, and Hadoop’s processing
system, MapReduce.

Fig: Pig Programming Contains


Fig: Pig Latin Execution Engine
 Execution Modes: Pig has two execution modes:
i. Local mode: In this mode, Pig runs in a single JVM and
makes use of local file system.
ii. Map Reduce mode: In this mode, queries written in Pig Latin
are translated into MapReduce jobs and are run on a Hadoop
cluster.

Fig: Pig Execution Modes


4. Pig Architecture: Pig Architecture is a combination of Pig Latin
scripts, MapReduce statements and HDFS.

 Pig has join and order by operators that will handle this case and
rebalance the reducers.

 There are no if statements or for loops in Pig Latin, this is because


traditional procedural and object-oriented programming languages
describe control flow, and data flow is a side effect of the program.

 Pig Components
 Pig Latin: Command based language.
 Execution Environment: The environment in which Pig Latin
commands are executed.
 Pig compiler: Converts Pig Latin to MapReduce – Compiler
strives to optimize execution.
Fig: Pig Architecture
 Pig user-defined functions: Pig provides extensive support
for user defined functions (UDFs) as a way to
specify custom processing. 
 Pig UDFs can currently be implemented in three
languages: Java, Python, and JavaScript.

 The following are UDF


Register
Define
EvalFunc
FilterFunc
LoadFunc
StoreFunc
30
 Data Processing Operators
Loading
Storing
Filtering
Foreach
Generate
Streaming
Grouping
Joining
Cogroup
Cross
Describe
Explain
Illustrate
31
5. SERVICES
 Extraction
 Transformation
 Loading
 Telecom Services
 Bigdata Advisory Services
 Bigdata Transformation Services
 Brokerage and Banking
 Financial Services
 Education Services
 Mailing Solutions
 Manufacturing Services 32
6. Features / Benefits

 Ease of programming
 Mobile Programming
 Branded email templates
 Data Analytics
 Join Datasets
 Sort Datasets
 Filters and Data Types
 Group By
 User Defined Functions
 Extract-transform-load (ETL) data pipelines,
 Iterative data processing
33
7. ADVANTAGES

 Increases productivity
 10 lines of Pig Latin ≈ 200 lines of Java
 Quickly changing data processing requirements
 Processing data from multiple channels
 Quick hypothesis testing
 Time sensitive data refreshes
 Data profiling using sampling
 Metadata not required, but used when available.
 Support for nested types.
 Web log processing.
 Data processing for web search platforms.
 Ad hoc queries across large data sets.
34
8. APPLICATIONS

 Call Center Apps


 Marketing Apps
 Chatter Applications
 Community Apps
 Big Data for Google AdWords
 Checkout / Checkin Apps in Organizations
 Add AppEchange Apps
 Create new Apps
 Website.Com
 Enterprise applications
35
9. PICTURES / VIDEOS

 Pig is a high level scripting language that is used with


Apache Hadoop. Pig enables data workers to write
complex data transformations without knowing Java.
10. SOFTWARES / TOOLS
 Pig Latin is a Data preprocessing Language.
 Running Pig
 Script: Execute commands in a file
 Grunt: Interactive Shell for executing Pig Commands
 Embedded: Execute Pig commands using Pig Server class.
 Pig Steps
1. Load text into a bag (named ‘lines’)
2. Tokenize the text in the ‘lines’ bag
3. Retain first letter of each token
4. Group by letter
5. Count the number of occurrences in each group
6. Descending order the group by the count
7. Grab the first element => Most occurring letter
8. Persist result on a file system
 https://cwiki.apache.org/PigTools.
 https://issues.apache.or/PIG-366.
 https://en.wikipedia.org/Pig_(programming_tool)
 https://pig.apache.org/download
 http://www.slideshare.net/big-data-analytics-using-pig
11. Resources

 https://pig.apache.org/

 http://hortonworks.com/pig/
 https://en.wikipedia.org/Pig_(programming_tool)

 http://www.rohitmenon.com/apache-pig-tutorial-part-1/
 http://www-01.ibm.com//pig/
12. Case Studies / White Papers:
 Large-Scale Mining Software Repositories Studies
 Flight Delay Analysis
 YouTube
 Yahoo
 Google
 Facebook
 Microsoft

13. Conclusions : Pig is a high-level scripting language that


is used with Apache Hadoop and excels at describing
data analysis problems as data flows. Pig provides
common data processing operations. Pig supports rapid
iteration of adhoc queries.
Hive Pig
1. Hive is a DW Tool 1. Procedural Data Flow
2. It is used by data analysts. Language
3. For creating reports. 2. It is used by Researchers and
4. Operates on the server side of Programmers.
a cluster. 3. For Programming.
5. Hive does not support Avro.
4. Operates on the client side of
6. Directly leverages SQL and is
a cluster.
easy to learn for database
experts. 5. Pig supports Avro file format.
7. Makes use of exact variation 6. Pig is SQL like but varies to a
of dedicated SQL DDL great extent.
language by defining tables 7. Does not have a dedicated
beforehand. metadata database.
8. For structured data. 8. For semi structured data.
HBase RDBMS
1. Column oriented
1. Row-oriented (mostly)
2. Flexible schema, columns can be
added on the fly 2. Fixed schema
3. Designed to store Denormalized 3. Designed to store
data Normalized data
4. Good with sparse tables 4. Not optimized for sparse
5. Joins using MapReduce which is tables
not optimized 5. Optimized for joins
6. Tight integration with 6. No integration with
MapReduce MapReduce
7. Horizontal scalability – just add
7. Hard to shard and scale
hardware
8. Good for structured data
8. Good for semi-structured data as
well as structured data.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy