0% found this document useful (0 votes)

225 views41 pages

BD - Unit - IV - Hive and Pig

The document provides information about the Hive framework and Pig scripting language. It begins with an introduction to Hive, including that it is a data warehousing tool used to process structured data in Hadoop. It then discusses key aspects of Hive like architecture, features, and advantages. The document also provides an overview of Pig, including that it is a scripting language that enables complex data transformations without Java coding through its Pig Latin scripting language.

Uploaded by

Prem Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

225 views41 pages

BD - Unit - IV - Hive and Pig

Uploaded by

Prem Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 41

BIG DATA

Syllabus

Unit-I : Introduction to Big Data

Unit-II : Hadoop Frameworks and HDFS
Unit-III :MapReduce
Unit-VI : Hive and Pig
Unit-V : Mahout, Sqoop and CASE STUDY

1
Unit – IV
1. HIVE: It is a data warehouse infrastructure tool to
process structured data in Hadoop and it resides on top
of Hadoop to summarize Big Data, and makes querying
and analyzing easy.

 Hive is a data warehousing tool based on Hadoop, as we

know Hadoop provides massive scale out on distributed
infrastructure with high degree of fault tolerance for data
storage and processing.

 Hive is a platform used to develop SQL type scripts to

do MapReduce operations. 2
 Hive processor converts most of its queries into a Map
Reduce job which runs on Hadoop cluster.

 Hive is designed for easy and effective data aggregation, ad-

hoc querying and analysis of huge volumes of data.

 Hive provides a database query interface to Apache

Hadoop.

 Hive is not a relational database, On Line Transaction

Processing (OLTP), Real-time queries and Row-level
updates.
 This Case Study consists of
1. Company Name
2. CEO
3. Introduction
4. Hive Architecture
5. Services
6. Features
7. Advantages
8. Applications
9. Pictures/videos
10. Software (Trial version)
11. References( URL’s)
12. Case Studies / Whitepapers
13. Conclusion

4
 1. Company Name: Initially Hive was developed by
facebook, later the Apache Software Foundation took it
up and developed it further as an open source under the
name Apache Hive.

 2. CEO: The Apache project and the CEO is Steven Farris.

– Dec 2004 – Google GFS paper published
– July 2005 – Nutch uses MapReduce
– Feb 2006 – Becomes Lucene subproject
– Apr 2007 – Yahoo! on 1000-node cluster
– Jan 2008 – An Apache Top Level Project
– Jul 2008 – A 4000 node test cluster
– Sept 2008 – Hive becomes a Hadoop subproject
 3. Introduction: Hive is a data warehousing system for
Hadoop to meet the needs of businesses, data scientists,
analysts and BI professionals.

 Analysis of Large Datasets stored in Hadoop File

Systems, SQL-Like language called HiveQL and Custom
mappers and reduces when HiveQL isn’t enough.

 Hive can help with a range of business problems are Log

Processing, Predictive Modelling, Hypothesis testing and
Business Intelligence.
 There are two types of tables in Hive
i. Managed table: In managed table both the data an schema in
under control of hive
ii. External table: In External table only the schema is under
control of Hive.

7
 4. Hive Architecture: Hive is a data warehouse system
for Hadoop that facilitates ad-hoc queries and the analysis
of large datasets stored in Hadoop.

 Hive provides a SQL-like language called HiveQL.

 Hive data is organized into:
i. Databases: Namespaces that separate tables and other data
units from naming confliction.
ii. Tables: Homogeneous units of data which have the same
schema.
iii. Partitions: Each Table can have one or more partition Keys
which determines how the data is stored.
iv. Buckets (or Clusters): Data in each partition may in turn be
divided into Buckets based on the value of a hash function of
some column of the Table.
Fig: Hive Architecture
5. SERVICES
 Storing the metadata of hive tables, partitions, Hive DB
 File system service
 Job Client service
 Hive Web Interface
 The Hive Metastore Server
 Disabling Bypass Mode
 Using Hive Gateways
 Hive web interface
 Hive server service

10
6. Features / Benefits
 It stores schema in a database and processed data
into HDFS.

 It is designed for OLAP.

 It provides SQL type language for querying

called HiveQL or HQL.

 It is familiar, fast, scalable, and extensible.

11
7. ADVANTAGES

 It take very less time to write Hive Query compared to Map

Reduce code.

 It supports many SQL Syntax which means that it is possible to

integrate Hive with existing BI tools.

 It is very easy to write query involving joins in Hive.

 It has very low maintenance and is very simple to learn & use.

 Hive is built on Hadoop, so supports and handles all the

capabilities of hadoop provides like reliability, high performance
and node failure.

12
8. APPLICATIONS

 Log processing
 Document indexing
 Predictive modeling
 Hypothesis testing
 Customer facing BI
 Data Mining
 Call Center Apps
 Marketing Apps
 Create new Apps
 Website.com Apps
 Enterprise applications
13
Fig: APPLICATIONS 14
9. PICTURES / VIDEOS
 Hive is a tool in Hadoop ecosystem which provides an
interface to organize and query data in a database like
fashion and write SQL like queries.
 It is suitable for accessing and analyzing data in Hadoop
using SQL syntax.
Difference between RDBMS and Hive
• RDBMS:
RDBMS stands for Relational Database Management System.
RDBMS is a such type of database management system which
is specifically designed for relational databases. RDBMS is a
subset of DBMS. A relational database refers to a database
that stores data in a structured format using rows and
columns and that structured form is known as table. There are
some certain rules defined in RDBMS and that are known as
Codd’s rule.
• Hive:
Hive is a data warehouse software system that provides data
query and analysis. Hive gives an interface like SQL to query
data stored in various databases and file systems that
integrate with Hadoop. Hive helps with querying and
managing large datasets real fast. It is an ETL tool for Hadoop
16
ecosystem.
17
10. SOFTWARE (TRIAL VERSION)
 Hive is a Data warehousing Software.

 The following simple steps are executed for Hive installation:

 http://www.tutorialspoint.com/hive/hive_installation.htm
 Step 1: Verifying JAVA Installation
 Step 2: Verifying Hadoop Installation
 Step 3: Downloading Hive
 Step 4: Installing Hive
 Step 5: Configuring Hive
 Step 6: Downloading and Installing Apache Derby
 Step 7: Configuring Metastore of Hive
 Step 8: Verifying Hive Installation
11. References

1. https://hive.apache.org/

2. http://hadooptutorials.co.in/hive/

3. https://en.wikipedia.org/hive

4. http://www.rohitmenon.com/hive/

5. http://www-01.ibm.com/hive/
12. Case Studies / White Papers
 Large-Scale Mining Software Repositories Studies
 http://hadoop.apache.org/hive/
 Amazon
 Facebook
 Google
 IBM
 New York Times
 Yahoo!

13. Conclusions: Hive is a data warehouse infrastructure

tool to process structured data in Hadoop. It resides on
top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
2. Pig: Pig is a high level scripting language that is used with
Apache Hadoop.

 Pig is a framework or platform for the execution of

complex queries to analyze data.

 Pig enables data workers to write complex data

transformations without knowing Java.

 Pig built on Hadoop and takes advantage of the distributed

nature and implementation of MapReduce.

 Pig works with data from many sources, including

structured and unstructured data, and store the results into
the Hadoop Data File System. 21
 Similar to Pigs, who eat anything, the Pig programming language is
designed to work upon any kind of data, that's why the name, Pig!

 Pig is a two part ecosystem, the actual language (Pig) and the
execution environment i.e. where the programmer enters the logic
called Pig Latin.

 Pig scripts are translated into a series of MapReduce jobs that are
run on the Apache Hadoop cluster.

 Pig’s simple SQL-like scripting language is called Pig Latin, and

appeals to developers already familiar with scripting languages and
SQL.

 Pig Latin is a dataflow language, this means it allows users to

describe how data from one or more inputs should be read,
processed, and then stored to one or more outputs in parallel.
 This Case Study consists of
1. Company Name
2. CEO
3. Introduction
4. Pig Architecture
5. Services
6. Features
7. Advantages
8. Applications
9. Pictures/videos
10. Software's / Tools
11. References( URL’s)
12. Case Studies / Whitepapers
13. Conclusion

23
1. Company Name: The Company name is Apache and
Yahoo.
 Yahoo! was the first big adopter of Hadoop, Hadoop
gained popularity in the company quickly.

2. CEO: Pig was first built in Yahoo! and CEO is Marissa

Mayer, later pig became a top level Apache project and
the CEO is Steven Farris.

 Pig was originally developed at Yahoo Research around

2006.
3. Introduction: PIG is a platform for analyzing large data
sets that implements a high-level abstraction for
expressing data analysis.
 Pig consists of two components:
 Pig Latin: Which is a language
 Runtime environment: For running Pig Latin programs.
 Pig runs on Hadoop, it makes use of both the Hadoop
Distributed File System, HDFS, and Hadoop’s processing
system, MapReduce.

Fig: Pig Programming Contains

Fig: Pig Latin Execution Engine
 Execution Modes: Pig has two execution modes:
i. Local mode: In this mode, Pig runs in a single JVM and
makes use of local file system.
ii. Map Reduce mode: In this mode, queries written in Pig Latin
are translated into MapReduce jobs and are run on a Hadoop
cluster.

Fig: Pig Execution Modes

4. Pig Architecture: Pig Architecture is a combination of Pig Latin
scripts, MapReduce statements and HDFS.

 Pig has join and order by operators that will handle this case and
rebalance the reducers.

 There are no if statements or for loops in Pig Latin, this is because

traditional procedural and object-oriented programming languages
describe control flow, and data flow is a side effect of the program.

 Pig Components
 Pig Latin: Command based language.
 Execution Environment: The environment in which Pig Latin
commands are executed.
 Pig compiler: Converts Pig Latin to MapReduce – Compiler
strives to optimize execution.
Fig: Pig Architecture
 Pig user-defined functions: Pig provides extensive support
for user defined functions (UDFs) as a way to
specify custom processing.
 Pig UDFs can currently be implemented in three
languages: Java, Python, and JavaScript.

 The following are UDF

Register
Define
EvalFunc
FilterFunc
LoadFunc
StoreFunc
30
 Data Processing Operators
Loading
Storing
Filtering
Foreach
Generate
Streaming
Grouping
Joining
Cogroup
Cross
Describe
Explain
Illustrate
31
5. SERVICES
 Extraction
 Transformation
 Loading
 Telecom Services
 Bigdata Advisory Services
 Bigdata Transformation Services
 Brokerage and Banking
 Financial Services
 Education Services
 Mailing Solutions
 Manufacturing Services 32
6. Features / Benefits

 Ease of programming
 Mobile Programming
 Branded email templates
 Data Analytics
 Join Datasets
 Sort Datasets
 Filters and Data Types
 Group By
 User Defined Functions
 Extract-transform-load (ETL) data pipelines,
 Iterative data processing
33
7. ADVANTAGES

 Increases productivity
 10 lines of Pig Latin ≈ 200 lines of Java
 Quickly changing data processing requirements
 Processing data from multiple channels
 Quick hypothesis testing
 Time sensitive data refreshes
 Data profiling using sampling
 Metadata not required, but used when available.
 Support for nested types.
 Web log processing.
 Data processing for web search platforms.
 Ad hoc queries across large data sets.
34
8. APPLICATIONS

 Call Center Apps

 Marketing Apps
 Chatter Applications
 Community Apps
 Big Data for Google AdWords
 Checkout / Checkin Apps in Organizations
 Add AppEchange Apps
 Create new Apps
 Website.Com
 Enterprise applications
35
9. PICTURES / VIDEOS

 Pig is a high level scripting language that is used with

Apache Hadoop. Pig enables data workers to write
complex data transformations without knowing Java.
10. SOFTWARES / TOOLS
 Pig Latin is a Data preprocessing Language.
 Running Pig
 Script: Execute commands in a file
 Grunt: Interactive Shell for executing Pig Commands
 Embedded: Execute Pig commands using Pig Server class.
 Pig Steps
1. Load text into a bag (named ‘lines’)
2. Tokenize the text in the ‘lines’ bag
3. Retain first letter of each token
4. Group by letter
5. Count the number of occurrences in each group
6. Descending order the group by the count
7. Grab the first element => Most occurring letter
8. Persist result on a file system
 https://cwiki.apache.org/PigTools.
 https://issues.apache.or/PIG-366.
 https://en.wikipedia.org/Pig_(programming_tool)
 https://pig.apache.org/download
 http://www.slideshare.net/big-data-analytics-using-pig
11. Resources

 https://pig.apache.org/

 http://hortonworks.com/pig/
 https://en.wikipedia.org/Pig_(programming_tool)

 http://www.rohitmenon.com/apache-pig-tutorial-part-1/
 http://www-01.ibm.com//pig/
12. Case Studies / White Papers:
 Large-Scale Mining Software Repositories Studies
 Flight Delay Analysis
 YouTube
 Yahoo
 Google
 Facebook
 Microsoft

13. Conclusions : Pig is a high-level scripting language that

is used with Apache Hadoop and excels at describing
data analysis problems as data flows. Pig provides
common data processing operations. Pig supports rapid
iteration of adhoc queries.
Hive Pig
1. Hive is a DW Tool 1. Procedural Data Flow
2. It is used by data analysts. Language
3. For creating reports. 2. It is used by Researchers and
4. Operates on the server side of Programmers.
a cluster. 3. For Programming.
5. Hive does not support Avro.
4. Operates on the client side of
6. Directly leverages SQL and is
a cluster.
easy to learn for database
experts. 5. Pig supports Avro file format.
7. Makes use of exact variation 6. Pig is SQL like but varies to a
of dedicated SQL DDL great extent.
language by defining tables 7. Does not have a dedicated
beforehand. metadata database.
8. For structured data. 8. For semi structured data.
HBase RDBMS
1. Column oriented
1. Row-oriented (mostly)
2. Flexible schema, columns can be
added on the fly 2. Fixed schema
3. Designed to store Denormalized 3. Designed to store
data Normalized data
4. Good with sparse tables 4. Not optimized for sparse
5. Joins using MapReduce which is tables
not optimized 5. Optimized for joins
6. Tight integration with 6. No integration with
MapReduce MapReduce
7. Horizontal scalability – just add
7. Hard to shard and scale
hardware
8. Good for structured data
8. Good for semi-structured data as
well as structured data.

Parallel Computing LessonPlan
No ratings yet
Parallel Computing LessonPlan
10 pages
ETL-Kafka (Talend) Student MANUAL - For Merge
No ratings yet
ETL-Kafka (Talend) Student MANUAL - For Merge
36 pages
Coc Exam Level4
86% (7)
Coc Exam Level4
3 pages
Big Data Analysis - Lab Manual - Bharathidasan University - B.SC Data Science, Second Year, 4th Semester
No ratings yet
Big Data Analysis - Lab Manual - Bharathidasan University - B.SC Data Science, Second Year, 4th Semester
41 pages
Mean Stack Technologies-Module II - Angular JS, Mongodb
No ratings yet
Mean Stack Technologies-Module II - Angular JS, Mongodb
6 pages
Apache Hive DDL DML, Queries
100% (2)
Apache Hive DDL DML, Queries
4 pages
AI Unit - 2 R22
No ratings yet
AI Unit - 2 R22
40 pages
Week-11 - 12-Hivepdf - 2023 - 11 - 10 - 12 - 47 - 43
No ratings yet
Week-11 - 12-Hivepdf - 2023 - 11 - 10 - 12 - 47 - 43
8 pages
Android Interview Questions PDF
No ratings yet
Android Interview Questions PDF
24 pages
DBMS Relational Calculus
No ratings yet
DBMS Relational Calculus
9 pages
Maipu Router Config ManualV1.0
67% (6)
Maipu Router Config ManualV1.0
937 pages
AI Digital Notes Complete
100% (1)
AI Digital Notes Complete
202 pages
CCS334 BDA Lab Manual Final
No ratings yet
CCS334 BDA Lab Manual Final
40 pages
BD - Unit - III - MapReduce
100% (1)
BD - Unit - III - MapReduce
31 pages
IBM - PBL Program 2025
No ratings yet
IBM - PBL Program 2025
2 pages
Machine Learning Handwritten Notes
No ratings yet
Machine Learning Handwritten Notes
52 pages
Unit V
No ratings yet
Unit V
67 pages
Class Interface: Diffrence New Api Old Api
No ratings yet
Class Interface: Diffrence New Api Old Api
5 pages
14-Lesson Cloudera Hive
No ratings yet
14-Lesson Cloudera Hive
9 pages
010 Intro Natural Language Processing
No ratings yet
010 Intro Natural Language Processing
43 pages
Programming World Wide Web
25% (8)
Programming World Wide Web
2 pages
FSD Unit III
100% (1)
FSD Unit III
36 pages
Python Project Result Management System
No ratings yet
Python Project Result Management System
21 pages
R22 SkillDevelopmentCourse
No ratings yet
R22 SkillDevelopmentCourse
21 pages
BDA Lab ManuaL
No ratings yet
BDA Lab ManuaL
83 pages
Python Full Stack
0% (1)
Python Full Stack
6 pages
06-E-SAT Including Data Management and Use of Results
100% (2)
06-E-SAT Including Data Management and Use of Results
53 pages
BCT Techknowledge Want All Subjects Notes Pls
No ratings yet
BCT Techknowledge Want All Subjects Notes Pls
193 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
SQL Handbook
No ratings yet
SQL Handbook
104 pages
Django Ppts
No ratings yet
Django Ppts
243 pages
Frame Structure - GPON
100% (1)
Frame Structure - GPON
95 pages
R Language
No ratings yet
R Language
59 pages
Queue Data Structure
No ratings yet
Queue Data Structure
13 pages
Principles of Artificial Intelligence
No ratings yet
Principles of Artificial Intelligence
2 pages
Neo4j Cypher Refcard 4
No ratings yet
Neo4j Cypher Refcard 4
21 pages
Apache Mahout Essentials - Sample Chapter
No ratings yet
Apache Mahout Essentials - Sample Chapter
25 pages
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
No ratings yet
Hadoop: The Definitive Guide Unit 2 Part 2: Hadoop I/O
26 pages
Data Warehousing Full
No ratings yet
Data Warehousing Full
41 pages
Chpater 1 - Unit 2
No ratings yet
Chpater 1 - Unit 2
31 pages
M.Tech (CSE) Big Data Analytics Curriculum
No ratings yet
M.Tech (CSE) Big Data Analytics Curriculum
69 pages
Hive PPT
No ratings yet
Hive PPT
25 pages
MCQ Type Questions
No ratings yet
MCQ Type Questions
24 pages
BA Lab Manual
No ratings yet
BA Lab Manual
62 pages
MT8000 User Manual WinView HMI EazyBuilder
100% (2)
MT8000 User Manual WinView HMI EazyBuilder
428 pages
DSBDA Practical Final
No ratings yet
DSBDA Practical Final
49 pages
EX200 Red Hat Certified System Administrator (RHCSA) Exam - 112016
No ratings yet
EX200 Red Hat Certified System Administrator (RHCSA) Exam - 112016
17 pages
Subject Name Parallel and Distributed Computing
100% (1)
Subject Name Parallel and Distributed Computing
3 pages
CS302 Unit1-III
No ratings yet
CS302 Unit1-III
18 pages
DBMS Unit 3
No ratings yet
DBMS Unit 3
98 pages
Cs3481 - Dbms Record
No ratings yet
Cs3481 - Dbms Record
63 pages
Develop A Java Program To Demonstrate Applet Life Cycle
No ratings yet
Develop A Java Program To Demonstrate Applet Life Cycle
8 pages
III Sem Syllabus RNSIT New
No ratings yet
III Sem Syllabus RNSIT New
19 pages
Apache Pig
No ratings yet
Apache Pig
21 pages
Python Record
No ratings yet
Python Record
35 pages
C 100 Dev
No ratings yet
C 100 Dev
10 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
C Programming Question Bank
No ratings yet
C Programming Question Bank
3 pages
Unit-3 DWDM
No ratings yet
Unit-3 DWDM
11 pages
Software Component of Computer
No ratings yet
Software Component of Computer
8 pages
AIML Lab Manual
No ratings yet
AIML Lab Manual
43 pages
Chapter 11 - Access Control Fundamentals
No ratings yet
Chapter 11 - Access Control Fundamentals
60 pages
Kaushal Chavda
No ratings yet
Kaushal Chavda
137 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
AI 2marks Questions
100% (1)
AI 2marks Questions
121 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
What Is Apache Flume?: Collecting, Aggregating, and Moving Large Amounts of Log Data. in
No ratings yet
What Is Apache Flume?: Collecting, Aggregating, and Moving Large Amounts of Log Data. in
8 pages
16-2 p30 Mapping of j1939 To Can FD Cia602 Zeltwanger
No ratings yet
16-2 p30 Mapping of j1939 To Can FD Cia602 Zeltwanger
2 pages
Wrapper Classes Exercise: Cognizant Technology Solutions
No ratings yet
Wrapper Classes Exercise: Cognizant Technology Solutions
7 pages
Lab Manual Compiler in C #
No ratings yet
Lab Manual Compiler in C #
145 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
46 pages
Python Syllbus by Lokesh
No ratings yet
Python Syllbus by Lokesh
5 pages
DBSPI
No ratings yet
DBSPI
29 pages
Computer Architecture All Important Questions
100% (1)
Computer Architecture All Important Questions
3 pages
Dokumen - Tips - Quiz PL SQL 10 15
No ratings yet
Dokumen - Tips - Quiz PL SQL 10 15
48 pages
CC9554-4 V2Y SPEC SW Serial Export Protocol
No ratings yet
CC9554-4 V2Y SPEC SW Serial Export Protocol
50 pages
Gfk1260f - Cimplicity Hmi Trend and Xy Chart
No ratings yet
Gfk1260f - Cimplicity Hmi Trend and Xy Chart
191 pages
BD - Unit - V - Mahout, Sqoop and Case Study
No ratings yet
BD - Unit - V - Mahout, Sqoop and Case Study
33 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
Release Notes: Emc Powerpath For Solaris
No ratings yet
Release Notes: Emc Powerpath For Solaris
46 pages
Security Onion Cheat Sheet
No ratings yet
Security Onion Cheat Sheet
1 page
17CS664 Module3 Lists Dictionaries Tuples (Python)
No ratings yet
17CS664 Module3 Lists Dictionaries Tuples (Python)
31 pages
Parminder Bhatia
No ratings yet
Parminder Bhatia
3 pages
ShareX Log 2024 02
No ratings yet
ShareX Log 2024 02
67 pages
Gilat SkyEdge II-c LB Guide May 2023
No ratings yet
Gilat SkyEdge II-c LB Guide May 2023
19 pages
BD - Unit - I - Introduction To Big Data
No ratings yet
BD - Unit - I - Introduction To Big Data
18 pages
Packet Tracer - Configuring Standard Acls: Topology
No ratings yet
Packet Tracer - Configuring Standard Acls: Topology
7 pages
Dreambox Control Center PDF
No ratings yet
Dreambox Control Center PDF
11 pages
How To Reset Cisco Emergency Responder DB
No ratings yet
How To Reset Cisco Emergency Responder DB
5 pages
Low Delay Rate Control For HEVC
No ratings yet
Low Delay Rate Control For HEVC
5 pages
GBAMS Sheet-Shrivastava-Panchmi-Data-v2
No ratings yet
GBAMS Sheet-Shrivastava-Panchmi-Data-v2
6 pages
(No - Write - To - Binlog - (,) ... : Binary
No ratings yet
(No - Write - To - Binlog - (,) ... : Binary
3 pages
Technology Back Up Plan Essay PDF
No ratings yet
Technology Back Up Plan Essay PDF
2 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

BD - Unit - IV - Hive and Pig

Uploaded by

BD - Unit - IV - Hive and Pig

Uploaded by

BIG DATA

Unit-I : Introduction to Big Data

 Hive is a data warehousing tool based on Hadoop, as we

 Hive is a platform used to develop SQL type scripts to

 Hive is designed for easy and effective data aggregation, ad-

 Hive provides a database query interface to Apache

 Hive is not a relational database, On Line Transaction

 2. CEO: The Apache project and the CEO is Steven Farris.

 Analysis of Large Datasets stored in Hadoop File

 Hive can help with a range of business problems are Log

 Hive provides a SQL-like language called HiveQL.

 It is designed for OLAP.

 It provides SQL type language for querying

 It is familiar, fast, scalable, and extensible.

 It take very less time to write Hive Query compared to Map

 It supports many SQL Syntax which means that it is possible to

 It is very easy to write query involving joins in Hive.

 Hive is built on Hadoop, so supports and handles all the

 The following simple steps are executed for Hive installation:

13. Conclusions: Hive is a data warehouse infrastructure

 Pig is a framework or platform for the execution of

 Pig enables data workers to write complex data

 Pig built on Hadoop and takes advantage of the distributed

 Pig works with data from many sources, including

 Pig’s simple SQL-like scripting language is called Pig Latin, and

 Pig Latin is a dataflow language, this means it allows users to

2. CEO: Pig was first built in Yahoo! and CEO is Marissa

 Pig was originally developed at Yahoo Research around

Fig: Pig Programming Contains

Fig: Pig Execution Modes

 There are no if statements or for loops in Pig Latin, this is because

 The following are UDF

 Call Center Apps

 Pig is a high level scripting language that is used with

13. Conclusions : Pig is a high-level scripting language that

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.