0% found this document useful (0 votes)
6 views

Lecture 11- Introduction to Apache Hive

The document provides an overview of Apache Hive, a data analysis tool designed for querying and managing large datasets on Hadoop using Hive Query Language (HQL). It compares Hive with other technologies like Hadoop and Pig, highlighting its strengths in handling large-scale data and its limitations in real-time analytics. Additionally, it discusses the history of Hive's development and its role in processing data for various industries.

Uploaded by

kmngl47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Lecture 11- Introduction to Apache Hive

The document provides an overview of Apache Hive, a data analysis tool designed for querying and managing large datasets on Hadoop using Hive Query Language (HQL). It compares Hive with other technologies like Hadoop and Pig, highlighting its strengths in handling large-scale data and its limitations in real-time analytics. Additionally, it discusses the history of Hive's development and its role in processing data for various industries.

Uploaded by

kmngl47
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Lecture 11

Apache Hive

By
Dr. Aditya Bhardwaj

aditya.bhardwaj@bennett.edu.in

Big Data Analytics and Business Intelligence (CSET/CMCA-580)


Lecture 12-
13 Apache Hive RoadMap
Working
Architecture,
Commands

Lecture 14

Lecture 11
Joins and Partitision
Introduction to in Hive
Apache Hive

Lecture 15

Practical
Demonstrations on
HQL
Hadoop vs Pig vs Hive- Quick Look at Industrial Use cases
Use Case Apache Hadoop Apache Pig Apache Hive

Data Storage Used for ETL (Extract,


Used for querying and
and Distributed storage and processing of Transform, Load)
managing large datasets
Management large-scale datasets across multiple processes to prepare data
stored in Hadoop with a
nodes using HDFS. stored in Hadoop for
SQL-like interface.
analysis.
Real-Time Not suitable for real-time analytics Suitable for batch data Supports faster querying
Analytics due to batch processing nature. processing but not for real- compared to raw
time analytics. MapReduce; however, not
ideal for real-time
analytics.
Financial Provides a scalable environment for Used to clean, aggregate, Supports SQL-like queries
Analysis and storing and processing transaction and transform raw financial for fast analysis and
Fraud Detection data and detecting anomalies. data for further analysis. reporting of financial data
stored in Hadoop.
Healthcare Data Stores and processes large datasets, Used to transform and Used for querying and
Processing such as electronic health records clean healthcare data for analyzing structured
(EHR), medical imaging, etc. analysis or model building healthcare data, such as
patient records and clinical
data.
History of Hive
 At Facebook the data grew from GBs (2006) to 1 TB/day (2007) and
today it is 500+ PBs per day.

 Rapidly grown data made traditional warehousing failed to process.

 Hadoop is an alternative to store and process large data.

 But MapReduce is very low-level and requires custom code.

 Facebook developed Hive as solution.

 Sept 2008 – Hive becomes a Hadoop subproject.

 Apache continued the development of Hive.


Why Go for Hive When Pig is There?
Hive vs. SQL: Which One to Choose for Data Analysis Better?

Hive and SQL Server are not comparable in any way other than
the similarity in the syntax of the query language.
While SQL Server is built to be able to respond in real-time
from a single machine, hive is for processing large data sets that
may span hundreds or thousands of machines.
Apache Hive is an open source project run by volunteers at the
Apache Software Foundation, used for querying, managing and
storing structured data on Hadoop.
Hive uses HQL (Hive Query Language) that lets you use SQL-
like syntax to define your map and reduce steps
Hive vs. SQL
Challenges of Hive
▪ Compared to Apache Pig, Latency for Apache Hive queries is
generally very high.
Key Summary Points on Hive
 HIVE is not a database but a data analysis tool through SQL
kind of syntax.

 HiveQL (Hive Query Language) are automatically translated


into MapReduce jobs executed on Hadoop.

 Apache Hive converts the SQL queries into MapReduce jobs and
then submits it to the Hadoop cluster.
Reference

 https://hive.apache.org/
Thanks

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy