13 Lecture
13 Lecture
LECTURE 13
Zaeem Anwaar
Assistant Director IT
Hadoop Ecosystem
The Apache Hive data warehouse software facilitates reading, writing, and managing
large datasets residing in distributed storage using SQL query.
SQL Query
Hive can only be used if the data is structured (as it is based on SQL Commands)
It is an open source data warehouse system for querying and analyzing large datasets
stored in Hadoop files
Hive do three main functions:
Data summarization (finding a compact description of a dataset)
Query
Analysis
Hive Hadoop Component is used mainly by data analysts
Hive (Continued)
Basically, HIVE is a data warehousing component which performs reading, writing and
managing large data sets in a distributed environment using SQL-like interface.
HIVE + SQL = HQL
The query language of Hive is called Hive Query Language(HQL), which is very similar
like SQL.
It has 2 basic components: Hive Command Line and JDBC/ODBC driver.
The Hive Command line interface is used to execute HQL commands.
While, Java Database Connectivity (JDBC) and Object Database Connectivity (ODBC) is
used to establish connection from data storage.
PIG
High-level platform for creating programs that run on Apache Hadoop.
Scripting (The language for this platform is called Pig Latin)
Pig Latin is used to develop the data analysis codes
Mapper Reducer class codes are too lengthy
To Reduce Line of Code from 200 to 20 the PIG scripting libraries/APIs/functions are used
For carrying out special purpose processing, users can create their own function.
Pig Hadoop Component is generally used by Researchers and Programmers
Pig works with both structured and semi-structured data
10 line of pig Latin = approx. 200 lines of Map-Reduce Java code
PIG Working:
Mahout
Machine-learning Algorithms
Machine learning algorithms allow us to build self-learning
machines that evolve by itself without being explicitly programmed
Supervised Learning vs Unsupervised Learning
Mahout provides an environment for creating machine learning
applications which are scalable.
It performs collaborative filtering, clustering and classification
It has a predefined set of library which already contains different
inbuilt algorithms for different use cases.
Mahout (continued)
Oozie is scalable
It can manage timely execution of thousands of workflow
in a Hadoop cluster.
Oozie is very much flexible as well.
One can easily start, stop, suspend and rerun jobs.
It is even possible to skip a specific failed node or rerun it
in Oozie.
Zookeeper
It is a centralized service
A Hadoop Ecosystem component for maintaining configuration
information, naming, providing distributed synchronization, and
providing group services.
Zookeeper manages and coordinates a large cluster of machines.
Apache Zookeeper coordinates with various services in a distributed
environment.
It saves a lot of time by performing synchronization, configuration
maintenance, grouping and naming.
Zookeeper maintains a record of all transactions.
Spark