0% found this document useful (0 votes)

27 views18 pages

Big Data Unit 5

Uploaded by

hemantsingh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views18 pages

Big Data Unit 5

Uploaded by

hemantsingh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

UNIT V HADOOP RELATED TOOLS

• HBASE stands for Hadoop data Base

• HBase is a distributed column-oriented database built on top of HDFS.
• HBase is the Hadoop application to use when you require real-time read/write random-access to very large
datasets.
• Horizontally scalable
– Automatic sharding
• Supports strongly consistent reads and writes
• It has Simple Java API
• Integrated with Map/Reduce framework
• Supports Thrift, Avro and REST Web-services
HBase is a column-oriented database that’s an open-source implementation of Google’s Big Table storage
architecture. It can manage structured and semi-structured data and has some built-in features such as scalability,
versioning, compression and garbage collection. Since its uses write-ahead logging and distributed configuration, it
can provide fault-tolerance and quick recovery from individual server failures. HBase built on top of Hadoop /
HDFS and the data stored in HBase can be manipulated using Hadoop’s MapReduce capabilities.
Let’s now take a look at how HBase (a column-oriented database) is different from some other data structures and
concepts that we are familiar with Row-Oriented vs. Column-Oriented data stores. As shown below, in a row-
oriented data store, a row is a unit of data that is read or written together. In a column-oriented data store, the data in
a column is stored together and hence quickly retrieved.

Row-oriented data stores –

 Data is stored and retrieved one row at a time and hence could read unnecessary data if only some of the data
in a row is required.
 Easy to read and write records
 Well suited for OLTP systems
 Not efficient in performing operations applicable to the entire dataset and hence aggregation is an expensive
operation
 Typical compression mechanisms provide less effective results than those on column-oriented data stores
Column-oriented data stores –
 Data is stored and retrieved in columns and hence can read only relevant data if only some data is required
 Read and Write are typically slower operations
 Well suited for OLAP systems
 Can efficiently perform operations applicable to the entire dataset and hence enables aggregation over many
rows and columns
 Permits high compression rates due to few distinct values in columns
Relational Databases vs. HBase
When talking of data stores, we first think of Relational Databases with structured data storage and a sophisticated
query engine. However, a Relational Database incurs a big penalty to improve performance as the data size
increases. HBase, on the other hand, is designed from the ground up to provide scalability and partitioning to enable
efficient data structure serialization, storage and retrieval. Broadly, the differences between a Relational Database
and HBase are:
HDFS vs. HBase

HDFS is a distributed file system that is well suited for storing large files. It’s designed to support batch processing
of data but doesn’t provide fast individual record lookups. HBase is built on top of HDFS and is designed to provide
access to single rows of data in large tables. Overall, the differences between HDFS and HBase are

HBase Architecture

The HBase Physical Architecture consists of servers in a Master-Slave relationship as shown below. Typically, the
HBase cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Each
Region Server contains multiple Regions – HRegions.
Just like in a Relational Database, data in HBase is stored in Tables and these Tables are stored in Regions. When a
Table becomes too big, the Table is partitioned into multiple Regions. These Regions are assigned to Region Servers
across the cluster. Each Region Server hosts roughly the same number of Regions.

The HMaster in the HBase is responsible for

 Performing Administration
 Managing and Monitoring the Cluster
 Assigning Regions to the Region Servers
 Controlling the Load Balancing and Failover
On the other hand, the HRegionServer perform the following work
 Hosting and managing Regions
 Splitting the Regions automatically
 Handling the read/write requests
 Communicating with the Clients directly
Each Region Server contains a Write-Ahead Log (called HLog) and multiple Regions. Each Region in turn is made
up of a MemStore and multiple StoreFiles (HFile). The data lives in these StoreFiles in the form of Column Families
(explained below). The MemStore holds in-memory modifications to the Store (data).
The mapping of Regions to Region Server is kept in a system table called .META. When trying to read or write data
from HBase, the clients read the required Region information from the .META table and directly communicate with
the appropriate Region Server. Each Region is identified by the start key (inclusive) and the end key (exclusive)

HBase Data Model

The Data Model in HBase is designed to accommodate semi-structured data that could vary in field size, data type
and columns. Additionally, the layout of the data model makes it easier to partition the data and distribute it across
the cluster. The Data Model in HBase is made of different logical components such as Tables, Rows, Column
Families, Columns, Cells and Versions.

Tables – The HBase Tables are more like logical collection of rows stored in separate partitions called Regions. As
shown above, every Region is then served by exactly one Region Server. The figure above shows a representation of
a Table.
Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys are unique in a Table and are
always treated as a byte[].
Column Families – Data in a row are grouped together as Column Families. Each Column Family has one more
Columns and these Columns in a family are stored together in a low level storage file known as HFile. Column
Families form the basic unit of physical storage to which certain HBase features like compression are applied. Hence
it’s important that proper care be taken when designing Column Families in table. The table above shows Customer
and Sales Column Families. The Customer Column Family is made up 2 columns – Name and City, whereas the
Sales Column Families is made up to 2 columns – Product and Amount.
Columns – A Column Family is made of one or more columns. A Column is identified by a Column Qualifier that
consists of the Column Family name concatenated with the Column name using a colon – example:
columnfamily:columnname. There can be multiple Columns within a Column Family and Rows within a table can
have varied number of Columns.
Cell – A Cell stores data and is essentially a unique combination of rowkey, Column Family and the Column
(Column Qualifier). The data stored in a Cell is called its value and the data type is always treated as byte[].
Version – The data stored in a cell is versioned and versions of data are identified by the timestamp. The number of
versions of data retained in a column family is configurable and this value by default is 3.
HBase clients
1. REST
HBase ships with a powerful REST server, which supports the complete client and
administrative API. It also provides support for different message formats, offering
many choices for a client application to communicate with the server.
REST Java client
The REST server also comes with a comprehensive Java client API. It is located in the
org.apache.hadoop.hbase.rest.client package.

2. Thrift
Apache Thrift is written in C++, but provides schema compilers for many programming
languages, including Java, C++, Perl, PHP, Python, Ruby, and more. Once you have
compiled a schema, you can exchange messages transparently between systems implemented
in one or more of those languages.
3. Avro
Apache Avro, like Thrift, provides schema compilers for many programming languages,
including Java, C++, PHP, Python, Ruby, and more. Once you have compiled a schema,
you can exchange messages transparently between systems implemented in one or more
of those languages.
Cassandra
The Cassandra data store is an open source Apache project available at http://cassandra
.apache.org. Cassandra originated at Facebook in 2007 to solve that company’s inbox
search problem, in which they had to deal with large volumes of data in a way that was
difficult to scale with traditional methods.
Main features
• Decentralized
Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster
(so each node contains different data), but there is no master as every node can service any request.
• Supports replication and multi data center replication
Replication strategies are configurable.[18] Cassandra is designed as a distributed system, for deployment of large
numbers of nodes across multiple data centers. Key features of Cassandra’s distributed architecture are specifically
tailored for multiple-data center deployment, for redundancy, for failover and disaster recovery.
• Scalability
Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to
applications.
• Fault-tolerant
Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is
supported. Failed nodes can be replaced with no downtime.
• Tunable consistency
Writes and reads offer a tunable level of consistency, all the way from "writes never fail" to "block for all replicas to
be readable", with the quorum level in the middle.
• MapReduce support
Cassandra has Hadoop integration, with MapReduce support. There is support also for Apache Pig and Apache
Hive.
• Query language
Cassandra introduces CQL (Cassandra Query Language), a SQL-like alternative to the traditional RPC interface.
Language drivers are available for Java (JDBC), Python, Node.JS and Go
Cassandra data model
Cassandra is a hybrid between a key-value and a column-oriented NoSQL databases. Key value nature is represented
by a row object, in which value would be generally organized in columns. In short, cassandra uses the following
terms
1. Keyspace can be seen as DB Schema in SQL.
2. Column family resembles a table in SQL world (read below this analogy is misleading)
3. Row has a key and as a value a set of Cassandra columns. But without relational schema corset.
4. Column is a triplet := (name, value, timestamp).
5. Super column is a tupel := (name, collection of columns).
6. Data Types : Validators & Comparators
7. Indexes
Cassandra data model is illustrated in the following figure

KeySpaces

KeySpaces are the largest container, with an ordered list of ColumnFamilies, similar to a database in RDMS.

Column
A Column is the most basic element in Cassandra: a simple tuple that contains a name, value and timestamp. All
values are set by the client. That's an important consideration for the timestamp, as it means you'll need clock
synchronization.

SuperColumn

A SuperColumn is a column that stores an associative array of columns. You could think of it as similar to a
HashMap in Java, with an identifying column (name) that stores a list of columns inside (value). The key difference
between a Column and a SuperColumn is that the value of a Column is a string, where the value of a SuperColumn
is a map of Columns. Note that SuperColumns have no timestamp, just a name and a value.
ColumnFamily

A ColumnFamily hold a number of Rows, a sorted map that matches column names to column values. A row is a
set of columns, similar to the table concept from relational databases. The column family holds an ordered list of
columns which you can reference by column name.
The ColumnFamily can be of two types, Standard or Super. Standard ColumnFamilys contain a map of normal
columns,

Differences Between RDBMS and Cassandra

1. No Query Language: SQL is the standard query language used in relational databases. Cassandra has no
query language. It does have an API that you access through its RPC serialization mechanism, Thrift.
2. No Referential Integrity: Cassandra has no concept of referential integrity, and therefore has no concept of
joins.
3. Secondary Indexes: The second column family acts as an explicit secondary index in Cassandra
4. Sorting: In RDBMS, you can easily change the order of records by using ORDER BY or GROUP BY in
your query. There is no support for ORDER BY and GROUP BY statements in Cassandra. In Cassandra, however,
sorting is treated differently; it is a design decision. Column family definitions include a CompareWith element,
which dictates the order in which your rows will be sorted.
5. Denormalization: In the relational world, denormalization violates Codd's normal forms, and we try to avoid
it. But in Cassandra, denormalization is, well, perfectly normal. It's not required if your data model is simple.
6. Design Patterns: Cassandra design pattern offers a Materialized View, Valueless Column, and Aggregate
Key.
Cassandra Clients
4. Thrift
Thrift is the driver-level interface; it provides the API for client implementations in a
wide variety of languages. Thrift was developed at Facebook and donated as an Apache
project

Thrift is a code generation library for clients in C++, C#, Erlang, Haskell, Java, Objective
C/Cocoa, OCaml, Perl, PHP, Python, Ruby, Smalltalk, and Squeak. Its goal is to provide an easy way to support
efficient RPC calls in a wide variety of popular languages, without requiring the overhead of something like SOAP.

The design of Thrift offers the following features:

 Language-independent types
 Common transport interface
 Protocol independence
 Versioning support

5. Avro
The Apache Avro project is a data serialization and RPC system targeted as the replacement for Thrift in Cassandra.
Avro provides many features similar to those of Thrift and other data serialization and
RPC mechanisms including:
• Robust data structures
• An efficient, small binary format for RPC calls
• Easy integration with dynamically typed languages such as Python, Ruby, Smalltalk,
Perl, PHP, and Objective-C
Avro is the RPC and data serialization mechanism for
Cassandra. It generates code that remote clients can use to interact with the database.
It’s well-supported in the community and has the strength of growing out of the larger
and very well-known Hadoop project. It should serve Cassandra well for the foreseeable
future.

6. Hector
Hector is an open source project written in Java using the MIT license. It was created
by Ran Tavory of Outbrain (previously of Google) and is hosted at GitHub. It was one
of the early Cassandra clients and is used in production at Outbrain. It wraps Thrift
and offers JMX, connection pooling, and failover.

Hector is a well-supported and full-featured Cassandra client, with many users and an
active community. It offers the following:
 High-level object-oriented API
 Fail over support
 Connection pooling
 JMX (Java Management eXtensions) support

7. Chirper
Chirper is a port of Twissandra to .NET, written by Chaker Nakhli. It’s available under the Apache 2.0 license, and
the source code is on GitHub

8. Chiton
Chiton is a Cassandra browser written by Brandon Williams that uses the Python GTK framework
9. Pelops
Pelops is a free, open source Java client written by Dominic Williams. It is similar to
Hector in that it’s Java-based, but it was started more recently. This has become a very
popular client. Its goals include the following:
 To create a simple, easy-to-use client
 To completely separate concerns for data processing from lower-level items such
 as connection pooling
 To act as a close follower to Cassandra so that it’s readily up to date

10. Kundera
Kundera is an object-relational mapping (ORM) implementation for Cassandra written using Java annotations.
11. Fauna
Ryan King of Twitter and Evan Weaver created a Ruby client for the Cassandra database called Fauna.

Pig
 Pig is a simple-to-understand data flow language used in the analysis of large data sets. Pig scripts are
automatically converted into MapReduce jobs by the Pig interpreter, so you can analyze the data in a Hadoop cluster
even if you aren't familiar with MapReduce.
 Used to

o Process web log

o Build user behavior models
o Process images
o Data mining

Pig is made up of two components: the first is the language itself, which is called PigLatin, and the second is a
runtime environment where PigLatin programs are executed.
The Pig execution environment has two modes:
• Local mode: All scripts are run on a single machine. Hadoop MapReduce and HDFS are not required.
• Hadoop: Also called MapReduce mode, all scripts are run on a given Hadoop cluster.

Pig programs can be run in three different ways, all of them compatible with local and Hadoop mode:

1. Pig Latin Script: Simply a file containing Pig Latin commands, identified by the .pig suffix (for example,
file.pig or myscript.pig). The commands are interpreted by Pig and executed in sequential order.
2. Grunt shell: Grunt is a command interpreter. You can type Pig Latin on the grunt command line and Grunt
will execute the command on your behalf.
3. Embedded: Pig programs can be executed as part of a Java program.

Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for
expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter,
etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data.
 It is a large-scale data processing system
 Scripts are written in Pig Latin, a dataflow language
 Developed by Yahoo, and open source
 Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s
processing system, MapReduce.
Differences between PIG vs Map Reduce
PIG is a data flow language, the key focus of Pig is manage the flow of data from input source to output store.
Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all jobs in a Pig are map
reduce jobs or data movement jobs. Pig allows for custom functions to be added which can be used for processing in
Pig, some default ones are like ordering, grouping, distinct, count etc.
Map/Reduce on the other hand is, it is a programming model, or framework for processing large data sets in
distributed manner, using large number of computers, i.e. nodes.
PIG commands are submitted as MapReduce jobs internally. An advantage PIG has over MapReduce is that the
former is more concise. A 200 lines Java code written for MapReduce can be reduced to 10 lines of PIG code.
A disadvantage PIG has: it is bit slower as compared to MapReduce as PIG commands are translated into
MapReduce prior to execution.
Pig Latin
Pig Latin has a very rich syntax. It supports operators for the following operations:
 Loading and storing of data
 Streaming data
 Filtering data
 Grouping and joining data
 Sorting data
 Combining and splitting data
Pig Latin also supports a wide variety of types, expressions, functions, diagnostic operators, macros, and file system
commands.
DUMP
Dump directs the output of your script to your screen
Syntax:
dump out.txt;
LOAD : Loads data from the file system.
Syntax

'data' is the name of the file or directory, in single quotes. USING, AS are Keywords. If the USING clause is
omitted, the default load function PigStorage is used. Schema- A schema using the AS keyword, enclosed in
parentheses
Usage
Use the LOAD operator to load data from the file system.
Grunt
Grunt is Pig’s interactive shell. It enables users to enter Pig Latin interactively and provides a shell for users to
interact with HDFS.
In other words, it is a command interpreter. You can type Pig Latin on the grunt command line and Grunt will
execute the command on your behalf.
To enter Grunt, invoke Pig with no script or command to run. Typing:
$ pig -x local
will result in the prompt:
grunt>
This gives you a Grunt shell to interact with your local filesystem. To exit Grunt you can type quit or enter Ctrl-D.
Pig’s Data Model
This includes Pig’s data types, how it handles concepts such as missing data, and how you can describe your data to
Pig.
Types
Pig’s data types can be divided into two categories: scalar types, which contain a single value, and complex types,
which contain other types.
1. Scalar Types
Pig’s scalar types are simple types that appear in most programming languages.
 int
An integer. They store a four-byte signed integer
 long
A long integer. They store an eight-byte signed integer.
 float
A floating-point number. Uses four bytes to store their value.
 double
A double-precision floating-point number. and use eight bytes to store their value
 chararray
A string or character array, and are expressed as string literals with single quotes
 bytearray
A blob or array of bytes.

2. Complex Types
Pig has several complex data types such as maps, tuples, and bags. All of these types can contain data of any type,
including other complex types. So it is possible to have a map where the value field is a bag, which contains a tuple
where one of the fields is a map.
 Map
A map in Pig is a chararray to data element mapping, where that element can be any Pig type, including a complex
type. The chararray is called a key and is used as an index to find the element, referred to as the value.
 Tuple
A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into fields, with each field
containing one data element. These elements can be of any type—they do not all need to be the same type. A tuple is
analogous to a row in SQL, with the fields being SQL columns.
 Bag
A bag is an unordered collection of tuples. Because it has no order, it is not possible to reference tuples in a bag by
position. Like tuples, a bag can, but is not required to, have a schema associated with it. In the case of a bag, the
schema describes all tuples within the bag.
 Nulls
Pig includes the concept of a data element being null. Data of any type can be null. It is important to understand that
in Pig the concept of null is the same as in SQL, which is completely different from the concept of null in C, Java,
Python, etc. In Pig a null data element means the value is unknown.
 Casts
Indicates convert one type of content to any other type.
Hive
Hive was originally an internal Facebook project which eventually tenured into a full-blown Apache project, and it
was created to simplify access to MapReduce (MR) by exposing a SQL-based language for data manipulation. Hive
also maintains metadata in a metastore, which is stored in a relational database, as well as this metadata contains
information about what tables exist, their columns, privileges, and more. Hive is an open source data warehousing
solution built on top of Hadoop, and its particular strength is in offering ad-hoc querying of data, in contrast to the
compilation requirement of Pig and Cascading.
Hive is a natural starting point for more full-featured business intelligence systems which offer a user friendly
interface for non-technical users.
Apache Hive supports analysis of large datasets stored in Hadoop's HDFS as well as easily compatible file systems
like Amazon S3 (Simple Storage Service). Amazon S3 is a scalable, high-speed, low-cost, Web-based service
designed for online backup and archiving of data as well as application programs. Hive provides SQL-like language
called HiveQL while maintaining full support for map/reduce, and to accelerate queries, it provides indexes,
including bitmap indexes. Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, as well as analysis.
Advantages of Hive
 Perfectly fits low level interface requirement of Hadoop
 Hive supports external tables and ODBC/JDBC
 Having Intelligence Optimizer
 Hive support of Table-level Partitioning to speed up the query times
 Metadata store is a big plus in the architecture that makes the lookup easy
Data Units
Hive data is organized into:
Databases: Namespaces that separate tables and other data units from naming confliction.
Tables: Homogeneous units of data, which have the same schema. An example of a table could be page_views table,
where each row could comprise of the following columns (schema):
timestamp - which is of INT type that corresponds to a unix timestamp of when the page was viewed.
userid - which is of BIGINT type that identifies the user who viewed the page.
page_url - which is of STRING type that captures the location of the page.
referer_url - which is of STRING that captures the location of the page from where the user arrived at the current
page.
IP - which is of STRING type that captures the IP address from where the page request was made.
Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions -
apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. For
example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the
partition keys defines a partition of the Table. For example all "US" data from "2009-12-23" is a partition of the
page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only
on the relevant partition of the table thereby speeding up the analysis significantly.
Partition columns are virtual columns, they are not part of the data itself but are derived on load.
Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash
function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of
the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the
data.

Hive Data Types

Hive Support two types of data type formats
1. Primitive data type
2. Collection data type
12. Primitive Data Types
 TINYINT, SMALLINT, INT, BIGINT are four integer data types with only differences in their size.

 FLOAT and DOUBLE are two floating point data types. BOOLEAN is to store true or false.

 STRING is to store character strings. Note that, in hive, we do not specify length for STRING like in other
databases. It’s more flexible and variable in length.

 TIMESTAMP can be an integer which is interpreted as seconds since UNIX epoch time. It may be a float
where number after decimal is nanosecond. It may be string which is interpreted
 according to the JDBC date string format i.e. YYYY-MM-DD hh:mm:ss.fffffffff. Time component is
interpreted as UTC time.

 BINARY is used to place raw bytes which will not be interpreted by hive. It is suitable for binary data.

13. Collection Data Types

1. STRUCT
2. MAP
3. ARRAY
Hive File formats

Hive supports all the Hadoop file formats, plus Thrift encoding, as well as supporting pluggable SerDe
(serializer/deserializer) classes to support custom formats.

There are several file formats supported by Hive.

TEXTFILE is the easiest to use, but the least space efficient.

SEQUENCEFILE format is more space efficient.

MAPFILE which adds an index to a SEQUENCEFILE for faster retrieval of particular records.

Hive defaults to the following record and field delimiters, all of which are non-printable control characters and all of
which can be customized.

Let us take an example to understand it. I am assuming an employee table with below structure in hive.

Note that \001 is an octal code for ^A, \002 is ^B and \003 is ^C. for further explanation, we will
use text instead of octal code. Let’s assume one record as shown in below table.
HiveQL

HiveQL is the Hive query language

Hadoop is an open source framework for the distributed processing of large amounts of data across a cluster. It relies
upon the MapReduce paradigm to reduce complex tasks into smaller parallel tasks that can be executed concurrently
across multiple machines. However, writing MapReduce tasks on top of Hadoop for processing data is not for
everyone since it requires learning a new framework and a new programming paradigm altogether. What is needed is
an easy-to-use abstraction on top of Hadoop that allows people not familiar with it to use its capabilities as easily.
Hive aims to solve this problem by offering an SQL-like interface, called HiveQL, on top of Hadoop. Hive achieves
this task by converting queries written in HiveQL into MapReduce tasks that are then run across the Hadoop cluster
to fetch the desired results

Hive is best suited for batch processing large amounts of data (such as in data warehousing) but is not ideally
suitable as a routine transactional database because of its slow response times (it needs to fetch data from across a
cluster).

A common task for which Hive is used is the processing of logs of web servers. These logs have a regular structure
and hence can be readily converted into a format that Hive can understand and process

Hive query language (HiveQL) supports SQL features like CREATE tables, DROP tables, SELECT ... FROM ...
WHERE clauses, Joins (inner, left outer, right outer and outer joins), Cartesian products, GROUP BY, SORT BY,
aggregations, union and many useful functions on primitive as well as complex data types. Metadata browsing
features such as list databases, tables and so on are also provided. HiveQL does have limitations compared with
traditional RDBMS SQL. HiveQL allows creation of new tables in accordance with partitions(Each table can have
one or more partitions in Hive) as well as buckets (The data in partitions is further distributed as buckets)and allows
insertion of data in single or multiple tables but does not allow deletion or updating of data

HiveQL: Data Definition

First open the hive console by typing:
$ hive
Once the hive console is opened, like
hive>
you need to run the query to create the table.
3. Create and Show database
They are very useful for larger clusters with multiple teams and users, as a way of avoiding table name collisions.
It’s also common to use databases to organize production tables into logical groups. If you don’t specify a database,
the default database is used.
hive> CREATE DATABASE IF NOT EXISTS financials;

At any time, you can see the databases that already exist as follows:

hive> SHOW DATABASES;

output is
default
financials
hive> CREATE DATABASE human_resources;

hive> SHOW DATABASES;

output is
default
financials
human_resources

2. DESCRIBE database
- shows the directory location for the database.

hive> DESCRIBE DATABASE financials;

output is
hdfs://master-server/user/hive/warehouse/financials.db
14. USE database

The USE command sets a database as your working database, analogous to changing working directories in a
filesystem

hive> USE financials;

15. DROP database

you can drop a database:

hive> DROP DATABASE IF EXISTS financials;

The IF EXISTS is optional and suppresses warnings if financials doesn’t exist.

16. Alter Database

You can set key-value pairs in the DBPROPERTIES associated with a database using the ALTER DATABASE
command. No other metadata about the database can be changed,including its name and directory location:

hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'active steps');

17. Create Tables

The CREATE TABLE statement follows SQL conventions, but Hive’s version offers sig- nificant extensions to
support a wide range of flexibility where the data files for tables are stored, the formats used, etc.

 Managed Tables
 The tables we have created so far are called managed tables or sometimes called internal tables, because Hive
controls the lifecycle of their data. As we’ve seen,Hive stores the data for these tables in subdirectory under the
directory defined by hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.
 When we drop a managed table, Hive deletes the data in the table.
 Managed tables are less convenient for sharing with other tools
 External Tables
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks/';

The EXTERNAL keyword tells Hive this table is external and the LOCATION … clause is required to tell Hive
where it’s located. Because it’s external

Partitioned, Managed Tables

Partitioned tables help to organize data in a logical fashion, such as hierarchically.

Example:Our HR people often run queries with WHERE clauses that restrict the results to a particular country or to
a particular first-level subdivision (e.g., state in the United States or province in Canada).
we have to use address.state to project the value inside the address. So, let’s partition the data first by country and
then by state:
CREATE TABLE IF NOT EXISTS mydb.employees (
name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
PARTITIONED BY (country STRING, state STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Partitioning tables changes how Hive structures the data storage. If we create this table in the mydb database, there
will still be an employees directory for the table:

LOAD DATA LOCAL INPATH '/path/to/employees.txt'

INTO TABLE employees
PARTITION (country = 'US', state = 'IL');
hdfs://master_server/user/hive/warehouse/mydb.db/employees

Once created, the partition keys (country and state, in this case) behave like regular columns.

hive> SHOW PARTITIONS employees;

output is

OK
country=US/state=IL
Time taken: 0.145 seconds

18. Dropping Tables

The familiar DROP TABLE command from SQL is supported:

DROP TABLE IF EXISTS employees;

HiveQL: Data Manipulation

1. Loading Data into Managed Tables

Create stocks table

CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/data/stocks/';
Queries on Sotck Data Set

Load the stocks

LOAD DATA LOCAL INPATH '/path/to/employees.txt'

INTO TABLE stocks
PARTITION (exchange = 'NASDAQ', symbol = 'AAPL');

This command will first create the directory for the partition, if it doesn’t already exist,
then copy the data to it.

2. Inserting Data into Tables from Queries

INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'OR')

With OVERWRITE, any previous contents of the partition are replaced. If you drop the keyword OVERWRITE or
replace it with INTO, Hive appends the data rather than replaces it.

HiveQL queries

1. SELECT … FROM Clauses

SELECT is the projection operator in SQL. The FROM clause identifies from which table, view, or nested query we
select records

Create employees

CREATE EXTERNAL TABLE employees (

name STRING,
salary FLOAT,
subordinates ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
address STRUCT<street:STRING, city:STRING, state:STRING, zip:INT>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
COLLECTION ITEMS TERMINATED BY '\002'
MAP KEYS TERMINATED BY '\003'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/data/employees';

Load data

LOAD DATA LOCAL INPATH '/path/to/employees.txt'

INTO TABLE employees
PARTITION (country = 'US', state = 'IL');

Data in employee.txt is assumed as

Select data

hive> SELECT name, salary FROM employees;

output is

When you select columns that are one of the collection types, Hive uses JSON (Java- Script Object Notation) syntax
for the output. First, let’s select the subordinates, an ARRAY, where a comma-separated list surrounded with […] is
used.

hive> SELECT name, subordinates FROM employees;

output is

The deductions is a MAP, where the JSON representation for maps is used, namely a comma-separated list of
key:value pairs, surrounded with {…}:

hive> SELECT name, deductions FROM employees;

output is

Finally, the address is a STRUCT, which is also written using the JSON map format:

hive> SELECT name, address FROM employees;

output is

Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
No ratings yet
Hbase Big Table: Oriented vs. Column-Oriented Data Stores. As Shown Below, in A Row
6 pages
BDA Unit 5
No ratings yet
BDA Unit 5
33 pages
Unit 4
No ratings yet
Unit 4
15 pages
HBase
No ratings yet
HBase
6 pages
Unit 5 BDA
No ratings yet
Unit 5 BDA
34 pages
Unit - IV - Notes
No ratings yet
Unit - IV - Notes
23 pages
BDT Unit - V
No ratings yet
BDT Unit - V
15 pages
Unit 5 Big Data
No ratings yet
Unit 5 Big Data
34 pages
UNIT 5 Notes
No ratings yet
UNIT 5 Notes
47 pages
Hadoop HBASE
No ratings yet
Hadoop HBASE
71 pages
Hbase
No ratings yet
Hbase
13 pages
HBase
No ratings yet
HBase
30 pages
Unit 5 Bda
No ratings yet
Unit 5 Bda
42 pages
UNIT5
No ratings yet
UNIT5
42 pages
Big Data Analytics Unit-5
No ratings yet
Big Data Analytics Unit-5
28 pages
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
No ratings yet
Hbase - in Detail: Pushpinder Singh Paxcel Technologies
32 pages
Assignment 10
No ratings yet
Assignment 10
9 pages
Assignment Day 10: Task 1
No ratings yet
Assignment Day 10: Task 1
8 pages
Unit 3
No ratings yet
Unit 3
15 pages
Cse 17CS82 M2 S4 PPT
No ratings yet
Cse 17CS82 M2 S4 PPT
19 pages
Lec 18
No ratings yet
Lec 18
18 pages
Hbase - Quick Guide Hbase - Overview
No ratings yet
Hbase - Quick Guide Hbase - Overview
53 pages
Unit 5 Notes
100% (3)
Unit 5 Notes
66 pages
Bda - Unit 5
No ratings yet
Bda - Unit 5
30 pages
10 HBase
No ratings yet
10 HBase
13 pages
HBASE
No ratings yet
HBASE
18 pages
BDA Unit-4 Part-2 HBase, Hive, Pig
No ratings yet
BDA Unit-4 Part-2 HBase, Hive, Pig
74 pages
HBase
No ratings yet
HBase
31 pages
Wa0005.
No ratings yet
Wa0005.
53 pages
Lec 18
No ratings yet
Lec 18
21 pages
Unit III - Full
No ratings yet
Unit III - Full
31 pages
HBASE
No ratings yet
HBASE
11 pages
HBASE
No ratings yet
HBASE
35 pages
Unit V Hadoop Related Tools
No ratings yet
Unit V Hadoop Related Tools
54 pages
HBase
No ratings yet
HBase
27 pages
10 NoSQL Databases - HBase Hive Cassandra
No ratings yet
10 NoSQL Databases - HBase Hive Cassandra
74 pages
Unit-5 Notes
No ratings yet
Unit-5 Notes
61 pages
BDA Module 2-2023
No ratings yet
BDA Module 2-2023
30 pages
Unit 3 Hbase, Mongodb and Couch DB
No ratings yet
Unit 3 Hbase, Mongodb and Couch DB
12 pages
HBase - Tutorial
No ratings yet
HBase - Tutorial
14 pages
Large-Scale Data Management: Hbase
No ratings yet
Large-Scale Data Management: Hbase
36 pages
Apache HBase
No ratings yet
Apache HBase
12 pages
Unit 5 Lecture No-3 (Hbase)
No ratings yet
Unit 5 Lecture No-3 (Hbase)
35 pages
cp5293 Big Data Analytics Unit 5 PDF
No ratings yet
cp5293 Big Data Analytics Unit 5 PDF
28 pages
9 HBase
No ratings yet
9 HBase
77 pages
HBase (Unit 4)
No ratings yet
HBase (Unit 4)
37 pages
HBase
No ratings yet
HBase
12 pages
HBase
No ratings yet
HBase
14 pages
HBase Presentation
No ratings yet
HBase Presentation
23 pages
Module V
No ratings yet
Module V
46 pages
4.5 Hbase
No ratings yet
4.5 Hbase
27 pages
Cs525: Special Topics in DBS: Large-Scale Data Management
No ratings yet
Cs525: Special Topics in DBS: Large-Scale Data Management
35 pages
BDM Unit 5
No ratings yet
BDM Unit 5
60 pages
Big Data UNIT 5 Own
No ratings yet
Big Data UNIT 5 Own
18 pages
Hbase: Q) What Is Hbase ?
No ratings yet
Hbase: Q) What Is Hbase ?
15 pages
NoteGPT - What Is HBase - HBase Architecture - HBase Tutorial For Beginners - Hadoop Tutorial - Simplilearn
No ratings yet
NoteGPT - What Is HBase - HBase Architecture - HBase Tutorial For Beginners - Hadoop Tutorial - Simplilearn
5 pages
Big Data 22MSM40206
No ratings yet
Big Data 22MSM40206
9 pages
HBase
No ratings yet
HBase
38 pages
DBMS Unit3
No ratings yet
DBMS Unit3
28 pages
Python Basics Handbook 1653665930
No ratings yet
Python Basics Handbook 1653665930
290 pages
Oracle 1Z0-539: ' Ìi'Êü Ì Êì Iê'I Êûiàã Ê Vê V Ýê À Ê Ê ' Ì Àê / Êài Ûiêì Ãê Ì Vi) Êû Ã Ì/Ê
No ratings yet
Oracle 1Z0-539: ' Ìi'Êü Ì Êì Iê'I Êûiàã Ê Vê V Ýê À Ê Ê ' Ì Àê / Êài Ûiêì Ãê Ì Vi) Êû Ã Ì/Ê
28 pages
Digital Field Prosper
100% (1)
Digital Field Prosper
31 pages
JCL Utilities Quick Reference
100% (1)
JCL Utilities Quick Reference
61 pages
(Type The Document Title) : LAB # 4 (A)
No ratings yet
(Type The Document Title) : LAB # 4 (A)
10 pages
Ooad Lab Programs
No ratings yet
Ooad Lab Programs
3 pages
Object Oriented Programming
No ratings yet
Object Oriented Programming
137 pages
BDA Experiment 7
No ratings yet
BDA Experiment 7
7 pages
Study of WwwW3schoolsCom Website For Programmer
No ratings yet
Study of WwwW3schoolsCom Website For Programmer
5 pages
Python QB
No ratings yet
Python QB
2 pages
IntroductionToProgramming ConditionalStatement Problems
No ratings yet
IntroductionToProgramming ConditionalStatement Problems
4 pages
Prelims of 10th
No ratings yet
Prelims of 10th
2 pages
Class 11 It Record Programs
No ratings yet
Class 11 It Record Programs
25 pages
Latihan Java Part 5
No ratings yet
Latihan Java Part 5
2 pages
Jax WS PDF
No ratings yet
Jax WS PDF
19 pages
Section Solutions #5: Winky Winky Pinky Pinky
No ratings yet
Section Solutions #5: Winky Winky Pinky Pinky
6 pages
TestDaily分享-ap16 computer science a sg
No ratings yet
TestDaily分享-ap16 computer science a sg
10 pages
Dbms MBA Notes
50% (2)
Dbms MBA Notes
125 pages
0802 Python Tutorial
100% (1)
0802 Python Tutorial
155 pages
Unit 4 - Assignment Brief A - 24 - 25
No ratings yet
Unit 4 - Assignment Brief A - 24 - 25
2 pages
Data Structures
No ratings yet
Data Structures
369 pages
Java Manual r22
No ratings yet
Java Manual r22
32 pages
Online Feedback System
No ratings yet
Online Feedback System
46 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
1 page
OBIEE Caching Best Practices
100% (1)
OBIEE Caching Best Practices
3 pages
Coding Booklet Class Iv
No ratings yet
Coding Booklet Class Iv
16 pages
Screening Test
No ratings yet
Screening Test
9 pages
Defect Life Cycle
No ratings yet
Defect Life Cycle
2 pages
Linked List Quiz
No ratings yet
Linked List Quiz
13 pages
Solid Project
No ratings yet
Solid Project
9 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Big Data Unit 5

Uploaded by

Big Data Unit 5

Uploaded by

UNIT V HADOOP RELATED TOOLS

• HBASE stands for Hadoop data Base

Row-oriented data stores –

The HMaster in the HBase is responsible for

HBase Data Model

Differences Between RDBMS and Cassandra

The design of Thrift offers the following features:

o Process web log

Hive Data Types

13. Collection Data Types

There are several file formats supported by Hive.

TEXTFILE is the easiest to use, but the least space efficient.

SEQUENCEFILE format is more space efficient.

HiveQL is the Hive query language

HiveQL: Data Definition

hive> SHOW DATABASES;

hive> SHOW DATABASES;

hive> DESCRIBE DATABASE financials;

hive> USE financials;

15. DROP database

hive> DROP DATABASE IF EXISTS financials;

16. Alter Database

hive> ALTER DATABASE financials SET DBPROPERTIES ('edited-by' = 'active steps');

17. Create Tables

Partitioned, Managed Tables

Partitioned tables help to organize data in a logical fashion, such as hierarchically.

LOAD DATA LOCAL INPATH '/path/to/employees.txt'

hive> SHOW PARTITIONS employees;

18. Dropping Tables

The familiar DROP TABLE command from SQL is supported:

DROP TABLE IF EXISTS employees;

HiveQL: Data Manipulation

1. Loading Data into Managed Tables

Create stocks table

Load the stocks

LOAD DATA LOCAL INPATH '/path/to/employees.txt'

2. Inserting Data into Tables from Queries

INSERT OVERWRITE TABLE employees PARTITION (country = 'US', state = 'OR')

1. SELECT … FROM Clauses

CREATE EXTERNAL TABLE employees (

LOAD DATA LOCAL INPATH '/path/to/employees.txt'

Data in employee.txt is assumed as

hive> SELECT name, salary FROM employees;

hive> SELECT name, subordinates FROM employees;

hive> SELECT name, deductions FROM employees;

hive> SELECT name, address FROM employees;

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.