Big Data Unit 5
Big Data Unit 5
HDFS is a distributed file system that is well suited for storing large files. It’s designed to support batch processing
of data but doesn’t provide fast individual record lookups. HBase is built on top of HDFS and is designed to provide
access to single rows of data in large tables. Overall, the differences between HDFS and HBase are
HBase Architecture
The HBase Physical Architecture consists of servers in a Master-Slave relationship as shown below. Typically, the
HBase cluster has one Master node, called HMaster and multiple Region Servers called HRegionServer. Each
Region Server contains multiple Regions – HRegions.
Just like in a Relational Database, data in HBase is stored in Tables and these Tables are stored in Regions. When a
Table becomes too big, the Table is partitioned into multiple Regions. These Regions are assigned to Region Servers
across the cluster. Each Region Server hosts roughly the same number of Regions.
The Data Model in HBase is designed to accommodate semi-structured data that could vary in field size, data type
and columns. Additionally, the layout of the data model makes it easier to partition the data and distribute it across
the cluster. The Data Model in HBase is made of different logical components such as Tables, Rows, Column
Families, Columns, Cells and Versions.
Tables – The HBase Tables are more like logical collection of rows stored in separate partitions called Regions. As
shown above, every Region is then served by exactly one Region Server. The figure above shows a representation of
a Table.
Rows – A row is one instance of data in a table and is identified by a rowkey. Rowkeys are unique in a Table and are
always treated as a byte[].
Column Families – Data in a row are grouped together as Column Families. Each Column Family has one more
Columns and these Columns in a family are stored together in a low level storage file known as HFile. Column
Families form the basic unit of physical storage to which certain HBase features like compression are applied. Hence
it’s important that proper care be taken when designing Column Families in table. The table above shows Customer
and Sales Column Families. The Customer Column Family is made up 2 columns – Name and City, whereas the
Sales Column Families is made up to 2 columns – Product and Amount.
Columns – A Column Family is made of one or more columns. A Column is identified by a Column Qualifier that
consists of the Column Family name concatenated with the Column name using a colon – example:
columnfamily:columnname. There can be multiple Columns within a Column Family and Rows within a table can
have varied number of Columns.
Cell – A Cell stores data and is essentially a unique combination of rowkey, Column Family and the Column
(Column Qualifier). The data stored in a Cell is called its value and the data type is always treated as byte[].
Version – The data stored in a cell is versioned and versions of data are identified by the timestamp. The number of
versions of data retained in a column family is configurable and this value by default is 3.
HBase clients
1. REST
HBase ships with a powerful REST server, which supports the complete client and
administrative API. It also provides support for different message formats, offering
many choices for a client application to communicate with the server.
REST Java client
The REST server also comes with a comprehensive Java client API. It is located in the
org.apache.hadoop.hbase.rest.client package.
2. Thrift
Apache Thrift is written in C++, but provides schema compilers for many programming
languages, including Java, C++, Perl, PHP, Python, Ruby, and more. Once you have
compiled a schema, you can exchange messages transparently between systems implemented
in one or more of those languages.
3. Avro
Apache Avro, like Thrift, provides schema compilers for many programming languages,
including Java, C++, PHP, Python, Ruby, and more. Once you have compiled a schema,
you can exchange messages transparently between systems implemented in one or more
of those languages.
Cassandra
The Cassandra data store is an open source Apache project available at http://cassandra
.apache.org. Cassandra originated at Facebook in 2007 to solve that company’s inbox
search problem, in which they had to deal with large volumes of data in a way that was
difficult to scale with traditional methods.
Main features
• Decentralized
Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster
(so each node contains different data), but there is no master as every node can service any request.
• Supports replication and multi data center replication
Replication strategies are configurable.[18] Cassandra is designed as a distributed system, for deployment of large
numbers of nodes across multiple data centers. Key features of Cassandra’s distributed architecture are specifically
tailored for multiple-data center deployment, for redundancy, for failover and disaster recovery.
• Scalability
Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to
applications.
• Fault-tolerant
Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is
supported. Failed nodes can be replaced with no downtime.
• Tunable consistency
Writes and reads offer a tunable level of consistency, all the way from "writes never fail" to "block for all replicas to
be readable", with the quorum level in the middle.
• MapReduce support
Cassandra has Hadoop integration, with MapReduce support. There is support also for Apache Pig and Apache
Hive.
• Query language
Cassandra introduces CQL (Cassandra Query Language), a SQL-like alternative to the traditional RPC interface.
Language drivers are available for Java (JDBC), Python, Node.JS and Go
Cassandra data model
Cassandra is a hybrid between a key-value and a column-oriented NoSQL databases. Key value nature is represented
by a row object, in which value would be generally organized in columns. In short, cassandra uses the following
terms
1. Keyspace can be seen as DB Schema in SQL.
2. Column family resembles a table in SQL world (read below this analogy is misleading)
3. Row has a key and as a value a set of Cassandra columns. But without relational schema corset.
4. Column is a triplet := (name, value, timestamp).
5. Super column is a tupel := (name, collection of columns).
6. Data Types : Validators & Comparators
7. Indexes
Cassandra data model is illustrated in the following figure
KeySpaces
KeySpaces are the largest container, with an ordered list of ColumnFamilies, similar to a database in RDMS.
Column
A Column is the most basic element in Cassandra: a simple tuple that contains a name, value and timestamp. All
values are set by the client. That's an important consideration for the timestamp, as it means you'll need clock
synchronization.
SuperColumn
A SuperColumn is a column that stores an associative array of columns. You could think of it as similar to a
HashMap in Java, with an identifying column (name) that stores a list of columns inside (value). The key difference
between a Column and a SuperColumn is that the value of a Column is a string, where the value of a SuperColumn
is a map of Columns. Note that SuperColumns have no timestamp, just a name and a value.
ColumnFamily
A ColumnFamily hold a number of Rows, a sorted map that matches column names to column values. A row is a
set of columns, similar to the table concept from relational databases. The column family holds an ordered list of
columns which you can reference by column name.
The ColumnFamily can be of two types, Standard or Super. Standard ColumnFamilys contain a map of normal
columns,
Thrift is a code generation library for clients in C++, C#, Erlang, Haskell, Java, Objective
C/Cocoa, OCaml, Perl, PHP, Python, Ruby, Smalltalk, and Squeak. Its goal is to provide an easy way to support
efficient RPC calls in a wide variety of popular languages, without requiring the overhead of something like SOAP.
5. Avro
The Apache Avro project is a data serialization and RPC system targeted as the replacement for Thrift in Cassandra.
Avro provides many features similar to those of Thrift and other data serialization and
RPC mechanisms including:
• Robust data structures
• An efficient, small binary format for RPC calls
• Easy integration with dynamically typed languages such as Python, Ruby, Smalltalk,
Perl, PHP, and Objective-C
Avro is the RPC and data serialization mechanism for
Cassandra. It generates code that remote clients can use to interact with the database.
It’s well-supported in the community and has the strength of growing out of the larger
and very well-known Hadoop project. It should serve Cassandra well for the foreseeable
future.
6. Hector
Hector is an open source project written in Java using the MIT license. It was created
by Ran Tavory of Outbrain (previously of Google) and is hosted at GitHub. It was one
of the early Cassandra clients and is used in production at Outbrain. It wraps Thrift
and offers JMX, connection pooling, and failover.
Hector is a well-supported and full-featured Cassandra client, with many users and an
active community. It offers the following:
High-level object-oriented API
Fail over support
Connection pooling
JMX (Java Management eXtensions) support
7. Chirper
Chirper is a port of Twissandra to .NET, written by Chaker Nakhli. It’s available under the Apache 2.0 license, and
the source code is on GitHub
8. Chiton
Chiton is a Cassandra browser written by Brandon Williams that uses the Python GTK framework
9. Pelops
Pelops is a free, open source Java client written by Dominic Williams. It is similar to
Hector in that it’s Java-based, but it was started more recently. This has become a very
popular client. Its goals include the following:
To create a simple, easy-to-use client
To completely separate concerns for data processing from lower-level items such
as connection pooling
To act as a close follower to Cassandra so that it’s readily up to date
10. Kundera
Kundera is an object-relational mapping (ORM) implementation for Cassandra written using Java annotations.
11. Fauna
Ryan King of Twitter and Evan Weaver created a Ruby client for the Cassandra database called Fauna.
Pig
Pig is a simple-to-understand data flow language used in the analysis of large data sets. Pig scripts are
automatically converted into MapReduce jobs by the Pig interpreter, so you can analyze the data in a Hadoop cluster
even if you aren't familiar with MapReduce.
Used to
Pig is made up of two components: the first is the language itself, which is called PigLatin, and the second is a
runtime environment where PigLatin programs are executed.
The Pig execution environment has two modes:
• Local mode: All scripts are run on a single machine. Hadoop MapReduce and HDFS are not required.
• Hadoop: Also called MapReduce mode, all scripts are run on a given Hadoop cluster.
Pig programs can be run in three different ways, all of them compatible with local and Hadoop mode:
1. Pig Latin Script: Simply a file containing Pig Latin commands, identified by the .pig suffix (for example,
file.pig or myscript.pig). The commands are interpreted by Pig and executed in sequential order.
2. Grunt shell: Grunt is a command interpreter. You can type Pig Latin on the grunt command line and Grunt
will execute the command on your behalf.
3. Embedded: Pig programs can be executed as part of a Java program.
Pig provides an engine for executing data flows in parallel on Hadoop. It includes a language, Pig Latin, for
expressing these data flows. Pig Latin includes operators for many of the traditional data operations (join, sort, filter,
etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data.
It is a large-scale data processing system
Scripts are written in Pig Latin, a dataflow language
Developed by Yahoo, and open source
Pig runs on Hadoop. It makes use of both the Hadoop Distributed File System, HDFS, and Hadoop’s
processing system, MapReduce.
Differences between PIG vs Map Reduce
PIG is a data flow language, the key focus of Pig is manage the flow of data from input source to output store.
Pig is written specifically for managing data flow of Map reduce type of jobs. Most if not all jobs in a Pig are map
reduce jobs or data movement jobs. Pig allows for custom functions to be added which can be used for processing in
Pig, some default ones are like ordering, grouping, distinct, count etc.
Map/Reduce on the other hand is, it is a programming model, or framework for processing large data sets in
distributed manner, using large number of computers, i.e. nodes.
PIG commands are submitted as MapReduce jobs internally. An advantage PIG has over MapReduce is that the
former is more concise. A 200 lines Java code written for MapReduce can be reduced to 10 lines of PIG code.
A disadvantage PIG has: it is bit slower as compared to MapReduce as PIG commands are translated into
MapReduce prior to execution.
Pig Latin
Pig Latin has a very rich syntax. It supports operators for the following operations:
Loading and storing of data
Streaming data
Filtering data
Grouping and joining data
Sorting data
Combining and splitting data
Pig Latin also supports a wide variety of types, expressions, functions, diagnostic operators, macros, and file system
commands.
DUMP
Dump directs the output of your script to your screen
Syntax:
dump out.txt;
LOAD : Loads data from the file system.
Syntax
'data' is the name of the file or directory, in single quotes. USING, AS are Keywords. If the USING clause is
omitted, the default load function PigStorage is used. Schema- A schema using the AS keyword, enclosed in
parentheses
Usage
Use the LOAD operator to load data from the file system.
Grunt
Grunt is Pig’s interactive shell. It enables users to enter Pig Latin interactively and provides a shell for users to
interact with HDFS.
In other words, it is a command interpreter. You can type Pig Latin on the grunt command line and Grunt will
execute the command on your behalf.
To enter Grunt, invoke Pig with no script or command to run. Typing:
$ pig -x local
will result in the prompt:
grunt>
This gives you a Grunt shell to interact with your local filesystem. To exit Grunt you can type quit or enter Ctrl-D.
Pig’s Data Model
This includes Pig’s data types, how it handles concepts such as missing data, and how you can describe your data to
Pig.
Types
Pig’s data types can be divided into two categories: scalar types, which contain a single value, and complex types,
which contain other types.
1. Scalar Types
Pig’s scalar types are simple types that appear in most programming languages.
int
An integer. They store a four-byte signed integer
long
A long integer. They store an eight-byte signed integer.
float
A floating-point number. Uses four bytes to store their value.
double
A double-precision floating-point number. and use eight bytes to store their value
chararray
A string or character array, and are expressed as string literals with single quotes
bytearray
A blob or array of bytes.
2. Complex Types
Pig has several complex data types such as maps, tuples, and bags. All of these types can contain data of any type,
including other complex types. So it is possible to have a map where the value field is a bag, which contains a tuple
where one of the fields is a map.
Map
A map in Pig is a chararray to data element mapping, where that element can be any Pig type, including a complex
type. The chararray is called a key and is used as an index to find the element, referred to as the value.
Tuple
A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into fields, with each field
containing one data element. These elements can be of any type—they do not all need to be the same type. A tuple is
analogous to a row in SQL, with the fields being SQL columns.
Bag
A bag is an unordered collection of tuples. Because it has no order, it is not possible to reference tuples in a bag by
position. Like tuples, a bag can, but is not required to, have a schema associated with it. In the case of a bag, the
schema describes all tuples within the bag.
Nulls
Pig includes the concept of a data element being null. Data of any type can be null. It is important to understand that
in Pig the concept of null is the same as in SQL, which is completely different from the concept of null in C, Java,
Python, etc. In Pig a null data element means the value is unknown.
Casts
Indicates convert one type of content to any other type.
Hive
Hive was originally an internal Facebook project which eventually tenured into a full-blown Apache project, and it
was created to simplify access to MapReduce (MR) by exposing a SQL-based language for data manipulation. Hive
also maintains metadata in a metastore, which is stored in a relational database, as well as this metadata contains
information about what tables exist, their columns, privileges, and more. Hive is an open source data warehousing
solution built on top of Hadoop, and its particular strength is in offering ad-hoc querying of data, in contrast to the
compilation requirement of Pig and Cascading.
Hive is a natural starting point for more full-featured business intelligence systems which offer a user friendly
interface for non-technical users.
Apache Hive supports analysis of large datasets stored in Hadoop's HDFS as well as easily compatible file systems
like Amazon S3 (Simple Storage Service). Amazon S3 is a scalable, high-speed, low-cost, Web-based service
designed for online backup and archiving of data as well as application programs. Hive provides SQL-like language
called HiveQL while maintaining full support for map/reduce, and to accelerate queries, it provides indexes,
including bitmap indexes. Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data
summarization, query, as well as analysis.
Advantages of Hive
Perfectly fits low level interface requirement of Hadoop
Hive supports external tables and ODBC/JDBC
Having Intelligence Optimizer
Hive support of Table-level Partitioning to speed up the query times
Metadata store is a big plus in the architecture that makes the lookup easy
Data Units
Hive data is organized into:
Databases: Namespaces that separate tables and other data units from naming confliction.
Tables: Homogeneous units of data, which have the same schema. An example of a table could be page_views table,
where each row could comprise of the following columns (schema):
timestamp - which is of INT type that corresponds to a unix timestamp of when the page was viewed.
userid - which is of BIGINT type that identifies the user who viewed the page.
page_url - which is of STRING type that captures the location of the page.
referer_url - which is of STRING that captures the location of the page from where the user arrived at the current
page.
IP - which is of STRING type that captures the IP address from where the page request was made.
Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions -
apart from being storage units - also allow the user to efficiently identify the rows that satisfy a certain criteria. For
example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the
partition keys defines a partition of the Table. For example all "US" data from "2009-12-23" is a partition of the
page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only
on the relevant partition of the table thereby speeding up the analysis significantly.
Partition columns are virtual columns, they are not part of the data itself but are derived on load.
Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash
function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of
the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the
data.
FLOAT and DOUBLE are two floating point data types. BOOLEAN is to store true or false.
STRING is to store character strings. Note that, in hive, we do not specify length for STRING like in other
databases. It’s more flexible and variable in length.
TIMESTAMP can be an integer which is interpreted as seconds since UNIX epoch time. It may be a float
where number after decimal is nanosecond. It may be string which is interpreted
according to the JDBC date string format i.e. YYYY-MM-DD hh:mm:ss.fffffffff. Time component is
interpreted as UTC time.
BINARY is used to place raw bytes which will not be interpreted by hive. It is suitable for binary data.
1. STRUCT
2. MAP
3. ARRAY
Hive File formats
Hive supports all the Hadoop file formats, plus Thrift encoding, as well as supporting pluggable SerDe
(serializer/deserializer) classes to support custom formats.
MAPFILE which adds an index to a SEQUENCEFILE for faster retrieval of particular records.
Hive defaults to the following record and field delimiters, all of which are non-printable control characters and all of
which can be customized.
Let us take an example to understand it. I am assuming an employee table with below structure in hive.
Note that \001 is an octal code for ^A, \002 is ^B and \003 is ^C. for further explanation, we will
use text instead of octal code. Let’s assume one record as shown in below table.
HiveQL
Hadoop is an open source framework for the distributed processing of large amounts of data across a cluster. It relies
upon the MapReduce paradigm to reduce complex tasks into smaller parallel tasks that can be executed concurrently
across multiple machines. However, writing MapReduce tasks on top of Hadoop for processing data is not for
everyone since it requires learning a new framework and a new programming paradigm altogether. What is needed is
an easy-to-use abstraction on top of Hadoop that allows people not familiar with it to use its capabilities as easily.
Hive aims to solve this problem by offering an SQL-like interface, called HiveQL, on top of Hadoop. Hive achieves
this task by converting queries written in HiveQL into MapReduce tasks that are then run across the Hadoop cluster
to fetch the desired results
Hive is best suited for batch processing large amounts of data (such as in data warehousing) but is not ideally
suitable as a routine transactional database because of its slow response times (it needs to fetch data from across a
cluster).
A common task for which Hive is used is the processing of logs of web servers. These logs have a regular structure
and hence can be readily converted into a format that Hive can understand and process
Hive query language (HiveQL) supports SQL features like CREATE tables, DROP tables, SELECT ... FROM ...
WHERE clauses, Joins (inner, left outer, right outer and outer joins), Cartesian products, GROUP BY, SORT BY,
aggregations, union and many useful functions on primitive as well as complex data types. Metadata browsing
features such as list databases, tables and so on are also provided. HiveQL does have limitations compared with
traditional RDBMS SQL. HiveQL allows creation of new tables in accordance with partitions(Each table can have
one or more partitions in Hive) as well as buckets (The data in partitions is further distributed as buckets)and allows
insertion of data in single or multiple tables but does not allow deletion or updating of data
At any time, you can see the databases that already exist as follows:
output is
default
financials
human_resources
2. DESCRIBE database
- shows the directory location for the database.
The USE command sets a database as your working database, analogous to changing working directories in a
filesystem
Managed Tables
The tables we have created so far are called managed tables or sometimes called internal tables, because Hive
controls the lifecycle of their data. As we’ve seen,Hive stores the data for these tables in subdirectory under the
directory defined by hive.metastore.warehouse.dir (e.g., /user/hive/warehouse), by default.
When we drop a managed table, Hive deletes the data in the table.
Managed tables are less convenient for sharing with other tools
External Tables
CREATE EXTERNAL TABLE IF NOT EXISTS stocks (
exchange STRING,
symbol STRING,
ymd STRING,
price_open FLOAT,
price_high FLOAT,
price_low FLOAT,
price_close FLOAT,
volume INT,
price_adj_close FLOAT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/stocks/';
The EXTERNAL keyword tells Hive this table is external and the LOCATION … clause is required to tell Hive
where it’s located. Because it’s external
Partitioning tables changes how Hive structures the data storage. If we create this table in the mydb database, there
will still be an employees directory for the table:
Once created, the partition keys (country and state, in this case) behave like regular columns.
output is
OK
country=US/state=IL
Time taken: 0.145 seconds
This command will first create the directory for the partition, if it doesn’t already exist,
then copy the data to it.
With OVERWRITE, any previous contents of the partition are replaced. If you drop the keyword OVERWRITE or
replace it with INTO, Hive appends the data rather than replaces it.
HiveQL queries
SELECT is the projection operator in SQL. The FROM clause identifies from which table, view, or nested query we
select records
Create employees
Load data
output is
When you select columns that are one of the collection types, Hive uses JSON (Java- Script Object Notation) syntax
for the output. First, let’s select the subordinates, an ARRAY, where a comma-separated list surrounded with […] is
used.
The deductions is a MAP, where the JSON representation for maps is used, namely a comma-separated list of
key:value pairs, surrounded with {…}:
output is
Finally, the address is a STRUCT, which is also written using the JSON map format:
output is