0% found this document useful (0 votes)
105 views47 pages

The Age OF: Every Minute

The document provides statistics that illustrate the growth of big data, including the number of social media posts, website search queries, online videos watched, emails sent, photos posted, and more generated every day, minute, or other time period. It notes that big data is defined as datasets that are too large to store, manage, process and analyze using traditional databases due to their volume, velocity or variety, and that there has been exponential growth in both structured and unstructured data.

Uploaded by

aghodke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views47 pages

The Age OF: Every Minute

The document provides statistics that illustrate the growth of big data, including the number of social media posts, website search queries, online videos watched, emails sent, photos posted, and more generated every day, minute, or other time period. It notes that big data is defined as datasets that are too large to store, manage, process and analyze using traditional databases due to their volume, velocity or variety, and that there has been exponential growth in both structured and unstructured data.

Uploaded by

aghodke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

THE AGE

OF
BIG DATA
2.8 2.5 27.2
Million Million Thousand
Social Media Website Review
posts search queries posts

100 201
Hours Million
of Online emails
videos sent

EVERY

57 50.7 MINUTE
Thousand Thousand
Pictures Thoughts
posts posts
• Big data is defined as collections of datasets whose volume,
velocity or variety is so large that it is difficult to store,
manage, process and analyze the data using traditional
databases and data processing tools.
• In the recent years, there has been an exponential growth in the
both structured and unstructured data generated by information
technology, industrial, healthcare, Internet of Things, and other
systems.
• According to an estimate by IBM, 2.5 quintillion bytes of data
is created every day
Below are some key pieces of data from the report:
• Facebook users share nearly 4.16 million pieces of content
• Twitter users send nearly 300,000 tweets
• Instagram users like nearly 1.73 million photos
• YouTube users upload 300 hours of new video content
• Apple users download nearly 51,000 apps
• Skype users make nearly 110,000 new calls
• Amazon receives 4300 new visitors
• Uber passengers take 694 rides
• Netflix subscribers stream nearly 77,000 hours of video
• Big Data has the potential to power next generation of smart
applications that will leverage the power of the data to make
the applications intelligent.
• Applications of big data span a wide range of domains such as
web, retail and marketing, banking and financial, industrial,
healthcare, environmental, Internet of Things and cyber-
physical systems.
Big Data analytics deals with collection, storage, processing and
analysis of this massive scale data. Specialized tools and
frameworks are required for big data analysis when:
(1) The volume of data involved is so large that it is difficult to
store, process and analyze data on a single machine,
(2) The velocity of data is very high and the data needs to be
analyzed in real-time,
(3) There is variety of data involved, which can be structured,
unstructured or semi-structured, and is collected from multiple
data sources
• Big data analytics involves several steps starting from data
cleansing, data munging (or wrangling), data processing and
visualization.
• Big data analytics life-cycle starts from the collection of data
from multiple data sources.
• Specialized tools and frameworks are required to ingest the
data from different sources into the big data analytics backend.
• The data is stored in specialized storage solutions (such as
distributed file systems and non-relational databases) which are
designed to scale.
Big data analytics is enabled by several technologies such as
cloud computing, distributed and parallel processing frameworks,
non-relational databases, in-memory computing
Some examples of big data are listed as follows:
• Data generated by social networks including text, images, audio and
video data
• Click-stream data generated by web applications such as e-Commerce to
analyze user behavior
• Machine sensor data collected from sensors embedded in industrial and
energy systems for monitoring their health and detecting failures
• Healthcare data collected in electronic health record (EHR) systems
• Logs generated by web applications
• Stock markets data
• Transactional data generated by banking and financial applications
Characteristics of Big Data
1. Volume
Big data is a form of data whose volume is so large that it would not fit on a
single machine therefore specialized tools and frameworks are required to
store process and analyze such data.
Ex- social media applications process billions of messages everyday,
industrial and energy systems can generate terabytes of sensor data
everyday, cab aggregation applications can process millions of transactions
in a day, etc.
Though there is no fixed threshold for the volume of data to be considered
as big data, however, typically, the term big data is used for massive scale
data that is difficult to store, manage and process using traditional databases
and data processing architectures.
2. Velocity
• Velocity of data refers to how fast the data is generated.
• Data generated by certain sources can arrive at very high velocities, for
example, social media data or sensor data.
• Velocity is another important characteristic of big data and the reason for the
exponential growth of data.
• High velocity of data results in the volume of data accumulated to become very
large, in short span of time.
3. Variety
• Variety refers to the forms of the data.
• Big data comes in different forms such as structured, unstructured or
semi-structured, including text data, image, audio, video and sensor data.
• Big data systems need to be flexible enough to handle such variety of
data.
• RDBMS deals with structured Data
• Out of 100% data nearly 70-80% of data is either unstructured or semi-
structured.
• Unstructured Data – FB videos, audio, images, text messages
• Semistructured Data – Log Files
HADOOP
1990
White Paper

After 13 years
2003 – GFS (Google File System)
2004 – Map Reduce

2006-2007 – HDFS (Hadoop


Distributed File System)

2007-2008 – Map Reduce

HDFS – Store huge data sets


Map Reduce – Processing data stored in HDFS
HADOOP
Hadoop is an Open Source Framework given by Apache Software
Foundation for storing huge data sets and processing huge data sets with a
cluster of commodity hardware.

HDFS is a specially designed file system for storing huge data sets with
cluster of commodity hardware and with streaming access pattern.

Write Once and Read Any (WORA) of times is called streaming access
pattern.

Basically Hadoop accomplishes following 2 tasks


1. Massive Data Storage
2. Faster Processing
HADOOP
Why Hadoop ?

Taking care of hardware failure cannot be made optional in Big Data


Analytics but has to be made as a rule. In case of 1000 nodes, we need to
consider say 4000 disks, 8000 core, 25 switches, 1000 NICs and 2000 RAMs
(16 TB). Meantime between failures could be even less than a day since
commodity hardware is used. There is a need for fault tolerant store to
guarantee reasonable availability.
HADOOP
Hadoop Goals

The main goals of Hadoop are :


1. Scalable : It can scale up from a single server to thousands of servers.
2. Fault tolerance : It is designed with very high degree of fault tolerance
3. Economical : It uses commodity hardware instead of high-end hardware
4. Handle hardware failures : The resiliency of these clusters comes from the
software’s ability to detect and handle failures at the application layer.
HADOOP
Hadoop Core Components

Hadoop consists of following components :


1. Hadoop Common : This package provides file system and OS level
abstractions. It contains libraries and utilities required by other Hadoop
modules.
2. HDFS : HDFS is distributed file system that provides a limited interface
for managing the file system.
3. Hadoop MapReduce : MapReduce is the key algorithm that the Hadoop
MapReduce engine uses to distribute work around a cluster.
4. Hadoop Yet Another Resource Negotiator (YARN) (MapReduce 2.0) : It is
a resource management platform responsible for managing computer
resources in clusters and using them for scheduling of users’ applications.
HADOOP
Hadoop Common Package

• This consists of necessary Java Archive (JAR) files and scripts needed to
start Hadoop.
• Hadoop requires Java Runtime Environment (JRE) 1.6 or higher version.
• The standard start-up and shut-down scripts need secure shell (SSH) to be
setup between the nodes in the cluster.

• HDFS (Storage) and MapReduce (processing) are the two core


components of Apache Hadoop.
• The storage system Hadoop is not physically separate from a processing
system MapReduce.
HADOOP
Hadoop Distributed File System (HDFS)

• HDFS is a distributed file system that provides a limited interface for


managing the file system to allow it to scale and provide high throughput.
• HDFS creates multiple replicas of each data block and distributed them on
computers throughout a cluster to enable reliable and rapid access.
• When file is loaded into HDFS, it is replicated and fragmented into
“blocks” of data which are stored across the cluster nodes; the cluster
nodes are also called the DataNodes.
• The NameNode is responsible for storage and management of metadata, so
that when MapReduce or another execution framework calls for the data,
the NameNode informs it where the data that is needed resides
• Figure below shows NameNode and DataNode replication in HDFS
Architecture.
HADOOP
HADOOP

1. HDFS creates multiple replicas of data blocks for reliability, placing them
on the computer nodes around the cluster.
2. Hadoop’s target is to run on clusters of the order of 10,000 nodes.
3. A file consists of many 64 MB blocks.
HADOOP
• Struts, Spring and Hibernate are the frameworks in Java.
• Just like that Hadoop is a Framework which is overseen by
Apache Software Foundation.
• If someone is making a software as an open source then it is none
other than Apache Software Foundation.
• You need not have to pay anything, you can just download for
free.
• Industries always prefer working with open source software and
not with commercial software.
• Hadoop is basically for storing and processing huge data sets but
not recommended for small data sets.
• For small data sets you must use local machines to store and
process the data and should not use Hadoop.
• Suppose you want to store 500 TB data
Higher reliable Hardware – they are costly
Commodity Hardware – they are cheap
• Hard Disk Internal Structure

1 Block of size 4 kb
Suppose your Hard
Disk Capacity is
500 GB

• Suppose in the 4 kb of block I am storing the file of size 2 kb,


then remaining 2kb is wasted.
• All these block sizes are given when you boot your system for
Unix, Linux of any OS.
• On the top of this OS / Hard Disk, we are going to install Hadoop
/ HDFS.
• The default block size of HDFS is 64 MB.

1 Block of size 64 Mb

• Suppose I want to store the file of size 35 MB, the remaining file
size = 64-35 = 29 MB remains free.
• In normal Hard disk the remaining space is wasted but in Hadoop
it is not wasted.
• It will release remaining memory for other file.
• If I uninstall Hadoop from the machine then the 64 MB block
size will automatically converted to 4kb of block size.

• For 1 GB how many number of 64 MB block sizes we require


suppose 15 block sizes we require.
• For 500 GB we require 7500 blocks of size 64 MB.
• Hadoop Administrator should take care of this allocation and
deallocation process.
• HDFS is having 5 services :

NameNode Master Services / Nodes

Secondary NameNode

JobTracker
Slave Services / Nodes

DataNode

TaskTracker
• Every Master service can talk to each other and every slave
service can talk to each other.
• If NameNode is a master node then its corresponding slave node
is DataNode.
• If JobTracker is a master node then its corresponding slave node
is TaskTracker.
• NameNode can talk to DataNode and JobTracker can talk to
TaskTracker, but NameNode can not talk to TaskTracker.
• Suppose there is a client which needs to store and process huge
data.
• As this client is having a huge data and need to process this data
in less time he wanted to put that into Hadoop.
• Let us suppose that the client is having a file of size 200 MB.
• Hadoop is not for such a small size of 200 MB but for the
understanding I am taking it as 200 MB.
• So this file is stored on Hadoop and its size is going to store on
number of 64 MB block sizes.
• One should not miss 0.000001 kb of data.
• For storing 200 MB of data we need 4 blocks of 64 MB.
• 3 x 64 = 198 MB + 1 Block = 8 MB
• The total = 200 MB
NameNode

file.txt 200 MB Metadata

file.txt (200MB)
Client a.txt – 1, 2, 4
b.txt – 3, 5, 8
c.txt – 5, 6, 7
a.txt d.txt – 7, 9, 10
a.txt a.txt
3 - 64MB blocks
1 – 8 MB block DN DN DN DN DN
TT TT TT TT TT
a.txt - 64 MB
b.txt - 64 MB
c.txt - 64 MB 1 2 3 4 5
d.txt - 8 MB
DN DN DN DN DN
TT TT TT TT TT

6 7 8 9 10
• Basically client does not have to worry to split the file into
cluster of 64 MB blocks.
• NameNode will take care of this splitting.
3 - 64MB blocks
1 – 8 MB block

a.txt - 64 MB
b.txt - 64 MB
c.txt - 64 MB
d.txt - 8 MB
• Even though the d.txt is 8 MB it is stored in 64 MB block and
remaining 56 MB is used to store other files but it will not be
wasted.
• If the client wants to store his 200 MB of data to HDFS, to whom
he should contact, he should go to NameNode.
• When the clients request reaches to NameNode it is taking care
of the Metadata.
• Then NameNode sends the signal to the client and says that Now
you can store the data in DataNode 1, 3, 5, 7.
• Now Client directly approaches NameNode 1 and stores the file
a.txt into it.
• As we know that the DataNodes are commodity Hardwares
(Cheap Hardware) with different storage capacity.
• If the data in the DataNode is lost, what will happen. Are we
going to loose the data.
• Definitely we are not going to loose the data.
• Bu default this HDFS is giving three replications of the same file
i.e. a.txt.
• That means for storing 200 MB of data we will occupy 600 MB
of space as it is making 3 replications of one file.
• How the name node can know which file is stored in which
block.
• For this reason for every 3 seconds each data node sends a block
report and heart beat to the NameNode.
• If the NameNode does not receive the heart beat from the
DataNode, then it thinks that the DataNode is dead.
• Suppose the Metadata is lost then Hadoop is of no use and for
this reason Metadata is stored in reliable hardware.
• This is called as single point failure.
• If NameNode is lost everything will not be accessible.

• Now the data is stored in different DataNodes and it is time to


process that data.
• To process the data I need to write a program may be in any
programming language like java, python, Ruby etc.
• We write a program of 10 kb of data for processing.
• Which is better ?
• Sending 10 KB of program to Data or fetching data to program ?
• Of course sending programs to data is advisable otherwise
fetching huge data to program will create problem.
• Here JobTracker come into picture.
• JobTracker doesnot know which tile is stored in which
DataNode.
• As we have said that Master Services talk to each other and
Slave services talk to each other.
• The JobTracker will take the information about the files stored in
the DataNode for processing.
• JobTracker shares this data to all TaskTrackers.
• JobTrackers job is to assign tasks to all TaskTrackers.
• The TaskTracker knows that the a.txt file is in local DataNode
i.e. is in system No. 1.
• Applying that program on local a,txt file is a process and that
process we call it as Map.
• The same process continues for b.txt, c.txt, d.txt.
• The a.txt, b.txt, c.txt, d.txt are called as input splits.
• The number of input splits in your cluster is equal to the number
of Mappers will be running.
• JobTracker will take care of this.
HADOOP Ecosystem

Data transfer Hadoop to RDBMS


Machine

Columnar database
Scripting Query
Learning
Windows and Scheduling

(Pig) (Hive)
(Mahout)

(HBase)
Coordination
(Zookeeper)

(Squoop)
(Oozie)

Distributed processing
(MapReduce)

Hadoop Distributed File System


(HDFS)

Fig. Hadoop Ecosystem


• Apart from HDFS and MapReduce the other components of
Hadoop ecosystem are shown in fig. above.
• The main ecosystem components of Hadoop architecture are as
follows:
1. Apache Hbase : Columnar (Non-relational) database
2. Apache Hive : Data access and query
3. Apache HCatalog : Metadata Services
4. Apache Pig : Scripting Platform
5. Apache Mahout : Machine Learning Libraries for Data Mining
6. Apache Oozie : Workflow and scheduling services
7. Apache Zookeeper : Cluster coordination
8. Apache Sqoop : Data integration services
1. Hbase
• It is an open source, distributed column oriented store that sites
top of HDFS.
• Hbase is based on Google’s Bigtable.
• It is based on columns rather than rows.
• This increases the speed of execution of operations if they are
need to be performed on similar values across massive data sets
• Hbase is simply to say it is NoSQL database
• All your data will be taken in the form of (Key, Value) pairs.
• Both key and value are byte array type data
• It will take data in the form of column family (CF) and some
timestamp (ts) value.
• Example:

Actual Data HBase Data


id Name salary
101 Howard 10000 row Column + cell
102 Pthanil 11000
101 Column = cf:name ts value = ‘Howard’
103 Robert 10000

102 Column = cf:name ts value = ‘Pthanil’


103 Column = cf:name ts value = ‘Robert’

Id 101 102 103


salary 10000 11000 10000
Name Howard Pthanil Robert
2. Hive
• Hive provides a warehouse structure for other Hadoop input
sources and SQL-like access for data in HDFS.
• Hive’s query language, HiveQL compiles to MapReduce and
also user-defined functions (UDFs).
• Hive’s data model is based primarily on three related data
structures : tables, partitions and buckets.
• Tables correspond to HDFS directories that are derived into
partitions, which in turn can be divided into buckets.
3. HCatalog
• HCatalog is a metadata and table storage service for HDFS.
• HCatalog’s goal is to simplify the User’s interaction with HDFS
data and enable sharing tools and execution platforms.
4. Pig
• Pig is run-time environment that allows users to execute
MapReduce on a Hadoop cluster.
• Pig Latin is a high-level scripting language on Pig platform.
• Like HiveQL in Hive, Pig Latin is high-level language that
compiles to MapReduce.

• Pig is more flexible with respect to possible data format than


Hive due to its data model.
• Pig’s data model is similar to the relational data model but here
tuples can be nested.
• In Pig, tables are called bags.
• Pig also has a “map” data type, which is useful in representing
semi-structured data such as JSON or XML.
5. Sqoop
• Sqoop (“SQL-to-Hadoop”) is a tools which transfers data in both
ways between relational systems and HDFS or other Hadoop
data stores such as Hive or HBase.
• Sqoop can be used to import data from external structured
databases into HDFS or any other related systems such as Hive
or HBase.
• On the other hand, sqoop can also be used to extract data from
Hadoop and export it to external structured databases such as
relational databases and enterprise data warehouses.
6. Oozie
• Oozie is a job coordinator and workflow manager for jobs
executed in Hadoop.
• It is integrated with the rest of the Apache Hadoop stack.
• It supports several types of Hadoop jobs such as Java map-
reduce, Streaming map-reduce, Pig, Hive and Sqoop as well as
system-specific jobs such as Java programs and shell scripts.
• An Oozie workflow is a collection of actions and Hadoop jobs
arranged in a Directed Acyclic Graph (DAG), since tasks are
executed in a sequence and also are subject to certain constraints.
7. Mahout
• Mahout is a scalable machine-learning and data-mining library.
• There are currently following four main groups of algorithms in
Mahout.
• 1. Recommendations / Collective filtering
• 2. Classification / Categorization
• 3. Clustering
• 4. Frequent item-set mining / parallel frequent pattern mining
• Many machine learning algorithms are non-scalable i.e. given the
types of operations they perform, they cannot be executed as a
set of parallel processes.
• But algorithms in Mahout library can be executed in a distributed
fashion, and have been written to MapReduce.
8. Zookeeper
• Zookeeper is a distributed service with master and slave nodes
for storing and maintaining configuration information, naming,
providing distributed synchronization and providing group
services in memory on Zookeeper servers.
• Zookeeper allows distributed processes to coordinate with each
other through a shared hierarchical name space of data registers
called znodes.
• HBase depends on Zookeeper and runs a Zookeeper instance by
default.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy