The Age OF: Every Minute
The Age OF: Every Minute
OF
BIG DATA
2.8 2.5 27.2
Million Million Thousand
Social Media Website Review
posts search queries posts
100 201
Hours Million
of Online emails
videos sent
EVERY
57 50.7 MINUTE
Thousand Thousand
Pictures Thoughts
posts posts
• Big data is defined as collections of datasets whose volume,
velocity or variety is so large that it is difficult to store,
manage, process and analyze the data using traditional
databases and data processing tools.
• In the recent years, there has been an exponential growth in the
both structured and unstructured data generated by information
technology, industrial, healthcare, Internet of Things, and other
systems.
• According to an estimate by IBM, 2.5 quintillion bytes of data
is created every day
Below are some key pieces of data from the report:
• Facebook users share nearly 4.16 million pieces of content
• Twitter users send nearly 300,000 tweets
• Instagram users like nearly 1.73 million photos
• YouTube users upload 300 hours of new video content
• Apple users download nearly 51,000 apps
• Skype users make nearly 110,000 new calls
• Amazon receives 4300 new visitors
• Uber passengers take 694 rides
• Netflix subscribers stream nearly 77,000 hours of video
• Big Data has the potential to power next generation of smart
applications that will leverage the power of the data to make
the applications intelligent.
• Applications of big data span a wide range of domains such as
web, retail and marketing, banking and financial, industrial,
healthcare, environmental, Internet of Things and cyber-
physical systems.
Big Data analytics deals with collection, storage, processing and
analysis of this massive scale data. Specialized tools and
frameworks are required for big data analysis when:
(1) The volume of data involved is so large that it is difficult to
store, process and analyze data on a single machine,
(2) The velocity of data is very high and the data needs to be
analyzed in real-time,
(3) There is variety of data involved, which can be structured,
unstructured or semi-structured, and is collected from multiple
data sources
• Big data analytics involves several steps starting from data
cleansing, data munging (or wrangling), data processing and
visualization.
• Big data analytics life-cycle starts from the collection of data
from multiple data sources.
• Specialized tools and frameworks are required to ingest the
data from different sources into the big data analytics backend.
• The data is stored in specialized storage solutions (such as
distributed file systems and non-relational databases) which are
designed to scale.
Big data analytics is enabled by several technologies such as
cloud computing, distributed and parallel processing frameworks,
non-relational databases, in-memory computing
Some examples of big data are listed as follows:
• Data generated by social networks including text, images, audio and
video data
• Click-stream data generated by web applications such as e-Commerce to
analyze user behavior
• Machine sensor data collected from sensors embedded in industrial and
energy systems for monitoring their health and detecting failures
• Healthcare data collected in electronic health record (EHR) systems
• Logs generated by web applications
• Stock markets data
• Transactional data generated by banking and financial applications
Characteristics of Big Data
1. Volume
Big data is a form of data whose volume is so large that it would not fit on a
single machine therefore specialized tools and frameworks are required to
store process and analyze such data.
Ex- social media applications process billions of messages everyday,
industrial and energy systems can generate terabytes of sensor data
everyday, cab aggregation applications can process millions of transactions
in a day, etc.
Though there is no fixed threshold for the volume of data to be considered
as big data, however, typically, the term big data is used for massive scale
data that is difficult to store, manage and process using traditional databases
and data processing architectures.
2. Velocity
• Velocity of data refers to how fast the data is generated.
• Data generated by certain sources can arrive at very high velocities, for
example, social media data or sensor data.
• Velocity is another important characteristic of big data and the reason for the
exponential growth of data.
• High velocity of data results in the volume of data accumulated to become very
large, in short span of time.
3. Variety
• Variety refers to the forms of the data.
• Big data comes in different forms such as structured, unstructured or
semi-structured, including text data, image, audio, video and sensor data.
• Big data systems need to be flexible enough to handle such variety of
data.
• RDBMS deals with structured Data
• Out of 100% data nearly 70-80% of data is either unstructured or semi-
structured.
• Unstructured Data – FB videos, audio, images, text messages
• Semistructured Data – Log Files
HADOOP
1990
White Paper
After 13 years
2003 – GFS (Google File System)
2004 – Map Reduce
HDFS is a specially designed file system for storing huge data sets with
cluster of commodity hardware and with streaming access pattern.
Write Once and Read Any (WORA) of times is called streaming access
pattern.
• This consists of necessary Java Archive (JAR) files and scripts needed to
start Hadoop.
• Hadoop requires Java Runtime Environment (JRE) 1.6 or higher version.
• The standard start-up and shut-down scripts need secure shell (SSH) to be
setup between the nodes in the cluster.
1. HDFS creates multiple replicas of data blocks for reliability, placing them
on the computer nodes around the cluster.
2. Hadoop’s target is to run on clusters of the order of 10,000 nodes.
3. A file consists of many 64 MB blocks.
HADOOP
• Struts, Spring and Hibernate are the frameworks in Java.
• Just like that Hadoop is a Framework which is overseen by
Apache Software Foundation.
• If someone is making a software as an open source then it is none
other than Apache Software Foundation.
• You need not have to pay anything, you can just download for
free.
• Industries always prefer working with open source software and
not with commercial software.
• Hadoop is basically for storing and processing huge data sets but
not recommended for small data sets.
• For small data sets you must use local machines to store and
process the data and should not use Hadoop.
• Suppose you want to store 500 TB data
Higher reliable Hardware – they are costly
Commodity Hardware – they are cheap
• Hard Disk Internal Structure
1 Block of size 4 kb
Suppose your Hard
Disk Capacity is
500 GB
1 Block of size 64 Mb
• Suppose I want to store the file of size 35 MB, the remaining file
size = 64-35 = 29 MB remains free.
• In normal Hard disk the remaining space is wasted but in Hadoop
it is not wasted.
• It will release remaining memory for other file.
• If I uninstall Hadoop from the machine then the 64 MB block
size will automatically converted to 4kb of block size.
Secondary NameNode
JobTracker
Slave Services / Nodes
DataNode
TaskTracker
• Every Master service can talk to each other and every slave
service can talk to each other.
• If NameNode is a master node then its corresponding slave node
is DataNode.
• If JobTracker is a master node then its corresponding slave node
is TaskTracker.
• NameNode can talk to DataNode and JobTracker can talk to
TaskTracker, but NameNode can not talk to TaskTracker.
• Suppose there is a client which needs to store and process huge
data.
• As this client is having a huge data and need to process this data
in less time he wanted to put that into Hadoop.
• Let us suppose that the client is having a file of size 200 MB.
• Hadoop is not for such a small size of 200 MB but for the
understanding I am taking it as 200 MB.
• So this file is stored on Hadoop and its size is going to store on
number of 64 MB block sizes.
• One should not miss 0.000001 kb of data.
• For storing 200 MB of data we need 4 blocks of 64 MB.
• 3 x 64 = 198 MB + 1 Block = 8 MB
• The total = 200 MB
NameNode
file.txt (200MB)
Client a.txt – 1, 2, 4
b.txt – 3, 5, 8
c.txt – 5, 6, 7
a.txt d.txt – 7, 9, 10
a.txt a.txt
3 - 64MB blocks
1 – 8 MB block DN DN DN DN DN
TT TT TT TT TT
a.txt - 64 MB
b.txt - 64 MB
c.txt - 64 MB 1 2 3 4 5
d.txt - 8 MB
DN DN DN DN DN
TT TT TT TT TT
6 7 8 9 10
• Basically client does not have to worry to split the file into
cluster of 64 MB blocks.
• NameNode will take care of this splitting.
3 - 64MB blocks
1 – 8 MB block
a.txt - 64 MB
b.txt - 64 MB
c.txt - 64 MB
d.txt - 8 MB
• Even though the d.txt is 8 MB it is stored in 64 MB block and
remaining 56 MB is used to store other files but it will not be
wasted.
• If the client wants to store his 200 MB of data to HDFS, to whom
he should contact, he should go to NameNode.
• When the clients request reaches to NameNode it is taking care
of the Metadata.
• Then NameNode sends the signal to the client and says that Now
you can store the data in DataNode 1, 3, 5, 7.
• Now Client directly approaches NameNode 1 and stores the file
a.txt into it.
• As we know that the DataNodes are commodity Hardwares
(Cheap Hardware) with different storage capacity.
• If the data in the DataNode is lost, what will happen. Are we
going to loose the data.
• Definitely we are not going to loose the data.
• Bu default this HDFS is giving three replications of the same file
i.e. a.txt.
• That means for storing 200 MB of data we will occupy 600 MB
of space as it is making 3 replications of one file.
• How the name node can know which file is stored in which
block.
• For this reason for every 3 seconds each data node sends a block
report and heart beat to the NameNode.
• If the NameNode does not receive the heart beat from the
DataNode, then it thinks that the DataNode is dead.
• Suppose the Metadata is lost then Hadoop is of no use and for
this reason Metadata is stored in reliable hardware.
• This is called as single point failure.
• If NameNode is lost everything will not be accessible.
Columnar database
Scripting Query
Learning
Windows and Scheduling
(Pig) (Hive)
(Mahout)
(HBase)
Coordination
(Zookeeper)
(Squoop)
(Oozie)
Distributed processing
(MapReduce)