Big Data Analytics_Lecture Slides
Big Data Analytics_Lecture Slides
BTCSE DED 42
Dept of Computer Sc. & Engg.
Big Data
• Big Data is data whose scale, distribution, diversity, and/or timeliness require the
use of new technical architectures and analytics to enable insights that unlock
new sources of business value - McKinsey Global report, 2011.
• Big Data refers to large amounts of massive data yet increases exponentially in
size over time.
• Data is so extensive and complicated that no usual data management methods
can effectively store or handle it.
• Big data is like data, but enormous.
• Structure of data dictates how to work with it and what insights it may provide
• Attributes of Big Data: Huge volume of data, Complexity of data types and
structures, Speed of new data creation and growth
Examples
Discovering consumer shopping habits
Finding new customer leads
Fuel optimization tools for the transportation
industry Live road mapping for autonomous vehicles
Monitoring health conditions through data from
wearables Personalised health plans for cancer patients
Personalised marketing
Predictive inventory ordering
Real-time data monitoring and cybersecurity
protocols Streamlined media streaming
User demand prediction for ridesharing companies
• Saves Cost
• Saves Time
• Helps to gain a better grasp of market conditions
• Improve their online presence of companies
• Boost Customer Acquisition and Retention
• Solve Advertisers Problem and Offer Marketing Insights
• Driver of Innovations and Product Development
Applications
of Big Data
Challenges
with Big
Data
• Data privacy: The
Big Data we now
generate contains a
lot of information
about our personal
lives, much of which
we have a right to
keep private.
• Data security: Even
if we decide we are happy for someone to have our data for a
purpose, can we trust them to keep it safe?
• Data discrimination: When everything is known, will it become acceptable
to discriminate against people based on data we have on their lives? We
already use credit scoring to decide who can borrow money, and insurance
is heavily data-driven.
• Data quality: Not enough emphasis on quality and contextual relevance.
The trend with technology is collecting more raw data closer to the end
user. The danger is data in raw format has quality issues. Reducing the
gap between the end user and raw data increases issues in data quality.
Types of Digital Data
• Structured Data
• Unstructured Data
• Semi-Structured Data
• Quasi-structured data
Structured Data
• Data that has been arranged
and specified in terms of
length and format
• Data containing a defined
data type, format, and
structure • Highly ordered,
with parameters defining its
size
• Examples: Payroll database,
transaction data, online
analytical processing [OLAP]
data cubes, traditional
RDBMS, CSV files, and simple
spread sheets
Unstructured Data
• Data that has not been
arranged; has no
inherent structure
• Data which can’t be
stored in the form of
rows and columns
• Examples: text
documents, PDFs,
images, and video.
Semi-
Structured Data
• Data that is semi-organized
• Textual data files with a discernible pattern that enables
parsing • Examples:
• Extensible Markup Language [XML] data files that are self-describing. • Details
to an email; time it was sent, email addresses to and from, internet protocol
address of the device from which the email was sent, and other bits of
information associated with the email's content. Actual content (email text) is
unstructured. However, some components enable the data to be categorized.
Quasi-structured data
• Text data with erratic data formats that can be formatted with effort,
tools, and time.
• Example: web clickstream data that may contain inconsistencies in
data values and formats
Characteristics of Big Data
Three V’s:
• Volume
• Velocity
• Variety
Can be extended to fourth and
fifth V’s:
• Veracity
• Value
• Volume: refers to a large amount of
data. The magnitude of data plays a critical role in determining its worth. Example: In
2016, worldwide mobile traffic was predicted to be 6.2 Exabytes (6.2 billion GB) per
month. Predictions for 2020 - about 40000 ExaBytes of data.
• Velocity: refers to the rapid collection of data. Data comes in at a high rate from machines,
networks, social media, mobile phones, and other sources in Big Data velocity. A large and
constant influx of data exists. This influences the data's potential, or how quickly data is
created and processed. Example: Google receives more than 3.5 billion queries every day.
In addition, the number of Facebook users is growing at a rate of around 22% every year.
• Variety: Complexity of data types and structures: Big Data reflects the variety of new
data sources, formats, and structures, including digital traces being left on the web and
other digital repositories for subsequent analysis
• Veracity: It is equivalent to quality. We have all the data, but could we be missing
something? Are the data “clean” and accurate? Do they really have something to offer?
• Value: There is another V to take into account when looking at big data: Value having
access to big data is no good unless we can turn it into value. Companies are starting to
generate amazing value from their big data.
A Weather Dataset
• Weather sensors collect data every hour at many locations across the globe and
gather a large volume of log data, which is a good candidate for analysis.
• The data is from the National Climatic Data Center, or NCDC.
• The data is stored using a line-oriented ASCII format, in which each line is a
record.
• The format supports a rich set of meteorological elements, many of which are
optional or with variable data lengths. For simplicity, we focus on the basic
elements, such as temperature, which are always present and are of fixed
width.
There is a directory for each year
from 1901 to 2001, each containing
a gzipped file for each weather
station with its readings for that
year. For example, the first entries
for 1990 shown above.
However, because of its scalability, cheap cost, and flexibility, Hadoop is the ideal platform for
Big Data analytics. It includes a slew of tools that data scientists need. Apache Hadoop, together
with YARN, converts a significant amount of raw data into an easily consumable feature matrix.
Hadoop simplifies the development of machine learning algorithms.
pipeline:
Run this same file using Hadoop:
% hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \ -
input input/ncdc/sample.txt \
-output output \
-mapper ch02-mr-intro/src/main/ruby/max_temperature_map.rb \ -
reducer ch02-mr-intro/src/main/ruby/max_temperature_reduce.rb
This is the basis for the communication protocol between the Map/Reduce framework and
the streaming mapper/reducer.
• Since they are network based, all the complications of network programming kick
in, thus making distributed filesystems more complex than regular disk
filesystems. For example, one of the biggest challenges is making the filesystem
tolerate node failure without suffering data loss.
• Hadoop comes with a distributed filesystem called HDFS, which stands for
Hadoop Distributed Filesystem. HDFS is Hadoop’s flagship filesystem.
Benefits of
block
abstraction
1. A file can be larger than any single disk in the network
2. Making the unit of abstraction a block rather than a file simplifies the storage
subsystem - because blocks are a fixed size, it is easy to calculate how many
can be stored on a given disk
3. Blocks fit well with replication for providing fault tolerance and availability. If a
block becomes unavailable due to corruption or machine failure, a copy can be
read from another location in a way that is transparent to the client.
Basic architecture of HDFS
HDFS Architecture key components
Namenodes and Datanodes
• Namenode (the master): manages the filesystem namespace. It maintains the filesystem tree and
the metadata for all the files and directories in the tree. This information is stored persistently on
the local disk in the form of two files: the namespace image and the edit log.
• Datanodes (workers): workhorses of the filesystem. They store and retrieve blocks when they are
told to (by clients or the namenode), and they report back to the namenode periodically with lists
of blocks that they are storing.
• Without the namenode, the filesystem cannot be used. In fact, if the machine running the
namenode were obliterated, all the files on the filesystem would be lost since there would be no
way of knowing how to reconstruct the files from the blocks on the datanodes. For this reason, it
is important to make the namenode resilient to failure, and Hadoop provides two mechanisms for
this.
• back up the files that make up the persistent state of the filesystem metadata • run a
secondary namenode, which despite its name does not act as a namenode. Its main role is to
periodically merge the namespace image with the edit log to prevent the edit log from
becoming too large; runs on a separate physical machine because it requires plenty of CPU
and as much memory as the namenode
Block Caching
• Normally a datanode reads blocks from disk, but for frequently
accessed files the blocks may be explicitly cached in the datanode’s
memory, in an off-heap block cache.
• By default, a block is cached in only one datanode’s memory,
although the number is configurable on a per-file basis.
• Job schedulers (for MapReduce, Spark, and other frame works) can
take advantage of cached blocks by running tasks on the datanode
where a block is cached, for increased read performance.
HDFS Federation
• The namenode keeps a reference to every file and block in the filesystem in
memory, which means that on very large clusters with many files, memory
becomes the limiting factor for scaling
• HDFS federation, introduced in the 2.x release series, allows a cluster to scale by
adding namenodes, each of which manages a portion of the filesystem
namespace. For example, one namenode might manage all the files rooted under
/user, say, and a second name node might handle files under /share.
• Under federation, each namenode manages a namespace volume, which is made
up of the metadata for the namespace, and a block pool containing all the blocks
for the files in the namespace.
• To access a federated HDFS cluster, clients use client-side mount tables to map
file paths to namenodes. This is managed in configuration using
ViewFileSystem and the viewfs:// URIs.
HDFS High Availability
• Hadoop 2 remedied the situation of long recovery time in routine maintenance and
unexpected failure of the namenode, by adding support for HDFS high availability (HA).
• Architectural changes needed:
• The namenodes must use highly available shared storage to share the edit log. When a
standby namenode comes up, it reads up to the end of the shared edit log to
synchronize its state with the active namenode, and then continues to read new
entries as they are written by the active namenode.
• Datanodes must send block reports to both namenodes because the block mappings
are stored in a namenode’s memory, and not on disk.
• Clients must be configured to handle namenode failover, using a mechanism that is
transparent to users.
• The secondary namenode’s role is subsumed by the standby, which takes periodic
checkpoints of the active namenode’s namespace.
Data Replication
• HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a
sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are
replicated for fault tolerance. The block size and replication factor are configurable per file. An application
can specify the number of replicas of a file. The replication factor can be specified at file creation time and
can be changed later. Files in HDFS are write-once and have strictly one writer at any time.
• The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and
a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode
is functioning properly. A Blockreport contains a list of all blocks on a DataNode.
• Large HDFS instances run on a cluster of computers that commonly spread across many racks. For the
common case, when the replication factor is three, HDFS’s placement policy is to put one replica on one
node in the local rack, another on a node in a different (remote) rack, and the last on a different node in
the same remote rack.
Data Disk Failure, Heartbeats and Re-Replication: Each DataNode sends a Heartbeat message to the
NameNode periodically. A network partition can cause a subset of DataNodes to lose connectivity with the
NameNode. The NameNode detects this condition by the absence of a Heartbeat message. The NameNode
marks DataNodes without recent Heartbeats as dead and does not forward any new IO requests to them.
Any data that was registered to a dead DataNode is not available to HDFS any more. The NameNode
constantly tracks which blocks need to be replicated and initiates replication whenever necessary. The
necessity for re-replication may arise due to many reasons: a DataNode may become unavailable, a replica
may become corrupted, a hard disk on a DataNode may fail, or the replication factor of a file may be
increased.
The Command-Line Interface
• There are many other interfaces to HDFS, but the command line is one of the simplest
and, to many developers, the most familiar.
• This command invokes Hadoop’s filesystem shell command fs, which supports a
number of subcommands—in this case, we are running -copyFromLocal The local
file quangle.txt is copied to the file /user/tom/quangle.txt on the HDFS instance
running on localhost. In fact, we could have omitted the scheme and host of the
URI and picked up the default, hdfs://localhost, as specified in core-site.xml:
• Copy the file back to the local filesystem and check whether it’s the
same:
• The MD5 digests are the same, showing that the file survived its trip to HDFS and
is back intact.
• HDFS file listing, create a directory to see how it is displayed in the
listing: