BIG DATA ANALYTICS Notes Unit 1 and 2
BIG DATA ANALYTICS Notes Unit 1 and 2
NOTES ON
BIG DATA AND
ANALYTICS (KDS 601)
COURSE OBJECTIVES:
The objectives of this course are
To learn the need of Big Data and the various challenges involved and to acquire
Knowledge about different analytical architectures.
To understand Hadoop Architecture and its ecosystems.
To Understand Hadoop Ecosystem and acquire knowledge about the NoSQL
database.
To acquire knowledge about the NewSQL, MongoDB and Cassandra databases.
To imbibe the processing of Big Data with advanced architectures like Spark.
1
PA
UNIT – I
Introduction to Big Data: Types of digital data, history of Big Data innovation, introduction to Big
Data platform, drivers for Big Data, Big Data architecture and characteristics, 5 Vs of Big Data, Big
Data technology components, Big Data importance and applications, Big Data features – security,
compliance, auditing and protection, Big Data privacy and ethics, Big Data Analytics, Challenges of
conventional systems, intelligent data analysis, nature of data, analytic processes and tools, analysis vs
reporting, modern data analytic tools.
UNIT – II
Hadoop: History of Hadoop, Apache Hadoop, the Hadoop Distributed File System, components of
Hadoop, data format, analyzing data with Hadoop, scaling out, Hadoop streaming, Hadoop pipes,
Hadoop Echo System. Map-Reduce: Map-Reduce framework and basics, how Map Reduce works,
developing a Map Reduce application, unit tests with MR unit, test data and local tests, anatomy of a
Map Reduce job run, failures, job scheduling, shuffle and sort, task execution, Map Reduce types, input
formats, output formats, Map Reduce features, Real-world Map Reduce
TEXT BOOKS:
Seema Acharya and Subhashini Chellappan, “Big Data and Analytics”, Wiley India Pvt.
Ltd., 2016.
Mike Frampton, “Mastering Apache Spark”, Packt Publishing, 2015.
REFERENCE BOOKS:
Tom White, “Hadoop: The Definitive Guide”, O‟Reilly, 4th Edition, 2015.
Mohammed Guller, “Big Data Analytics with Spark”, Apress, 2015
Donald Miner, Adam Shook, “Map Reduce Design Pattern”, O‟Reilly, 2012
COURSE OUTCOMES:
On successful completion of the course, students will be able to
Demonstrate knowledge of Big Data, Data Analytics, challenges and their solutions in Big
Data.
Analyze Hadoop Framework and eco systems.
Analyze MapReduce and Yarn, Work on NoSQL environment.
Work on NewSQL environment, MongoDB and Cassandra.
Apply the Big Data using Map-reduce programming in Both Hadoop and Spark framework.
2
PA
Unit 1
What is Data?
Data is defined as individual facts, such as numbers, words, measurements, observations or
just descriptions of things.
For example, data might include individual prices, weights, addresses, ages, names, temperatures, dates,
or distances.
Characteristics of Data
The following are six key characteristics of data which discussed below:
Accuracy
Validity
Reliability
Timeliness
Relevance
Completeness
1. Accuracy
Data should be sufficiently accurate for the intended use and should be captured only
once, although it may have multiple uses.
Data should be captured at the point of activity.
3
PA
2. Validity
Data should be recorded and used in compliance with relevant requirements, including
the correct application of any rules or definitions.
This will ensure consistency between periods and with similar organizations,
measuring what is intended to be measured.
3. Reliability
Data should reflect stable and consistent data collection processes across collection
points and over time.
Progress toward performance targets should reflect real changes rather than variations
in data collection approaches or methods.
Source data is clearly identified and readily available from manual, automated, or
other systems and records.
4. Timeliness
Data should be captured as quickly as possible after the event or activity and must be
available for the intended use within a reasonable time period.
Data must be available quickly and frequently enough to support information needs
and to influence service or management decisions.
5. Relevance
Data captured should be relevant to the purposes for which it is to be used. This will
require a periodic review of requirements to reflect changing needs.
6. Completeness
Data requirements should be clearly specified based on the information needs of the
organization and data collection processes matched to these requirements.
1. Structured Data:
Structured data refers to any data that resides in a fixed field within a record or
file.
Having a particular Data Model.
Meaningful data.
Data arranged in arow and column.
Structured data has the advantage of being easily entered, stored, queried and
analysed.
E.g.: Relational Data Base, Spread sheets.
Structured data is often managed using Structured Query Language (SQL)
5
Efficient storage and retrieval: Structured data is typically stored in relational
PA
databases, which are designed to efficiently store and retrieve large amounts of
data. This makes it easy to access and process data quickly.
Enhanced data security: Structured data can be more easily secured than
unstructured or semi-structured data, as access to the data can be controlled
through database security protocols.
Clear data lineage: Structured data typically has a clear lineage or history, making
it easy to track changes and ensure data quality.
2. Unstructured Data:
Unstructured data can not readily classify and fit into a neat box
Also called unclassified data.
Which does not confirm to any data model.
Business rules are not applied.
Indexing is not required.
6
PA
E.g.: photos and graphic images, videos, streaming instrument data, webpages, Pdf
files, PowerPoint presentations, emails, blog entries, wikis and word processing
documents.
7
PA
8
PA
Disadvantages of Semi-structured data
Lack of fixed, rigid schema make it difficult in storage of the data
Interpreting the relationship between data is difficult as there is no separation of
the schema and the data.
Queries are less efficient as compared to structured data.
Complexity: Semi-structured data can be more complex to manage and process
than structured data, as it may contain a wide variety of formats, tags, and
metadata. This can make it more difficult to develop and maintain data models and
processing pipelines.
Lack of standardization: Semi-structured data often lacks the standardization and
consistency of structured data, which can make it more difficult to ensure data
quality and accuracy. This can also make it harder to compare and analyze data
across different sources.
Reduced performance: Processing semi-structured data can be more resource-
intensive than processing structured data, as it often requires more complex
parsing and indexing operations. This can lead to reduced performance and longer
processing times.
Limited tooling: While there are many tools and technologies available for
working with structured data, there are fewer options for working with semi-
structured data. This can make it more challenging to find the right tools and
technologies for a particular use case.
Data security: Semi-structured data can be more difficult to secure than structured
data, as it may contain sensitive information in unstructured or less- visible parts
of the data. This can make it more challenging to identify and protect sensitive
information from unauthorized access.
Overall, while semi-structured data offers many advantages in terms of flexibility and
scalability, it also presents some challenges and limitations that need to be carefully
considered when designing and implementing data processing and analysis pipelines.
9
PA
Big Data
Big Data is a collection of data that is huge in volume, yet growing exponentially with time.
It is a data with so large size and complexity that none of traditional data management tools
can store it or process it efficiently. Big data is also a data but with huge size.
New York Stock Exchange : The New York Stock Exchange is an example of Big
Data that generates about one terabyte of new trade data per day.
10
PA
Social Media: The statistic shows that 500+terabytes of new data get ingested into
the databases of social media site Facebook, every day. This data is mainly generated
in terms of photo and video uploads, message exchanges, putting comments etc.
Jet engine :A single Jet engine can generate 10+terabytes of data in 30 minutes of
flight time. With many thousand flights per day, generation of data reaches up to
many Petabytes.
11
PA
1. Volume:
The name Big Data itself is related to an enormous size.
Big Data is a vast ‘volume’ of data generated from many sources daily, such as
business processes, machines, social media platforms, networks, human
interactions, and many more.
2. Variety:
Big Data can be structured, unstructured, and semi-structured that are being
collected from different sources.
Data will only be collected from databases and sheets in the past, but these
days the data will comes in array forms, that are PDFs, Emails, audios, SM posts,
photos, videos, etc.
3. Veracity
Veracity means how much the data is reliable. It has many ways to filter or translate
the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data
is also essential in business development.
12
PA
4. Value
Value is an essential characteristic of big data. It is not the data that we process or
store. It is valuable and reliable data that we store, process, and also analyze.
5. Velocity
Velocity plays an important role compared to others. Velocity creates the speed by
which the data is created in real-time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to
provide demanding data rapidly.
Big data velocity deals with the speed at the data flows from sources like application
logs, business processes, networks, and social media sites, sensors, mobile devices,
etc.
Big Data has become a critical asset for organizations, with 93% of companies rating Big Data initiatives as
“extremely important.” Leveraging Big Data analytics helps organizations unlock strategic value and maximize
their potential.
13
Big Data and Customer-Centric Strategies PA
Companies use Big Data to identify customer preferences, best customers, and reasons behind product
choices.
Machine Learning and predictive analytics help in designing market strategies tailored to customer needs.
Businesses can continuously improve and update marketing strategies to be more responsive to consumer
demands.
By leveraging Big Data, companies can gain deep insights into customer behavior, optimize operations, and stay
ahead in a competitive market.
The significance of Big Data does not depend on the volume of data a company possesses but on how effectively it
is utilized. Companies that efficiently analyze and leverage their data experience faster growth and greater
success.
1. Cost Savings
Big Data tools like Apache Hadoop and Spark help businesses store and manage vast amounts of data cost-
effectively.
These tools assist in identifying more efficient business processes, reducing operational expenses.
2. Time-Saving
Real-time in-memory analytics enables companies to collect and process data instantly.
Tools like Hadoop facilitate immediate analysis, allowing for quick and informed decision-making.
3. Market Understanding
Big Data analytics helps businesses gain insights into market conditions.
14
Customer purchasing behavior analysis allows companies to identify high-demand products and
PA adjust
production accordingly.
This ensures a competitive edge over rivals.
Big Data analytics optimizes business operations by aligning with customer expectations.
Helps in refining product lines and designing powerful marketing campaigns.
By leveraging Big Data, businesses can optimize their strategies, improve customer satisfaction, and drive
innovation, ensuring long-term success.
Implementing a Big Data solution comes with several challenges. Below are some of the most common issues
businesses face and their potential solutions.
Challenge:
o Companies collect increasing amounts of data daily, making traditional data storage solutions
insufficient.
o 43% of IT decision-makers worry about infrastructure overload.
Solution:
o Cloud migration: Businesses are shifting to cloud-based storage solutions that scale dynamically.
o Big Data tools: Software like Apache Hadoop and Spark enable efficient data storage and quick
access.
15
2. Integrating Data from Multiple Sources PA
Challenge:
o Data comes from various sources (web analytics, social media, CRM, emails, etc.), making
integration difficult.
o Different structures require standardization for effective analysis.
Solution:
o ETL (Extract, Transform, Load) software: Helps map disparate data sources into a common
structure.
o Business intelligence tools: Combine multiple data sets for accurate reporting and insights.
Challenge:
o Incomplete, corrupted, or inaccurate data leads to flawed analytics and incorrect predictions.
o Increasing data volume makes quality control difficult.
Solution:
o Data governance tools: Organize, manage, and secure data while validating sources.
o Data quality software: Cleans up corrupted and incomplete data before processing.
Challenge:
o Sensitive data (company strategies, financial records, customer details) makes businesses targets for
hackers.
o Breaches can lead to financial loss, identity theft, and reputational damage.
Solution:
o Encryption: Ensures data is unusable without a decryption key.
o Identity & access controls: Restrict data access to authorized users.
o Endpoint protection & real-time monitoring: Prevents malware infections and detects threats
instantly.
Challenge:
o The wide range of Big Data tools makes it difficult to choose the right one.
o Overlapping functionalities can lead to inefficiencies.
Solution:
o Consulting Big Data professionals: Experts assess business needs and recommend appropriate
tools.
o Enterprise data streaming & ETL solutions: Aggregate data from multiple sources for seamless
processing.
o Dynamic cloud configuration: Ensures the system scales efficiently with minimal maintenance.
16
6. Scaling Systems and Costs Efficiently PA
Challenge:
o Without proper planning, companies may store and process irrelevant or excessive data.
o Uncontrolled expansion leads to unnecessary costs.
Solution:
o Goal-oriented planning: Define objectives before starting a data project.
o Schema design: Structure data storage efficiently.
o Data retention policies: Purge outdated and unnecessary data periodically.
Challenge:
o Many businesses lack trained personnel for handling Big Data.
o Untrained employees can cause workflow disruptions and data errors.
Solution:
o Hiring specialists: Employ full-time data experts or consultants to guide teams.
o Training programs: Upskill existing employees in Big Data management.
o Self-service analytics tools: Enable non-data professionals to work with analytics solutions.
8. Organizational Resistance
Challenge:
o Larger organizations often resist adopting Big Data due to cost concerns or reluctance to change.
o Leaders may not see immediate value in analytics and machine learning.
Solution:
o Pilot projects: Start small and use successful results to demonstrate value.
o Leadership changes: Appoint data-driven leaders to guide transformation.
Conclusion
Big Data presents various challenges, from infrastructure scalability to security concerns. However, with the right
strategies—cloud adoption, data governance, skilled professionals, and efficient planning—businesses can harness
Big Data’s full potential for growth and innovation.
17
PA
Unit 2
Hadoop is designed to efficiently process large-scale data across distributed clusters. The core design principles of
Hadoop are:
18
Cost-effective as it runs on commodity hardware. PA
These principles ensure Hadoop’s efficiency, reliability, and scalability in processing large datasets across
distributed environments.
20
PA
Components of Hadoop
There are three core components of Hadoop as mentioned earlier. They are HDFS,
MapReduce, and YARN. These together form the Hadoop framework architecture.
Features:
The storage is distributed to handle a large data pool
Distribution increases data security
It is fault-tolerant, other blocks can pick up the failure of one block
2) MapReduce:
The MapReduce framework is the processing unit. All data is distributed and
processed parallelly.
There is a MasterNode that distributes data amongst SlaveNodes. The SlaveNodes do
the processing and send it back to the MasterNode.
Features:
Consists of two phases, Map Phase and Reduce Phase.
Processes big data faster with multiples nodes working under one CPU
21
PA
3) YARN (yet another Resources Negotiator):
It is the resource management unit of the Hadoop framework.
The data which is stored can be processed with help of YARN using data processing
engines like interactive processing. It can be used to fetch any sort of data analysis.
Features:
It is a filing system that acts as an Operating System for the data stored on HDFS
It helps to schedule the tasks to avoid overloading any system
Hadoop Architecture
o The Hadoop architecture is a package of the file system, MapReduce engine and the
HDFS (Hadoop Distributed File System). The MapReduce engine can be
MapReduce/MR1 or YARN/MR2.
o A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node
includes DataNode and TaskTracker.
22
PA
Components of HDFS
a) NameNode
It is a single master server exist in the HDFS cluster.
As it is a single node, it may become the reason of single point failure.
It manages the file system namespace by executing an operation like the opening,
renaming and closing the files.
It simplifies the architecture of the system.
b) DataNode
The HDFS cluster contains multiple DataNodes.
Each DataNode contains multiple data blocks.
These data blocks are used to store data.
It is the responsibility of DataNode to read and write requests from the file
system's clients.
It performs block creation, deletion, and replication upon instruction from the
NameNode.
c) Job Tracker
The role of Job Tracker is to accept the MapReduce jobs from client and
process the data by using NameNode.
In response, NameNode provides metadata to Job Tracker.
d) Task Tracker
It works as a slave node for Job Tracker.
It receives task and code from Job Tracker and applies that code on the file. This
process can also be called as a Mapper.
23
PA
MapReduce Layer
The MapReduce comes into existence when the client application submits the
MapReduce job to Job Tracker.
In response, the Job Tracker sends the request to the appropriate Task Trackers.
Sometimes, the TaskTracker fails or time out. In such a case, that part of the job is
rescheduled.
Hadoop 1 vs Hadoop 2
Hadoop 1 Hadoop 2
HDFS HDFS
Daemons:
Hadoop 1 Hadoop 2
Namenode Namenode
Datanode Datanode
24
Task Tracker Node Manager PA
Working:
In Hadoop 1, there is HDFS which is used for storage and top of it, Map Reduce
which works as Resource Management as well as Data Processing. Due to this
workload on Map Reduce, it will affect the performance.
In Hadoop 2, there is again HDFS which is again used for storage and on the top of
HDFS, there is YARN which works as Resource Management. It basically allocates
the resources and keeps all the things going on.
Limitations:
25
PA
combinations of active-standby nodes. Thus Hadoop 2 will eliminate the problem of a single
point of failure.
Ecosystem
Oozie is basically Work Flow Scheduler. It decides the particular time of jobs to
execute according to their dependency.
Pig, Hive and Mahout are data processing tools that are working on the top of
Hadoop.
Sqoop is used to import and export structured data. You can directly import and
export the data into HDFS using SQL database.
Flume is used to import and export the unstructured data and streaming data.
Hadoop Ecosystem
The Hadoop Ecosystem is a platform that provides various services to solve big data problems. It includes
Apache projects and various commercial tools and solutions. The four major elements of Hadoop are:
Most of the tools in the ecosystem supplement or support these major elements, working collectively to provide
services such as data absorption, analysis, storage, and maintenance.
26
PA
Components of the Hadoop Ecosystem
Core Components
Supporting Components
4. Apache Spark
o In-memory data processing for real-time analytics.
o Handles batch processing, interactive/iterative processing, and graph conversions.
5. PIG & HIVE
o PIG: Developed by Yahoo, uses Pig Latin (query-based language similar to SQL) for structuring
data flow and analysis.
o HIVE: Uses HQL (Hive Query Language) for reading/writing large datasets, supporting real-time
and batch processing.
6. Apache HBase
27
o A NoSQL database that supports structured, semi-structured, and unstructured data. PA
o Inspired by Google's BigTable, designed for real-time data retrieval.
7. Mahout & Spark MLlib
o Libraries for machine learning algorithms.
o Provide functionalities like collaborative filtering, clustering, and classification.
8. Solr & Lucene
o Lucene: Java-based search library supporting spell-checking and indexing.
o Solr: Uses Lucene for searching and indexing large datasets.
9. Zookeeper
o Manages coordination, synchronization, inter-component communication, and clustering.
10. Oozie
o Job scheduling framework.
o Supports:
Oozie Workflows (sequential job execution)
Oozie Coordinator Jobs (triggered by data availability or external events)
28
PA
1. Execution Order
o As the name suggests, the Reducer phase takes place after the Mapper phase is completed.
2. Map Phase
o The Map job is the first step in the process.
o It reads and processes a block of data.
o Produces key-value pairs as intermediate outputs.
3. Intermediate Data Transfer
o The output of a Mapper (key-value pairs) is passed as input to the Reducer.
o The Reducer receives key-value pairs from multiple Map jobs.
4. Reduce Phase
o The Reducer aggregates intermediate key-value pairs into a smaller set of tuples.
o This results in the final output of the MapReduce process.
1. Mapper Class
Input Split
RecordReader
2. Reducer Class
29
The intermediate output generated from the Mapper is fed to the Reducer. PA
The Reducer processes the data and generates the final output, which is then saved in HDFS.
3. Driver Class
Here’s your structured content on YARN and its Components while keeping the data unchanged:
Introduction to YARN
Yet Another Resource Manager (YARN) takes programming beyond Java, making it interactive and allowing
other applications like HBase, Spark, etc. to work on it.
Components of YARN
1. Client
o Used for submitting MapReduce jobs.
2. Resource Manager
o Manages the use of resources across the cluster.
3. Node Manager
o Launches and monitors compute containers on machines in the cluster.
4. MapReduce Application Master
o Checks tasks running the MapReduce job.
o Both the Application Master and MapReduce tasks run in containers.
o These containers are scheduled by the Resource Manager and managed by the Node Managers.
30
However, in Hadoop 2.0, Resource Manager and Node Manager replaced JobTracker & TaskTracker
PA
to overcome their shortcomings.
31
PA
Apache Yarn Framework consists of a master daemon known as “Resource Manager”, slave
daemon called node manager (one per slave node) and Application Master (one per
application).
o Scheduler
o Application manager
a) Scheduler
The scheduler is responsible for allocating the resources to the running application.
The scheduler is pure scheduler it means that it performs no monitoring no tracking
for the application and even doesn’t guarantees about restarting failed tasks either due
to application failure or hardware failures.
32
PA
b) Application Manager
One application master runs per application. It negotiates resources from the resource
manager and works with the node manager. It Manages the application life cycle.
The AM acquires containers from the RM’s Scheduler before contacting the
corresponding NMs to start the application’s individual tasks.
33
PA