Big Data Analytics - notes
Big Data Analytics - notes
UNIT I: What is big data, why big data, convergence of key trends, unstructured data, industry
examples of big data, web analytics, big data and marketing, fraud and big data, risk and big data,
credit risk management, big data and algorithmic trading, big data and healthcare, big data in
medicine, advertising and big data, big data technologies, introduction to Hadoop, open source
technologies, cloud and big data, mobile business intelligence, Crowd sourcing analytics, inter and
trans firewall analytics.
UNIT II: Introduction to NoSQL, aggregate data models, aggregates, key-value and document data
models, relationships, graph databases, schema less databases, materialized views, distribution
models, sharding, master-slave replication, peer- peer replication, sharding and replication,
consistency, relaxing consistency, version stamps, Working with Cassandra, Table creation, loading
and reading data.
UNIT III: Data formats, analyzing data with Hadoop, scaling out, Architecture of Hadoop distributed
file system (HDFS), fault tolerance, with data replication, High availability, Data locality, Map Reduce
Architecture, Process flow, Java interface, data flow, Hadoop I/O, data integrity, compression,
serialization. Introduction to Hive, data types and file formats, HiveQL data definition, HiveQL data
manipulation, Logical joins, Window functions, Optimization, Table partitioning, Bucketing, Indexing,
Join strategies.
UNIT IV: Apache spark- Advantages over Hadoop, lazy evaluation, In memory processing, DAG, Spark
context, Spark Session, RDD, Transformations- Narrow and Wide, Actions, Data frames, RDD to Data
frames, Catalyst optimizer, Data Frame Transformations, Working with Dates and Timestamps,
Working with Nulls in Data, Working with Complex Types, Working with JSON, Grouping, Window
Functions, Joins, Data Sources, Broadcast Variables, Accumulators, Deploying Spark- On-Premises
Cluster Deployments, Cluster Managers- Standalone Mode, Spark on YARN, Spark Logs, The Spark UI-
Spark UI History Server, Debugging and Spark First Aid
UNIT V: Spark-Performance Tuning, Stream Processing Fundamentals, Event-Time and State full
Processing - Event Time, State full Processing, Windows on Event Time- Tumbling Windows, Handling
Late Data with Watermarks, Dropping Duplicates in a Stream, Structured Streaming Basics - Core
1. a Define big data and explain how it differs from traditional data sets. Discuss the convergence of
key trends that have led to the rise of big data.
b Describe the role of unstructured data in big data analytics. Provide an example of how
unstructured data is used in one industry.
OR
2. a Explain how big data technologies like Hadoop have revolutionized web analytics. Provide a
specific example of its application.
b. Discuss the impact of big data in the healthcare sector, particularly in terms of patient care and
medical research.
UNIT-II
3. a Describe the key differences between NoSQL and traditional relational database systems. Why is
NOSQL preferred for big data applications?
b. Explain the concept of aggregates in NoSQL databases. How do they affect data modeling and
querying?
OR
4. a Discuss the architecture and data model of Cassandra. How does it differ from other NoSQL
databases?
b. b Describe the process of creating and managing tables in Cassandra. Include an example of table
creation and data manipulation.
UNIT-III
5. a Explain the architecture of the Hadoop Distributed File System (HDFS) and its role in big data
analytics.
b Discuss the MapReduce architecture and its process flow. How does it handle large datasets?
OR
6. a Explain how Hive facilitates big data analytics. Discuss its data types, file formats, and HiveQL.
b Describe the concepts of table partitioning and bucketing in Hive. How do these features
contribute to query optimization?
UNIT-IV
7. a Compare the advantages of Apache Spark over traditional Hadoop MapReduce. Why is Spark
considered more efficient for certain tasks?
b Explain the concept of Resilient Distributed Datasets (RDDs) in Spark. Discuss their transformations
and actions.
OR
8. a Describe how Spark handles data frames and complex data types. Include an example of working
with JSON data in Spark.
b. Discuss the deployment of Spark in different environments. Compare its performance in
Standalone Mode versus Spark on YARN.
UNIT-V
9. a Explain the fundamentals of stream processing in Spark. How does it handle real- time data
analytics?
b. Discuss the concepts of event-time processing and stateful processing in Spark Streaming. Include
an example of tumbling windows.
OR
10. a Describe the core concepts of structured streaming in Spark. How is it used in real- world data
processing scenarios?
b. Explain the techniques involved in performance tuning of Spark applications. How does one
optimize a Spark application for better performance?
ANSWERS:
1a. Definition of Big Data
Big Data refers to extremely large and complex datasets that are difficult to process, analyze, and
manage using traditional data management tools and techniques. Big data is characterized by its
Volume, Velocity, and Variety (commonly referred to as the 3Vs).
1. Volume:
o Refers to the sheer size of data, often measured in terabytes, petabytes, or even
exabytes.
o Example: Social media platforms like Facebook generate billions of posts, images,
and videos daily.
2. Velocity:
3. Variety:
o Refers to the diverse types of data, including structured (e.g., tables), semi-
structured (e.g., JSON files), and unstructured data (e.g., images, videos, emails).
4. Veracity:
5. Value:
o Refers to the insights and actionable information derived from analyzing big data.
Processing Tools Big Data tools like Hadoop, Spark Relational databases (SQL, Oracle)
Aspect Big Data Traditional Data Sets
Examples Social media, IoT, genomics data Payroll, inventory management data
The rise of Big Data is attributed to the convergence of several technological, social, and economic
trends:
The proliferation of the internet, smartphones, and IoT devices has drastically increased the
amount of data being generated.
Example: Social media platforms like Twitter and Instagram generate millions of posts every
second.
Falling storage costs (e.g., cloud storage) have made it possible to store vast amounts of data
economically.
IoT devices generate a continuous stream of real-time data, contributing significantly to big
data.
Social media platforms have become a major source of user-generated data in the form of
posts, images, and videos.
Example: Companies analyze social media data for sentiment analysis and brand monitoring.
The development of open-source big data frameworks like Hadoop, Apache Spark, and
NoSQL databases has made big data processing accessible and affordable.
Example: Hadoop's distributed file system (HDFS) allows processing massive datasets.
Example: Financial markets analyze real-time trading data to make instant decisions.
The rise of AI and ML requires vast amounts of data for training and testing models, fueling
the growth of big data.
2. High Volume: Generated in massive quantities, often from social media, IoT devices, or user-
generated content.
3. Variety: Comes in diverse formats, such as text, images, videos, audio files, and sensor data.
4. Complex Processing: Requires advanced technologies like machine learning, natural language
processing (NLP), and computer vision for analysis.
1. Enhancing Decision-Making:
2. Real-Time Insights:
o Example: Predicting customer churn by analyzing emails, call logs, and transaction
history.
4. Personalized Experiences:
o Example: Streaming platforms like Netflix use viewing history (unstructured data) to
recommend shows.
o Example: Retailers analyze unstructured data like social media mentions to identify
trends and launch new products.
In the healthcare industry, unstructured data plays a critical role in improving patient care and
operational efficiency.
Wearable Devices: Data from fitness trackers like heart rate, sleep patterns, and activity
levels.
Challenge: Medical images like X-rays, CT scans, and MRIs are unstructured data, and
analyzing them manually is time-consuming and prone to errors.
Solution:
o Machine Learning and AI: Advanced algorithms analyze medical images to identify
abnormalities such as tumors or fractures with high accuracy.
Impact:
o Faster and more accurate diagnoses.
2 a.
How Big Data Technologies Like Hadoop Have Revolutionized Web Analytics
Big Data technologies such as Hadoop have fundamentally transformed web analytics by enabling
the storage, processing, and analysis of vast amounts of data at high speed and low cost. Traditional
data management systems struggled with the scale, speed, and complexity of modern web data, but
Hadoop introduced a scalable, fault-tolerant, and distributed framework that addresses these
challenges.
1. Distributed Processing:
o Hadoop divides large datasets into smaller chunks and processes them across
multiple nodes simultaneously using the MapReduce framework.
o This enables faster analysis of vast amounts of web traffic and user behavior data.
2. Scalability:
3. Cost-Effectiveness:
4. Data Variety:
5. Real-Time Insights:
o Hadoop extensions, like Apache Spark, allow near-real-time processing of web data,
enabling organizations to respond to user behavior dynamically.
2. Enhanced Personalization:
3. SEO Optimization:
o Analyzing search engine traffic data, keywords, and conversion rates becomes more
efficient with Hadoop, helping businesses improve their SEO strategies.
o Hadoop processes web logs and server data to detect anomalies, such as unusual
login patterns or high-volume traffic spikes, which might indicate security threats.
o Hadoop integrates social media data with web analytics, providing insights into
customer sentiment and the impact of campaigns.
Example: Netflix
Netflix, a leader in streaming services, uses Hadoop extensively to optimize its web and app
analytics.
Application of Hadoop:
o Netflix tracks what users watch, search for, pause, and skip. This generates enormous
amounts of data.
2. Predictive Analytics:
o Netflix predicts trends in viewership using Hadoop. For example, if users in a specific
region are watching a certain genre, it suggests content in that genre to similar users
in other regions.
3. Content Optimization:
o Hadoop analyzes user feedback and reviews to determine which content is popular
or needs improvement.
4. Streaming Quality:
o Netflix uses Hadoop to analyze server logs and network performance, ensuring
seamless streaming by minimizing buffering and latency.
Impact:
2b.
Big data has significantly transformed the healthcare sector, especially in patient care and medical
research. Here's how it impacts both areas:
1. Patient Care
Personalized Medicine: Big data allows for the analysis of genetic, environmental, and
lifestyle factors, enabling healthcare providers to deliver more tailored treatment plans. By
using patient-specific data, treatments can be optimized for effectiveness, reducing the trial-
and-error approach in prescribing medications.
Predictive Analytics: With access to vast amounts of patient data, healthcare systems can
predict patient outcomes more accurately. This helps in identifying at-risk patients before
they develop severe conditions. For example, predictive models can anticipate complications
in patients with chronic diseases like diabetes or heart disease, prompting early
interventions.
Improved Diagnostics: Big data helps in the identification of patterns within medical images,
lab results, and patient histories that may not be easily spotted by human clinicians. This has
led to more accurate diagnostics, especially in areas like radiology, where machine learning
algorithms can assist in detecting abnormalities like tumors or fractures.
Remote Monitoring: Wearables and home health devices generate real-time data, which can
be integrated into patient records. This allows for continuous monitoring of patients,
enabling prompt responses to changes in their health status, especially for chronic disease
management.
2. Medical Research
Epidemiological Studies: Big data provides vast amounts of information for studying the
spread of diseases, especially through sources like electronic health records (EHRs), public
health datasets, and social media. This data is critical for understanding disease patterns,
outbreaks, and trends, which can inform public health responses and policies.
Clinical Trials: Traditional clinical trials are often limited by sample size and diversity. Big data
enables the analysis of a broader, more diverse population, which leads to better
understanding of how treatments work across different demographic groups. It also helps in
identifying adverse effects faster, making clinical trials safer.
Genomics and Precision Medicine: Big data allows researchers to analyze large-scale
genomic data, providing insights into the genetic factors that influence diseases. This can
lead to the development of targeted therapies based on individual genetic profiles, thus
improving treatment outcomes.
Challenges
Data Privacy and Security: One of the major concerns with big data in healthcare is ensuring
the privacy and security of sensitive patient information. Robust data protection measures
and strict regulations are necessary to prevent breaches and misuse of data.
Data Integration: Healthcare data often comes from disparate sources, including hospitals,
clinics, insurance providers, and wearable devices. Integrating this data into a cohesive
system that can be easily analyzed remains a significant challenge.
Bias and Inequality: If data used in medical research or patient care is biased or incomplete,
it could result in skewed outcomes, such as underrepresentation of certain demographic
groups. This can exacerbate healthcare inequalities and affect treatment effectiveness.
3a.
1. Relational Database :
RDBMS stands for Relational Database Management Systems. It is most popular database. In it, data
is store in the form of row that is in the form of tuple. It contain numbers of table and data can be
easily accessed because data is store in the table. This Model was proposed by E.F. Codd.
2. NoSQL :
NoSQL Database stands for a non-SQL database. NoSQL database doesn’t use table to store the data
like relational database. It is used for storing and fetching the data in database and generally used to
store the large amount of data. It supports query language and provides better performance.
It gives only read scalability. It gives both read and write scalability.
Its difficult to make changes in database once it is Enables easy and frequent changes to
defined database
Handling Large Volumes of Data: Big data applications often deal with massive amounts of
data that traditional relational databases struggle to handle efficiently. NoSQL databases are
designed to handle large-scale datasets by distributing data across many servers or nodes,
allowing them to scale horizontally.
Flexibility with Unstructured Data: Big data typically includes a mix of structured, semi-
structured, and unstructured data (such as logs, social media content, or sensor data). NoSQL
databases are designed to store and process this diverse data efficiently, without the need
for rigid schemas or complex data modeling.
High Velocity and Real-Time Processing: Many big data applications require real-time or
near-real-time data processing, such as monitoring systems, recommendation engines, and
fraud detection. NoSQL databases can provide the speed and low latency needed for these
use cases.
Distributed and Fault-Tolerant: Big data systems need to be resilient and handle hardware
failures gracefully. NoSQL databases are often designed with built-in replication and fault
tolerance, ensuring high availability and data durability even in the event of server crashes or
network partitions.
Cost-Effectiveness: NoSQL databases can be deployed on commodity hardware or cloud
infrastructure, which helps keep costs lower than traditional RDBMS solutions that require
high-end servers for vertical scaling.
3b.