0% found this document useful (0 votes)
10 views13 pages

Big Data Analytics - notes

The document provides an overview of Big Data Analytics, covering its definition, significance, and key trends leading to its rise, including the role of unstructured data across various industries. It discusses technologies like Hadoop and NoSQL databases, their architectures, and applications in web analytics and healthcare. Additionally, it explores Apache Spark's advantages, stream processing fundamentals, and performance tuning techniques.

Uploaded by

caleb dharmaraju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views13 pages

Big Data Analytics - notes

The document provides an overview of Big Data Analytics, covering its definition, significance, and key trends leading to its rise, including the role of unstructured data across various industries. It discusses technologies like Hadoop and NoSQL databases, their architectures, and applications in web analytics and healthcare. Additionally, it explores Apache Spark's advantages, stream processing fundamentals, and performance tuning techniques.

Uploaded by

caleb dharmaraju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Big Data Analytics

UNIT I: What is big data, why big data, convergence of key trends, unstructured data, industry
examples of big data, web analytics, big data and marketing, fraud and big data, risk and big data,
credit risk management, big data and algorithmic trading, big data and healthcare, big data in
medicine, advertising and big data, big data technologies, introduction to Hadoop, open source
technologies, cloud and big data, mobile business intelligence, Crowd sourcing analytics, inter and
trans firewall analytics.

UNIT II: Introduction to NoSQL, aggregate data models, aggregates, key-value and document data

models, relationships, graph databases, schema less databases, materialized views, distribution
models, sharding, master-slave replication, peer- peer replication, sharding and replication,
consistency, relaxing consistency, version stamps, Working with Cassandra, Table creation, loading
and reading data.

UNIT III: Data formats, analyzing data with Hadoop, scaling out, Architecture of Hadoop distributed
file system (HDFS), fault tolerance, with data replication, High availability, Data locality, Map Reduce

Architecture, Process flow, Java interface, data flow, Hadoop I/O, data integrity, compression,

serialization. Introduction to Hive, data types and file formats, HiveQL data definition, HiveQL data

manipulation, Logical joins, Window functions, Optimization, Table partitioning, Bucketing, Indexing,

Join strategies.

UNIT IV: Apache spark- Advantages over Hadoop, lazy evaluation, In memory processing, DAG, Spark

context, Spark Session, RDD, Transformations- Narrow and Wide, Actions, Data frames, RDD to Data

frames, Catalyst optimizer, Data Frame Transformations, Working with Dates and Timestamps,
Working with Nulls in Data, Working with Complex Types, Working with JSON, Grouping, Window
Functions, Joins, Data Sources, Broadcast Variables, Accumulators, Deploying Spark- On-Premises
Cluster Deployments, Cluster Managers- Standalone Mode, Spark on YARN, Spark Logs, The Spark UI-
Spark UI History Server, Debugging and Spark First Aid

UNIT V: Spark-Performance Tuning, Stream Processing Fundamentals, Event-Time and State full

Processing - Event Time, State full Processing, Windows on Event Time- Tumbling Windows, Handling

Late Data with Watermarks, Dropping Duplicates in a Stream, Structured Streaming Basics - Core

Concepts, Structured Streaming in Action, Transformations on Streams, Input and Output.


UNIT-I

1. a Define big data and explain how it differs from traditional data sets. Discuss the convergence of
key trends that have led to the rise of big data.

b Describe the role of unstructured data in big data analytics. Provide an example of how
unstructured data is used in one industry.

OR

2. a Explain how big data technologies like Hadoop have revolutionized web analytics. Provide a
specific example of its application.

b. Discuss the impact of big data in the healthcare sector, particularly in terms of patient care and
medical research.

UNIT-II

3. a Describe the key differences between NoSQL and traditional relational database systems. Why is
NOSQL preferred for big data applications?

b. Explain the concept of aggregates in NoSQL databases. How do they affect data modeling and
querying?

OR

4. a Discuss the architecture and data model of Cassandra. How does it differ from other NoSQL
databases?

b. b Describe the process of creating and managing tables in Cassandra. Include an example of table
creation and data manipulation.

UNIT-III

5. a Explain the architecture of the Hadoop Distributed File System (HDFS) and its role in big data
analytics.

b Discuss the MapReduce architecture and its process flow. How does it handle large datasets?

OR

6. a Explain how Hive facilitates big data analytics. Discuss its data types, file formats, and HiveQL.

b Describe the concepts of table partitioning and bucketing in Hive. How do these features
contribute to query optimization?

UNIT-IV

7. a Compare the advantages of Apache Spark over traditional Hadoop MapReduce. Why is Spark
considered more efficient for certain tasks?

b Explain the concept of Resilient Distributed Datasets (RDDs) in Spark. Discuss their transformations
and actions.

OR

8. a Describe how Spark handles data frames and complex data types. Include an example of working
with JSON data in Spark.
b. Discuss the deployment of Spark in different environments. Compare its performance in
Standalone Mode versus Spark on YARN.

UNIT-V

9. a Explain the fundamentals of stream processing in Spark. How does it handle real- time data
analytics?

b. Discuss the concepts of event-time processing and stateful processing in Spark Streaming. Include
an example of tumbling windows.

OR

10. a Describe the core concepts of structured streaming in Spark. How is it used in real- world data
processing scenarios?

b. Explain the techniques involved in performance tuning of Spark applications. How does one
optimize a Spark application for better performance?
ANSWERS:
1a. Definition of Big Data

Big Data refers to extremely large and complex datasets that are difficult to process, analyze, and
manage using traditional data management tools and techniques. Big data is characterized by its
Volume, Velocity, and Variety (commonly referred to as the 3Vs).

Characteristics of Big Data (The 5Vs Framework)

1. Volume:

o Refers to the sheer size of data, often measured in terabytes, petabytes, or even
exabytes.

o Example: Social media platforms like Facebook generate billions of posts, images,
and videos daily.

2. Velocity:

o Refers to the speed at which data is generated, collected, and processed.

o Example: Real-time data from IoT devices and sensors.

3. Variety:

o Refers to the diverse types of data, including structured (e.g., tables), semi-
structured (e.g., JSON files), and unstructured data (e.g., images, videos, emails).

o Example: Combining video surveillance data with transaction logs.

4. Veracity:

o Refers to the uncertainty or inconsistencies in the data. It highlights the need to


ensure data accuracy and reliability.

o Example: Filtering out fake news from social media data.

5. Value:

o Refers to the insights and actionable information derived from analyzing big data.

o Example: Customer behavior analysis for personalized marketing.

How Big Data Differs from Traditional Data Sets

Aspect Big Data Traditional Data Sets

Size Huge, often terabytes or more Relatively small, gigabytes or less

Type of Data Structured, semi-structured, unstructured Mostly structured

Processing Tools Big Data tools like Hadoop, Spark Relational databases (SQL, Oracle)
Aspect Big Data Traditional Data Sets

Speed of Data Real-time or near-real-time Batch processing

Scalability Distributed systems for scalability Limited by single systems

Examples Social media, IoT, genomics data Payroll, inventory management data

Key Trends Leading to the Rise of Big Data

The rise of Big Data is attributed to the convergence of several technological, social, and economic
trends:

1. Explosion of Data Generation

 The proliferation of the internet, smartphones, and IoT devices has drastically increased the
amount of data being generated.

 Example: Social media platforms like Twitter and Instagram generate millions of posts every
second.

2. Advancements in Storage Technology

 Falling storage costs (e.g., cloud storage) have made it possible to store vast amounts of data
economically.

 Example: Cloud storage solutions like Amazon S3 and Google Drive.

3. Improved Computing Power

 Advances in computing technology, including distributed computing frameworks (e.g.,


Hadoop, Spark), have enabled the processing of massive datasets.

 Example: Parallel processing allows quicker analysis of big data.

4. Internet of Things (IoT)

 IoT devices generate a continuous stream of real-time data, contributing significantly to big
data.

 Example: Smart sensors in manufacturing or wearable devices like fitness trackers.

5. Growth of Social Media

 Social media platforms have become a major source of user-generated data in the form of
posts, images, and videos.

 Example: Companies analyze social media data for sentiment analysis and brand monitoring.

6. Open Source Technologies

 The development of open-source big data frameworks like Hadoop, Apache Spark, and
NoSQL databases has made big data processing accessible and affordable.

 Example: Hadoop's distributed file system (HDFS) allows processing massive datasets.

7. Need for Real-Time Insights


 Businesses and organizations demand real-time insights for competitive advantage, driving
the adoption of big data analytics.

 Example: Financial markets analyze real-time trading data to make instant decisions.

8. Artificial Intelligence and Machine Learning

 The rise of AI and ML requires vast amounts of data for training and testing models, fueling
the growth of big data.

 Example: Training neural networks for image recognition.

1b. Role of Unstructured Data in Big Data Analytics


Unstructured data plays a pivotal role in Big Data Analytics as it constitutes the majority (around 80-
90%) of all data generated today. Unlike structured data (organized in rows and columns),
unstructured data lacks a predefined format and is often challenging to analyze using traditional
methods. However, it holds immense value because of the rich insights it contains, making it a critical
component in decision-making and predictive analytics.

Key Characteristics of Unstructured Data

1. Lacks Structure: Unstructured data is not organized in a pre-defined manner (e.g.,


documents, images, videos).

2. High Volume: Generated in massive quantities, often from social media, IoT devices, or user-
generated content.

3. Variety: Comes in diverse formats, such as text, images, videos, audio files, and sensor data.

4. Complex Processing: Requires advanced technologies like machine learning, natural language
processing (NLP), and computer vision for analysis.

Role in Big Data Analytics

1. Enhancing Decision-Making:

o Unstructured data provides valuable insights that cannot be extracted from


structured data alone.

o Example: Sentiment analysis on social media posts helps organizations understand


public opinion.

2. Real-Time Insights:

o Analyzing unstructured data in real-time enables organizations to react quickly to


trends or issues.

o Example: Monitoring live customer reviews to improve product offerings.


3. Predictive Analytics:

o Combining structured and unstructured data enhances predictive models, offering a


more holistic view.

o Example: Predicting customer churn by analyzing emails, call logs, and transaction
history.

4. Personalized Experiences:

o Unstructured data helps businesses understand customer preferences and deliver


tailored services.

o Example: Streaming platforms like Netflix use viewing history (unstructured data) to
recommend shows.

5. New Revenue Streams:

o Organizations can monetize insights derived from unstructured data.

o Example: Retailers analyze unstructured data like social media mentions to identify
trends and launch new products.

Example: Healthcare Industry

In the healthcare industry, unstructured data plays a critical role in improving patient care and
operational efficiency.

Sources of Unstructured Data in Healthcare:

 Medical Records: Doctors' notes, prescriptions, and discharge summaries.

 Medical Imaging: X-rays, MRIs, CT scans.

 Patient Feedback: Surveys and reviews.

 Wearable Devices: Data from fitness trackers like heart rate, sleep patterns, and activity
levels.

 Research Articles: Clinical trials, medical journals, and research papers.

Use Case: Improving Diagnosis with Medical Imaging

 Challenge: Medical images like X-rays, CT scans, and MRIs are unstructured data, and
analyzing them manually is time-consuming and prone to errors.

 Solution:

o Machine Learning and AI: Advanced algorithms analyze medical images to identify
abnormalities such as tumors or fractures with high accuracy.

o Example: Google's DeepMind uses AI to detect eye diseases by analyzing retinal


scans.

 Impact:
o Faster and more accurate diagnoses.

o Reduced workload for radiologists.

o Improved patient outcomes.

2 a.
How Big Data Technologies Like Hadoop Have Revolutionized Web Analytics

Big Data technologies such as Hadoop have fundamentally transformed web analytics by enabling
the storage, processing, and analysis of vast amounts of data at high speed and low cost. Traditional
data management systems struggled with the scale, speed, and complexity of modern web data, but
Hadoop introduced a scalable, fault-tolerant, and distributed framework that addresses these
challenges.

Key Features of Hadoop in Web Analytics

1. Distributed Processing:

o Hadoop divides large datasets into smaller chunks and processes them across
multiple nodes simultaneously using the MapReduce framework.

o This enables faster analysis of vast amounts of web traffic and user behavior data.

2. Scalability:

o Hadoop's ability to scale horizontally allows organizations to handle growing volumes


of data without expensive hardware upgrades.

3. Cost-Effectiveness:

o Built on commodity hardware, Hadoop is more economical than traditional


enterprise solutions for processing large datasets.

4. Data Variety:

o Hadoop handles structured, semi-structured, and unstructured data, enabling web


analytics to include diverse data types such as clickstreams, videos, social media
posts, and logs.

5. Real-Time Insights:

o Hadoop extensions, like Apache Spark, allow near-real-time processing of web data,
enabling organizations to respond to user behavior dynamically.

How Hadoop Transformed Web Analytics

1. Tracking User Behavior:

o Hadoop processes massive clickstream data (records of user interactions on a


website) to identify patterns and trends in user behavior.
o Businesses can determine popular pages, navigation paths, and bounce rates to
improve website performance.

2. Enhanced Personalization:

o Hadoop-powered analytics enables personalized user experiences by analyzing


browsing history, preferences, and purchase behavior.

o Example: E-commerce platforms use Hadoop to suggest products based on a user's


previous activity.

3. SEO Optimization:

o Analyzing search engine traffic data, keywords, and conversion rates becomes more
efficient with Hadoop, helping businesses improve their SEO strategies.

4. Fraud Detection and Security:

o Hadoop processes web logs and server data to detect anomalies, such as unusual
login patterns or high-volume traffic spikes, which might indicate security threats.

5. Social Media Integration:

o Hadoop integrates social media data with web analytics, providing insights into
customer sentiment and the impact of campaigns.

Example: Netflix

Netflix, a leader in streaming services, uses Hadoop extensively to optimize its web and app
analytics.

Application of Hadoop:

1. User Behavior Analysis:

o Netflix tracks what users watch, search for, pause, and skip. This generates enormous
amounts of data.

o Hadoop processes this data to understand viewer preferences, allowing Netflix to


provide highly accurate content recommendations.

2. Predictive Analytics:

o Netflix predicts trends in viewership using Hadoop. For example, if users in a specific
region are watching a certain genre, it suggests content in that genre to similar users
in other regions.

3. Content Optimization:

o Hadoop analyzes user feedback and reviews to determine which content is popular
or needs improvement.

4. Streaming Quality:
o Netflix uses Hadoop to analyze server logs and network performance, ensuring
seamless streaming by minimizing buffering and latency.

Impact:

 Enhanced user satisfaction through personalized recommendations.

 Higher viewer engagement and retention rates.

 Improved operational efficiency in managing global traffic.

2b.
Big data has significantly transformed the healthcare sector, especially in patient care and medical
research. Here's how it impacts both areas:

1. Patient Care

 Personalized Medicine: Big data allows for the analysis of genetic, environmental, and
lifestyle factors, enabling healthcare providers to deliver more tailored treatment plans. By
using patient-specific data, treatments can be optimized for effectiveness, reducing the trial-
and-error approach in prescribing medications.

 Predictive Analytics: With access to vast amounts of patient data, healthcare systems can
predict patient outcomes more accurately. This helps in identifying at-risk patients before
they develop severe conditions. For example, predictive models can anticipate complications
in patients with chronic diseases like diabetes or heart disease, prompting early
interventions.

 Improved Diagnostics: Big data helps in the identification of patterns within medical images,
lab results, and patient histories that may not be easily spotted by human clinicians. This has
led to more accurate diagnostics, especially in areas like radiology, where machine learning
algorithms can assist in detecting abnormalities like tumors or fractures.

 Streamlined Operations: Big data optimizes healthcare operations, such as scheduling,


patient flow management, and resource allocation, leading to reduced wait times, better
access to care, and improved patient satisfaction.

 Remote Monitoring: Wearables and home health devices generate real-time data, which can
be integrated into patient records. This allows for continuous monitoring of patients,
enabling prompt responses to changes in their health status, especially for chronic disease
management.

2. Medical Research

 Accelerated Drug Discovery: By analyzing large datasets, including genetic information,


clinical trials data, and real-world patient outcomes, researchers can identify potential drug
candidates more efficiently. Big data helps pinpoint promising compounds and predict their
effects, speeding up the development of new medications.

 Epidemiological Studies: Big data provides vast amounts of information for studying the
spread of diseases, especially through sources like electronic health records (EHRs), public
health datasets, and social media. This data is critical for understanding disease patterns,
outbreaks, and trends, which can inform public health responses and policies.

 Clinical Trials: Traditional clinical trials are often limited by sample size and diversity. Big data
enables the analysis of a broader, more diverse population, which leads to better
understanding of how treatments work across different demographic groups. It also helps in
identifying adverse effects faster, making clinical trials safer.

 Genomics and Precision Medicine: Big data allows researchers to analyze large-scale
genomic data, providing insights into the genetic factors that influence diseases. This can
lead to the development of targeted therapies based on individual genetic profiles, thus
improving treatment outcomes.

Challenges

 Data Privacy and Security: One of the major concerns with big data in healthcare is ensuring
the privacy and security of sensitive patient information. Robust data protection measures
and strict regulations are necessary to prevent breaches and misuse of data.

 Data Integration: Healthcare data often comes from disparate sources, including hospitals,
clinics, insurance providers, and wearable devices. Integrating this data into a cohesive
system that can be easily analyzed remains a significant challenge.

 Bias and Inequality: If data used in medical research or patient care is biased or incomplete,
it could result in skewed outcomes, such as underrepresentation of certain demographic
groups. This can exacerbate healthcare inequalities and affect treatment effectiveness.

3a.
1. Relational Database :
RDBMS stands for Relational Database Management Systems. It is most popular database. In it, data
is store in the form of row that is in the form of tuple. It contain numbers of table and data can be
easily accessed because data is store in the table. This Model was proposed by E.F. Codd.

2. NoSQL :
NoSQL Database stands for a non-SQL database. NoSQL database doesn’t use table to store the data
like relational database. It is used for storing and fetching the data in database and generally used to
store the large amount of data. It supports query language and provides better performance.

Difference between Relational database and NoSQL :

Relational Database NoSQL

It is used to handle data coming in high


It is used to handle data coming in low velocity. velocity.

It gives only read scalability. It gives both read and write scalability.

It manages structured data. It manages all type of data.


Data arrives from one or few locations. Data arrives from many locations.

It supports complex transactions. It supports simple transactions.

It has single point of failure. No single point of failure.

It handles data in less volume. It handles data in high volume.

Transactions written in one location. Transactions written in many locations.

support ACID properties compliance doesn’t support ACID properties

Its difficult to make changes in database once it is Enables easy and frequent changes to
defined database

schema is mandatory to store the data schema design is not required

Deployed in vertical fashion. Deployed in Horizontal fashion.

Why is NoSQL Preferred for Big Data Applications?

 Handling Large Volumes of Data: Big data applications often deal with massive amounts of
data that traditional relational databases struggle to handle efficiently. NoSQL databases are
designed to handle large-scale datasets by distributing data across many servers or nodes,
allowing them to scale horizontally.

 Flexibility with Unstructured Data: Big data typically includes a mix of structured, semi-
structured, and unstructured data (such as logs, social media content, or sensor data). NoSQL
databases are designed to store and process this diverse data efficiently, without the need
for rigid schemas or complex data modeling.

 High Velocity and Real-Time Processing: Many big data applications require real-time or
near-real-time data processing, such as monitoring systems, recommendation engines, and
fraud detection. NoSQL databases can provide the speed and low latency needed for these
use cases.

 Distributed and Fault-Tolerant: Big data systems need to be resilient and handle hardware
failures gracefully. NoSQL databases are often designed with built-in replication and fault
tolerance, ensuring high availability and data durability even in the event of server crashes or
network partitions.
 Cost-Effectiveness: NoSQL databases can be deployed on commodity hardware or cloud
infrastructure, which helps keep costs lower than traditional RDBMS solutions that require
high-end servers for vertical scaling.

3b.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy