Bigdata
Bigdata
(AUTONOMOUS)
SALEM
ACADEMIC YEAR
(2025-2026)
ODD SEMSTER
Paper Title BIG DATA Analytics
BIG 4 - -
V CAE06 DATA Analytics Elective 3
Big Data Analytics is the process of analyzing large and complex data sets to
Preamble discover patterns, trends, and insights that support better decision-making.
Basic knowledge of programming (preferably Python, Java, or R),
understanding of databases and SQL, fundamentals of statistics and
Prerequisite mathematics, and familiarity with data structures and algorithms.
Course Outcomes(Cos) Work with big data tools and its analysis techniques.
Bloom's
Taxonomy
Knowledge
CO Number Course Outcomes(Cos) Statement Level
Understand the Big Data Platform and its Use cases, Map
CO1 Reduce Jobs KI
To identify and understand the basics of cluster and decision tree
CO2 K2
To study about the Association Rules, Recommendation System
CO3 K2
To learn about the concept of stream
CO4 K3
Understand the concepts of No S QL Databases
CO5 K3
Syllabus
E-Content/
Unit Content HOURS Resources
Evolution of Big data — Best Practices for Big
data Analytics—Big data characteristics—
Validating —The Promotion of the Value of Big
Data — Big DataUseCases-Characteristics of Big
Data Applications—Perception and Quantification
of Value -UnderstandingBigDataStorage—
AGeneralOverviewofHigh-
PerformanceArchitecture—HDFS—MapReduce
I and YARN—Map Reduce Programming Modeling 15
Advanced Analytical Theory and Methods:
Clustering:
Overview of Clustering – K-Means – Use Cases –
Overview of the Method – Determining the
Number of Clusters – Diagnostics – Reasons to
Choose and Cautions Classification:
Decision Trees – Overview of a Decision Tree –
The General Algorithm – Decision Tree
Algorithms – Evaluating a Decision Tree –
Decision Trees in R Naïve Bayes:
II Bayes Theorem – Naïve Bayes Classifier 15
Advanced Analytical Theory and Methods:
Association Rules – Overview – Apriori Algorithm
– Evaluation of Candidate Rules – Applications of
Association Rules – Finding Association &
Finding Similarity Recommendation System:
Collaborative Recommendation – Content-Based
Recommendation – Knowledge-Based
Recommendation – Hybrid Recommendation
III Approaches 15
Introduction to Streams Concepts:
Stream Data Model and Architecture – Stream
Computing – Sampling Data in a Stream –
Filtering Streams – Counting Distinct Elements in
a Stream – Estimating Moments – Counting
Oneness in a Window – Decaying WindowReal-
Time Analytics Platform (RTAP) Applications:
Case Studies – Real-Time Sentiment Analysis –
Stock Market Predictions Using Graph Analytics
for Big Data:
IV Graph Analytics 15
NoSQL Databases:
Schema-less Models – Increasing Flexibility for
Data Manipulation – Key-Value Stores –
Document Stores – Tabular Stores – Object Data
Stores – Graph Databases – Hive – Sharding –
HBase Applications of Big Data: Analyzing Big
Data with Twitter – Big Data for E-Commerce –
Big Data for Blogs Tools and Methods: Review
V of Basic Data Analytic Methods using R 15
Total
Note:
1.Svetlin Nakov, Veselin Kolev & Co., Fundamentals of Computer
Programming with C#, Faber Publication, 2019.
2. Mathew MacDonald, The Complete Reference ASP.NET, Tata McGraw-
Text Books 1 Hill, 2015.
1 Herbert Schildt, The Complete Reference C#.NET, Tata McGraw-Hill, 2017.
Kogent Learning Solutions, C# 2012 Programming Covers .NET 4.5 Black
Reference 2 Book, Dreamtech Press, 2013.
Books Anne Boehm, Joel Murach, Murach’s C# 2015, Mike Murach & Associates
3 Inc., 2016.
Denielle Otey, Michael Otey, ADO.NET: The Complete Reference, McGraw-
4 Hill, 2008.
Learning Methods
1 TO 3- Unit I
4 TO 6- Unit II
7 TO 9- Unit III
10 TO 12- Unit IV
13 TO 15- Unit V
16- Unit I
17- Unit II
20- Unit V
Evolution Of Bigdata
Big Data is nothing but lots of data consisting of varieties of data. It is the
concept of gathering useful insights from such voluminous amounts of
structured, semi-structured and unstructured data that can be used for
effective decision making in the business environment. This data is collected
from various sources over a course of time and is cumbersome to be managed
by traditional database tools.
Bigdata characteristics
In recent years, Big Data was defined by the “3Vs” but now there is “6Vs” of Big
Data which are also termed as the characteristics of Big Data as follows:
1.Volume:
To determine the value of data, size of data plays a very crucial role. If the volume of data
is very large, then it is actually considered as a ‘Big Data’. This means whether a particular
data can actually be considered as a Big Data or not, is dependent upon the volume of
data.
Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.
Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes (6.2
billion GB) per month. Also, by the year 2020 we will have almost 40000 Exabytes of data.
2. Velocity:
• Velocity refers to the high speed of accumulation of data.
• In Big Data velocity data flows in from sources like machines, networks,
social media, mobile phones etc.
• There is a massive and continuous flow of data. This determines the
potential of data that how fast the data is generated and processed to
meet the demands.
• Sampling data can help in dealing with the issue like ‘velocity’.
• Example: There are more than 3.5 billion searches per day are made on
Google. Also, Facebook users are increasing by 22%(Approx.) year by
year.
3. Variety:
• It refers to nature of data that is structured, semi-structured and
unstructured data.
• It also refers to heterogeneous sources.
• Variety is basically the arrival of data from new sources that are both
inside and outside of an enterprise. It can be structured, semi-structured
and unstructured.
o Structured data: This data is basically an organized data.
It generally refers to data that has defined the length and
format of data.
o Semi- Structured data: This data is basically a semi-
organised data. It is generally a form of data that do not
conform to the formal structure of data. Log files are the
examples of this type of data.
o Unstructured data: This data basically refers to
unorganized data. It generally refers to data that doesn’t
fit neatly into the traditional row and column structure of
the relational database. Texts, pictures, videos etc. are the
examples of unstructured data which can’t be stored in
the form of rows and columns.
4. Veracity:
• It refers to inconsistencies and uncertainty in data, that is data which is
available can sometimes get messy and quality and accuracy are difficult
to control.
• Big Data is also variable because of the multitude of data dimensions
resulting from multiple disparate data types and sources.
• Example: Data in bulk could create confusion whereas less amount of
data could convey half or Incomplete Information.
5. Value:
• After having the 4 V’s into account there comes one more V which stands
for Value! The bulk of Data having no Value is of no good to the company,
unless you turn it into something useful.
• Data in itself is of no use or importance but it needs to be converted into
something valuable to extract Information. Hence, you can state that
Value! is the most important V of all the 6V’s.
6. Variability:
• How fast or available data that extent is the structure of your data is
changing?
• How often does the meaning or shape of your data change?
• Example: if you are eating same ice-cream daily and the taste just keep
changing.
Uses of Bigdata
Big Data enables you to gather information about customers and their experience,
then eventually helps you to align it properly.
Big Data is also useful for companies to anticipate customer demand, roll out new
plans, test markets, etc.
• Big Data enables you to gather information about customers and their
experience, then eventually helps you to align it properly.
• Helps in maintaining the predictive failures beforehand by analyzing the
problems and provides with their potential solutions.
• Big Data is also useful for companies to anticipate customer demand, roll
out new plans, test markets, etc.
• Big Data is very useful in assessing predictive failures by analyzing
various indicators such as unstructured data, error messages, log entries,
engine temperature, etc.
• Big Data is also very efficient in maintaining operational functions along with
anticipating future demands of the customers, current market demands thus
providing proper results.
Bigdata usecases
Big data use cases span various industries and include applications
like predictive maintenance, fraud detection, customer analytics, supply chain
optimization, and risk management.
1. Tracking Customer Spending Habit, Shopping Behavior: In big retails store (like
Amazon, Walmart, Big Bazar etc.) management team has to keep data of
customer’s spending habit (in which product customer spent, in which brand
they wish to spent, how frequently they spent), shopping behavior, customer’s
most liked product (so that they can keep those products in the store). Which
product is being searched/sold most, based on that data, production/collection
rate of that product get fixed.
3 Virtual Personal Assistant Tool: Big data analysis helps virtual personal assistant
tool (like Siri in Apple Device, Cortana in Windows, Google Assistant in Android) to
provide the answer of the various question asked by users. This tool tracks the
location of the user, their local time, season, other data
4 IoT:
• Manufacturing company install IOT sensor into machines to collect
operational data. Analyzing such data, it can be predicted how long machine
will work without any problem when it requires repairing so that company
can take action before the situation when machine facing a lot of issues or
gets totally down. Thus, the cost to replace the whole machine can be saved.
• In the Healthcare field, Big data is providing a significant contribution. Using
big data tool, data regarding patient experience is collected and is used by
doctors to give better treatment. IoT device can sense a symptom of
probable coming disease in the human body and prevent it from giving
advance treatment.
Big data storage refers to systems designed to efficiently store, manage, and retrieve
massive datasets for analysis and decision-making. It addresses the challenges of
storing and processing large volumes, diverse formats, and rapidly changing
data. Big data storage solutions often utilize distributed architectures and
specialized technologies to handle the unique needs of big data.
MapReduce nothing but just like an Algorithm or a data structure that is based on the
YARN framework. The major feature of MapReduce is to perform the distributed
processing in parallel in a Hadoop cluster which Makes Hadoop working so fast.
When you are dealing with Big Data, serial processing is no more of any use.
MapReduce has mainly 2 tasks which are divided phase-wise:
Map Task:
• RecordReader The purpose of recordreader is to break the records. It is
responsible for providing key-value pairs in a Map() function. The key is
actually is its locational information and value is the data associated with it.
• Map: A map is nothing but a user-defined function whose work is to process
the Tuples obtained from record reader. The Map() function either does not
generate any key-value pair or generate multiple pairs of these tuples.
• Combiner: Combiner is used for grouping the data in the Map workflow. It is
similar to a Local reducer. The intermediate key-value that are generated in
the Map is combined with the help of this combiner. Using a combiner is not
necessary as it is optional.
• Partitionar: Partitional is responsible for fetching key-value pairs generated
in the Mapper Phases. The partitioner generates the shards corresponding to
each reducer. Hashcode of each key is also fetched by this partition. Then
partitioner performs it's(Hashcode) modulus with the number of
reducers(key.hashcode()%(number of reducers)).
Reduce Task
• Shuffle and Sort: The Task of Reducer starts with this step, the process in
which the Mapper generates the intermediate key-value and transfers them
to the Reducer task is known as Shuffling. Using the Shuffling process the
system can sort the data using its key value.
Once some of the Mapping tasks are done Shuffling begins that is why it is a
faster process and does not wait for the completion of the task performed by
Mapper.
• Reduce: The main function or task of the Reduce is to gather the Tuple
generated from Map and then perform some sorting and aggregation sort of
process on those key-value depending on its key element.
• OutputFormat: Once all the operations are performed, the key-value pairs
are written into the file with the help of record writer, each record in a new
line, and the key and value in a space-separated manner.
YARN also allows different data processing engines like graph processing, interactive
processing, stream processing as well as batch processing to run and process data stored in
HDFS (Hadoop Distributed File System) thus making the system much more efficient.
Through its various components, it can dynamically allocate various resources and schedule
the application processing. For large volume data processing, it is quite necessary to
manage the available resources properly so that every application can leverage them.
MapReduce Architecture
Last Updated : 10 Sep, 2020
••
MapReduce and HDFS are the two major components of Hadoop which makes it so
powerful and efficient to use. MapReduce is a programming model used for efficient
processing in parallel over large data-sets in a distributed manner. The data is first
split and then combined to produce the final result. The libraries for MapReduce is
written in so many programming languages with various different-different
optimizations. The purpose of MapReduce in Hadoop is to Map each of the jobs and
then it will reduce it to equivalent tasks for providing less overhead over the cluster
network and to reduce the processing power. The MapReduce task is mainly divided
into two phases Map Phase and Reduce Phase.
MapReduce Architecture:
Components of MapReduce Architecture:
1. Client: The MapReduce client is the one who brings the Job to the
MapReduce for processing. There can be multiple clients available that
continuously send jobs for processing to the Hadoop MapReduce Manager.
2. Job: The MapReduce Job is the actual work that the client wanted to do
which is comprised of so many smaller tasks that the client wants to process
or execute.
3. Hadoop MapReduce Master: It divides the particular job into subsequent
job-parts.
4. Job-Parts: The task or sub-jobs that are obtained after dividing the main
job. The result of all the job-parts combined to produce the final output.
5. Input Data: The data set that is fed to the MapReduce for processing.
6. Output Data: The final result is obtained after the processing.
In MapReduce, we have a client. The client will submit the job of a particular size to
the Hadoop MapReduce Master. Now, the MapReduce master will divide this job into
further equivalent job-parts. These job-parts are then made available for the Map and
Reduce Task. This Map and Reduce task will contain the program as per the
requirement of the use-case that the particular company is solving. The developer
writes their logic to fulfill the requirement that the industry requires. The input data
which we are using is then fed to the Map Task and the Map will generate intermediate
key-value pair as its output. The output of Map i.e. these key-value pairs are then fed
to the Reducer and the final output is stored on the HDFS. There can be n number of
Map and Reduce tasks made available for processing the data as per the requirement.
The algorithm for Map and Reduce is made with a very optimized way such that the
time complexity or space complexity is minimum.
Let's discuss the MapReduce phases to get a better understanding of its architecture:
The MapReduce task is mainly divided into 2 phases i.e. Map phase and Reduce phase.
1. Map: As the name suggests its main use is to map the input data in key-
value pairs. The input to the map may be a key-value pair where the key can
be the id of some kind of address and value is the actual value that it keeps.
The Map() function will be executed in its memory repository on each of
these input key-value pairs and generates the intermediate key-value pair
which works as input for the Reducer or Reduce() function.
2. Reduce: The intermediate key-value pairs that work as input for Reducer
are shuffled and sort and send to the Reduce() function. Reducer aggregate
or group the data based on its key-value pair as per the reducer algorithm
written by the developer.
How Job tracker and the task tracker deal with MapReduce:
1. Job Tracker: The work of Job tracker is to manage all the resources and all
the jobs across the cluster and also to schedule each map on the Task Tracker
running on the same data node since there can be hundreds of data nodes
available in the cluster.
2. Task Tracker: The Task Tracker can be considered as the actual slaves that
are working on the instruction given by the Job Tracker. This Task Tracker is
deployed on each of the nodes available in the cluster that executes the Map
and Reduce task as instructed by Job Tracker.
Unit 2
Advanced analytical theory and methods
Applications:
• Fraud Detection:
Advanced analytics can be used to identify fraudulent activities by analyzing
transaction patterns.
• Marketing:
Predictive analytics can help businesses develop targeted marketing campaigns
and improve customer engagement.
• Supply Chain Management:
By analyzing data, businesses can optimize inventory levels, reduce costs, and
improve efficiency.
• Healthcare:
Advanced analytics can be used to improve patient outcomes, predict disease
outbreaks, and personalize treatment plans.
• Financial Services:
Advanced analytics can be used to assess credit risk, detect anomalies, and
improve financial decision-making.
Overview of clustering
Key Concepts:
• Unsupervised Learning:
Clustering is a type of unsupervised learning, meaning it doesn't rely on labeled
data to train the model.
• Similarity and Dissimilarity:
Clustering algorithms rely on measuring the similarity or dissimilarity between
data points, often using distance metrics like Euclidean distance or cosine
similarity.
• Purpose:
Clustering is used for various purposes, including:
• Exploratory Data Analysis: Identifying natural groupings and trends in
data.
• Data Reduction: Simplifying large datasets by grouping similar data
points into clusters, reducing the number of features.
• Anomaly Detection: Identifying data points that are far from any cluster,
potentially indicating outliers or anomalies.
• Types of Clustering Algorithms:
• K-Means: A popular centroid-based algorithm that assigns data points to
clusters based on their distance to cluster centers (centroids).
• Hierarchical Clustering: Builds a hierarchy of clusters, starting with each
point as a separate cluster and iteratively merging the closest clusters.
• Density-Based Spatial Clustering of Applications with Noise
(DBSCAN): Groups data points based on their density, identifying clusters
as dense regions of points.
• Hard vs. Soft Clustering:
• Hard Clustering: Assigns each data point to exactly one cluster.
• Soft Clustering: Allows data points to belong to multiple clusters with
varying degrees of membership.
• Applications:
Clustering finds applications in various fields, including:
• Marketing: Customer segmentation.
K means
K-means clustering is an iterative process to minimize the sum of
distances between the data points and their cluster centroids. The
k-means clustering algorithm operates by categorizing data points
into clusters by using a mathematical distance measure, usually
euclidean, from the cluster center.
The algorithm works by first randomly picking some central points
called centroids and each data point is then assigned to the closest centroid
forming a cluster. After all the points are assigned to a cluster the centroids
are updated by finding the average position of the points in each cluster.
This process repeats until the centroids stop changing forming clusters. The
goal of clustering is to divide the data points into clusters so that similar
data points belong to same group.
The algorithm will categorize the items into k groups or clusters of similarity.
To calculate that similarity we will use the Euclidean distance as a
measurement. The algorithm works as follows:
1. First we randomly initialize k points called means or cluster
centroids.
2. We categorize each item to its closest mean and we update the
mean's coordinates, which are the averages of the items
categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the
end, we have our clusters.
The "points" mentioned above are called means because they are the mean
values of the items categorized in them. To initialize these means, we have a
lot of options. An intuitive method is to initialize the means at random items
in the data set. Another method is to initialize the means at random values
between the boundaries of the data set. For example for a feature x the items
have values in [0,3] we will initialize the means with values for x at [0,3].
Use cases
Big data use cases span various industries and include applications
like predictive maintenance, fraud detection, customer analytics, supply
chain optimization, and risk management. These use cases leverage the
analysis of large, complex datasets to gain insights and make informed
decisions.
• Government:
Using big data for emergency response, crime prevention, and smart city initiatives.
We have used the Mall Customer dataset which can be found on this link.
The dataset has 200 rows and 5 columns. It has no null values.
Output:
Shape of the dataset along
with count of null values
Let us extract two columns namely 'Annual Income (k$)' and 'Spending
Score (1-100)' for further process.
for k in range(2,limit+1):
model = KMeans(n_clusters=k)
model.fit(dataset_new)
wcss[k] = model.inertia_
Through the above plot, we can observe that the turning point of this curve
is at the value of k = 5. Therefore, we can say that the 'right' number of
clusters for this data is 5.
Decision tree is a simple diagram that shows different choices and their
possible results helping you make decisions easily. This article is all about
what decision trees are, how they work, their advantages and
disadvantages and their applications.
Understanding Decision Tree
A decision tree is a graphical representation of different options for solving
a problem and show how different factors are related. It has a hierarchical
tree structure starts with one main question at the top called a node which
further branches out into different possible outcomes where:
• Root Node is the starting point that represents the entire dataset.
• Branches: These are the lines that connect nodes. It shows the
flow from one decision to another.
• Internal Nodes are Points where decisions are made based on
the input features.
• Leaf Nodes: These are the terminal nodes at the end of
branches that represent final outcome also support decision-
making by visualizing outcomes. You can quickly evaluate and
compare the “branches” to determine which course of action is best
for you.
Now, let’s take an example to understand the decision tree. Imagine you want to decide
whether to drink coffee based on the time of day and how tired you feel. First the tree
checks the time of day—if it’s morning it asks whether you are tired. If you’re tired the
tree suggests drinking coffee if not it says there’s no need. Similarly in the afternoon the
tree again asks if you are tired. If you recommends drinking coffee if not it concludes no
coffee is needed.
We have mainly two types of decision tree based on the nature of the target variable:
classification trees and regression trees.
Classification trees: They are designed to predict categorical outcomes means they
classify data into different classes. They can determine whether an email is “spam” or
“not spam” based on various features of the email.
Regression trees : These are used when the target variable is continuous It predict
numerical values rather than categories. For example a regression tree can estimate the
price of a house based on its size, location, and other features.
H(D)=Σi=1npilog2(pi)H(D)=Σi=1npilog2(pi)
InformationGain=H(D)−Σv=1V ∣Dv∣∣D∣H(Dv)InformationGain=H(D)
−Σv=1V∣D∣∣Dv∣H(Dv)
ID3 recursively splits the dataset using the feature with the highest
information gain until all examples in a node belong to the same class or no
features remain to split. After the tree is constructed it prune branches that
don't significantly improve accuracy to reduce overfitting. But it tends to
overfit the training data and cannot directly handle continuous attributes.
These issues are addressed by other algorithms like C4.5 and CART.
2. C4.5
C4.5 uses a modified version of information gain called the gain ratio to reduce the
bias towards features with many values. The gain ratio is computed by dividing
the information gain by the intrinsic information which measures the amount of
data required to describe an attribute’s values:
GainRatio=SplitgainGaininformationGainRatio=GaininformationSplitgain
C4.5 has limitations:
• It can be prone to overfitting especially in noisy datasets even if
uses pruning techniques.
• Performance may degrade when dealing with datasets that have
many features.
3. CART (Classification and Regression Trees)
CART is a widely used decision tree algorithm that is used
for classification and regression tasks.
• For classification CART splits data based on the Gini impurity
which measures the likelihood of incorrectly classified randomly
selected data. The feature that minimizes the Gini impurity is
selected for splitting at each node. The formula is:
Gini(D)=1−Σi=1npi2Gini(D)=1−Σi=1npi2
X2=Σ(Oi−Ei)2EiX2=ΣEi(Oi−Ei)2
Where:
• OiOi represents the observed frequency
EiEi represents the expected frequency in each category. 5. MARS
(Multivariate Adaptive Regression Splines)
MARS is an extension of the CART algorithm. It uses splines to model non-
linear relationships between variables. It constructs a piecewise linear
model where the relationship between the input and output variables is
linear but with variable slopes at different points, known as knots. It
automatically selects and positions these knots based on the data
distribution and the need to capture non-linearities.
Basis Functions: Each basis function in MARS is a simple linear function
defined over a range of the predictor variable. The function is described as:
h(x)={x−tifx>tt−xifx≤t}h(x)={x−tifx>tt−xifx≤t}
Where
• xx is a predictor variable
• ttis the knot function.
Knot Function: The knots are the points where the piecewise linear
functions connect. MARS places these knots to best represent the data's
non-linear structure.
6. Conditional Inference Trees
Conditional Inference Trees uses statistical tests to choose splits based on
the relationship between features and the target variable. It use
permutation tests to select the feature that best splits the data while
minimizing bias.
The algorithm follows a recursive approach. At each node it evaluates the
statistical significance of potential splits using tests like the Chi-squared
test for categorical features and the F-test for continuous features. The
feature with the strongest relationship to the target is selected for the split.
The process continues until the data cannot be further split or meets
predefined stopping criteria.
Evaluation Metrics:
Accuracy:
Measures the overall correctness of the model by comparing predicted values with
actual values.
Precision:
Measures the ability of the model to correctly identify positive cases among those
predicted as positive.
Recall:
Measures the ability of the model to correctly identify all positive cases.
F1-score:
The harmonic mean of precision and recall, providing a balance between both metrics.
Cross-validation:
Splits the data into multiple folds and trains the model on different combinations of
these folds to assess its performance on unseen data.
Pruning:
Removes branches or nodes from the decision tree to simplify it and improve its ability
to generalize to new data. This can be achieved using techniques like cost complexity
pruning.
Ensemble methods:
Utilizing multiple decision trees to create a more robust and accurate model, such as
random forests or gradient boosting.
R Packages:
Caret: A package that simplifies model training and evaluation, including cross-
validation and hyperparameter tuning.
By employing these evaluation metrics, techniques, and R packages, you can assess
the performance of decision trees in R and ensure that they are providing accurate and
robust predictions.
Naive Bayes
Key Concepts:
• Supervised Learning:
Naive Bayes is a supervised learning algorithm, meaning it learns from labeled
data to make predictions.
• Bayes' Theorem:
The core principle behind Naive Bayes is Bayes' Theorem, which calculates the
probability of an event based on prior knowledge.
• Conditional Independence:
A key assumption in Naive Bayes is that the presence of one feature does not
affect the presence of another feature. This is where the "naive" part of the
name comes in.
• Probabilistic Classifier:
Naive Bayes is a probabilistic classifier, meaning it assigns probabilities to
different classes and predicts the most likely class for a given input.
Bayes theorem
For any two events A and B, Bayes’s formula for the Bayes theorem is given by:
Bayes-theorem-2
Where,
• P(A) and P(B) are the probabilities of events A and B, also, P(B) is
never equal to zero.
• P(A|B) is the probability of event A when event B happens,
• P(B|A) is the probability of event B when A happens.
Bayes Theorem Statement
Bayes's Theorem for n sets of events is defined as,
Let E1, E2,…, En be a set of events associated with the sample space S, in
which all the events E1, E2,…, En have a non-zero probability of occurrence.
All the events E1, E2,…, E form a partition of S. Let A be an event from space
S for which we have to find the probability, then according to Bayes theorem,
P(Ei∣A)=P(Ei)⋅P(A∣Ei)∑k=1nP(Ek)⋅P(A∣Ek)P(Ei∣A)=∑k=1nP(Ek)⋅P(A∣Ek)P(Ei
)⋅P(A∣Ei)
for k = 1, 2, 3, …., n
Objectives:
• Bayes' Theorem:
The foundation of the algorithm, used to calculate the probability of a
hypothesis (or class) given the evidence (or features).
• Conditional Independence:
The core assumption that features are independent of each other given the
class label.
• Probabilistic Classification:
The algorithm predicts the class with the highest probability for a given input.
How it works:
1. 1. Training:
The algorithm learns the probability distribution of each feature given each class
label from the training data.
2. 2. Prediction:
For a new input, the algorithm calculates the probability of each class given the
input features, based on the learned probabilities.
3. 3. Classification:
The algorithm assigns the input to the class with the highest calculated
probability.
Advantages:
• Simple and fast: It's easy to implement and computationally efficient, making
it suitable for large datasets.
• Scalable: Can handle large datasets and many features.
• Handles both continuous and categorical data: Can be used with various data
types.
• Not sensitive to irrelevant features: Can ignore irrelevant data and maintain
good performance.
• Low false positive rate: A study on Naive Bayes spam filtering showed that
Naive Bayes can achieve low false positive rates in spam detection.
Disadvantages:
• The "naive" assumption: The assumption of feature independence is often not
true in real-world scenarios. This can lead to suboptimal performance,
especially when features are strongly correlated.
• Limited ability to model complex dependencies: It struggles to model
complex relationships between features.
Applications:
• Spam filtering: Used to classify emails as spam or not spam.
• Text classification: Used to categorize documents into different topics.
• Sentiment analysis: Used to determine the sentiment expressed in text, such
as positive or negative.
• Medical diagnosis: Used to help diagnose patients by predicting the
probability of different diseases.
• Face recognition: Used to identify faces or features like the nose, mouth, and
eyes.