UNIT 5 NOTES DWM
UNIT 5 NOTES DWM
here are two forms of data analysis that can be used to extract models
describing important classes or predict future data trends. These two
forms are as follows:
1. Classification
2. Prediction
The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not. Each
internal node represents a test on an attribute. Each leaf node represents
a class.
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting
subset.
Output:
A Decision Tree
Method
create a node N;
Bayesian Classification
Bayesian classification in data mining is a statistical technique used to classify data based on
probabilistic reasoning. It is a type of probabilistic classification that uses Bayes' theorem to
predict the probability of a data point belonging to a certain class. The Bayesian
classification is a powerful technique for probabilistic inference and decision-making and is
widely used in various applications such as medical diagnosis, spam classification, fraud
detection, etc.
Bayesian classification in data mining is a statistical approach to data classification that uses
Bayes' theorem to make predictions about a class of a data point based on observed data. It is
a popular data mining and machine learning technique for modelling the probability of
certain outcomes and making predictions based on that probability.
The basic idea behind Bayesian classification in data mining is to assign a class label to a
new data instance based on the probability that it belongs to a particular class, given the
observed data. Bayes' theorem provides a way to compute this probability by multiplying the
prior probability of the class (based on previous knowledge or assumptions) by the likelihood
of the observed data given that class (conditional probability).
Several types of Bayesian classifiers exist, such as naive Bayes, Bayesian network classifiers,
Bayesian logistic regression, etc. Bayesian classification is preferred in many applications
because it allows for the incorporation of new data (just by updating the prior probabilities)
and can update the probabilities of class labels accordingly.
This is important when new data is constantly being collected, or the underlying distribution
may change over time. In contrast, other classification techniques, such as decision trees or
support vector machines, do not easily accommodate new data and may require re-training of
the entire model to incorporate new information. This can be computationally expensive and
time-consuming.
Bayesian classification is a powerful tool for data mining and machine learning and is widely
used in many applications, such as spam filtering, text classification, and medical diagnosis.
Its ability to incorporate prior knowledge and uncertainty makes it well-suited for real-world
problems where data is incomplete or noisy and accurate predictions are critical.
Bayes' theorem is used in Bayesian classification in data mining, which is a technique for
predicting the class label of a new instance based on the probabilities of different class labels
and the observed features of the instance. In data mining, Bayes' theorem is used to compute
the probability of a hypothesis (such as a class label or a pattern in the data) given some
observed event (such as a set of features or attributes). It is named after Reverend Thomas
Bayes, an 18th-century British mathematician who first formulated it.
Bayes' theorem states that the probability of a hypothesis H given some observed event E is
proportional to the likelihood of the evidence given the hypothesis, multiplied by the prior
probability of the hypothesis, as shown below -
P(H∣E)=P(E∣H)P(H)P(E)P(H∣E)=P(E)P(E∣H)P(H)
where P(H∣E)P(H∣E) is the posterior probability of the hypothesis given the event
E, P(E∣H)P(E∣H) is the likelihood or conditional probability of the event given the
hypothesis, P(H)P(H) is the prior probability of the hypothesis, and P(E)P(E) is the
probability of the event.
Prior probability is a term used in probability theory and statistics that refers to the
probability of a hypothesis or event before any event or data is considered. It represents our
prior belief or expectation about the likelihood of a hypothesis or event based on previous
knowledge or assumptions.
For example, we are interested in the probability of a certain disease in a population. Our
prior probability might be based on previous studies or epidemiological data and might be
relatively low if the disease is rare. As we collect data from medical tests or patient
symptoms, we can update our probability estimate using Bayes' theorem to reflect the new
evidence.
The posterior probability is a term used in Bayesian inference to refer to the updated
probability of a hypothesis, given some observed event or data. It is calculated using Bayes'
theorem, which combines the prior probability of the hypothesis with the likelihood of the
event to produce an updated or posterior probability.
The posterior probability is important in Bayesian inference because it reflects the latest
information about the hypothesis based on the observed data. It can be used to make
decisions or predictions and updated further as new data becomes available.
Formula Derivation
Bayes' theorem is derived from the definition of conditional probability. The conditional
probability of an event E given a hypothesis H is defined as the joint probability of E and H,
divided by the probability of H, as shown below -
P(E∣H)=P(E∩H)P(H)P(E∣H)=P(H)P(E∩H)
We can rearrange this equation to solve for the joint probability of E and H -
P(E∩H)=P(E∣H)∗P(H)P(E∩H)=P(E∣H)∗P(H)
Similarly, we can use the definition of conditional probability to write the conditional
probability of H given E, as shown below -
P(H∣E)=P(H∩E)P(E)P(H∣E)=P(E)P(H∩E)
P(H∩E)=P(E∩H)P(H∩E)=P(E∩H)
We can substitute the expression for P(H∩E) from the first equation into the second equation
to obtain -
P(H∣E)=P(E∣H)∗P(H)P(E)P(H∣E)=P(E)P(E∣H)∗P(H)
This is the formula for Bayes' theorem for hypothesis H and event E. It states that the
probability of hypothesis H given event E is proportional to the likelihood of the event given
the hypothesis, multiplied by the prior probability of the hypothesis, and divided by the
probability of the event.
Bayes' theorem or Bayesian classification in data mining has a wide range of applications in
many fields, including statistics, machine learning, artificial intelligence, natural language
processing, medical diagnosis, image and speech recognition, and more. Here are some
examples of its applications -
Spam filtering - Bayes' theorem is commonly used in email spam filtering, where it
helps to identify emails that are likely to be spam based on the text content and
other features.
Medical diagnosis - Bayes' theorem can be used to diagnose medical conditions
based on the observed symptoms, test results, and prior knowledge about the
prevalence and characteristics of the disease.
Risk assessment - Bayes' theorem can be used to assess the risk of events such as
accidents, natural disasters, or financial market fluctuations based on historical data
and other relevant factors.
Natural language processing - Bayes' theorem can be used to classify documents,
sentiment analysis, and topic modeling in natural language processing applications.
Recommendation systems - Bayes' theorem can be used in recommendation
systems like e-commerce websites to suggest products or services to users based on
their previous behavior and preferences.
Fraud detection - Bayes' theorem can be used to detect fraudulent behavior, such as
credit card or insurance fraud, by analyzing patterns of transactions and other data.
Examples
Problem - Suppose a medical test for a certain disease has a false positive rate of 5% and a
false negative rate of 2%. If a person has the disease, there is a 2% chance that the test will
come back negative; if a person does not, there is a 5% chance that the test will come back
positive. Suppose the disease affects 1% of the population. If a person tests positive for the
disease, what is the probability that they have the disease?
Solution - To solve this problem using Bayes' theorem, we can start by defining some events:
We are interested in the probability of event D given the event T, which we can write
as P(D∣T)P(D∣T). Using Bayes' theorem, we can write -
P(D∣T)=P(T∣D)∗P(D)/P(T)P(D∣T)=P(T∣D)∗P(D)/P(T)
The first term on the right-hand side of the equation is the probability of a positive test result
given that the person has the disease, which we can calculate as -
P(T∣D)=1−0.02=0.98P(T∣D)=1−0.02=0.98
(2% id FPR, which means that if a person has a disease, then there are 2% chance that the test
will come negative)
The second term is the prior probability of the person having the disease, which is given
as 1% -
The third term is the probability of a positive test result, which we can calculate using the law
of total probability, as shown below -
P(D∣T)=0.98∗0.01/0.0593=0.1652P(D∣T)=0.98∗0.01/0.0593=0.1652
So the probability that a person has the disease, given that they test positive for it, is
approximately 16.52%. This shows that even with a relatively high false positive rate, a
positive test result is not a guarantee of having the disease, and further testing or confirmation
may be necessary.
IF-THEN Rules
Points to remember −
The IF part of the rule is called rule antecedent or precondition.
The THEN part of the rule is called rule consequent.
The antecedent part the condition consist of one or more attribute tests
and these tests are logically ANDed.
The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)
If the condition holds true for a given tuple, then the antecedent is
satisfied.
Advertisement
Mute
Fullscreen
Rule Extraction
Points to remember −
One rule is created for each path from the root to the leaf node.
To form a rule antecedent, each splitting criterion is logically ANDed.
The leaf node holds the class prediction, forming the rule consequent.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As
per the general strategy the rules are learned one at a time. For each time
rules are learned, a tuple covered by the rule is removed and the process
continues for the rest of the tuples. This is because the path to each leaf
in a decision tree corresponds to a rule.
The Following is the sequential learning Algorithm where rules are learned
for one class at a time. When learning a rule from a class Ci, we want the
rule to cover all the tuples from class C only and no tuple form any other
class.
Classification Of Backpropagation
The error is then propagated backwards through the network, with each
neuron in
the network adjusting its weights and biases based on its contribution to
the error.
This is done using a gradient descent algorithm, where the weights and
biases are
adjusted in the direction that reduces the error.
Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear
SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM
classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.
The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either
green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
🐢 Lazy Learners
Lazy learners are a type of learning algorithm that delays the generalization process until a query is
made. In simple terms, they do not build a model during training, but instead store the training
data and use it at prediction time.
How It Works:
Lazy learners memorize the training data. When a new instance needs to be classified, they
compare it with the stored instances (usually using a similarity or distance measure) and make a
prediction based on the closest matches.
Examples of Lazy Learners:
Advantages:
Simple to implement.
Can adapt to changes in data easily (just add new data).
No training phase → fast for updating.
Disadvantages:
Slow classification time (has to search through data for each prediction).
High memory usage (since all data is stored).
Sensitive to irrelevant or redundant features (especially K-NN).
In data mining, especially in classification and regression tasks, it’s essential to evaluate how
well a model performs. This is done using accuracy and error metrics that measure the
difference between predicted and actual values.
The class part (THEN) specifies the predicted class for instances satisfying the condition
xample
Let’s say you are classifying customers as “Buy” or “Don’t Buy” a product.
Rule 1:
IF Age > 25 AND Income = High THEN Class = Buy
Rule 2:
IF Age <= 25 AND Student = Yes THEN Class = Buy
Rule 3:
IF Income = Low THEN Class = Don’t Buy
Advantages
Disadvantages