0% found this document useful (0 votes)
23 views18 pages

UNIT 5 NOTES DWM

The document discusses two primary forms of data analysis: classification and prediction, which are used to extract models for understanding data classes and predicting future trends. It explains decision tree induction, detailing the structure and benefits of decision trees, and introduces Bayesian classification, emphasizing its probabilistic approach and applications in various fields. Additionally, it covers rule-based classification and backpropagation for training neural networks, highlighting the importance of these techniques in data mining and machine learning.

Uploaded by

Jayshree Borkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views18 pages

UNIT 5 NOTES DWM

The document discusses two primary forms of data analysis: classification and prediction, which are used to extract models for understanding data classes and predicting future trends. It explains decision tree induction, detailing the structure and benefits of decision trees, and introduces Bayesian classification, emphasizing its probabilistic approach and applications in various fields. Additionally, it covers rule-based classification and backpropagation for training neural networks, highlighting the importance of these techniques in data mining and machine learning.

Uploaded by

Jayshree Borkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Classification and Predication

here are two forms of data analysis that can be used to extract models
describing important classes or predict future data trends. These two
forms are as follows:

1. Classification
2. Prediction

We use classification and prediction to extract a model,


representing the data classes to predict future data trends.
Classification predicts the categorical labels of data with the
prediction models. This analysis provides us with the best
understanding of the data at a large scale.
Classification models predict categorical class labels, and prediction
models predict continuous-valued functions. For example, we can
build a classification model to categorize bank loan applications as
either safe or risky or a prediction model to predict the expenditures
in dollars of potential customers on computer equipment given their
income and occupation.

Decision Tree Induction


A decision tree is a structure that includes a root node, branches, and leaf
nodes. Each internal node denotes a test on an attribute, each branch
denotes the outcome of a test, and each leaf node holds a class label. The
topmost node in the tree is the root node.

The following decision tree is for the concept buy_computer that indicates
whether a customer at a company is likely to buy a computer or not. Each
internal node represents a test on an attribute. Each leaf node represents
a class.

The benefits of having a decision tree are as follows –

 It does not require any domain knowledge.


 It is easy to comprehend.
 The learning and classification steps of a decision tree are simple and fast.

Decision Tree Induction Algorithm

A machine researcher named J. Ross Quinlan in 1980 developed a decision


tree algorithm known as ID3 (Iterative Dichotomiser). Later, he presented
C4.5, which was the successor of ID3. ID3 and C4.5 adopt a greedy
approach. In this algorithm, there is no backtracking; the trees are
constructed in a top-down recursive divide-and-conquer manner.

Generating a decision tree form training tuples of data


partition D
Algorithm : Generate_decision_tree

Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting
subset.

Output:
A Decision Tree

Method
create a node N;

if tuples in D are all of the same class, C then


return N as leaf node labeled with class C;

if attribute_list is empty then


return N as leaf node with labeled
with majority class in D;|| majority voting

apply attribute_selection_method(D, attribute_list)


to find the best splitting_criterion;
label node N with splitting_criterion;

if splitting_attribute is discrete-valued and


multiway splits allowed then // no restricted to binary
trees

attribute_list = splitting attribute; // remove splitting


attribute
for each outcome j of splitting criterion

// partition the tuples and grow subtrees for each


partition
let Dj be the set of data tuples in D satisfying outcome j;
// a partition
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;

Bayesian Classification

Bayesian classification in data mining is a statistical technique used to classify data based on
probabilistic reasoning. It is a type of probabilistic classification that uses Bayes' theorem to
predict the probability of a data point belonging to a certain class. The Bayesian
classification is a powerful technique for probabilistic inference and decision-making and is
widely used in various applications such as medical diagnosis, spam classification, fraud
detection, etc.

Introduction to Bayesian Classification in Data Mining

Bayesian classification in data mining is a statistical approach to data classification that uses
Bayes' theorem to make predictions about a class of a data point based on observed data. It is
a popular data mining and machine learning technique for modelling the probability of
certain outcomes and making predictions based on that probability.

The basic idea behind Bayesian classification in data mining is to assign a class label to a
new data instance based on the probability that it belongs to a particular class, given the
observed data. Bayes' theorem provides a way to compute this probability by multiplying the
prior probability of the class (based on previous knowledge or assumptions) by the likelihood
of the observed data given that class (conditional probability).

Several types of Bayesian classifiers exist, such as naive Bayes, Bayesian network classifiers,
Bayesian logistic regression, etc. Bayesian classification is preferred in many applications
because it allows for the incorporation of new data (just by updating the prior probabilities)
and can update the probabilities of class labels accordingly.

This is important when new data is constantly being collected, or the underlying distribution
may change over time. In contrast, other classification techniques, such as decision trees or
support vector machines, do not easily accommodate new data and may require re-training of
the entire model to incorporate new information. This can be computationally expensive and
time-consuming.

Bayesian classification is a powerful tool for data mining and machine learning and is widely
used in many applications, such as spam filtering, text classification, and medical diagnosis.
Its ability to incorporate prior knowledge and uncertainty makes it well-suited for real-world
problems where data is incomplete or noisy and accurate predictions are critical.

Bayes’ Theorem in Data Mining

Bayes' theorem is used in Bayesian classification in data mining, which is a technique for
predicting the class label of a new instance based on the probabilities of different class labels
and the observed features of the instance. In data mining, Bayes' theorem is used to compute
the probability of a hypothesis (such as a class label or a pattern in the data) given some
observed event (such as a set of features or attributes). It is named after Reverend Thomas
Bayes, an 18th-century British mathematician who first formulated it.

Bayes' theorem states that the probability of a hypothesis H given some observed event E is
proportional to the likelihood of the evidence given the hypothesis, multiplied by the prior
probability of the hypothesis, as shown below -

P(H∣E)=P(E∣H)P(H)P(E)P(H∣E)=P(E)P(E∣H)P(H)

where P(H∣E)P(H∣E) is the posterior probability of the hypothesis given the event
E, P(E∣H)P(E∣H) is the likelihood or conditional probability of the event given the
hypothesis, P(H)P(H) is the prior probability of the hypothesis, and P(E)P(E) is the
probability of the event.

What is Prior Probability?

Prior probability is a term used in probability theory and statistics that refers to the
probability of a hypothesis or event before any event or data is considered. It represents our
prior belief or expectation about the likelihood of a hypothesis or event based on previous
knowledge or assumptions.

For example, we are interested in the probability of a certain disease in a population. Our
prior probability might be based on previous studies or epidemiological data and might be
relatively low if the disease is rare. As we collect data from medical tests or patient
symptoms, we can update our probability estimate using Bayes' theorem to reflect the new
evidence.

What is Posterior Probability?

The posterior probability is a term used in Bayesian inference to refer to the updated
probability of a hypothesis, given some observed event or data. It is calculated using Bayes'
theorem, which combines the prior probability of the hypothesis with the likelihood of the
event to produce an updated or posterior probability.

The posterior probability is important in Bayesian inference because it reflects the latest
information about the hypothesis based on the observed data. It can be used to make
decisions or predictions and updated further as new data becomes available.
Formula Derivation

Bayes' theorem is derived from the definition of conditional probability. The conditional
probability of an event E given a hypothesis H is defined as the joint probability of E and H,
divided by the probability of H, as shown below -

P(E∣H)=P(E∩H)P(H)P(E∣H)=P(H)P(E∩H)

We can rearrange this equation to solve for the joint probability of E and H -

P(E∩H)=P(E∣H)∗P(H)P(E∩H)=P(E∣H)∗P(H)

Similarly, we can use the definition of conditional probability to write the conditional
probability of H given E, as shown below -

P(H∣E)=P(H∩E)P(E)P(H∣E)=P(E)P(H∩E)

Based on the commutative property of joint probability, we can write -

P(H∩E)=P(E∩H)P(H∩E)=P(E∩H)

We can substitute the expression for P(H∩E) from the first equation into the second equation
to obtain -

P(H∣E)=P(E∣H)∗P(H)P(E)P(H∣E)=P(E)P(E∣H)∗P(H)

This is the formula for Bayes' theorem for hypothesis H and event E. It states that the
probability of hypothesis H given event E is proportional to the likelihood of the event given
the hypothesis, multiplied by the prior probability of the hypothesis, and divided by the
probability of the event.

Applications of Bayes’ Theorem

Bayes' theorem or Bayesian classification in data mining has a wide range of applications in
many fields, including statistics, machine learning, artificial intelligence, natural language
processing, medical diagnosis, image and speech recognition, and more. Here are some
examples of its applications -

 Spam filtering - Bayes' theorem is commonly used in email spam filtering, where it
helps to identify emails that are likely to be spam based on the text content and
other features.
 Medical diagnosis - Bayes' theorem can be used to diagnose medical conditions
based on the observed symptoms, test results, and prior knowledge about the
prevalence and characteristics of the disease.
 Risk assessment - Bayes' theorem can be used to assess the risk of events such as
accidents, natural disasters, or financial market fluctuations based on historical data
and other relevant factors.
 Natural language processing - Bayes' theorem can be used to classify documents,
sentiment analysis, and topic modeling in natural language processing applications.
 Recommendation systems - Bayes' theorem can be used in recommendation
systems like e-commerce websites to suggest products or services to users based on
their previous behavior and preferences.
 Fraud detection - Bayes' theorem can be used to detect fraudulent behavior, such as
credit card or insurance fraud, by analyzing patterns of transactions and other data.

Examples

Problem - Suppose a medical test for a certain disease has a false positive rate of 5% and a
false negative rate of 2%. If a person has the disease, there is a 2% chance that the test will
come back negative; if a person does not, there is a 5% chance that the test will come back
positive. Suppose the disease affects 1% of the population. If a person tests positive for the
disease, what is the probability that they have the disease?

Solution - To solve this problem using Bayes' theorem, we can start by defining some events:

 D - the event that a person has the disease


 ~D - the event that a person does not have the disease
 T - the event that a person tests positive for the disease
 ~T - the event that a person tests negative for the disease

We are interested in the probability of event D given the event T, which we can write
as P(D∣T)P(D∣T). Using Bayes' theorem, we can write -

P(D∣T)=P(T∣D)∗P(D)/P(T)P(D∣T)=P(T∣D)∗P(D)/P(T)

The first term on the right-hand side of the equation is the probability of a positive test result
given that the person has the disease, which we can calculate as -

P(T∣D)=1−0.02=0.98P(T∣D)=1−0.02=0.98

(2% id FPR, which means that if a person has a disease, then there are 2% chance that the test
will come negative)

The second term is the prior probability of the person having the disease, which is given
as 1% -

P(D)=0.01P(D)=0.01 (prior probability of disease in given population)

The third term is the probability of a positive test result, which we can calculate using the law
of total probability, as shown below -

 P(T)=P(T∣D)∗P(D)+P(T∣ D)∗P( D)P(T)=P(T∣D)∗P(D)+P(T∣ D)∗P( D) (It is


the sum of the probability of both scenarios when a person tests positive and he may
or may not have the disease)
 P(T)=0.98∗0.01+0.05∗0.99=0.0593P(T)=0.98∗0.01+0.05∗0.99=0.
0593
Substituting these values into the first equation, we get -

 P(D∣T)=0.98∗0.01/0.0593=0.1652P(D∣T)=0.98∗0.01/0.0593=0.1652

So the probability that a person has the disease, given that they test positive for it, is
approximately 16.52%. This shows that even with a relatively high false positive rate, a
positive test result is not a guarantee of having the disease, and further testing or confirmation
may be necessary.

Rule Based Classification

IF-THEN Rules

Rule-based classifier makes use of a set of IF-THEN rules for classification.


We can express a rule in the following from −

IF condition THEN conclusion

Let us consider a rule R1,

R1: IF age = youth AND student = yes THEN buy_computer = yes

Points to remember −
 The IF part of the rule is called rule antecedent or precondition.
 The THEN part of the rule is called rule consequent.
 The antecedent part the condition consist of one or more attribute tests
and these tests are logically ANDed.
 The consequent part consists of class prediction.
Note − We can also write rule R1 as follows −
R1: (age = youth) ^ (student = yes))(buys computer = yes)

If the condition holds true for a given tuple, then the antecedent is
satisfied.

Advertisement

PauseSkip backward 5 secondsSkip forward 5 seconds

Mute

Fullscreen
Rule Extraction

Here we will learn how to build a rule-based classifier by extracting IF-


THEN rules from a decision tree.

Points to remember −

To extract a rule from a decision tree −

 One rule is created for each path from the root to the leaf node.
 To form a rule antecedent, each splitting criterion is logically ANDed.
 The leaf node holds the class prediction, forming the rule consequent.

Rule Induction Using Sequential Covering Algorithm

Sequential Covering Algorithm can be used to extract IF-THEN rules form


the training data. We do not require to generate a decision tree first. In
this algorithm, each rule for a given class covers many of the tuples of
that class.

Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER. As
per the general strategy the rules are learned one at a time. For each time
rules are learned, a tuple covered by the rule is removed and the process
continues for the rest of the tuples. This is because the path to each leaf
in a decision tree corresponds to a rule.

Note − The Decision tree induction can be considered as learning a set of


rules simultaneously.

The Following is the sequential learning Algorithm where rules are learned
for one class at a time. When learning a rule from a class Ci, we want the
rule to cover all the tuples from class C only and no tuple form any other
class.

Classification Of Backpropagation

Classification by backpropagation is a type of supervised learning


algorithm that is
used to train a neural network to classify data into different classes. The
backpropagation algorithm is based on the idea of adjusting the weights
and biases
of a network in order to minimize the error between the predicted output
and the
actual output.
The backpropagation algorithm works by taking a set of training examples
and
feeding them through the neural network. The output of the network is
compared to
the desired output, and the error is calculated using a cost function such
as mean
squared error

The error is then propagated backwards through the network, with each
neuron in
the network adjusting its weights and biases based on its contribution to
the error.
This is done using a gradient descent algorithm, where the weights and
biases are
adjusted in the direction that reduces the error.

Discuss on classification by back propagation.


Classification by backpropagation is a type of supervised learning
algorithm that is
used to train a neural network to classify data into different classes. The
backpropagation algorithm is based on the idea of adjusting the weights
and biases
of a network in order to minimize the error between the predicted output
and the
actual output.
The backpropagation algorithm works by taking a set of training examples
and
feeding them through the neural network. The output of the network is
compared to
the desired output, and the error is calculated using a cost function such
as mean
squared error.
The error is then propagated backwards through the network, with each
neuron in
the network adjusting its weights and biases based on its contribution to
the error.
This is done using a gradient descent algorithm, where the weights and
biases are
adjusted in the direction that reduces the error.
The backpropagation algorithm is an iterative process that continues until
the error
is minimized or until a predetermined number of iterations is reached. The
final set
of weights and biases is then used to classify new data

Discuss on classification by back propagation.


Classification by backpropagation is a type of supervised learning
algorithm that is
used to train a neural network to classify data into different classes. The
backpropagation algorithm is based on the idea of adjusting the weights
and biases
of a network in order to minimize the error between the predicted output
and the
actual output.

The backpropagation algorithm works by taking a set of training examples


and
feeding them through the neural network. The output of the network is
compared to
the desired output, and the error is calculated using a cost function such
as mean
squared error.
The error is then propagated backwards through the network, with each
neuron in
the network adjusting its weights and biases based on its contribution to
the error.
This is done using a gradient descent algorithm, where the weights and
biases are
adjusted in the direction that reduces the error.

Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point
in the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:

Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a
dataset can be classified into two classes by using a single straight line, then such
data is termed as linearly separable data, and classifier is used called as Linear
SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM
classifier.

Hyperplane and Support Vectors in the SVM algorithm:


Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in
n-dimensional space, but we need to find out the best decision boundary that helps to
classify the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line.
And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the
position of the hyperplane are termed as Support Vector. Since these vectors support the
hyperplane, hence called a Support vector.

How does SVM works?


Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we
have a dataset that has two tags (green and blue), and the dataset has two features x1
and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either
green or blue. Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two
classes. But there can be multiple lines that can separate these classes. Consider the
below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best
boundary or region is called as a hyperplane. SVM algorithm finds the closest point of
the lines from both the classes. These points are called support vectors. The distance
between the vectors and the hyperplane is called as margin. And the goal of SVM is to
maximize this margin. The hyperplane with maximum margin is called the optimal
hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-
linear data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data,
we have used two dimensions x and y, so for non-linear data, we will add a third
dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below

image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we
convert it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.

How to calculate the distance from a point to a line?

 In our case, w1*x1+w2*x2+b=0,


 thus, w=(w1,w2), x=(x1,x2)

🐢 Lazy Learners
Lazy learners are a type of learning algorithm that delays the generalization process until a query is
made. In simple terms, they do not build a model during training, but instead store the training
data and use it at prediction time.

Key Characteristics of Lazy Learners:

 No model is built in advance


 High query-time cost (slower prediction)
 Low training time
 Use instance-based learning – they compare new data with stored instances

How It Works:

Lazy learners memorize the training data. When a new instance needs to be classified, they
compare it with the stored instances (usually using a similarity or distance measure) and make a
prediction based on the closest matches.
Examples of Lazy Learners:

1. K-Nearest Neighbors (K-NN):


o Stores all training samples.
o At prediction, finds the k nearest neighbors and assigns the class based on majority
voting.
2. Case-Based Reasoning (CBR):
o Solves new problems based on the solutions of similar past problems.
3. Locally Weighted Regression (LWR):
o Performs regression based on local data near the query point.

Advantages:

 Simple to implement.
 Can adapt to changes in data easily (just add new data).
 No training phase → fast for updating.

Disadvantages:

 Slow classification time (has to search through data for each prediction).
 High memory usage (since all data is stored).
 Sensitive to irrelevant or redundant features (especially K-NN).

📌 Lazy Learner vs Eager Learner


Feature Lazy Learner Eager Learner

Model Building No Yes

Training Time Fast Slower

Prediction Time Slow Fast

Examples K-NN, CBR Decision Trees, Naive Bayes

Prediction Accuracy and Error Measures

In data mining, especially in classification and regression tasks, it’s essential to evaluate how
well a model performs. This is done using accuracy and error metrics that measure the
difference between predicted and actual values.

What is Rule-Based Classification?

Rule-based classification uses rules of the form:

IF <condition> THEN <class>


 The condition part (IF) is a conjunction of attribute tests (e.g., Age > 30 AND Income =
High).

 The class part (THEN) specifies the predicted class for instances satisfying the condition

Process of Rule-Based Classification

1. Rule Generation – Extract rules from the training data.


2. Rule Pruning – Eliminate overly specific or redundant rules.
3. Rule Ordering – Order the rules by priority or accuracy.
4. Classification – Apply rules to classify new instances.

xample

Let’s say you are classifying customers as “Buy” or “Don’t Buy” a product.

Rule 1:
IF Age > 25 AND Income = High THEN Class = Buy

Rule 2:
IF Age <= 25 AND Student = Yes THEN Class = Buy

Rule 3:
IF Income = Low THEN Class = Don’t Buy

Rule Generation Algorithms

Some popular algorithms that generate rule-based classifiers:

 RIPPER (Repeated Incremental Pruning to Produce Error Reduction)


 CN2
 PART (uses partial decision trees to generate rules)
 Decision Trees (can be converted into rules)

Advantages

 Easy to understand and interpret.


 Good for domains where explanation is important.
 Flexible – can handle both numeric and categorical data.

Disadvantages

 Might generate too many rules (overfitting).


 Rule conflicts need resolution (when multiple rules apply).
 Not always the most accurate compared to other classifiers like SVM or Random Forests.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy