0% found this document useful (0 votes)
47 views44 pages

DWM - Classification-Unit7

The document discusses classification and prediction problems in machine learning. It explains that classification predicts categorical class labels by constructing models from training data to classify new data, while prediction models continuous functions to predict unknown values. The document provides examples of common classification applications like credit approval, medical diagnosis, and fraud detection.

Uploaded by

Hansica Madurkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views44 pages

DWM - Classification-Unit7

The document discusses classification and prediction problems in machine learning. It explains that classification predicts categorical class labels by constructing models from training data to classify new data, while prediction models continuous functions to predict unknown values. The document provides examples of common classification applications like credit approval, medical diagnosis, and fraud detection.

Uploaded by

Hansica Madurkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Classification

NOTE : THIS PRESENTATION SHOULD BE CONSIDERED AS


SUPPORTING MATERIAL ONLY. FOR DETAILED STUDY
STUDENTS MUST REFER THE TEXT BOOKS AND REFRENCE
BOOKS MENTIONED IN SYLLABUS.

August 20, 2020 PRASHASTI KANIKAR 1


Classification vs. Prediction
• Classification
– predicts categorical class labels
– classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute
and uses it in classifying new data
• Prediction
– models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
– Credit approval
– Target marketing
– Medical diagnosis
– Fraud detection

August 20, 2020 PRASHASTI KANIKAR 2


Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction is training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Estimation accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
– If the accuracy is acceptable, use the model to classify data tuples
whose class labels are not known

August 20, 2020 PRASHASTI KANIKAR 3


Process (1): Model Construction

Classification
Algorithms
Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistan t P ro f 3 no (Model)
M ary A ssistan t P ro f 7 yes
B ill P ro fesso r 2 yes
Jim A sso ciate P ro f 7 yes
IF rank = ‘professor’
D ave A ssistan t P ro f 6 no
OR years > 6
Anne A sso ciate P ro f 3 no
THEN tenured = ‘yes’
August 20, 2020 PRASHASTI KANIKAR 4
Process (2): Using the Model in Prediction

Classifier

Testing
Data Unseen Data

(Jeff, Professor, 4)
NAME RANK YEARS TENURED
T om A ssistant P rof 2 no
Tenured?
M erlisa A ssociate P rof 7 no
G eorge P rofessor 5 yes
Joseph A ssistant P rof 7 yes
August 20, 2020 PRASHASTI KANIKAR 5
Classification Examples
• Teachers classify students’ grades as A, B, C, D
or F.
• Identify mushrooms as poisonous or edible.
• Predict when a river will flood.
• Identify individuals with credit risks.
• Speech recognition
• Pattern recognition

August 20, 2020 PRASHASTI KANIKAR 6


Classification Ex: Grading
x
• If x >= 90 then grade =A.
<90 >=90
• If 80<=x<90 then grade
=B. x A

• If 70<=x<80 then grade <80 >=80


=C. x B
• If 60<=x<70 then grade
<70 >=70
=D.
x C
• If x<50 then grade =F.
<50 >=60

F D
August 20, 2020 PRASHASTI KANIKAR 7
Classification problem
The classification problem can be expressed as: Use the training database given and predict
the class label of a previously unseen instance

August 20, 2020 PRASHASTI KANIKAR 8


August 20, 2020 PRASHASTI KANIKAR 9
Database
Name Gender Height Output1 Output2
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium

August 20, 2020 PRASHASTI KANIKAR 10


Bayesian Classification: Why?
• A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
• Foundation: Based on Bayes’ Theorem.
• Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
• Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
• Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured

August 20, 2020 PRASHASTI KANIKAR 11


Bayesian Theorem: Basics
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), the probability that the
hypothesis holds given the observed data sample X
• P(H) (prior probability), the initial probability
– E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (posteriori probability), the probability of observing the
sample X, given that the hypothesis holds
– E.g., Given that X will buy computer, the prob. that X is 31..40,
medium income
August 20, 2020 PRASHASTI KANIKAR 12
Bayesian Theorem
• Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes theorem

P(H | X)  P(X | H )P(H )


P(X)
• Informally, this can be written as
posteriori = likelihood x prior/evidence
• Predicts X belongs to C2 iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
August 20, 2020 PRASHASTI KANIKAR 13
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_comput
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample 31…40 low yes excellent yes
X = (age <=30, <=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes
>40 medium yes fair yes
Credit_rating = Fair)
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
August 20, 2020 PRASHASTI KANIKAR 14
Naïve Bayesian Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357

• Compute P(X|Ci) for each class


P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4

• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)

P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044


P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007

Therefore, X belongs to class (“buys_computer = yes”)

August 20, 2020 PRASHASTI KANIKAR 15


Naïve Bayesian Classifier: Comments
• Advantages
– Easy to implement
– Good results obtained in most of the cases
• Disadvantages
– Assumption: class conditional independence, therefore loss of
accuracy
– Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayesian
Classifier

August 20, 2020 PRASHASTI KANIKAR 16


Decision Tree Induction( ID3): Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
August 20, 2020 PRASHASTI KANIKAR 17
Output: A Decision Tree for “buys_computer”

age?

<=30 overcast
31..40 >40

student? yes credit rating?

no yes excellent fair

no yes yes

August 20, 2020 PRASHASTI KANIKAR 18


Attribute Selection Measure:
Information Gain (ID3)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D
belongs to class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple
in D: m
Info ( D)   pi log 2 ( pi )
i 1

 Information needed (after using A to split D into v


partitions) to classify D: v |D |
Info A ( D)    I (D j )
j

j 1 | D |

 Information gained by branching on attribute A


Gain(A)  Info(D)  Info A(D)
August 20, 2020 PRASHASTI KANIKAR 19
Attribute Selection: Information Gain

 Class P: buys_computer = “yes” 5 4


Infoage ( D)  I (2,3)  I (4,0)
 Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info( D)  I (9,5)   log 2 ( )  log 2 ( ) 0.940  I (3,2)  0.694
14 14 14 14 14
age pi ni I(pi, ni) 5
I (2,3)means “age <=30” has 5 out of
<=30 2 3 0.971 14 14 samples, with 2 yes’es and 3
31…40 4 0 0 no’s. Hence
>40 3 2 0.971
age income student credit_rating buys_computer Gain(age)  Info( D)  Infoage( D)  0.246
<=30 high no fair no
<=30 high no excellent no Similarly,
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low
31…40 low
yes
yes
excellent
excellent
no
yes Gain(income)  0.029
Gain( student )  0.151
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium
31…40 medium
yes
no
excellent
excellent
yes
yes
Gain(credit _ rating )  0.048
August
31…40 high20, 2020 yes fair PRASHASTI KANIKAR
yes 20
>40 medium no excellent no
C4.5

• This approach uses gain ratio instead of gain.

• Decision tree algorithm C4.5 improves ID3 in following ways


– Handling both continuous and discrete attributes - In order to handle continuous
attributes, C4.5 creates a threshold and then splits the list into those whose
attribute value is above the threshold and those that are less than or equal to it.
– Handling training data with missing attribute values - C4.5 allows attribute values
to be marked as ? for missing. Missing attribute values are simply not used in
gain and entropy calculations.
– Pruning trees after creation - C4.5 goes back through the tree once it's been
created and attempts to remove branches that do not help by replacing them
with leaf nodes.
– Splitting-For splitting C4.5 uses the largest gain ratio that ensures a larger than
average information gain.
– Rules

August 20, 2020 PRASHASTI KANIKAR 21


Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a
large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D)    log 2 ( )
j 1 | D| |D|
– GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.

– gain_ratio(income) = 0.029/1.557 = 0.019


• The attribute with the maximum gain ratio is selected as the
splitting attribute
22
August 20, 2020 PRASHASTI KANIKAR
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (if continuous-valued, they are
discretized in advance)
– Examples are partitioned recursively based on selected
attributes
– Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning
– There are no samples left
23
August 20, 2020 PRASHASTI KANIKAR
Gini Index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes, gini index,
gini(D) is defined as n 2
gini( D)  1  p j
j 1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as
|D1| |D |
gini A (D)  gini(D1)  2 gini(D2)
|D| |D|
• Reduction in Impurity:
gini( A)  gini(D)  giniA(D)
• The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node
24
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
9 5
gini ( D)  1        0.459
 14   14 
• Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in
D2  10  4
giniincome{low,medium} ( D)   Gini( D1 )   Gini ( D2 )
 14   14 

Similarly, Gini{low,high} is 0.458 and Gini{medium,high} is 0.450.


• Evaluating age, we obtain {youth, senior} or {middle-aged} as the best split
with Gini index of 0.357
• The attributes student and credit_rating are both binary with Gini index values
of 0.367 and 0.429.
• Therefore, the attribute age and splitting subset {youth, senior} gives the
minimum Gini index overall and is used as the splitting criterion.
25
August 20, 2020 PRASHASTI KANIKAR
Comparing Attribute Selection Measures

• The three measures, in general, return good results but


– Information gain:
• biased towards multivalued attributes
– Gain ratio:
• tends to prefer unbalanced splits in which one partition is
much smaller than the others
– Gini index:
• biased to multivalued attributes
• has difficulty when number of classes is large

26
August 20, 2020 PRASHASTI KANIKAR
Overfitting and Tree Pruning
• Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to
noise or outliers
– Poor accuracy for unseen samples
• Two approaches to avoid overfitting
– Prepruning: Halt tree construction early ̵ do not split a node
if this would result in the goodness measure falling below a
threshold
• Difficult to choose an appropriate threshold
– Postpruning: Remove branches from a “fully grown” tree—
get a sequence of progressively pruned trees
• Use a set of data different from the training data to
decide which is the “best pruned tree”
27
August 20, 2020 PRASHASTI KANIKAR
Regression models
• Regression models can be used to approximate the given data.
• In (simple)linear regression, the data are modeled to fit a straight
line.
• For example, a random variable, y (called a response variable), can
be modeled as a linear function of another random variable, x
(called a predictor variable), with the equation
• y=wx+b
• where the variance of y is assumed to be constant. In the context of
data mining, x and y are numeric database attributes.
• The coefficients, w and b (called regression coefficients), specify the
slope of the line and the y-intercept, respectively.
• These coefficients can be solved for by the method of least squares,
which minimizes the error between the actual line separating the
data and the estimate of the line.

August 20, 2020 PRASHASTI KANIKAR 28


Multiple linear regression
Multiple linear regression is an extension of
(simple) linear regression, which allows a
response variable, y, to be modeled as a linear
function of two or more predictor variables.

August 20, 2020 PRASHASTI KANIKAR 29


Classifier Model Evaluation Metrics:
Confusion Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
30
30
Classifier Evaluation Metrics: Accuracy,
Error Rate, Sensitivity and Specificity
A\P C ¬C
C TP FN P
¬C FP TN N
P’ N’ All

• Classifier Accuracy, or  Sensitivity: True Positive


recognition rate: percentage of recognition rate
test set tuples that are correctly  Sensitivity = TP/P
classified
 Specificity: True Negative
Accuracy = (TP + TN)/All recognition rate
• Error rate: 1 – accuracy, or  Specificity = TN/N
Error rate = (FP + FN)/All

August 20, 2020 PRASHASTI KANIKAR 31


31
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

• Recall: completeness – what % of positive tuples did the


classifier label as positive?

• F measure (F1 or F-score): harmonic mean of precision and


recall,

• Fß: weighted measure of precision and recall

August 20, 2020 PRASHASTI KANIKAR 32


32
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.40 (accuracy)

– Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

33
33
Holdout
• In this method, the given data are randomly partitioned into
two independent sets, a training set and a test set.
• Typically, two-thirds of the data are allocated to the training
set, and the remaining one-third is allocated to the test set.
• The training set is used to derive the model. The model’s
accuracy is then estimated with the test set .

34
Random subsampling
• Random subsampling is a variation of the
holdout method in which the holdout method
is repeated k times.
• The overall accuracy estimate is taken as the
average of the accuracies obtained from each
iteration.

August 20, 2020 PRASHASTI KANIKAR 35


cross-validation
• In k-fold cross-validation, the initial data are randomly partitioned into k
mutually exclusive subsets or “folds,” D1, D2, : : : , Dk, each of
approximately equal size.
• Training and testing is performed k times.
• In iteration i, partition Di is reserved as the test set, and the remaining
partitions are collectively used to train the model.
• That is, in the first iteration, subsets D2, : : : , Dk collectively serve as the
training set to obtain a first model, which is tested on D1;
• the second iteration is trained on subsets D1, D3, : : : , Dk and tested on
D2; and so on.
• Unlike the holdout and random subsampling methods, here each sample is
used the same number of times for training and once for testing.
• For classification, the accuracy estimate is the overall number of correct
classifications from the k iterations, divided by the total number of tuples
in the initial data.

August 20, 2020 PRASHASTI KANIKAR 36


Bootstrap
• The bootstrap method samples the given
training tuples uniformly with replacement.
• That is, each time a tuple is selected, it is equally
likely to be selected again and re-added to the
training set.
• For instance, imagine a machine that randomly
selects tuples for our training set.
• In sampling with replacement, the machine is
allowed to select the same tuple more than once.
August 20, 2020 PRASHASTI KANIKAR 37
.632 bootstrap

• Suppose we are given a data set of d tuples.


• The data set is sampled d times, with replacement,
resulting in a bootstrap sample or training set of d samples.
• It is very likely that some of the original data tuples will
occur more than once in this sample.
• The data tuples that did not make it into the training set
end up forming the test set.
• Suppose we were to try this out several times. As it turns
out, on average, 63.2% of the original data tuples will end
up in the bootstrap sample, and the remaining 36.8% will
form the test set (hence, the name, .632 bootstrap).

August 20, 2020 PRASHASTI KANIKAR 38


Comparing Classifier Performance using ROC
(Receiver operating characteristic ) Curves
• Receiver operating characteristic curves are a useful visual tool for comparing two
classification models.
• An ROC curve for a given model shows the trade-off between the true positive rate
(TPR) and the false positive rate (FPR).
• Given a test set and a model, TPR is the proportion of positive (or “yes”) tuples that
are correctly labeled by the model; FPR is the proportion of negative (or “no”)
tuples that are mislabeled as positive.
• Given that TP, FP, P, and N are the number of true positive, false positive, positive,
and negative tuples, respectively, we know that TPR = TP/P , which is sensitivity.
• Furthermore, FPR = FP/N , which is 1 - specificity.
• For a two-class problem, an ROC curve allows us to visualize the trade-off between
the rate at which the model can accurately recognize positive cases versus the rate
at which it mistakenly identifies negative cases as positive for different portions of
the test set.
• Any increase in TPR occurs at the cost of an increase in FPR. The area under the
ROC curve is a measure of the accuracy of the model.
August 20, 2020 PRASHASTI KANIKAR 39
Techniques to Improve Classification
Accuracy
• An ensemble for classification is a composite
model, made up of a combination of
classifiers.
• The individual classifiers vote, and a class label
prediction is returned by the ensemble based
on the collection of votes.
• Ensembles tend to be more accurate than
their component classifiers.

August 20, 2020 PRASHASTI KANIKAR 40


August 20, 2020 PRASHASTI KANIKAR 41
Bagging (bootstrap aggregation)

• The bagging algorithm—create an ensemble of classification models for a


learning scheme where each model gives an equally weighted prediction.
• Input:
– D, a set of d training tuples;
– k, the number of models in the ensemble;
– a classification learning scheme (decision tree algorithm, na¨ıve Bayesian,
etc.).
• Output: The ensemble—a composite model, M.
• Method:
– (1) for i = 1 to k do // create k models:
– (2) create bootstrap sample, Di , by sampling D with replacement;
– (3) use Di and the learning scheme to derive a model, Mi ;
– (4) endfor
• To use the ensemble to classify a tuple, X:
– let each of the k models classify X and return the majority vote;

August 20, 2020 PRASHASTI KANIKAR 42


Boosting
• In boosting, weights are also assigned to each
training tuple.
• A series of k classifiers is iteratively learned.
• After a classifier, Mi , is learned, the weights are
updated to allow the subsequent classifier,MiC1,
to “pay more attention” to the training tuples
that were misclassified by Mi
• The final boosted classifier, M, combines the votes
of each individual classifier, where the weight of
each classifier’s vote is a function of its accuracy.
August 20, 2020 PRASHASTI KANIKAR 43
Random Forests
• We now present another ensemble method called random
forests.
• Imagine that each of the classifiers in the ensemble is a
decision tree classifier so that the collection of classifiers is
a “forest.”
• The individual decision trees are generated using a random
selection of attributes at each node to determine the split.
• More formally, each tree depends on the values of a
random vector sampled independently and with the same
distribution for all trees in the forest.
• During classification, each tree votes and the most popular
class is returned.

August 20, 2020 PRASHASTI KANIKAR 44

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy