Classification: Basic Concepts
Classification: Basic Concepts
or missing values
Typical applications
Credit/loan approval:
3
Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class, as
mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
Note: If the test set is used to select models, it is called validation (test) set
4
Process (1): Model Construction
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
6
Chapter 8. Classification: Basic Concepts
8
Decision Tree Induction: An Example
age income student credit_rating buys_computer
<=30 high no fair no
Training data set: Buys_computer <=30 high no excellent no
The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40 31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
no yes no yes
9
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
Tree is constructed in a top-down recursive divide-and-
conquer manner
At start, all the training examples are at the root
discretized in advance)
Examples are partitioned recursively based on selected
attributes
Test attributes are selected on the basis of a heuristic or
m=2
11
Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify
m a tuple in D:
Info ( D ) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D )
j
Info ( D j )
j 1 | D |
Information gained by branching on attribute A
Gain(A) Info(D) Info A(D)
12
Attribute Selection: Information Gain
g Class P: buys_computer = “yes” 5 4
Info age ( D ) I ( 2 ,3 ) I ( 4 ,0 )
g Class N: buys_computer = “no” 14 14
9 9 5 5 5
Info ( D ) I (9,5) log 2 ( ) log 2 ( ) 0 .940 I (3, 2 ) 0 .694
14 14 14 14 14
age pi ni I(p i, n i) 5
<=30 2 3 0.971 I ( 2,3)means “age <=30” has 5 out of
14
31…40 4 0 0 14 samples, with 2 yes’es and 3
>40 3 2 0.971 no’s. Hence
age
<=30
income student credit_rating
high no fair
buys_computer
no
Gain( age) Info( D ) Infoage ( D ) 0.246
<=30 high no excellent no
31…40
>40
high
medium
no
no
fair
fair
yes
yes
Similarly,
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes Gain(income) 0.029
<=30 medium no fair no
<=30
>40
low
medium
yes
yes
fair
fair
yes
yes
Gain( student ) 0.151
<=30
31…40
medium
medium
yes
no
excellent
excellent
yes
yes Gain(credit _ rating ) 0.048
31…40 high yes fair yes
>40 medium no excellent no 13
Computing Information-Gain for
Continuous-Valued Attributes
Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of adjacent values
is considered as a possible split point
(ai+ai+1)/2 is the midpoint between the values of ai and ai+1
The point with the minimum expected information
requirement for A is selected as the split-point for A
Split:
D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is
the set of tuples in D satisfying A > split-point
14
Gain Ratio for Attribute Selection (C4.5)
Information gain measure is biased towards attributes with a
large number of values
C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfoA ( D) log 2 ( )
j 1 |D| |D|
GainRatio(A) = Gain(A)/SplitInfo(A)
Ex.
noise or outliers
Poor accuracy for unseen samples
21
Classification in Large Databases
Classification—a classical problem extensively studied by
statisticians and machine learning researchers
Scalability: Classifying data sets with millions of examples and
hundreds of attributes with reasonable speed
Why is decision tree induction popular?
relatively faster learning speed (than other classification
methods)
convertible to simple and easy to understand classification
rules
can use SQL queries for accessing databases
22
Scalability Framework for RainForest
23
Rainforest: Training Set and Its AVC Sets
25
Presentation of Classification Results
Bayes’ Theorem: P ( H | X ) P ( X | H ) P ( H ) P ( X | H ) P ( H ) / P ( X )
P (X )
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
P(H) (prior probability): the initial probability
E.g., X will buy computer, regardless of age, income, …
medium income
31
Prediction Based on Bayes’ Theorem
Given training data X, posteriori probability of a hypothesis H,
P(H|X), follows the Bayes’ theorem
P ( H | X ) P (X | H ) P ( H ) P (X | H ) P (H ) / P (X )
P (X )
Informally, this can be viewed as
posteriori = likelihood x prior/evidence
Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
Practical difficulty: It requires initial knowledge of many
probabilities, involving significant computational cost
32
Classification Is to Derive the Maximum Posteriori
Let D be a training set of tuples and their associated class labels,
and each tuple is represented by an n-D attribute vector X = (x1,
x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) i i
i P(X)
33
Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between attributes):
n
P(X | Ci) P(x | Ci) P(x | Ci) P(x | Ci) ... P(x | Ci)
k 1
This greatly reduces thek computation
1 cost: Only2counts the n
class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk
for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with a mean μ and standard deviation σ
( x )2
and P(xk|Ci) is 1
g ( x, , )
2
e 2
2
“uncorrected” counterparts
37
Naïve Bayes Classifier: Comments
Advantages
Easy to implement
Disadvantages
Assumption: class conditional independence, therefore loss of
accuracy
Practically, dependencies exist among variables
Bayes Classifier
How to deal with these dependencies? Bayesian Belief Networks
(Chapter 9)
38
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
39
Using IF-THEN Rules for Classification
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
Rule antecedent/precondition vs. rule consequent
no yes
prediction no yes
Each time a rule is learned, the tuples covered by the rules are
removed
Repeat the process on the remaining tuples until termination
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
43
Rule Generation
To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
A3=1&&A1=2
A3=1&&A1=2
&&A8=5A3=1
Positive Negative
examples examples
44
How to Learn-One-Rule?
Start with the most general rule possible: condition = empty
Adding new attributes by adopting a greedy depth-first strategy
Picks the one that most improves the rule quality
47
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
49
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
50
Classifier Evaluation Metrics: Example
51
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
Holdout method
Given data is randomly partitioned into two independent sets
obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets,
data
*Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
52
Evaluating Classifier Accuracy: Bootstrap
Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with replacement
i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
Several bootstrap methods, and a common one is .632 boostrap
A data set with d tuples is sampled d times, with replacement, resulting in
a training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since
(1 – 1/d)d ≈ e-1 = 0.368)
Repeat the sampling procedure k times, overall accuracy of the model:
53
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
Suppose we have 2 classifiers, M1 and M2, which one is better?
Use 10-fold cross-validation to obtain and
These mean error rates are just estimates of error on the true
population of future data cases
What if the difference between the 2 error rates is just
attributed to chance?
Use a test of statistical significance
Obtain confidence limits for our error estimates
54
Estimating Confidence Intervals:
Null Hypothesis
Perform 10-fold cross-validation
Assume samples follow a t distribution with k–1 degrees of
freedom (here, k=10)
Use t-test (or Student’s t-test)
Null Hypothesis: M1 & M2 are the same
If we can reject null hypothesis, then
we conclude that the difference between M1 & M2 is
statistically significant
Chose model with lower error rate
55
Estimating Confidence Intervals: t-test
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
56
Estimating Confidence Intervals:
Table for t-distribution
Symmetric
Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
Confidence limit, z
= sig/2
57
Estimating Confidence Intervals:
Statistical Significance
Are M1 & M2 significantly different?
Compute t. Select significance level (e.g. sig = 5%)
Consult table for t-distribution: Find t value corresponding to
k-1 degrees of freedom (here, 9)
t-distribution is symmetric: typically upper % points of
distribution shown → look up value for confidence limit
z=sig/2 (here, 0.025)
If t > z or t < -z, then t value lies in rejection region:
Reject null hypothesis that mean error rates of M & M
1 2
are same
Conclude: statistically significant difference between M
1
& M2
Otherwise, conclude that any difference is chance
58
Model Selection: ROC Curves
ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
Originated from signal detection theory
Shows the trade-off between the true
positive rate and the false positive rate
The area under the ROC curve is a Vertical axis
measure of the accuracy of the model represents the true
positive rate
Rank the test tuples in decreasing Horizontal axis rep.
order: the one that is most likely to the false positive rate
belong to the positive class appears at The plot also shows a
the top of the list diagonal line
The closer to the diagonal line (i.e., the A model with perfect
closer the area is to 0.5), the less accuracy will have an
accurate is the model area of 1.0
59
Issues Affecting Model Selection
Accuracy
classifier accuracy: predicting class label
Speed
time to construct the model (training time)
time to use the model (classification/prediction time)
Robustness: handling noise and missing values
Scalability: efficiency in disk-resident databases
Interpretability
understanding and insight provided by the model
Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
60
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
61
Ensemble Methods
Problem Statement:
Mining complex knowledge from complex data such as
from enterprise data mining solutions, has been one
of the most challenging problems in knowledge
discovery process.
In such situations, it is observed that not a single
learning method able to provide best informative
knowledge in all kinds of datasets.
The word ‘ensemble’ literally mean group.
Ensemble methods involve group of predictive models
to achieve a better accuracy and model stability.
Ensemble methods are known to impart supreme
Bias and Variance
Bias means, ‘how much on an average are the predicted values
different from the actual value.’
Variance means, ‘how different will the predictions of the model be
at the same point if different samples are taken from the same
population’.
You build a small tree and you will get a model with low variance
and high bias. How do you manage to balance the trade off
between bias and variance ?
Increasing in the complexity of model, reduction in prediction error
due to lower bias in the model.
As the model gets more complex, it end up with over-fitting i.e.
model will start suffering from high variance.
Successful model is one that maintains a balance between these
two types of errors. This is known as the trade-off management of
bias-variance errors.
Ensemble learning is one way to execute this trade off analysis.
Ensemble Methods: Increasing the Accuracy
Ensemble methods
Use a combination of models to increase accuracy
classifiers
Boosting: weighted vote with a collection of classifiers
64
Bagging: Boostrap Aggregation
Analogy: Diagnosis based on multiple doctors’ majority vote
Training
Given a set D of d tuples, at each iteration i, a training set D of d tuples is
i
sampled with replacement from D (i.e., bootstrap)
A classifier model M is learned for each training set D
i i
Classification: classify an unknown sample X
Each classifier M returns its class prediction
i
The bagged classifier M* counts the votes and assigns the class with the
most votes to X
Prediction: can be applied to the prediction of continuous values by taking
the average value of each prediction for a given test tuple
Accuracy
Often significantly better than a single classifier derived from D
65
Boosting
Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
How boosting works?
Weights are assigned to each training tuple
A series of k classifiers is iteratively learned
After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, Mi+1, to pay more attention to
the training tuples that were misclassified by Mi
The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
Boosting algorithm can be extended for numeric prediction
Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfitting the model to misclassified data 66
Pseudocode of Bagging Algorithm
Note: Bagging works especially well for unstable algorithms , where a small
changes in the training data might result some significant changes in the output
classifiers.
• However, this algorithm would yield poor ensembles for stable algorithms.
• Stable algorithms include K-nearest neighbor, whereas decision tree, rule
learners and neural networks are considered to be unstable
67
Adaboost (Freund and Schapire, 1997)
Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
Initially, all the weights of tuples are set the same (1/d)
Generate k classifiers in k rounds. At round i,
Tuples from D are sampled (with replacement) to form a training set Di
of the same size
Each tuple’s chance of being selected is based on its weight
A classification model Mi is derived from Di
Its error rate is calculated using Di as a test set
If a tuple is misclassified, its weight is increased, o.w. it is decreased
Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi error
rate is the sum of the weights of the misclassified tuples:
d
error ( M i ) w
j
j err ( X j )
returned
Two Methods to construct Random Forest:
Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to maximum size
Forest-RC (random linear combinations): Creates new attributes (or
class
Threshold-moving: moves the decision threshold, t, so that
the rare class tuples are easier to classify, and hence, less
chance of costly false negative errors
Ensemble techniques: Ensemble multiple classifiers
introduced above
Still difficult for class imbalance problem on multiclass tasks
70
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy:
Ensemble Methods
Summary
71
Summary (I)
Classification is a form of data analysis that extracts models
describing important data classes.
Effective and scalable methods have been developed for decision
tree induction, Naive Bayesian classification, rule-based
classification, and many other classification methods.
Evaluation metrics include: accuracy, sensitivity, specificity,
precision, recall, F measure, and Fß measure.
Stratified k-fold cross-validation is recommended for accuracy
estimation. Bagging and boosting can be used to increase overall
accuracy by learning and combining a series of individual models.
72
Summary (II)
Significance tests and ROC curves are useful for model selection.
There have been numerous comparisons of the different
classification methods; the matter remains a research topic
No single method has been found to be superior over all others
for all data sets
Issues such as accuracy, training time, robustness, scalability,
and interpretability must be considered and can involve trade-
offs, further complicating the quest for an overall superior
method
73