Datamining Lect10a Classsification Basics DT
Datamining Lect10a Classsification Basics DT
LECTURE 10
Classification
Basic Concepts
Decision Trees
Catching tax-evasion
Tid Refund Marital Taxable
Status Income Cheat
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Evaluation of classification models
• Counts of test records that are correctly (or
incorrectly) predicted by the classification model
• Confusion matrix Predicted Class
Actual Class
Class = 1 Class = 0
Class = 1 f11 f10
Class = 0 f01 f00
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
Class labels
Training Data Model: Decision Tree
Another Example of Decision Tree
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Tree Induction
• Finding the best decision tree is NP-hard
• Greedy strategy.
• Split the records based on an attribute test that
optimizes certain criterion.
• Many Algorithms:
• Hunt’s Algorithm (one of the earliest)
• CART
• ID3, C4.5
• SLIQ,SPRINT
General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
• Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No
2 No Married 100K No
• General Procedure: 3 No Single 70K No
• If Dt contains records that belong 4 Yes Married 120K No
the same class yt, then t is a leaf 5 No Divorced 95K Yes
node labeled as yt 6 No Married 60K No
• If Dt contains records with the same 7 Yes Divorced 220K No
attribute values, then t is a leaf node 8 No Single 85K Yes
labeled with the majority class yt 9 No Married 75K No
• If Dt is an empty set, then t is a leaf 10 No Single 90K Yes
node labeled by the default class, yd 10
Hunt’s Algorithm 1
42
Yes
No
Yes
Single
Single
Married
Married
125K
125K
120K
100K
120K
No
No
No
No
No
No
Refund 3
7 No
Yes Single
Divorced
Divorced 70K
220K
220K No
No
No
Don’t 4
2 Yes
No Married
Married 120K
100K
100K No
No
No
Yes No
Cheat 5
3
6 No Divorced 70K
Single
Married 95K
60K Yes
No
No
Don’t Don’t
6
5
9 No Married 95K
Divorced 60K
75K No
Yes
No
Cheat Cheat
7
6
3 Yes
No Divorced 60K
Married
Single 220K
70K No
No
No
58 No Single
Single
Divorced 85K
85K
95K Yes
Yes
Yes
9
8 No Married
Married
Single 75K
75K
85K No
No
Yes
Refund Refund
10 No Single
Single 90K
90K Yes
Yes
Yes
Yes No Yes No 10
10
10
CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct
values.
Size
Small Large
Medium
Size
• What about this split? {Small,
Large} {Medium}
Splitting Based on Continuous Attributes
• Different ways of handling
• Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or clustering.
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
• Ideas?
Measuring Node Impurity
• p(i|t): fraction of records associated with node t
belonging to class i
c
Entropy(t ) p (i | t ) log p (i | t )
i 1
• Used in ID3 and C4.5
c
Gini(t ) 1 p(i | t )
2
i 1
• Used in CART, SLIQ, SPRINT.
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Splitting based on impurity
• Impurity measures favor attributes with large
number of values
GAIN n n
SplitINFO log
k
GainRATIO split
Split i i
SplitINFO n i 1
n
Parent Node, p is split into k partitions
ni is the number of records in partition i
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
• Border line between two neighboring regions of different classes is known
as decision boundary
• Decision boundary is parallel to axes because test condition involves a
single attribute at-a-time
Expressiveness
• Decision tree provides expressive representation for
learning discrete-valued function
• But they do not generalize well to certain types of
Boolean functions
• Example: parity function:
• Class = 1 if there is an even number of Boolean attributes with truth
value = True
• Class = 0 if there is an odd number of Boolean attributes with truth
value = True
• For accurate modeling, must have a complete tree
x+y<1
Class = + Class =
• Evaluation
Underfitting and Overfitting (Example)
Circular points:
0.5 sqrt(x12+x22) 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting
Underfitting Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex it models the details of the training set and
fails on the test set
Overfitting due to Noise
Lack of data points in the lower half of the diagram makes it difficult to
predict correctly the class labels of that region
- Insufficient number of training records in the region causes the decision
tree to predict the test examples using other training records that are
irrelevant to the classification task
Notes on Overfitting
• Overfitting results in decision trees that are more
complex than necessary
• Pessimistic approach:
• For each leaf node: 𝑒′(𝑡) = (𝑒(𝑡) + 0.5)
• Total errors: 𝑒′(𝑇) = 𝑒(𝑇) + 𝑁 0.5 (N: number of leaf nodes)
• Penalize large trees
• For a tree with 30 leaf nodes and 10 errors on training (out of 1000
instances)
• Training error = 10/1000 = 1
• Generalization error = (10 + 300.5)/1000 = 2.5%
A1 A4
A2 A3
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
Class=Yes a b
ACTUAL b: FN (false negative)
CLASS Class=No c d c: FP (false positive)
d: TN (true negative)
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS Class=No c d
(FP) (TN)
wa w d
Weighted Accuracy 1 4
wa wb wc w d
1 2 3 4
Computing Cost of Classification
Cost PREDICTED CLASS
Matrix
C(i|j) + -
ACTUAL
+ -1 100
CLASS
- 1 0
Class=Yes a b N=a+b+c+d
ACTUAL
CLASS Class=No c d
Accuracy = (a + d)/N
Requires a sampling
schedule for creating
learning curve
At threshold t:
TP=0.5, FN=0.5, FP=0.12, FN=0.88
ROC Curve
(TP,FP):
• (0,0): declare everything
to be negative class
• (1,1): declare everything
to be positive class
• (1,0): ideal
PREDICTED CLASS
• Diagonal line:
Yes No
• Random guessing
Yes a b
• Below diagonal line:
Actual (TP) (FN)
• prediction is opposite of
the true class No c d
(FP) (TN)
Using ROC for Model Comparison
No model consistently
outperform the other
M1 is better for
small FPR
M2 is better for
large FPR