ML 05 Decision Trees
ML 05 Decision Trees
Machine Learning
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Intuition behind a decision tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Learning a Decision Tree Classifier
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
Test Set
A Decision Tree (seen earlier)
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree
10
10 No Single 90K Yes that fits the same data!
Challenge in learning decision tree
● Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
● Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No
Don’t Marital
Cheat Status
Single,
Married
Divorced
Don’t
Cheat
Cheat
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat
Don’t
Cheat
Cheat
Tree Induction
● Greedy strategy
– Split the records based on an attribute test
that optimizes certain criterion
● Issues
– Determine how to split the records
uHow to specify the attribute test condition?
uHow to determine the best split?
● Greedy strategy
– Split the records based on an attribute test
that optimizes certain criterion
● Issues
– Determine how to split the records
uHow to specify the attribute test condition?
uHow to determine the best split?
Size
{Small,
● What about this split? Large} {Medium}
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
● Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
● Issues
– Determine how to split the records
uHow to specify the attribute test condition?
uHow to determine the best split?
● Greedy approach:
– Nodes with homogeneous class distribution
are preferred
● Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
● Gini Index
● Entropy
● Misclassification error
How to Find the Best Split
Before Splitting: C0 N00 M0
C1 N01
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Measures of Node Impurity
● Gini Index
● Entropy
● Misclassification error
Measure of Impurity: GINI Index
GINI (t ) = 1 − ∑[ p( j | t )]2
j
GINI (t ) = 1 − ∑[ p( j | t )]2
j
GINI (t ) = 1 − ∑[ p( j | t )]2
j
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Measures of Node Impurity
● Gini Index
● Entropy
● Misclassification error
Alternative Splitting Criteria based on INFO
Entropy(t ) = −∑ p( j | t ) log p( j | t )
j 2
● Information Gain:
⎛∑ nk ⎞
GAIN = Entropy ( p) − ⎜ Entropy (i ) ⎟
i
⎝ n
split i =1
⎠
Parent Node p is split into k partitions;
ni is number of records in partition i
– Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
Splitting Based on INFO...
● Gain Ratio:
GAIN n n k
GainRATIO = SplitINFO = − ∑ log
Split i i
SplitINFO
split
n n i =1
● Gini Index
● Entropy
● Misclassification error
Splitting Criteria based on Classification Error
Error (t ) = 1 − max P (i | t )
i
● Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
● Issues
– Determine how to split the records
uHow to specify the attribute test condition?
uHow to determine the best split?
● Advantages:
– Extremely fast at classifying unknown records
– Easy to interpret, especially for small-sized
trees
– Accuracy is comparable to other classification
techniques for many simple data sets
DT classification: points to note
0.9
0.8
x < 0.43?
0.7
Yes No
0.6
y < 0.33?
y
0.3
Yes No Yes No
0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time
Oblique Decision Trees
x+y<1
Class = + Class =
Circular points:
0.5 ≤ sqrt(x12+x22) ≤ 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting
Overfitting
Underfitting: when DT is too simple, both training and test errors are large
Overfitting: DT has grown too large, and is now fitting the noise in the dataset
Overfitting
● Overfitting
results in decision trees that are more
complex than necessary
● Training
error no longer provides a good estimate
of how well the tree will perform on previously
unseen records
Overfitting due to Noise
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Occam’s Razor
● Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a bottom-up
fashion
– If generalization error improves after trimming, replace
sub-tree by a leaf node.
– Class label of leaf node is determined from majority
class of instances in the sub-tree
– Can use MDL for post-pruning
Other Issues
● Data Fragmentation
● Search Strategy
● Expressiveness
● Tree Replication
Data Fragmentation
● Otherstrategies?
– Bottom-up
– Bi-directional
Expressiveness
Q R
S 0 Q 1
0 1 S 0
0 1