Decision Tree
Decision Tree
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Decision
Model
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Decision Tree Induction
• Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
General Structure of Hunt’s Algorithm
• Let Dt be the set of training records Tid Refund Marital
Status
Taxable
Income Cheat
that reach a node t 1 Yes Single 125K No
• General Procedure: 2 No Married 100K No
Hunt’s Algorithm 1
2
Yes
No
Single
Married
125K
100K
No
No
Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to Specify Test Condition?
• Depends on attribute types
– Nominal
– Ordinal
– Continuous
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
How to determine the Best Split
Before Splitting: 10 records of class 0,
10 records of class 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
• Gini Index
• Entropy
• Misclassification error
How to Find the Best Split
C0 N00
Before Splitting: M0
C1 N01
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI
• Gini Index for a given node t :
GINI (t ) 1 [ p( j | t )]2
j
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Examples for computing GINI
GINI (t ) 1 [ p( j | t )]2
j
• For each distinct value, gather counts for each class in the
dataset
• Use the count matrix to make decisions
Multi-way split Two-way split
(find best partition of values)
index Taxable
– Computationally Inefficient! Repetition Income
of work. > 80K?
Yes No
Continuous Attributes: Computing Gini Index...
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Alternative Splitting Criteria based on INFO
• Entropy at a given node t:
Entropy(t ) p( j | t ) log p( j | t )
j
• Information Gain:
n
GAIN Entropy( p) Entropy(i)
k
i
n
split i 1
• Gain Ratio:
GAIN n n
GainRATIO SplitINFO log
Split k
i i
split
SplitINFO n n i 1
Gini(N1)
N1 N2 Gini(Children)
= 1 – (3/3)2 – (0/3)2
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) Gini=0.361 = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489 Gini improves !!
Tree Induction
• Greedy strategy.
– Split the records based on an attribute test that
optimizes certain criterion.
• Issues
– Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
– Determine when to stop splitting
Stopping Criteria for Tree Induction
• Stop expanding a node when all the records
belong to the same class
© Prentice Hall 48
Classification Problem
• Given a database D={t1,t2,…,tn} and a set of
classes C={C1,…,Cm}, the Classification
Problem is to define a mapping f:DgC where
each ti is assigned to one class.
• Actually divides D into equivalence classes.
• Prediction is similar, but may be viewed as
having infinite number of classes.
© Prentice Hall 49
Classification Examples
• Teachers classify students’ grades as A, B, C, D,
or F.
• Identify mushrooms as poisonous or edible.
• Predict when a river will flood.
• Identify individuals with credit risks.
• Speech recognition
• Pattern recognition
© Prentice Hall 50
Classification Ex: Grading
x
• If x >= 90 then grade =A.
<90 >=90
• If 80<=x<90 then grade
=B. x A
F D
© Prentice Hall 51
Classification Ex: Letter Recognition
Letter A Letter B
Letter C Letter D
Letter E Letter F
© Prentice Hall 52
Classification Techniques
• Approach:
1. Create specific model by evaluating
training data (or using domain experts’
knowledge).
2. Apply model developed to new data.
• Classes must be predefined
• Most common techniques use DTs, NNs,
or are based on distances or statistical
methods.
© Prentice Hall 53
Defining Classes
Distance Based
Partitioning Based
© Prentice Hall 54
Issues in Classification
• Missing Data
– Ignore
– Replace with assumed value
• Measuring Performance
– Classification accuracy on test data
– Confusion matrix
– OC Curve
© Prentice Hall 55
Height Example Data
Name Gender Height Output1 Output2
Kristina F 1.6m Short Medium
Jim M 2m Tall Medium
Maggie F 1.9m Medium Tall
Martha F 1.88m Medium Tall
Stephanie F 1.7m Short Medium
Bob M 1.85m Medium Medium
Kathy F 1.6m Short Medium
Dave M 1.7m Short Medium
Worth M 2.2m Tall Tall
Steven M 2.1m Tall Tall
Debbie F 1.8m Medium Medium
Todd M 1.95m Medium Medium
Kim F 1.9m Medium Tall
Amy F 1.8m Medium Medium
Wynette F 1.75m Medium Medium
© Prentice Hall 56
Classification Performance
© Prentice Hall 57
Confusion Matrix Example
Actual Assignment
Membership Short Medium Tall
Short 0 4 0
Medium 0 5 3
Tall 0 1 2
© Prentice Hall 58
Classification Using Decision Trees
• Partitioning based: Divide search space into
rectangular regions.
• Tuple placed into class based on the region
within which it falls.
• DT approaches differ in how the tree is built:
DT Induction
• Internal nodes associated with attribute and
arcs with values for that attribute.
• Algorithms: ID3, C4.5, CART
© Prentice Hall 59
Decision Tree
Given:
– D = {t1, …, tn} where ti=<ti1, …, tih>
– Database schema contains {A1, A2, …, Ah}
– Classes C={C1, …., Cm}
Decision or Classification Tree is a tree associated with
D such that
– Each internal node is labeled with attribute, Ai
– Each arc is labeled with predicate which can be
applied to attribute at parent
– Each leaf node is labeled with a class, Cj
© Prentice Hall 60
DT Induction
© Prentice Hall 61
DT Splits Area
M
Gender
Height
© Prentice Hall 62
Comparing DTs
Balanced
Deep
© Prentice Hall 63
DT Issues
• Choosing Splitting Attributes
• Ordering of Splitting Attributes
• Splits
• Tree Structure
• Stopping Criteria
• Training Data
• Pruning
© Prentice Hall 64
Decision Tree Induction is often based on
Information Theory
So
© Prentice Hall 65
Information
© Prentice Hall 66
DT Induction
• When all the marbles in the bowl are mixed
up, little information is given.
• When the marbles in the bowl are all from
one class and those in the other two classes
are on either side, more information is given.
© Prentice Hall 67
Information/Entropy
• Given probabilitites p1, p2, .., ps whose sum is 1,
Entropy is defined as:
© Prentice Hall 68
Entropy
© Prentice Hall 69
ID3
• Creates tree using information theory concepts
and tries to reduce expected number of
comparison..
• ID3 chooses split attribute with the highest
information gain:
© Prentice Hall 70
ID3 Example (Output1)
• Starting state entropy:
4/15 log(15/4) + 8/15 log(15/8) + 3/15 log(15/3) = 0.4384
• Gain using gender:
– Female: 3/9 log(9/3)+6/9 log(9/6)=0.2764
– Male: 1/6 (log 6/1) + 2/6 log(6/2) + 3/6 log(6/3) = 0.4392
– Weighted sum: (9/15)(0.2764) + (6/15)(0.4392) =
0.34152
– Gain: 0.4384 – 0.34152 = 0.09688
• Gain using height:
0.4384 – (2/15)(0.301) = 0.3983
• Choose height as first splitting attribute
© Prentice Hall 71
C4.5
• ID3 favors attributes with large number of
divisions
• Improved version of ID3:
– Missing Data
– Continuous Data
– Pruning
– Rules
– GainRatio:
© Prentice Hall 72
CART
• Create Binary Tree
• Uses entropy
• Formula to choose split point, s, for node t:
© Prentice Hall 73
CART Example
• At the start, there are six choices for split
point (right branch on equality):
– P(Gender)=2(6/15)(9/15)(2/15 + 4/15 + 3/15)=0.224
– P(1.6) = 0
– P(1.7) = 2(2/15)(13/15)(0 + 8/15 + 3/15) = 0.169
– P(1.8) = 2(5/15)(10/15)(4/15 + 6/15 + 3/15) = 0.385
– P(1.9) = 2(9/15)(6/15)(4/15 + 2/15 + 3/15) = 0.256
– P(2.0) = 2(12/15)(3/15)(4/15 + 8/15 + 3/15) = 0.32
• Split at 1.8
© Prentice Hall 74