0% found this document useful (0 votes)
18 views76 pages

ML 05 Decision Trees

Uploaded by

Mrs.SANTHOSHI A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views76 pages

ML 05 Decision Trees

Uploaded by

Mrs.SANTHOSHI A
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

CS 60050

Machine Learning

Decision Tree Classifier

Slides taken from course materials of Tan, Steinbach, Kumar


Illustrating Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Intuition behind a decision tree

● Ask a series of questions about a given record


– Each question is about one of the attributes
– Answer to one question decides what question to ask
next (or if a next question is needed)
– Continue asking questions until we can infer the class
of the given record
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Structure of a decision tree

● Decision tree: hierarchical structure


– One root node: no incoming edge, zero or more
outgoing edges
– Internal nodes: exactly one incoming edge, two or
more outgoing edges
– Leaf or terminal nodes: exactly one incoming edge, no
outgoing edge
● Each leaf node assigned a class label
● Each non-leaf node contains a test condition on
one of the attributes
Applying a Decision Tree Classifier

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?
10

Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO Once a decision tree


< 80K > 80K has been constructed
(learned), it is easy to
NO YES
apply it to test data
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married

TaxInc NO
< 80K > 80K

NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES
Learning a Decision Tree Classifier

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No

4 Yes Medium 120K No


Induction
5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No Learn


8 No Small 85K Yes Model
9 No Medium 75K No

10 No Small 90K Yes


Model
10

Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?


Deduction
14 No Small 95K ?

15 No Large 67K ?

How to learn a decision tree?


10

Test Set
A Decision Tree (seen earlier)

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No


2 No Married 100K No Refund
3 No Single 70K No
Yes No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Married
Single, Divorced
6 No Married 60K No
7 Yes Divorced 220K No TaxInc NO
8 No Single 85K Yes < 80K > 80K
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


Another Decision Tree on same dataset

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree
10
10 No Single 90K Yes that fits the same data!
Challenge in learning decision tree

● Exponentially many decision trees can be


constructed from a given set of attributes
– Some of the trees are more ‘accurate’ or better
classifiers than the others
– Finding the optimal tree is computationally infeasible
● Efficient
algorithms available to learn a
reasonably accurate (although potentially
suboptimal) decision tree in reasonable time
– Employs greedy strategy
– Locally optimal choices about which attribute to use
next to partition the data
Decision Tree Induction

● Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
● Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No

● General Procedure: 2 No Married 100K No


3 No Single 70K No
– If Dt contains records that all
4 Yes Married 120K No
belong the same class yt, then t 5 No Divorced 95K Yes
is a leaf node labeled as yt 6 No Married 60K No
– If Dt is an empty set, then t is a 7 Yes Divorced 220K No
leaf node labeled by the default 8 No Single 85K Yes

class yd 9 No Married 75K No


10 No Single 90K Yes
– If Dt contains records that 10

belong to more than one class, Dt


use an attribute test to split the
data into smaller subsets.
Recursively apply the ?
procedure to each subset.
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No


2 No Married 100K No

Don’t 3 No Single 70K No


Cheat 4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Default class is “Don’t


cheat” since it is the
majority class in the
dataset
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No


2 No Married 100K No
Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

For now, assume that


“Refund” has been
decided to be the best
attribute for splitting in
some way (to be
discussed soon)
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No


2 No Married 100K No
Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes

Refund 9 No Married 75K No


Yes No 10 No Single 90K Yes
10

Don’t Marital
Cheat Status
Single,
Married
Divorced
Don’t
Cheat
Cheat
Hunt’s Algorithm Tid Refund Marital
Status
Taxable
Income Cheat

1 Yes Single 125K No


2 No Married 100K No
Refund
Don’t 3 No Single 70K No
Yes No
Cheat 4 Yes Married 120K No
Don’t Don’t 5 No Divorced 95K Yes
Cheat Cheat
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes

Refund Refund 9 No Married 75K No


Yes No Yes No 10 No Single 90K Yes
10

Don’t Don’t Marital


Marital Cheat
Cheat Status Status
Single, Single,
Married Married
Divorced Divorced

Don’t Taxable Don’t


Cheat Cheat
Cheat Income
< 80K >= 80K

Don’t
Cheat
Cheat
Tree Induction

● Greedy strategy
– Split the records based on an attribute test
that optimizes certain criterion

● Issues
– Determine how to split the records
uHow to specify the attribute test condition?
uHow to determine the best split?

– Determine when to stop splitting


Tree Induction

● Greedy strategy
– Split the records based on an attribute test
that optimizes certain criterion

● Issues
– Determine how to split the records
uHow to specify the attribute test condition?
uHow to determine the best split?

– Determine when to stop splitting


How to Specify Test Condition?

● Depends on attribute types


– Nominal: two or more distinct values (special
case: binary) E.g., marital status: {single,
divorced, married}
– Ordinal: two or more distinct values that have
an ordering. E.g. shirt size: {S, M, L, XL}
– Continuous: continuous range of values
● Depends on number of ways to split
– 2-way split
– Multi-way split
Splitting Based on Nominal Attributes

● Multi-way split: Use as many partitions as distinct


values.
CarType
Family Luxury
Sports

● Binary split: Divides values into two subsets.


Need to find optimal partitioning.
CarType CarType
{Sports, OR {Family,
Luxury} {Family} Luxury} {Sports}
Splitting Based on Ordinal Attributes

● Multi-way split: Use as many partitions as distinct


values.
Size
Small Large
Medium

● Binary split: Divides values into two subsets.


Need to find optimal partitioning.
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}

Size
{Small,
● What about this split? Large} {Medium}
Splitting Based on Continuous Attributes

● Different ways of handling


– Discretization to form an ordinal categorical
attribute
u Static – discretize once at the beginning
u Dynamic – ranges can be found by equal interval
bucketing, equal frequency bucketing
(percentiles), or clustering.

– Binary Decision: (A < v) or (A ≥ v)


u consider all possible splits and finds the best cut
u can be more compute intensive
Splitting Based on Continuous Attributes

Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No

[10K,25K) [25K,50K) [50K,80K)

(i) Binary split (ii) Multi-way split


Tree Induction

● Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.

● Issues
– Determine how to split the records
uHow to specify the attribute test condition?
uHow to determine the best split?

– Determine when to stop splitting


What is meant by “determine best split”

Before Splitting: 10 records of class 0,


10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6 C0: 4 C0: 1 C0: 8 C0: 1 C0: 1 ... C0: 1 C0: 0 ... C0: 0
C1: 4 C1: 6 C1: 3 C1: 0 C1: 7 C1: 0 C1: 0 C1: 1 C1: 1

Which test condition is the best?


How to determine the Best Split

● Greedy approach:
– Nodes with homogeneous class distribution
are preferred
● Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity

● Gini Index

● Entropy

● Misclassification error
How to Find the Best Split
Before Splitting: C0 N00 M0
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 M34
Gain = M0 – M12 vs M0 – M34
Measures of Node Impurity

● Gini Index

● Entropy

● Misclassification error
Measure of Impurity: GINI Index

● Gini Index for a given node t :

GINI (t ) = 1 − ∑[ p( j | t )]2
j

p( j | t) is the relative frequency of class j at node t


Examples for computing GINI

GINI (t ) = 1 − ∑[ p( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Gini = 1 – (2/6)2 – (4/6)2 = 0.444
Measure of Impurity: GINI Index

● Gini Index for a given node t :

GINI (t ) = 1 − ∑[ p( j | t )]2
j

p( j | t) is the relative frequency of class j at node t

– Maximum (1 - 1/nc) when records are equally


distributed among all classes, implying least
interesting information [nc: number of classes]
– Minimum (0.0) when all records belong to one class,
implying most interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Splitting Based on GINI

● Used in CART, SLIQ, SPRINT.


● When a node p is split into k partitions (children), the
quality of split is computed as,
k
ni
GINI split = ∑ GINI (i )
i =1 n

where, ni = number of records at child i,


n = number of records at node p.
Binary Attributes: Computing GINI Index

● Splits into two partitions


● Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/7)2 – (2/7)2 N1 N2 Gini(Children)
= 0.408
C1 5 1 = 7/12 * 0.408 +
Gini(N2) C2 2 4 5/12 * 0.32
= 1 – (1/5)2 – (4/5)2 Gini=0.371 = 0.371
= 0.32
Categorical Attributes: Computing Gini Index

● For each distinct value, gather counts for each class in


the dataset
● Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 3 1 C1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419
Continuous Attributes: Computing Gini Index

Tid Refund Marital Taxable


● Use Binary Decisions based on one Status Income Cheat
value
1 Yes Single 125K No
● Several Choices for the splitting value 2 No Married 100K No

– Number of possible splitting values 3 No Single 70K No


= Number of distinct values 4 Yes Married 120K No

● Each splitting value has a count matrix 5 No Divorced 95K Yes

associated with it 6 No Married 60K No


7 Yes Divorced 220K No
– Class counts in each of the
8 No Single 85K Yes
partitions, A < v and A ≥ v
9 No Married 75K No
● Simple method to choose best v 10 No Single 90K Yes
– For each v, scan the database to
10

gather count matrix and compute Taxable


Income
its Gini index > 80K?
– Computationally Inefficient!
Repetition of work. Yes No
Continuous Attributes: Computing Gini Index...

● For efficient computation: for each attribute,


– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income

Sorted Values 60 70 75 85 90 95 100 120 125 220


55 65 72 80 87 92 97 110 122 172 230
Split Positions
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Measures of Node Impurity

● Gini Index

● Entropy

● Misclassification error
Alternative Splitting Criteria based on INFO

● Entropy at a given node t:


Entropy(t ) = −∑ p( j | t ) log p( j | t )
j 2

p( j | t) is the relative frequency of class j at node t

● Measures homogeneity of a node


Examples for computing Entropy

Entropy(t ) = −∑ p( j | t ) log p( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Alternative Splitting Criteria based on INFO

● Entropy at a given node t:


Entropy(t ) = −∑ p( j | t ) log p( j | t )
j 2

p( j | t) is the relative frequency of class j at node t

● Measures homogeneity of a node


u Maximum (log nc) when records are equally distributed
among all classes implying least information
u Minimum (0.0) when all records belong to one class,
implying most information
Splitting Based on INFO...

● Information Gain:
⎛∑ nk ⎞
GAIN = Entropy ( p) − ⎜ Entropy (i ) ⎟
i

⎝ n
split i =1

Parent Node p is split into k partitions;
ni is number of records in partition i
– Measures Reduction in Entropy achieved because of
the split. Choose the split that achieves most reduction
(maximizes GAIN)
– Used in ID3 and C4.5
– Disadvantage: Tends to prefer splits that result in large
number of partitions, each being small but pure.
Splitting Based on INFO...

● Gain Ratio:

GAIN n n k
GainRATIO = SplitINFO = − ∑ log
Split i i

SplitINFO
split

n n i =1

Parent Node, p is split into k partitions


ni is the number of records in partition i

– Adjusts Information Gain by the entropy of the


partitioning (SplitINFO). Higher entropy partitioning
(large number of small partitions) is penalized!
– Used in C4.5
– Designed to overcome the disadvantage of Information
Gain
Measures of Node Impurity

● Gini Index

● Entropy

● Misclassification error
Splitting Criteria based on Classification Error

● Classification error at a node t :

Error (t ) = 1 − max P(i | t )


i

p( i | t) is the relative frequency of class i at node t

● Measures misclassification error made by a node


Examples for Computing Error

Error (t ) = 1 − max P (i | t )
i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Splitting Criteria based on Classification Error

● Classification error at a node t :

Error (t ) = 1 − max P(i | t )


i

● Measures misclassification error made by a node


u Maximum (1 - 1/nc) when records are equally distributed
among all classes, implying least interesting information
u Minimum (0.0) when all records belong to one class, implying
most interesting information
Comparison among Splitting Criteria

For a 2-class problem:


Tree Induction

● Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.

● Issues
– Determine how to split the records
uHow to specify the attribute test condition?
uHow to determine the best split?

– Determine when to stop splitting


Stopping Criteria for Tree Induction

● Stopexpanding a node when all the records


belong to the same class

● Stop expanding a node when all the records have


similar attribute values (if different class values,
then usually assign the majority class)

● Earlytermination, usually to prevent overfitting (to


be discussed later)
DT classification: points to note

● Findingan optimal DT is NPC, but efficient and


fast heuristic methods available

● Advantages:
– Extremely fast at classifying unknown records
– Easy to interpret, especially for small-sized
trees
– Accuracy is comparable to other classification
techniques for many simple data sets
DT classification: points to note

● Inwhat we discussed till now, the test condition


always involved a single attribute
– Decision boundaries are ‘rectilinear’ i.e., parallel
to ‘coordinate axes’ of the feature space
– Limits the expressiveness of DTs
● Oblique DTs – allows test conditions that involve
more than one attribute (e.g., x + y < 1)
– Better expressiveness
– But finding a good tree is computationally more
expensive
Decision Boundary
1

0.9

0.8
x < 0.43?

0.7
Yes No
0.6

y < 0.33?
y

0.5 y < 0.47?


0.4

0.3
Yes No Yes No

0.2
:4 :0 :0 :4
0.1 :0 :4 :3 :0
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

x
• Border line between two neighboring regions of different classes is
known as decision boundary
• Decision boundary is parallel to axes because test condition involves
a single attribute at-a-time
Oblique Decision Trees

x+y<1

Class = + Class =

• Test condition may involve multiple attributes


• More expressive representation
• Finding optimal test condition is computationally expensive
Example: C4.5

● Simple depth-first construction.


● Uses Information Gain
● Sorts Continuous Attributes at each node.
● Needs entire data to fit in memory.
● Unsuitable for Large Datasets.
– Needs out-of-core sorting.

● You can download the software from:


http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz
Practical issues of Decision Tree classifier
Underfitting and Overfitting (Example)

500 circular and 500


triangular data points.

Circular points:
0.5 ≤ sqrt(x12+x22) ≤ 1

Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
Underfitting and Overfitting

Overfitting

Underfitting: when DT is too simple, both training and test errors are large
Overfitting: DT has grown too large, and is now fitting the noise in the dataset
Overfitting

● Overfitting
results in decision trees that are more
complex than necessary

● Training
error no longer provides a good estimate
of how well the tree will perform on previously
unseen records
Overfitting due to Noise

Decision boundary is distorted by noise point


Overfitting due to Insufficient Examples

Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
Occam’s Razor

● Given two models of similar generalization errors,


one should prefer the simpler model over the
more complex model

● For complex models, there is a greater chance


that it was fitted accidentally by errors in data

● Therefore, one should include model complexity


when evaluating a model
Minimum Description Length (MDL)
A?
X y Yes No
X y
X1 1 0 B? X1 ?
X2 0 B1 B2
X2 ?
X3 0 C? 1
A C1 C2 B X3 ?
X4 1
0 1 X4 ?
… …
Xn
… …
1
Xn ?

● Cost(Model,Data) = Cost(Data|Model) + Cost(Model)


– Cost is the number of bits needed for encoding.
– Search for the least costly model.
● Cost(Data|Model) encodes the misclassification errors.
● Cost(Model) uses node encoding (number of children)
plus splitting condition encoding.
How to Address Overfitting

● Pre-Pruning (Early Stopping Rule)


– Stop the algorithm before it becomes a fully-grown tree
– Typical stopping conditions for a node:
u Stop if all instances belong to the same class
u Stop if all the attribute values are the same
– More restrictive conditions:
u Stop if number of instances is less than some user-specified
threshold
uStop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain)
How to Address Overfitting…

● Post-pruning
– Grow decision tree to its entirety
– Trim the nodes of the decision tree in a bottom-up
fashion
– If generalization error improves after trimming, replace
sub-tree by a leaf node.
– Class label of leaf node is determined from majority
class of instances in the sub-tree
– Can use MDL for post-pruning
Other Issues

● Data Fragmentation
● Search Strategy
● Expressiveness
● Tree Replication
Data Fragmentation

● Number of instances gets smaller as you traverse


down the tree

● Number of instances at the leaf nodes could be


too small to make any statistically significant
decision
Search Strategy

● Finding an optimal decision tree is NP-hard

● Thealgorithm presented so far uses a greedy,


top-down, recursive partitioning strategy to
induce a reasonable solution

● Otherstrategies?
– Bottom-up
– Bi-directional
Expressiveness

● Decision tree provides expressive representation for


learning discrete-valued function
– But they do not generalize well to certain types of
Boolean functions
u Example: parity function:
– Class = 1 if there is an even number of Boolean attributes with truth
value = True
– Class = 0 if there is an odd number of Boolean attributes with truth
value = True
u For accurate modeling, must have a complete tree

● Not expressive enough for modeling continuous variables


– Particularly when test condition involves only a single
attribute at-a-time
Tree Replication
P

Q R

S 0 Q 1

0 1 S 0

0 1

• Same subtree appears in multiple branches

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy