0% found this document useful (0 votes)
7 views16 pages

ML Lecture04x2

The document discusses decision trees, a method used for classification problems, detailing their construction, the concepts of entropy and information gain, and their application in real data scenarios. It explains how decision trees can be used to classify instances based on attributes and highlights their advantages, such as ease of understanding and robustness to noise. The lecture also covers the ID3 algorithm for building decision trees and the importance of selecting the best attribute for splitting data based on information gain.

Uploaded by

sanjai0718345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views16 pages

ML Lecture04x2

The document discusses decision trees, a method used for classification problems, detailing their construction, the concepts of entropy and information gain, and their application in real data scenarios. It explains how decision trees can be used to classify instances based on attributes and highlights their advantages, such as ease of understanding and robustness to noise. The lecture also covers the ID3 algorithm for building decision trees and the importance of selecting the best attribute for splitting data based on information gain.

Uploaded by

sanjai0718345
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Lecture 4: Decision Trees

What is a decision tree?


Constructing decision trees
Entropy and information gain
Issues when using real data
Note: part of this lecture based on notes from Roni Rosenfeld
(CMU)

Classification problem example

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No


D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Discover a “rule” for the PlayTennis predicate!

2
Decision trees

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

A decision tree consists of:


a set of nodes, where each node tests the value of an attribute
and branches on all possible values
a set of leaves, where each leaf gives a class value
3

Using decision trees for classification


Suppose we get a new instance:
Outlook =Sunny, Temperature=Hot,Humidity=High,Wind=Strong
How do we classify it?
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

At every node, test the corresponding attribute


Send the instance down the appropriate branch of the tree
If at a leaf, output the corresponding classification
4
Real example: the “hepatitis” task

liver_firm = yes

t f

spiders = no spleen_palpable = no

t f t f

age < 40.00 live (65) bilirubin < 1.40 albumin < 2.90
die (3)
t f t f t f

live (4) die (10) live (11) die (6) die (9) live (37)
die (1) live (1) live (2) live (3) die (3)

Good things about decision trees


Provide a general representation of classification rules
Easy to understand!
Fast learning algorithms (e.g. C4.5, CART)
Robust to noise (attribute and classification noise, missing
values)
Good accuracy
Decision trees are widely used in large, realistic classification
problems, e.g.:
Star classification
Medical diagnosis
Industrial applications
Often incorporated in data mining software (e.g. SGI Mineset).
6
Decision trees as logical representations
Each decision tree has an equivalent representation in propositional
logic. For example:
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

corresponds to:
(Outlook=Sunny Humidity=Normal)
 (Outlook=Overcast)
 (Outlook=Rain Wind=Weak)
7

What is easy/hard for decision trees to represent ?

How would we represent:


 XOR
(A B)
 (C D)
M of N
Natural to represent disjunctions, hard to represent functions like
parity, XOR (need exponential-size trees).
Sometimes duplication occurs (same subtree on various paths).

8
When would one use a decision tree?
Data is represented as attribute-value pairs
Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data, missing values
Need to construct a classifier fast
Need an understandable classifier
Existing applications include:
Equipment/medical diagnosis
Learning to fly
Scene analysis and image segmentation
Standard algorithm developed in the ’80s, now commercially
available packages (C4.5). Quite successful in practice
9

Top-down induction of decision trees

Given a set of labeled training instances:


1. If all the training instances have the same class, create a leaf
with that class label and exit.
2. Pick the best attribute to split the data on
3. Add a node that tests the attribute
4. Split the training set according to the value of the attribute
5. Recurse on each subset of the training data

This is the ID3 algorithm (Quinlan, 1983) and is at the core of C4.5

10
Which attribute is best?

The attribute should provide information about the class label.


Consider we have 29 positive examples, 35 negative ones, and we
are considering two attributes, that would give the following splits of
instances:
[30+,10-] A1=? [30+,10-] A2=?

t f t f

[20+,10-] [10+,0-] [15+,7-] [15+,3-]

Intuitively, we would like an attribute that separates the training


instances as well as possible
We need a mathematical measure for the purity of a set of instances

11

Information = Reduction in uncertainty

Imagine:
1. You are about to observe the outcome of a dice roll
2. You are about to observe the outcome of a coin flip
Which one has more uncertainty?
Now suppose:
1. You observe the outcome of the dice roll
2. You observe the outcome of the coin flip
In both cases, now there is no more uncertainty.
Which one provides more information?

12
Definition of information

Let be an event that occurs with probability   . If we


are told that has occurred with certainty, then we received
     

bits of information.

   ,    )


You can also think of information as the amount of “surprise” in

  
the outcome (e.g., consider
Example: result of a fair coin flip provides bit of

 !
information
Example: result of a fair dice roll provides bits of
information.

13

Information is additive
Suppose you have " independent fair coin tosses. How much
information do they give?
 #" $ %   ) (  "
fair coin tosses
'& bits

A cute example:

  word *         +  ,  bits


Consider a random word drawn from a vocabulary of 100,000
words:

 *      bits
Now consider a 1000 word document drawn from the same
source:  document
Now consider a -
./ -  gray-scale image with 16 grey levels:
  picture$ 0 )1  ! 23   +    !  % bits!
54 A picture is worth (more than) a thousand words!
14
Entropy

 33  ( 3   (  . Each


Suppose we have an information source which emits symbols
from an alphabet  with probabilities   

emission is independent of the others.
What is the average amount of information when observing the
output of ?

 $ 
  * 
  



% 

Call this entropy of .

   ).
Note though that this depends only on the probability distribution
and not on the actual alphabet (so we can really write

15

Interpretations of entropy

  $ 
 


Average amount of information per symbol


Average amount of surprise when observing the symbol
Uncertainty the observer has before seeing the symbol
Average number of bits needed to communicate the symbol

16
Entropy and coding theory
Suppose I will get data from a 4-value alphabet  and I want to
send it over a channel. I know that the probability of item
 is

 .
Suppose all values are equally likely. Then I can encode them in

  ,        ,          .


two bits each, so on every transmission I need 2 bits
Suppose now 
Then I can encode 
  ,     ,      ,   %  .
What is the expected length of the message that I will have to
send over time?

 
Shannon: there are codes that will communicate the symbols with

 
efficiency arbitrarily close to bits/symbol. There are no codes
that will do it with efficiency less than bits/symbol.
17

Properties of entropy

(
  $ 

 


   
  " with equality if and only if   (  
Non-negative:

The further  is from uniform, the lower the entropy

18
Entropy applied to concept learning
Consider:
- a sample of training examples
 is the proportion of positive examples in
 is the proportion of negative examples in
Entropy measures the impurity of :

  

%  


 
1.0
Entropy(S)

0.5

0.0 0.5 1.0


p
+

19

Conditional entropy

Suppose I am trying to predict output and I have input  , e.g.:
HasKids OwnsDumboVideo

Yes Yes
Yes Yes
Yes Yes
Yes Yes
No No
No No
Yes No
Yes
#' 
No

    

,
  !
   &% (        ...
From the table, we can estimate
 "# $ 
 $   (based on the data in the table).
#)


#'
 (
   * 
Specific conditional entropy  %+
  ,
What if we look only at the instances for which
is the entropy of

?

,
among only the instances in which  has value
20
Conditional entropy

Conditional entropy,  %
 , is the average conditional entropy

of given specific values for  :

 %
*    ,   %
 , 

Alternative interpretation: the expected number of bits needed to



transmit if both the emitter and the receiver know the value of  .
In our example:
         

21

Information gain

Suppose I have to transmit . How many bits on the average would
it save me if both me and the receiver knew  ?
   %
*   
 %

This is called information gain
Alternative interpretation: how much reduction in entropy do I get if I
know  .

22
Information gain to determine best attribute

         !    


!  
  "
 

where is the subset of instances in which #


 ,
.
IG(S,A) = expected reduction in entropy due to sorting on attribute A
[30+,10-] A1=? [30+,10-] A2=?

t f t f

[20+,10-] [10+,0-] [15+,7-] [15+,3-]

Check that in this case, A1 wins.

23

Going back to our example...

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No


D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Which attribute will have the highest information gain?

24
Problem: Attributes with multiple values

If an attribute splits the data perfectly, it will always be preferred


by information gain.
E.g. a unique ID for each data point!
But that has very poor generalization performance!
Possible solutions:
– Using better criteria (based on information)
– Ensuring that all attributes have the same number of values

25

A better criterion: Gain ratio


,
and an attribute #
  "
For a set of instances with possible values
    

 
where
      ! 
 
!  !  
and is the subset of for which #
 ,.

# 
So for an attribute that splits the data into many partitions mostly
uniformly, will be high
Problem: It can actually become too high!
Solution: First use information gain, then use gain ratio only for
attributes with information gain above average
Other such metrics are also used.
26
Ensuring the same number of values

If an attribute #
,  '   '
  
    , , where:
has possible values, , replace it
by
,
Boolean attributes, # (
"

  if #   ' (
# (
  "    ,
 otherwise
,
This is called 1-of- encoding
Used more generally to encode learning data (e.g. in neural
networks)

27

Decision tree construction as search

State space: all possible trees


Actions: which attribute to test
Goal: tree consistent with the training data
Depth-first search, no backtracking
Heuristic: information gain (or gain ratio)
Can get stuck in a local minimum, but is fairly robust (because
of the heuristic)

28
Inductive bias of decision tree construction

The hypothesis space is complete! We can represent any


Boolean function of the attributes
So there is no representational bias
Outputs a single hypothesis: the “shortest” tree, as anticipated
by the information gain
Because there is no backtracking, it is subject to local minima
But because the search choices are statistically based, it is
robust to noise in the data
Algorithmic bias: prefer shorter (smaller) trees; prefer trees that
place attributes with high information gain close to the root

29

Using decision trees for real data

Lots of issues to deal with!


How to test real-valued attributes
How we estimate classifier error
How to deal with noise in the data
How to deal with missing attributes
How to incorporate attribute costs

30
Example: CRX data, UCI Repository
| This file concerns credit card applications. All attribute names
| and values have been changed to meaningless symbols to protect
| confidentiality of the data.
6
+, -. | classes

A1: b,a.
A2: continuous.
A3: continuous.
A4: u, y, l, t.
A5: g, p, gg.
A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
A7: v, h, bb, j, n, z, dd, ff, o.
A8: continuous.
A9: t,f.
A10: t,f.
A11: continuous.
A12: t,f.
A13: g, p, s.
A14: continuous.
A15: continuous.

31

Attributes with continuous values

Example:
Temperature: 40 48 60 72 80 90

PlayTennis: No No Yes Yes Yes No

A decision tree needs to perform tests on these attributes as well

What kind of test do we want?


Value of the attribute less than a cut point!

What cut points should we consider?


We need to consider only cut points where the class label changes!

32

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy