ML Lecture04x2
ML Lecture04x2
2
Decision trees
Outlook
No Yes No Yes
No Yes No Yes
liver_firm = yes
t f
spiders = no spleen_palpable = no
t f t f
age < 40.00 live (65) bilirubin < 1.40 albumin < 2.90
die (3)
t f t f t f
live (4) die (10) live (11) die (6) die (9) live (37)
die (1) live (1) live (2) live (3) die (3)
No Yes No Yes
corresponds to:
(Outlook=Sunny Humidity=Normal)
(Outlook=Overcast)
(Outlook=Rain Wind=Weak)
7
8
When would one use a decision tree?
Data is represented as attribute-value pairs
Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data, missing values
Need to construct a classifier fast
Need an understandable classifier
Existing applications include:
Equipment/medical diagnosis
Learning to fly
Scene analysis and image segmentation
Standard algorithm developed in the ’80s, now commercially
available packages (C4.5). Quite successful in practice
9
This is the ID3 algorithm (Quinlan, 1983) and is at the core of C4.5
10
Which attribute is best?
t f t f
11
Imagine:
1. You are about to observe the outcome of a dice roll
2. You are about to observe the outcome of a coin flip
Which one has more uncertainty?
Now suppose:
1. You observe the outcome of the dice roll
2. You observe the outcome of the coin flip
In both cases, now there is no more uncertainty.
Which one provides more information?
12
Definition of information
the outcome (e.g., consider
Example: result of a fair coin flip provides bit of
!
information
Example: result of a fair dice roll provides bits of
information.
13
Information is additive
Suppose you have " independent fair coin tosses. How much
information do they give?
#" $ % ) ( "
fair coin tosses
'& bits
A cute example:
* bits
Now consider a 1000 word document drawn from the same
source: document
Now consider a -
./ - gray-scale image with 16 grey levels:
picture$ 0 )1 ! 23 + ! % bits!
54 A picture is worth (more than) a thousand words!
14
Entropy
$
*
%
).
Note though that this depends only on the probability distribution
and not on the actual alphabet (so we can really write
15
Interpretations of entropy
$
16
Entropy and coding theory
Suppose I will get data from a 4-value alphabet and I want to
send it over a channel. I know that the probability of item
is
.
Suppose all values are equally likely. Then I can encode them in
Shannon: there are codes that will communicate the symbols with
efficiency arbitrarily close to bits/symbol. There are no codes
that will do it with efficiency less than bits/symbol.
17
Properties of entropy
(
$
" with equality if and only if (
Non-negative:
18
Entropy applied to concept learning
Consider:
- a sample of training examples
is the proportion of positive examples in
is the proportion of negative examples in
Entropy measures the impurity of :
%
1.0
Entropy(S)
0.5
19
Conditional entropy
Suppose I am trying to predict output and I have input , e.g.:
HasKids OwnsDumboVideo
Yes Yes
Yes Yes
Yes Yes
Yes Yes
No No
No No
Yes No
Yes
#'
No
,
!
&% ( ...
From the table, we can estimate
"# $
$ (based on the data in the table).
#)
#'
(
*
Specific conditional entropy %+
,
What if we look only at the instances for which
is the entropy of
?
,
among only the instances in which has value
20
Conditional entropy
Conditional entropy, %
, is the average conditional entropy
of given specific values for :
%
* , %
,
21
Information gain
Suppose I have to transmit . How many bits on the average would
it save me if both me and the receiver knew ?
%
*
%
This is called information gain
Alternative interpretation: how much reduction in entropy do I get if I
know .
22
Information gain to determine best attribute
t f t f
23
24
Problem: Attributes with multiple values
25
#
So for an attribute that splits the data into many partitions mostly
uniformly, will be high
Problem: It can actually become too high!
Solution: First use information gain, then use gain ratio only for
attributes with information gain above average
Other such metrics are also used.
26
Ensuring the same number of values
If an attribute #
, ' '
, , where:
has possible values, , replace it
by
,
Boolean attributes, # (
"
if # ' (
# (
" ,
otherwise
,
This is called 1-of- encoding
Used more generally to encode learning data (e.g. in neural
networks)
27
28
Inductive bias of decision tree construction
29
30
Example: CRX data, UCI Repository
| This file concerns credit card applications. All attribute names
| and values have been changed to meaningless symbols to protect
| confidentiality of the data.
6
+, -. | classes
A1: b,a.
A2: continuous.
A3: continuous.
A4: u, y, l, t.
A5: g, p, gg.
A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
A7: v, h, bb, j, n, z, dd, ff, o.
A8: continuous.
A9: t,f.
A10: t,f.
A11: continuous.
A12: t,f.
A13: g, p, s.
A14: continuous.
A15: continuous.
31
Example:
Temperature: 40 48 60 72 80 90
32