0% found this document useful (0 votes)

7 views16 pages

ML Lecture04x2

The document discusses decision trees, a method used for classification problems, detailing their construction, the concepts of entropy and information gain, and their application in real data scenarios. It explains how decision trees can be used to classify instances based on attributes and highlights their advantages, such as ease of understanding and robustness to noise. The lecture also covers the ID3 algorithm for building decision trees and the importance of selecting the best attribute for splitting data based on information gain.

Uploaded by

sanjai0718345

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views16 pages

ML Lecture04x2

Uploaded by

sanjai0718345

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Lecture 4: Decision Trees

What is a decision tree?

Constructing decision trees
Entropy and information gain
Issues when using real data
Note: part of this lecture based on notes from Roni Rosenfeld
(CMU)

Classification problem example

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

Discover a “rule” for the PlayTennis predicate!

2
Decision trees

Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

A decision tree consists of:

a set of nodes, where each node tests the value of an attribute
and branches on all possible values
a set of leaves, where each leaf gives a class value
3

Using decision trees for classification

Suppose we get a new instance:
Outlook =Sunny, Temperature=Hot,Humidity=High,Wind=Strong
How do we classify it?
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

At every node, test the corresponding attribute

Send the instance down the appropriate branch of the tree
If at a leaf, output the corresponding classification
4
Real example: the “hepatitis” task

liver_firm = yes

t f

spiders = no spleen_palpable = no

t f t f

age < 40.00 live (65) bilirubin < 1.40 albumin < 2.90
die (3)
t f t f t f

live (4) die (10) live (11) die (6) die (9) live (37)
die (1) live (1) live (2) live (3) die (3)

Good things about decision trees

Provide a general representation of classification rules
Easy to understand!
Fast learning algorithms (e.g. C4.5, CART)
Robust to noise (attribute and classification noise, missing
values)
Good accuracy
Decision trees are widely used in large, realistic classification
problems, e.g.:
Star classification
Medical diagnosis
Industrial applications
Often incorporated in data mining software (e.g. SGI Mineset).
6
Decision trees as logical representations
Each decision tree has an equivalent representation in propositional
logic. For example:
Outlook

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

No Yes No Yes

corresponds to:
(Outlook=Sunny Humidity=Normal)
(Outlook=Overcast)
(Outlook=Rain Wind=Weak)
7

What is easy/hard for decision trees to represent ?

How would we represent:

XOR
(A B)
(C D)
M of N
Natural to represent disjunctions, hard to represent functions like
parity, XOR (need exponential-size trees).
Sometimes duplication occurs (same subtree on various paths).

8
When would one use a decision tree?
Data is represented as attribute-value pairs
Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data, missing values
Need to construct a classifier fast
Need an understandable classifier
Existing applications include:
Equipment/medical diagnosis
Learning to fly
Scene analysis and image segmentation
Standard algorithm developed in the ’80s, now commercially
available packages (C4.5). Quite successful in practice
9

Top-down induction of decision trees

Given a set of labeled training instances:

1. If all the training instances have the same class, create a leaf
with that class label and exit.
2. Pick the best attribute to split the data on
3. Add a node that tests the attribute
4. Split the training set according to the value of the attribute
5. Recurse on each subset of the training data

This is the ID3 algorithm (Quinlan, 1983) and is at the core of C4.5

10
Which attribute is best?

The attribute should provide information about the class label.

Consider we have 29 positive examples, 35 negative ones, and we
are considering two attributes, that would give the following splits of
instances:
[30+,10-] A1=? [30+,10-] A2=?

t f t f

[20+,10-] [10+,0-] [15+,7-] [15+,3-]

Intuitively, we would like an attribute that separates the training

instances as well as possible
We need a mathematical measure for the purity of a set of instances

Information = Reduction in uncertainty

Imagine:
1. You are about to observe the outcome of a dice roll
2. You are about to observe the outcome of a coin flip
Which one has more uncertainty?
Now suppose:
1. You observe the outcome of the dice roll
2. You observe the outcome of the coin flip
In both cases, now there is no more uncertainty.
Which one provides more information?

12
Definition of information

Let be an event that occurs with probability . If we

are told that has occurred with certainty, then we received

bits of information.

, )

You can also think of information as the amount of “surprise” in

the outcome (e.g., consider
Example: result of a fair coin flip provides bit of

!
information
Example: result of a fair dice roll provides bits of
information.

Information is additive
Suppose you have " independent fair coin tosses. How much
information do they give?
#" $ % ) ( "
fair coin tosses
'& bits

A cute example:

word * + , bits

Consider a random word drawn from a vocabulary of 100,000
words:

* bits
Now consider a 1000 word document drawn from the same
source: document
Now consider a -
./ - gray-scale image with 16 grey levels:
picture$ 0 )1 ! 23 + ! % bits!
54 A picture is worth (more than) a thousand words!
14
Entropy

33 ( 3 ( . Each

Suppose we have an information source which emits symbols
from an alphabet with probabilities

emission is independent of the others.
What is the average amount of information when observing the
output of ?

$
*

%

Call this entropy of .

).
Note though that this depends only on the probability distribution
and not on the actual alphabet (so we can really write

Interpretations of entropy

Average amount of information per symbol

Average amount of surprise when observing the symbol
Uncertainty the observer has before seeing the symbol
Average number of bits needed to communicate the symbol

16
Entropy and coding theory
Suppose I will get data from a 4-value alphabet and I want to
send it over a channel. I know that the probability of item
is

.
Suppose all values are equally likely. Then I can encode them in

, , .

two bits each, so on every transmission I need 2 bits
Suppose now
Then I can encode
, , , % .
What is the expected length of the message that I will have to
send over time?

Shannon: there are codes that will communicate the symbols with

efficiency arbitrarily close to bits/symbol. There are no codes
that will do it with efficiency less than bits/symbol.
17

Properties of entropy

(
$

" with equality if and only if (
Non-negative:

The further is from uniform, the lower the entropy

18
Entropy applied to concept learning
Consider:
- a sample of training examples
is the proportion of positive examples in
is the proportion of negative examples in
Entropy measures the impurity of :

%

1.0
Entropy(S)

0.5

0.0 0.5 1.0

p
+

Conditional entropy

Suppose I am trying to predict output and I have input , e.g.:
HasKids OwnsDumboVideo

Yes Yes
Yes Yes
Yes Yes
Yes Yes
No No
No No
Yes No
Yes
#'
No

,
!
&% ( ...
From the table, we can estimate
"# $
$ (based on the data in the table).
#)

#'
(
*
Specific conditional entropy %+
,
What if we look only at the instances for which
is the entropy of

?

,
among only the instances in which has value
20
Conditional entropy

Conditional entropy, %
, is the average conditional entropy

of given specific values for :

%
* , %
,

Alternative interpretation: the expected number of bits needed to

transmit if both the emitter and the receiver know the value of .
In our example:

Information gain

Suppose I have to transmit . How many bits on the average would
it save me if both me and the receiver knew ?
%
*
%

This is called information gain
Alternative interpretation: how much reduction in entropy do I get if I
know .

22
Information gain to determine best attribute

!

!
"

where is the subset of instances in which #

,
.
IG(S,A) = expected reduction in entropy due to sorting on attribute A
[30+,10-] A1=? [30+,10-] A2=?

t f t f

[20+,10-] [10+,0-] [15+,7-] [15+,3-]

Check that in this case, A1 wins.

Going back to our example...

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

Which attribute will have the highest information gain?

24
Problem: Attributes with multiple values

If an attribute splits the data perfectly, it will always be preferred

by information gain.
E.g. a unique ID for each data point!
But that has very poor generalization performance!
Possible solutions:
– Using better criteria (based on information)
– Ensuring that all attributes have the same number of values

A better criterion: Gain ratio

,
and an attribute #
"
For a set of instances with possible values

where
!

! !
and is the subset of for which #
,.

#
So for an attribute that splits the data into many partitions mostly
uniformly, will be high
Problem: It can actually become too high!
Solution: First use information gain, then use gain ratio only for
attributes with information gain above average
Other such metrics are also used.
26
Ensuring the same number of values

If an attribute #
, ' '

, , where:
has possible values, , replace it
by
,
Boolean attributes, # (
"

if # ' (
# (
" ,
otherwise
,
This is called 1-of- encoding
Used more generally to encode learning data (e.g. in neural
networks)

Decision tree construction as search

State space: all possible trees

Actions: which attribute to test
Goal: tree consistent with the training data
Depth-first search, no backtracking
Heuristic: information gain (or gain ratio)
Can get stuck in a local minimum, but is fairly robust (because
of the heuristic)

28
Inductive bias of decision tree construction

The hypothesis space is complete! We can represent any

Boolean function of the attributes
So there is no representational bias
Outputs a single hypothesis: the “shortest” tree, as anticipated
by the information gain
Because there is no backtracking, it is subject to local minima
But because the search choices are statistically based, it is
robust to noise in the data
Algorithmic bias: prefer shorter (smaller) trees; prefer trees that
place attributes with high information gain close to the root

Using decision trees for real data

Lots of issues to deal with!

How to test real-valued attributes
How we estimate classifier error
How to deal with noise in the data
How to deal with missing attributes
How to incorporate attribute costs

30
Example: CRX data, UCI Repository
| This file concerns credit card applications. All attribute names
| and values have been changed to meaningless symbols to protect
| confidentiality of the data.
6
+, -. | classes

A1: b,a.
A2: continuous.
A3: continuous.
A4: u, y, l, t.
A5: g, p, gg.
A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
A7: v, h, bb, j, n, z, dd, ff, o.
A8: continuous.
A9: t,f.
A10: t,f.
A11: continuous.
A12: t,f.
A13: g, p, s.
A14: continuous.
A15: continuous.

Attributes with continuous values

Example:
Temperature: 40 48 60 72 80 90

PlayTennis: No No Yes Yes Yes No

A decision tree needs to perform tests on these attributes as well

What kind of test do we want?

Value of the attribute less than a cut point!

What cut points should we consider?

We need to consider only cut points where the class label changes!

Chap5 - Machine Learning Part II - Decision Tree
No ratings yet
Chap5 - Machine Learning Part II - Decision Tree
68 pages
CENG313 Introduction To Data Science: Lecture 12: Classification Decision Trees
No ratings yet
CENG313 Introduction To Data Science: Lecture 12: Classification Decision Trees
61 pages
Decision Tree
No ratings yet
Decision Tree
42 pages
4.decision Tree
No ratings yet
4.decision Tree
39 pages
Unit 3
No ratings yet
Unit 3
81 pages
DM-Lecture Decision Trees (A)
No ratings yet
DM-Lecture Decision Trees (A)
161 pages
A08 Decision Trees 2up
No ratings yet
A08 Decision Trees 2up
20 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
AIML - Module 3 - Updated
No ratings yet
AIML - Module 3 - Updated
42 pages
02 DecisionTrees Done
No ratings yet
02 DecisionTrees Done
68 pages
Lecture2 DT
No ratings yet
Lecture2 DT
75 pages
Decision Tree-Using Entropy
No ratings yet
Decision Tree-Using Entropy
17 pages
Business Analytics & Machine Learning: Decision Tree Classifiers
No ratings yet
Business Analytics & Machine Learning: Decision Tree Classifiers
60 pages
2024 Lecture11 MLAlgorithms
No ratings yet
2024 Lecture11 MLAlgorithms
84 pages
03 InformationGain
No ratings yet
03 InformationGain
20 pages
Machine Learning II - Decision Trees
No ratings yet
Machine Learning II - Decision Trees
16 pages
ML Lec5
No ratings yet
ML Lec5
7 pages
ML Lecture 13-14
No ratings yet
ML Lecture 13-14
33 pages
Jdavis Indlearn2
No ratings yet
Jdavis Indlearn2
91 pages
2c Decision Tree Algorithm
No ratings yet
2c Decision Tree Algorithm
21 pages
Chapter 3 Decision Trees
No ratings yet
Chapter 3 Decision Trees
61 pages
Decision Tree
No ratings yet
Decision Tree
18 pages
Week 11 - Decision Tree Learning
No ratings yet
Week 11 - Decision Tree Learning
43 pages
Comp101 Lect02
No ratings yet
Comp101 Lect02
44 pages
7 DecisionTree
No ratings yet
7 DecisionTree
58 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
41 pages
Decision - Tree
No ratings yet
Decision - Tree
75 pages
07 - ML - Decision Tree
No ratings yet
07 - ML - Decision Tree
37 pages
Decision Trees
No ratings yet
Decision Trees
14 pages
Decision Tree Part 1
No ratings yet
Decision Tree Part 1
15 pages
Decision Tree Learning
No ratings yet
Decision Tree Learning
70 pages
SDG Sdgs DF
No ratings yet
SDG Sdgs DF
23 pages
Module 3
No ratings yet
Module 3
102 pages
Module 3
No ratings yet
Module 3
101 pages
Decision Trees
No ratings yet
Decision Trees
128 pages
Module 3 Chap 3 Decision Tree Learning
No ratings yet
Module 3 Chap 3 Decision Tree Learning
79 pages
15-780: Graduate Artificial Intelligence: Decision Trees
No ratings yet
15-780: Graduate Artificial Intelligence: Decision Trees
41 pages
DMDW Co3 Session 14
No ratings yet
DMDW Co3 Session 14
55 pages
Decision Tree Example
No ratings yet
Decision Tree Example
21 pages
New Module 3 Part1
No ratings yet
New Module 3 Part1
69 pages
MAchine Learning 1
No ratings yet
MAchine Learning 1
17 pages
MAchine Learning 2
No ratings yet
MAchine Learning 2
16 pages
Module 3 DecisionTree Notes
100% (1)
Module 3 DecisionTree Notes
14 pages
23 Id3
No ratings yet
23 Id3
20 pages
Chapter4 Machine Learning Part3
No ratings yet
Chapter4 Machine Learning Part3
43 pages
Decision Trees
No ratings yet
Decision Trees
42 pages
Unit 5. Decision Trees
No ratings yet
Unit 5. Decision Trees
58 pages
Dtree
No ratings yet
Dtree
15 pages
Fundamentals of Mathematics L-20
No ratings yet
Fundamentals of Mathematics L-20
3 pages
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
No ratings yet
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
54 pages
Artificial Intelligence 11. Decision Tree Learning
No ratings yet
Artificial Intelligence 11. Decision Tree Learning
25 pages
Module - 2 Decision Tree Learning
No ratings yet
Module - 2 Decision Tree Learning
79 pages
Recitation Decision Trees Adaboost 02-09-2006
No ratings yet
Recitation Decision Trees Adaboost 02-09-2006
30 pages
Machine Learning: MVJ21CS62
No ratings yet
Machine Learning: MVJ21CS62
12 pages
Decision-Tree Learning .
No ratings yet
Decision-Tree Learning .
29 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Decision Tree
No ratings yet
Decision Tree
43 pages
Unit 1 Family Life Lesson 2 Language
No ratings yet
Unit 1 Family Life Lesson 2 Language
76 pages
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
No ratings yet
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
49 pages
Humour&Gender in The Marvellous Mrs Maisel by Alina-Laura Chitu
No ratings yet
Humour&Gender in The Marvellous Mrs Maisel by Alina-Laura Chitu
110 pages
Full Download Electromagnetic Waves and Lasers Second Edition Kimura Wayne D PDF
100% (3)
Full Download Electromagnetic Waves and Lasers Second Edition Kimura Wayne D PDF
49 pages
The Next Big Thing Quantum Computings Potential On Chemicals
No ratings yet
The Next Big Thing Quantum Computings Potential On Chemicals
7 pages
Creative Strategies of Local Resources in Managing Geotourism in The Ijen Geopark Bondowoso, E
No ratings yet
Creative Strategies of Local Resources in Managing Geotourism in The Ijen Geopark Bondowoso, E
20 pages
15114L23 Popa-Mirela 2023 29-2 - 133-137
No ratings yet
15114L23 Popa-Mirela 2023 29-2 - 133-137
5 pages
Power Series Solutions of Linear Differential Equations
No ratings yet
Power Series Solutions of Linear Differential Equations
34 pages
Lifi
No ratings yet
Lifi
19 pages
Lesson One - Inclusive Education - Supplimentary Notes
No ratings yet
Lesson One - Inclusive Education - Supplimentary Notes
10 pages
UoS BABS 3 HRM Assignment
No ratings yet
UoS BABS 3 HRM Assignment
15 pages
Cromeans J Breanan NU-607-818 Theoretical Underpinnings
No ratings yet
Cromeans J Breanan NU-607-818 Theoretical Underpinnings
8 pages
Variability of Unconfined Compressive Strength in Relation To Number of Test Samples
No ratings yet
Variability of Unconfined Compressive Strength in Relation To Number of Test Samples
8 pages
Bilal Khan Paper
No ratings yet
Bilal Khan Paper
18 pages
Business Etiquette in South Korea - 20230908 - 122053 - 0000
No ratings yet
Business Etiquette in South Korea - 20230908 - 122053 - 0000
8 pages
Lab1 Syarifuddin 2016490588 PDF
No ratings yet
Lab1 Syarifuddin 2016490588 PDF
14 pages
Career Opportunities - Food Security Cluster Coordinator - WFP
No ratings yet
Career Opportunities - Food Security Cluster Coordinator - WFP
4 pages
Product Conformity Certificate - O2000 Oxygen Analyser
No ratings yet
Product Conformity Certificate - O2000 Oxygen Analyser
9 pages
Eoa Peg-4000 (En) Msds
No ratings yet
Eoa Peg-4000 (En) Msds
7 pages
Konica Monolta Drum (Photoconductor) DR512-DR512K
No ratings yet
Konica Monolta Drum (Photoconductor) DR512-DR512K
4 pages
Marking Criteria: End-Of-Term Exams For English 5 Speaking Exam: 30% (Five Tests)
No ratings yet
Marking Criteria: End-Of-Term Exams For English 5 Speaking Exam: 30% (Five Tests)
7 pages
Geological Map Symbol
No ratings yet
Geological Map Symbol
5 pages
Science Fair Literature Review Example
100% (2)
Science Fair Literature Review Example
4 pages
Class X Prep Ideas
No ratings yet
Class X Prep Ideas
6 pages
Essay On Greenhouse Effect
100% (2)
Essay On Greenhouse Effect
3 pages
137-E Blank Form
No ratings yet
137-E Blank Form
3 pages
Art Appreciation - Assignment 1
No ratings yet
Art Appreciation - Assignment 1
1 page
Standard Operating Procedure Title: Determination of PH GTP Number Supersedes Standard Effective Date
No ratings yet
Standard Operating Procedure Title: Determination of PH GTP Number Supersedes Standard Effective Date
2 pages
Saudi Aramco Typical Inspection Plan: LEAK TESTING (Per SAES-A-004) 14-May-18
No ratings yet
Saudi Aramco Typical Inspection Plan: LEAK TESTING (Per SAES-A-004) 14-May-18
10 pages
15-Nguyen Van Thin-Bai Bao28!3!2007
No ratings yet
15-Nguyen Van Thin-Bai Bao28!3!2007
8 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Stem Guides To Weather
From Everand
Stem Guides To Weather
Kay Robertson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

ML Lecture04x2

Uploaded by

ML Lecture04x2

Uploaded by

Lecture 4: Decision Trees

What is a decision tree?

Classification problem example

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

Discover a “rule” for the PlayTennis predicate!

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

A decision tree consists of:

Using decision trees for classification

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

At every node, test the corresponding attribute

Good things about decision trees

Sunny Overcast Rain

Humidity Yes Wind

High Normal Strong Weak

What is easy/hard for decision trees to represent ?

How would we represent:

Top-down induction of decision trees

Given a set of labeled training instances:

The attribute should provide information about the class label.

[20+,10-] [10+,0-] [15+,7-] [15+,3-]

Intuitively, we would like an attribute that separates the training

Information = Reduction in uncertainty

Let be an event that occurs with probability  . If we

   ,    )

 word *        +  ,  bits

33 ( 3   (  . Each

Call this entropy of .

Average amount of information per symbol

  ,       ,          .

The further is from uniform, the lower the entropy

0.0 0.5 1.0

Alternative interpretation: the expected number of bits needed to

    ! 

where is the subset of instances in which #

[20+,10-] [10+,0-] [15+,7-] [15+,3-]

Check that in this case, A1 wins.

Going back to our example...

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No

Which attribute will have the highest information gain?

If an attribute splits the data perfectly, it will always be preferred

A better criterion: Gain ratio

Decision tree construction as search

State space: all possible trees

The hypothesis space is complete! We can represent any

Using decision trees for real data

Lots of issues to deal with!

Attributes with continuous values

PlayTennis: No No Yes Yes Yes No

A decision tree needs to perform tests on these attributes as well

What kind of test do we want?

What cut points should we consider?

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Let be an event that occurs with probability . If we

, )

word * + , bits

33 ( 3 ( . Each

, , .

!