0% found this document useful (0 votes)

21 views55 pages

Lecture 11 Slides - After

The document outlines decision trees and random forests, discussing their structures, terminology, and applications in regression and classification tasks. It explains the process of constructing decision trees using criteria like Gini impurity and the concept of ensemble methods, particularly bagging, to improve model accuracy. Additionally, it highlights the use of random forests in optimizing experimental designs by predicting outcomes based on limited experimental data.

Uploaded by

baptiste.ferrer10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views55 pages

Lecture 11 Slides - After

Uploaded by

baptiste.ferrer10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Decision-trees,

Random forests
Outline

▪ Review

▪ Decision-tree

▪ Random forest
3
Introduction Linear regression Logistic regression

Feature engineering Data statistics Naive Bayes

KNN Clustering Dimensionality reduction

Neural networks Convolutional neural Decision-trees

networks
Announcements
Last topic of the course: Reinforcement learning

Quiz: 13.12, topics: dimensionality reduction, clustering, neural network, convolutional neural network
Last time - CNN

-1 -1 -1

-1 8 -1

-1 -1 -1
Decision trees
Brief recap - data statistics
Empirical distribution
Brief recap - data statistics
Variance, covariance, correlation
Decision Trees
Introduction
What’s a decision tree?

Student attends lecture

True False

Student gets high grade Student gets average grade

Split data based on the answer to a question

Decision Trees
Terminology

Root Node

▪ Nodes are checked on a single feature

Internal Node Internal Node
▪ Branches are feature values
▪ Leaves indicate prediction of the tree Internal Node Leaf Node Internal Node Leaf Node

Leaf Node Leaf Node Leaf Node Leaf Node

Decision tree for regression
Called regression tree
Regression tree example
Regression tree example
Regression tree
Loss function
Regression Tree
Optimising the loss function - greedy algorithm
Regression tree
Stopping criteria
Decision Trees for classification
Called classification trees

Continuous Categorical Class

Feature Feature label
Age < 27.5
Age Car type Risk
true false
23 family high
17 sports high High Car type sports
43 sports high
true false
68 family low
32 family low High Low
20 family high
Classification trees
Performance metric and loss function
Classification trees
Greedy approach: choose a feature and the split sequentially based on a metric,
such as Gini impurity

Gini impurity of a leaf

Gini impurity of a node

Note: Criteria other than gini index (such as entropy) are also used for node split
Classification Trees
Example - constructing the tree
We will apply Gini impurity to construct the classification tree
Continuous Categorical Class
Feature Feature label

Age Car type Risk

23 family high
17 sports high
43 sports high
68 family low
32 family low
20 family high
Classification trees
Car type sports
Criteria for choosing a feature and a split
true false

High Low High Low

2 0 2 2

Continuous Categorical Class

Feature Feature label

Age Car type Risk

23 family high
17 sports high
43 sports high
68 family low
32 family low
20 family high
Age < 18.5

Decision Trees true false

Splitting continuous features High Low High Low

tid Age Risk tid Age Risk 1 0 3 2

0 23 high 1 17 high
Reorder the
1 17 high data depending 5 20 high
on Age
2 43 high 0 23 high
3 68 low 4 32 low
4 32 low 2 43 high
5 20 high 3 68 low
Decision Trees Age < 21.5

Splitting continuous features true false

High Low High Low

tid Age Risk tid Age Risk
2 0 2 2
0 23 high 1 17 high
Reorder the
1 17 high data depending 5 20 high
on Age
2 43 high 0 23 high
3 68 low 4 32 low
4 32 low 2 43 high
5 20 high 3 68 low
Decision Trees Age < 27.5

true false
Splitting continuous features
High Low High Low
tid Age Risk tid Age Risk
3 0 1 2
0 23 high 1 17 high
Reorder the
1 17 high data depending 5 20 high
on Age
2 43 high 0 23 high
3 68 low 4 32 low
4 32 low 2 43 high
5 20 high 3 68 low

verity :

g
:
v

impunty of this mode +

split 2
9
Decision Trees Age < 37.5

true false
Splitting continuous features
High Low High Low
tid Age Risk tid Age Risk
3 1 1 1
0 23 high 1 17 high
Reorder the
1 17 high data depending 5 20 high
on Age
2 43 high 0 23 high
3 68 low 4 32 low
4 32 low 2 43 high
5 20 high 3 68 low

of mode
I
imponty
.
G n .
:
Decision Trees Age < 55.5

true false
Splitting continuous features
High Low High Low
tid Age Risk tid Age Risk
4 1 0 1
0 23 high 1 17 high
Reorder the
1 17 high data depending 5 20 high
on Age
2 43 high 0 23 high
3 68 low 4 32 low
4 32 low 2 43 high
5 20 high 3 68 low

of nece &
Gin ,

imprnhy '

IS
Decision Trees for classification Car type sports

true false

High Low High Low

Continuous Categorical Class

Feature Feature label

Age Car type Risk

23 family high
17 sports high
43 sports high
68 family low
32 family low
Age < 27.5
20 family high
true false

High Low High Low

Classification tree
Example - Final tree
Age < 27.5

true false

Continuous Categorical Class

High Low Car type sports
Feature Feature label

product
Age Car type Risk 3 0 true false

23 family high nigh

High Low High Low
17 sports high
visk 1 0 0 2
43 sports high
A
68 family low
I I
32 family low c
I
20 family high
.

pres
k preciat
high
ris

low risk
here
Classification tree example
penguin data set
Decision Trees
Penguins example

Culmen length ≤ 43.25

Adelie Gentoo
Decision Trees
Penguins example

Culmen length ≤ 43.25

Culmen depth ≤ 15.3 Culmen depth ≤ 16.55

Gentoo Adelie Gentoo Chinstrap

Decision Trees
Penguins example
Python output for sklearn.tree

Culmen length ≤ 43.25

Culmen depth ≤ 16.55

Culmen depth ≤ 15.3 Culmen depth ≤ 16.55

Gentoo Adelie Gentoo Chinstrap

[Adelie, Chinstrap, Gentoo]

Summary - decision tree construction
▪ We select the most discriminative Feature and Threshold

• Discriminative power based on a criteria:

• Regression tree: Average Squared Loss

• Classification tree: Gini impurity

• We create a node based on this feature

▪ We repeat for each new branch until a stopping criteria

▪ To ensure the tree-depth is not too high (avoid overfitting) “pruning” is done..

More examples on StatQuestion!!!: https://www.youtube.com/watch?v=_L39rN6gz7Y

Random forests
Recall - Decision Trees
Regression trees
Root Node

Classification trees
Internal Node Internal Node

For the “split” in each node Internal Node Leaf Node Internal Node Leaf Node

need to determine:
• The feature Leaf Node Leaf Node Leaf Node Leaf Node

• The threshold
using some criteria
Decision Trees
Characteristics of decision tree induction

Easy to interpret and explain

Tend to overfit if deep

Minimal data preparation

Don’t always perform so well
Automatic feature selection
Fundamental trade-off in learning
37

Goal of supervised ML models: generalise well on new data (based on the

patterns learned from known data).
Hence, we want test error to be small: e_test
Ensemble methods
Overview
based on the hypothesis that combining multiple models together can reduce
one type of error while not significantly increasing the other

Idea :
▪ Take a collection of predictors (decision trees for example)
▪ Combine their results to make a single predictor

Types :
▪ Bagging: train predictors in parallel on different samples of the data, then
combine outputs through voting or averaging
▪ Stacking: combine model outputs using a second-stage predictor like linear
regression
▪ Boosting: train learners on the filtered output of other learners
Optional: https://towardsdatascience.com/ensemble-methods-bagging-
boosting-and-stacking-c9214a10a205
Ensemble methods
Bagging
Bagging = Bootstrap aggregating

See: https://en.wikipedia.org/wiki/Bootstrap_aggregating
Ensemble methods
Intuition on why ensemble methods could work
Suppose we have ensemble of 25 classifiers
→ Each classifier has error rate = 0.35
→ Assume classifiers are independent

Probability that the ensemble classifier makes a wrong prediction

= probably of majority (at least 13) makes wrong prediction

∑( i )
25
P(wrong prediction) = 25 ϵ i(1 − ϵ)25−i = 0.06
i=13
Ensemble methods
Bootstrap
Method to generate multiple datasets with good statistical properties from an
original dataset
Ensemble methods
Bootstrap
Method to generate multiple datasets with good statistical properties from an
original dataset

Original data
Ensemble methods
Bootstrap
Method to generate multiple datasets with good statistical properties from an
original dataset

Original data