Lecture 16
Lecture 16
Javed Iqbal
Decision Tree Models for Regression and Classification
Machine Learning is an AI technique that teaches computers to learn from experience. Machine
learning algorithms use computational methods to “learn” information directly from data
without relying on a predetermined equation as a model.
Regression Tree:
Consider to model the baseball players salary, a quantitative variable. Hence the resulting tree
model is a regression tree. There are 322 observations (players) and 20 variables are measured
on them. There can be many variables affecting salary (thousands of dollars, expressed in logs)
but suppose we consider only two predictors (X1: years of experience and X2: hits made in
previous season). The resulting decision tree is represented as follows:
Overall, the tree stratifies or segments players into three regions of predictor space:
𝑅1 = {𝑋|𝑌𝑒𝑎𝑟𝑠 < 4.5}, 𝑅2 = {𝑋|𝑌𝑒𝑎𝑟𝑠 ≥ 4.5, 𝐻𝑖𝑡𝑠 < 117.5},
𝑅3 = {𝑋|𝑌𝑒𝑎𝑟𝑠 ≥ 4.5, 𝐻𝑖𝑡𝑠 ≥ 117.5}
Here we have two internal nodes and three leaves (final nodes). Within a given region,
predicted salary is the average salary of players in this region.
Interpretation of the tree outcome: Year is most important predictor of salary so that players
with lower years of experience (in particular less than 4.5 years) have lower salary.
Given that players have low experience, the number of hits has no role to play in salary. But
among the players with more than 4.5 years, the number of hits also becomes important with
higher salary being paid to players with high number of hits.
In general, the goal of the tree algorithm is to find regions (boxes) 𝑅1 , 𝑅2 ,… 𝑅𝐽 that minimize
the residual sum of squares (RSS), given by:
𝐽
∑ ∑(𝑦̂𝑖 − 𝑦̂𝑅𝑗 )2
𝑗=1 𝑖∈𝑅𝑗
Where 𝑦̂𝑅𝑗 is the mean response for the training observations within the jth box.
What variable to start with making a tree and where the cut is to be made e.g., 𝑌𝑒𝑎𝑟𝑠 < 4.5, is
determined by the decision tree algorithm so that residual sum of square is minimized. Initially
the residual sum of squares equals sum of squared deviation of all observation y from the grand
mean of 𝑦̅. We select predictor and consider all possible cuts for the predictor. The cut that
provides greatest reduction in the residual sum of squares is selected by the algorithm. After a
cut is made, we keep on segmenting the predictor space to look for further reduction in the
residual sum of squares.
The algorithm stops when a stopping criterion is reached e.g., when there are only a certain
number of cases remaining in a region of predictor space.
Ex1: Prepare a decision tree from the given rectangular predictor space (right panel) or vice
versa given the decision tree (left panel), prepare a rectangular predictor space for the two
predictors X1 and X2.
Ex2: Prepare a decision tree from the given rectangular predictor for the two predictors X1
and X2.
Ex3: Prepare a rectangular predictor space corresponding to the decision tree given below for
the two predictors X1 and X2.
Classification Tree:
Used for qualitative dependent variables. The tree building procedure is similar to the
regression tree.
Here the prediction for a test case is given by label of the most commonly occurring class
(i.e., based on majority vote) in which the test case falls.
We keep on segmenting the predictor space to minimize the nodes impurity measured by e.g.,
the Gini Coefficient G. (Here K is number of classes of response variable e.g., K=2 for binary
classification).
𝐾
𝐺 = ∑ 𝑝̂𝑚𝑘 (1 − 𝑝̂𝑚𝑘 )
𝑘=1
Here 𝑝̂𝑚𝑘 is the proportion of training observations in the mth region that are from the kth
class. G will be smaller when there is mostly one type of cases in the region so that 𝑝̂ 𝑚𝑘 is
closer to either zero or 1. When there is an even number of cases of the two types in the
region, 𝑝̂ 𝑚𝑘 will be closer to 0.5 and G will be large.
Ex4: Consider predicting the default status (yes or no) of a sample of clients from the following
classification tree. The predictors are whether the client owns a home, marital status (married
or single/divorced) and annual income (dollars).
a) Predict whether or not a married client who has income of $100,000 and who does not own
a house will default.
b) Predict whether a single client who has income of $60,000 and who owns a house will
default.
c) Predict whether a single client who has an income of $70,000 and who does not own a
house will default.
d) What are the two most important predictors in predicting default status?
e) What is the number of internal nodes and leaves in this problem?
[Ans: a) not default b) not default c) not default, d) Home ownership and marital status, e)3,4]
Ex5: Consider the regression tree grown on a response variable ‘pollution level’ and seven
predictors as mentioned in the tree plot corresponding to different test locations. The predictors
include ‘number of industries’ in the location, the ‘population’ in thousands in that location,
average number of ‘wet days’ in the year, average ‘temperature’ in Fahrenheit, and ‘wind
speed’ in KM/hour. The numbers in each leave node are the average pollution level in this
predictor space.
a). Is this a regression or classification tree?
b).What are the two most important variables which determine the pollution level in an area?
c). Predict the average pollution level in a test location which has 500 industries, with a
population of 200 thousand, the average wet days being 150, average temperature level of 50
Fahrenheit, and wind speed of 8 Km/hour.
[Answer: a) regression b) number of industries and population in the test area c) 33.88]
Ex6: Consider the classification tree grown to predict whether the client will buy computer
(yes or no). The predictors are age (with three levels youth, middle-aged and senior), whether
the client is student and credit rating of the individual (fair or excellent). The resulting
classification tree is as follows: