Chapter 1. Elements in Predictive Analytics
Chapter 1. Elements in Predictive Analytics
Related Readings …
L earning Objectives
Idea of predictive modeling, linear regression and its limitations, KNN
regression, bias-variance trade-off, cross-validation, principal component
analysis
1 Introduction
Predictive analytics is a field in machine learning. In many daily life problems, we are given a
series of n observations, each of which looking something like this:
(x1, x2, …, xp) y
where
x (x1, x2, …, xp), and each xk is called a (an) predictor / independent variable / attribute
/ descriptor,
y is a (an) response / dependent variable / output / target.
Each of the xk may (or may not) contain some “explanatory power” in predicting y. Given the n
observations, we want to build a model that can predict y based on x.
In this course, we are going to discuss different classes of models of f, and the meaning of being
the “best”. We will implement models using Excel and R. For some models, f is a simple
function. However, f can also exist as a graphic representation of an algorithm in some models.
The process of building f can sometimes be expressed as a problem of parameter estimation. But
since sometimes there are not really parameters, we use the term “modeling training” rather than
“modeling fitting” to describe the process of building f.
x n xn1 xn 2 xnp yn
In many cases we do not use all observations for building f but withhold some for model
validation. The subset of observations used for modeling training is called the training data set
and the observations withheld is called the test data set or validation data set. We will
elaborate on this when we discuss model validation in Section 4.
Before we start our discussion of the first model for f, let us have a quick recap of the simplest
predictive model: linear regression.
The problem of training f is the same as estimating the p unknown coefficients 0, 1, …, p.
The quality of the fit can be assessed by using the mean square error (MSE)
1 n
MSE =
n i 1
( yi yˆi ) 2 where yˆi f (xi) 0 1xi1 2xi2 … pxip.
When p n, there is a unique solution that minimizes the MSE. You may recall the least-square
solution βˆ ( XX) 1 Xy , where the definition of X here is slightly different from that on page 2:
1 x11 x12 x1 p
x21 x22 x2 p
1
X .
1 xn1 xn 2 xnp
? Example 2. 1
Consider the linear regression problem with data
(1, 2, 3, 4, 1) 13.4
(2, 2, 1, 5, 0) 6.1
(3, 3, 2, 1, 2) 16.7
Here, p 5 and n 3. Show that two solutions for β̂ such that the MSE is zero are
(0, 1.155403, 0.55734, 2.674513, 0, 3.10637)
and
(1, 1.155403, 0.128767, 2.531655, 0, 3.392083).
1) Why is f linear? Are there strong evidence to support linearity? As the number of data points
increases, it becomes easier to reject the linearity assumption.
2) We have arrived at the era of big data. Typically, for each observation, the number of
predictors is huge. That is, p n. Just think of using genes to predict cancer. There can well
be more than 10000 genes that are “potentially” related to lung cancer. However, in a
clinical trial, the number of patients is limited. In such cases there will be infinitely many
solutions of that make MSE 0.
3) One subtle problem is that many of the p predictors in x have very low, or even no predictive
power because they are collected in a routine manner, or are strongly correlated with some
other predictors in the same x, such that including one is sufficient. The problem of variable
selection becomes extremely problematic in such setting.
In view of these three limitations, the least-square solution is not useful under a big data setting.
(Good news?) Also, if y is a qualitative variable, then a linear regression simply does not make
sense.
In this course we will deal with the problems above. We will discuss
1) Non-linear regression: K-nearest neighbor (KNN) regression (chapter 1), classification and
regression tree (CART) (chapter 3)
2) Dimensionality reduction: principal component analysis (PCA) (chapter 1)
3) Subset selection, penalized least squares regression including rigid and Lasso regression,
partial least squares (chapter 4)
We will also discuss separately clustering, a method of exploratory data analysis (EDA), in
chapter 2.
Why do we need so many models or procedures? A modeling procedure that can fit many
different possible true functional forms for f is said to be flexible. It provides better fit to data,
but may be complex or may involve a lot of parameters, and is in generally harder to interpret. A
model that provides too much freedom (that is, too flexible) may follow the noise rather than the
essential feature of the predictors too closely, leading to overfitting. As a result, we should
choose a model that is flexible enough to handle the data, but not overly flexible.
3 KNN Regression
KNN regression stands for K-nearest neighbors regression. It is a very simple procedure for non-
linear regression.
KNN
Predictions are made for a new x by searching through the entire training data set for the K most
similar cases (neighbors) and summarizing the output variables for those K cases.
Algorithm:
1) Calculate the distance between xi and x for every i.
2) Find the K data points which are closest to x.
3) Compute the mean of the y’s from the K data points in 2) for a regression problem, or find
the mode from the K data points in 2) for a classification problem.
See the two illustrations in the first two tabs of the companion Excel worksheet for a regression
problem and a classification problem.
Remarks:
(a) There are various types of distances. The Euclidean distance between two points
a (a1, …, ap) and b (b1, …, bp),
p
defined as || a b ||2 (a b )
i 1
i i
2
is the most commonly used when all predictors are
p
similar in type. The city-block distance || a b ||1 | a b | is a good distance if the
i 1
i i
(b) Ties can happen when some of the distances computed are the same. All these data points
can be included as the “K-nearest neighbors” (this is equivalent to expanding K). Those data
points that are the most far away can also be randomly selected. Ties can also happen when
computing a mode and they can be resolved by a random selection.
(c) K is called a tuning parameter (微調参數). The variance of the predicted value depends on
the value of K. A small value of K provides a flexible fit, but the prediction changes
frequently when x varies. A large value of K provides a smoother and less variable fit. But
such smoothing effect may introduce large bias for the prediction. In the next section we will
discuss how the optimal value of a tuning parameter can be obtained.
? Example 3. 1
Consider the following data from a survey with two attributes to classify if a paper tissue is good
or not.
x1 (acid durability in s) x2 (strength in kg per m2) y
7 7 Bad
7 4 Bad
3 4 Good
2 3 Good
A factory produces a new paper tissue with x1 3 and x2 6. Using KNN regression with K 3,
determine the classification of this new tissue.
Quality of Fit
1 n
In a linear regression model, the mean square error MSE =
n i 1
( yi yˆi ) 2 is used to measure the
quality of the fit. For a KNN regression, the same definition can be used to measure the quality if
fit. For a KNN classification, the error rate
1 n
I ( yi yˆi )
n i 1
can be used. The smaller the MSE or the error rate, the better is the fit.
Note, however, that the quality of fit is not a measure of the predictive power of the model.
Now let us consider a regression problem. In the previous section, we mentioned the mean
1 n
square error MSE = ( yi yˆi ) 2 as a measure of quality of fit. This MSE is computed using
n i 1
the training data set that was used to build or fit the model and should more accurately be
referred to as the training MSE. In doing a prediction, we do not care so much about how well
the model works on the training data but on its application on a previously unseen data x0.
what we really care is about x0 and its predicted value fˆ (x 0 ) . The predictor x0 is associated with
a true y0. The error that we are concerned with is the test MSE [ y fˆ (x )]2 , which cannot be
0 0
computed because y0 is unknown. The optimal model is one that minimizes the test MSE, rather
than the training MSE.
To analyze the test MSE, we assume for a moment that the outputs (y1, y2, …, yn) and also y0 are
treated as random. It is rather like the outputs have not been realized (observations have not been
made). Since fˆ is computed from the y’s, it is also random.
We consider the expected value of [ y0 fˆ (x0 )]2 , that is, E [ y0 fˆ (x0 )]2 .
It can be shown that the above can be broken down into three components:
E [ y0 fˆ (x0 )]2 Var[ fˆ (x 0 )] (Bias[ fˆ (x 0 )]) 2 Var( )
Proof (Optional): Recall that the bias of the estimator fˆ (x 0 ) is the difference between the
quantity that it is estimating (y0) and itself. So,
Bias[ fˆ (x0 )] y0 fˆ ( x0 ) .
Now, by Var(U) E(U 2) (EU)2,
E [ y0 fˆ (x 0 )]2 Var[y0 fˆ ( x0 )] E 2 [ y0 fˆ (x 0 )] Var[y0 fˆ ( x 0 )] (Bias[fˆ ( x 0 )]) 2 .
Finally,
Var[y0 fˆ (x 0 )] Var[f ( x 0 ) fˆ ( x0 )] Var[ 0 fˆ (x 0 )] Var[ fˆ ( x0 )] Var( ) ,
where the last equality follows from the fact that 0 is the irreducible error associated with the
pair x0 y0. It is independent with any pair xi yi and is hence uncorrelated with fˆ .
The bias-variance tradeoff states that the test MSE is related to three components:
(a) The variance of the prediction
(b) The bias of the prediction
(c) The variance of the irreducible error
To lower the test MSE, one should control both the variance and the (absolute) bias.
The variance of the prediction refers to the amount by which fˆ would change if we estimate it
using a different training data set (different y’s). Ideally fˆ should not vary too much between
training sets.
The bias of the prediction refers to the error that is introduced by using fˆ to approximate the true
value y. In general, a more flexible model leads to a bias that is closer to zero. But it would also
result in a greater variance of the prediction.
One extreme is to fit a flat horizontal line y c to the data set: no matter what value of x0, the
prediction is c. The variance of the prediction is the lowest possible (it is 0!), but the model is
inaccurate, leading to large bias (in absolute sense).
Because y0 is unknown, the test MSE cannot be computed. However, we can estimate it by a
method called cross-validation (also known as rotation estimation 交叉驗證 / 循環估計).
The main idea about cross-validation is to split the known data set into two subsets: a training
data set on which the model is built, and a validation data set (aka hold-out set) against which the
model is tested.
Algorithm:
Finally, take average of the k test MSEs to arrive at a final estimate of the test MSE.
There are many ways to construct the k rounds above. Here we name three.
1) Holdout method (k 1)
The n observations is randomly split into two halves of unequal sizes. Say, if n 205,
then the training data set can be any 155 randomly drawn observations, and the validation
data set is the remaining 50 observations. Only a single model is fitted, and computationally
this method is not cumbersome. However, the test MSE obtained depends very heavily on
the random splitting and hence is highly variable. Also, since the model tends to fit badly,
the resulting test MSE reflect more on whether the validation set selected looks like the
training data set. That is, the resulting estimate is easily overshadowed by the validation set
error. Generally, the resulting estimate overstates the test MSE.
The textbook uses the name “Validation set approach” for this method because there is
actually no “cross-validation” for this approach: the data in the training set does not serve as
the validation set for any other round.
Round 1 1 2 3 4 5 6 7 8 9 10 11 12
Round 2 5 6 7 8 9 10 11 12 1 2 3 4
Round 3 9 10 11 12 1 2 3 4 5 6 7 8
However, another way to do the estimation is to use the 3 subsamples (2, 3, 4, 5), (6, 7, 8, 9)
and (10, 11, 12, 1).
? Example 4. 1
Consider the data set in the tab CV in the companion Excel worksheet. Estimate the test MSE for
a linear regression model using
(a) the holdout method, with the first 15 data points as the training data set, and repeat by using
the last 15 data points as the training data set;
(b) LOOCV;
(c) the 5-fold cross-validation.
Repeat, for a quadratic regression model. Which model is the better and what is the final model?
In general, the k-fold CV method has not only a computational advantage to LOOCV but also
better statistical property: it yields more accurate estimates. In k-fold CV, we train the model on
less data that what is available. This introduces bias into the estimate. LOOCV has less bias in
this sense. However, when we perform LOOCV, we are in effect averaging the outputs of n
fitted models, each of which is trained on an almost identical set of observations; therefore, these
outputs are strongly positively correlated with each other. For k-fold CV method, the overlapping
between the training sets in each model is smaller. Since the mean of many highly correlated
quantities has higher variance that does the mean of many quantities that are not as highly
correlated, the test MSE resulting from LOOCV tends to have higher variance. Empirically, k 5
and k 10 are shown to yield test MSE estimates that suffer neither from excessively high bias
nor from very high variance.
Proof (Optional): Let X[i] and y[i] be similar to X and y but with the ith row of data deleted. Let
xi be the ith row of X and let [i] (X[i]X[i])1 X[i]y[i] be the estimate of in the ith round of
training. Then the MSE in the ith round is e[i] yi xi[i].
? Example 4. 2
Repeat (b) for Example 4.1 using the shortcut formula.
Model Selection
Example 4.1 illustrates model selection: we have selected between two polynomial regression
models with degree 1 and degree 2 using the test MSE. The general procedure for selecting a
model or the best tuning parameter is as follows:
Classification Problem
For classification problem, yi is not numerical but categorical. The corresponding objective to be
minimized is the error rate
1 n
I ( yi yˆi ) .
n i 1
The same procedure for model selection can be performed using the above measure and cross-
validation.
5 Data Pre-processing
Data pre-processing generally refers to the addition, deletion or transformation of the training
data set. Many techniques in predictive modeling are sensitive to the scale of the data. For
example, consider a KNN regression with two predictors x1 and x2. If in the training data set, x1s
are in magnitudes of ten, while x2s are in magnitudes of thousands, then the variation of the
Euclidean distance will be dominated by the distance in x2. See the illustration in the tab
“Standardize”.
Standardization
The most common data transformation is standardization. Given a training data set, we compute
the sample mean and sample standard deviation for each predictor xk by
1 n 1 n
xk
n i 1
xik , sk
n 1 i 1
( xik xk ) 2 .
xik xk
Then the data points are transformed by using xik .
sk
In practice, if the predictors are similar in units and order of magnitudes, there is not a need to
standardize them. However, standardization may still improve numerical stability in many of the
calculations.
Skewness
An un-skewed distribution is one that is roughly symmetric. The skewness of a data series can be
computed by using
1 1 n
Skewness 3/2
s n 1 i 1
( xi x )3 .
A right-skewed distribution is one that has a positive skewness. Such distribution has a larger
number of data points on the left-tail than the right-tail and hence the frequency curve appears to
be skewed to the left.
A simple way to resolve skewness is to take log or square root on the data (if every data is
positive). A more sophisticated method is to use the Box-Cox transformation
xi 1
xi for 0
where can be estimated by maximum likelihood estimation.
Some of the techniques in predictive analytics (e.g. conducting statistical inference on PCA, not
covered in this course) require the assumption of normality. Removing skewness is one way to
achieve normality.
Missing Data
Sometimes some predictors have no values for a given sample. There are many reasons why the
predictors are missing. The predictors may be unavailable at the time when the model is built,
and it is also possible that some predictors are more nonresponsive (e.g. attributes such as
income). It can even happen in economics and finance that governments and corporations choose
not to report critical statistics because they do not look good.
It is important to understand if there are any mechanism that create the missing data, especially if
such mechanism is related to the output y (informative missingness). For large data set, the
omission of a small amount of data with missing predictors (at random) is not a problem.
However, if the size of the data set is small, such omission can cause significant loss of
information and it is better to impute the missing values. One method to do so is the KNN
regression mentioned in Section 3.
To visualize a data set with 2 predictors we can use a single scatterplot. However, with p
predictors, even if we are just looking at pairwise relations, we can plot a total of pC2 p(p 1) /
2 scatterplots, which gives something like this for p 5:
Such a graph is called a scatterplot matrix, and can potentially be very large. Say p equals 10,
then 10C2 45 and it is hard to analyze 45 plots by sight. A better method is needed for large p.
One way to do so is to find a low-dimensional representation of the data that captures as much of
the information as possible. PCA is a tool to achieve this.
Consider a random vector X (X1, X2, …, Xp) (which is assumed to be centered 0) with
covariance matrix . Let k (k1, k2, …, kp) for k 1, 2, …, p. Consider building a new
random vector Z (Z1, Z2, …, Zp) using the linear transformation
Z1 11 X 1 12 X 2 1 p X p
Z 2 21 X 1 22 X 2 2 p X p
Z p p1 X 1 p 2 X 2 pp X p
1 11 12 1 p
22 2 p
(that is, Z X where
2 21
.) It can be shown that
p p1 p 2 pp
Principal Components
The principal components of X are those uncorrelated linear combinations Z1, Z2, …, Zp whose
variances are as large as possible.
The transformation is defined in such a way that the first component has the largest possible
variance (that is, accounts for as much of the variability in X as possible), and each succeeding
component in turn has the highest possible variance under the constraint that it is uncorrelated to
the preceding components. (The geometric interpretation is: the resulting vectors form an
orthogonal basis set).
Finding the principal components amounts to finding the matrix (called loading) based on the
covariance matrix , together with the variance of each principal component Zi.
It turns out that we can find and the variance of each principal component by the following
mathematical procedure, the proof of which can be found in any elementary text on multivariate
statistics:
Algorithm:
1) Compute all eigenvalues of . By construction, all eigenvalues are positive real-valued.
2) Arrange the eigenvalues in 1) in the order 1 2 … p. Then Var(Zi) i.
3) Find the eigenvector associated with each of the eigenvalues.
4) Normalize the eigenvectors in 3) to unit length to give the loadings 1, 2, …, p.
Var( X i ) Var( Zi ) i .
i 1 i 1 i 1
As such, the proportion of total population variance due to the kth principal component is
k
.
1 2 p
A typical computer output for PCA looks something like this (note the whole in the table):
predictor PC1 PC2 … PCp
x1 11 21 … p1
x2 12 22 … p2
⁝ ⁝ ⁝ ⁝
xp 1p 2p … pp
variance 1 2 … p
? Example 5. 1
1 2 0
Consider the three random variables X1, X2, X3, with the covariance matrix 2 6 1.5 .
0 1.5 5
Find the three principal components and also the variances of each of them. Also conduct a
principal component analysis on the correlation matrix of the three random variables.
A PCA based on the covariance matrix will give more weights on the Xk that has a higher
variance. If the Xi’s have different magnitudes in units, then it is customary to first standardize
each Xk before looking into the covariance matrix. This is equivalent to considering the
correlation matrix of the original set of random variables.
In most cases we do not have the population covariance (or correlation) matrix of the set of
random variables X, but a set of n independent observations x1, x2, … xn. Our goal is to estimate
the principal components from the observations. This can be done by applying PCA on either the
sample covariance matrix or the sample correlation matrix.
? Example 5. 2
Consider the covariance matrix produced by the daily percentage change in 1-yr, 2-yr, 3-yr, 4-yr,
5-yr, 7-yr, 10-yr and 30-yr swap rate data of the US market from 3 July 2000 to 12 Aug 2011.
Estimate the 8 principal components from the data. How much variance can be explained by the
first 3 principal components? What do these 3 PCs represent in terms of yield curve movement?
Applications of PCA
From the point of view of explanatory data analysis, the transformation z x (z x) maps
each observation xk (with p predictors) into a new observation zk (again with p predictors), as
follows:
The variance of the components in z are decreasing in order: the first component (those zk1’s)
explains the largest variability in the original data, the second component (those zk2’s), which is
uncorrelated with the first component, explains the next largest variability, and so on. Hence one
way to simplify the original data set is to retain only the first m ( p) columns in the z matrix and
discard the rest (whose variability may be contributed by noise):
z1 z11 z12 z1m z1,m 1 z1 p
z z z22 z2 m z2, m 1 z2 p
2 21
z n zn1 zn 2 znm zn ,m 1 znp
Then we can analyze the new data by plotting a scatterplot matrix of lower dimension and even
interpret each of the m components to shed lights on the essential features of x. Now the question
is: how can we choose the optimal m?
It turns out that there is no universally accepted scientific way to do so. All existing ways (which
I give you three) are ad hoc in nature!
1) We can let m be the least number of PCs that explains a certain percentage (say 80%) of the
total variance of the data.
2) We can draw a graph (called scree plot) of the proportion of variance explained vs the PCs.
We eyeball the plot and look for a point at which the proportion of variance explained drops
off and cut that at m. This is often referred to as an elbow.
3) We look at the first few PCs in order to find interesting interpretation in the data and keep
those that can explain the variability of X. If no such things are found, then further PCs are
unlikely to yield anything interesting. This is very subjective and depends on your ability to
come up with stories (as illustrated in Example 5.2).
After we have decided m, we can further plot two types of graphs. The first one is a scatterplot
matrix for the retained PC scores
z1 z11 z12 z1m
z z z22 z2 m
2 21 .
zn zn1 zn 2 znm
Such a scatterplot matrix is called a PC scores plot. While each PC is uncorrelated with any
other PCs, there may still be patterns in the PC score that warrants further investigations.
We can also take any two PCs and plot (for all p predictors) the direction they are pointing to.
This is known as a PC loading plot. For example, we can pick PC1 and PC2,
predictor PC1 PC2 … PCm
x1 11 21 … m1
x2 12 22 … m2
⁝ ⁝ ⁝ ⁝
xp 1p 2p … mp
We draw the p vectors (11, 21), (12, 22), … , (1p, 2p). The point (11, 21) shows how much
weight x1 have on PC1 and PC2, the point (12, 22) shows how much weight x2 have on PC1 and
PC2, and so on.
The angles between the vectors tell us how different predictors xis correlate with one another (as
reflected by the two PCs):
When the vectors that correspond to xi and xj form a small angle, they strongly correlate
with each other.
When the vectors are perpendicular, the corresponding xi and xj are not correlated.
When the vectors diverge and form a large angle (close to 180), the corresponding xi and
xj are strongly negatively correlated.
? Example 5. 3
Consider the US violent crime rates data in year 1973 in the companion Excel worksheet (the
USArrests data set in R). Perform a principal component analysis on the standardized data
and create
(a) a scatterplot for the scores of PC1 and PC2;
(b) a PC loading plot for PC1 and PC2.
A biplot for two particular PCs simply overlap a PC scores plot with a PC loading plot. See
Figure 10.1 of the text for the biplot of PC 1 and PC2 in Example 5.3. The programming language
R can generate biplots easily.
This is not related to real life applications but to the geometry of random vectors and change of
base in linear algebra. To understand what follows, you need some basics of linear algebra.
Consider PC1. The loading vector 1 defines a direction in Rp along which the data vary the most.
If we project the n data points x1, x2, …, xn on this direction, the projected values are z11, z21, …,
zn1. The following is an illustration with p 2:
We have Z1 0.839(population) 0.544(ad) for the mean-centered data. The vector 0.839i
0.544j is the direction of the green line. PC2 points to the direction that is perpendicular to PC 1.
2 points to this direction. When the n data points are projected on this direction, the projected
values are z12, z22, …, zn2. We have Z2 0.544(population) 0.839(ad) for the mean-centered
data.
An alternative interpretation is that PCs provide low-dimensional linear surfaces that are closest
to the observations. PC1 is the line in Rp that is closest to the n observations. PC1 and PC2
together span the plane that is closest to the n observations. PC1, PC2 and PC3 span a 3-d
hyperplane that is closest to the n observations, and so on. So,
m
xij kj zik .
k 1
Together the m PC scores and the m loading vectors can give a good approximation to the data.
CHAPTER 2 Clustering
Related Readings …
L earning Objectives
Similarity measures, hierarchical method, K-means method, difference
between supervised and unsupervised machine learning
To understand the nature of clustering, look at the class of students in this course. By defining a
measure of similarity, we can form different clusters.
1) If students in the same cohort are treated as similar, then all year 4 students form one cluster,
and all year 5 students form another cluster.
2) If students in the same major program are treated as similar, then students in the IBBA
program form one cluster, and students from the actuarial science program form another
cluster.
3) If students living in the same political district (e.g. Eastern, Southern, North) are treated as
similar, then students in the class naturally form at most 18 groups.
There are many other ways of grouping, e.g. by secondary school, by gender, by age.
Clustering has huge application in pattern recognition, search engine results grouping, crime
analysis and medical imaging. In business, clustering is widely used in market research.
1 Similarity Measures
In Chapter 1 we mentioned the Euclidean distance and the city-block distance between two data
points. We now make the discussion more general. How can we define the distance between a
data point and a cluster of data points, and the distance between two clusters of data points?
Suppose that there are m points in cluster A and n points in cluster B. By selecting one point from
A and one point from B, we can calculate mn distances. These distances are called inter-cluster
dissimilarities. Based on these, we can define the “linkage” between A and B. Alternatively, we
can find the centroids of the two clusters and find the linkage by computing the distance between
the two centroids.
Linkage Description
Average Mean inter-cluster dissimilarity
Complete Maximal inter-cluster dissimilarity
Single Minimal inter-cluster dissimilarity
Centroid Distance between the two centroids
2 Hierarchical Method
In this section we discuss hierarchical agglomerative clustering (HAC). This method builds the
hierarchy from the individual observations by progressively merging them into clusters, until all
of them finally merge into a single cluster:
Algorithm:
1) Treat each of the n observations as its own cluster. All clusters are at level 0.
2) For k n, n 1, …, 2:
(i) Calculate all kC2 pairwise inter-cluster dissimilarities.
(ii) Fuse the two clusters with the smallest inter-cluster dissimilarities.
(iii) The dissimilarity in (ii) indicates the level at which the fusion should be placed.
The form of the dendrogram depends on the choices of distance and linkage function. Average
and complete linkage are generally preferred over single linkage because single linkage can
result in extended, trailing clusters in which single observations are fused one-at-a-time. This is
called chaining phenomenon, where clusters formed may be forced together due to single
elements being close to each other, even though many of the observations in each cluster may be
very distant to each other. (Sometimes this can be an advantage, though.) Average and complete
linkage give more balanced dendrograms.
Centroid linkage is seldom used because inversion can occur. See Example 2.3.
a b c d e
a 0
b 8.5 0
c 10.5 15 0
d 15.5 17 14 0
e 11.5 10.5 19.5 21.5 0
? Example 2. 1
Show that the dendrogram resulted from the use of single linkage is:
? Example 2. 2
Show that the dendrogram resulted from the use of average linkage is
? Example 2. 3
Construct a dendrogram for the three data points a (1.1, 1), b (5, 1) and c (3, 1 2 3)
using Euclidean distance and centroid linkage.
B)
Recall the USArrests data in Example 5.3 of Chapter 1. The dendrograms for complete,
average and single linkages for the standardized data are generated by R (it is hard to create a
dendrogram using Excel) and are given below. Standardization is necessary because the unit of
the predictor “Urban population” is very different from those of the other 3 predictors.
Interpreting a Dendrogram
In a dendrogram, we can tell the proximity of two observations based on the height where
branches containing those two observations first are fuse. (Note that the proximity of two
observations along the vertical axis when the tree is grown to the right is not related to similarity.)
We can draw a cut to the dendrogram to control the number of clusters obtained. Using the
previous dendrogram as illustration:
Limitations of HAC
1) A dendrogram can give any number of clusters required. However, the clusters may not be
optimal in some sense. The term “hierarchical” in HAC refers to the fact that clusters
obtained by cutting the dendrogram at a given height (e.g. the cut at h 12 above) are
necessarily nested within the clusters obtained by cutting the same dendrogram at a greater
height (e.g. the cut at h 18). This may not be reasonable. Consider observations coming
from male students coming from Canada, USA and Mexico. They either pass a test or fail a
test. If we focus on 3 clusters, the most reasonable way is to split them by country. However,
if we focus on 2 clusters, the most reasonable way is to split by test result. If we use HAC
and at the 3 clusters level we indeed arrive at a split based on country, then it is not possible
to arrive a split by test result at the upper level.
3) HAC is not robust: If we delete a single observation from the data set and perform HAC on
the resulting data set, the result may look very different.
3 K-Means Method
K-means method was developed by MacQueen in 1967. Suppose that we want to split n
observations into K groups where K is known. The simplest version of the K-means method is
the following:
Algorithm:
1) Partition the n observations into K preliminary clusters by assigning a number from 1, 2, …,
K randomly to each observation. (Random partition)
2) Iterate until the cluster assignments stop changing:
(i) For each of the K clusters, compute the centroid.
(ii) Assign each observation to the cluster whose centroid is closest (as measured by || ||2).
Alternative to 1):
1) Randomly specify K observations as centroids. (Forgy initialization)
The final assignment of observations to clusters depends upon the initial randomization. That is
why the algorithm should be run with different initial settings and the results of the different final
clustering be recorded. If more than one result is obtained, we should compute a score for each
result and find the “best” grouping that is achieved. Such a score is based on the following:
K K
SSE || x i ||22 SSEi where i is the centroid of cluster Ci.
i 1 xCi i 1
SSEi is a measure of dispersion within cluster Ci. The smaller is the SSE, the better is the cluster
{C1, C2, …, CK}.
Remark: It is not hard to prove (as you will see on the next page) that
K
1
SSE || x y ||22 , where | Ci | is the number of points in cluster Ci.
i 1 2 | Ci | x , yCi
zero because they correspond to the cases where x y, and that the remaining terms form pairs
because || a b ||2 || b a ||2,
1
SSEi
2 | Ci | x , yCi
|| x y ||22
can be interpreted as the average of the squared distances between the observations in the i-th
cluster. So, the SSE defined above is half of the sum of the total within-cluster variation, as
measured by the sum of pairwise squared distances between observations within each cluster,
divided by the number of points in the cluster.
|| x y ||
x , yCi
2
2 || ( x ) ( y ) ||
x , yCi
i i
2
2
x , yCi
|| x i ||22 || y i ||22 2( x i ) ( y i )
x , yCi
|| x i ||22 || y i ||22 2
x , yCi
( x i ) ( y i )
2 || x
x , yCi
i ||22 2 (x ) ( y )
xCi yCi
i i
2 | Ci | || x i ||22 2 ( x i ) ( y i )
xCi xCi yCi
2 | Ci | || x i || 2(0) (0). 2
2
xCi
where the last equality follows from the fact that i is the center (mean) of Ci. Hence,
1
|| x
xCi
i ||22
2 | Ci | x , yCi
|| x y ||22 .
Consider A (5, 3), B (1, 1), C (1, 2) and D (–3, –2). Let us implement the K-means
method with K 2, using the initial clusters {A, B} and {C, D}. What is the resulting SSE?
? Example 3. 1
Repeat the illustration with initial clusters {A, C} and {B, D}.
For an example that is of a larger scale, see the illustration in the companion Excel worksheet.
The K-means method suffers from similar problem just like HAC. It is sensitive to the scaling of
data, is easily distorted by outlier and is not robust to addition and deletion of a few observations.
There are other disadvantages of the K-means method that are not present in HAC.
1) Initialization is a problem because the final clustering heavily depends on initialization. This
can be partly resolved by using better algorithm to find a more reasonable initialization
setting (e.g. k++ method) rather than a pure randomization.
2) It is hard to come up with a global minimum for the SSE. You may get struck in local
minimum easily.
3) It is very difficult to set the value of K. Even if the population is known to consist of K
groups, the sampling method may be such that data from the rarest group do not appear in the
sample. Forcing the data into K groups will lead to unreasonable results. In practice,
computer scientists use a few values of K to run the algorithm.
4) If some clusters have only a few points, but some clusters have many more points, the K-
means method automatically gives more weights to larger clusters because it gives equal
weight to each point. This translates into results that let centroids of smaller clusters end up
far away from their true centers because these centroids are “attracted” to larger cluster so
that larger cluster can be further split to minimize the SSE.
5) Similar to linear regression, we are minimizing SSE. But why SSE? If the data looks like the
following, good luck even if you correctly set K 2!
Supervised learning: This refers to learning from a training set of data with both x and y.
Learning consists of inferring the relation between input x and output y.
It is called “supervised” because the process of an algorithm learning
from the training data set can be thought of as a teacher supervising the
learning process. We know the correct answers (y) and the algorithm
iteratively makes predictions and so that more and more correct answers
can be obtained.
Unsupervised learning: This refers to learning from a training set of data with only x. The
learning consists of revealing the underlying distribution in the data in
order to learn more about the data. This is called unsupervised because
there is no correct answers.
This table summarizes the classification of what we have studied in Chapter 1 and Chapter 2:
Supervised Unsupervised
Classification
Discrete Clustering
(e.g. KNN classification)
Regression Dimensionality reduction
Continuous
(e.g. linear regression) (e.g. PCA)
There is another type called semi-supervised machine learning, where only some of the data has
y. For example, you may have taken a lot of photos in the F6 athletic meet. Some of the photos
may be clearly labelled by tags such as “teacher”, “class 6A”, “Chris Wong”. But some photos
are unlabeled. How can you perform clustering in such a case? In this course we are not going to
touch on this.
Related Readings …
L earning Objectives
CART, C4.5 model, Gini index, entropy, sum of square error and binary
recursive splitting, cost-complexity pruning, bagging, random forest,
boosting
A decision tree is a tree where each node represents a predictor, each branch represents a
decision / rule and each leaf represents a prediction (which can be numerical or categorical).
Here is a hypothetical example. We have the number of points scored by a set of baseball players
and our goal is to use two predictors (weight and height) to predict the number of home runs. A
decision tree split the height-weight space and assign a number to each partition.
height
height 1.8 m
y 1.5 y 1
weight 90 kg weight 100 kg 1.8
y 0.5 y2
weight
0.5 2 1.5 1
90
We start from the root node, and use the split “height 1.8 m ?” to partition the height-weight
space into two subspaces. In each of the two subspaces, we use weight to further branch out into
two leaves (terminal nodes), where a final prediction on the number of points is made.
Decision trees are common in business, social science, medical and many other daily life
applications because its working principle resembles that of a human mind and is highly
interpretable (while the height-weight space partition on the right looks alien to most people).
The construction of the tree, however, is far from simple. Given a set of observations of the form
(height, weight) number of home runs
how can you build a tree?
This chapter is an introduction to classification and regression trees (CART) and popular
ensemble methods to improve their prediction accuracy. CART is the most well-known
algorithm in predictive analytics. It works for both classification and regression problems, is easy
to interpret, and is computationally efficient. It can deal with both continuous and categorical
predictors, and even works for missing value in the input data.
1 Classification Trees
To illustrate the construction of a classification tree, let us walk through the following famous
example. We have a data set with 14 observations, 4 predictors, and binary outcome:
To start with, we can pick any of the 4 predictors as the first split, as follows:
2 Yes 3 Yes 4 Yes 2 Yes 4 Yes 3 Yes 3 Yes 6 Yes 6 Yes 3 Yes
2 No 1 No 2 No 3 No 2 No 4 No 1 No 2 No 3 No
A predictor that splits the observations so that each successor node is as pure as possible should
be chosen.
Outlook gives a node that is 100% pure. It seems to be a good choice. But in most real-life cases
things are murkier. Breiman et al. (1984) came up with the idea of Gini index, which uses the
total variance across different classes to measure the purity of a node:
k k
G pi (1 pi ) 1 pi2 .
i 1 i 1
The smaller is G, the more pure a node is. (In the two-class case where k 2, p1 p2 1, and G
0 for p1 0 or 1, and attains a maximum of 0.5 at p1 p2 1.)
2 2
9 5
Before the split, there are 9 Yes’s and 5 No’s. The Gini index is 1 0.45918.
14 14
After the split, there are multiple new nodes. We compute the Gini index for each of the nodes
and aggregate out the total Gini index by weighting them by the number of total observations in
the nodes.
Let us calculate the total Gini index after each of the splits.
Temperature:
4 2 2 4 3 1 6 4 2
2 2 2 2 2 2
G 1 1 1 0.440476
14 4 4 14 4 4 14 6 6
So G decreases by a bit.
Humidity:
7 3 4 7 6 1
2 2 2 2
G 1 1 0.367347
14 7 7 14 7 7
? Example 1. 1
Calculate the total Gini index for the split using the predictors Outlook and Windy. Hence decide
which predictor should be used for the initial splitting. Continue the process and build the whole
classification tree.
What is the prediction for “Cool, Sunny, with normal Humidity and Windy”?
Another way to determine the split is to make use of the concept of entropy. The more pure is the
node, the higher degree of order it possesses. Maximum order is achieved if all outputs in a node
are the same, giving a zero entropy. Minimum order is achieved if different kinds of outputs are
equally distributed in a node, giving larger value of entropy. The application of entropy in
computer science was explored by Quinlan (1993), giving the “C4.5 tree”. The cross entropy
(aka deviance) for a node is given by (in units of bits)
k
1 k
D pi log 2 pi pi ln pi (and note the convention 0 log2 0 0.)
i 1 ln 2 i 1
(Once again note that in the two-class case where k 2, p1 p2 1, and D 0 for p1 0 or 1,
and attains a maximum of 1 at p1 p2 1. Also, this is NOT natural logarithm!) Smaller D is
more preferable when we consider a split.
Let us use cross entropy to rework the golf case. Before the split,
9 9 5 5
D log 2 log 2 0.94029.
14 14 14 14
? Example 1. 2
Use entropy to construct the classification tree for the golf illustration.
(There is yet another to determine impurity called “misclassification error rate” E 1 max pi.
The larger is a particular type of outcome in a node, the smaller is E. However, E is not very
sensitive for the purpose of growing a tree.)
Continuous Predictors
The golf example features categorical predictors which can only take on finitely many
possibilities. In most of the real-life examples, there is a mix of categorical and continuous
predictors. For example, you may have the actual temperature (in F) for a predictor:
x y x y
64 Yes 72 Yes
65 No 75 Yes
68 Yes 75 Yes
69 Yes 80 No
70 Yes 81 No
71 No 83 No
72 No
For such a predictor, it is customary to do a binary splitting *: one branch is x t, while the other
branch is x t. To determine the best split, we look at split points of the form t (xi xi+1) / 2.
Let us consider t 71.5 which is the midpoint of 71 and 72. The split gives
x 71.5 x 71.5
4 Yes 3 Yes
2 No 4 No
* Do not treat x as categorical with values 64, 65 etc. Such a high branching predictor will smash any dataset and seriously overfit.
Actually, tree splitting is also biased towards categorical predictors with a large number of possible values because they lead to a
large number of nodes which are very pure because the number of observations in each of these nodes are small.
2 2
7 6
Before the split, the Gini index is G 1 0.497041.
13 13
6 4 2 7 3 4
2 2 2 2
We can repeat the calculation for other potential split points. See the Excel illustration:
Now we can state the general algorithm for building a classification tree:
Algorithm:
1) Go through the list of every predictor:
(i) For categorical predictors, calculate the total Gini index / cross entropy for the split.
(ii) For continuous predictors, calculate all total Gini indexes / cross entropies based on
every possible split points (mid points of observed values). Pick the minimum as the
final Gini index / entropy for the split.
(iii) Use the predictor (and the split point) with the minimum Gini index or cross entropy.
2) Repeat for the new nodes formed until we get the tree we desired. Stop at nodes where
(i) all observations lead to the same output (no need to split),
(ii) all predictors are exhausted, or all remaining observations have the same remaining
predictor values so that no predictors can split the observations (no way to split).
Alternatively, stop when some stopping criteria are reached.
Remarks:
1) Classification trees do predictor selection: A predictor that does no splitting has no use.
2) For a categorical predictor, once it is used for a split for a node, it will not appear in
branches under that node. The split in a node will completely exhaust its information content.
3) For a continuous predictor, because the split is binary, it can happen that the same predictor
will appear more than once alone a path (recursive binary split). For example, in the tree
below, MaxHR appears at depth 3, as well as depth 6 and 7:
A classification tree can be very large in size. A fully grown tree can lead to overfitting (high
variance, low bias) and it is sometimes desirable to limit the tree size. One way to do so is to
define stopping criteria to conduct a pre-pruning, some of which is shown below:
controlling the minimum samples for a node split (say, if the number of observations under a
node is less than 20, then splitting stops, of if the number of observations under a node is less
than 5% of the total sample size.)
controlling the maximum depth of tree
controlling the maximum number of terminal nodes
all possible splits lead to very small decrease in Gini index or entropy (a threshold has to be
set ahead)
For example, if we limit the maximum depth of the tree in the golf case to 1, then we will end up
with only one split. The nodes for Sunny and Rain are not pure. The prediction will be based on
the most commonly occurring class of outputs of training observations:
Outlook Outlook
In Section 3, we will also investigate a method that prunes a large tree back to a smaller scale.
(A) I only.
(B) II only.
(C) III only.
(D) I, II and III.
(E) The correct answer is not given by (A), (B), (C) and (D).
2 Regression Trees
In this case, the response is continuous. For a continuous predictor x, we perform a binary
spitting for a split point t by considering the objective function
SSE ( yi yˆ R1 ) 2 ( y yˆ i R2 )2 ,
i: xi t i:xi t
where R1 {X : Xi t} and R2 {X : Xi t}, and yˆ Ri is the mean response for the all training
observations in Ri. This is similar to the K-means clustering objective function. The
determination of the best split point t that gives the smallest SSE is based on exhaustion.
When we decide the predictor for the first splitting, we again go through the list of all predictors.
We pick the one (and the associated split point for continuous predictor) that minimizes the SSE.
Then we continue, looking for the best predictor (and the best split point) that minimizes the SSE
within each of the resulting regions formed from the previous splitting just like building a
classification tree. The only difference is that the prediction in each terminal node is the average
of the output of observations falling into that node.
? Example 2. 1
Consider the following data set with 15 observations in the companion Excel worksheet.
Determine the first 3 splits if the stopping criterion is a maximum depth of 2. What are the
predicted value for the 4 terminal nodes?
Limitations of CART
1) Form of f
The (X1, X2) space for the regression tree in Example 2.1 looks like this:
X2 X2
11.81
8.705
X1 X1
4.005
Notice that the space is split into 4 rectangles. In general, such splitting can only result in series
of rectangles. It is impossible to get things like the top right! What does this mean practically?
Consider the population shown in the figure on the left:
X2 X2
y1
y0
X1 X1
A linear model fits data from such population well. A regression tree can only approximate the
relation by using lots of splits. (This problem, though, can be solved by first using PCA to rotate
the data!) Then let us consider the following population:
X2
y1
y2 y0
X1
A regression tree works well with only a few splits for data coming from such population. A
linear model would fit very poorly.
If the underlying population can be well approximated by a linear model, linear regression will
outperform a non-parametric regression tree. However, if there is a high non-linearity and
complex relationship between x and y, a tree will outperform a linear model.
The technique of predictor selection and splitting described in the previous two sections is a
greedy algorithm (貪婪演算法): it will check for the best split instantaneously and move
forward until one of the specified stopping condition is reached. In other words, it is looking for
local optimum at each step instead of finding a global optimum. For an analogy, consider the
case for driving.
Consider the classical XOR classification problem, where we have two predictors x1 and x2, with
4 observations:
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0
The first split can be based on either of the predictors, and the resulting total G (or D) is the same
as that before the split. If we assign a stopping rule that stops when the decrease in G (or D) is
less than a threshold, then the first split cannot be performed. Note, however, that the second
split gives two completely pure nodes.
To conclude: A stopping condition may be met too early. It is better to grow the tree fully, then
remove nodes.
4) Sensitivity to Data
Consider the building of a classification tree with the following 16 observations using cross
entropy:
x1 x2 y x1 x2 y
0.15 0.83 0 0.10 0.29 1
0.09 0.55 0 0.08 0.15 1
0.29 0.35 0 0.23 0.16 1
0.38 0.70 0 0.70 0.19 1
0.52 0.48 0 0.62 0.47 1
0.57 0.73 0 0.91 0.27 1
0.73 0.75 0 0.65 0.90 1
0.47 0.06 0 0.75 0.36 1
x1 0.6 x2 0.33
1 0 1 0 1 0
However, if you change the last x2 from 0.36 to 0.32, the classification tree will become the one
shown on the right, which is totally different from the one on the left. The same happens for
regression trees. Trees can be very non-robust! Recall that in chapter 1 (p.8) we mentioned that
the variance of the prediction refers to the amount by which fˆ would change if we estimate it
using a different training data set. Ideally fˆ should not vary too much between training sets.
You can treat the change from 0.36 to 0.32 as a replacement of one single data point. In this
sense, the variance of the prediction can be huge for trees!
3 Post-Pruning
In this section we discuss one method to limit the size of a tree. We treat the tree size (as
measured by the number of terminal nodes) as a tuning parameter. Here are some notation:
We consider the cost complexity pruning (aka weakest link pruning) procedure:
where the collection of rectangular sets {R1, R2, …, R|T|} corresponds to the terminal nodes of T.
The complexity parameter controls the trade-off between the subtree’s size and its goodness of
fit to the training data. Small values of result in bigger trees. If 0, then C(T) is just the
usual SSE for an unpruned tree and hence the tree that minimizes C0(T) can be chosen to be the
fully grown tree T0. For any fixed 0, it can be shown that there is unique smallest subtree T
that minimizes C(T). It can be shown that the series of T are such that:
While the above is a very complicated procedure, the R programming language has a function
called prune.tree in R library tree that returns the best pruned tree by entering either the
value of or the number of terminal nodes in the pruned tree. So in practice, for any T0, the
series of {, T} can be obtained very easily.
After we have arrived at the series of best pruned trees {T}0, we can use cross-validation as
illustrated in section 4 of chapter 1 to estimate the test MSE for each of these trees and pick the
final best one that minimizes the test MSE. In more detail, the process is as follows:
Algorithm:
1) Divide the training dataset into K folds.
2) For each k 1, 2, …, K:
(i) Use recursive binary splitting to grow a large tree on the k-th fold of the dataset.
(ii) Obtain the series of best pruned subtrees for this large tree as a function of .
(iii) Evaluate the mean squared prediction error on the data in the left-out k-th fold, as a
function of .
3) For every , calculate the mean of k MSEs computed in 2). This is an estimate of the test
MSE for each
The R function cv.tree in library tree performs the above and reports the series of test
MSEs based on the input K (whose default value is 10).
The whole process also applies to classification tree. The objective to be minimized can be based
on G, D or the misclassification rate E. R can perform cross-validation based on either D or E.
4 Ensemble Methods
Ensemble methods refer to methods that combine many model’s performance predictions. They
are generally applicable to most of the models in machine learning. In this section we use CART
to illustrate how three ensemble methods can be applied to improve predication accuracies.
Before we discuss this method, you need to know the meaning of bootstrap sampling: Given a
set of n observations, we create a new random sample of size n by drawing n observations from
the original observations with replacement.
For example, for n 10, we can draw three bootstrap sample as follows:
Original data 1 2 3 4 5 6 7 8 9 10
Round 1 5 6 9 4 2 6 7 6 3 2
Round 2 7 10 9 4 8 1 8 8 9 10
Round 3 7 1 6 4 5 8 5 1 2 3
Algorithm:
For i 1 to B do
1) Generate a bootstrap sample of the original n observations.
2) Train an unpruned tree y fˆ *i ( x) based on the bootstrap sample.
End
A prediction on x is obtained by averaging the results from the B trees for regression problems:
1 B ˆ i
yˆ f * ( x) .
B i 1
The averaging reduces the variance of the final prediction. For classification problems, the final
prediction is the majority vote of the B classifications.
Remarks:
2) When bagging is used, we need not use cross-validation to estimate the test MSE. Each
original observation has a probability of (1 1/n)n for not being included in a particular
bootstrap sample. So for each bootstrap sample, on average the proportion of original data
that is not included is (1 1/n)n. For large n, this is approximately e1 0.3679. For each of
the tree, on average there is about 37% of the original observations that were not used to train
the tree. These unused observations are called out-of-bag (OOB) samples and can be used to
compute the test MSE.
3) Although the B bootstrap samples are drawn independently, if there is a strong predictor in
the data set, along with a number of other moderately strong predictors, then most of the
bagged trees will use this strong predictor in the top split and hence the bagged trees will
look similar and the resulting predictions will be highly correlated.
Random Forests
In bagging, a randomness is introduced by using bootstrap sampling. Random forests also use
bootstrap sampling, but add another layer of randomness: random input vectors.
The random vector method means that at each node, the best split is chosen from a random
sample of m predictors instead of the full set of p predictors. The set of m predictors varies from
node to node. Typically, we choose m p . The random vector method decorrelates the bagged
trees so that variance can be further reduced.
Boosting
In this method, instead of growing a series of trees based on different bootstrap samples, we
build trees in a sequential manner: The new tree depends on all the previously grown trees.
Boosting (as its name suggests) involves building B trees, each of which has only a few splits.
Given the current model, we fit a tree to the current residuals, rather than the outcome. We then
add this new tree into the fitted function in order the update the residuals. By only fitting small
trees in each step, we boost up the performance slowly so that overfitting would not occur.
Algorithm:
1) Set fˆ (x) 0 and ri yi for all i.
2) For i 1, 2, …, B,
(i) Fit a tree y fˆ i (x) with d splits to the data set {xi ri}.
Classification trees can also be boosted in a slightly more complicated way, and the details are
omitted. You can refer to section 14.5 of Applied Predictive Modeling (algorithm 14.2) for the
adaptive boosting method.
Related Readings …
L earning Objectives
Regularization methods including ridge and LASSO regression, principal
component regression and partial least squares regression
In this section we review the classical predictor selection problem. In data science this is usually
called “feature selection” or “feature engineering”. Suppose there are p predictors x1, x2, … xp.
How can we select the predictor(s) to be included in a linear regression model? Traditionally
there are four methods:
Best subset selection
Forward stepwise selection
Backward elimination (called backward stepwise selection in the text)
Stepwise selection
When we rank models with different number of predictors, we cannot use R2 or MSE because for
a series of nested models, they increase with the number of predictors.
Remarks:
1) The AIC (Akaike information criterion) is based on the likelihood function of the model. In
its general form, AIC 2k 2l for a model with k parameters and maximized log-likelihood l.
The smaller is the AIC, the better is the model.
It can be shown that under the assumption of the Gauss-Markov theorem,
1
AIC = constant ( SSE 2iˆ 2 )
nˆ 2
for models with i predictors. Here ˆ 2 is the estimated variance of the error term for the full
model.
1
2) Mallow’s Cp is defined as Cp ( SSE 2iˆ 2 ) . Minimizing AIC is the same as minimizing
n
Cp. Also, it can be shown that Cp is an unbiased estimator of the test MSE.
3) The BIC (Bayesian information criterion) is based on a Bayesian argument. In its general
form, BIC (ln n)k 2l for a model with k parameters and maximized log-likelihood l. The
smaller is the BIC, the better is the model.
It can be shown that for a linear regression model with i predictors,
1
BIC = constant [ SSE (ln n)iˆ 2 ] .
n
Since ln n 2 for n 7 (which nearly always holds), BIC generally penalizes models with
more predictors more heavily than AIC or Cp.
SSE / (n i 1)
4) Adjusted R2 is defined as 1 where TSS ( yi y ) 2 . Though popular,
TSS / (n 1)
there is not a rigorous statistical theory that supports the use of it to rank models.
5) To perform best subset selection, a total of 2p models have to be considered! Say p 20.
Then 2p 1,048,576, which is undoable.
Remarks:
p 1
p ( p 1)
1) Forward stepwise selection involves the fitting of 1 ( p i ) 1 models. For p
i0 2
20, we have to fit 211 models only.
2) However, there is no guarantee that the best possible model out of all 2 p models can be
obtained using this greedy algorithm, simply because it is “too greedy”. Say p 3. The best
one-predictor model is the one with x1. The best two-predictor model is the one with x2 and x3.
Then forward stepwise selection will fail to get the best two-predictor model.
3) If p n, then forward stepwise selection has to stop at n1.
? Example 1. 1
Consider the credit data set in Section 3.3 of the textbook (also presented in the companion Excel
worksheet). Perform forward stepwise selection based on BIC and show that the model with the
four predictors cards, income, student and limit is better than the model favored by forward
stepwise selection.
Remarks:
1) Backward elimination again involves fitting 1 p(p 1)/2 models.
2) Backward elimination is again a greedy algorithm. It may not yield the best model.
3) If p n, then backward elimination does not work.
Remark:
Such a bidirectional search is an attempt to search for a larger space of possible models
compared with forward stepwise selection and backward elimination. However, empirically it is
known that such a procedure tends to overfitting.
2 Shrinkage Methods
The automatic subset selection described in Section 1 is known to perform poorly, both in terms
of selecting of the true model and estimating regression coefficients. While it remains the
dominant method in medical research, in statistics and data science it has largely been abandoned.
In classical regression, the regression coefficients are unconstrained. They can explode and are
hence susceptible to very high variance (recall the polynomial regression example on p.8). Also,
it is impossible to obtain a zero regression coefficient. Shrinkage is an approach that involves
fitting a model with all predictors. But the estimated coefficients are regularized: they are
shrunken towards zero relative to the OLS estimates. Depending on the form of the shrinkage
penalty, some of the coefficients may be estimated to be exactly 0.
This is the most commonly used method of regularization. Let the predictors x and the output y
be mean-centered. We introduce the penalized SSE as the objective function for minimization:
n p
PSSE ( yi 1 xi1 p xip )2 j2 SSE β 2 ,
2
i 1 j 1
where 0 is a tuning parameter. The first part seeks regression coefficients that fit well, while
the second part is a called a shrinkage penalty. The value of controls how much penalty is
added to the regression coefficients when they are non-zero. When grows, the impact of
shrinkage penalty grows, and the regression coefficients will approach zero.
Remarks:
1) (Out of Exam SRM syllabus) Parameter estimate:
x11 x12 x1 p
x x22 x2 p
Let the design matrix be X 21
. The OLS estimate is βˆ ( XX) 1 Xy
xn1 xn 2 xnp
which does not always exist. The ridge regression estimate is
βˆ ( ) ( XX I p p ) 1 Xy WXy .
(This is similar to the OLS estimate, but with the addition of a “ridge” down the diagonal.)
2) Bias-variance tradeoff:
It can be shown that (out of Exam SRM syllabus)
Bias[βˆ ( )] Wβ and Var[βˆ ( )] 2 WXXW .
Based on these, we can prove that the bias increases (in absolute sense) with , and
Var[i()] Var(i) for 0. So, ridge regression reduces variance at the expense of
introducing bias.
Actually, a more surprising result is that there always exists a such that the MSE[ βˆ ( )]
MSE( βˆ ) ! We can always do better than OLS by shrinking towards zero.
based on the raw data, which will gives same solution for 1, 2, …, p. Note that we do not
shrink the intercept term because it is just a measure of the different mean values of the data.
As in PCA and clustering, the scaling of predictors plays a role in the final estimation result:
multiplying xj by a constant c does not lead to scaling of i by 1/c because of the shrinkage
penalty. Put it in other way, the estimation problem is not scale-invariant. Therefore, it is best
to standardize the predictors and center y before performing doing ridge regression.
? Example 2. 2
Consider again the credit data set in Section 3.3 of the textbook. Perform ridge regressions for
0, 1, 50 and 100 using the Real Stat Excel add-in function RidgeCoeff. Do that for both the
raw data and also the standardized data. Compare the coefficients and also the L2 norm for ()
for the case of standardized data.
4) Geometric interpretation:
An equivalent formulation to minimizing PSSE is to minimize the usual SSE subject to the
p
constraint
j 1
2
j s . (Here s is decreasing in . If 0, then s .) The equivalence is due
2
OLS estimate
1
s1/2
When p 3, the ellipses become ellipsoids, and the blue circle becomes a sphere.
The full name of Lasso is “Least Absolute Shrinkage and Selection Operator”. We introduce the
following PSSE:
n p
(y x
i 1
i 1 i1 p xip ) | j | SSE β 1 ,
2
j 1
where 0 is a tuning parameter. This method was first introduced in 1986 in physics, and
rediscovered and popularized in 1996 by Tibshirani.
Remarks:
1) Again x and y in the formulation above are mean-centered. Alternatively, one can consider
n p
PSSE ( yi 0 1 xi1 p xip ) | j | .
2
i 1 j 1
? Example 2. 1 (continued)
Let the predictors be standardized and the design matrix be X Ip×p. Show that
yj / 2 for y j / 2
ˆi ( ) 0 for / 2 y j / 2 .
y / 2 for y j / 2
j
Plot the OLS, ridge and lasso regression estimates for the coefficient of xi on the same graph (as
a function of yi) and compare the results. This example illustrates the soft thresholding of Lasso.
2
OLS estimate
1
s
It turns out that the Lasso constraint has corners at each of the two axes, the hence the ellipse
will often intersect the constraint region at an axis. When this occurs, one of the ’s will
equal zero. In higher dimensions, the diamond becomes a polytope and many of the
coefficients may equal zero simultaneously. This means the solution is sparse and Lasso
conducts feature selection.
4) Lasso regression has a close connection to Bayesian linear regression. With the prior
distribution
i ~ Lap(0, b) where b 2/ for i 1, 2, …, p
and 1, 2, …, p being mutually independent, it can be shown that the Lasso regression
estimate maximizes the posterior density
f ( | X, y) f (y | X, ) f ().
and is hence the posterior mode for . The proof of this result is similar to the one shown in
the appendix for ridge regression. In this case, though, the Lasso regression parameters are
not the posterior mean for .
1 | x |
Note: the Laplace distribution Lap(, b) has density f (x) exp for real x.
2b b
? Example 2. 3
Consider again the credit data set in Section 3.3 of the textbook. Perform Lasso regressions for
various values of using the VBA user-defined function Lasso_Regression. (Note: this
function does not report standard errors) for the standardized data. Compare the coefficients for
() with the ridge regression coefficients.
We can use cross validation to select the optimal value of in ridge or Lasso regression. The test
MSE for a range of values of are estimated, and the value of that gives the smallest test MSE
is the optimal . A final model based on the optimal is then fitted using all observations.
(Out of Exam SRM syllabus) For ridge regression, there is again a shortcut formula to compute
the LOOCV estimate of the test MSE by fitting only one full model. Let H XWX. Then
2 2
1 n y yˆi 1 n yi yˆ i
CV(n) i .
n i 1 1 hi n i 1 1 tr(H ) / n
The approximation (called generalized cross-validation error, aka GCV error) is very often used
because hi Hii is hard to compute for large p, but there are formulas that give tr(H) without
computing H.
form of the coefficients has the potential to bias the coefficient estimates unless M p. However,
when p is much greater than n, using an M that is much less than p can significantly reduce
variance and avoid overfitting at the expense of increased bias.
Remarks:
1) PCR is not a feature selection method because each of the M components is a linear
combination of all p original predictors. Hence PCR does not give a more interpretable
model as Lasso regression.
2) The methods to determine M was discussed in Chapter 1. An alternative method is to treat M
as a tuning parameter and choose it using cross-validation.
3) Whether PCR can work well depends on whether the variability of the original predictor can
be summarized by a small number of PCs, and whether the directions in which the original
predictor show the most variation are the directions that are associated with y.
4) Apart from reducing dimensionality, the most important use of PCR is to resolve the problem
of multi-collinearity in multiple regression. When multi-collinearity occurs, the OLS
estimates are unbiased, but suffer from large standard errors. By transforming the predictors,
PCR eliminates the problem of multi-collinearity because the columns of z’s are uncorrelated.
5) PCR has a close relation with ridge regression. Actually, one can even think of ridge
regression as a continuous version of PCR. The details about this is, however, out of the
scope of this course.
? Example 3. 1
Consider the data set in the companion worksheet. Significant multi-collinearity exists in the
three predictors. Conduct a PCR analysis.
Partial least squares regression (PLSR) is a technique that originated in economics but was
popularized by applications in research in computational chemistry. It is a regression method that
takes into account the latent structure in both x and y. The latent structure corresponding to the
most variation in y is extracted and explained by a latent structure of x that explains it the best.
PLSR can be thought of as a supervised version of PCR. In the previous Remark 3, we note that
the directions for which the loading factors 1 2, …, M are determined in an unsupervised way:
the PCs explain the observed variability in the predictor variables, without considering the
response variable at all. On the other hand, PLSR takes the response variable into account, and
therefore often leads to models that are able to fit the response variable with fewer components.
In PLSR, the loading factors are obtained such that the covariances of y and the resulting linear
combinations of x (instead of the variance of the linear combination of x) are maximized.
The textbook does not give the detailed procedure for fitting PLSR because the modeling
assumption of PLSR is quite complicated. Here I follow the textbook’s heuristic description of
the “algorithm” for determining the “factor loadings”. The actual implementation is different and
is much quicker than this heuristic algorithm. Both the predictors and the response are assumed
to be mean-centered, or standardized.
They are uncorrelated with the first and second series of z scores.
4) Continue until the Mth loading and series of z scores are obtained.
Finally, run an OLS regression on the original response, with the M series of z scores as
regressors.
Remarks:
1) Just like PCR, PLSR deals with multi-collinearity.
2) PLSR can model several response variables at the same time (observation i has not one
response yi, but a vector of q responses yi (yi1, yi2, …, yiq)).
3) M is determined by cross-validation.
4) In practice, the optimal number M that is used in PLSR can be much less that the optimal M
that is used in PCR for the same data set. However, the overall performance of PLSR is no
better than ridge regression or PCR. The textbook mentions that the reason is because PLS
reduce bias but also has the “potential” to increase variance.
Afterword
I believe that there are three kinds of people who work with predictive models: scrupulous data
scientists, unscrupulous data scientists, and novice.
Both scrupulous and unscrupulous data scientists are experts in modeling. They know the bag of
tricks to twist models to a point so that they yield and confess with whatever designed results.
Scrupulous ones let the data speaks for themselves; unscrupulous ones put the assumed results
into the models’ output. Novices are not yet experienced. They do not know how to use models
and sometimes pick models wrongly. The following comes from a presentation with the
interesting title “How to Choose the Wrong Model” by Scott L. Zeger. Anyone who analyzes
data at work should bear the following in mind:
1) Forget about this important fact: There is not one best model because there isn’t one model
that works well in every possible aspect; models are useful but they are not the truth.
2) Turn your scientific problem over to a computer that, knowing nothing about your science or
your question, is very good at optimizing AIC, BIC or Cross-validated test MSE.
3) Turn your scientific problem over to your neighborhood statistician, who, knowing nothing
about your science or your question, is very good at optimizing ABC.
4) Choose a model that presumes the answer to your scientific question and ignores what the
data say.
[Note: Unscrupulous data scientists!]
5) Choose a model that no one can understand not the persons who did the study, not their
readers (not even yourself).
6) Use (e.g.) the Cox proportional hazards “model” because it has been used in 14,154 articles
cited in PubMed how could so many people be wrong?
[Notes: The Cox PH model is a popular model in biostatistics that deals with the modeling of
mortality rate of patients. PubMed is a free search engine accessing the MEDLINE database
of life sciences and biomedical topics.]
2 p
1
The prior distribution of has density is f (β) exp i 2 , Rn.
i 1 2 2
Because the n irreducible errors 1, 2, …, n are iid N(0, 2), the model distribution is
n
1 ( y j 1 x j1 2 x j 2 p x jp ) 2
f ( y | X, β ) exp .
j 1 2 2 2
1 1
exp[ PSSE (β)]
n p
2 2
(2 )
n p 2
To maximize the posterior distribution, should be such that PSSE() attains a minimum. This
means that is the ridge regression estimate. Since the posterior distribution is multivariate
normal in , the posterior mean equals the posterior mode.