0% found this document useful (0 votes)
150 views66 pages

Chapter 1. Elements in Predictive Analytics

This document provides an introduction to predictive analytics and summarizes key elements that will be covered in the course. It discusses the goals of predictive modeling, including building regression models to predict continuous variables and classification models to predict categorical variables. Common examples like predicting stock returns and credit scores are provided. The document then recaps linear regression and its limitations for modern big data settings with many predictors. It previews that later chapters will cover techniques like KNN regression, decision trees, principal component analysis, subset selection models, and clustering. The goal is to discuss flexible models that fit the data well without overfitting.

Uploaded by

Christopher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
150 views66 pages

Chapter 1. Elements in Predictive Analytics

This document provides an introduction to predictive analytics and summarizes key elements that will be covered in the course. It discusses the goals of predictive modeling, including building regression models to predict continuous variables and classification models to predict categorical variables. Common examples like predicting stock returns and credit scores are provided. The document then recaps linear regression and its limitations for modern big data settings with many predictors. It previews that later chapters will cover techniques like KNN regression, decision trees, principal component analysis, subset selection models, and clustering. The goal is to discuss flexible models that fit the data well without overfitting.

Uploaded by

Christopher
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Chapter 1.

Elements in Predictive Analytics


C.Y. Ng
Section 1. Introduction

CHAPTER 1 Elements in Predictive Analytics

Related Readings …

An Introduction to Statistical Learning: with Applications in R


Sections 2.1  2.2, 3.5, 5.1, 7.1, 10.2

Applied Predictive Modeling


Sections 3.2 – 3.5, 4.1 – 4.4, 5.1  5.2

L earning Objectives
Idea of predictive modeling, linear regression and its limitations, KNN
regression, bias-variance trade-off, cross-validation, principal component
analysis

1 Introduction

Predictive analytics is a field in machine learning. In many daily life problems, we are given a
series of n observations, each of which looking something like this:
(x1, x2, …, xp)  y
where
x  (x1, x2, …, xp), and each xk is called a (an) predictor / independent variable / attribute
/ descriptor,
y is a (an) response / dependent variable / output / target.
Each of the xk may (or may not) contain some “explanatory power” in predicting y. Given the n
observations, we want to build a model that can predict y based on x.

Depending on the nature of y, two kinds of models that can be built:


1) If y is quantitative (continuous), that the model is called a regression;
2) If y is qualitative (categorical), then the model is called a classification.
Let us consider some real-life examples in business.

FINA 3295 | Predictive Analytics 1


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 1. Introduction

Regression model (Investments)


In investment performance analysis, one aims to predict the return on a stock given some
features of the underlying company (e.g. P/E ratio, revenue, volatility, correlation with the
general market). You may have heard of the famous CAPM model and the Fama-French 5-
factor model for predicting stock’s return.

Classification model (Banking and Insurance)


Credit scoring is a system that predicts the credit rating of a mortgagor based on his profile
(e.g. income, field of employment, spending, risk appetite, credit history).
Similar systems can be found in the underwriting of automobile insurance. Given the
driving history and indicators (e.g. gender, age, income), a driver’s risk level can be
predicted and the corresponding premium can be obtained from a table.

In its most general form, a predictive model can mathematically be written as


y  f (x)  
where f is some fixed by unknown function, and  is an error term that reflects the variation of y
that is not captured by x. The goal of predictive analytics is to find the “best” f.

In this course, we are going to discuss different classes of models of f, and the meaning of being
the “best”. We will implement models using Excel and R. For some models, f is a simple
function. However, f can also exist as a graphic representation of an algorithm in some models.
The process of building f can sometimes be expressed as a problem of parameter estimation. But
since sometimes there are not really parameters, we use the term “modeling training” rather than
“modeling fitting” to describe the process of building f.

The n observations are usually expressed in the vector form


 x1   x11 x12  x1 p   y1 
x   x x22 
 x2 p  y 
X  and y    .
2 21 2

       
     
 x n   xn1 xn 2  xnp   yn 
In many cases we do not use all observations for building f but withhold some for model
validation. The subset of observations used for modeling training is called the training data set
and the observations withheld is called the test data set or validation data set. We will
elaborate on this when we discuss model validation in Section 4.

Before we start our discussion of the first model for f, let us have a quick recap of the simplest
predictive model: linear regression.

FINA 3295 | Predictive Analytics 2


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 2. Linear Regression: A Recap

2 Linear Regression: A Recap

Recall from elementary statistics the following setting of linear regression:

Training data set: n observations of the form (x1, x2, …, xp)  y

The form of f: a linear function f (x)  0  1x1  2x2 …  pxp.

The problem of training f is the same as estimating the p unknown coefficients 0, 1, …, p.

The quality of the fit can be assessed by using the mean square error (MSE)
1 n
MSE = 
n i 1
( yi  yˆi ) 2 where yˆi  f (xi)  0  1xi1  2xi2 …  pxip.

When p  n, there is a unique solution that minimizes the MSE. You may recall the least-square
solution βˆ  ( XX) 1 Xy , where the definition of X here is slightly different from that on page 2:

1 x11 x12  x1 p 
 x21 x22  x2 p 
1
X .
     
 
1 xn1 xn 2  xnp 

The fitted values are yˆ  Xβˆ  X( XX)1 Xy  Hy .


However, when p  n, there can be infinitely many solutions for β̂ .

? Example 2. 1
Consider the linear regression problem with data
(1, 2, 3, 4, 1)  13.4
(2, 2, 1, 5, 0)  6.1
(3, 3, 2, 1, 2)  16.7
Here, p  5 and n  3. Show that two solutions for β̂ such that the MSE is zero are
(0, 1.155403, 0.55734, 2.674513, 0, 3.10637)
and
(1, 1.155403, 0.128767, 2.531655, 0, 3.392083).

Use these two betas to predict y for x  (1, 1, 1, 0, 2).

FINA 3295 | Predictive Analytics 3


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 2. Linear Regression: A Recap

There are three limitations of the setting of linear regression.

1) Why is f linear? Are there strong evidence to support linearity? As the number of data points
increases, it becomes easier to reject the linearity assumption.

Even if one can live with the linear assumption…

2) We have arrived at the era of big data. Typically, for each observation, the number of
predictors is huge. That is, p n. Just think of using genes to predict cancer. There can well
be more than 10000 genes that are “potentially” related to lung cancer. However, in a
clinical trial, the number of patients is limited. In such cases there will be infinitely many
solutions of  that make MSE  0.

3) One subtle problem is that many of the p predictors in x have very low, or even no predictive
power because they are collected in a routine manner, or are strongly correlated with some
other predictors in the same x, such that including one is sufficient. The problem of variable
selection becomes extremely problematic in such setting.

In view of these three limitations, the least-square solution is not useful under a big data setting.
(Good news?) Also, if y is a qualitative variable, then a linear regression simply does not make
sense.

What Are You Going to Learn?

In this course we will deal with the problems above. We will discuss

1) Non-linear regression: K-nearest neighbor (KNN) regression (chapter 1), classification and
regression tree (CART) (chapter 3)
2) Dimensionality reduction: principal component analysis (PCA) (chapter 1)
3) Subset selection, penalized least squares regression including rigid and Lasso regression,
partial least squares (chapter 4)

We will also discuss separately clustering, a method of exploratory data analysis (EDA), in
chapter 2.

Why do we need so many models or procedures? A modeling procedure that can fit many
different possible true functional forms for f is said to be flexible. It provides better fit to data,
but may be complex or may involve a lot of parameters, and is in generally harder to interpret. A
model that provides too much freedom (that is, too flexible) may follow the noise rather than the
essential feature of the predictors too closely, leading to overfitting. As a result, we should
choose a model that is flexible enough to handle the data, but not overly flexible.

FINA 3295 | Predictive Analytics 4


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 3. KNN Regression

3 KNN Regression

KNN regression stands for K-nearest neighbors regression. It is a very simple procedure for non-
linear regression.

KNN
Predictions are made for a new x by searching through the entire training data set for the K most
similar cases (neighbors) and summarizing the output variables for those K cases.

Consider the following data set:


x1  y1
x2  y2

xn  yn

For a given x, to compute its ŷ , we

Algorithm:
1) Calculate the distance between xi and x for every i.
2) Find the K data points which are closest to x.
3) Compute the mean of the y’s from the K data points in 2) for a regression problem, or find
the mode from the K data points in 2) for a classification problem.

See the two illustrations in the first two tabs of the companion Excel worksheet for a regression
problem and a classification problem.

Remarks:

(a) There are various types of distances. The Euclidean distance between two points
a  (a1, …, ap) and b  (b1, …, bp),
p
defined as || a  b ||2   (a  b )
i 1
i i
2
is the most commonly used when all predictors are
p
similar in type. The city-block distance || a  b ||1   | a  b | is a good distance if the
i 1
i i

predictors are not similar in type (such as age, gender, height).

(b) Ties can happen when some of the distances computed are the same. All these data points
can be included as the “K-nearest neighbors” (this is equivalent to expanding K). Those data
points that are the most far away can also be randomly selected. Ties can also happen when
computing a mode and they can be resolved by a random selection.

FINA 3295 | Predictive Analytics 5


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 3. KNN Regression

(c) K is called a tuning parameter (微調参數). The variance of the predicted value depends on
the value of K. A small value of K provides a flexible fit, but the prediction changes
frequently when x varies. A large value of K provides a smoother and less variable fit. But
such smoothing effect may introduce large bias for the prediction. In the next section we will
discuss how the optimal value of a tuning parameter can be obtained.

? Example 3. 1
Consider the following data from a survey with two attributes to classify if a paper tissue is good
or not.
x1 (acid durability in s) x2 (strength in kg per m2) y
7 7 Bad
7 4 Bad
3 4 Good
2 3 Good

A factory produces a new paper tissue with x1  3 and x2  6. Using KNN regression with K  3,
determine the classification of this new tissue.

Quality of Fit
1 n
In a linear regression model, the mean square error MSE = 
n i 1
( yi  yˆi ) 2 is used to measure the

quality of the fit. For a KNN regression, the same definition can be used to measure the quality if
fit. For a KNN classification, the error rate
1 n
 I ( yi  yˆi )
n i 1
can be used. The smaller the MSE or the error rate, the better is the fit.

Note, however, that the quality of fit is not a measure of the predictive power of the model.

FINA 3295 | Predictive Analytics 6


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 4. The Bias-Variance Tradeoff

4 The Bias-Variance Tradeoff

Training MSE and Test MSE in Regression Problem

Now let us consider a regression problem. In the previous section, we mentioned the mean
1 n
square error MSE =  ( yi  yˆi ) 2 as a measure of quality of fit. This MSE is computed using
n i 1
the training data set that was used to build or fit the model and should more accurately be
referred to as the training MSE. In doing a prediction, we do not care so much about how well
the model works on the training data but on its application on a previously unseen data x0.

To state it more clearly, suppose the training data set is


x1  y1, x2  y2, …, xn  yn .
We use the training data set to build a function or algorithm fˆ . Based on this we can compute
yˆ  fˆ (x ) for i  1, 2, …, n. If yi and yˆ are close, then the test MSE would be small. However,
i i i

what we really care is about x0 and its predicted value fˆ (x 0 ) . The predictor x0 is associated with
a true y0. The error that we are concerned with is the test MSE [ y  fˆ (x )]2 , which cannot be
0 0
computed because y0 is unknown. The optimal model is one that minimizes the test MSE, rather
than the training MSE.

Mathematical Analysis of Test MSE

To analyze the test MSE, we assume for a moment that the outputs (y1, y2, …, yn) and also y0 are
treated as random. It is rather like the outputs have not been realized (observations have not been
made). Since fˆ is computed from the y’s, it is also random.


We consider the expected value of [ y0  fˆ (x0 )]2 , that is, E [ y0  fˆ (x0 )]2 . 
It can be shown that the above can be broken down into three components:

The Bias-Variance Tradeoff

 
E [ y0  fˆ (x0 )]2  Var[ fˆ (x 0 )]  (Bias[ fˆ (x 0 )]) 2  Var( )

Proof (Optional): Recall that the bias of the estimator fˆ (x 0 ) is the difference between the
quantity that it is estimating (y0) and itself. So,
Bias[ fˆ (x0 )]  y0  fˆ ( x0 ) .
Now, by Var(U)  E(U 2)  (EU)2,

 
E [ y0  fˆ (x 0 )]2  Var[y0  fˆ ( x0 )]  E 2 [ y0  fˆ (x 0 )]  Var[y0  fˆ ( x 0 )]  (Bias[fˆ ( x 0 )]) 2 .

FINA 3295 | Predictive Analytics 7


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 4. The Bias-Variance Tradeoff

Finally,
Var[y0  fˆ (x 0 )]  Var[f ( x 0 )    fˆ ( x0 )]  Var[ 0  fˆ (x 0 )]  Var[ fˆ ( x0 )]  Var( ) ,
where the last equality follows from the fact that 0 is the irreducible error associated with the
pair x0  y0. It is independent with any pair xi  yi and is hence uncorrelated with fˆ .

Interpretation of the bias-variance tradeoff

The bias-variance tradeoff states that the test MSE is related to three components:
(a) The variance of the prediction
(b) The bias of the prediction
(c) The variance of the irreducible error
To lower the test MSE, one should control both the variance and the (absolute) bias.

The variance of the prediction refers to the amount by which fˆ would change if we estimate it
using a different training data set (different y’s). Ideally fˆ should not vary too much between
training sets.

The bias of the prediction refers to the error that is introduced by using fˆ to approximate the true
value y. In general, a more flexible model leads to a bias that is closer to zero. But it would also
result in a greater variance of the prediction.

Illustrating the Bias-Variance Tradeoff in Polynomial Regression

The bias-variance tradeoff can be illustrated by considering a polynomial regression on n points


in the xy plane (x1, y1), (x2, y2), …, (xn, yn).

One extreme is to fit a flat horizontal line y  c to the data set: no matter what value of x0, the
prediction is c. The variance of the prediction is the lowest possible (it is 0!), but the model is
inaccurate, leading to large bias (in absolute sense).

Another extreme is to fit a polynomial of degree n  1 of the form


y  b0  b1x  b2x2  …  bn1xn1.
The n coefficients can be obtained from the n data points. The bias is zero because there is
always a polynomial that passes through the n points. However, the estimates for bi’s are highly
sensitive to the data set. For an illustration, open the tab “Fit poly” of the companion Excel
worksheet and press F9 to see how sensitive the bi’s are when the 4th and the 8th data point
wobble. Compare this with the bi’s for a simple linear regression equation. Also look at the
predicted values for x0  0.7 and x0  1.07.

FINA 3295 | Predictive Analytics 8


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 4. The Bias-Variance Tradeoff

Statistical Estimation of Test MSE

Because y0 is unknown, the test MSE cannot be computed. However, we can estimate it by a
method called cross-validation (also known as rotation estimation 交叉驗證 / 循環估計).

The main idea about cross-validation is to split the known data set into two subsets: a training
data set on which the model is built, and a validation data set (aka hold-out set) against which the
model is tested.

Algorithm:

For each of the k rounds of the cross-validation, we do the following:


1) Train the model using the training data set.
2) For each of the j observations in the validation set (say, xm), calculate the predicted value
fˆ (x m ) using the model in 1).
1 j
3) Estimate the test MSE using 
j m 1
[ ym  fˆ (x m )]2 .

Finally, take average of the k test MSEs to arrive at a final estimate of the test MSE.

There are many ways to construct the k rounds above. Here we name three.

1) Holdout method (k  1)
The n observations is randomly split into two halves of unequal sizes. Say, if n  205,
then the training data set can be any 155 randomly drawn observations, and the validation
data set is the remaining 50 observations. Only a single model is fitted, and computationally
this method is not cumbersome. However, the test MSE obtained depends very heavily on
the random splitting and hence is highly variable. Also, since the model tends to fit badly,
the resulting test MSE reflect more on whether the validation set selected looks like the
training data set. That is, the resulting estimate is easily overshadowed by the validation set
error. Generally, the resulting estimate overstates the test MSE.

The textbook uses the name “Validation set approach” for this method because there is
actually no “cross-validation” for this approach: the data in the training set does not serve as
the validation set for any other round.

2) Leave-one-out cross-validation LOOCV (留一驗證, k  n)


In the k-th round, the k-th observation is held out as the validation set, and the remaining
n  1 observations is the training data set. A total of n different models has to be built, and
hence it is computationally very intensive. However, the final test MSE estimate is not
subject to any randomness in the set splits. This is an advantage over the holdout method.

FINA 3295 | Predictive Analytics 9


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 4. The Bias-Variance Tradeoff

3) k-fold cross-validation (typically k  5 or k  10)


In k-fold cross validation, the original sample is first randomly partitioned into k (roughly)
equal-sized subsamples. In the k-th round, all observations in the k-th subsample are held out,
and all the observations in the remaining k  1 subsamples belong to the training set.
k-fold cross-validation is not as computationally intensive as LOOCV. However, the
resulting estimate of the test MSE depends on the partition of the subsample. For a toy
example, suppose there are 12 observations altogether and a 3-fold cross-validation is to be
conducted. One way to do the estimation is to use the 3 subsamples (1, 2, 3, 4), (5, 6, 7, 8)
and (9, 10, 11, 12) and do the three rounds of cross-validation as follows:

Training set Validation set

Round 1 1 2 3 4 5 6 7 8 9 10 11 12

Round 2 5 6 7 8 9 10 11 12 1 2 3 4

Round 3 9 10 11 12 1 2 3 4 5 6 7 8

However, another way to do the estimation is to use the 3 subsamples (2, 3, 4, 5), (6, 7, 8, 9)
and (10, 11, 12, 1).

? Example 4. 1
Consider the data set in the tab CV in the companion Excel worksheet. Estimate the test MSE for
a linear regression model using
(a) the holdout method, with the first 15 data points as the training data set, and repeat by using
the last 15 data points as the training data set;
(b) LOOCV;
(c) the 5-fold cross-validation.
Repeat, for a quadratic regression model. Which model is the better and what is the final model?

Comparing LOOCV and k-fold Cross-validation

In general, the k-fold CV method has not only a computational advantage to LOOCV but also
better statistical property: it yields more accurate estimates. In k-fold CV, we train the model on
less data that what is available. This introduces bias into the estimate. LOOCV has less bias in
this sense. However, when we perform LOOCV, we are in effect averaging the outputs of n
fitted models, each of which is trained on an almost identical set of observations; therefore, these
outputs are strongly positively correlated with each other. For k-fold CV method, the overlapping

FINA 3295 | Predictive Analytics 10


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 4. The Bias-Variance Tradeoff

between the training sets in each model is smaller. Since the mean of many highly correlated
quantities has higher variance that does the mean of many quantities that are not as highly
correlated, the test MSE resulting from LOOCV tends to have higher variance. Empirically, k  5
and k  10 are shown to yield test MSE estimates that suffer neither from excessively high bias
nor from very high variance.

Shortcut Formula for LOOCV for Linear Regression

The LOOCV estimation above is time-consuming because it involves fitting n models. It


happens that for linear regression models, there is a shortcut formula for computing the test MSE
that involves fitting only a single model with all observations serving as the training data set:
2
1 n  y  yˆi 
CV(n)    i  ,
n i 1  1  hi 
where hi is the diagonal element of the influence matrix (aka projection matrix or hat matrix) in
linear regression: H  X( XX) 1 X .

Proof (Optional): Let X[i] and y[i] be similar to X and y but with the ith row of data deleted. Let
xi be the ith row of X and let [i]  (X[i]X[i])1 X[i]y[i] be the estimate of  in the ith round of
training. Then the MSE in the ith round is e[i]  yi  xi[i].

Now X[i]X[i] = XX  xixi and hi  xi(XX)1xi. By the Woodbury identity


(A  BC)1  A1  A1B(I  CA1B)1CA1,
( XX) 1 xi xi ( XX) 1
we have (X[i]X[i])1  (XX)1  . Also, X[i ] y[i ]  Xy  xi yi . Therefore,
1  hi
 ( XX) 1 xi xi ( XX) 1 
β[i ]  ( XX) 1   ( Xy  xi yi )
 1  hi 
( XX) 1 xi
 ( XX) 1 Xy  ( XX) 1 xi yi  xi ( XX) 1 ( Xy  xi yi )
1  hi

( XX) 1 xi
 β  ( XX) 1 xi yi  [xi β  xi ( XX)1 xi yi ]
1  hi
( XX) 1 xi ( XX) 1 xi ei
β [(1  hi ) yi  xiβ  hi yi ]  β  .
1  hi 1  hi
Hence,
( XX) 1 xi ei he he e
e[i ]  yi  xi (β  )  yi  xiβ  i i  ei  i i  i .
1  hi 1  hi 1  hi 1  h i
2
1 n 1 n  e 
The LOOCV estimate is CV(n)   e[2i ]    i  .
n i 1 n i 1  1  hi 

FINA 3295 | Predictive Analytics 11


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 4. The Bias-Variance Tradeoff

? Example 4. 2
Repeat (b) for Example 4.1 using the shortcut formula.

Model Selection

Example 4.1 illustrates model selection: we have selected between two polynomial regression
models with degree 1 and degree 2 using the test MSE. The general procedure for selecting a
model or the best tuning parameter is as follows:

Define a set of candidate models


(which can be indexed by a tuning parameter)

For each candidate model, estimate the test


MSE using cross-validation
(which can be indexed by a tuning parameter)

Resample Build Predict


data model holdouts

Aggregate the result into a performance profile


and determine the final parameter

Using the final tuning parameter, refit the


model using the entire set of observations

Classification Problem

For classification problem, yi is not numerical but categorical. The corresponding objective to be
minimized is the error rate
1 n
 I ( yi  yˆi ) .
n i 1
The same procedure for model selection can be performed using the above measure and cross-
validation.

FINA 3295 | Predictive Analytics 12


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 5. Data Pre-processing

5 Data Pre-processing

Data pre-processing generally refers to the addition, deletion or transformation of the training
data set. Many techniques in predictive modeling are sensitive to the scale of the data. For
example, consider a KNN regression with two predictors x1 and x2. If in the training data set, x1s
are in magnitudes of ten, while x2s are in magnitudes of thousands, then the variation of the
Euclidean distance will be dominated by the distance in x2. See the illustration in the tab
“Standardize”.

Standardization

The most common data transformation is standardization. Given a training data set, we compute
the sample mean and sample standard deviation for each predictor xk by
1 n 1 n
xk  
n i 1
xik , sk  
n  1 i 1
( xik  xk ) 2 .

xik  xk
Then the data points are transformed by using xik  .
sk
In practice, if the predictors are similar in units and order of magnitudes, there is not a need to
standardize them. However, standardization may still improve numerical stability in many of the
calculations.

Skewness

An un-skewed distribution is one that is roughly symmetric. The skewness of a data series can be
computed by using
1 1 n
Skewness  3/2 
s n  1 i 1
( xi  x )3 .

A right-skewed distribution is one that has a positive skewness. Such distribution has a larger
number of data points on the left-tail than the right-tail and hence the frequency curve appears to
be skewed to the left. 

A simple way to resolve skewness is to take log or square root on the data (if every data is
positive). A more sophisticated method is to use the Box-Cox transformation
xi  1
xi  for   0

where  can be estimated by maximum likelihood estimation.

Some of the techniques in predictive analytics (e.g. conducting statistical inference on PCA, not
covered in this course) require the assumption of normality. Removing skewness is one way to
achieve normality.

FINA 3295 | Predictive Analytics 13


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 5. Data Pre-processing

Missing Data

Sometimes some predictors have no values for a given sample. There are many reasons why the
predictors are missing. The predictors may be unavailable at the time when the model is built,
and it is also possible that some predictors are more nonresponsive (e.g. attributes such as
income). It can even happen in economics and finance that governments and corporations choose
not to report critical statistics because they do not look good.

It is important to understand if there are any mechanism that create the missing data, especially if
such mechanism is related to the output y (informative missingness). For large data set, the
omission of a small amount of data with missing predictors (at random) is not a problem.
However, if the size of the data set is small, such omission can cause significant loss of
information and it is better to impute the missing values. One method to do so is the KNN
regression mentioned in Section 3.

Reducing the Number of Predictors: Principal Component Analysis

Principal component analysis (PCA, 主成份分析) is an explanatory data analysis technique in


multivariate statistics invented in 1901 by Karl Pearson.

To visualize a data set with 2 predictors we can use a single scatterplot. However, with p
predictors, even if we are just looking at pairwise relations, we can plot a total of pC2  p(p 1) /
2 scatterplots, which gives something like this for p  5:

FINA 3295 | Predictive Analytics 14


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 5. Data Pre-processing

Such a graph is called a scatterplot matrix, and can potentially be very large. Say p equals 10,
then 10C2  45 and it is hard to analyze 45 plots by sight. A better method is needed for large p.
One way to do so is to find a low-dimensional representation of the data that captures as much of
the information as possible. PCA is a tool to achieve this.

Transformation involved in PCA

Consider a random vector X  (X1, X2, …, Xp) (which is assumed to be centered 0) with
covariance matrix . Let k  (k1, k2, …, kp) for k  1, 2, …, p. Consider building a new
random vector Z  (Z1, Z2, …, Zp) using the linear transformation
Z1  11 X 1  12 X 2    1 p X p
Z 2  21 X 1  22 X 2    2 p X p
 
Z p   p1 X 1   p 2 X 2     pp X p

 1  11 12  1 p 
   22  2 p 
(that is, Z   X where      
2 21
.) It can be shown that
     
   
 p   p1  p 2   pp 

Var(Zk)  k Σk and Cov(Zi, Zj)  i Σ j .

Principal Components

The principal components of X are those uncorrelated linear combinations Z1, Z2, …, Zp whose
variances are as large as possible.
The transformation is defined in such a way that the first component has the largest possible
variance (that is, accounts for as much of the variability in X as possible), and each succeeding
component in turn has the highest possible variance under the constraint that it is uncorrelated to
the preceding components. (The geometric interpretation is: the resulting vectors form an
orthogonal basis set).

First principal component:


The first principal component is the linear combination with maximum variance. That is, it
maximizes Var(Z1). Since the variance can be made as large as possible by increasing every
p
element in 1 in proportion, we restrict 1 to have a unit length ( || 1 ||2  1 or 
i 1
2
1i  1 ).

Second principal component:


The second principal component is the linear combination with maximum variance and is
uncorrelated with Z1 (i.e., 1Σ2  0 ). Again we set the restriction || 2 ||2  1.

FINA 3295 | Predictive Analytics 15


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 5. Data Pre-processing

Third principal component:


The third principal component is the linear combination with maximum variance and is
uncorrelated with Z1 and Z2 (i.e., 1Σ3  2 Σ3  0 ). Again we set the restriction || 3 ||2  1.

and so on, until we reach the pth principal component.

Finding the principal components amounts to finding the matrix  (called loading) based on the
covariance matrix , together with the variance of each principal component Zi.

Analytical Calculation of PCA

It turns out that we can find  and the variance of each principal component by the following
mathematical procedure, the proof of which can be found in any elementary text on multivariate
statistics:

Algorithm:
1) Compute all eigenvalues of . By construction, all eigenvalues are positive real-valued.
2) Arrange the eigenvalues in 1) in the order 1  2  …  p. Then Var(Zi)  i.
3) Find the eigenvector associated with each of the eigenvalues.
4) Normalize the eigenvectors in 3) to unit length to give the loadings 1, 2, …, p.

An important property about the i s is that


p p p

 Var( X i )   Var( Zi )   i .
i 1 i 1 i 1

As such, the proportion of total population variance due to the kth principal component is
k
.
1  2     p

A typical computer output for PCA looks something like this (note the whole  in the table):
predictor PC1 PC2 … PCp
x1 11 21 … p1
x2 12 22 … p2
⁝ ⁝ ⁝ ⁝
xp 1p 2p … pp
variance 1 2 … p

Most software have an option to select how many PCs to be reported.

FINA 3295 | Predictive Analytics 16


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 5. Data Pre-processing

Remark: The loadings in each k is unique up to a sign flip. For example, if


 1 1 1
1   , , 
 2 2 2
 1 1 1
is a normalized eigenvector, then 1    ,  ,   is also a normalized eigenvector.
 2 2 2
Different computer software may yield different k s because of this.

? Example 5. 1
 1 2 0 
Consider the three random variables X1, X2, X3, with the covariance matrix    2 6 1.5  .
 0 1.5 5 

Find the three principal components and also the variances of each of them. Also conduct a
principal component analysis on the correlation matrix of the three random variables.

A PCA based on the covariance matrix will give more weights on the Xk that has a higher
variance. If the Xi’s have different magnitudes in units, then it is customary to first standardize
each Xk before looking into the covariance matrix. This is equivalent to considering the
correlation matrix of the original set of random variables.

Estimating Principal Components from Observations

In most cases we do not have the population covariance (or correlation) matrix of the set of
random variables X, but a set of n independent observations x1, x2, … xn. Our goal is to estimate
the principal components from the observations. This can be done by applying PCA on either the
sample covariance matrix or the sample correlation matrix.

? Example 5. 2
Consider the covariance matrix produced by the daily percentage change in 1-yr, 2-yr, 3-yr, 4-yr,
5-yr, 7-yr, 10-yr and 30-yr swap rate data of the US market from 3 July 2000 to 12 Aug 2011.

Estimate the 8 principal components from the data. How much variance can be explained by the
first 3 principal components? What do these 3 PCs represent in terms of yield curve movement?

FINA 3295 | Predictive Analytics 17


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 5. Data Pre-processing

Applications of PCA

From the point of view of explanatory data analysis, the transformation z  x (z   x) maps
each observation xk (with p predictors) into a new observation zk (again with p predictors), as
follows:

 x1   x11 x12  x1 p   z1   z11 z12  z1 p 


x   x x22  x2 p   z   z z22  z2 p 
 2    21  2    21 .
            
       
 x n   xn1 xn 2  xnp   z n   zn1 zn 2  znp 

The variance of the components in z are decreasing in order: the first component (those zk1’s)
explains the largest variability in the original data, the second component (those zk2’s), which is
uncorrelated with the first component, explains the next largest variability, and so on. Hence one
way to simplify the original data set is to retain only the first m ( p) columns in the z matrix and
discard the rest (whose variability may be contributed by noise):

 z1   z11 z12  z1m z1,m 1  z1 p 
z   z z22  z2 m z2, m 1  z2 p 
 2    21
       
   
 z n   zn1 zn 2  znm zn ,m 1  znp 

Then we can analyze the new data by plotting a scatterplot matrix of lower dimension and even
interpret each of the m components to shed lights on the essential features of x. Now the question
is: how can we choose the optimal m?

It turns out that there is no universally accepted scientific way to do so. All existing ways (which
I give you three) are ad hoc in nature!

1) We can let m be the least number of PCs that explains a certain percentage (say 80%) of the
total variance of the data.

2) We can draw a graph (called scree plot) of the proportion of variance explained vs the PCs.
We eyeball the plot and look for a point at which the proportion of variance explained drops
off and cut that at m. This is often referred to as an elbow.

3) We look at the first few PCs in order to find interesting interpretation in the data and keep
those that can explain the variability of X. If no such things are found, then further PCs are
unlikely to yield anything interesting. This is very subjective and depends on your ability to
come up with stories (as illustrated in Example 5.2).

FINA 3295 | Predictive Analytics 18


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 5. Data Pre-processing

PC scores plot and PC loading plot

After we have decided m, we can further plot two types of graphs. The first one is a scatterplot
matrix for the retained PC scores
 z1   z11 z12  z1m 
 z   z z22  z2 m 
 2    21 .
     
   
 zn   zn1 zn 2  znm 

Such a scatterplot matrix is called a PC scores plot. While each PC is uncorrelated with any
other PCs, there may still be patterns in the PC score that warrants further investigations.

We can also take any two PCs and plot (for all p predictors) the direction they are pointing to.
This is known as a PC loading plot. For example, we can pick PC1 and PC2,
predictor PC1 PC2 … PCm
x1 11 21 … m1
x2 12 22 … m2
⁝ ⁝ ⁝ ⁝
xp 1p 2p … mp
We draw the p vectors (11, 21), (12, 22), … , (1p, 2p). The point (11, 21) shows how much
weight x1 have on PC1 and PC2, the point (12, 22) shows how much weight x2 have on PC1 and
PC2, and so on.

The angles between the vectors tell us how different predictors xis correlate with one another (as
reflected by the two PCs):
 When the vectors that correspond to xi and xj form a small angle, they strongly correlate
with each other.
 When the vectors are perpendicular, the corresponding xi and xj are not correlated.
 When the vectors diverge and form a large angle (close to 180), the corresponding xi and
xj are strongly negatively correlated.

? Example 5. 3
Consider the US violent crime rates data in year 1973 in the companion Excel worksheet (the
USArrests data set in R). Perform a principal component analysis on the standardized data
and create
(a) a scatterplot for the scores of PC1 and PC2;
(b) a PC loading plot for PC1 and PC2.

FINA 3295 | Predictive Analytics 19


Chapter 1. Elements in Predictive Analytics
C.Y. Ng
Section 5. Data Pre-processing

A biplot for two particular PCs simply overlap a PC scores plot with a PC loading plot. See
Figure 10.1 of the text for the biplot of PC 1 and PC2 in Example 5.3. The programming language
R can generate biplots easily.

Some Geometric Interpretations of PCs

This is not related to real life applications but to the geometry of random vectors and change of
base in linear algebra. To understand what follows, you need some basics of linear algebra.

Consider PC1. The loading vector 1 defines a direction in Rp along which the data vary the most.
If we project the n data points x1, x2, …, xn on this direction, the projected values are z11, z21, …,
zn1. The following is an illustration with p  2:

We have Z1  0.839(population)  0.544(ad) for the mean-centered data. The vector 0.839i 
0.544j is the direction of the green line. PC2 points to the direction that is perpendicular to PC 1.
2 points to this direction. When the n data points are projected on this direction, the projected
values are z12, z22, …, zn2. We have Z2  0.544(population)  0.839(ad) for the mean-centered
data.

An alternative interpretation is that PCs provide low-dimensional linear surfaces that are closest
to the observations. PC1 is the line in Rp that is closest to the n observations. PC1 and PC2
together span the plane that is closest to the n observations. PC1, PC2 and PC3 span a 3-d
hyperplane that is closest to the n observations, and so on. So,
m
xij   kj zik .
k 1

Together the m PC scores and the m loading vectors can give a good approximation to the data.

FINA 3295 | Predictive Analytics 20


Chapter 2. Clustering
C.Y. Ng
Section 1. Similarity Measures

CHAPTER 2 Clustering

Related Readings …

An Introduction to Statistical Learning: with Applications in R


Section 10.3

Applied Multivariate Statistical Analysis (6th edition)


Section 12.2 – 12.4

L earning Objectives
Similarity measures, hierarchical method, K-means method, difference
between supervised and unsupervised machine learning

In this chapter we study clustering ( 聚類分析). Clustering is different from classification.


Classification pertains to a known number of groups (e.g. two groups in the KNN regression
example in the Excel worksheet of Chapter 1) and the operation of assigning a new observation
to one of these groups. Clustering pertains to grouping a set of objects in a way such that objects
in the same group are more similar to each other than to those in other groups.

To understand the nature of clustering, look at the class of students in this course. By defining a
measure of similarity, we can form different clusters.
1) If students in the same cohort are treated as similar, then all year 4 students form one cluster,
and all year 5 students form another cluster.
2) If students in the same major program are treated as similar, then students in the IBBA
program form one cluster, and students from the actuarial science program form another
cluster.
3) If students living in the same political district (e.g. Eastern, Southern, North) are treated as
similar, then students in the class naturally form at most 18 groups.
There are many other ways of grouping, e.g. by secondary school, by gender, by age.

Clustering has huge application in pattern recognition, search engine results grouping, crime
analysis and medical imaging. In business, clustering is widely used in market research.

FINA 3295 | Predictive Analytics 21


Chapter 2. Clustering
C.Y. Ng
Section 1. Similarity Measures

1 Similarity Measures

In Chapter 1 we mentioned the Euclidean distance and the city-block distance between two data
points. We now make the discussion more general. How can we define the distance between a
data point and a cluster of data points, and the distance between two clusters of data points?

Suppose that there are m points in cluster A and n points in cluster B. By selecting one point from
A and one point from B, we can calculate mn distances. These distances are called inter-cluster
dissimilarities. Based on these, we can define the “linkage” between A and B. Alternatively, we
can find the centroids of the two clusters and find the linkage by computing the distance between
the two centroids.

Linkage Description
Average Mean inter-cluster dissimilarity
Complete Maximal inter-cluster dissimilarity
Single Minimal inter-cluster dissimilarity
Centroid Distance between the two centroids

Average of 15 distances Red  Minimal Triangles are centroids


Blue  Maximal

? Example 1. 1 [SRM Sample #1]


You are given the following four pairs of observations:
x1  (1, 0), x2  (1, 1) , x3  (2, 1), x4  (5, 10).
A hierarchical clustering algorithm is used with complete linkage and Euclidean distance.
Calculate the inter-cluster dissimilarity between {x1, x2} and {x4}.
(A) 2.2 (B) 3.2
(C) 9.9 (D) 10.8
(E) 11.7
Repeat for {x2, x3} and {x4}, with average linkage and city-block distance.
Repeat for {x1, x2} and {x3, x4}, with centroid linkage and Euclidean distance.

FINA 3295 | Predictive Analytics 22


Chapter 2. Clustering
C.Y. Ng
Section 2. Hierarchical Method

2 Hierarchical Method

In this section we discuss hierarchical agglomerative clustering (HAC). This method builds the
hierarchy from the individual observations by progressively merging them into clusters, until all
of them finally merge into a single cluster:

The result is called a dendrogram (樹狀圖).

Algorithm:
1) Treat each of the n observations as its own cluster. All clusters are at level 0.
2) For k  n, n  1, …, 2:
(i) Calculate all kC2 pairwise inter-cluster dissimilarities.
(ii) Fuse the two clusters with the smallest inter-cluster dissimilarities.
(iii) The dissimilarity in (ii) indicates the level at which the fusion should be placed.

The form of the dendrogram depends on the choices of distance and linkage function. Average
and complete linkage are generally preferred over single linkage because single linkage can
result in extended, trailing clusters in which single observations are fused one-at-a-time. This is
called chaining phenomenon, where clusters formed may be forced together due to single
elements being close to each other, even though many of the observations in each cluster may be
very distant to each other. (Sometimes this can be an advantage, though.) Average and complete
linkage give more balanced dendrograms.

Centroid linkage is seldom used because inversion can occur. See Example 2.3.

FINA 3295 | Predictive Analytics 23


Chapter 2. Clustering
C.Y. Ng
Section 2. Hierarchical Method

Let us consider the following 5 observations with distance matrix

a b c d e
a 0
b 8.5 0
c 10.5 15 0
d 15.5 17 14 0
e 11.5 10.5 19.5 21.5 0

Complete Linkage Illustration

1) We start with 5 clusters {a}, {b}, {c}¸{d}¸{e}.


2) (i) The 5C2  10 dissimilarities are given in the distance matrix.
(ii) 8.5 is the shortest dissimilarity. So we fuse {a} and {b} into {a, b}.
(iii) 8.5 is the height at which the fusion occurs.
3) (i) The 4C2  6 dissimilarities are given below:
a, b c d e
a, b 0
c 15 0
d 17 14 0
e 11.5 19.5 21.5 0
(ii) 11.5 is the shortest dissimilarity. So we fuse {a, b} and {e} into {a, b, e}.
(iii) 11.5 is the height at which the fusion occurs.
4) (i) The 3C2  3 dissimilarities are given below:
a , b, e c d
a , b, e 0
c 19.5 0
d 21.5 14 0
(ii) 14 is the shortest dissimilarity. So we fuse {c} and {d} into {c, d}.
(iii) 14 is the height at which the fusion occurs.
5) (i) The 2C2  1 dissimilarity are given below:
a , b, e c , d
a , b, e 0
c, d 21.5 0
(ii) We fuse {a, b, e} and {c, d} at height 21.5 into {a, b, c, d, e}. End of algorithm.

FINA 3295 | Predictive Analytics 24


Chapter 2. Clustering
C.Y. Ng
Section 2. Hierarchical Method

? Example 2. 1
Show that the dendrogram resulted from the use of single linkage is:

FINA 3295 | Predictive Analytics 25


Chapter 2. Clustering
C.Y. Ng
Section 2. Hierarchical Method

? Example 2. 2
Show that the dendrogram resulted from the use of average linkage is

FINA 3295 | Predictive Analytics 26


Chapter 2. Clustering
C.Y. Ng
Section 2. Hierarchical Method

? Example 2. 3
Construct a dendrogram for the three data points a  (1.1, 1), b  (5, 1) and c  (3, 1  2 3)
using Euclidean distance and centroid linkage.

B)

Recall the USArrests data in Example 5.3 of Chapter 1. The dendrograms for complete,
average and single linkages for the standardized data are generated by R (it is hard to create a
dendrogram using Excel) and are given below. Standardization is necessary because the unit of
the predictor “Urban population” is very different from those of the other 3 predictors.

FINA 3295 | Predictive Analytics 27


Chapter 2. Clustering
C.Y. Ng
Section 2. Hierarchical Method

FINA 3295 | Predictive Analytics 28


Chapter 2. Clustering
C.Y. Ng
Section 2. Hierarchical Method

Interpreting a Dendrogram

In a dendrogram, we can tell the proximity of two observations based on the height where
branches containing those two observations first are fuse. (Note that the proximity of two
observations along the vertical axis when the tree is grown to the right is not related to similarity.)

We can draw a cut to the dendrogram to control the number of clusters obtained. Using the
previous dendrogram as illustration:

Limitations of HAC

1) A dendrogram can give any number of clusters required. However, the clusters may not be
optimal in some sense. The term “hierarchical” in HAC refers to the fact that clusters
obtained by cutting the dendrogram at a given height (e.g. the cut at h  12 above) are
necessarily nested within the clusters obtained by cutting the same dendrogram at a greater
height (e.g. the cut at h  18). This may not be reasonable. Consider observations coming
from male students coming from Canada, USA and Mexico. They either pass a test or fail a
test. If we focus on 3 clusters, the most reasonable way is to split them by country. However,
if we focus on 2 clusters, the most reasonable way is to split by test result. If we use HAC
and at the 3 clusters level we indeed arrive at a split based on country, then it is not possible
to arrive a split by test result at the upper level.

2) HAC is easily distorted by the presence of outliers. Suppose that in a group of 20


observations, 4 are truly singleton, and the remaining 16 observations can be split naturally
into 2 groups of 8. Then if the dendrogram is cut at a height to give 5 clusters, then one of
the singleton must be fused with one of the 2 big groups. This would greatly distort the
resulting cluster that absorbs the singleton, and later on merging in the algorithm may give
unreasonable results.

3) HAC is not robust: If we delete a single observation from the data set and perform HAC on
the resulting data set, the result may look very different.

FINA 3295 | Predictive Analytics 29


Chapter 2. Clustering
C.Y. Ng
Section 3. K-Means Method

3 K-Means Method

K-means method was developed by MacQueen in 1967. Suppose that we want to split n
observations into K groups where K is known. The simplest version of the K-means method is
the following:

Algorithm:
1) Partition the n observations into K preliminary clusters by assigning a number from 1, 2, …,
K randomly to each observation. (Random partition)
2) Iterate until the cluster assignments stop changing:
(i) For each of the K clusters, compute the centroid.
(ii) Assign each observation to the cluster whose centroid is closest (as measured by ||  ||2).
Alternative to 1):
1) Randomly specify K observations as centroids. (Forgy initialization)

The final assignment of observations to clusters depends upon the initial randomization. That is
why the algorithm should be run with different initial settings and the results of the different final
clustering be recorded. If more than one result is obtained, we should compute a score for each
result and find the “best” grouping that is achieved. Such a score is based on the following:
K K
SSE    || x  i ||22   SSEi where i is the centroid of cluster Ci.
i 1 xCi i 1

SSEi is a measure of dispersion within cluster Ci. The smaller is the SSE, the better is the cluster
{C1, C2, …, CK}.

Remark: It is not hard to prove (as you will see on the next page) that
K
1
SSE    || x  y ||22 , where | Ci | is the number of points in cluster Ci.
i 1 2 | Ci | x , yCi

Noting that in the sum  || x  y ||


x , yCi
2
2 there are altogether | Ci |2 terms, | Ci | of which must be

zero because they correspond to the cases where x  y, and that the remaining terms form pairs
because || a  b ||2  || b  a ||2,
1
SSEi  
2 | Ci | x , yCi
|| x  y ||22

can be interpreted as the average of the squared distances between the observations in the i-th
cluster. So, the SSE defined above is half of the sum of the total within-cluster variation, as

FINA 3295 | Predictive Analytics 30


Chapter 2. Clustering
C.Y. Ng
Section 3. K-Means Method

measured by the sum of pairwise squared distances between observations within each cluster,
divided by the number of points in the cluster.

Proof of the SSE relation (Optional): We note that

 || x  y ||
x , yCi
2
2   || ( x   )  ( y   ) ||
x , yCi
i i
2
2

 
x , yCi
|| x  i ||22  || y  i ||22 2( x  i )  ( y  i ) 

 
x , yCi
|| x  i ||22  || y  i ||22   2 
x , yCi
( x  i )  ( y   i )

2  || x  
x , yCi
i ||22  2   (x   )  ( y   )
xCi yCi
i i

 2 | Ci |  || x  i ||22  2  ( x  i )   ( y  i )
xCi xCi yCi

 2 | Ci |  || x  i ||  2(0)  (0). 2
2
xCi

where the last equality follows from the fact that i is the center (mean) of Ci. Hence,
1
 || x  
xCi
i ||22  
2 | Ci | x , yCi
|| x  y ||22 .

Illustration of K-Means Method

Consider A  (5, 3), B  (1, 1), C  (1, 2) and D  (–3, –2). Let us implement the K-means
method with K  2, using the initial clusters {A, B} and {C, D}. What is the resulting SSE?

2) (i) The centroids are (2, 2) and (1, 2).


(ii)
Distance from A B C D
(2, 2) 3.1623 3.1623 4 6.4031
(1, 2) 7.8103 3 2 2
Hence, B should be assigned to {C, D}.
3) (i) The centroid for {A} is (5, 3), while the centroid for {B, C, D} is (1, 1).
(ii)
Distance from A B C D
(5, 3) 0 6.3246 6.4031 9.4340
(1, 1) 7.2111 2 2.2361 2.2361
There is no more reassignment. End of algorithm.
The SSE is 0  22  2.23612 × 2  14.

FINA 3295 | Predictive Analytics 31


Chapter 2. Clustering
C.Y. Ng
Section 3. K-Means Method

? Example 3. 1
Repeat the illustration with initial clusters {A, C} and {B, D}.

For an example that is of a larger scale, see the illustration in the companion Excel worksheet.

Limitations of K-Means Method

The K-means method suffers from similar problem just like HAC. It is sensitive to the scaling of
data, is easily distorted by outlier and is not robust to addition and deletion of a few observations.
There are other disadvantages of the K-means method that are not present in HAC.

1) Initialization is a problem because the final clustering heavily depends on initialization. This
can be partly resolved by using better algorithm to find a more reasonable initialization
setting (e.g. k++ method) rather than a pure randomization.

2) It is hard to come up with a global minimum for the SSE. You may get struck in local
minimum easily.

3) It is very difficult to set the value of K. Even if the population is known to consist of K
groups, the sampling method may be such that data from the rarest group do not appear in the
sample. Forcing the data into K groups will lead to unreasonable results. In practice,
computer scientists use a few values of K to run the algorithm.

4) If some clusters have only a few points, but some clusters have many more points, the K-
means method automatically gives more weights to larger clusters because it gives equal
weight to each point. This translates into results that let centroids of smaller clusters end up
far away from their true centers because these centroids are “attracted” to larger cluster so
that larger cluster can be further split to minimize the SSE.

FINA 3295 | Predictive Analytics 32


Chapter 2. Clustering
C.Y. Ng
Section 4. Supervised vs Unsupervised Learning

5) Similar to linear regression, we are minimizing SSE. But why SSE? If the data looks like the
following, good luck even if you correctly set K  2!

Single linkage HAC would work, though, due to chaining phenomenon.

4 Supervised Versus Unsupervised Learning

There are two main kinds of predictive models or machine learning.

Supervised learning: This refers to learning from a training set of data with both x and y.
Learning consists of inferring the relation between input x and output y.
It is called “supervised” because the process of an algorithm learning
from the training data set can be thought of as a teacher supervising the
learning process. We know the correct answers (y) and the algorithm
iteratively makes predictions and so that more and more correct answers
can be obtained.

When conducting supervised learning, the main considerations are 1) the


complexity, and hence the interpretability of the model, 2) bias-variance
tradeoff.

Unsupervised learning: This refers to learning from a training set of data with only x. The
learning consists of revealing the underlying distribution in the data in
order to learn more about the data. This is called unsupervised because
there is no correct answers.

FINA 3295 | Predictive Analytics 33


Chapter 2. Clustering
C.Y. Ng
Section 4. Supervised vs Unsupervised Learning

This table summarizes the classification of what we have studied in Chapter 1 and Chapter 2:

Supervised Unsupervised
Classification
Discrete Clustering
(e.g. KNN classification)
Regression Dimensionality reduction
Continuous
(e.g. linear regression) (e.g. PCA)

There is another type called semi-supervised machine learning, where only some of the data has
y. For example, you may have taken a lot of photos in the F6 athletic meet. Some of the photos
may be clearly labelled by tags such as “teacher”, “class 6A”, “Chris Wong”. But some photos
are unlabeled. How can you perform clustering in such a case? In this course we are not going to
touch on this.

FINA 3295 | Predictive Analytics 34


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 1. Classification Trees

CHAPTER 3 Classification and Regression Trees

Related Readings …

An Introduction to Statistical Learning: with Applications in R


Sections 8.1 – 8.2

Applied Predictive Modeling


Sections 8.1 – 8.2, 8.4  8.6, 14.1, 14.3  14.4

L earning Objectives
CART, C4.5 model, Gini index, entropy, sum of square error and binary
recursive splitting, cost-complexity pruning, bagging, random forest,
boosting

A decision tree is a tree where each node represents a predictor, each branch represents a
decision / rule and each leaf represents a prediction (which can be numerical or categorical).

Here is a hypothetical example. We have the number of points scored by a set of baseball players
and our goal is to use two predictors (weight and height) to predict the number of home runs. A
decision tree split the height-weight space and assign a number to each partition.

height
height  1.8 m

y  1.5 y  1
weight  90 kg weight  100 kg 1.8

y  0.5 y2
weight
0.5 2 1.5 1
90

We start from the root node, and use the split “height  1.8 m ?” to partition the height-weight
space into two subspaces. In each of the two subspaces, we use weight to further branch out into
two leaves (terminal nodes), where a final prediction on the number of points is made.

FINA 3295 | Predictive Analytics 35


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 1. Classification Trees

Decision trees are common in business, social science, medical and many other daily life
applications because its working principle resembles that of a human mind and is highly
interpretable (while the height-weight space partition on the right looks alien to most people).
The construction of the tree, however, is far from simple. Given a set of observations of the form
(height, weight)  number of home runs
how can you build a tree?

This chapter is an introduction to classification and regression trees (CART) and popular
ensemble methods to improve their prediction accuracy. CART is the most well-known
algorithm in predictive analytics. It works for both classification and regression problems, is easy
to interpret, and is computationally efficient. It can deal with both continuous and categorical
predictors, and even works for missing value in the input data.

1 Classification Trees

To illustrate the construction of a classification tree, let us walk through the following famous
example. We have a data set with 14 observations, 4 predictors, and binary outcome:

Day Temperature Outlook Humidity Windy Golf?


July 5 Hot Sunny High No No
July 6 Hot Sunny High Yes No
July 7 Hot Overcast High No Yes
July 9 Cool Rain Normal No Yes
July 10 Cool Overcast Normal Yes Yes
July 12 Mild Sunny High No No
July 14 Cool Sunny Normal No Yes
July 15 Mild Rain Normal No Yes
July 20 Mild Sunny Normal Yes Yes
July 21 Mild Overcast High Yes Yes
July 22 Hot Overcast Normal No Yes
July 23 Mild Rain High Yes No
July 26 Cool Rain Normal Yes No
July 30 Mild Rain High No Yes

With these 4 categorical predictors, there are 3 × 3 × 2 × 2  36 combinations. We want to build


a classification tree to do prediction. For example, if tomorrow will be Cool, Sunny, with normal
Humidity and Windy, should your company play golf?

To start with, we can pick any of the 4 predictors as the first split, as follows:

FINA 3295 | Predictive Analytics 36


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 1. Classification Trees

Temperature Outlook Humidity Windy

Hot Cool Mild Sunny Overcast Rain High Normal No Yes

2 Yes 3 Yes 4 Yes 2 Yes 4 Yes 3 Yes 3 Yes 6 Yes 6 Yes 3 Yes
2 No 1 No 2 No 3 No 2 No 4 No 1 No 2 No 3 No

Which split should be chosen? Common sense tells us that

A predictor that splits the observations so that each successor node is as pure as possible should
be chosen.

Outlook gives a node that is 100% pure. It seems to be a good choice. But in most real-life cases
things are murkier. Breiman et al. (1984) came up with the idea of Gini index, which uses the
total variance across different classes to measure the purity of a node:
k k
G   pi (1  pi )  1   pi2 .
i 1 i 1

The smaller is G, the more pure a node is. (In the two-class case where k  2, p1  p2  1, and G
 0 for p1  0 or 1, and attains a maximum of 0.5 at p1  p2  1.)
2 2
9  5
Before the split, there are 9 Yes’s and 5 No’s. The Gini index is 1        0.45918.
 14   14 
After the split, there are multiple new nodes. We compute the Gini index for each of the nodes
and aggregate out the total Gini index by weighting them by the number of total observations in
the nodes.

Let us calculate the total Gini index after each of the splits.

Temperature:

4   2  2  4   3 1  6  4  2 
2 2 2 2 2 2

G  1         1         1         0.440476
14   4   4   14   4   4   14   6   6  

So G decreases by a bit.
Humidity:

7   3  4  7   6 1 
2 2 2 2

G  1         1         0.367347
14   7   7   14   7   7  

This is much better than Temperature.

FINA 3295 | Predictive Analytics 37


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 1. Classification Trees

? Example 1. 1
Calculate the total Gini index for the split using the predictors Outlook and Windy. Hence decide
which predictor should be used for the initial splitting. Continue the process and build the whole
classification tree.

What is the prediction for “Cool, Sunny, with normal Humidity and Windy”?

Another way to determine the split is to make use of the concept of entropy. The more pure is the
node, the higher degree of order it possesses. Maximum order is achieved if all outputs in a node
are the same, giving a zero entropy. Minimum order is achieved if different kinds of outputs are
equally distributed in a node, giving larger value of entropy. The application of entropy in
computer science was explored by Quinlan (1993), giving the “C4.5 tree”. The cross entropy
(aka deviance) for a node is given by (in units of bits)
k
1 k
D   pi log 2 pi    pi ln pi (and note the convention 0 log2 0  0.)
i 1 ln 2 i 1
(Once again note that in the two-class case where k  2, p1  p2  1, and D  0 for p1  0 or 1,
and attains a maximum of 1 at p1  p2  1. Also, this is NOT natural logarithm!) Smaller D is
more preferable when we consider a split.

Let us use cross entropy to rework the golf case. Before the split,
9 9 5 5
D  log 2  log 2  0.94029.
14 14 14 14

FINA 3295 | Predictive Analytics 38


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 1. Classification Trees

? Example 1. 2
Use entropy to construct the classification tree for the golf illustration.

(There is yet another to determine impurity called “misclassification error rate” E  1  max pi.
The larger is a particular type of outcome in a node, the smaller is E. However, E is not very
sensitive for the purpose of growing a tree.)

Continuous Predictors

The golf example features categorical predictors which can only take on finitely many
possibilities. In most of the real-life examples, there is a mix of categorical and continuous
predictors. For example, you may have the actual temperature (in F) for a predictor:
x y x y
64 Yes 72 Yes
65 No 75 Yes
68 Yes 75 Yes
69 Yes 80 No
70 Yes 81 No
71 No 83 No
72 No

For such a predictor, it is customary to do a binary splitting *: one branch is x  t, while the other
branch is x  t. To determine the best split, we look at split points of the form t  (xi  xi+1) / 2.
Let us consider t  71.5 which is the midpoint of 71 and 72. The split gives

x  71.5 x  71.5

4 Yes 3 Yes
2 No 4 No

* Do not treat x as categorical with values 64, 65 etc. Such a high branching predictor will smash any dataset and seriously overfit.
Actually, tree splitting is also biased towards categorical predictors with a large number of possible values because they lead to a
large number of nodes which are very pure because the number of observations in each of these nodes are small.

FINA 3295 | Predictive Analytics 39


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 1. Classification Trees

2 2
7 6
Before the split, the Gini index is G  1        0.497041.
 13   13 
6  4 2  7  3 4 
2 2 2 2

After the split, the total Gini index is G  1         1         0.46886.


13   6   6   13   7   7  

We can repeat the calculation for other potential split points. See the Excel illustration:

Split point G Split point G


64.5 0.46154 71.5 0.46886
66.5 0.49650 73.5 0.47308
68.5 0.48718 77.5 0.32308
69.5 0.45727 80.5 0.39161
70.5 0.41154 82 0.44872

So the split point that should be chosen is 77.5.

Now we can state the general algorithm for building a classification tree:

Algorithm:
1) Go through the list of every predictor:
(i) For categorical predictors, calculate the total Gini index / cross entropy for the split.
(ii) For continuous predictors, calculate all total Gini indexes / cross entropies based on
every possible split points (mid points of observed values). Pick the minimum as the
final Gini index / entropy for the split.
(iii) Use the predictor (and the split point) with the minimum Gini index or cross entropy.
2) Repeat for the new nodes formed until we get the tree we desired. Stop at nodes where
(i) all observations lead to the same output (no need to split),
(ii) all predictors are exhausted, or all remaining observations have the same remaining
predictor values so that no predictors can split the observations (no way to split).
Alternatively, stop when some stopping criteria are reached.

Remarks:
1) Classification trees do predictor selection: A predictor that does no splitting has no use.
2) For a categorical predictor, once it is used for a split for a node, it will not appear in
branches under that node. The split in a node will completely exhaust its information content.
3) For a continuous predictor, because the split is binary, it can happen that the same predictor
will appear more than once alone a path (recursive binary split). For example, in the tree
below, MaxHR appears at depth 3, as well as depth 6 and 7:

FINA 3295 | Predictive Analytics 40


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 1. Classification Trees

Controlling the Size of a Tree by Pre-Pruning (預先修剪)

A classification tree can be very large in size. A fully grown tree can lead to overfitting (high
variance, low bias) and it is sometimes desirable to limit the tree size. One way to do so is to
define stopping criteria to conduct a pre-pruning, some of which is shown below:
 controlling the minimum samples for a node split (say, if the number of observations under a
node is less than 20, then splitting stops, of if the number of observations under a node is less
than 5% of the total sample size.)
 controlling the maximum depth of tree
 controlling the maximum number of terminal nodes
 all possible splits lead to very small decrease in Gini index or entropy (a threshold has to be
set ahead)
For example, if we limit the maximum depth of the tree in the golf case to 1, then we will end up
with only one split. The nodes for Sunny and Rain are not pure. The prediction will be based on
the most commonly occurring class of outputs of training observations:

Outlook Outlook

Sunny Overcast Rain Sunny Overcast Rain

2 Yes 4 Yes 3 Yes No Yes Yes


3 No 2 No

FINA 3295 | Predictive Analytics 41


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 1. Classification Trees

We can see that 2  2  4 observations will be misclassified. The misclassification rate is 4 / 14 


28.6%. (If you unfortunately end up with 2 yes and 2 no for Sunny, then there is no majority
vote. In practice this rarely happens.)

In Section 3, we will also investigate a method that prunes a large tree back to a smaller scale.

? Example 1. 3 [SRM Sample #9]


A classification tree is being constructed to predict if an insurance policy will lapse. A random
sample of 100 policies contains 30 that lapsed. You are considering the two splits:
Split 1: One node has 20 observations with 12 lapses and one node has 80 observations with
18 lapses.
Split 2: One node has 10 observations with 8 lapses and one node has 90 observations with 22
lapses.
Determine which of the following statements is/are true.
I. Split 1 is preferred based on the total Gini index.
II. Split 1 is preferred based on the total entropy.
III. Split 1 is preferred based on having fewer classification errors.

(A) I only.
(B) II only.
(C) III only.
(D) I, II and III.
(E) The correct answer is not given by (A), (B), (C) and (D).

FINA 3295 | Predictive Analytics 42


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 2. Regression Trees

2 Regression Trees

In this case, the response is continuous. For a continuous predictor x, we perform a binary
spitting for a split point t by considering the objective function

SSE   ( yi  yˆ R1 ) 2   ( y  yˆ i R2 )2 ,
i: xi  t i:xi t

where R1  {X : Xi  t} and R2  {X : Xi  t}, and yˆ Ri is the mean response for the all training
observations in Ri. This is similar to the K-means clustering objective function. The
determination of the best split point t that gives the smallest SSE is based on exhaustion.

For a categorical predictor, we consider the value of   ( y  yˆ


t i:xi t
i Rt ) 2 where Rt  {X : Xi  t}

and yˆ Rt is the mean response for all training observations in Rt.

When we decide the predictor for the first splitting, we again go through the list of all predictors.
We pick the one (and the associated split point for continuous predictor) that minimizes the SSE.
Then we continue, looking for the best predictor (and the best split point) that minimizes the SSE
within each of the resulting regions formed from the previous splitting just like building a
classification tree. The only difference is that the prediction in each terminal node is the average
of the output of observations falling into that node.

? Example 2. 1
Consider the following data set with 15 observations in the companion Excel worksheet.
Determine the first 3 splits if the stopping criterion is a maximum depth of 2. What are the
predicted value for the 4 terminal nodes?

FINA 3295 | Predictive Analytics 43


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 2. Regression Trees

Limitations of CART

1) Form of f

The (X1, X2) space for the regression tree in Example 2.1 looks like this:

X2 X2

11.81

8.705

X1 X1
4.005
Notice that the space is split into 4 rectangles. In general, such splitting can only result in series
of rectangles. It is impossible to get things like the top right! What does this mean practically?
Consider the population shown in the figure on the left:

X2 X2

y1

y0
X1 X1

A linear model fits data from such population well. A regression tree can only approximate the
relation by using lots of splits. (This problem, though, can be solved by first using PCA to rotate
the data!) Then let us consider the following population:
X2

y1

y2 y0
X1

A regression tree works well with only a few splits for data coming from such population. A
linear model would fit very poorly.

FINA 3295 | Predictive Analytics 44


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 2. Regression Trees

If the underlying population can be well approximated by a linear model, linear regression will
outperform a non-parametric regression tree. However, if there is a high non-linearity and
complex relationship between x and y, a tree will outperform a linear model.

2) CART is a Greedy Algorithm

The technique of predictor selection and splitting described in the previous two sections is a
greedy algorithm (貪婪演算法): it will check for the best split instantaneously and move
forward until one of the specified stopping condition is reached. In other words, it is looking for
local optimum at each step instead of finding a global optimum. For an analogy, consider the
case for driving.

There are 2 lanes:


 A lane with cars moving at 80 km h1
 A lane with trucks moving at 30 km h1
At this moment, you have two choices:
 Take a left and overtake the other 2 green cars quickly.
 Keep moving in the present lane.
The first choice is a local optimum because you can immediately overtake the green cars ahead
and start moving at speed above 80 km h1. However, this is not a global optimum because you
will very soon be blocked by the trucks. This is the same as tree building.

3) The Disaster of Early Stopping (Optional)

Consider the classical XOR classification problem, where we have two predictors x1 and x2, with
4 observations:
x1 x2 y
0 0 0
0 1 1
1 0 1
1 1 0

FINA 3295 | Predictive Analytics 45


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 2. Regression Trees

The first split can be based on either of the predictors, and the resulting total G (or D) is the same
as that before the split. If we assign a stopping rule that stops when the decrease in G (or D) is
less than a threshold, then the first split cannot be performed. Note, however, that the second
split gives two completely pure nodes.

To conclude: A stopping condition may be met too early. It is better to grow the tree fully, then
remove nodes.

4) Sensitivity to Data

Consider the building of a classification tree with the following 16 observations using cross
entropy:

x1 x2 y x1 x2 y
0.15 0.83 0 0.10 0.29 1
0.09 0.55 0 0.08 0.15 1
0.29 0.35 0 0.23 0.16 1
0.38 0.70 0 0.70 0.19 1
0.52 0.48 0 0.62 0.47 1
0.57 0.73 0 0.91 0.27 1
0.73 0.75 0 0.65 0.90 1
0.47 0.06 0 0.75 0.36 1

The resulting classification tree is shown on the left:

x1  0.6 x2  0.33

x2  0.32 x2  0.61 x2  0.09 x1  0.6

x1  0.35 x1  0.69 x1  0.69


0 1 0 1 0

1 0 1 0 1 0
However, if you change the last x2 from 0.36 to 0.32, the classification tree will become the one
shown on the right, which is totally different from the one on the left. The same happens for
regression trees. Trees can be very non-robust! Recall that in chapter 1 (p.8) we mentioned that
the variance of the prediction refers to the amount by which fˆ would change if we estimate it
using a different training data set. Ideally fˆ should not vary too much between training sets.
You can treat the change from 0.36 to 0.32 as a replacement of one single data point. In this
sense, the variance of the prediction can be huge for trees!

FINA 3295 | Predictive Analytics 46


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 3. Post-Pruning

3 Post-Pruning

In this section we discuss one method to limit the size of a tree. We treat the tree size (as
measured by the number of terminal nodes) as a tuning parameter. Here are some notation:

T0 : a very large tree grown on the entire dataset


(the splitting process only stops when some minimum node size, say 5, is reached)
T: a subtree T  T0 is a tree that can be obtained by pruning T0, that is, collapsing any
number of its internal nodes
|T|: the number of terminal nodes in T
: a positive number that serves as the tuning parameter called complexity parameter

We consider the cost complexity pruning (aka weakest link pruning) procedure:

For the regression problem, define the cost-complexity criterion:


|T |
C (T )    (y i  yˆ Rm ) 2   | T |
m 1 xi Rm

where the collection of rectangular sets {R1, R2, …, R|T|} corresponds to the terminal nodes of T.

The complexity parameter controls the trade-off between the subtree’s size and its goodness of
fit to the training data. Small values of  result in bigger trees. If   0, then C(T) is just the
usual SSE for an unpruned tree and hence the tree that minimizes C0(T) can be chosen to be the
fully grown tree T0. For any fixed   0, it can be shown that there is unique smallest subtree T
that minimizes C(T). It can be shown that the series of T are such that:

While the above is a very complicated procedure, the R programming language has a function
called prune.tree in R library tree that returns the best pruned tree by entering either the
value of  or the number of terminal nodes in the pruned tree. So in practice, for any T0, the
series of {, T} can be obtained very easily.

FINA 3295 | Predictive Analytics 47


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 3. Post-Pruning

After we have arrived at the series of best pruned trees {T}0, we can use cross-validation as
illustrated in section 4 of chapter 1 to estimate the test MSE for each of these trees and pick the
final best one that minimizes the test MSE. In more detail, the process is as follows:

Algorithm:
1) Divide the training dataset into K folds.
2) For each k  1, 2, …, K:
(i) Use recursive binary splitting to grow a large tree on the k-th fold of the dataset.
(ii) Obtain the series of best pruned subtrees for this large tree as a function of .
(iii) Evaluate the mean squared prediction error on the data in the left-out k-th fold, as a
function of .
3) For every , calculate the mean of k MSEs computed in 2). This is an estimate of the test
MSE for each 

The R function cv.tree in library tree performs the above and reports the series of test
MSEs based on the input K (whose default value is 10).

The whole process also applies to classification tree. The objective to be minimized can be based
on G, D or the misclassification rate E. R can perform cross-validation based on either D or E.

The following is the best pruned tree based on the T0 on p.41.

FINA 3295 | Predictive Analytics 48


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 4. Ensemble Methods

4 Ensemble Methods

Ensemble methods refer to methods that combine many model’s performance predictions. They
are generally applicable to most of the models in machine learning. In this section we use CART
to illustrate how three ensemble methods can be applied to improve predication accuracies.

Bagging (Bootstrap aggregation)

Bagging was one of the earliest ensemble techniques.

Before we discuss this method, you need to know the meaning of bootstrap sampling: Given a
set of n observations, we create a new random sample of size n by drawing n observations from
the original observations with replacement.

For example, for n  10, we can draw three bootstrap sample as follows:

Original data 1 2 3 4 5 6 7 8 9 10
Round 1 5 6 9 4 2 6 7 6 3 2
Round 2 7 10 9 4 8 1 8 8 9 10
Round 3 7 1 6 4 5 8 5 1 2 3

Bagging involves iterating two stages B times.

Algorithm:
For i  1 to B do
1) Generate a bootstrap sample of the original n observations.
2) Train an unpruned tree y  fˆ *i ( x) based on the bootstrap sample.
End

A prediction on x is obtained by averaging the results from the B trees for regression problems:
1 B ˆ i
yˆ   f * ( x) .
B i 1
The averaging reduces the variance of the final prediction. For classification problems, the final
prediction is the majority vote of the B classifications.

Remarks:

1) As B increases, the variance of the prediction decreases. In practice, the accuracy


improvement decreases exponentially as B increases. The most improvement can usually be
obtained for B  10. If the test MSE is still considered not small enough for B  50, a more
powerful variance reduction method should be used.

FINA 3295 | Predictive Analytics 49


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 4. Ensemble Methods

2) When bagging is used, we need not use cross-validation to estimate the test MSE. Each
original observation has a probability of (1  1/n)n for not being included in a particular
bootstrap sample. So for each bootstrap sample, on average the proportion of original data
that is not included is (1  1/n)n. For large n, this is approximately e1  0.3679. For each of
the tree, on average there is about 37% of the original observations that were not used to train
the tree. These unused observations are called out-of-bag (OOB) samples and can be used to
compute the test MSE.

3) Although the B bootstrap samples are drawn independently, if there is a strong predictor in
the data set, along with a number of other moderately strong predictors, then most of the
bagged trees will use this strong predictor in the top split and hence the bagged trees will
look similar and the resulting predictions will be highly correlated.

Random Forests

In bagging, a randomness is introduced by using bootstrap sampling. Random forests also use
bootstrap sampling, but add another layer of randomness: random input vectors.

The random vector method means that at each node, the best split is chosen from a random
sample of m predictors instead of the full set of p predictors. The set of m predictors varies from
node to node. Typically, we choose m  p . The random vector method decorrelates the bagged
trees so that variance can be further reduced.

Boosting

In this method, instead of growing a series of trees based on different bootstrap samples, we
build trees in a sequential manner: The new tree depends on all the previously grown trees.
Boosting (as its name suggests) involves building B trees, each of which has only a few splits.
Given the current model, we fit a tree to the current residuals, rather than the outcome. We then
add this new tree into the fitted function in order the update the residuals. By only fitting small
trees in each step, we boost up the performance slowly so that overfitting would not occur.

There are 3 tuning parameters for boosting:

B : The number of trees.


d : The maximum depth of each tree. Often d  1, and the resulting tree, which has one split, is
called a stump.
 : Shrinkage parameter, a positive constant, typical values being 0.01 or 0.001. Very small 
can require using a very large B to achieve good performance.

FINA 3295 | Predictive Analytics 50


Chapter 3. Classification and Regression Trees
C.Y. Ng
Section 4. Ensemble Methods

The following is the process of boosting regression trees.

Algorithm:
1) Set fˆ (x)  0 and ri  yi for all i.
2) For i  1, 2, …, B,
(i) Fit a tree y  fˆ i (x) with d splits to the data set {xi  ri}.

(ii) Update fˆ (x) using fˆ (x)  fˆ (x)   fˆ i (x) .

(iii) Update the residuals using ri  ri   fˆ i (xi ) .

Classification trees can also be boosted in a slightly more complicated way, and the details are
omitted. You can refer to section 14.5 of Applied Predictive Modeling (algorithm 14.2) for the
adaptive boosting method.

FINA 3295 | Predictive Analytics 51


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 1. Review of Predictor Selection

CHAPTER 4 Regularization in Linear Regression

Related Readings …

An Introduction to Statistical Learning: with Applications in R


Sections 6.1 – 6.3

Applied Predictive Modeling


Sections 6.3 – 6.4

L earning Objectives
Regularization methods including ridge and LASSO regression, principal
component regression and partial least squares regression

In machine learning, regularization is the process of adding constraints to objective functions in


order to solve an ill-posed problem or to prevent overfitting.

1 Review of Predictor Selection

In this section we review the classical predictor selection problem. In data science this is usually
called “feature selection” or “feature engineering”. Suppose there are p predictors x1, x2, … xp.
How can we select the predictor(s) to be included in a linear regression model? Traditionally
there are four methods:
 Best subset selection
 Forward stepwise selection
 Backward elimination (called backward stepwise selection in the text)
 Stepwise selection
When we rank models with different number of predictors, we cannot use R2 or MSE because for
a series of nested models, they increase with the number of predictors.

FINA 3295 | Predictive Analytics 52


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 1. Review of Predictor Selection

Algorithm (Best subset selection):


1) Let 0 denote the null model, which contains only the intercept term but no predictors.
2) For i  1, 2, …, p:
(a) Fit all pCi models that contain exactly i predictors.
(b) Pick the best among these models, and call it i. Here the best is defined as having
the smallest SSE, or equivalently, largest R2.
3) Select a single best model from among 0, 1, …, p. Here the best can be defined as
having the smallest test MSE, Mallow’s Cp, AIC, BIC, or adjusted R2.

Remarks:
1) The AIC (Akaike information criterion) is based on the likelihood function of the model. In
its general form, AIC  2k  2l for a model with k parameters and maximized log-likelihood l.
The smaller is the AIC, the better is the model.
It can be shown that under the assumption of the Gauss-Markov theorem,
1
AIC = constant  ( SSE  2iˆ 2 )
nˆ 2

for models with i predictors. Here ˆ 2 is the estimated variance of the error term for the full
model.
1
2) Mallow’s Cp is defined as Cp  ( SSE  2iˆ 2 ) . Minimizing AIC is the same as minimizing
n
Cp. Also, it can be shown that Cp is an unbiased estimator of the test MSE.

3) The BIC (Bayesian information criterion) is based on a Bayesian argument. In its general
form, BIC  (ln n)k  2l for a model with k parameters and maximized log-likelihood l. The
smaller is the BIC, the better is the model.
It can be shown that for a linear regression model with i predictors,
1
BIC = constant  [ SSE  (ln n)iˆ 2 ] .
n
Since ln n  2 for n  7 (which nearly always holds), BIC generally penalizes models with
more predictors more heavily than AIC or Cp.

SSE / (n  i  1)
4) Adjusted R2 is defined as 1  where TSS   ( yi  y ) 2 . Though popular,
TSS / (n  1)
there is not a rigorous statistical theory that supports the use of it to rank models.

5) To perform best subset selection, a total of 2p models have to be considered! Say p  20.
Then 2p  1,048,576, which is undoable.

FINA 3295 | Predictive Analytics 53


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 1. Review of Predictor Selection

Algorithm (Forward stepwise selection):


1) Let 0 denote the null model, which contains only the intercept term but no predictors.
2) For i  0, 1, …, p  1:
(a) Fit all p  i models that augment the predictors in i with one more predictor.
(b) Pick the best among these p  i models, and call it i+1. Here the best is defined as
having the smallest SSE, or equivalently, largest R2.
3) Select a single best model from among 0, 1, …, p. Here the best can be defined as
having the smallest test MSE, Mallow’s Cp, AIC, BIC, or adjusted R2.

Remarks:
p 1
p ( p  1)
1) Forward stepwise selection involves the fitting of 1   ( p  i )  1  models. For p 
i0 2
20, we have to fit 211 models only.
2) However, there is no guarantee that the best possible model out of all 2 p models can be
obtained using this greedy algorithm, simply because it is “too greedy”. Say p  3. The best
one-predictor model is the one with x1. The best two-predictor model is the one with x2 and x3.
Then forward stepwise selection will fail to get the best two-predictor model.
3) If p  n, then forward stepwise selection has to stop at n1.

? Example 1. 1
Consider the credit data set in Section 3.3 of the textbook (also presented in the companion Excel
worksheet). Perform forward stepwise selection based on BIC and show that the model with the
four predictors cards, income, student and limit is better than the model favored by forward
stepwise selection.

FINA 3295 | Predictive Analytics 54


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 1. Review of Predictor Selection

Algorithm (Backward elimination):


1) Let p denote the null model, which contains the intercept term and all predictors.
2) For k  p, p  1, …, 1:
(a) Fit all k models that contains all but one of the predictors in k.
(b) Pick the best among these k models, and call it k1. Here the best is defined as
having the smallest SSE, or equivalently, largest R2.
3) Select a single best model from among 0, 1, …, p. Here the best can be defined as
having the smallest test MSE, Mallow’s Cp, AIC, BIC, or adjusted R2.

Remarks:
1) Backward elimination again involves fitting 1  p(p  1)/2 models.
2) Backward elimination is again a greedy algorithm. It may not yield the best model.
3) If p  n, then backward elimination does not work.

Algorithm (Stepwise selection):


1) Let 0 denote the null model, which contains only the intercept term but no predictors.
2) For i  0, 1, …, p  1:
(a) Fit all models that augment the predictors in i with one more predictor that has not
been dropped previously.
(b) Pick the best among these i models, and call it i+1. Here the best is defined as having
the smallest SSE, or equivalently, largest R2.
(c) For i  0, see if the SSE increases appreciably when any of the other predictors are
dropped. Drop those for which the SSE only increases slightly.
Alternatively, compute the t-statistic for each predictor. Drop those with insignificant
t-statistics.
3) Select a single best model from among 0, 1, …, p. Here the best can be defined as
having the smallest test MSE, Mallow’s Cp, AIC, BIC, or adjusted R2.

Remark:
Such a bidirectional search is an attempt to search for a larger space of possible models
compared with forward stepwise selection and backward elimination. However, empirically it is
known that such a procedure tends to overfitting.

FINA 3295 | Predictive Analytics 55


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 2. Shrinkage Methods

2 Shrinkage Methods

The automatic subset selection described in Section 1 is known to perform poorly, both in terms
of selecting of the true model and estimating regression coefficients. While it remains the
dominant method in medical research, in statistics and data science it has largely been abandoned.

In classical regression, the regression coefficients are unconstrained. They can explode and are
hence susceptible to very high variance (recall the polynomial regression example on p.8). Also,
it is impossible to obtain a zero regression coefficient. Shrinkage is an approach that involves
fitting a model with all predictors. But the estimated coefficients are regularized: they are
shrunken towards zero relative to the OLS estimates. Depending on the form of the shrinkage
penalty, some of the coefficients may be estimated to be exactly 0.

Ridge regression / L2 regularization

This is the most commonly used method of regularization. Let the predictors x and the output y
be mean-centered. We introduce the penalized SSE as the objective function for minimization:
n p
PSSE   ( yi  1 xi1     p xip )2     j2  SSE   β 2 ,
2

i 1 j 1

where   0 is a tuning parameter. The first part seeks regression coefficients that fit well, while
the second part is a called a shrinkage penalty. The value of  controls how much penalty is
added to the regression coefficients when they are non-zero. When  grows, the impact of
shrinkage penalty grows, and the regression coefficients will approach zero.

Remarks:
1) (Out of Exam SRM syllabus) Parameter estimate:
 x11 x12  x1 p 
x x22  x2 p 
Let the design matrix be X   21
. The OLS estimate is βˆ  ( XX) 1 Xy
     
 
 xn1 xn 2  xnp 
which does not always exist. The ridge regression estimate is
βˆ ( )  ( XX   I p p ) 1 Xy  WXy .
(This is similar to the OLS estimate, but with the addition of a “ridge” down the diagonal.)

Inclusion of  makes W well-defined even if XX is not invertible (when multi-collinearity


exists or p  n). This was the original motivation for ridge regression.
In practice, it is hard to compute W directly when p is huge ( 10,000 is not uncommon).
However, there exists a method called single value decomposition in matrix theory that can
calculate βˆ ( ) extremely efficiently without any matrix inversion.

FINA 3295 | Predictive Analytics 56


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 2. Shrinkage Methods

? Example 2. 1 (Textbook p.225)


Let the predictors be standardized and the design matrix be X  Ip×p. What are the OLS and ridge
regression estimates?

2) Bias-variance tradeoff:
It can be shown that (out of Exam SRM syllabus)
Bias[βˆ ( )]   Wβ and Var[βˆ ( )]   2 WXXW .
Based on these, we can prove that the bias increases (in absolute sense) with , and
Var[i()]  Var(i) for   0. So, ridge regression reduces variance at the expense of
introducing bias.
Actually, a more surprising result is that there always exists a  such that the MSE[ βˆ ( )] 
MSE( βˆ ) ! We can always do better than OLS by shrinking towards zero.

3) Scaling and standardization:


Why do we mean-center x and y? By doing so we can get rid of the constant term 0 in the
regression, which can be easily added back after obtaining the result for the centered version:
p
 0  y    i ( ) xi .
i 1

Or you can simply consider the penalized SSE in the form


n p
PSSE   ( yi   0  1 xi1     p xip ) 2     j2
i 1 j 1

based on the raw data, which will gives same solution for 1, 2, …, p. Note that we do not
shrink the intercept term because it is just a measure of the different mean values of the data.
As in PCA and clustering, the scaling of predictors plays a role in the final estimation result:
multiplying xj by a constant c does not lead to scaling of i by 1/c because of the shrinkage
penalty. Put it in other way, the estimation problem is not scale-invariant. Therefore, it is best
to standardize the predictors and center y before performing doing ridge regression.

? Example 2. 2
Consider again the credit data set in Section 3.3 of the textbook. Perform ridge regressions for 
 0, 1, 50 and 100 using the Real Stat Excel add-in function RidgeCoeff. Do that for both the
raw data and also the standardized data. Compare the coefficients and also the L2 norm for ()
for the case of standardized data.

FINA 3295 | Predictive Analytics 57


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 2. Shrinkage Methods

Further Remarks (Advanced!):

4) Geometric interpretation:
An equivalent formulation to minimizing PSSE is to minimize the usual SSE subject to the
p
constraint 
j 1
2
j  s . (Here s is decreasing in . If   0, then s  .) The equivalence is due

to Lagrange multiplier. To understand what this means geometrically, let p  2. SSE is a


function of 1 and 2. It can be shown (by using the property of conic sections) that the locus
of the equation SSE  c is that of an ellipse. The OLS estimate is the center of the family of
ellipses SSE  t where t is a free parameter that indices the family. With the constraint
12   22  s , the solution must be lying within the blue circle. The optimal solution is the one
that is intersection for which the circle and the ellipse touch each other.

2

OLS estimate

Ridge regression estimate

1
s1/2

When p  3, the ellipses become ellipsoids, and the blue circle becomes a sphere.

5) Connection to Bayesian linear regression:


Ridge regression has a close connection to Bayesian linear regression, which assumes that
the parameter vector 1, 2, …, p) is random, keeping X and y fixed. With the prior
distribution
 ~ Np(0, 2 Ip×p) where 2   2/ 
it can be shown that the ridge regression estimate maximizes the posterior density
f ( | X, y)  f (y | X, ) f (),
and is hence the posterior mode for . Actually the ridge regression estimate is also the
posterior mean for  because the posterior distribution is multivariate normal. The proof of
this can be found in the appendix of this chapter (this is a textbook exercise).

FINA 3295 | Predictive Analytics 58


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 2. Shrinkage Methods

Lasso regression / L1 regularization

The full name of Lasso is “Least Absolute Shrinkage and Selection Operator”. We introduce the
following PSSE:
n p

(y   x
i 1
i 1 i1     p xip )    |  j |  SSE   β 1 ,
2

j 1

where   0 is a tuning parameter. This method was first introduced in 1986 in physics, and
rediscovered and popularized in 1996 by Tibshirani.

Remarks:
1) Again x and y in the formulation above are mean-centered. Alternatively, one can consider
n p
PSSE   ( yi   0  1 xi1     p xip )    |  j | .
2

i 1 j 1

Scaling of predictors matters, and it is customary to standardize predictors.


2) There is no closed-form solution for the Lasso estimates unless in toy examples.

? Example 2. 1 (continued)
Let the predictors be standardized and the design matrix be X  Ip×p. Show that
yj  / 2 for y j   / 2

ˆi ( )   0 for   / 2  y j   / 2 .
y   / 2 for y j   / 2
 j
Plot the OLS, ridge and lasso regression estimates for the coefficient of xi on the same graph (as
a function of yi) and compare the results. This example illustrates the soft thresholding of Lasso.

FINA 3295 | Predictive Analytics 59


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 2. Shrinkage Methods

3) (Advanced) An equivalent formulation to minimizing PSSE is to minimize the SSE subject to


p
the | 
j 1
j |  s . Let p  2, then the constraint becomes | 1 |  | 2 |  s. Geometrically,

2

OLS estimate

Lasso regression estimate

1
s

It turns out that the Lasso constraint has corners at each of the two axes, the hence the ellipse
will often intersect the constraint region at an axis. When this occurs, one of the ’s will
equal zero. In higher dimensions, the diamond becomes a polytope and many of the
coefficients may equal zero simultaneously. This means the solution is sparse and Lasso
conducts feature selection.

4) Lasso regression has a close connection to Bayesian linear regression. With the prior
distribution
i ~ Lap(0, b) where b   2/ for i  1, 2, …, p
and 1, 2, …, p being mutually independent, it can be shown that the Lasso regression
estimate maximizes the posterior density
f ( | X, y)  f (y | X, ) f ().
and is hence the posterior mode for . The proof of this result is similar to the one shown in
the appendix for ridge regression. In this case, though, the Lasso regression parameters are
not the posterior mean for .

1  | x  |
Note: the Laplace distribution Lap(, b) has density f (x)  exp    for real x.
2b  b 

FINA 3295 | Predictive Analytics 60


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 2. Shrinkage Methods

? Example 2. 3
Consider again the credit data set in Section 3.3 of the textbook. Perform Lasso regressions for
various values of  using the VBA user-defined function Lasso_Regression. (Note: this
function does not report standard errors) for the standardized data. Compare the coefficients for
() with the ridge regression coefficients.

Determining the Shrinkage Parameter

We can use cross validation to select the optimal value of  in ridge or Lasso regression. The test
MSE for a range of values of  are estimated, and the value of  that gives the smallest test MSE
is the optimal . A final model based on the optimal  is then fitted using all observations.

(Out of Exam SRM syllabus) For ridge regression, there is again a shortcut formula to compute
the LOOCV estimate of the test MSE by fitting only one full model. Let H  XWX. Then
2 2
1 n  y  yˆi  1 n  yi  yˆ i 
CV(n)    i     .
n i 1  1  hi  n i 1  1  tr(H ) / n 

The approximation (called generalized cross-validation error, aka GCV error) is very often used
because hi  Hii is hard to compute for large p, but there are formulas that give tr(H) without
computing H.

FINA 3295 | Predictive Analytics 61


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 3. PCR and PLSR

3 Principal Component and Partial Least Squares Regression

Principal Component Regression

In principal component regression (PCR), the n standardized predictors undergo the


transformation based on the loading factors
j  (j1, j2, …, jp) for j  1, 2, …, p
to give the p series of z scores. The p principal components are obtained from
z1  (z11, z12, …, z1M, …, z1p),
z2  (z21, z22, …, z2M, …, z2p),

zn  (zn1, zn2, …, znM, …, znp).
p
(Recall that zim   mj xij .) We use the first M PCs to act as regressors on y in
j 1

yi  y  1zi1  2zi2 …  MziM  i, for i  1, 2, …, n.


We can write each zij in terms of the original (standardized) predictors and rearrange the above as
M M p p
yi  y    m zim   i  y    m  mj xij   i  y    j xij   i
m 1 m 1 j 1 j 1
M
for  j    mmj . This set of p constraints is the regularization in PCR. Such constraints on the
m 1

form of the coefficients has the potential to bias the coefficient estimates unless M  p. However,
when p is much greater than n, using an M that is much less than p can significantly reduce
variance and avoid overfitting at the expense of increased bias.

Remarks:
1) PCR is not a feature selection method because each of the M components is a linear
combination of all p original predictors. Hence PCR does not give a more interpretable
model as Lasso regression.
2) The methods to determine M was discussed in Chapter 1. An alternative method is to treat M
as a tuning parameter and choose it using cross-validation.
3) Whether PCR can work well depends on whether the variability of the original predictor can
be summarized by a small number of PCs, and whether the directions in which the original
predictor show the most variation are the directions that are associated with y.
4) Apart from reducing dimensionality, the most important use of PCR is to resolve the problem
of multi-collinearity in multiple regression. When multi-collinearity occurs, the OLS
estimates are unbiased, but suffer from large standard errors. By transforming the predictors,
PCR eliminates the problem of multi-collinearity because the columns of z’s are uncorrelated.

FINA 3295 | Predictive Analytics 62


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 3. PCR and PLSR

5) PCR has a close relation with ridge regression. Actually, one can even think of ridge
regression as a continuous version of PCR. The details about this is, however, out of the
scope of this course.

? Example 3. 1
Consider the data set in the companion worksheet. Significant multi-collinearity exists in the
three predictors. Conduct a PCR analysis.

Partial Least Squares Regression

Partial least squares regression (PLSR) is a technique that originated in economics but was
popularized by applications in research in computational chemistry. It is a regression method that
takes into account the latent structure in both x and y. The latent structure corresponding to the
most variation in y is extracted and explained by a latent structure of x that explains it the best.

PLSR can be thought of as a supervised version of PCR. In the previous Remark 3, we note that
the directions for which the loading factors 1 2, …, M are determined in an unsupervised way:
the PCs explain the observed variability in the predictor variables, without considering the
response variable at all. On the other hand, PLSR takes the response variable into account, and
therefore often leads to models that are able to fit the response variable with fewer components.
In PLSR, the loading factors are obtained such that the covariances of y and the resulting linear
combinations of x (instead of the variance of the linear combination of x) are maximized.

The textbook does not give the detailed procedure for fitting PLSR because the modeling
assumption of PLSR is quite complicated. Here I follow the textbook’s heuristic description of
the “algorithm” for determining the “factor loadings”. The actual implementation is different and
is much quicker than this heuristic algorithm. Both the predictors and the response are assumed
to be mean-centered, or standardized.

FINA 3295 | Predictive Analytics 63


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 3. PCR and PLSR

Algorithm (for illustration of underlying principle of PLSR only):


1) To find the first loading 1,
i1 = slope of the regression of y with the ith predictor as the only regressor.
p
The first series of z scores of all n observations can then be computed from zi1   1 j xij .
j 1

2) To find the second loading 2,


(i) p  1 regression is run. The first is the first predictor with z as regressor, the second is
the second predictor with z as regressor, … the last is y with z as regressor. The
residuals can be interpreted as the remaining information that has not been explained
by the first PLS direction.
(ii) Repeat Step 1) with the predictors and the response replaced with the residuals in (i) to
obtain2  (21, 22, …, 2p).
p
The second series of z scores of all n observations is computed from zi 2   2 j xij .
j 1

They are uncorrelated with the first series of z scores as in PCA.

3) To find the third loading 3,


(i) p  1 regression is run. The first is the first predictor with zi1 and zi2 as regressors, the
second is the second predictor with zi1 and zi2 as regressor zi1 and zi2, … the last is y
with zi1 and zi2 as regressors. The residuals can be interpreted as the remaining
information that has not been explained by the first and second PLS directions.
(ii) Repeat Step 1) with the predictors and the response replaced with the residuals in (i) to
obtain3  (31, 32, …, 3p).
p
The third series of z scores of all n observations is computed from zi 3   3 j xij .
j 1

They are uncorrelated with the first and second series of z scores.
4) Continue until the Mth loading and series of z scores are obtained.

Finally, run an OLS regression on the original response, with the M series of z scores as
regressors.

Remarks:
1) Just like PCR, PLSR deals with multi-collinearity.
2) PLSR can model several response variables at the same time (observation i has not one
response yi, but a vector of q responses yi  (yi1, yi2, …, yiq)).
3) M is determined by cross-validation.

FINA 3295 | Predictive Analytics 64


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Section 3. PCR and PLSR

4) In practice, the optimal number M that is used in PLSR can be much less that the optimal M
that is used in PCR for the same data set. However, the overall performance of PLSR is no
better than ridge regression or PCR. The textbook mentions that the reason is because PLS
reduce bias but also has the “potential” to increase variance.

Afterword

I believe that there are three kinds of people who work with predictive models: scrupulous data
scientists, unscrupulous data scientists, and novice.

Both scrupulous and unscrupulous data scientists are experts in modeling. They know the bag of
tricks to twist models to a point so that they yield and confess with whatever designed results.
Scrupulous ones let the data speaks for themselves; unscrupulous ones put the assumed results
into the models’ output. Novices are not yet experienced. They do not know how to use models
and sometimes pick models wrongly. The following comes from a presentation with the
interesting title “How to Choose the Wrong Model” by Scott L. Zeger. Anyone who analyzes
data at work should bear the following in mind:

Some Ways to Choose the Wrong Model:

1) Forget about this important fact: There is not one best model because there isn’t one model
that works well in every possible aspect; models are useful but they are not the truth.

2) Turn your scientific problem over to a computer that, knowing nothing about your science or
your question, is very good at optimizing AIC, BIC or Cross-validated test MSE.

3) Turn your scientific problem over to your neighborhood statistician, who, knowing nothing
about your science or your question, is very good at optimizing ABC.

4) Choose a model that presumes the answer to your scientific question and ignores what the
data say.
[Note: Unscrupulous data scientists!]

5) Choose a model that no one can understand  not the persons who did the study, not their
readers (not even yourself).

6) Use (e.g.) the Cox proportional hazards “model” because it has been used in 14,154 articles
cited in PubMed  how could so many people be wrong?
[Notes: The Cox PH model is a popular model in biostatistics that deals with the modeling of
mortality rate of patients. PubMed is a free search engine accessing the MEDLINE database
of life sciences and biomedical topics.]

FINA 3295 | Predictive Analytics 65


Chapter 4. Regularization in Linear Regression
C.Y. Ng
Appendix

Appendix: Derivation of the posterior distribution of Bayesian regression

 2 p
1
The prior distribution of  has density is f (β)   exp   i 2  ,   Rn.
i 1  2  2 
Because the n irreducible errors 1, 2, …, n are iid N(0, 2), the model distribution is
n
1  ( y j  1 x j1   2 x j 2     p x jp ) 2 
f ( y | X, β )   exp   .
j 1  2  2 2 

The posterior distribution of  is proportional to


n
1  ( y j  1 x j1   2 x j 2     p x jp ) 2  p 1   i2 
f (y | X, β) f (β)   exp      exp  2 
j 1  2  2 2  i 1  2  2 
1  1 n
 p i2 
 exp   2  ( y j  1 x j1  2 x j 2     p x jp ) 2   
 2 2 2 i 1 2 
n p
 n p (2 ) 2 j 1

1 1
 exp[  PSSE (β)]
n p
2 2
  (2 )
n p 2

To maximize the posterior distribution,  should be such that PSSE() attains a minimum. This
means that  is the ridge regression estimate. Since the posterior distribution is multivariate
normal in , the posterior mean equals the posterior mode.

FINA 3295 | Predictive Analytics 66

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy