0% found this document useful (0 votes)
21 views18 pages

Chapter 7 - Handsout Machine Learning

chapter7

Uploaded by

Abid Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views18 pages

Chapter 7 - Handsout Machine Learning

chapter7

Uploaded by

Abid Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Machine Learning

Session 17-18

Prof. Dr.Ijaz Hussain

Statistics, QAU

May 27, 2024

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 1 / 18
Moving Beyond Linearity: Background

In Chapter 6 we see that we can improve upon least squares using


ridge regression, the lasso, principal components regression, and other
techniques.
In that setting, the improvement is obtained by reducing the complexity
of the linear model, and hence the variance of the estimates.
But we are still using a linear model, which can only be improved so
far!
In this chapter we relax the linearity assumption while still attempting
to maintain as much interpretability as possible.
We do this by examining very simple extensions of linear models like
polynomial regression and step functions, as well as more sophisticated
approaches such as splines, local regression, and generalized additive
models.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 2 / 18
Possible Non-Linear Methods
Polynomial regression extends the linear model by adding extra predic-
tors, obtained by raising each of the original predictors to a power.
Step functions cut the range of a variable into K distinct regions in
order to produce a qualitative variable. This has the effect of fitting a
piecewise constant function.
Regression splines are more flexible than polynomials and step func-
tions, and in fact are an extension of the two. They involve dividing
the range of X into K distinct regions. Within each region, a polyno-
mial function is fitted to the data.
Smoothing splines are similar to regression splines, but arise in a slightly
different situation. Smoothing splines result from minimizing a residual
sum of squares criterion subject to a smoothness penalty.
Local regression is similar to splines, but differs in an important way.
The regions are allowed to overlap, and indeed they do so in a very
smooth way.
Generalized additive models allow us to extend the above methods to
deal with multiple predictors.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 3 / 18
Polynomial Regression

The standard way to extend linear regression to nonlinear is to replace


the standard linear model

yi = β0 + β1 xi + ϵi

with a polynomial function

yi = β0 + β1 xi + β2 xi2 + β3 xi3 + ... + βd xid + ϵi

For large enough degree d, a polynomial regression allows us to produce


an extremely non-linear curve.
Notice that the coefficients in above equation can be easily estimated
using OLS because this is just a standard linear model with predictors.
Generally speaking, it is unusual to use d greater than 3 or 4 because
for large values of d, the polynomial curve can become overly flexible
and can take on some very strange shapes.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 4 / 18
Step Functions
Using polynomial functions of the features as predictors in a linear
model imposes a global structure on the non-linear function of X .
We can instead use step functions in order to avoid imposing such a
global structure.
Here we break the range of X into bins, and fit a different constant in
each bin.
This amounts to converting a continuous variable into an ordered cat-
egorical variable.
In greater detail, we create cutpoints c1 , c2 , ..., cK in the range of X
and then construct K + 1 new variables C0 (X ) = I (X < c1 ), C1 (X ) =
I (c1 ≤ X < c2 ), C2 (X ) = I (c2 ≤ X < c3 ), ...CK −1 (X ) = I (cK −1 ≤
X < cK ), CK (X ) = I (cK ≤ X )
where I () is an indicator function that returns a 1 if the condition is true
and returns a 0 otherwise.These are sometimes called dummy variables.
We then use least squares to fit a linear model using
C1 (X ), C2 (X ), ..., CK (X ) as predictors.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 5 / 18
Step Functions...

For a given value of X , at most one of C1 , C2 , ..., CK can be non-


zero.Resulting model becomes as follows

yi = β0 + β1 C1 (xi ) + β2 C2 (xi ) + + βK CK (xi ) + ϵi

Note that when X < c1 , all of the predictors in are zero, so β0 can be
interpreted as the mean value of Y for X < c1 .
By comparison, above equation predicts a response of β0 + βj for cj ≤
X < cj + 1, so βj represents the average increase in the response for
X in cj ≤ X < cj + 1 relative to X < c1 .
Unfortunately, unless there are natural breakpoints in the predictors,
piecewise-constant functions can miss the action.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 6 / 18
Basis Functions
Polynomial and piecewise-constant regression models are in fact special
cases of a basis function approach.
The idea is to have at hand a family of functions or transformations
that can be applied to a variable X i.e
b1 (X ), b2 (X ), ..., bK (X )
.
Instead of fitting a linear model in X , we fit the model
yi = β0 + β1 b1 (xi ) + β2 b2 (xi ) + β3 b3 (xi ) + + βK bK (xi ) + ϵi .
Note that the basis functions b1 (), b2 (), ..., bK () are fixed and known.
(In other words, we choose the functions ahead of time.)
For polynomial regression, the basis functions are bj (xi ) = xij , and for
piecewise constant functions they are bj (xi ) = I (cj ≤ xi < cj + 1).
We can think of above model as a standard linear model with predictors
b1 (xi ), b2 (xi ), ..., bK (xi ).
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 7 / 18
Basis Functions..

Hence,we can use least squares to estimate the unknown regression


coefficients in above model.
Importantly, this means that all of the inference tools for linear models,
such as standard errors for the coefficient estimates and F-statistics for
the model’s overall significance, are available in this setting.
Thus far we have considered the use of polynomial functions and piece-
wise constant functions for our basis functions; however, many alter-
natives are possible.
For instance, we can use wavelets or Fourier series to construct basis
functions.
In the next section, we investigate a very common choice for a basis
function: regression splines.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 8 / 18
Piecewise Polynomials
Instead of fitting a high-degree polynomial over the entire range of
X , piecewise polynomial regression involves fitting separate low-degree
polynomials over different regions of X .
For example, a piecewise cubic polynomial works by fitting a cubic
regression model of the form
yi = β0 + β1 xi + β2 xi2 + β3 xi3 + ϵi
where the coefficients β0 , β1 , β2 , and β3 differ in different parts of the
range of X .
The points where the coefficients change are called knots.
For example, a piecewise cubic with no knots is just a standard cubic
polynomial.
A piecewise cubic polynomial with a single knot at a point c takes the
form

yi = β01 + β11 xi + β21 xi2 + β31 xi3 + ϵi if xi < c


yi = β02 + β12 xi + β22 xi2 + β32 xi3 + ϵi if xi ≥ c.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 9 / 18
Piecewise Polynomials

In other words, we fit two different polynomial functions to the data,


one on the subset of the observations with xi < c, and one on the
subset of the observations with xi ≥ c.
The first polynomial function has coefficients β01 , β11 , β21 , and β31 ,
and the second has coefficients β02 , β12 , β22 , and β32 .
Each of these polynomial functions can be fit using least squares applied
to simple functions of the original predictor.
Using more knots leads to a more flexible piecewise polynomial.
In general, if we place K different knots throughout the range of X ,
then we will end up fitting K + 1 different cubic polynomials.
Note that we do not need to use a cubic polynomial. For example, we
can instead fit piecewise linear functions.
In fact, our piecewise constant functions of Section 7.2 are piecewise
polynomials of degree 0!.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 10 / 18
Various piecewise polynomials

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 11 / 18
Constraints and Splines
The top left panel of the above Figure looks wrong because the fitted
curve is just too flexible.
To remedy this problem, we can fit a piecewise polynomial under the
constraint that the fitted curve must be continuous.
In other words, there cannot be a jump when age=50.
The top right plot in the Figure shows the resulting fit. This looks
better than the top left plot, but the Vshaped join looks unnatural.
In the lower left plot, we have added two additional constraints: now
both the first and second derivatives of the piecewise polynomials are
continuous
at age=50.
In other words, we are requiring that the piecewise polynomial be not
only continuous when age=50, but also very smooth.
Each constraint that we impose on the piecewise cubic polynomials
effectively frees up one degree of freedom, by reducing the complexity
of the resulting piecewise polynomial fit.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 12 / 18
Constraints and Splines...
So in the top left plot, we are using eight degrees of freedom, but in the
bottom left plot we imposed three constraints (continuity, continuity
of the first derivative, and continuity of the second derivative) and so
are left with five degrees of freedom.
The curve in the bottom left plot is called a cubic spline.
In general, a cubic spline with K knots uses cubic spline a total of 4+K
degrees of freedom.
In Figure 7.3, the lower right plot is a linear spline, which is continuous
linear spline at age=50.
The general definition of a degree-d spline is that it is a piecewise
degree d polynomial, with continuity in derivatives up to degree d − 1
at each knot.
Therefore, a linear spline is obtained by fitting a line in each region of
the predictor space defined by the knots, requiring continuity at each
knot.
In Figure 7.3, there is a single knot at age=50. Of course, we could
add more knots, and impose continuity at each.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 13 / 18
The Spline Basis Representation
The regression splines that we just saw in the previous section may
have seemed somewhat complex: how can we fit a piecewise degree-d
polynomial under the constraint that it (and possibly its first d − 1
derivatives) be continuous?
It turns out that we can use the basis model to represent a regression
spline.
A cubic spline with K knots can be modeled as

yi = β0 + β1 b1 (xi ) + β2 b2 (xi ) + β3 b3 (xi ) + + βK bK (xi ) + ϵi .


for an appropriate choice of basis functions b1 (), b2 (), ..., bK +3 () This
model can be fitted by using least squares.
Just as there were several ways to represent polynomials, there are also
many equivalent ways to represent cubic splines using different choices
of basis functions in above model.
The most direct way to represent a cubic spline using above model is
to start off with a basis for a cubic polynomial—namely, x, x 2 , and x 3
and
Prof. Dr.Ijaz then(Statistics,
Hussain add one QAU) truncatedMachine
power basis function per May
Learning knot.
27, 2024 14 / 18
The Spline Basis Representation...
A truncated power basis function is defined as

h(x, ξ) = (x − ξ)3+ = (x − ξ)3 if x > ξ, Otherwise 0

where ξ is the knot.


One can show that adding a term of the form β4 h(x, ξ) to the above
model for a cubic polynomial will lead to a discontinuity in only the third
derivative at ξ; the function will remain continuous, with continuous
first and second derivatives, at each of the knots.
In other words, in order to fit a cubic spline to a data set with K knots,
we perform least squares regression with an intercept and 3+K pre-
dictors, of the form X , X 2 , X 3 , h(X , ξ1 ), h(X , ξ2 ), ..., h(X , ξK ), where
ξ1 , ..., ξK are the knots.
This amounts to estimating a total of K + 4 regression coefficients; for
this reason, fitting a cubic spline with K knots uses K + 4 degrees of
freedom.
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 15 / 18
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 16 / 18
References

The material used in these slides is borrowed from the following books.
These slides can be used only for academic purpose.
Gareth, J., Daniela, W., Trevor, H., & Robert, T. (2013). An intro-
duction to statistical learning: with applications in R. Spinger.
Hastie, T., Tibshirani, R., Friedman, J. H., & Friedman, J. H. (2009).
The elements of statistical learning: data mining, inference, and pre-
diction (Vol. 2, pp. 1-758). New York: springer.

Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 17 / 18
Prof. Dr.Ijaz Hussain (Statistics, QAU) Machine Learning May 27, 2024 18 / 18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy