0% found this document useful (0 votes)
38 views38 pages

Week 4 Linear Regression

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views38 pages

Week 4 Linear Regression

Introduction to Applied Machine Learning
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

IAML: Linear Regression

Nigel Goddard
School of Informatics

Semester 1

1 / 38
Overview

I The linear model


I Fitting the linear model to data
I Probabilistic interpretation of the error function
I Examples of regression problems
I Dealing with multiple outputs
I Generalized linear regression
I Radial basis function (RBF) models

2 / 38
The Regression Problem

I Classification and regression problems:


I Classification: target of prediction is discrete
I Regression: target of prediction is continuous

I Training data: Set D of pairs (xi , yi ) for i = 1, . . . , n, where


xi ∈ RD and yi ∈ R
I Today: Linear regression, i.e., relationship between x and
y is linear.
I Although this is simple (and limited) it is:
I More powerful than you would expect
I The basis for more complex nonlinear methods
I Teaches a lot about regression and classification

3 / 38
Examples of regression problems

I Robot inverse dynamics: predicting what torques are


needed to drive a robot arm along a given trajectory
I Electricity load forecasting, generate hourly forecasts two
days in advance.
I Predicting staffing requirements at help desks based on
historical data and product and sales information,
I Predicting the time to failure of equipment based on
utilization and environmental conditions

4 / 38
The Linear Model

I Linear model

f (x; w) = w0 + w1 x1 + . . . + wD xD
= φ(x)w

where φ(x) = (1, x1 , . . . , xD ) = (1, xT )


and  
w0
 w1 
w= 
 ...  (1)
wD
I The maths of fitting linear models to data is easy. We use
the notation φ(x) to make generalisation easy later.

5 / 38
Toy example: Data
4

● ●
3




●●
2


● ●
● ●
● ●

y


● ●●
0




−1

● ●
−2

−3 −2 −1 0 1 2 3

6 / 38
Toy example: Data
4

4
● ● ● ●
3

3
● ●
● ●
● ●
●● ●●
2

2
● ●
● ● ● ●
● ● ● ●
● ● ● ●
● ●
y

1
● ●

● ●
● ●● ● ●●
0

0
● ●
● ●
● ●
−1

−1
● ● ● ●
−2

−2
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3

x x

7 / 38
With two features


• • ••

•• •
• • •• •• • • •
• • • • •• •

• •
•• • • • • •• •• • •

•• • •
• • X2
• •

X1

FIGURE 3.1. Linear least squares fitting with


Instead X of ∈a IRline,
2
. We a plane. With
seek the more
linear features,
function a hyperplane.
of X that mini-
mizes the sum of
Figure: Hastie, Tibshirani, and Friedman
squared residuals from Y .
8 / 38
With more features

CPU Performance Data Set


I Predict: PRP: published relative performance
I MYCT: machine cycle time in nanoseconds (integer)
I MMIN: minimum main memory in kilobytes (integer)
I MMAX: maximum main memory in kilobytes (integer)
I CACH: cache memory in kilobytes (integer)
I CHMIN: minimum channels in units (integer)
I CHMAX: maximum channels in units (integer)

9 / 38
With more features

PRP = - 56.1
+ 0.049 MYCT
+ 0.015 MMIN
+ 0.006 MMAX
+ 0.630 CACH
- 0.270 CHMIN
+ 1.46 CHMAX

10 / 38
In matrix notation

I Design matrix is n × (D + 1)
 
1 x11 x12 . . . x1D
 1 x21 x22 . . . x2D 
 
Φ= . .. .. .. .. 
 .. . . . . 
1 xn1 xn2 . . . xnD

I xij is the jth component of the training input xi


I Let y = (y1 , . . . , yn )T
I Then ŷ = Φw is ...?

11 / 38
Linear Algebra: The 1-Slide Version
What is matrix multiplication?
   
a11 a12 a13 b1
A = a21 a22 a23  , b = b2 
a31 a32 a33 b3

First consider matrix times vector, i.e., Ab. Two answers:


1. Ab is a linear combination of the columns of A
     
a11 a12 a13
Ab = b1 a21  + b2 a22  + b3 a23 
a31 a32 a33
2. Ab is a vector. Each element of the vector is the dot
products between b and one row of A.
 
(a11 , a12 , a13 )b
Ab = (a21 , a22 , a23 )b
(a31 , a32 , a33 )b
12 / 38
Linear model (part 2)

In matrix notation:

I Design matrix is n × (D + 1)
 
1 x11 x12 . . . x1D
 1 x21 x22 . . . x2D 
 
Φ= . .. .. .. .. 
 .. . . . . 
1 xn1 xn2 . . . xnD

I xij is the jth component of the training input xi


I Let y = (y1 , . . . , yn )T
I Then ŷ = Φw is the model’s predicted values on training
inputs.

13 / 38
Solving for Model Parameters

This looks like what we’ve seen in linear algebra

y = Φw

We know y and Φ but not w.

So why not take w = Φ−1 y? (You can’t, but why?)

14 / 38
Solving for Model Parameters

This looks like what we’ve seen in linear algebra

y = Φw

We know y and Φ but not w.

So why not take w = Φ−1 y? (You can’t, but why?)

Three reasons:
I Φ is not square. It is n × (D + 1).
I The system is overconstrained (n equations for D + 1
parameters), in other words
I The data has noise

15 / 38
Loss function

Want a loss function O(w) that


I We minimize wrt w.
I At minimum, ŷ looks like y.
I (Recall: ŷ depends on w)

ŷ = Φw

16 / 38
Fitting a linear model to data

Y
I A common choice: squared error
(makes the maths easy)

• • •• n
X

•• • O(w) = (yi − wT xi )2
• • • •• • • •

• • • • •• • i=1

• • • • ••
•• • • • • • • I In the picture: this is sum of

•• • •• • X2 squared length of black sticks.
• • I (Each one is called a residual,

X1 i.e., each yi − wT xi )

URE 3.1. Linear least squares fitting with


IR2 . We seek the linear function of X that mini-
s the sum of squared residuals from Y . 17 / 38
Fitting a linear model to data
ting a linear model to data
I
n
� Given a dataset D of pairs (xi , yiX
) for i = 1, .T. . , n2
O(w) = (y − w xi )
� Squared error makes the maths easy i
i=1
n

E(w) = (y= (y − Φw)2T (y − Φw)
i − f (xi ; w))
i=1
I We want to minimize this with respect to w.
� TheIerror
Thesurface is a parabolic
error surface bowl
is a parabolic bowl
25

20

15
E[w]

10

0
2

1
-2
-1
0 0
1
2
-1 3
w0 w1

How do we do this?
Figure: Tom Mitchell
I 5 / 17
18 / 38
The Solution

Pn
I Answer: to minimize O(w) = i=1 (yi − wT xi )2 , set partial
derivatives to 0.
I This has an analytical solution

ŵ = (ΦT Φ)−1 ΦT y

I (ΦT Φ)−1 ΦT is the pseudo-inverse of Φ


I First check: Does this make sense? Do the matrix
dimensions line up?
I Then: Why is this called a pseudo-inverse? ()
I Finally: What happens if there are no features?

19 / 38
Probabilistic interpretation of O(w)

I Assume that y = wT x + , where  ∼ N(0, σ 2 )


I (This is an exact linear relationship plus Gaussian noise.)
I This implies that y |xi ∼ N(wT xi , σ 2 ), i.e.

√ (y − wT xi )2
− log p(yi |xi ) = log 2π + log σ + i
2σ 2
I So minimising O(w) equivalent to maximising likelihood!
I Can view wT x as E[y |x].
I Squared residuals allow estimation of σ 2
n
2 1X
σ̂ = (yi − wT xi )2
n
i=1

20 / 38
Fitting this into the general structure for learning algorithms:

I Define the task: regression


I Decide on the model structure: linear regression model
I Decide on the score function: squared error (likelihood)
I Decide on optimization/search method to optimize the
score function: calculus (analytic solution)

21 / 38
Sensitivity to Outliers
I Linear regression is sensitive to outliers √
I Example: Suppose y = 0.5x + , where  ∼ N(0, 0.25),
and then add a point (2.5,3):

3.0
2.5 ●



2.0
1.5


1.0


0.5
0.0

0 1 2 3 4 5
22 / 38
Diagnositics

Graphical diagnostics can be useful for checking:


I Is the relationship obviously nonlinear? Look for structure
in residuals?
I Are there obvious outliers?
The goal isn’t to find all problems. You can’t. The goal is to find
obvious, embarrassing problems.

Examples: Plot residuals by fitted values. Stats packages will


do this for you.

23 / 38
Dealing with multiple outputs

I Suppose there are q different targets for each input x


I We introduce a different wi for each target dimension, and
do regression separately for each one
I This is called multiple regression

24 / 38
Basis expansion

I We can easily transform the original attributes x


non-linearly into φ(x) and do linear regression on them

1 1 1

0.5 0.75 0.75

0 0.5 0.5

−0.5 0.25 0.25

−1 0 0
−1 0 1 −1 0 1 −1 0 1

polynomial Gaussians sigmoids

Figure credit: Chris Bishop, PRML

25 / 38
I Design matrix is n × m
 
φ1 (x1 ) φ2 (x1 ) . . . φm (x1 )
 φ1 (x2 ) φ2 (x2 ) . . . φm (x2 ) 
 
Φ= .. .. .. .. 
 . . . . 
φ1 (xn ) φ2 (xn ) . . . φm (xn )

I Let y = (y1 , . . . , yn )T
I Minimize E(w) = |y − Φw|2 . As before we have an
analytical solution

ŵ = (ΦT Φ)−1 ΦT y

I (ΦT Φ)−1 ΦT is the pseudo-inverse of Φ

26 / 38
Example: polynomial regression
φ(x) = (1, x, x 2 , . . . , x M )T

1 M =0 1 M =1
t t

0 0

−1 −1

0 x 1 0 x 1

1 M =3 1 M =9
t t

0 0

−1 −1

0 x 1 0 x 1

Figure credit: Chris Bishop, PRML


27 / 38
More about the features

I Transforming the features can be important.


I Example: Suppose I want to predict the CPU performance.
I Maybe one of the features is manufacturer.



 1 if Intel

2 if AMD
x1 =


 3 if Apple

4 if Motorola

I Let’s use this as a feature. Will this work?

28 / 38
More about the features

I Transforming the features can be important.


I Example: Suppose I want to predict the CPU performance.
I Maybe one of the features is manufacturer.



 1 if Intel

2 if AMD
x1 =


 3 if Apple

4 if Motorola

I Let’s use this as a feature. Will this work?


I Not the way you want. Do you really believe AMD is double
Intel?

29 / 38
More about the features

I Instead convert this into 0/1

x1 = 1 if Intel, 0 otherwise
x2 = 1 if AMD, 0 otherwise
..
.

I Note this is a consequence of linearity. We saw something


similar with text in the first week.
I Other good transformations: log, square, square root

30 / 38
Radial basis function (RBF) models

I Set φi (x) = exp(− 12 |x − ci |2 /α2 ).


I Need to position these “basis functions” at some prior
chosen centres ci and with a given width α. There are
many ways to do this but choosing a subset of the
datapoints as centres is one method that is quite effective
I Finding the weights is the same as ever: the
pseudo-inverse solution.

31 / 38
RBF example

2.0
1.5
1.0
y

0.5
0.0


−0.5

0 1 2 3 4 5 6 7

32 / 38
y

−0.5 0.0 0.5 1.0 1.5 2.0

0
1
2
RBF example

x
4
5
6
7

−0.5 0.0 0.5 1.0 1.5 2.0


0
1
2
3

x
4
5
6
7

33 / 38
An RBF feature

Original data RBF feature, c1 = 3, α1 = 1


2.0

2.0
1.5

1.5
1.0

1.0
y

y
0.5

0.5
0.0

0.0
● ●
−0.5

0 1 2 3 4 5 6 7 −0.5 0.0 0.1 0.2 0.3 0.4

x phi3.x

34 / 38
Another RBF feature

Notice how the feature functions “specialize” in input space.

Original data RBF feature, c2 = 6, α2 = 1


2.0

2.0
1.5

1.5
1.0

1.0
y

y
0.5

0.5
0.0

0.0
● ●
−0.5

−0.5
0 1 2 3 4 5 6 7 0.0 0.1 0.2 0.3 0.4

x RBF kernel (mean 6)

35 / 38
RBF example
Run the RBF with both basis functions above, plot the residuals

yi − φ(xi )T w

Original data Residuals


2.0

2.0
1.5

1.5
Residual (RBF model)
1.0

1.0
y

0.5

0.5
0.0

0.0

−0.5

−0.5

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

x x
36 / 38
RBF: Ay, there’s the rub

I So why not use RBFs for everything?


I Short answer: You might need too many basis functions.
I This is especially true in high dimensions (we’ll say more
later)
I Too many means you probably overfit.
I Extreme example: Centre one on each training point.
I Also: notice that we haven’t seen yet in the course how to
learn the RBF parameters, i.e., the mean and standard
deviation of each kernel
I Main point of presenting RBFs now: Set up later methods
like support vector machines

37 / 38
Summary

I Linear regression often useful out of the box.


I More useful than it would be seem because linear means
linear in the parameters. You can do a nonlinear transform
of the data first, e.g., polynomial, RBF. This point will come
up again.
I Maximum likelihood solution is computationally efficient
(pseudo-inverse)
I Danger of overfitting, especially with many features or
basis functions

38 / 38

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy