Regression 1
Regression 1
Mauricio A. Álvarez
1 / 63
Contents
A regression model
Linear regression
Gradient descent
Regularisation
2 / 63
Scalar and vectors
3 / 63
Vectors
x = [x1 , x2 , x3 , x4 , x5 ]
Or as a column vector
x1
x2
x3
x=
x4
x5
❑ In this module, ALL vectors will be column vectors by default. So, when
you see a vector, e.g. x, y, z always think this vector has a column-wise
shape.
4 / 63
Matrices
❑ A matrix is a rectangular array of scalars arranged in rows and
columns.
❑ The entries in the matrix above are of the form xij , where the first
subindex i indicates the row of the element and the second subindex j
indicates the column.
5 / 63
Matrix transpose
6 / 63
Matrix multiplication
X
cij = aik bkj .
k
7 / 63
Transpose of a product
❑ Let w be a vector of dimensions d × 1. Let X be a matrix with
dimensions n × d .
(Xw)⊤ = w⊤ X⊤ .
(ABCD)⊤ = ((AB)(CD))⊤
= (CD)⊤ (AB)⊤
= D⊤ C⊤ B⊤ A⊤ .
8 / 63
From a scalar operation to a vector operation
9 / 63
Example
10 / 63
Answer (I)
+ ···
d
X d
X
+ (yn − xnj wj )(yn − xnj wj ).
j =1 j =1
11 / 63
Answer (II)
❑ The product of vectors v⊤ v gives the same result than the required
sum,
Pd
(y1 − j =1 x1j wj )
..
h i
v⊤ v = (y1 − dj =1 x1j wj ) · · · (yn − dj =1 xnj wj )
P P
.
Pd
(yn − j =1 xnj wj )
n
X d
X
= (yi − xij wj )2 .
i =1 j =1
12 / 63
Answer (III)
❑ For a fixed i , xi 1 , . . . , xid can be grouped into a vector x⊤
i .
yn − x⊤
n w. yn x⊤
n w. yn x⊤
n
13 / 63
Answer (IV)
❑ Finally
n d
⊤
X X
(yi − xij wj )2 = v⊤ v = (y − Xw) (y − Xw) .
i =1 j =1
14 / 63
Two common types of products
❑ Inner product. The inner product between two vectors results in a
scalar.
❑ Let x and y be vectors of dimension m × 1. The inner product is given
as
m
X
x⊤ y = xi yi ,
i =1
15 / 63
Differentiating a function in a vector/matrix form (I)
16 / 63
Differentiating a function in a vector/matrix form (II)
17 / 63
Identity matrix and the inverse of a matrix
❑ The identity matrix of size N is a square matrix with ones on the main
diagonal and zeros elsewhere, e.g.,
1 0 0
I3 = 0 1 0
0 0 1
18 / 63
Contents
A regression model
Linear regression
Gradient descent
Regularisation
19 / 63
Olympic 100m Data
11.5
Seconds
11.0
10.5
10.0
21 / 63
Model
f (x , w) = w0 + w1 x ,
22 / 63
Data and model
11.5
Seconds
11.0
10.5
10.0
23 / 63
Contents
A regression model
Linear regression
Gradient descent
Regularisation
24 / 63
Linear model
❑ A simple model for regression consists in using a linear combination of
the attributes to predict the output
f (x, w) = w0 + w1 x1 + . . . + wD xD ,
f (x, w) = w⊤ x.
❑ Notice that x0 = 1.
25 / 63
Parenthesis: Gaussian pdf
(y − µ)2
1
p(y ) = √ exp − .
2π σ 2 2σ 2
❑ A Gaussian pdf requires two parameters µ and σ 2 , the mean and the
variance of the RV Y .
26 / 63
Parenthesis: Gaussian pdf
The mean of the three Gaussians is µ = 2 and the variances are σ 2 = 0.5
(solid), and σ 2 = 2 (dashed).
0.5
0.4
0.3
0.2
0.1
0.0
2 1 0 1 2 3 4 5 6
27 / 63
Gaussian regression model (I)
❑ We use a Gaussian regression model to relate the inputs and outputs
y = f (x, w) + ϵ,
where ϵ ∼ N (0, σ 2 ).
❑ It assumes that each output yi that we observe can be explained as the
prediction of an underlying model, f (xi , w) plus a noise term ϵi .
❑ For a fixed x and a fixed w, f (x, w) is a constant, then
y = constant + ϵ,
28 / 63
Gaussian regression model (II)
❑ This means that
y ∼ N (constant , σ 2 ),
y ∼ N (f (x, w), σ 2 ).
❑ If we knew the value for w, once we have a new x∗ , we can predict the
output as f (x∗ , w).
❑ σ 2 tells us the noise variance.
29 / 63
Gaussian regression model (III)
𝑓 𝑥, 𝒘
𝑝 𝑦 𝑥 ! ,𝒘, 𝜎 "
2𝜎
𝑓 𝑥 ! ,𝒘
𝑥!
30 / 63
How do we estimate w? (I)
❑ We start with a training dataset (x1 , y1 ), · · · , (xN , yN ).
31 / 63
How do we estimate w? (II)
❑ Putting both assumptions together, we get
N
Y N
Y
p(y|X, w, σ 2 ) = p(yn |xn , w, σ 2 ) = N (yn |w⊤ xn , σ 2 ),
n=1 n=1
32 / 63
How do we estimate w? (III)
❑ When we look at a Gaussian pdf, like
(y − µ)2
1
p(y ) = √ exp − ,
2π σ 2 2σ 2
we assume that both µ and σ 2 are given. In this case, the pdf follows all
the properties we reviewed before.
❑ The same is true for
N
Y N
Y
p(y|X, w, σ 2 ) = p(yn |xn , w, σ 2 ) = N (yn |w⊤ xn , σ 2 ).
n=1 n=1
33 / 63
How do we estimate w? (IV)
❑ With yn and xn given but with unknown values for w and σ 2 , each
p(yn |xn , w, σ 2 ) is not a pdf anymore.
❑ In that case, the function
N
Y
p(y|X, w, σ 2 ) = N (yn |w⊤ xn , σ 2 ),
n =1
g (w, σ 2 ) = p(y|X, w, σ 2 ),
34 / 63
How do we estimate w? (V)
❑ Given y, X, we use the ML criterion to find the parameters w and σ 2 that
maximise
( N
)
2 1 1 X ⊤ 2
p(y|X, w, σ ) = N exp − 2 (yn − w xn ) .
(2π σ 2 ) 2 2σ
n=1
35 / 63
Connection with the sum of squared errors
❑ If we multiply LL(w, σ 2 ) by minus one, we get
N
X
2 2
E (w, σ ) = − log p(y|X, w, σ ) ∝ (yn − w⊤ xn )2 .
n=1
❑ The ML criterion for this model has a close connection with the
sum-of-squared errors used in non-probabilistic formulations of linear
regression.
36 / 63
Normal equation (I)
❑ Let us find an estimate for w.
❑ From what we saw before,
N
N N 1 X
LL(w, σ 2 ) = − log (2π ) − log σ 2 − (yn − w⊤ xn )2 .
2 2 2σ 2
n=1
N N 1
LL(w, σ 2 ) = − log (2π ) − log σ 2 − (y − Xw)⊤ (y − Xw)
2 2 2σ 2
(y − Xw)⊤ (y − Xw) = y⊤ y − w⊤ X⊤ y − y⊤ Xw + w⊤ X⊤ Xw
37 / 63
Normal equation (II)
38 / 63
Normal equation (III)
❑ Putting these terms together, we get
d 1 ⊤ 1 ⊤ 1
LL(w, σ 2 ) = X y+ X y− 2X⊤ Xw
dw 2σ 2 2σ 2 2σ 2
1 1
= 2 X⊤ y − 2 X⊤ Xw
σ σ
1
σ∗2 = (y − Xw∗ )⊤ (y − Xw∗ ).
N
40 / 63
Basis functions
❑ The model that is linear in x only allows linear relationships between x
and y .
where ϕi (x) are basis functions and we have M + 1 parameters for the
vector w and ϕ(x) = [1, ϕ1 (x), · · · , ϕM (x)]⊤ .
41 / 63
Examples of basis functions
1 1 1
0 0.5 0.5
−1 0 0
−1 0 1 −1 0 1 −1 0 1
Polynomial: ϕi (x ) = x i . n o
µi )2
Exponential: ϕi (x ) = exp − (x − 2s 2
42 / 63
Transforming the input using the basis functions
❑ As an example, let us use polynomial basis functions to predict y , the
time in seconds in the 100 mt Olympics competition.
43 / 63
Normal equations with a design matrix
❑ Given X, we first compute a new design matrix Φ,
ϕ(x1 )⊤
ϕ0 (x1 ) ϕ1 (x1 ) · · · ϕM (x1 )
ϕ(x2 )⊤ ϕ0 (x2 ) ϕ1 (x2 ) · · · ϕM (x2 )
Φ= = ..
..
. .
ϕ(xN )⊤ ϕ0 (xN ) ϕ1 (xN ) · · · ϕM (xN )
❑ We now can use (y, Φ) and write the Gaussian linear regression
problem
N
Y
p(y|X, w, σ 2 ) = N (yn |w⊤ ϕn , σ 2 ),
n=1
where ϕn = ϕ(xn ).
❑ Using the ML criterion, we arrive to the following normal equation
−1
w∗ = Φ⊤ Φ Φ⊤ y.
44 / 63
Olympic 100-mt data with M = 5
11.5
Seconds
11.0
10.5
10.0
45 / 63
Alternative to find w
46 / 63
Contents
A regression model
Linear regression
Gradient descent
Regularisation
47 / 63
General problem
wk +1 = wk + η dk ,
48 / 63
Gradient descent
wk +1 = wk − η gk .
49 / 63
Step size
❑ The main issue in gradient descent is how to set the step size.
❑ If it is too small, convergence will be very slow. If it is too large, the
method can fail to converge at all.
3 3
2.5 2.5
2 2
1.5 1.5
1 1
0.5 0.5
0 0
−0.5 −0.5
0 0.5 1 1.5 2 0 0.5 1 1.5 2
(a) (b)
50 / 63
Alternatives to choose the step size η
❑ Line search methods may use search directions other than the steepest
descent direction.
51 / 63
Gradient descent for linear regression (I)
d
wk +1 = wk − η E (w) .
dw w=wk
52 / 63
Gradient descent for linear regression (II)
❑ Computing the gradient for E (w), we get
N
d 2 X ⊤ 2
w xn − yn xn = X⊤ (Xw − y) .
E ( w) =
dw N N
n=1
2 ⊤
wk +1 = wk − η X (Xwk − y) .
N
❑ The computation of the gradient involves using the whole dataset (X, y)
at every step.
❑ If the attributes are in very different scales, it may take a long time to
converge.
54 / 63
Contents
A regression model
Linear regression
Gradient descent
Regularisation
55 / 63
Online learning and large datasets
❑ There are settings, though, where only a subset of the data can be
used.
56 / 63
Stochastic gradient descent (I)
❑ The word stochastic refers to the fact that the value for gk will depend
on the subset of the instances chosen for computation.
57 / 63
Stochastic gradient descent (II)
1 X
gk = gk ,i ,
|S |
i ∈S
58 / 63
Step size in SGD
❑ Choosing the value of η is particularly important in SGD since there is
no easy way to compute it.
1 1
ηk = , ηk = ,
k (τ0 + k )κ
59 / 63
Contents
A regression model
Linear regression
Gradient descent
Regularisation
60 / 63
What is regularisation?
❑ It refers to a technique used for preventing overfitting in a predictive
model.
❑ In the expression for h(w), we can use the negative LL(w) instead of
E (w).
61 / 63
Different types of regularisation
1
R (w) = α∥w∥1 + (1 − α) ∥w∥22 ,
2
Pp p
where ∥w∥1 = m=1 |wm |, and ∥w∥22 = m=1 wm2 .
P
❑ If α = 1, we get ℓ1 regularisation.
❑ If α = 0, we get ℓ2 regularisation.
62 / 63
Ridge regression or ℓ2 regularisation
❑ In ridge regression, α = 0,
N
1 X λ
h ( w) = (yn − w⊤ xn )2 + w⊤ w,
N 2
n=1
❑ Notice that we can also use iterative procedure for optimising h(w)
either through batch gradient decent, SGD or mini-batch SGD.
63 / 63