0% found this document useful (0 votes)
16 views63 pages

Regression 1

This document provides an overview of linear algebra concepts such as vectors, matrices, and their properties. It also describes how to express scalar operations as vector/matrix operations, using the sum of squared errors as an example. Key concepts covered include vectors, matrices, transposes, and expressing a scalar operation as a vector/matrix operation.

Uploaded by

Car Nicobar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views63 pages

Regression 1

This document provides an overview of linear algebra concepts such as vectors, matrices, and their properties. It also describes how to express scalar operations as vector/matrix operations, using the sum of squared errors as an example. Key concepts covered include vectors, matrices, transposes, and expressing a scalar operation as a vector/matrix operation.

Uploaded by

Car Nicobar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Linear regression

Mauricio A. Álvarez

Machine Learning and Adaptive Intelligence


The University of Sheffield

1 / 63
Contents

Review of vector/matrix notation and linear algebra

A regression model

Linear regression

Gradient descent

Stochastic Gradient Descent

Regularisation

2 / 63
Scalar and vectors

❑ A scalar is just a numeric value like 0.9 or −18.7.

❑ Scalars are usually denoted as lower case letters like x or a.

❑ A vector is an ordered list of scalar values. Sometimes we refer to


these scalar values of the vector as attributes or entries of the vector.

❑ Vectors are usually denoted by bold lowercase letters like x or y.

3 / 63
Vectors

❑ A vector can appear sometimes written as a row vector, e.g.

x = [x1 , x2 , x3 , x4 , x5 ]

Or as a column vector
 
x1
x2 
 
x3 
x= 
x4 
x5

❑ In this module, ALL vectors will be column vectors by default. So, when
you see a vector, e.g. x, y, z always think this vector has a column-wise
shape.

4 / 63
Matrices
❑ A matrix is a rectangular array of scalars arranged in rows and
columns.

❑ Matrices are usually denoted by bold uppercase letters, e.g. X or Y.

❑ The following matrix has three rows and two columns


 
x11 x12
X = x21 x22 
x31 x32

❑ The entries in the matrix above are of the form xij , where the first
subindex i indicates the row of the element and the second subindex j
indicates the column.

5 / 63
Matrix transpose

❑ Let X be a matrix with elements xij .

❑ The transpose of a matrix X is a new matrix X⊤ with elements xji .


 
4.1 −5.6  
4.1 −2.6 3.5
X = −2.6 7.9  , X⊤ =
−5.6 7.9 1.8
3.5 1.8

6 / 63
Matrix multiplication

❑ Let A be a matrix with entries aik of dimensions p × q .

❑ Let B be a matrix with entries bkj of dimensions t × s.

❑ Matrix multiplication of the form AB is only possible if q = t .

❑ If this is the case, the matrix C = AB has dimensions p × s with entries

X
cij = aik bkj .
k

7 / 63
Transpose of a product
❑ Let w be a vector of dimensions d × 1. Let X be a matrix with
dimensions n × d .

❑ The transpose of the product Xw, (Xw)⊤ is

(Xw)⊤ = w⊤ X⊤ .

❑ We can apply this result to a product of several matrices

(ABCD)⊤ = ((AB)(CD))⊤
= (CD)⊤ (AB)⊤
= D⊤ C⊤ B⊤ A⊤ .

8 / 63
From a scalar operation to a vector operation

❑ It is usually desirable to transform a scalar operation into a vector


operation.

❑ When coding scalar operations, we require making use of loops, which


can be expensive.

❑ In contrast, vector operations are handled efficiently by low-level


routines already included in modules like numpy.

9 / 63
Example

Write the following scalar operation into a vector/matrix form


n
X d
X
(yi − xij wj )2 .
i =1 j =1

10 / 63
Answer (I)

❑ The sum above can be written as


n
X d
X d
X d
X
(yi − xij wj )2 = (y1 − x1j wj )(y1 − x1j wj )
i =1 j =1 j =1 j =1

+ ···
d
X d
X
+ (yn − xnj wj )(yn − xnj wj ).
j =1 j =1

❑ Let us define a vector v of dimensions n × 1 with entries given as


d
X
(yi − xij wj ).
j =1

11 / 63
Answer (II)

❑ The product of vectors v⊤ v gives the same result than the required
sum,
 Pd 
(y1 − j =1 x1j wj )
..
h i
v⊤ v = (y1 − dj =1 x1j wj ) · · · (yn − dj =1 xnj wj ) 
P P 
.

 
Pd
(yn − j =1 xnj wj )
n
X d
X
= (yi − xij wj )2 .
i =1 j =1

❑ How do we express the elements in v with vectors and matrices?

12 / 63
Answer (III)
❑ For a fixed i , xi 1 , . . . , xid can be grouped into a vector x⊤
i .

❑ The internal sums in the entries of v can then be written as


 
w1
Xd  w2 
xij wj = x⊤
 
i w = xi 1 xi 2 · · · xid  . 
 
 .. 
j =1
wd

❑ We can now write v as


y1 − x⊤
     ⊤     ⊤
1 w y1 x1 w y1 x1
.
.. . . . . 
v=  =  ..  −  ..  =  ..  −  ..  w
        

yn − x⊤
n w. yn x⊤
n w. yn x⊤
n

❑ We can group the scalars y1 , . . . , yn into a vector y.


❑ We can group the row vectors x⊤ ⊤
1 , . . . , xn into a matrix X.

13 / 63
Answer (IV)

❑ It means that v = y − Xw.

❑ Finally
n d

X X
(yi − xij wj )2 = v⊤ v = (y − Xw) (y − Xw) .
i =1 j =1

14 / 63
Two common types of products
❑ Inner product. The inner product between two vectors results in a
scalar.
❑ Let x and y be vectors of dimension m × 1. The inner product is given
as
m
X
x⊤ y = xi yi ,
i =1

❑ Outer product. The outer product between two vectors results in a


matrix.
❑ Let x be a vector of dimension m × 1 and y a vector of dimension p × 1.
The outer product is given as
 
x1 y1 · · · x1 yp
 x2 y1 · · · x2 yp 
xy⊤ =  .
 
.. .. 
 .. . . 
xm y1 · · · xm yp .

15 / 63
Differentiating a function in a vector/matrix form (I)

❑ We will see cases in which a function f (w) depends on some


parameters grouped in a vector w.

❑ We would like to find the vector of parameters w that maximise f (w).

❑ For example, suppose f (w) is defined as


d
X
f (w) = wi x i .
i =1

❑ We can group the scalars x1 , . . . , xd into x. Likewise for w.

❑ According to what we saw before, we can write f (w) as f (w) = x⊤ w.

16 / 63
Differentiating a function in a vector/matrix form (II)

❑ For a fixed x, we are interested in computing the gradient of f (w) with


respect to w
 ∂ f (w)   
∂ w1 x1
df (w)  .   . 
 .
=  .  =  ..  = x.

dw
∂ f (w) xd .
∂ wd

❑ Some useful identities when differentiating with respect to a vector


df (w)
f (w) dw
w⊤ x x
x⊤ w x
w⊤ w 2w
w⊤ Cw 2Cw.

17 / 63
Identity matrix and the inverse of a matrix

❑ The identity matrix of size N is a square matrix with ones on the main
diagonal and zeros elsewhere, e.g.,
 
1 0 0
I3 = 0 1 0
0 0 1

❑ The inverse matrix of a matrix A of dimensions d × d , denoted as A−1 ,


satisfies
AA−1 = A−1 A = Id

18 / 63
Contents

Review of vector/matrix notation and linear algebra

A regression model

Linear regression

Gradient descent

Stochastic Gradient Descent

Regularisation

19 / 63
Olympic 100m Data

Image from Wikimedia Commons http://bit.ly/191adDC.


20 / 63
Dataset

Male 100 mts


12.0

11.5
Seconds

11.0

10.5

10.0

1900 1920 1940 1960 1980 2000


Year

21 / 63
Model

❑ We will use a linear model f (x , w) to predict y , where y is the time in


seconds and x the year of the competition.

❑ The linear model is given as

f (x , w) = w0 + w1 x ,

where w0 is the intercept and w1 is the slope.

❑ We use w to refer both to w0 and w1 .

22 / 63
Data and model

Male 100 mts


12.0

11.5
Seconds

11.0

10.5

10.0

1900 1920 1940 1960 1980 2000


Year

23 / 63
Contents

Review of vector/matrix notation and linear algebra

A regression model

Linear regression

Gradient descent

Stochastic Gradient Descent

Regularisation

24 / 63
Linear model
❑ A simple model for regression consists in using a linear combination of
the attributes to predict the output

f (x, w) = w0 + w1 x1 + . . . + wD xD ,

where w0 , w1 , · · · , wD are the parameters of the regression model.

❑ The term w0 is the bias term or intercept, e.g. f (0, w) = w0 .

❑ The expression above can be written in a vectorial form

f (x, w) = w⊤ x.

where we have defined w = [w0 , w1 , · · · , wD ]⊤ and x = [1, x1 , · · · , xD ]⊤ .

❑ Notice that x0 = 1.

25 / 63
Parenthesis: Gaussian pdf

❑ The Gaussian pdf has the form

(y − µ)2
 
1
p(y ) = √ exp − .
2π σ 2 2σ 2

❑ A Gaussian pdf requires two parameters µ and σ 2 , the mean and the
variance of the RV Y .

❑ We denote the Gaussian pdf as p(y |µ, σ 2 ) = N (y |µ, σ 2 ) or


y ∼ N (µ, σ 2 ).

26 / 63
Parenthesis: Gaussian pdf

The mean of the three Gaussians is µ = 2 and the variances are σ 2 = 0.5
(solid), and σ 2 = 2 (dashed).

0.5

0.4

0.3

0.2

0.1

0.0
2 1 0 1 2 3 4 5 6

27 / 63
Gaussian regression model (I)
❑ We use a Gaussian regression model to relate the inputs and outputs

y = f (x, w) + ϵ,

where ϵ ∼ N (0, σ 2 ).
❑ It assumes that each output yi that we observe can be explained as the
prediction of an underlying model, f (xi , w) plus a noise term ϵi .
❑ For a fixed x and a fixed w, f (x, w) is a constant, then

y = constant + ϵ,

where ϵ is a continuous RV.


❑ What is the pdf for y ? (we are adding a constant to a Gaussian RV)
– E {y } = E {constant + ϵ} = constant
– var{y } = var{constant} + var{ϵ} = σ 2 .

28 / 63
Gaussian regression model (II)
❑ This means that

y ∼ N (constant , σ 2 ),

where we said constant was f (x, w), this is,

y ∼ N (f (x, w), σ 2 ).

❑ Because we assumed that x and w are given, we can also write

p(y |x, w, σ 2 ) = N (y |f (x, w), σ 2 ).

❑ If we knew the value for w, once we have a new x∗ , we can predict the
output as f (x∗ , w).
❑ σ 2 tells us the noise variance.

29 / 63
Gaussian regression model (III)

𝑓 𝑥, 𝒘

𝑝 𝑦 𝑥 ! ,𝒘, 𝜎 "

2𝜎

𝑓 𝑥 ! ,𝒘

𝑥!

30 / 63
How do we estimate w? (I)
❑ We start with a training dataset (x1 , y1 ), · · · , (xN , yN ).

❑ We assume that the random variables Y1 , · · · , YN are independent,


N
Y
p(y1 , · · · , yN |x1 , · · · , xN ) = p(y1 |x1 ) · · · p(yN |xN ) = p(yn |xn ).
n =1

❑ We also assume that the RVs Y1 , · · · , YN follow an identical


distribution, Gaussian in this case

p(yn |xn , w, σ 2 ) = N (yn |f (xn , w), σ 2 ) = N (yn |w⊤ xn , σ 2 ).

❑ Both assumptions go by the name of the iid assumption, independent


and identically distributed.

31 / 63
How do we estimate w? (II)
❑ Putting both assumptions together, we get
N
Y N
Y
p(y|X, w, σ 2 ) = p(yn |xn , w, σ 2 ) = N (yn |w⊤ xn , σ 2 ),
n=1 n=1

where y = [y1 , · · · , yN ]⊤ ∈ RN ×1 and X = [x1 , · · · , xN ]⊤ ∈ RN ×(D+1) .

❑ The expression above can then be written as


N
Y
2
p(y|X, w, σ ) = N (yn |w⊤ xn , σ 2 ),
n=1
N
(yn − w⊤ xn )2
 
Y 1
= √ exp − .
2π σ 2 2σ 2
n=1
( N
)
1 1 X ⊤ 2
= N exp − 2 (yn − w xn ) .
(2π σ 2 ) 2 2σ
n=1

32 / 63
How do we estimate w? (III)
❑ When we look at a Gaussian pdf, like

(y − µ)2
 
1
p(y ) = √ exp − ,
2π σ 2 2σ 2

we assume that both µ and σ 2 are given. In this case, the pdf follows all
the properties we reviewed before.
❑ The same is true for
N
Y N
Y
p(y|X, w, σ 2 ) = p(yn |xn , w, σ 2 ) = N (yn |w⊤ xn , σ 2 ).
n=1 n=1

❑ Given w⊤ xn and σ 2 , then each p(yn |xn , w, σ 2 ) is a pdf.


❑ A different approach would be to say: I have some data for {yn }N
n=1 and
N
{xn }n=1 but
– “I don’t know what is w⊤ (therefore I don’t know what is w⊤ xn )”
– “I don’t know what is σ 2 ”.

33 / 63
How do we estimate w? (IV)
❑ With yn and xn given but with unknown values for w and σ 2 , each
p(yn |xn , w, σ 2 ) is not a pdf anymore.
❑ In that case, the function
N
Y
p(y|X, w, σ 2 ) = N (yn |w⊤ xn , σ 2 ),
n =1

receives the name of a likelihood function.


❑ We can think of a likelihood function as a function of the parameters w
and σ 2 ,

g (w, σ 2 ) = p(y|X, w, σ 2 ),

❑ And subsequently, we can use multivariate calculus to find the values of


w, σ 2 that maximise g (w, σ 2 ).
❑ In statistics, this is known as the maximum-likelihood (ML) criterion to
estimate parameters.

34 / 63
How do we estimate w? (V)
❑ Given y, X, we use the ML criterion to find the parameters w and σ 2 that
maximise
( N
)
2 1 1 X ⊤ 2
p(y|X, w, σ ) = N exp − 2 (yn − w xn ) .
(2π σ 2 ) 2 2σ
n=1

❑ In practice, we prefer to maximise the log of the likelihood p(y|X, w, σ 2 ),

LL(w, σ 2 ) = log p(y|X, w, σ 2 )


N
N N 1 X
=− log (2π ) − log σ 2 − (yn − w⊤ xn )2 .
2 2 2σ 2
n=1

❑ Consistency of the ML criterion If data was really generated


according to the probability we specified, the correct parameters will be
recovered in the limit as N → ∞.

35 / 63
Connection with the sum of squared errors
❑ If we multiply LL(w, σ 2 ) by minus one, we get
N
X
2 2
E (w, σ ) = − log p(y|X, w, σ ) ∝ (yn − w⊤ xn )2 .
n=1

❑ The ML criterion for this model has a close connection with the
sum-of-squared errors used in non-probabilistic formulations of linear
regression.

❑ Maximising the log-likelihood function is equivalent to minimising the


sum-of-squares errors.

❑ Notice that the log is a monotonic function, meaning that if we find w, σ 2


that maximise g (w, σ 2 ), those will also maximise log(g (w, σ 2 )).

36 / 63
Normal equation (I)
❑ Let us find an estimate for w.
❑ From what we saw before,
N
N N 1 X
LL(w, σ 2 ) = − log (2π ) − log σ 2 − (yn − w⊤ xn )2 .
2 2 2σ 2
n=1

❑ Using what we reviewed in the section on vector/matrix notation, it can


be shown that this expression can be written in a vectorial form as

N N 1
LL(w, σ 2 ) = − log (2π ) − log σ 2 − (y − Xw)⊤ (y − Xw)
2 2 2σ 2

❑ Let us focus on the term (y − Xw)⊤ (y − Xw),

(y − Xw)⊤ (y − Xw) = y⊤ y − w⊤ X⊤ y − y⊤ Xw + w⊤ X⊤ Xw

37 / 63
Normal equation (II)

❑ We can find the w that maximises LL(w, σ 2 ) by taking the gradient


dLL(w,σ 2 )
dw , equating to zero and solving for w.

❑ Taking the gradient of each term in LL(w, σ 2 ) wrt w, we get


     
d N d N 2 d 1 ⊤
− log (2π ) = 0, − log σ = 0, − 2 y y = 0,
dw 2 dw 2 dw 2σ
 
d 1 ⊤ ⊤ 1 ⊤
w X y = X y,
d w 2σ 2 2σ 2
 
d 1 ⊤ 1 ⊤
y Xw = X y
d w 2σ 2 2σ 2
 
d 1 1
− 2 w⊤ X⊤ Xw = − 2 2X⊤ Xw
dw 2σ 2σ

38 / 63
Normal equation (III)
❑ Putting these terms together, we get
d 1 ⊤ 1 ⊤ 1
LL(w, σ 2 ) = X y+ X y− 2X⊤ Xw
dw 2σ 2 2σ 2 2σ 2
1 1
= 2 X⊤ y − 2 X⊤ Xw
σ σ

❑ Now, equating to zero and solving for w, we get


1 ⊤ 1
X y − 2 X⊤ Xw = 0
σ2 σ
X⊤ Xw = X⊤ y
 −1 ⊤
w∗ = X⊤ X X y.

❑ The expression for w∗ is known as the normal equation.


 −1
❑ The solution for w∗ exists if we can compute X⊤ X .
❑ The inverse can be computed as long as X⊤ X is non-singular (e.g.
determinant different from zero, or has full-rank).
39 / 63
Solving for σ∗2

❑ Following a similar procedure, it can be shown that the ML solution for


σ∗2 is given as

1
σ∗2 = (y − Xw∗ )⊤ (y − Xw∗ ).
N

40 / 63
Basis functions
❑ The model that is linear in x only allows linear relationships between x
and y .

❑ We can extend the model to describe non-linear relationships between


the inputs and the output by using basis functions, non-linear mappings
from inputs to outputs.

❑ However, we keep the linear relationship of y wrt w for tractability.

❑ The predictive model follows as f (x, w)


M
X
f (x, w) = w0 + wi ϕi (x) = w⊤ ϕ(x),
i =1

where ϕi (x) are basis functions and we have M + 1 parameters for the
vector w and ϕ(x) = [1, ϕ1 (x), · · · , ϕM (x)]⊤ .

41 / 63
Examples of basis functions

1 1 1

0.5 0.75 0.75

0 0.5 0.5

−0.5 0.25 0.25

−1 0 0
−1 0 1 −1 0 1 −1 0 1

Polynomial: ϕi (x ) = x i . n o
µi )2
Exponential: ϕi (x ) = exp − (x − 2s 2

Sigmoidal: ϕi (x ) = σ x −sµi , σ (a) = 1/(1 + exp(−a)).




42 / 63
Transforming the input using the basis functions
❑ As an example, let us use polynomial basis functions to predict y , the
time in seconds in the 100 mt Olympics competition.

❑ For each x (year of the competition), we now compute the vector of


polynomial basis functions
 
1
x 
 2
x 
ϕ(x ) =  x 3 
 
 
 .. 
 . 
xM

❑ We have converted the unidimensional input feature x into a higher


dimensional feature representation ϕ(x ) ∈ R M +1 .

43 / 63
Normal equations with a design matrix
❑ Given X, we first compute a new design matrix Φ,

ϕ(x1 )⊤
   
ϕ0 (x1 ) ϕ1 (x1 ) · · · ϕM (x1 )
 ϕ(x2 )⊤   ϕ0 (x2 ) ϕ1 (x2 ) · · · ϕM (x2 ) 
Φ=  =  ..
   
.. 
 .   . 
ϕ(xN )⊤ ϕ0 (xN ) ϕ1 (xN ) · · · ϕM (xN )

❑ We now can use (y, Φ) and write the Gaussian linear regression
problem
N
Y
p(y|X, w, σ 2 ) = N (yn |w⊤ ϕn , σ 2 ),
n=1

where ϕn = ϕ(xn ).
❑ Using the ML criterion, we arrive to the following normal equation
 −1
w∗ = Φ⊤ Φ Φ⊤ y.

44 / 63
Olympic 100-mt data with M = 5

Male 100 mts


12.0

11.5
Seconds

11.0

10.5

10.0

1900 1920 1940 1960 1980 2000


Year

45 / 63
Alternative to find w

❑ For solving the normal equation, we need to invert X⊤ X.

❑ This inversion has a computational complexity between O((D + 1)2.4 )


to O((D + 1)3 ) (depending on the implementation).

❑ The normal equation is linear regarding the number of instances in the


training data, O(N ).

❑ It can handle a large training set as long as it fits in memory.

❑ Alternatively, we can use iterative optimisation in cases with a large


number of features and too many instances to fit in memory.

46 / 63
Contents

Review of vector/matrix notation and linear algebra

A regression model

Linear regression

Gradient descent

Stochastic Gradient Descent

Regularisation

47 / 63
General problem

❑ We are given a function h(w), where w ∈ Rp .

❑ Aim: to find a value for w that minimises h(w).

❑ Use an iterative procedure

wk +1 = wk + η dk ,

where dk is known as the search direction and it is such that

h(wk +1 ) < h(wk ).

❑ The parameter η is known as the step size or learning rate.

48 / 63
Gradient descent

❑ Perhaps, the simplest algorithm for unconstrained optimisation.

❑ It assumes that dk = −gk , where gk = g(wk ).

❑ Also known as steepest descent.

❑ It can be written like

wk +1 = wk − η gk .

49 / 63
Step size
❑ The main issue in gradient descent is how to set the step size.
❑ If it is too small, convergence will be very slow. If it is too large, the
method can fail to converge at all.

3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5
0 0.5 1 1.5 2 0 0.5 1 1.5 2

(a) (b)

Figure: The function to optimise is h(w1 , w2 ) = 0.5(w12 − w2 )2 + 0.5(w1 − 1)2 . The


minimum is at (1, 1). In (a) η = 0.1. In (b) η = 0.6.

50 / 63
Alternatives to choose the step size η

❑ Line search methods (there are different alternatives).

❑ Line search methods may use search directions other than the steepest
descent direction.

❑ Conjugate gradient (method of choice for quadratic objectives


g (w) = w⊤ Aw).

❑ Use a Newton search direction.

51 / 63
Gradient descent for linear regression (I)

❑ For simplicity, let us assume that the objective function h(w)


corresponds to the mean squared error
N
1 X
E (w) = (yn − w⊤ xn )2 .
N
n=1

❑ We could also minimise the negative LL(w) instead.

❑ We write the update equation as

d
wk +1 = wk − η E (w) .
dw w=wk

52 / 63
Gradient descent for linear regression (II)
❑ Computing the gradient for E (w), we get
N
d 2 X ⊤ 2
w xn − yn xn = X⊤ (Xw − y) .

E ( w) =
dw N N
n=1

❑ The update equation follows as

2 ⊤
wk +1 = wk − η X (Xwk − y) .
N

❑ The computation of the gradient involves using the whole dataset (X, y)
at every step.

❑ For this reason, this algorithm is known as batch gradient descent.


53 / 63
Gradient descent and feature scaling

❑ Always normalise the features if using gradient descent.

❑ Gradient descent converges faster if all features have a similar scale.

❑ If the attributes are in very different scales, it may take a long time to
converge.

54 / 63
Contents

Review of vector/matrix notation and linear algebra

A regression model

Linear regression

Gradient descent

Stochastic Gradient Descent

Regularisation

55 / 63
Online learning and large datasets

❑ Traiditionally in machine learning, the gradient gk is computed using the


whole dataset D = {xn , yn }Nn =1 .

❑ There are settings, though, where only a subset of the data can be
used.

❑ Online learning: the instances (xn , yn ) appear one at a time.

❑ Large datasets: computing the exact value for gk would be expensive,


if not impossible.

56 / 63
Stochastic gradient descent (I)

❑ In stochastic gradient descent (SGD), the gradient gk is computed


using a subset of the instances available.

❑ The word stochastic refers to the fact that the value for gk will depend
on the subset of the instances chosen for computation.

57 / 63
Stochastic gradient descent (II)

❑ In the stochastic setting, a better estimate can be found if the gradient


is computed using

1 X
gk = gk ,i ,
|S |
i ∈S

where S ∈ D, |S | is the cardinality of S , and gk ,i is the gradient at


iteration k computed using the instance (xi , yi ).

❑ This setting is called mini-batch gradient descent.

58 / 63
Step size in SGD
❑ Choosing the value of η is particularly important in SGD since there is
no easy way to compute it.

❑ Usually the value of η will depend on the iteration k , ηk .

❑ It should follow the Robbins-Monro conditions



X ∞
X
η k = ∞, ηk2 < ∞.
k =1 k =1

❑ Various formulas for ηk can be used

1 1
ηk = , ηk = ,
k (τ0 + k )κ

where τ0 slows down early interations and κ ∈ (0.5, 1].

59 / 63
Contents

Review of vector/matrix notation and linear algebra

A regression model

Linear regression

Gradient descent

Stochastic Gradient Descent

Regularisation

60 / 63
What is regularisation?
❑ It refers to a technique used for preventing overfitting in a predictive
model.

❑ It consists in adding a term (a regulariser) to the objective function that


encourages simpler solutions.

❑ With regularisation, the objective function for linear regression would be

h(w) = E (w) + λR (w),

where R (w) is the regularisation term and λ the regularisation


parameter.

❑ In the expression for h(w), we can use the negative LL(w) instead of
E (w).

❑ If λ = 0, we get h(w) = E (w).

61 / 63
Different types of regularisation

❑ The objective function for linear regression would be

h(w) = E (w) + λR (w),

where R (w) follows as

1
R (w) = α∥w∥1 + (1 − α) ∥w∥22 ,
2
Pp p
where ∥w∥1 = m=1 |wm |, and ∥w∥22 = m=1 wm2 .
P

❑ If α = 1, we get ℓ1 regularisation.

❑ If α = 0, we get ℓ2 regularisation.

❑ If 0 < α < 1, we get the elastic net regularisation.

62 / 63
Ridge regression or ℓ2 regularisation

❑ In ridge regression, α = 0,
N
1 X λ
h ( w) = (yn − w⊤ xn )2 + w⊤ w,
N 2
n=1

❑ It can be shown that an optimal solution for w∗ is given as


 −1
λN
w∗ = X⊤ X + I X⊤ y.
2

❑ Notice that we can also use iterative procedure for optimising h(w)
either through batch gradient decent, SGD or mini-batch SGD.

63 / 63

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy