0% found this document useful (0 votes)

16 views63 pages

Regression 1

This document provides an overview of linear algebra concepts such as vectors, matrices, and their properties. It also describes how to express scalar operations as vector/matrix operations, using the sum of squared errors as an example. Key concepts covered include vectors, matrices, transposes, and expressing a scalar operation as a vector/matrix operation.

Uploaded by

Car Nicobar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views63 pages

Regression 1

Uploaded by

Car Nicobar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Linear regression

Mauricio A. Álvarez

Machine Learning and Adaptive Intelligence

The University of Sheffield

1 / 63
Contents

Review of vector/matrix notation and linear algebra

A regression model

Linear regression

Gradient descent

Stochastic Gradient Descent

Regularisation

2 / 63
Scalar and vectors

❑ A scalar is just a numeric value like 0.9 or −18.7.

❑ Scalars are usually denoted as lower case letters like x or a.

❑ A vector is an ordered list of scalar values. Sometimes we refer to

these scalar values of the vector as attributes or entries of the vector.

❑ Vectors are usually denoted by bold lowercase letters like x or y.

3 / 63
Vectors

❑ A vector can appear sometimes written as a row vector, e.g.

x = [x1 , x2 , x3 , x4 , x5 ]

Or as a column vector
 
x1
x2 
 
x3 
x= 
x4 
x5

❑ In this module, ALL vectors will be column vectors by default. So, when
you see a vector, e.g. x, y, z always think this vector has a column-wise
shape.

4 / 63
Matrices
❑ A matrix is a rectangular array of scalars arranged in rows and
columns.

❑ Matrices are usually denoted by bold uppercase letters, e.g. X or Y.

❑ The following matrix has three rows and two columns

 
x11 x12
X = x21 x22 
x31 x32

❑ The entries in the matrix above are of the form xij , where the first
subindex i indicates the row of the element and the second subindex j
indicates the column.

5 / 63
Matrix transpose

❑ Let X be a matrix with elements xij .

❑ The transpose of a matrix X is a new matrix X⊤ with elements xji .

 
4.1 −5.6
4.1 −2.6 3.5
X = −2.6 7.9  , X⊤ =
−5.6 7.9 1.8
3.5 1.8

6 / 63
Matrix multiplication

❑ Let A be a matrix with entries aik of dimensions p × q .

❑ Let B be a matrix with entries bkj of dimensions t × s.

❑ Matrix multiplication of the form AB is only possible if q = t .

❑ If this is the case, the matrix C = AB has dimensions p × s with entries

X
cij = aik bkj .
k

7 / 63
Transpose of a product
❑ Let w be a vector of dimensions d × 1. Let X be a matrix with
dimensions n × d .

❑ The transpose of the product Xw, (Xw)⊤ is

(Xw)⊤ = w⊤ X⊤ .

❑ We can apply this result to a product of several matrices

(ABCD)⊤ = ((AB)(CD))⊤
= (CD)⊤ (AB)⊤
= D⊤ C⊤ B⊤ A⊤ .

8 / 63
From a scalar operation to a vector operation

❑ It is usually desirable to transform a scalar operation into a vector

operation.

❑ When coding scalar operations, we require making use of loops, which

can be expensive.

❑ In contrast, vector operations are handled efficiently by low-level

routines already included in modules like numpy.

9 / 63
Example

Write the following scalar operation into a vector/matrix form

n
X d
X
(yi − xij wj )2 .
i =1 j =1

10 / 63
Answer (I)

❑ The sum above can be written as

n
X d
X d
X d
X
(yi − xij wj )2 = (y1 − x1j wj )(y1 − x1j wj )
i =1 j =1 j =1 j =1

+ ···
d
X d
X
+ (yn − xnj wj )(yn − xnj wj ).
j =1 j =1

❑ Let us define a vector v of dimensions n × 1 with entries given as

d
X
(yi − xij wj ).
j =1

11 / 63
Answer (II)

❑ The product of vectors v⊤ v gives the same result than the required
sum,
 Pd 
(y1 − j =1 x1j wj )
..
h i
v⊤ v = (y1 − dj =1 x1j wj ) · · · (yn − dj =1 xnj wj ) 
P P 
.

 
Pd
(yn − j =1 xnj wj )
n
X d
X
= (yi − xij wj )2 .
i =1 j =1

❑ How do we express the elements in v with vectors and matrices?

12 / 63
Answer (III)
❑ For a fixed i , xi 1 , . . . , xid can be grouped into a vector x⊤
i .

❑ The internal sums in the entries of v can then be written as

 
w1
Xd  w2 
xij wj = x⊤

i w = xi 1 xi 2 · · · xid  . 
 
 .. 
j =1
wd

❑ We can now write v as

y1 − x⊤
     ⊤     ⊤
1 w y1 x1 w y1 x1
.
.. . . . . 
v=  =  ..  −  ..  =  ..  −  ..  w
        

yn − x⊤
n w. yn x⊤
n w. yn x⊤
n

❑ We can group the scalars y1 , . . . , yn into a vector y.

❑ We can group the row vectors x⊤ ⊤
1 , . . . , xn into a matrix X.

13 / 63
Answer (IV)

❑ It means that v = y − Xw.

❑ Finally
n d
⊤
X X
(yi − xij wj )2 = v⊤ v = (y − Xw) (y − Xw) .
i =1 j =1

14 / 63
Two common types of products
❑ Inner product. The inner product between two vectors results in a
scalar.
❑ Let x and y be vectors of dimension m × 1. The inner product is given
as
m
X
x⊤ y = xi yi ,
i =1

❑ Outer product. The outer product between two vectors results in a

matrix.
❑ Let x be a vector of dimension m × 1 and y a vector of dimension p × 1.
The outer product is given as
 
x1 y1 · · · x1 yp
 x2 y1 · · · x2 yp 
xy⊤ =  .
 
.. .. 
 .. . . 
xm y1 · · · xm yp .

15 / 63
Differentiating a function in a vector/matrix form (I)

❑ We will see cases in which a function f (w) depends on some

parameters grouped in a vector w.

❑ We would like to find the vector of parameters w that maximise f (w).

❑ For example, suppose f (w) is defined as

d
X
f (w) = wi x i .
i =1

❑ We can group the scalars x1 , . . . , xd into x. Likewise for w.

❑ According to what we saw before, we can write f (w) as f (w) = x⊤ w.

16 / 63
Differentiating a function in a vector/matrix form (II)

❑ For a fixed x, we are interested in computing the gradient of f (w) with

respect to w
 ∂ f (w)   
∂ w1 x1
df (w)  .   . 
 .
=  .  =  ..  = x.

dw
∂ f (w) xd .
∂ wd

❑ Some useful identities when differentiating with respect to a vector

df (w)
f (w) dw
w⊤ x x
x⊤ w x
w⊤ w 2w
w⊤ Cw 2Cw.

17 / 63
Identity matrix and the inverse of a matrix

❑ The identity matrix of size N is a square matrix with ones on the main
diagonal and zeros elsewhere, e.g.,
 
1 0 0
I3 = 0 1 0
0 0 1

❑ The inverse matrix of a matrix A of dimensions d × d , denoted as A−1 ,

satisfies
AA−1 = A−1 A = Id

18 / 63
Contents

Review of vector/matrix notation and linear algebra

A regression model

Linear regression

Gradient descent

Stochastic Gradient Descent

Regularisation

19 / 63
Olympic 100m Data

Image from Wikimedia Commons http://bit.ly/191adDC.

20 / 63
Dataset

Male 100 mts

12.0

11.5
Seconds

11.0

10.5

10.0

1900 1920 1940 1960 1980 2000

Year

21 / 63
Model

❑ We will use a linear model f (x , w) to predict y , where y is the time in

seconds and x the year of the competition.

❑ The linear model is given as

f (x , w) = w0 + w1 x ,

where w0 is the intercept and w1 is the slope.

❑ We use w to refer both to w0 and w1 .

22 / 63
Data and model

Male 100 mts

12.0

11.5
Seconds

11.0

10.5

10.0

1900 1920 1940 1960 1980 2000

Year

23 / 63
Contents

Review of vector/matrix notation and linear algebra

A regression model

Linear regression

Gradient descent

Stochastic Gradient Descent

Regularisation

24 / 63
Linear model
❑ A simple model for regression consists in using a linear combination of
the attributes to predict the output

f (x, w) = w0 + w1 x1 + . . . + wD xD ,

where w0 , w1 , · · · , wD are the parameters of the regression model.

❑ The term w0 is the bias term or intercept, e.g. f (0, w) = w0 .

❑ The expression above can be written in a vectorial form

f (x, w) = w⊤ x.

where we have defined w = [w0 , w1 , · · · , wD ]⊤ and x = [1, x1 , · · · , xD ]⊤ .

❑ Notice that x0 = 1.

25 / 63
Parenthesis: Gaussian pdf

❑ The Gaussian pdf has the form

(y − µ)2

1
p(y ) = √ exp − .
2π σ 2 2σ 2

❑ A Gaussian pdf requires two parameters µ and σ 2 , the mean and the
variance of the RV Y .

❑ We denote the Gaussian pdf as p(y |µ, σ 2 ) = N (y |µ, σ 2 ) or

y ∼ N (µ, σ 2 ).

26 / 63
Parenthesis: Gaussian pdf

The mean of the three Gaussians is µ = 2 and the variances are σ 2 = 0.5
(solid), and σ 2 = 2 (dashed).

0.5

0.4

0.3

0.2

0.1

0.0
2 1 0 1 2 3 4 5 6

27 / 63
Gaussian regression model (I)
❑ We use a Gaussian regression model to relate the inputs and outputs

y = f (x, w) + ϵ,

where ϵ ∼ N (0, σ 2 ).
❑ It assumes that each output yi that we observe can be explained as the
prediction of an underlying model, f (xi , w) plus a noise term ϵi .
❑ For a fixed x and a fixed w, f (x, w) is a constant, then

y = constant + ϵ,

where ϵ is a continuous RV.

❑ What is the pdf for y ? (we are adding a constant to a Gaussian RV)
– E {y } = E {constant + ϵ} = constant
– var{y } = var{constant} + var{ϵ} = σ 2 .

28 / 63
Gaussian regression model (II)
❑ This means that

y ∼ N (constant , σ 2 ),

where we said constant was f (x, w), this is,

y ∼ N (f (x, w), σ 2 ).

❑ Because we assumed that x and w are given, we can also write

p(y |x, w, σ 2 ) = N (y |f (x, w), σ 2 ).

❑ If we knew the value for w, once we have a new x∗ , we can predict the
output as f (x∗ , w).
❑ σ 2 tells us the noise variance.

29 / 63
Gaussian regression model (III)

𝑓 𝑥, 𝒘

𝑝 𝑦 𝑥 ! ,𝒘, 𝜎 "

2𝜎

𝑓 𝑥 ! ,𝒘

𝑥!

30 / 63
How do we estimate w? (I)
❑ We start with a training dataset (x1 , y1 ), · · · , (xN , yN ).

❑ We assume that the random variables Y1 , · · · , YN are independent,

N
Y
p(y1 , · · · , yN |x1 , · · · , xN ) = p(y1 |x1 ) · · · p(yN |xN ) = p(yn |xn ).
n =1

❑ We also assume that the RVs Y1 , · · · , YN follow an identical

distribution, Gaussian in this case

p(yn |xn , w, σ 2 ) = N (yn |f (xn , w), σ 2 ) = N (yn |w⊤ xn , σ 2 ).

❑ Both assumptions go by the name of the iid assumption, independent

and identically distributed.

31 / 63
How do we estimate w? (II)
❑ Putting both assumptions together, we get
N
Y N
Y
p(y|X, w, σ 2 ) = p(yn |xn , w, σ 2 ) = N (yn |w⊤ xn , σ 2 ),
n=1 n=1

where y = [y1 , · · · , yN ]⊤ ∈ RN ×1 and X = [x1 , · · · , xN ]⊤ ∈ RN ×(D+1) .

❑ The expression above can then be written as

N
Y
2
p(y|X, w, σ ) = N (yn |w⊤ xn , σ 2 ),
n=1
N
(yn − w⊤ xn )2

Y 1
= √ exp − .
2π σ 2 2σ 2
n=1
( N
)
1 1 X ⊤ 2
= N exp − 2 (yn − w xn ) .
(2π σ 2 ) 2 2σ
n=1

32 / 63
How do we estimate w? (III)
❑ When we look at a Gaussian pdf, like

(y − µ)2

1
p(y ) = √ exp − ,
2π σ 2 2σ 2

we assume that both µ and σ 2 are given. In this case, the pdf follows all
the properties we reviewed before.
❑ The same is true for
N
Y N
Y
p(y|X, w, σ 2 ) = p(yn |xn , w, σ 2 ) = N (yn |w⊤ xn , σ 2 ).
n=1 n=1

❑ Given w⊤ xn and σ 2 , then each p(yn |xn , w, σ 2 ) is a pdf.

❑ A different approach would be to say: I have some data for {yn }N
n=1 and
N
{xn }n=1 but
– “I don’t know what is w⊤ (therefore I don’t know what is w⊤ xn )”
– “I don’t know what is σ 2 ”.

33 / 63
How do we estimate w? (IV)
❑ With yn and xn given but with unknown values for w and σ 2 , each
p(yn |xn , w, σ 2 ) is not a pdf anymore.
❑ In that case, the function
N
Y
p(y|X, w, σ 2 ) = N (yn |w⊤ xn , σ 2 ),
n =1

receives the name of a likelihood function.

❑ We can think of a likelihood function as a function of the parameters w
and σ 2 ,

g (w, σ 2 ) = p(y|X, w, σ 2 ),

❑ And subsequently, we can use multivariate calculus to find the values of

w, σ 2 that maximise g (w, σ 2 ).
❑ In statistics, this is known as the maximum-likelihood (ML) criterion to
estimate parameters.

34 / 63
How do we estimate w? (V)
❑ Given y, X, we use the ML criterion to find the parameters w and σ 2 that
maximise
( N
)
2 1 1 X ⊤ 2
p(y|X, w, σ ) = N exp − 2 (yn − w xn ) .
(2π σ 2 ) 2 2σ
n=1

❑ In practice, we prefer to maximise the log of the likelihood p(y|X, w, σ 2 ),

LL(w, σ 2 ) = log p(y|X, w, σ 2 )

N
N N 1 X
=− log (2π ) − log σ 2 − (yn − w⊤ xn )2 .
2 2 2σ 2
n=1

❑ Consistency of the ML criterion If data was really generated

according to the probability we specified, the correct parameters will be
recovered in the limit as N → ∞.

35 / 63
Connection with the sum of squared errors
❑ If we multiply LL(w, σ 2 ) by minus one, we get
N
X
2 2
E (w, σ ) = − log p(y|X, w, σ ) ∝ (yn − w⊤ xn )2 .
n=1

❑ The ML criterion for this model has a close connection with the
sum-of-squared errors used in non-probabilistic formulations of linear
regression.

❑ Maximising the log-likelihood function is equivalent to minimising the

sum-of-squares errors.

❑ Notice that the log is a monotonic function, meaning that if we find w, σ 2

that maximise g (w, σ 2 ), those will also maximise log(g (w, σ 2 )).

36 / 63
Normal equation (I)
❑ Let us find an estimate for w.
❑ From what we saw before,
N
N N 1 X
LL(w, σ 2 ) = − log (2π ) − log σ 2 − (yn − w⊤ xn )2 .
2 2 2σ 2
n=1

❑ Using what we reviewed in the section on vector/matrix notation, it can

be shown that this expression can be written in a vectorial form as

N N 1
LL(w, σ 2 ) = − log (2π ) − log σ 2 − (y − Xw)⊤ (y − Xw)
2 2 2σ 2

❑ Let us focus on the term (y − Xw)⊤ (y − Xw),

(y − Xw)⊤ (y − Xw) = y⊤ y − w⊤ X⊤ y − y⊤ Xw + w⊤ X⊤ Xw

37 / 63
Normal equation (II)

❑ We can find the w that maximises LL(w, σ 2 ) by taking the gradient

dLL(w,σ 2 )
dw , equating to zero and solving for w.

❑ Taking the gradient of each term in LL(w, σ 2 ) wrt w, we get

d N d N 2 d 1 ⊤
− log (2π ) = 0, − log σ = 0, − 2 y y = 0,
dw 2 dw 2 dw 2σ

d 1 ⊤ ⊤ 1 ⊤
w X y = X y,
d w 2σ 2 2σ 2

d 1 ⊤ 1 ⊤
y Xw = X y
d w 2σ 2 2σ 2

d 1 1
− 2 w⊤ X⊤ Xw = − 2 2X⊤ Xw
dw 2σ 2σ

38 / 63
Normal equation (III)
❑ Putting these terms together, we get
d 1 ⊤ 1 ⊤ 1
LL(w, σ 2 ) = X y+ X y− 2X⊤ Xw
dw 2σ 2 2σ 2 2σ 2
1 1
= 2 X⊤ y − 2 X⊤ Xw
σ σ

❑ Now, equating to zero and solving for w, we get

1 ⊤ 1
X y − 2 X⊤ Xw = 0
σ2 σ
X⊤ Xw = X⊤ y
−1 ⊤
w∗ = X⊤ X X y.

❑ The expression for w∗ is known as the normal equation.

−1
❑ The solution for w∗ exists if we can compute X⊤ X .
❑ The inverse can be computed as long as X⊤ X is non-singular (e.g.
determinant different from zero, or has full-rank).
39 / 63
Solving for σ∗2

❑ Following a similar procedure, it can be shown that the ML solution for

σ∗2 is given as

1
σ∗2 = (y − Xw∗ )⊤ (y − Xw∗ ).
N

40 / 63
Basis functions
❑ The model that is linear in x only allows linear relationships between x
and y .

❑ We can extend the model to describe non-linear relationships between

the inputs and the output by using basis functions, non-linear mappings
from inputs to outputs.

❑ However, we keep the linear relationship of y wrt w for tractability.

❑ The predictive model follows as f (x, w)

M
X
f (x, w) = w0 + wi ϕi (x) = w⊤ ϕ(x),
i =1

where ϕi (x) are basis functions and we have M + 1 parameters for the
vector w and ϕ(x) = [1, ϕ1 (x), · · · , ϕM (x)]⊤ .

41 / 63
Examples of basis functions

1 1 1

0.5 0.75 0.75

0 0.5 0.5

−0.5 0.25 0.25

−1 0 0
−1 0 1 −1 0 1 −1 0 1

Polynomial: ϕi (x ) = x i . n o
µi )2
Exponential: ϕi (x ) = exp − (x − 2s 2

Sigmoidal: ϕi (x ) = σ x −sµi , σ (a) = 1/(1 + exp(−a)).

42 / 63
Transforming the input using the basis functions
❑ As an example, let us use polynomial basis functions to predict y , the
time in seconds in the 100 mt Olympics competition.

❑ For each x (year of the competition), we now compute the vector of

polynomial basis functions
 
1
x 
 2
x 
ϕ(x ) =  x 3 
 
 
 .. 
 . 
xM

❑ We have converted the unidimensional input feature x into a higher

dimensional feature representation ϕ(x ) ∈ R M +1 .

43 / 63
Normal equations with a design matrix
❑ Given X, we first compute a new design matrix Φ,

ϕ(x1 )⊤
   
ϕ0 (x1 ) ϕ1 (x1 ) · · · ϕM (x1 )
 ϕ(x2 )⊤   ϕ0 (x2 ) ϕ1 (x2 ) · · · ϕM (x2 ) 
Φ=  =  ..
   
.. 
 .   . 
ϕ(xN )⊤ ϕ0 (xN ) ϕ1 (xN ) · · · ϕM (xN )

❑ We now can use (y, Φ) and write the Gaussian linear regression
problem
N
Y
p(y|X, w, σ 2 ) = N (yn |w⊤ ϕn , σ 2 ),
n=1

where ϕn = ϕ(xn ).
❑ Using the ML criterion, we arrive to the following normal equation
−1
w∗ = Φ⊤ Φ Φ⊤ y.

44 / 63
Olympic 100-mt data with M = 5

Male 100 mts

12.0

11.5
Seconds

11.0

10.5

10.0

1900 1920 1940 1960 1980 2000

Year

45 / 63
Alternative to find w

❑ For solving the normal equation, we need to invert X⊤ X.

❑ This inversion has a computational complexity between O((D + 1)2.4 )

to O((D + 1)3 ) (depending on the implementation).

❑ The normal equation is linear regarding the number of instances in the

training data, O(N ).

❑ It can handle a large training set as long as it fits in memory.

❑ Alternatively, we can use iterative optimisation in cases with a large

number of features and too many instances to fit in memory.

46 / 63
Contents

Review of vector/matrix notation and linear algebra

A regression model

Linear regression

Gradient descent

Stochastic Gradient Descent

Regularisation

47 / 63
General problem

❑ We are given a function h(w), where w ∈ Rp .

❑ Aim: to find a value for w that minimises h(w).

❑ Use an iterative procedure

wk +1 = wk + η dk ,

where dk is known as the search direction and it is such that

h(wk +1 ) < h(wk ).

❑ The parameter η is known as the step size or learning rate.

48 / 63
Gradient descent

❑ Perhaps, the simplest algorithm for unconstrained optimisation.

❑ It assumes that dk = −gk , where gk = g(wk ).

❑ Also known as steepest descent.

❑ It can be written like

wk +1 = wk − η gk .

49 / 63
Step size
❑ The main issue in gradient descent is how to set the step size.
❑ If it is too small, convergence will be very slow. If it is too large, the
method can fail to converge at all.

3 3

2.5 2.5

2 2

1.5 1.5

1 1

0.5 0.5

0 0

−0.5 −0.5
0 0.5 1 1.5 2 0 0.5 1 1.5 2

(a) (b)

Figure: The function to optimise is h(w1 , w2 ) = 0.5(w12 − w2 )2 + 0.5(w1 − 1)2 . The

minimum is at (1, 1). In (a) η = 0.1. In (b) η = 0.6.

50 / 63
Alternatives to choose the step size η

❑ Line search methods (there are different alternatives).

❑ Line search methods may use search directions other than the steepest
descent direction.

❑ Conjugate gradient (method of choice for quadratic objectives

g (w) = w⊤ Aw).

❑ Use a Newton search direction.

51 / 63
Gradient descent for linear regression (I)

❑ For simplicity, let us assume that the objective function h(w)

corresponds to the mean squared error
N
1 X
E (w) = (yn − w⊤ xn )2 .
N
n=1

❑ We could also minimise the negative LL(w) instead.

❑ We write the update equation as

d
wk +1 = wk − η E (w) .
dw w=wk

52 / 63
Gradient descent for linear regression (II)
❑ Computing the gradient for E (w), we get
N
d 2 X ⊤ 2
w xn − yn xn = X⊤ (Xw − y) .

E ( w) =
dw N N
n=1

❑ The update equation follows as

2 ⊤
wk +1 = wk − η X (Xwk − y) .
N

❑ The computation of the gradient involves using the whole dataset (X, y)
at every step.

❑ For this reason, this algorithm is known as batch gradient descent.

53 / 63
Gradient descent and feature scaling

❑ Always normalise the features if using gradient descent.

❑ Gradient descent converges faster if all features have a similar scale.

❑ If the attributes are in very different scales, it may take a long time to
converge.

54 / 63
Contents

Review of vector/matrix notation and linear algebra

A regression model

Linear regression

Gradient descent

Stochastic Gradient Descent

Regularisation

55 / 63
Online learning and large datasets

❑ Traiditionally in machine learning, the gradient gk is computed using the

whole dataset D = {xn , yn }Nn =1 .

❑ There are settings, though, where only a subset of the data can be
used.

❑ Online learning: the instances (xn , yn ) appear one at a time.

❑ Large datasets: computing the exact value for gk would be expensive,

if not impossible.

56 / 63
Stochastic gradient descent (I)

❑ In stochastic gradient descent (SGD), the gradient gk is computed

using a subset of the instances available.

❑ The word stochastic refers to the fact that the value for gk will depend
on the subset of the instances chosen for computation.

57 / 63
Stochastic gradient descent (II)

❑ In the stochastic setting, a better estimate can be found if the gradient

is computed using

1 X
gk = gk ,i ,
|S |
i ∈S

where S ∈ D, |S | is the cardinality of S , and gk ,i is the gradient at

iteration k computed using the instance (xi , yi ).

❑ This setting is called mini-batch gradient descent.

58 / 63
Step size in SGD
❑ Choosing the value of η is particularly important in SGD since there is
no easy way to compute it.

❑ Usually the value of η will depend on the iteration k , ηk .

❑ It should follow the Robbins-Monro conditions

∞
X ∞
X
η k = ∞, ηk2 < ∞.
k =1 k =1

❑ Various formulas for ηk can be used

1 1
ηk = , ηk = ,
k (τ0 + k )κ

where τ0 slows down early interations and κ ∈ (0.5, 1].

59 / 63
Contents

Review of vector/matrix notation and linear algebra

A regression model

Linear regression

Gradient descent

Stochastic Gradient Descent

Regularisation

60 / 63
What is regularisation?
❑ It refers to a technique used for preventing overfitting in a predictive
model.

❑ It consists in adding a term (a regulariser) to the objective function that

encourages simpler solutions.

❑ With regularisation, the objective function for linear regression would be

h(w) = E (w) + λR (w),

where R (w) is the regularisation term and λ the regularisation

parameter.

❑ In the expression for h(w), we can use the negative LL(w) instead of
E (w).

❑ If λ = 0, we get h(w) = E (w).

61 / 63
Different types of regularisation

❑ The objective function for linear regression would be

h(w) = E (w) + λR (w),

where R (w) follows as

1
R (w) = α∥w∥1 + (1 − α) ∥w∥22 ,
2
Pp p
where ∥w∥1 = m=1 |wm |, and ∥w∥22 = m=1 wm2 .
P

❑ If α = 1, we get ℓ1 regularisation.

❑ If α = 0, we get ℓ2 regularisation.

❑ If 0 < α < 1, we get the elastic net regularisation.

62 / 63
Ridge regression or ℓ2 regularisation

❑ In ridge regression, α = 0,
N
1 X λ
h ( w) = (yn − w⊤ xn )2 + w⊤ w,
N 2
n=1

❑ It can be shown that an optimal solution for w∗ is given as

−1
λN
w∗ = X⊤ X + I X⊤ y.
2

❑ Notice that we can also use iterative procedure for optimising h(w)
either through batch gradient decent, SGD or mini-batch SGD.

63 / 63

Msce Mathematics Form 4
No ratings yet
Msce Mathematics Form 4
89 pages
Module 2 MAT 350
No ratings yet
Module 2 MAT 350
95 pages
BBMP1103: Mathematics For Management
100% (1)
BBMP1103: Mathematics For Management
262 pages
Sbte Syllabus - 57 - 1st Semester - Electronics - Session 2023-24-1
No ratings yet
Sbte Syllabus - 57 - 1st Semester - Electronics - Session 2023-24-1
71 pages
Task PDF
No ratings yet
Task PDF
2 pages
All Formulas of Mathematics
80% (5)
All Formulas of Mathematics
27 pages
Chapter 0 - Miscellaneous Preliminaries: EE 520: Topics - Compressed Sensing Linear Algebra Review
No ratings yet
Chapter 0 - Miscellaneous Preliminaries: EE 520: Topics - Compressed Sensing Linear Algebra Review
18 pages
Notes
No ratings yet
Notes
23 pages
1 - Summary of Vector Matrix Operations
No ratings yet
1 - Summary of Vector Matrix Operations
14 pages
Linear Algebra Review and Reference
No ratings yet
Linear Algebra Review and Reference
26 pages
Maths Roadmap For Machine Learning
No ratings yet
Maths Roadmap For Machine Learning
16 pages
Selected Linear Algebra For Machine Learning
No ratings yet
Selected Linear Algebra For Machine Learning
30 pages
Linear Algebra Pure Applied 1st Edgar G Goodaire Download
No ratings yet
Linear Algebra Pure Applied 1st Edgar G Goodaire Download
88 pages
How To Perform Matrix Multiplication With Python
No ratings yet
How To Perform Matrix Multiplication With Python
2 pages
Ma 2715
No ratings yet
Ma 2715
78 pages
Matrices
No ratings yet
Matrices
27 pages
LinearAlgebra 2016updatedfromwiki
No ratings yet
LinearAlgebra 2016updatedfromwiki
110 pages
Week2 3 MatrixApproach Part2
No ratings yet
Week2 3 MatrixApproach Part2
29 pages
Integral Calculus and Matrices Lab
No ratings yet
Integral Calculus and Matrices Lab
11 pages
Mathematics For AI
No ratings yet
Mathematics For AI
3 pages
Lec 3
No ratings yet
Lec 3
43 pages
Introduction To Programming Practical File
No ratings yet
Introduction To Programming Practical File
17 pages
STA302 Week08 Full
No ratings yet
STA302 Week08 Full
50 pages
ML - Lec 3 - Review of Linear Algebra
No ratings yet
ML - Lec 3 - Review of Linear Algebra
16 pages
Matrix Introduction
No ratings yet
Matrix Introduction
30 pages
Math Prelims
No ratings yet
Math Prelims
40 pages
Bhagaban - Dynamic - Programming Intro - Matrix - Elemnts - Unit - II - 4
No ratings yet
Bhagaban - Dynamic - Programming Intro - Matrix - Elemnts - Unit - II - 4
37 pages
Introduction To Linear Algebra: Mark Goldman Emily Mackevicius
No ratings yet
Introduction To Linear Algebra: Mark Goldman Emily Mackevicius
110 pages
Mml-Book Removed
No ratings yet
Mml-Book Removed
295 pages
2.mathemetical Background
No ratings yet
2.mathemetical Background
34 pages
Program 12 12. Write A Program To Perform Multiplication of Matrices
No ratings yet
Program 12 12. Write A Program To Perform Multiplication of Matrices
5 pages
Chapter 1 Simple Linear Regression (Part 6: Matrix Version)
No ratings yet
Chapter 1 Simple Linear Regression (Part 6: Matrix Version)
12 pages
Matrix Chapter1 Part1 2025
No ratings yet
Matrix Chapter1 Part1 2025
39 pages
Linear Algebra With Matlab
No ratings yet
Linear Algebra With Matlab
5 pages
Tensors: Geometry and Applications
No ratings yet
Tensors: Geometry and Applications
83 pages
Linear Algebra
No ratings yet
Linear Algebra
23 pages
CSE1202 Mathematics & Statistics For Computer Science: Lecture 5 - Linear Algebra
No ratings yet
CSE1202 Mathematics & Statistics For Computer Science: Lecture 5 - Linear Algebra
20 pages
Linear Algebra
No ratings yet
Linear Algebra
14 pages
Dfai Markov Chains 02 PDF
No ratings yet
Dfai Markov Chains 02 PDF
25 pages
Matrix 123
No ratings yet
Matrix 123
6 pages
Machine Learning Unit1
No ratings yet
Machine Learning Unit1
38 pages
MatricesAndVectors PDF
No ratings yet
MatricesAndVectors PDF
53 pages
Lecture 3 Introduction To Linear Algebra (Part 2)
No ratings yet
Lecture 3 Introduction To Linear Algebra (Part 2)
57 pages
Lecture 2 - Applied Mathematics (Updated 22nd Sept 2022)
No ratings yet
Lecture 2 - Applied Mathematics (Updated 22nd Sept 2022)
74 pages
Matrix Algebra and Random Vectors
No ratings yet
Matrix Algebra and Random Vectors
37 pages
Multivariate Notes r1
No ratings yet
Multivariate Notes r1
54 pages
MAtrices Review
No ratings yet
MAtrices Review
9 pages
Thomas Minka - Note On Matrix Calculus and Algebra
No ratings yet
Thomas Minka - Note On Matrix Calculus and Algebra
19 pages
Revision Notes of Matrices
No ratings yet
Revision Notes of Matrices
14 pages
Elementary Linear Algebra - Stephen Andrilli, David Hecker
No ratings yet
Elementary Linear Algebra - Stephen Andrilli, David Hecker
124 pages
Matrices
No ratings yet
Matrices
68 pages
Numerical Method For Linear Algebra
No ratings yet
Numerical Method For Linear Algebra
16 pages
Matlab 2
No ratings yet
Matlab 2
40 pages
T&S Book
No ratings yet
T&S Book
8 pages
Fast Algorithms For Convolutional Neural Networks: Andrew Lavin Scott Gray Nervana Systems
No ratings yet
Fast Algorithms For Convolutional Neural Networks: Andrew Lavin Scott Gray Nervana Systems
9 pages
Linear Algebra Student Notes
No ratings yet
Linear Algebra Student Notes
3 pages
Module 2 ML Mumbai University
No ratings yet
Module 2 ML Mumbai University
39 pages
L02 Notes
No ratings yet
L02 Notes
6 pages
Summary
No ratings yet
Summary
115 pages
A Journey From Linear Algebra To Machine Learning
No ratings yet
A Journey From Linear Algebra To Machine Learning
50 pages
00 Lectureslides LinAlg
No ratings yet
00 Lectureslides LinAlg
20 pages
Mathematical Foundations For AI Basic
No ratings yet
Mathematical Foundations For AI Basic
3 pages
Basic Matrix and Vector Functions Written With VBA/Excel: Finaquant Analytics LTD
No ratings yet
Basic Matrix and Vector Functions Written With VBA/Excel: Finaquant Analytics LTD
21 pages
Lin Agebra Rev
No ratings yet
Lin Agebra Rev
18 pages
Mathematics of Modern Engineering I Lecture 1
No ratings yet
Mathematics of Modern Engineering I Lecture 1
7 pages
Numerical Renaissance
No ratings yet
Numerical Renaissance
767 pages
GEM 802 Chapter 1
No ratings yet
GEM 802 Chapter 1
52 pages
MATLAB Linear Algebra
No ratings yet
MATLAB Linear Algebra
39 pages
Xica Bca Sem I
No ratings yet
Xica Bca Sem I
30 pages
Scs1102 Unit III
No ratings yet
Scs1102 Unit III
76 pages
Linear Algebra
No ratings yet
Linear Algebra
4 pages
Linear Algebra Review and Reference: Zico Kolter (Updated by Chuong Do and Tengyu Ma) April 3, 2019
No ratings yet
Linear Algebra Review and Reference: Zico Kolter (Updated by Chuong Do and Tengyu Ma) April 3, 2019
28 pages
Linear Algebra Review and Reference: Zico Kolter (Updated by Chuong Do and Tengyu Ma) June 20, 2020
No ratings yet
Linear Algebra Review and Reference: Zico Kolter (Updated by Chuong Do and Tengyu Ma) June 20, 2020
29 pages
Linear Algebra and Applications: Benjamin Recht
No ratings yet
Linear Algebra and Applications: Benjamin Recht
42 pages
Matrices and Vectors. - . in A Nutshell: AT Patera, M Yano October 9, 2014
No ratings yet
Matrices and Vectors. - . in A Nutshell: AT Patera, M Yano October 9, 2014
19 pages
Lecture Notes #1: Review of Matrix Algebra: 1 Vectors
No ratings yet
Lecture Notes #1: Review of Matrix Algebra: 1 Vectors
8 pages
LinearAlgebraPrimer Ver 2010
No ratings yet
LinearAlgebraPrimer Ver 2010
15 pages
Tungban Machine Learning Math Course
No ratings yet
Tungban Machine Learning Math Course
124 pages
Chapter 11 - Matrices
0% (1)
Chapter 11 - Matrices
39 pages
01 - Mathematica Manual PDF
No ratings yet
01 - Mathematica Manual PDF
14 pages
Mathematical Treatise On Linear Algebra
No ratings yet
Mathematical Treatise On Linear Algebra
7 pages
Lab. Manual PDF
No ratings yet
Lab. Manual PDF
310 pages
cs229 Linalg
No ratings yet
cs229 Linalg
26 pages
Linear Algebra For Business Analytics
No ratings yet
Linear Algebra For Business Analytics
27 pages
Matlab Workbook: CME 102 Winter 2008-2009
No ratings yet
Matlab Workbook: CME 102 Winter 2008-2009
55 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.