0% found this document useful (0 votes)
5 views10 pages

3EBX0 Lecture Notes Addendum

This addendum to the 3EBX0 Machine Learning course provides updates on Stochastic Gradient Descent and Backpropagation, essential for training neural networks. It outlines the structure of feedforward networks, the process of supervised learning, and various error metrics used to evaluate model performance. The document also details the backpropagation algorithm for efficient gradient computation and emphasizes the importance of iterative weight updates during training.

Uploaded by

franksbackup123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

3EBX0 Lecture Notes Addendum

This addendum to the 3EBX0 Machine Learning course provides updates on Stochastic Gradient Descent and Backpropagation, essential for training neural networks. It outlines the structure of feedforward networks, the process of supervised learning, and various error metrics used to evaluate model performance. The document also details the backpropagation algorithm for efficient gradient computation and emphasizes the importance of iterative weight updates during training.

Uploaded by

franksbackup123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

3EBX0 - Machine Learning for Science

Lecture notes addendum


Alessandro Corbetta
February 19, 2025

1 Purpose of this addendum


Machine Learning is a field in extremely rapid evolution. University courses need to step forward
quickly to ensure they keep on serving their purpose of providing strong and relevant bases. This
course, 3EBX0, is no exception.
This addendum is a first step towards a complete upgrade of the original 3EBX0 lecture notes by
prof.dr. Vianney Koelman (responsible teacher of this course for the academic years 2018-2020).
This addendum focuses on providing stronger bases on the topics of Stochastic Gradient Descent
(ch. 13) and Backpropagation (ch. 14) which are now part of the graded assignments. The Backprop-
agation algorithm is a cornerstone of contemporary machine learning, enabling scalable computation
of gradients and thus it is essential in training.
This addendum strives to keep the same notation of the lecture notes. To this there is only
one exception: input vectors x are row vectors. Hence, when interpreting linear layers as matrix
multiplications, the vector-matrix product that emerge are left products (i.e. xW) rather than the
usual right products (i.e. Wx). Clearly the dimensions of the weight matrices W are the proper
ones to ensure these matrix products to work. This enables to simplify the calculations underlying
the backpropagation algorithm. Right products appear in the lecture notes e.g. in Equations 6 and
7. Clearly, the same theory could be done, yet with some more technical complication, using right
products.
Sketches representing the matrix dimensions are often interleaved between formulas to help keeping
a visual intuition of the operations at play.
This addendum is structured as follows: Sect. 2 includes a brief introduction to supervised learning.
This aims at introducing the notation having this addendum being self-standing. Sect. 3 details the
iterative process of training neural networks via stochastic gradient descent. Sect. 4 provides some
notations. Sect. 5 details a complete derivation of the backpropagation algorithm. The last subsection
makes a synthesis, listing all the steps needed when coding the algorithm.

Alessandro Corbetta

2 Supervised learning via feedforward networks


In supervised learning, we aim at approximating a function of the vector x

f˜ = f˜(x) (1)

via a trainable model, here a neural network: fw = fw (x). Typically, the function f˜ is extremely
complex. Examples can be
• x = (a, b, c), sides of a triangle, f (x) = area of the triangle;
• x = molecule composition, f (x) = effectiveness of the molecule.
Besides, f˜ is generally only known in terms of its behavior on an annotated dataset of input-output
pairs. This dataset includes inputs
{x(1) , x(2) , . . .}

1
and corresponding ground truth output predictions (also called annotations)

{ỹ (1) = f˜(x(1) ), ỹ (2) = f˜(x(2) ), . . . }.

A trainable model fw depends on free parameters, or weights, w = (w1 , w2 , . . . , wn ). These free


parameters have to be chosen to ensure that the approximation fw (x) ≈ f˜(x) is good. We consider
this approximation good when the values fw (x) are very close to the ground truth f˜(x) on a test
dataset.

Nl

b xl
xl

`l

xl−1 ...
...
`l−1

xl−2

... ... ...


x2 ...
...
`2

x1

...
`1

x0
N0

b x0

Figure 1: Sketch of a neural network. Information flows from bottom to top. Linear layers are
represented by straight connections. Each connection carries a trainable weight. Relu layers are
represented by wiggly connections.

2.1 Feedforward networks


In this course we focus on a class of trainable models called feedforward neural networks. These
networks are built as stacks of layers performing simple operations. We shall now define how
such network work, i.e. how the layers operate. We consider a generic datapoint and related annotation,
say (x(i) , ỹ (i) ) for some index i. As we are going to define how the input x(i) will be transformed by
the layers of the network, we shall rename it:

x0 = x(i) . (2)

We will use the subscript (here 0) to denote the outputs of each layer of the network. Thus, we index
the input by 0, and so on across the layers (cf. Figure 1). Note that we are not writing explicitly the
dependency on the datapoint (i) to avoid an excessively heavy notation.

2
In general, the input will be a vector. For simplicity, we do not consider the case in which the
input data is a matrix (e.g. a gray-scale image) or a higher dimensional tensor (e.g. a color image, or
a color movie). We consider inputs as row vectors:
x0 ∈ R1×N0 N0 : input dimension. (3)
The number N0 denotes the dimension of the input, also called the number of features.
This notation enables us to treat easily input batches. That is blocks of b input vectors that are
addressed at once for predictions or during training. These allow, respectively, multiple predictions in
parallel and training via the so-called, minibatch stochastic gradient descent. In case of batches, we
store the b inputs across b rows
x0 ∈ Rb×N0 b : batch dimension. (4)
In the following, we restrict to b = 1 (i.e. no batching), bearing in mind that our calculation will be
general enough to accommodate also the case b > 1. We can sketch the input x0 as a rectangle as
follows
N0

b x0 ,
where the number of features N0 and batch size are reported, respectively on the horizontal and vertical
axis.
Let
`1 , `2 , . . . , `l−1 , `l (5)
be the layers of the neural networks. In other terms, the output is built applying one layer after the
other (cf. Figure 1):
x0 −→ x1 = `1 (x0 ) −→ x2 = `2 (x1 ) −→ . . . −→ xl = `l (xl−1 ). (6)
The output of the last layer xl coincides with the output of the network y. In the language of
mathematical analysis, considering each layer as a function, we identify this stacking operation as a
function composition. In mathematical lingo, we could also write
y = `l ( `l−1 (. . . `2 ( `1 ( x0 )))). (7)
We build the network juxtaposing linear multiplier layers and Relu (Rectified Linear Unit) activation
functions. By definition, Relu activations are such that
(
0 x < 0,
ReLu(x) = (8)
x x ≥ 0.

A graph of the ReLu activation is in Figure 2(left). In other words, the layers
`1 , `3 , `5 , . . . , (9)
are multipliers, i.e. perform linear operations, whereas
`2 , `4 , `6 , . . . (10)
perform ReLus.
In formulas, for the multiplier layers, we can write
xj = `j (xj−1 ) = xj−1 Wj , j = 1, 3, 5, . . . , Wj ∈ RNj−1 ×Nj , (11)
where Nj is the number of neurons at layer j. To have a visual intuition of this matrix operation, and
especially of the sizes involved, we illustrate this matrix multiplication as
Nj

Nj Nj−1

b xj = b xj−1 × Nj−1 Wj

3
For the Relu layers it holds, instead,

xj = `j (xj−1 ) = Relu(xj−1 ), j = 2, 4, 6, . . . . (12)

Note that we used the symbol xj to denote the layer outputs, regardless these were linear layers or
non-linear activations. In the literature the symbols a and z are often used to distinguish, respectively,
these outputs. Our choice of notation will pay off simplifying the formulas obtained while deriving the
backpropagation algorithm.
The weight array w is thus the collection of the elements of the weight matrices:

w → {W1 , W3 , W5 , . . .} . (13)

ReLu(x) Heaviside(x)

x x

Figure 2: (Left) graph of the Rectified Linear Unit function (ReLu). (Right) Graph of the derivative
of the ReLu function: the Heaviside function.

2.2 Error metrics


Identifying good or optimal choices for the weights w entails the minimization of a suitable error
function, or Loss, E. This function measures the mismatch between the model prediction and the
corresponding ground truth value. Heuristically, we write

E = E(predictions, ground truth; w), (14)

to stress the dependency of the loss value on


• the network predictions;
• the ground truth for the same input;
• the weights of the network.
Let S be the size of the dataset, examples of loss functions are
• the Mean Square Error (MSE)
1 X (i)
E(predictions, ground truth; w) = |ỹ − fw (x(i) )|2 , (15)
2S i

• the Root Mean Square Error (RMSE)


sX
1
E(predictions, ground truth; w) = |ỹ (i) − fw (x(i) )|2 , (16)
S i

• the Mean Relative Error (MRE)

1 X |ỹ (i) − fw (x(i) )|


E(predictions, ground truth; w) = , (17)
S i |ỹ (i) |

4
• the Mean Square Relative Error1 (MSRE)
1 X |ỹ (i) − fw (x(i) )|2
E(predictions, ground truth; w) = . (18)
2S i |ỹ (i) |2

• the Root Mean Square Relative Error2 (RMSRE)


s
1 X |ỹ (i) − fw (x(i) )|2
E(predictions, ground truth; w) = . (19)
S i |ỹ (i) |2

In the course, we focus on physical systems. Hence, we shall retain relative error functions. In fact,
they enable to approximate functions regardless the scale of the input/output.
To train we shall consider a dataset on which input-output pairs are known. This is called the
training dataset.

3 Training with stochastic gradient descent


Typically, we do not seek to find optimal parameters for a neural network in one single operation.
Due to their non-linear nature of neural networks, this would be impossible. Conversely, we aim at
iteratively improve our choice of the weights starting from an initial guess, w0 . In particular, we shall
move from w0 in a sequence of steps that progressively allows lower loss values on the training set:
w0 → w1 → . . . → wk (20)
L(w0 ) > L(w0 ) > . . . > L(wk ). (21)
To decide how to update the weights from step s to step s + 1, our optimal choice is the direction in
which the loss decreases the most. Basic calculus tells that this direction in weight space is opposite
to the gradient of the loss with respect to the weights:
 T
∂E
− , (22)
∂w
where the T indicates a transposition operation.
Let the vector ws , defined as
∆ws = ws+1 − ws , (23)
denote the weight update from step s to step s + 1. Performing such an update following the gradient
implies
 T
∂E
∆ws = −µ , (24)
∂w
where µ is a rescaling constant usually dubbed learning rate.
As the free parameters are contributed by the linear layers, we consider gradients in its components:
 T ( T  T  T )
∂E ∂E ∂E ∂E
→ , , ,... . (25)
∂w ∂W1 ∂W3 ∂W5
Finally, we shall not compute gradients evaluating the loss on the entire training set. The loss landscape
in parameter space includes a lot of local minima, in which the gradient vanishes. This would stop
our weight update very quickly. We rather strive to add randomness to this weight update process to
“bypass” local minima and, hopefully, reach a global minimum of the loss.
The most common strategy to achieve this randomness is included in the Stochastic Gradient
Descent algorithm. The underlying idea is to evaluate the loss E and update the gradients using one
training sample at a time. The intuition is that by looking at subsets of the training set local minima
will appear and disappear. This will allow our iterative weight update to go around them. On the
other hand, global minima will be robustly present.
As we use all our training data once to update the weights, we say that we have concluded one
training epoch. After one epoch, we proceed with updating the weights using again all the training
data in the next epoch and so on.
1 Also called Mean Square Percentage Error (MSPE) when it is multiplied by 100.
2 Also called Root Mean Square Percentage Error (RMSPE) when it is multiplied by 100.

5
4 Notation
• Input data points are row vectors, x0 ∈ R1×N0 (or batches of row vectors, x0 ∈ Rb×N0 ).
• Linear layers entail multiplications on the left. This is chosen for convenience as it allows easy
generalization in case of batched operations.
• If the matrix A has shape N ×M , i.e. A ∈ RN ×M , then the derivative ∂L/∂A has the transposed
shape M × N .
• Deriving matrices with respect to matrices gives rises to higher order tensor (matrices with three
or four dimensions). The notation adopted allows us to never confront directly with these beasts:
they will always be “higher dimensional” identity matrices, that will trivially contract with our
two-dimensional matrices outputting two-dimensional matrices..

• The “ : ” indicates the contraction operation of two matrices (or high-dimensional matrices). This
entails performing the dot product twice in a row.
• Technical steps involving contractions or four-dimensional tensors are grayed out.

5 Backpropagation for stochastic gradient descent


The structure of feedforward networks enables scalable computation of the gradients ∂w ∂E T
via the


backpropagation algorithm.
Through backpropagation we can recursively generate the quantities ∂W ∂E
j
starting from the last
layer and proceeding backwards. For simplicity, we consider the MSE loss, although the same calcu-
lations hold also for the other losses.
The derivation reported below apply to stochastic gradient descent, gradient descent, or minibatch
stochastic gradient descent. Technically, the only distinction is the value of b, respectively 1, the full
dataset size, or the batch size. In any of these cases it is assumed that a forward pass of the network
has been performed on the datapoint, dataset or batch of interest. This means that the quantities
x0 , x1 , . . . , xl are known.
The backpropagation algorithms uses a foundational notion in calculus, namely the chain rule.
The chain rule provides a method to compute derivatives of composed functions. A neural network,
composing layers, is itself a composed function. As a small reminder, given two (composable) functions
t = t(x), s = s(x), the derivative of the composed function t(s(x)) with respect to x reads:

∂t(s(x)) ∂s ∂t
= . (26)
∂x ∂x ∂s
When it is understood that t depends on x via s the right-hand side of the equation is always just
short-handed to ∂x∂t
. The chain rule holds for scalar, vector and matrix functions. We shall use it for
vector and matrix functions where its elegant structure holds unchanged, but attention needs to be
payed in the order of the matrix multiplications that emerge. The next derivation aims at including
all the relevant operations.
First, we shall calculate the derivative of the loss with respect to the output of each layer. We
start from the last one, the network output and then proceed backwards. Given the output y = xl ,
we compute:

∂E
= −(y − xl ) ∈ RNl ×b (27)
∂xl Nl
∂E
∂xl

For the MSE case this is nothing but the difference between the output and the ground truth.

6
We can compute the derivative of the loss with respect the previous layers outputs, xl−j , proceeding
backwards and by making use of the chain rule. Let us start with the output of the second last layer
xl−1 :
∂E ∂xl ∂E
= . (28)
∂xl−1 ∂xl−1 ∂xl
The first factor in this matrix multiplication has been computed previously, whereas the second factor
depends on whether the l-th layer is a linear operation (a) or a Relu activation (b).
• case a. the l-th layer is a multiplier, i.e. xl = xl−1 Wl .
It holds
∂xl ∂E ∂xl−1 Wl ∂E ∂E ∂E
= = (Id : Wl ) = Wl . (29)
∂xl−1 ∂xl ∂xl−1 ∂xl ∂xl ∂xl
Thus, the gradient ∂E
∂xl−1 reads
∂E ∂E
= Wl . (30)
∂xl−1 ∂xl

b Nl
b

Nl−1
∂E
∂xl−1
= Nl−1 Wl × Nl ∂E
∂xl

• case b. the l-th layer is a Relu activation, i.e. xl = Relu(xl−1 ).


The derivative of the ReLu function is the Heaviside function defined as
(
dReLu(x) 0 x < 0,
= Heaviside(x) = (31)
dx 1 x > 0.

The graph of the Heaviside function is reported in Figure 2(Right). It holds


∂xl ∂E ∂Relu(xl−1 ) ∂E ∂E ∂E
= = [Heaviside(xl−1 )]+ : = Heaviside(xl−1 )T ⊗ , (32)
∂xl−1 ∂xl ∂xl−1 ∂xl ∂xl ∂xl
where Heaviside(xl−1 ) is the Heaviside step function applied component-wise. Note that the
layer l and l − 1 have the same number of neurons: Nl−1 = Nl .
Thus, the gradient ∂E
∂xl−1 reads

∂E ∂E
= Heaviside(xl−1 )T ⊗ , (33)
∂xl−1 ∂xl

b b b

Nl−1
∂E
∂xl−1
= Nl−1 H(xl−1 )T ⊗ Nl−1 ∂E
∂xl

where ⊗ indicates the element-wise multiplication.


Note that we can recursively compute the derivatives with respect to other activations moving back-
wards in the network as as:
∂E ∂xl−1 ∂E
= , (34)
∂xl−2 ∂xl−2 ∂xl−1
∂E ∂xl−2 ∂E
= , (35)
∂xl−3 ∂xl−3 ∂xl−2
... (36)

7
The second factor of the right-hand sides is always known from the previous iteration, whilst the first
factor has to be computed as explained in cases (a) and (b).
Finally, we can compute the gradients with respect to the weights. Let us consider the weights
associated to layer j, and let us apply the chain rule again:
∂E ∂xj ∂E
= (37)
∂Wj ∂Wj ∂xj

Note that the second term has been computed previously, for simplicity we rename it K = ∂xj .
∂E
Thus,
we get
∂xj ∂xj K
K= (38)
∂Wj ∂Wj
∂xj−1 Wj K
= (xj−1 Wj K) is a scalar; we could jump to the last step (39)
∂Wj
∂Kxj−1 : Wj
= (40)
∂Wj
∂Wj
= Kxj−1 : (41)
∂Wj
= Kxj−1 , (42)

In conclusion, we get
∂E ∂E
= xj−1 . (43)
∂Wj ∂xj
Nj−1 b
Nj−1

Nj ∂E
∂Wj
= Nj ∂E
∂xj
× b xj−1 .

5.1 Implementation and training


Perform a forward pass and store all the intermediate output values x1 , x2 , . . .
1. Compute the derivative of the loss with respect to the output:
∂E
= −(y − xl ) (44)
∂xl

2. If the previous layer, `l , is a multiplier, compute the derivative of the loss with respect to the
weights:
∂E ∂E
= xl−1 . (45)
∂Wl ∂xl
Note that the term in blue is known from step 1, while xl−1 is known from the forward pass.
3. Compute the derivative of the loss with respect to the l − 1 layer output, xl−1 :
∂E ∂xl ∂E
= . (46)
∂xl−1 ∂xl−1 ∂xl
The term in green needs to be computed. The procedure is different depending on whether `l−1
is a multiplier (case a. above) or it is a Relu layer (case b. above). In particular, the previous
formula gets specialized as follows
case a. (linear multiplier layer)
∂E ∂E
= Wl . (47)
∂xl−1 ∂xl

8
case b. (Relu layer)
∂E ∂E
= Heaviside(xl−1 ) ⊗ . (48)
∂xl−1 ∂xl
4. You have obtained ∂xl−1 .
∂E
You can move back to step 2, but considering `l−1 layer instead.
As you get to step 4 again, continue with layer `l−2 and so forth until you compute the derivative
with respect to the linear layer closest to the input.

xl ∂E
∂xl = −(y − xl )

Wl
∂xl xl−1
∂E ∂E

... ∂Wl =

xl−1 ∂E
∂xl−1 = Wl ∂x
∂E
l

xl−2 ... ∂E
∂xl−2 = Heaviside(xl−2 )T ⊗ ∂E
∂xl−1

Wl−2 ∂E
∂xl−2 xl−3
∂E

xl−3 ... ∂E
∂xl−3 = ∂xl−2 ∂E
∂xl−3 ∂xl−2
∂Wl−2 =

... ... ...


Figure 3: Backpropagation in action. Once we do a forward pass, we can calculate the gradients
moving backwards. The sequence of operation goes from top to bottom following the arrows.

Through this process you have obtained and stored the derivatives
∂E ∂E ∂E
, , ,... (49)
∂Wl ∂Wl−2 ∂Wl−4
which you can use to update the weights. Consider that the weights are themselves matrices. It holds
 T
∂E
Wl ← Wl − µ (50)
∂Wl
 T
∂E
Wl−2 ← Wl−2 − µ (51)
∂Wl−2
 T
∂E
Wl−4 ← Wl−4 − µ (52)
∂Wl−4
... (53)
You have now the updated weights. You can proceed to the next datapoint or the next batch until
you reach the conclusion of the epoch.

5.2 Exercise
Consider the network shown in Figure 4, consisting of a multiplier layer, followed by a ReLU layer,
followed by another multiplier layer. Let the weights of the network be
 
0.1 0 0.5
Wl−2 = , Wl = 1 1 0.5 . (54)
 
0.7 −0.4 0.2

(a) Implement a forward pass of the network with the input xl−3 = 0.5 −0.25 . Show that the
 

output of the network is xl = 0.2 and store the intermediate values of the network.

9
xl
W7
 

Wl = W8 
W9 W7 W8 W9

xl−1

xl−2
W4 W3

W1 W2 W3

W1 W2 W5
Wl−2 = W6
W4 W5 W6  
0 0
∂E
∂Wl−2 = −0.4 0.2
xl−3 −0.2 0.1

Figure 4: Exercise network

(b) Assuming that the ground truth is y = 1, compute the derivative of the loss with respect to the
output of the network ( ∂x
∂E
l
).

(c) Use the result of exercise (b) to compute the derivative of the loss with respect to the weights of
the last layer ( ∂W
∂E
l
) and the loss with respect to the output of the second layer ( ∂x∂E
l−1
).

(d) Compute the loss of the ReLU layer ( ∂x∂E


l−2
).
 
0 0
(e) Show that ∂W ∂E
= −0.4 0.2 and compute the updated weights Wl−2 and Wl assuming
l−2
−0.2 0.1
that the learning rate is µ = 0.1.
 
0.5 −0.25
Bonus (hard) Implement the entire backpropagation with batch size b = 2 and input xl−3 =
  0.8 0.4
1
and ground truth y = .
1
 
−0.32 −0.16
Hint. If implemented correctly this should give ∂W
∂E
l−2
=  −0.4 0.2  and
−0.36 0.02
∂E
−0.144 −0.08 −0.352 .
 
∂Wl =

10

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy