3EBX0 Lecture Notes Addendum
3EBX0 Lecture Notes Addendum
Alessandro Corbetta
f˜ = f˜(x) (1)
via a trainable model, here a neural network: fw = fw (x). Typically, the function f˜ is extremely
complex. Examples can be
• x = (a, b, c), sides of a triangle, f (x) = area of the triangle;
• x = molecule composition, f (x) = effectiveness of the molecule.
Besides, f˜ is generally only known in terms of its behavior on an annotated dataset of input-output
pairs. This dataset includes inputs
{x(1) , x(2) , . . .}
1
and corresponding ground truth output predictions (also called annotations)
Nl
b xl
xl
`l
xl−1 ...
...
`l−1
xl−2
x1
...
`1
x0
N0
b x0
Figure 1: Sketch of a neural network. Information flows from bottom to top. Linear layers are
represented by straight connections. Each connection carries a trainable weight. Relu layers are
represented by wiggly connections.
x0 = x(i) . (2)
We will use the subscript (here 0) to denote the outputs of each layer of the network. Thus, we index
the input by 0, and so on across the layers (cf. Figure 1). Note that we are not writing explicitly the
dependency on the datapoint (i) to avoid an excessively heavy notation.
2
In general, the input will be a vector. For simplicity, we do not consider the case in which the
input data is a matrix (e.g. a gray-scale image) or a higher dimensional tensor (e.g. a color image, or
a color movie). We consider inputs as row vectors:
x0 ∈ R1×N0 N0 : input dimension. (3)
The number N0 denotes the dimension of the input, also called the number of features.
This notation enables us to treat easily input batches. That is blocks of b input vectors that are
addressed at once for predictions or during training. These allow, respectively, multiple predictions in
parallel and training via the so-called, minibatch stochastic gradient descent. In case of batches, we
store the b inputs across b rows
x0 ∈ Rb×N0 b : batch dimension. (4)
In the following, we restrict to b = 1 (i.e. no batching), bearing in mind that our calculation will be
general enough to accommodate also the case b > 1. We can sketch the input x0 as a rectangle as
follows
N0
b x0 ,
where the number of features N0 and batch size are reported, respectively on the horizontal and vertical
axis.
Let
`1 , `2 , . . . , `l−1 , `l (5)
be the layers of the neural networks. In other terms, the output is built applying one layer after the
other (cf. Figure 1):
x0 −→ x1 = `1 (x0 ) −→ x2 = `2 (x1 ) −→ . . . −→ xl = `l (xl−1 ). (6)
The output of the last layer xl coincides with the output of the network y. In the language of
mathematical analysis, considering each layer as a function, we identify this stacking operation as a
function composition. In mathematical lingo, we could also write
y = `l ( `l−1 (. . . `2 ( `1 ( x0 )))). (7)
We build the network juxtaposing linear multiplier layers and Relu (Rectified Linear Unit) activation
functions. By definition, Relu activations are such that
(
0 x < 0,
ReLu(x) = (8)
x x ≥ 0.
A graph of the ReLu activation is in Figure 2(left). In other words, the layers
`1 , `3 , `5 , . . . , (9)
are multipliers, i.e. perform linear operations, whereas
`2 , `4 , `6 , . . . (10)
perform ReLus.
In formulas, for the multiplier layers, we can write
xj = `j (xj−1 ) = xj−1 Wj , j = 1, 3, 5, . . . , Wj ∈ RNj−1 ×Nj , (11)
where Nj is the number of neurons at layer j. To have a visual intuition of this matrix operation, and
especially of the sizes involved, we illustrate this matrix multiplication as
Nj
Nj Nj−1
b xj = b xj−1 × Nj−1 Wj
3
For the Relu layers it holds, instead,
Note that we used the symbol xj to denote the layer outputs, regardless these were linear layers or
non-linear activations. In the literature the symbols a and z are often used to distinguish, respectively,
these outputs. Our choice of notation will pay off simplifying the formulas obtained while deriving the
backpropagation algorithm.
The weight array w is thus the collection of the elements of the weight matrices:
w → {W1 , W3 , W5 , . . .} . (13)
ReLu(x) Heaviside(x)
x x
Figure 2: (Left) graph of the Rectified Linear Unit function (ReLu). (Right) Graph of the derivative
of the ReLu function: the Heaviside function.
4
• the Mean Square Relative Error1 (MSRE)
1 X |ỹ (i) − fw (x(i) )|2
E(predictions, ground truth; w) = . (18)
2S i |ỹ (i) |2
In the course, we focus on physical systems. Hence, we shall retain relative error functions. In fact,
they enable to approximate functions regardless the scale of the input/output.
To train we shall consider a dataset on which input-output pairs are known. This is called the
training dataset.
5
4 Notation
• Input data points are row vectors, x0 ∈ R1×N0 (or batches of row vectors, x0 ∈ Rb×N0 ).
• Linear layers entail multiplications on the left. This is chosen for convenience as it allows easy
generalization in case of batched operations.
• If the matrix A has shape N ×M , i.e. A ∈ RN ×M , then the derivative ∂L/∂A has the transposed
shape M × N .
• Deriving matrices with respect to matrices gives rises to higher order tensor (matrices with three
or four dimensions). The notation adopted allows us to never confront directly with these beasts:
they will always be “higher dimensional” identity matrices, that will trivially contract with our
two-dimensional matrices outputting two-dimensional matrices..
• The “ : ” indicates the contraction operation of two matrices (or high-dimensional matrices). This
entails performing the dot product twice in a row.
• Technical steps involving contractions or four-dimensional tensors are grayed out.
backpropagation algorithm.
Through backpropagation we can recursively generate the quantities ∂W ∂E
j
starting from the last
layer and proceeding backwards. For simplicity, we consider the MSE loss, although the same calcu-
lations hold also for the other losses.
The derivation reported below apply to stochastic gradient descent, gradient descent, or minibatch
stochastic gradient descent. Technically, the only distinction is the value of b, respectively 1, the full
dataset size, or the batch size. In any of these cases it is assumed that a forward pass of the network
has been performed on the datapoint, dataset or batch of interest. This means that the quantities
x0 , x1 , . . . , xl are known.
The backpropagation algorithms uses a foundational notion in calculus, namely the chain rule.
The chain rule provides a method to compute derivatives of composed functions. A neural network,
composing layers, is itself a composed function. As a small reminder, given two (composable) functions
t = t(x), s = s(x), the derivative of the composed function t(s(x)) with respect to x reads:
∂t(s(x)) ∂s ∂t
= . (26)
∂x ∂x ∂s
When it is understood that t depends on x via s the right-hand side of the equation is always just
short-handed to ∂x∂t
. The chain rule holds for scalar, vector and matrix functions. We shall use it for
vector and matrix functions where its elegant structure holds unchanged, but attention needs to be
payed in the order of the matrix multiplications that emerge. The next derivation aims at including
all the relevant operations.
First, we shall calculate the derivative of the loss with respect to the output of each layer. We
start from the last one, the network output and then proceed backwards. Given the output y = xl ,
we compute:
∂E
= −(y − xl ) ∈ RNl ×b (27)
∂xl Nl
∂E
∂xl
For the MSE case this is nothing but the difference between the output and the ground truth.
6
We can compute the derivative of the loss with respect the previous layers outputs, xl−j , proceeding
backwards and by making use of the chain rule. Let us start with the output of the second last layer
xl−1 :
∂E ∂xl ∂E
= . (28)
∂xl−1 ∂xl−1 ∂xl
The first factor in this matrix multiplication has been computed previously, whereas the second factor
depends on whether the l-th layer is a linear operation (a) or a Relu activation (b).
• case a. the l-th layer is a multiplier, i.e. xl = xl−1 Wl .
It holds
∂xl ∂E ∂xl−1 Wl ∂E ∂E ∂E
= = (Id : Wl ) = Wl . (29)
∂xl−1 ∂xl ∂xl−1 ∂xl ∂xl ∂xl
Thus, the gradient ∂E
∂xl−1 reads
∂E ∂E
= Wl . (30)
∂xl−1 ∂xl
b Nl
b
Nl−1
∂E
∂xl−1
= Nl−1 Wl × Nl ∂E
∂xl
∂E ∂E
= Heaviside(xl−1 )T ⊗ , (33)
∂xl−1 ∂xl
b b b
Nl−1
∂E
∂xl−1
= Nl−1 H(xl−1 )T ⊗ Nl−1 ∂E
∂xl
7
The second factor of the right-hand sides is always known from the previous iteration, whilst the first
factor has to be computed as explained in cases (a) and (b).
Finally, we can compute the gradients with respect to the weights. Let us consider the weights
associated to layer j, and let us apply the chain rule again:
∂E ∂xj ∂E
= (37)
∂Wj ∂Wj ∂xj
Note that the second term has been computed previously, for simplicity we rename it K = ∂xj .
∂E
Thus,
we get
∂xj ∂xj K
K= (38)
∂Wj ∂Wj
∂xj−1 Wj K
= (xj−1 Wj K) is a scalar; we could jump to the last step (39)
∂Wj
∂Kxj−1 : Wj
= (40)
∂Wj
∂Wj
= Kxj−1 : (41)
∂Wj
= Kxj−1 , (42)
In conclusion, we get
∂E ∂E
= xj−1 . (43)
∂Wj ∂xj
Nj−1 b
Nj−1
Nj ∂E
∂Wj
= Nj ∂E
∂xj
× b xj−1 .
2. If the previous layer, `l , is a multiplier, compute the derivative of the loss with respect to the
weights:
∂E ∂E
= xl−1 . (45)
∂Wl ∂xl
Note that the term in blue is known from step 1, while xl−1 is known from the forward pass.
3. Compute the derivative of the loss with respect to the l − 1 layer output, xl−1 :
∂E ∂xl ∂E
= . (46)
∂xl−1 ∂xl−1 ∂xl
The term in green needs to be computed. The procedure is different depending on whether `l−1
is a multiplier (case a. above) or it is a Relu layer (case b. above). In particular, the previous
formula gets specialized as follows
case a. (linear multiplier layer)
∂E ∂E
= Wl . (47)
∂xl−1 ∂xl
8
case b. (Relu layer)
∂E ∂E
= Heaviside(xl−1 ) ⊗ . (48)
∂xl−1 ∂xl
4. You have obtained ∂xl−1 .
∂E
You can move back to step 2, but considering `l−1 layer instead.
As you get to step 4 again, continue with layer `l−2 and so forth until you compute the derivative
with respect to the linear layer closest to the input.
xl ∂E
∂xl = −(y − xl )
Wl
∂xl xl−1
∂E ∂E
... ∂Wl =
xl−1 ∂E
∂xl−1 = Wl ∂x
∂E
l
xl−2 ... ∂E
∂xl−2 = Heaviside(xl−2 )T ⊗ ∂E
∂xl−1
Wl−2 ∂E
∂xl−2 xl−3
∂E
xl−3 ... ∂E
∂xl−3 = ∂xl−2 ∂E
∂xl−3 ∂xl−2
∂Wl−2 =
Through this process you have obtained and stored the derivatives
∂E ∂E ∂E
, , ,... (49)
∂Wl ∂Wl−2 ∂Wl−4
which you can use to update the weights. Consider that the weights are themselves matrices. It holds
T
∂E
Wl ← Wl − µ (50)
∂Wl
T
∂E
Wl−2 ← Wl−2 − µ (51)
∂Wl−2
T
∂E
Wl−4 ← Wl−4 − µ (52)
∂Wl−4
... (53)
You have now the updated weights. You can proceed to the next datapoint or the next batch until
you reach the conclusion of the epoch.
5.2 Exercise
Consider the network shown in Figure 4, consisting of a multiplier layer, followed by a ReLU layer,
followed by another multiplier layer. Let the weights of the network be
0.1 0 0.5
Wl−2 = , Wl = 1 1 0.5 . (54)
0.7 −0.4 0.2
(a) Implement a forward pass of the network with the input xl−3 = 0.5 −0.25 . Show that the
output of the network is xl = 0.2 and store the intermediate values of the network.
9
xl
W7
Wl = W8
W9 W7 W8 W9
xl−1
xl−2
W4 W3
W1 W2 W3
W1 W2 W5
Wl−2 = W6
W4 W5 W6
0 0
∂E
∂Wl−2 = −0.4 0.2
xl−3 −0.2 0.1
(b) Assuming that the ground truth is y = 1, compute the derivative of the loss with respect to the
output of the network ( ∂x
∂E
l
).
(c) Use the result of exercise (b) to compute the derivative of the loss with respect to the weights of
the last layer ( ∂W
∂E
l
) and the loss with respect to the output of the second layer ( ∂x∂E
l−1
).
10