0% found this document useful (0 votes)

5 views10 pages

3EBX0 Lecture Notes Addendum

This addendum to the 3EBX0 Machine Learning course provides updates on Stochastic Gradient Descent and Backpropagation, essential for training neural networks. It outlines the structure of feedforward networks, the process of supervised learning, and various error metrics used to evaluate model performance. The document also details the backpropagation algorithm for efficient gradient computation and emphasizes the importance of iterative weight updates during training.

Uploaded by

franksbackup123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views10 pages

3EBX0 Lecture Notes Addendum

Uploaded by

franksbackup123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

3EBX0 - Machine Learning for Science

Lecture notes addendum

Alessandro Corbetta
February 19, 2025

1 Purpose of this addendum

Machine Learning is a field in extremely rapid evolution. University courses need to step forward
quickly to ensure they keep on serving their purpose of providing strong and relevant bases. This
course, 3EBX0, is no exception.
This addendum is a first step towards a complete upgrade of the original 3EBX0 lecture notes by
prof.dr. Vianney Koelman (responsible teacher of this course for the academic years 2018-2020).
This addendum focuses on providing stronger bases on the topics of Stochastic Gradient Descent
(ch. 13) and Backpropagation (ch. 14) which are now part of the graded assignments. The Backprop-
agation algorithm is a cornerstone of contemporary machine learning, enabling scalable computation
of gradients and thus it is essential in training.
This addendum strives to keep the same notation of the lecture notes. To this there is only
one exception: input vectors x are row vectors. Hence, when interpreting linear layers as matrix
multiplications, the vector-matrix product that emerge are left products (i.e. xW) rather than the
usual right products (i.e. Wx). Clearly the dimensions of the weight matrices W are the proper
ones to ensure these matrix products to work. This enables to simplify the calculations underlying
the backpropagation algorithm. Right products appear in the lecture notes e.g. in Equations 6 and
7. Clearly, the same theory could be done, yet with some more technical complication, using right
products.
Sketches representing the matrix dimensions are often interleaved between formulas to help keeping
a visual intuition of the operations at play.
This addendum is structured as follows: Sect. 2 includes a brief introduction to supervised learning.
This aims at introducing the notation having this addendum being self-standing. Sect. 3 details the
iterative process of training neural networks via stochastic gradient descent. Sect. 4 provides some
notations. Sect. 5 details a complete derivation of the backpropagation algorithm. The last subsection
makes a synthesis, listing all the steps needed when coding the algorithm.

Alessandro Corbetta

2 Supervised learning via feedforward networks

In supervised learning, we aim at approximating a function of the vector x

f˜ = f˜(x) (1)

via a trainable model, here a neural network: fw = fw (x). Typically, the function f˜ is extremely
complex. Examples can be
• x = (a, b, c), sides of a triangle, f (x) = area of the triangle;
• x = molecule composition, f (x) = effectiveness of the molecule.
Besides, f˜ is generally only known in terms of its behavior on an annotated dataset of input-output
pairs. This dataset includes inputs
{x(1) , x(2) , . . .}

1
and corresponding ground truth output predictions (also called annotations)

{ỹ (1) = f˜(x(1) ), ỹ (2) = f˜(x(2) ), . . . }.

A trainable model fw depends on free parameters, or weights, w = (w1 , w2 , . . . , wn ). These free

parameters have to be chosen to ensure that the approximation fw (x) ≈ f˜(x) is good. We consider
this approximation good when the values fw (x) are very close to the ground truth f˜(x) on a test
dataset.

b xl
xl

xl−1 ...
...
`l−1

xl−2

... ... ...

x2 ...
...
`2

...
`1

x0
N0

b x0

Figure 1: Sketch of a neural network. Information flows from bottom to top. Linear layers are
represented by straight connections. Each connection carries a trainable weight. Relu layers are
represented by wiggly connections.

2.1 Feedforward networks

In this course we focus on a class of trainable models called feedforward neural networks. These
networks are built as stacks of layers performing simple operations. We shall now define how
such network work, i.e. how the layers operate. We consider a generic datapoint and related annotation,
say (x(i) , ỹ (i) ) for some index i. As we are going to define how the input x(i) will be transformed by
the layers of the network, we shall rename it:

x0 = x(i) . (2)

We will use the subscript (here 0) to denote the outputs of each layer of the network. Thus, we index
the input by 0, and so on across the layers (cf. Figure 1). Note that we are not writing explicitly the
dependency on the datapoint (i) to avoid an excessively heavy notation.

2
In general, the input will be a vector. For simplicity, we do not consider the case in which the
input data is a matrix (e.g. a gray-scale image) or a higher dimensional tensor (e.g. a color image, or
a color movie). We consider inputs as row vectors:
x0 ∈ R1×N0 N0 : input dimension. (3)
The number N0 denotes the dimension of the input, also called the number of features.
This notation enables us to treat easily input batches. That is blocks of b input vectors that are
addressed at once for predictions or during training. These allow, respectively, multiple predictions in
parallel and training via the so-called, minibatch stochastic gradient descent. In case of batches, we
store the b inputs across b rows
x0 ∈ Rb×N0 b : batch dimension. (4)
In the following, we restrict to b = 1 (i.e. no batching), bearing in mind that our calculation will be
general enough to accommodate also the case b > 1. We can sketch the input x0 as a rectangle as
follows
N0

b x0 ,
where the number of features N0 and batch size are reported, respectively on the horizontal and vertical
axis.
Let
`1 , `2 , . . . , `l−1 , `l (5)
be the layers of the neural networks. In other terms, the output is built applying one layer after the
other (cf. Figure 1):
x0 −→ x1 = `1 (x0 ) −→ x2 = `2 (x1 ) −→ . . . −→ xl = `l (xl−1 ). (6)
The output of the last layer xl coincides with the output of the network y. In the language of
mathematical analysis, considering each layer as a function, we identify this stacking operation as a
function composition. In mathematical lingo, we could also write
y = `l ( `l−1 (. . . `2 ( `1 ( x0 )))). (7)
We build the network juxtaposing linear multiplier layers and Relu (Rectified Linear Unit) activation
functions. By definition, Relu activations are such that
(
0 x < 0,
ReLu(x) = (8)
x x ≥ 0.

A graph of the ReLu activation is in Figure 2(left). In other words, the layers
`1 , `3 , `5 , . . . , (9)
are multipliers, i.e. perform linear operations, whereas
`2 , `4 , `6 , . . . (10)
perform ReLus.
In formulas, for the multiplier layers, we can write
xj = `j (xj−1 ) = xj−1 Wj , j = 1, 3, 5, . . . , Wj ∈ RNj−1 ×Nj , (11)
where Nj is the number of neurons at layer j. To have a visual intuition of this matrix operation, and
especially of the sizes involved, we illustrate this matrix multiplication as
Nj

Nj Nj−1

b xj = b xj−1 × Nj−1 Wj

3
For the Relu layers it holds, instead,

xj = `j (xj−1 ) = Relu(xj−1 ), j = 2, 4, 6, . . . . (12)

Note that we used the symbol xj to denote the layer outputs, regardless these were linear layers or
non-linear activations. In the literature the symbols a and z are often used to distinguish, respectively,
these outputs. Our choice of notation will pay off simplifying the formulas obtained while deriving the
backpropagation algorithm.
The weight array w is thus the collection of the elements of the weight matrices:

w → {W1 , W3 , W5 , . . .} . (13)

ReLu(x) Heaviside(x)

x x

Figure 2: (Left) graph of the Rectified Linear Unit function (ReLu). (Right) Graph of the derivative
of the ReLu function: the Heaviside function.

2.2 Error metrics

Identifying good or optimal choices for the weights w entails the minimization of a suitable error
function, or Loss, E. This function measures the mismatch between the model prediction and the
corresponding ground truth value. Heuristically, we write

E = E(predictions, ground truth; w), (14)

to stress the dependency of the loss value on

• the network predictions;
• the ground truth for the same input;
• the weights of the network.
Let S be the size of the dataset, examples of loss functions are
• the Mean Square Error (MSE)
1 X (i)
E(predictions, ground truth; w) = |ỹ − fw (x(i) )|2 , (15)
2S i

• the Root Mean Square Error (RMSE)

sX
1
E(predictions, ground truth; w) = |ỹ (i) − fw (x(i) )|2 , (16)
S i

• the Mean Relative Error (MRE)

1 X |ỹ (i) − fw (x(i) )|

E(predictions, ground truth; w) = , (17)
S i |ỹ (i) |

4
• the Mean Square Relative Error1 (MSRE)
1 X |ỹ (i) − fw (x(i) )|2
E(predictions, ground truth; w) = . (18)
2S i |ỹ (i) |2

• the Root Mean Square Relative Error2 (RMSRE)

s
1 X |ỹ (i) − fw (x(i) )|2
E(predictions, ground truth; w) = . (19)
S i |ỹ (i) |2

In the course, we focus on physical systems. Hence, we shall retain relative error functions. In fact,
they enable to approximate functions regardless the scale of the input/output.
To train we shall consider a dataset on which input-output pairs are known. This is called the
training dataset.

3 Training with stochastic gradient descent

Typically, we do not seek to find optimal parameters for a neural network in one single operation.
Due to their non-linear nature of neural networks, this would be impossible. Conversely, we aim at
iteratively improve our choice of the weights starting from an initial guess, w0 . In particular, we shall
move from w0 in a sequence of steps that progressively allows lower loss values on the training set:
w0 → w1 → . . . → wk (20)
L(w0 ) > L(w0 ) > . . . > L(wk ). (21)
To decide how to update the weights from step s to step s + 1, our optimal choice is the direction in
which the loss decreases the most. Basic calculus tells that this direction in weight space is opposite
to the gradient of the loss with respect to the weights:
T
∂E
− , (22)
∂w
where the T indicates a transposition operation.
Let the vector ws , defined as
∆ws = ws+1 − ws , (23)
denote the weight update from step s to step s + 1. Performing such an update following the gradient
implies
T
∂E
∆ws = −µ , (24)
∂w
where µ is a rescaling constant usually dubbed learning rate.
As the free parameters are contributed by the linear layers, we consider gradients in its components:
T ( T T T )
∂E ∂E ∂E ∂E
→ , , ,... . (25)
∂w ∂W1 ∂W3 ∂W5
Finally, we shall not compute gradients evaluating the loss on the entire training set. The loss landscape
in parameter space includes a lot of local minima, in which the gradient vanishes. This would stop
our weight update very quickly. We rather strive to add randomness to this weight update process to
“bypass” local minima and, hopefully, reach a global minimum of the loss.
The most common strategy to achieve this randomness is included in the Stochastic Gradient
Descent algorithm. The underlying idea is to evaluate the loss E and update the gradients using one
training sample at a time. The intuition is that by looking at subsets of the training set local minima
will appear and disappear. This will allow our iterative weight update to go around them. On the
other hand, global minima will be robustly present.
As we use all our training data once to update the weights, we say that we have concluded one
training epoch. After one epoch, we proceed with updating the weights using again all the training
data in the next epoch and so on.
1 Also called Mean Square Percentage Error (MSPE) when it is multiplied by 100.
2 Also called Root Mean Square Percentage Error (RMSPE) when it is multiplied by 100.

5
4 Notation
• Input data points are row vectors, x0 ∈ R1×N0 (or batches of row vectors, x0 ∈ Rb×N0 ).
• Linear layers entail multiplications on the left. This is chosen for convenience as it allows easy
generalization in case of batched operations.
• If the matrix A has shape N ×M , i.e. A ∈ RN ×M , then the derivative ∂L/∂A has the transposed
shape M × N .
• Deriving matrices with respect to matrices gives rises to higher order tensor (matrices with three
or four dimensions). The notation adopted allows us to never confront directly with these beasts:
they will always be “higher dimensional” identity matrices, that will trivially contract with our
two-dimensional matrices outputting two-dimensional matrices..

• The “ : ” indicates the contraction operation of two matrices (or high-dimensional matrices). This
entails performing the dot product twice in a row.
• Technical steps involving contractions or four-dimensional tensors are grayed out.

5 Backpropagation for stochastic gradient descent

The structure of feedforward networks enables scalable computation of the gradients ∂w ∂E T
via the

backpropagation algorithm.
Through backpropagation we can recursively generate the quantities ∂W ∂E
j
starting from the last
layer and proceeding backwards. For simplicity, we consider the MSE loss, although the same calcu-
lations hold also for the other losses.
The derivation reported below apply to stochastic gradient descent, gradient descent, or minibatch
stochastic gradient descent. Technically, the only distinction is the value of b, respectively 1, the full
dataset size, or the batch size. In any of these cases it is assumed that a forward pass of the network
has been performed on the datapoint, dataset or batch of interest. This means that the quantities
x0 , x1 , . . . , xl are known.
The backpropagation algorithms uses a foundational notion in calculus, namely the chain rule.
The chain rule provides a method to compute derivatives of composed functions. A neural network,
composing layers, is itself a composed function. As a small reminder, given two (composable) functions
t = t(x), s = s(x), the derivative of the composed function t(s(x)) with respect to x reads:

∂t(s(x)) ∂s ∂t
= . (26)
∂x ∂x ∂s
When it is understood that t depends on x via s the right-hand side of the equation is always just
short-handed to ∂x∂t
. The chain rule holds for scalar, vector and matrix functions. We shall use it for
vector and matrix functions where its elegant structure holds unchanged, but attention needs to be
payed in the order of the matrix multiplications that emerge. The next derivation aims at including
all the relevant operations.
First, we shall calculate the derivative of the loss with respect to the output of each layer. We
start from the last one, the network output and then proceed backwards. Given the output y = xl ,
we compute:

∂E
= −(y − xl ) ∈ RNl ×b (27)
∂xl Nl
∂E
∂xl

For the MSE case this is nothing but the difference between the output and the ground truth.

6
We can compute the derivative of the loss with respect the previous layers outputs, xl−j , proceeding
backwards and by making use of the chain rule. Let us start with the output of the second last layer
xl−1 :
∂E ∂xl ∂E
= . (28)
∂xl−1 ∂xl−1 ∂xl
The first factor in this matrix multiplication has been computed previously, whereas the second factor
depends on whether the l-th layer is a linear operation (a) or a Relu activation (b).
• case a. the l-th layer is a multiplier, i.e. xl = xl−1 Wl .
It holds
∂xl ∂E ∂xl−1 Wl ∂E ∂E ∂E
= = (Id : Wl ) = Wl . (29)
∂xl−1 ∂xl ∂xl−1 ∂xl ∂xl ∂xl
Thus, the gradient ∂E
∂xl−1 reads
∂E ∂E
= Wl . (30)
∂xl−1 ∂xl

b Nl
b

Nl−1
∂E
∂xl−1
= Nl−1 Wl × Nl ∂E
∂xl

• case b. the l-th layer is a Relu activation, i.e. xl = Relu(xl−1 ).

The derivative of the ReLu function is the Heaviside function defined as
(
dReLu(x) 0 x < 0,
= Heaviside(x) = (31)
dx 1 x > 0.

The graph of the Heaviside function is reported in Figure 2(Right). It holds

∂xl ∂E ∂Relu(xl−1 ) ∂E ∂E ∂E
= = [Heaviside(xl−1 )]+ : = Heaviside(xl−1 )T ⊗ , (32)
∂xl−1 ∂xl ∂xl−1 ∂xl ∂xl ∂xl
where Heaviside(xl−1 ) is the Heaviside step function applied component-wise. Note that the
layer l and l − 1 have the same number of neurons: Nl−1 = Nl .
Thus, the gradient ∂E
∂xl−1 reads

∂E ∂E
= Heaviside(xl−1 )T ⊗ , (33)
∂xl−1 ∂xl

b b b

Nl−1
∂E
∂xl−1
= Nl−1 H(xl−1 )T ⊗ Nl−1 ∂E
∂xl

where ⊗ indicates the element-wise multiplication.

Note that we can recursively compute the derivatives with respect to other activations moving back-
wards in the network as as:
∂E ∂xl−1 ∂E
= , (34)
∂xl−2 ∂xl−2 ∂xl−1
∂E ∂xl−2 ∂E
= , (35)
∂xl−3 ∂xl−3 ∂xl−2
... (36)

7
The second factor of the right-hand sides is always known from the previous iteration, whilst the first
factor has to be computed as explained in cases (a) and (b).
Finally, we can compute the gradients with respect to the weights. Let us consider the weights
associated to layer j, and let us apply the chain rule again:
∂E ∂xj ∂E
= (37)
∂Wj ∂Wj ∂xj

Note that the second term has been computed previously, for simplicity we rename it K = ∂xj .
∂E
Thus,
we get
∂xj ∂xj K
K= (38)
∂Wj ∂Wj
∂xj−1 Wj K
= (xj−1 Wj K) is a scalar; we could jump to the last step (39)
∂Wj
∂Kxj−1 : Wj
= (40)
∂Wj
∂Wj
= Kxj−1 : (41)
∂Wj
= Kxj−1 , (42)

In conclusion, we get
∂E ∂E
= xj−1 . (43)
∂Wj ∂xj
Nj−1 b
Nj−1

Nj ∂E
∂Wj
= Nj ∂E
∂xj
× b xj−1 .

5.1 Implementation and training

Perform a forward pass and store all the intermediate output values x1 , x2 , . . .
1. Compute the derivative of the loss with respect to the output:
∂E
= −(y − xl ) (44)
∂xl

2. If the previous layer, `l , is a multiplier, compute the derivative of the loss with respect to the
weights:
∂E ∂E
= xl−1 . (45)
∂Wl ∂xl
Note that the term in blue is known from step 1, while xl−1 is known from the forward pass.
3. Compute the derivative of the loss with respect to the l − 1 layer output, xl−1 :
∂E ∂xl ∂E
= . (46)
∂xl−1 ∂xl−1 ∂xl
The term in green needs to be computed. The procedure is different depending on whether `l−1
is a multiplier (case a. above) or it is a Relu layer (case b. above). In particular, the previous
formula gets specialized as follows
case a. (linear multiplier layer)
∂E ∂E
= Wl . (47)
∂xl−1 ∂xl

8
case b. (Relu layer)
∂E ∂E
= Heaviside(xl−1 ) ⊗ . (48)
∂xl−1 ∂xl
4. You have obtained ∂xl−1 .
∂E
You can move back to step 2, but considering `l−1 layer instead.
As you get to step 4 again, continue with layer `l−2 and so forth until you compute the derivative
with respect to the linear layer closest to the input.

xl ∂E
∂xl = −(y − xl )

Wl
∂xl xl−1
∂E ∂E

... ∂Wl =

xl−1 ∂E
∂xl−1 = Wl ∂x
∂E
l

xl−2 ... ∂E
∂xl−2 = Heaviside(xl−2 )T ⊗ ∂E
∂xl−1

Wl−2 ∂E
∂xl−2 xl−3
∂E

xl−3 ... ∂E
∂xl−3 = ∂xl−2 ∂E
∂xl−3 ∂xl−2
∂Wl−2 =

... ... ...

Figure 3: Backpropagation in action. Once we do a forward pass, we can calculate the gradients
moving backwards. The sequence of operation goes from top to bottom following the arrows.

Through this process you have obtained and stored the derivatives
∂E ∂E ∂E
, , ,... (49)
∂Wl ∂Wl−2 ∂Wl−4
which you can use to update the weights. Consider that the weights are themselves matrices. It holds
T
∂E
Wl ← Wl − µ (50)
∂Wl
T
∂E
Wl−2 ← Wl−2 − µ (51)
∂Wl−2
T
∂E
Wl−4 ← Wl−4 − µ (52)
∂Wl−4
... (53)
You have now the updated weights. You can proceed to the next datapoint or the next batch until
you reach the conclusion of the epoch.

5.2 Exercise
Consider the network shown in Figure 4, consisting of a multiplier layer, followed by a ReLU layer,
followed by another multiplier layer. Let the weights of the network be

0.1 0 0.5
Wl−2 = , Wl = 1 1 0.5 . (54)

0.7 −0.4 0.2

(a) Implement a forward pass of the network with the input xl−3 = 0.5 −0.25 . Show that the

output of the network is xl = 0.2 and store the intermediate values of the network.

9
xl
W7
 

Wl = W8 
W9 W7 W8 W9

xl−1

xl−2
W4 W3

W1 W2 W3

W1 W2 W5
Wl−2 = W6
W4 W5 W6  
0 0
∂E
∂Wl−2 = −0.4 0.2
xl−3 −0.2 0.1

Figure 4: Exercise network

(b) Assuming that the ground truth is y = 1, compute the derivative of the loss with respect to the
output of the network ( ∂x
∂E
l
).

(c) Use the result of exercise (b) to compute the derivative of the loss with respect to the weights of
the last layer ( ∂W
∂E
l
) and the loss with respect to the output of the second layer ( ∂x∂E
l−1
).

(d) Compute the loss of the ReLU layer ( ∂x∂E

l−2
).
 
0 0
(e) Show that ∂W ∂E
= −0.4 0.2 and compute the updated weights Wl−2 and Wl assuming
l−2
−0.2 0.1
that the learning rate is µ = 0.1.

0.5 −0.25
Bonus (hard) Implement the entire backpropagation with batch size b = 2 and input xl−3 =
0.8 0.4
1
and ground truth y = .
1
 
−0.32 −0.16
Hint. If implemented correctly this should give ∂W
∂E
l−2
=  −0.4 0.2  and
−0.36 0.02
∂E
−0.144 −0.08 −0.352 .

∂Wl =

Terraform+Notes+PPT+2nd+May+2025+ +KPLABS
No ratings yet
Terraform+Notes+PPT+2nd+May+2025+ +KPLABS
754 pages
Optimization
No ratings yet
Optimization
44 pages
Neural Networks
No ratings yet
Neural Networks
52 pages
Deep Learning Module-02 Search Creators
No ratings yet
Deep Learning Module-02 Search Creators
15 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
Wa0006.
No ratings yet
Wa0006.
70 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
Neural Networks & Deep Learning 2025
No ratings yet
Neural Networks & Deep Learning 2025
73 pages
Slides 11
No ratings yet
Slides 11
48 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
Neural - Networks
No ratings yet
Neural - Networks
47 pages
Unit 2 - ML
No ratings yet
Unit 2 - ML
18 pages
5 - From Linear Models To Multi-Layer Perceptrons
No ratings yet
5 - From Linear Models To Multi-Layer Perceptrons
45 pages
Deep Learning PDF
100% (1)
Deep Learning PDF
87 pages
Module 3 Final
No ratings yet
Module 3 Final
88 pages
Lec 6
No ratings yet
Lec 6
18 pages
Deep Learning Basics (Lecture Notes) : Romain Tavenard
No ratings yet
Deep Learning Basics (Lecture Notes) : Romain Tavenard
49 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
LLM For Maths People
No ratings yet
LLM For Maths People
53 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Artificial Neural Networks - DL
No ratings yet
Artificial Neural Networks - DL
55 pages
18 DL Regularization
No ratings yet
18 DL Regularization
41 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Pediatric Demyelinating Diseases of The Central Nervous System and Their Mimics
100% (1)
Pediatric Demyelinating Diseases of The Central Nervous System and Their Mimics
338 pages
Self Healing Concrete PPT Mu
50% (2)
Self Healing Concrete PPT Mu
22 pages
Machine Learning
No ratings yet
Machine Learning
83 pages
Back Propagation
No ratings yet
Back Propagation
29 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
General Observation
No ratings yet
General Observation
93 pages
Isolated Footing Excel Computation
No ratings yet
Isolated Footing Excel Computation
27 pages
Bai 1 Eng
No ratings yet
Bai 1 Eng
10 pages
Working of Multi-Layer Perceptron
No ratings yet
Working of Multi-Layer Perceptron
16 pages
K P P Abhilash Emergency Medicine Best Practices at CMC EMAC 2018
100% (1)
K P P Abhilash Emergency Medicine Best Practices at CMC EMAC 2018
531 pages
Inference and Learning
No ratings yet
Inference and Learning
33 pages
M3 Transcript
No ratings yet
M3 Transcript
10 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Module 3.docxaiml
No ratings yet
Module 3.docxaiml
20 pages
How To Build Your Own Neural Network From Scratch in
No ratings yet
How To Build Your Own Neural Network From Scratch in
6 pages
NN 2
No ratings yet
NN 2
12 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
19 pages
Lesson 3 Artificial Neural Network
No ratings yet
Lesson 3 Artificial Neural Network
77 pages
Artificial Neural Networks - Lect - 3
No ratings yet
Artificial Neural Networks - Lect - 3
16 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
State Budget 2025-26
No ratings yet
State Budget 2025-26
13 pages
IoT Quantum Computing A Future Concept
No ratings yet
IoT Quantum Computing A Future Concept
8 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
14 pages
Sparse Autoencoder
No ratings yet
Sparse Autoencoder
15 pages
Backpropagation
No ratings yet
Backpropagation
12 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Spectroscopic Techniques
No ratings yet
Spectroscopic Techniques
38 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Exam 20231108
No ratings yet
Exam 20231108
24 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
26 pages
Convolutional Neural Network Tutorial
No ratings yet
Convolutional Neural Network Tutorial
8 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
Understanding Backpropagation Algorithm - Towards Data Science
No ratings yet
Understanding Backpropagation Algorithm - Towards Data Science
11 pages
Neural Networks Handout
No ratings yet
Neural Networks Handout
7 pages
Mathura Vrindavan Tour
No ratings yet
Mathura Vrindavan Tour
1 page
Bell, SOME EXPERIMENTS IN DIAGNOSTIC TEACHING
No ratings yet
Bell, SOME EXPERIMENTS IN DIAGNOSTIC TEACHING
23 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Back-Propagation Is Very Simple. Who Made It Complicated
No ratings yet
Back-Propagation Is Very Simple. Who Made It Complicated
26 pages
Exam 2DF40 April 2024
No ratings yet
Exam 2DF40 April 2024
8 pages
4DC10 Jan2021
No ratings yet
4DC10 Jan2021
3 pages
Trial Memorandum Plaintiff SAMPLE
100% (4)
Trial Memorandum Plaintiff SAMPLE
10 pages
Better Homes & Gardens 8 Cube Organizer EN
No ratings yet
Better Homes & Gardens 8 Cube Organizer EN
26 pages
4GC10 Case Description
No ratings yet
4GC10 Case Description
7 pages
Taxi Reimbursement Request Form 07.31.24 - 0
No ratings yet
Taxi Reimbursement Request Form 07.31.24 - 0
2 pages
Hum 103 Coverage For Semifinals
No ratings yet
Hum 103 Coverage For Semifinals
6 pages
Pengaruh Model PBL Terhadap Kemampuan Berpikir Kreatif Ditinjau Dari Kemandirian Belajar Siswa
No ratings yet
Pengaruh Model PBL Terhadap Kemampuan Berpikir Kreatif Ditinjau Dari Kemandirian Belajar Siswa
14 pages
Projectassignment Schaapsdrift
No ratings yet
Projectassignment Schaapsdrift
4 pages
Bavleen Revised
No ratings yet
Bavleen Revised
4 pages
Werner 2018 Geographies of Production I Global Production and Uneven Development
No ratings yet
Werner 2018 Geographies of Production I Global Production and Uneven Development
11 pages
(3b.) Positive Production Externalities (Type of Market Failure) - Notes
No ratings yet
(3b.) Positive Production Externalities (Type of Market Failure) - Notes
6 pages
B.ing Kls XII
No ratings yet
B.ing Kls XII
1 page
Yorrick - Player Sheet
No ratings yet
Yorrick - Player Sheet
2 pages
Ficha Técnica de Balatas-001 Noviembre 2011
No ratings yet
Ficha Técnica de Balatas-001 Noviembre 2011
4 pages
Message Analyzer FAQ and Known Issues
No ratings yet
Message Analyzer FAQ and Known Issues
11 pages
Writing Your First Django App, Part 7 - Django Documentation - Django
No ratings yet
Writing Your First Django App, Part 7 - Django Documentation - Django
10 pages
UEFA Euro 2020 Case Study
No ratings yet
UEFA Euro 2020 Case Study
3 pages
What The Managers Really Do
No ratings yet
What The Managers Really Do
4 pages
PSL50 Protection Datasheet
No ratings yet
PSL50 Protection Datasheet
5 pages
Don Mariano Marcos Memorial State University College of Graduate Studies
No ratings yet
Don Mariano Marcos Memorial State University College of Graduate Studies
4 pages
Brosur Master Steel
No ratings yet
Brosur Master Steel
4 pages
Library Cataloger General Responsibilities
No ratings yet
Library Cataloger General Responsibilities
2 pages
Name: - Date: - Grade & Section: - Score
No ratings yet
Name: - Date: - Grade & Section: - Score
2 pages
Elements of Tensor Calculus
From Everand
Elements of Tensor Calculus
A. Lichnerowicz
3.5/5 (2)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

3EBX0 Lecture Notes Addendum

Uploaded by

3EBX0 Lecture Notes Addendum

Uploaded by

3EBX0 - Machine Learning for Science

Lecture notes addendum

1 Purpose of this addendum

2 Supervised learning via feedforward networks

{ỹ (1) = f˜(x(1) ), ỹ (2) = f˜(x(2) ), . . . }.

A trainable model fw depends on free parameters, or weights, w = (w1 , w2 , . . . , wn ). These free

... ... ...

2.1 Feedforward networks

xj = `j (xj−1 ) = Relu(xj−1 ), j = 2, 4, 6, . . . . (12)

2.2 Error metrics

E = E(predictions, ground truth; w), (14)

to stress the dependency of the loss value on

• the Root Mean Square Error (RMSE)

• the Mean Relative Error (MRE)

1 X |ỹ (i) − fw (x(i) )|

• the Root Mean Square Relative Error2 (RMSRE)

3 Training with stochastic gradient descent

5 Backpropagation for stochastic gradient descent

• case b. the l-th layer is a Relu activation, i.e. xl = Relu(xl−1 ).

The graph of the Heaviside function is reported in Figure 2(Right). It holds

where ⊗ indicates the element-wise multiplication.

5.1 Implementation and training

... ... ...

Figure 4: Exercise network

(d) Compute the loss of the ReLU layer ( ∂x∂E

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.