0% found this document useful (0 votes)

18 views8 pages

TUM I2DL Matrix Derivatives

The document discusses the concept of affine layers in neural networks, explaining the roles of input matrix X, weight matrix W, and the process of backpropagation. It details how gradients and derivatives are calculated in the context of neural networks, emphasizing the importance of understanding these calculations for effective model training. Additionally, it includes examples and mathematical formulations to illustrate the concepts presented.

Uploaded by

yejinheng5314johann

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views8 pages

TUM I2DL Matrix Derivatives

Uploaded by

yejinheng5314johann

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

TUM - I2DL - Matrix derivatives

Dan Halperin - Tutor

May 2, 2023

1 Affine layer

y = XW + b (1)

Where XN xD WDxM b1xM .

A known use case is the 1-dim case of a line equation:

y = ax + b

1.1 What is X?

• In the affine layer context, the matrix X is considered to be the input.

• In neural networks, we almost always refer to it as a batch of input elements (e.g. images).
• In some deep learning applications (e.g. ”style-transfer”), it is also trained by backpropogation.
• Besides being the input of each layer of the network, X is also the output of the previous layer.
Let us take for example this one input instance (image) from the MNIST handwritten digits’ dataset. Each
grayscale image in this dataset is a 1 × 8 × 8 tensor: 1 for the channels, 8 for the height and 8 for the width.

Figure 1: Mnist handwritten 8 × 8 image of the digit 0

1
For the affine layer, as phrased in (1), each input instance is flattened to be a row vector inside X. Let us
take a batch of 2 images from the MNIST dataset.

  
x111 ... x118 x211 . . . x218
 ..
X =  . .. ..   .. .. ..  → x111 , ... , x118 , , x121 , ... , x181 , ... , x188
. .  . . .  x211 , ... , x218 , , x221 , ... , x281 , ... , x288
x181 ... x188 x281 . . . x288
(2)
Here, the batch shape is 2 × 1 × 8 × 8
Question: What if we had a 3-channels RGB images?
Answer: The images are flattened the same, row by row and channel by channel. The actual order doesn’t
matter, but it is important that it will remain consistent among all input instances, so the weights will
correspond to the correct entries.

1.2 What is W?

• The coefficient matrix.

• In a learning model, they represent the learnable weights, and modified during the backpropogation
step.
• If in X each row represents one input inside the batch, in W each column represents the weights that
are attached from all input neurons (cells in the input vector) to one neuron in the next layer, which
is the input to the next layer, as could be seen in Figure 2.

1.3 Notes

• Note: It is not a linear function, but we treat it as an approximation. Why not? It doesn’t follow the
rules of linearity, where
f (x + y) = f (x) + f (y)
or
f (ax) = af (x)

• Another common notation of an affine layer is

T
y = W x + b = (W x + b)T = (X T W T + bT )T (3)

Which calculates the exact same thing, but results in a column vector and not a row. W weight vectors
are now row vectors and X inputs are now column vectors. It is just a matter of how we construct our
inputs and weights.

2
2 Derivatives

Figure 2: A neural network computational graph. Note: Although we always deal with batches of inputs,
in the sketch, the input layer represents only one input instance (e.g one flattened image). Each colour
represnts a different weights column vector in W . Also, each neuron in the input layer (true to any neuron
in the network) will collect the gradients from the flow on the colourful edges that are attached to it.

2.1 What is a gradient

It all depends on the function!

•
∂f
f : R → R, x ∈ R, =a∈R (4)
∂x
• Gradient:  ∂f   
∂x1 a1
∂f
f : Rn → R, x ∈ Rn , =  ...  =  ...  ∈ Rn (5)
   
∂x ∂f
∂x
an
n

• Gradient:
∂f ∂f
 
∂x1,1 ... ∂x1,m
∂f  . ..
f : Rn×m → R, x ∈ Rn×m , n×m

= .. . ...  ∈ R
 (6)
∂x  ∂f ∂f
∂xn,1 ... ∂xn,m

• Jacobian:  ∂f1
    ∂f1 
x1 f1 ∂x1 ... ∂xn
 ..   ..  ∂f
f : Rn → Rm , f ( . ) =  .  , =  ... .. ..  ∈ Rm×n (7)

∂x . . 
∂f ∂fm
xn fm ∂x1
m
... ∂xn

• Note that if x was a row vector and so was the function ’image’ (result), then this Jacobian matrix
would have been transposed.

3
• An ugly Jacobian (Tensor - A multidimensional matrix):
 ∂f1 ∂f1 
∂w11 ... ∂w1m
 . ..
 ..

 . . . . 

     ∂f1 ... ∂f1
w11 ... w1m f1

 ∂wn1 ∂wmn 
 .. .. ..  =  ..  , ∂f ..
f : Rn×m → Rn ,
 
f  . . .   .  = .
 (8)
∂w  ∂fn ∂fn

wn1 ... wmn fn

 ∂w
11
... ∂w1m 

 .. .. 

..
 . . . 
∂fn ∂fn
∂wn1 ... ∂wmn

• What is a gradient? It is the derivative scalar-valued differentiable function by a vector or a matrix

input.
• Super important: Neural networks in general could take any shape of input, but they all result in a
loss function, that gives a scalar L ∈ R. That means:
– In the backpropogation step, the derivative of a learnable weight wij is to be calculated as a
scalar derivative:
∂L X X ∂L ∂yi,j
= (9)
∂wu,v i j
∂yi,j ∂wu,v
∂L
Where i, j correspond to the rows and columns of ∂Y in some function that utilizes w, such in
Y = f (X) = XW + b.
– In the neural network backpropagation algorithm, we observe only the current layer at a time, as
an abstraction. We do not try to think of the entire network at once, but step-by-step. Example:
Toy-Network:
∗ Affine()
∗ ReLU()
∗ Affine()
∗ Sigmoid()
∗ Loss()
When it comes to think of how to derive the current Affine() layer, we observe it as if it was a
function with its own scope.

Figure 3: Scope of a function

4
According to the chain rule, the derivative of the loss function value L according to the weight
matrix of our current affine layer W , would be:
∂L ∂L ∂σ(Y ) ∂Y
= ⊕ ⊕ (10)
∂W ∂σ(Y ) ∂Y ∂W
∂L ∂σ(Y )
In this case, ∂σ(Y ) ∂Y is what we call dout, or the upstream gradient, and we assume it is
already calculated before, as in our current scope (according to the relevant functions, of course).
Now it is sent to our current scope, to be calculated as a part of the chain-rule, and sent up the
stream to the next layer.
Also, ⊕ in scalar derivatives represents a simple multiplication. However, in multidimensional
derivatives, ⊕ represents some unknown function, which we need to figure out.

3 Stanford Article

• Original article by Stanford (cs231n): Link

Let’s follow their example:

x1,1 x1,2 w1,1 w1,2 w1,3
X= W = (11)
x2,1 x2,2 2×2
w2,1 w2,2 w2,3 2×3

x1,1 w1,1 + x1,2 w2,1 x1,1 w1,2 + x1,2 w2,2 x1,1 w1,3 + x1,2 w2,3
Y = XW = (12)
x2,1 w1,1 + x2,2 w2,1 x2,1 w1,2 + x2,2 w2,2 x2,1 w1,3 + x2,2 w2,3

∂L ∂L
Given a loss function Loss(Y ) = L, we want to calculate ∂X or ∂W .
As seen in (6), the derivative of a scalar by a matrix, is a gradient / Jacobian matrix that has the same shape
as the input. Moreover, we saw in (9), that the final derivative of the loss value L by any entry of any matrix
in the whole neural network is just a scalar. For better understanding we could look at the computational
graph of the network in, Figure 2, to clearly see that each neuron collects and sums the upstream derivatives
(from the loss up to it) - that it took part in calculation of, during the forward pass.

∂L ∂L ∂L
 
∂L ∂y1,1 ∂y1,2 ∂y1,3
=  (13)
∂Y ∂L ∂L ∂L
∂y2,1 ∂y2,2 ∂y2,3

So, from (6) we know that the gradient of Y will have the same shape of Y , because L is a scalar, and it is
calculated as a part of the chain-rule. This is the abstraction notion that is discussed above.
Let’s derive W . Eventually, after the chain-rule, the derivative of W would have the same shape:
 ∂L ∂L ∂L

∂L ∂w1,1 ∂w1,2 ∂w1,3
=  (14)
∂W ∂L ∂L ∂L
∂w2,1 ∂w2,2 ∂w2,3

Now, this is important. We do not (!!) want to calculate the Jacobians. For a better explanation why,
∂L
refer to the attached article. We have also learned that each entry of ∂W is a scalar, that is computed as in
(9).
So let’s divide and conquer. It is always a better practice, because it’s hard to wrap our minds on something
bigger than scalars.

5
2 3
∂L X X ∂L ∂yij
= (15)
∂w11 i=1 j=1
∂yij ∂w11

For better visualization, we could look at it as a dot product, which is elementwise multiplication and then
summation off all cells (Not what we know as np.dot() - this is confusing). Remember: when deriving a
function, it is by the input variable (at least one):

 ∂L ∂L ∂L
  ∂y1,1 ∂y1,2 ∂y1,3 
∂L ∂L ∂Y ∂y1,1 ∂y1,2 ∂y1,3 ∂w1,1 ∂w1,1 ∂w1,1
= · =   (16)
∂w11 ∂Y ∂w11 ∂L ∂L ∂L ∂y2,1 ∂y2,2 ∂y2,3
∂y2,1 ∂y2,2 ∂y2,3 ∂w1,1 ∂w1,1 ∂w1,1

If we go back to (23), we get:

∂L ∂L ∂L
   
∂L ∂y1,1 ∂y1,2 ∂y1,3 x1,1 0 0
= ·  (17)
∂w11 ∂L ∂L ∂L
x2,1 0 0
∂y2,1 ∂y2,2 ∂y2,3

Now let’s perform the dot product, and we get:

∂L ∂L ∂L
= x1,1 + x2,1 (18)
∂w11 ∂y1,1 ∂y2,1

We can do that for every entry wi,j in W, and we get:

 ∂L ∂L ∂L ∂L ∂L ∂L

∂L ∂y1,1 x1,1 + ∂y2,1 x2,1 ∂y1,2 x1,1 + ∂y2,2 x2,1 ∂y1,3 x1,1 + ∂y2,3 x2,1
=  (19)
∂W ∂L
x + ∂L ∂L ∂L ∂L ∂L
∂y1,1 1,2 ∂y2,1 x2,2 ∂y1,2 x1,2 + ∂y2,2 x2,2 ∂y1,3 x1,2 + ∂y2,3 x2,2

From this matrix, with a little experience, we could derive

∂L ∂L ∂L
 

∂L x1,1 x2,1 ∂y1,1 ∂y1,2 ∂y1,3
 = X T · ∂L
=  (20)
∂W x1,2 x2,2 ∂L ∂L ∂L ∂Y
∂y2,1 ∂y2,2 ∂y2,3

We could, of course, do the exact same thing in order to derive X, and we will see that:
∂L ∂L
= · WT (21)
∂X ∂Y

∂Y
Note: This is only true, because L is a scalar. If we just looked at Y = XW → ∂W would be a Jacobian.

6
3.1 What about the bias b term in the affine layer?

We could, or course, do the trick of merging it into X and W , as we saw in the lecture.
If not:
1.
Y = XW + b
where XN xD , WDxM , b1xM XWN xM
That means that each bi in b corresponds to one feature in a row of XW - but to add them like that,
it is quite impossible mathematically, right?
• NumPy uses broadcasting to duplicate the row vector b to be a matrix BN xM , by simply
copying b N times and stacking them together along the rows, or the 0-axis. But that’s just the
programming application.
• Mathematically speaking, it’s not Y = XW + b, but Y = XW + 1N b, where 1N is a column
vector, such that    
11 b1 . . . bM
 ..  . . .. 
 .  b1 . . . bM =  .. .. . 
1N b1 . . . bM N ×M
That gives us the broadcast that python does by itself, and allows us to actually sum those
matrices together.
Now, one can simply follow the exact same paradigm that we’ve shown above to solve for b, or we
could just look at 1N b as another XW , and do the exact same thing as you did for XW , where 1N
was X and b was W .
∂L
2. We see that the derivative of ∂b is,
∂L ∂L hP
N ∂L
PN ∂L
i
= (1N )T · = i=1 ∂yi,1 , . . . , i=1 ∂yi,M
∂b ∂Y
Which in NumPy translates into:
np.sum(dout, axis = 0)

4 Exercise

Given a simple neural network as above:

Toy-Network:
• Affine()
• Sigmoid()
• Loss()
Given again that:
x1,1 x1,2 w1,1 w1,2 w1,3
X= W = (22)
x2,1 x2,2 2×2
w2,1 w2,2 w2,3 2×3

x1,1 w1,1 + x1,2 w2,1 x1,1 w1,2 + x1,2 w2,2 x1,1 w1,3 + x1,2 w2,3
Y = XW = (23)
x2,1 w1,1 + x2,2 w2,1 x2,1 w1,2 + x2,2 w2,2 x2,1 w1,3 + x2,2 w2,3

And Y = Af f ine(X)

7
1. Show a solution to compute the graident of the Sigmoid layer, w.r.t to the upstream graident.
2. Replace the sigmoid layer with f (x) = x2 . For the following input X and weight matrix W :

0 1 1 1
X= W =
2 3 2×2 1 1 2×2

y=X
N
X
Loss(X) = xi
i

Find:
(a) loss value.
∂L
(b) ∂f
∂L
(c) ∂y
∂L
(d) ∂X
∂L
(e) ∂W

Solutions:
(a) 52

1 1
(b)
1 1

2 2
(c)
10 10

4 4
(d)
20 20

20 20
(e)
32 32

Maths XII OIG Solution
No ratings yet
Maths XII OIG Solution
182 pages
Healy World Presentation Company en in
No ratings yet
Healy World Presentation Company en in
70 pages
? Download PDF (Design of Shear Wall & Boundary Elements)
No ratings yet
? Download PDF (Design of Shear Wall & Boundary Elements)
7 pages
Paper 2 Specimen Papers 2
No ratings yet
Paper 2 Specimen Papers 2
18 pages
Matrix Calculus
No ratings yet
Matrix Calculus
33 pages
LLM For Maths People
No ratings yet
LLM For Maths People
53 pages
2024 04 CS115 Vector Caculus
No ratings yet
2024 04 CS115 Vector Caculus
131 pages
Back Propogation
No ratings yet
Back Propogation
43 pages
Chapter 9
No ratings yet
Chapter 9
73 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
XCS224N Module2 Slides
No ratings yet
XCS224N Module2 Slides
80 pages
NN Ch2
No ratings yet
NN Ch2
36 pages
Neural Network Training
No ratings yet
Neural Network Training
73 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Introduction To Feed Forward Neural Networks
No ratings yet
Introduction To Feed Forward Neural Networks
121 pages
Unit 1 DL
No ratings yet
Unit 1 DL
52 pages
Slides 11
No ratings yet
Slides 11
48 pages
cs224n 2023 Lecture03 Neuralnets
No ratings yet
cs224n 2023 Lecture03 Neuralnets
83 pages
World Maps
No ratings yet
World Maps
32 pages
MECH 211 - Couples, Moment of Force
No ratings yet
MECH 211 - Couples, Moment of Force
28 pages
Important Questions For Well Test Analysis
No ratings yet
Important Questions For Well Test Analysis
3 pages
Computing Neural Network Gradients-Merged
No ratings yet
Computing Neural Network Gradients-Merged
67 pages
5 Backward Propagation
No ratings yet
5 Backward Propagation
81 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
L26 NeuralNetwork BP
No ratings yet
L26 NeuralNetwork BP
37 pages
L3 Backpropagation
No ratings yet
L3 Backpropagation
61 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
No ratings yet
Christopher Manning Lecture 3: Neural Net Learning: Gradients by Hand (Matrix Calculus) and Algorithmically (The Backpropagation Algorithm)
84 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Cable Tray Sizing Calculations
100% (1)
Cable Tray Sizing Calculations
6 pages
Lecture12 Diff
No ratings yet
Lecture12 Diff
31 pages
LOD Differentiable
No ratings yet
LOD Differentiable
55 pages
Tut 01
No ratings yet
Tut 01
39 pages
SEVA PDF
No ratings yet
SEVA PDF
31 pages
The Matrix Calculus You Need For Deep Learning
No ratings yet
The Matrix Calculus You Need For Deep Learning
34 pages
L18 Backprop
No ratings yet
L18 Backprop
18 pages
Mit18 S096iap23 Lec4
No ratings yet
Mit18 S096iap23 Lec4
14 pages
Lecture04 Neuralnets
No ratings yet
Lecture04 Neuralnets
81 pages
Aditek Catalogue Ed.04 - English
No ratings yet
Aditek Catalogue Ed.04 - English
64 pages
3EBX0 Lecture Notes Addendum
No ratings yet
3EBX0 Lecture Notes Addendum
10 pages
Imperial Dlcourse2022 RNN Notes
No ratings yet
Imperial Dlcourse2022 RNN Notes
9 pages
DL03 Classroom SNN
No ratings yet
DL03 Classroom SNN
41 pages
GCS-VSD Course HSE
No ratings yet
GCS-VSD Course HSE
20 pages
Kala Physics P1 Trial 2 2024 (Moderated)
No ratings yet
Kala Physics P1 Trial 2 2024 (Moderated)
12 pages
CSC 2541: Neural Net Training Dynamics: Lecture 1 - A Toy Model: Linear Regression
No ratings yet
CSC 2541: Neural Net Training Dynamics: Lecture 1 - A Toy Model: Linear Regression
62 pages
The Mathematics of Artificial Intelligence: 1 Supervised Learning
No ratings yet
The Mathematics of Artificial Intelligence: 1 Supervised Learning
10 pages
AyushChokhani AI Asiignment 2
No ratings yet
AyushChokhani AI Asiignment 2
12 pages
Sphere, Pyramid, Cone, Composite QP
No ratings yet
Sphere, Pyramid, Cone, Composite QP
7 pages
Application of Derivatives-I - Lecture Notes - New
100% (1)
Application of Derivatives-I - Lecture Notes - New
30 pages
Aeroelasticity Oral Exam Questions and Answers
No ratings yet
Aeroelasticity Oral Exam Questions and Answers
11 pages
CS 182 Berkeley 2021 Discussion 2
No ratings yet
CS 182 Berkeley 2021 Discussion 2
9 pages
Bona, Indri - 2005 - Friction Compensation in Robotics An Overview
No ratings yet
Bona, Indri - 2005 - Friction Compensation in Robotics An Overview
8 pages
Backward Forward Propogation
No ratings yet
Backward Forward Propogation
19 pages
Derivative Networks Reducedversion - 2022
No ratings yet
Derivative Networks Reducedversion - 2022
14 pages
DLassignment
No ratings yet
DLassignment
6 pages
Model of Neuron in An ANN
No ratings yet
Model of Neuron in An ANN
12 pages
Crespin, D. - Matrix Formulas For Semilinear Backpropagation
No ratings yet
Crespin, D. - Matrix Formulas For Semilinear Backpropagation
29 pages
Autodiff
No ratings yet
Autodiff
12 pages
Backprop Unit 2
No ratings yet
Backprop Unit 2
5 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Morphology of Reciprocal Frame 3-Dimensional Grillage Structures
No ratings yet
Morphology of Reciprocal Frame 3-Dimensional Grillage Structures
7 pages
Automated Structural Analysis and Desing Using Cad
No ratings yet
Automated Structural Analysis and Desing Using Cad
13 pages
Safety and Reference: Owner'S Manual
No ratings yet
Safety and Reference: Owner'S Manual
40 pages
Deep Learning Basics Lecture 2 Backpropagation
No ratings yet
Deep Learning Basics Lecture 2 Backpropagation
31 pages
Basic Electron Theory
No ratings yet
Basic Electron Theory
10 pages
DL Exam 2023-2
No ratings yet
DL Exam 2023-2
5 pages
DL - Quiz 2 - Google Forms
No ratings yet
DL - Quiz 2 - Google Forms
10 pages
XI Maths (Term II) Lab Activities 1
No ratings yet
XI Maths (Term II) Lab Activities 1
10 pages
MAT232H5 - Suggested - Homework
No ratings yet
MAT232H5 - Suggested - Homework
2 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Backpropagation in Matrix Notation
No ratings yet
Backpropagation in Matrix Notation
8 pages
Standard Proctor Test
No ratings yet
Standard Proctor Test
2 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
hw07 Neural Soln PDF
No ratings yet
hw07 Neural Soln PDF
6 pages
ECEN-20044 Final Exam
No ratings yet
ECEN-20044 Final Exam
3 pages
DeepNeuralanPDE 12
No ratings yet
DeepNeuralanPDE 12
1 page
Foundations of Interconnect and Microstrip Design: Third Edition
100% (1)
Foundations of Interconnect and Microstrip Design: Third Edition
13 pages
Derivatives, Backpropagation, and Vectorization
No ratings yet
Derivatives, Backpropagation, and Vectorization
7 pages
Vectorization: Linear Model As A Perceptron
No ratings yet
Vectorization: Linear Model As A Perceptron
5 pages
Concept Folding 77: Invite Nature Into Your Building
No ratings yet
Concept Folding 77: Invite Nature Into Your Building
4 pages
HLT 1100-R2: Linear Position Transmitter
No ratings yet
HLT 1100-R2: Linear Position Transmitter
2 pages
Gradient Notes PDF
No ratings yet
Gradient Notes PDF
7 pages
Chapter 8 - Failure
No ratings yet
Chapter 8 - Failure
6 pages
Unit M
No ratings yet
Unit M
10 pages
09: Neural Networks - Learning: Neural Network Cost Function
No ratings yet
09: Neural Networks - Learning: Neural Network Cost Function
9 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
Vector, Matrix, and Tensor Derivatives: 1 Simplify, Simplify, Simplify
No ratings yet
Vector, Matrix, and Tensor Derivatives: 1 Simplify, Simplify, Simplify
7 pages
Speed Controller With One-Touch Fittings: In-Line Type
No ratings yet
Speed Controller With One-Touch Fittings: In-Line Type
8 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

TUM I2DL Matrix Derivatives

Uploaded by

TUM I2DL Matrix Derivatives

Uploaded by

TUM - I2DL - Matrix derivatives

Dan Halperin - Tutor

Where XN xD WDxM b1xM .

• In the affine layer context, the matrix X is considered to be the input.

Figure 1: Mnist handwritten 8 × 8 image of the digit 0

• The coefficient matrix.

• Another common notation of an affine layer is

2.1 What is a gradient

It all depends on the function!

• What is a gradient? It is the derivative scalar-valued differentiable function by a vector or a matrix

Figure 3: Scope of a function

• Original article by Stanford (cs231n): Link

If we go back to (23), we get:

Now let’s perform the dot product, and we get:

We can do that for every entry wi,j in W, and we get:

From this matrix, with a little experience, we could derive

Given a simple neural network as above:

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.