0% found this document useful (0 votes)
18 views8 pages

TUM I2DL Matrix Derivatives

The document discusses the concept of affine layers in neural networks, explaining the roles of input matrix X, weight matrix W, and the process of backpropagation. It details how gradients and derivatives are calculated in the context of neural networks, emphasizing the importance of understanding these calculations for effective model training. Additionally, it includes examples and mathematical formulations to illustrate the concepts presented.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views8 pages

TUM I2DL Matrix Derivatives

The document discusses the concept of affine layers in neural networks, explaining the roles of input matrix X, weight matrix W, and the process of backpropagation. It details how gradients and derivatives are calculated in the context of neural networks, emphasizing the importance of understanding these calculations for effective model training. Additionally, it includes examples and mathematical formulations to illustrate the concepts presented.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

TUM - I2DL - Matrix derivatives

Dan Halperin - Tutor

May 2, 2023

1 Affine layer

y = XW + b (1)

Where XN xD WDxM b1xM .


A known use case is the 1-dim case of a line equation:

y = ax + b

1.1 What is X?

• In the affine layer context, the matrix X is considered to be the input.


• In neural networks, we almost always refer to it as a batch of input elements (e.g. images).
• In some deep learning applications (e.g. ”style-transfer”), it is also trained by backpropogation.
• Besides being the input of each layer of the network, X is also the output of the previous layer.
Let us take for example this one input instance (image) from the MNIST handwritten digits’ dataset. Each
grayscale image in this dataset is a 1 × 8 × 8 tensor: 1 for the channels, 8 for the height and 8 for the width.

Figure 1: Mnist handwritten 8 × 8 image of the digit 0

1
For the affine layer, as phrased in (1), each input instance is flattened to be a row vector inside X. Let us
take a batch of 2 images from the MNIST dataset.

  
x111 ... x118 x211 . . . x218  
 ..
X =  . .. ..   .. .. ..  → x111 , ... , x118 , , x121 , ... , x181 , ... , x188 
. .  . . .  x211 , ... , x218 , , x221 , ... , x281 , ... , x288
x181 ... x188 x281 . . . x288
(2)
Here, the batch shape is 2 × 1 × 8 × 8
Question: What if we had a 3-channels RGB images?
Answer: The images are flattened the same, row by row and channel by channel. The actual order doesn’t
matter, but it is important that it will remain consistent among all input instances, so the weights will
correspond to the correct entries.

1.2 What is W?

• The coefficient matrix.


• In a learning model, they represent the learnable weights, and modified during the backpropogation
step.
• If in X each row represents one input inside the batch, in W each column represents the weights that
are attached from all input neurons (cells in the input vector) to one neuron in the next layer, which
is the input to the next layer, as could be seen in Figure 2.

1.3 Notes

• Note: It is not a linear function, but we treat it as an approximation. Why not? It doesn’t follow the
rules of linearity, where
f (x + y) = f (x) + f (y)
or
f (ax) = af (x)

• Another common notation of an affine layer is


T
y = W x + b = (W x + b)T = (X T W T + bT )T (3)

Which calculates the exact same thing, but results in a column vector and not a row. W weight vectors
are now row vectors and X inputs are now column vectors. It is just a matter of how we construct our
inputs and weights.

2
2 Derivatives

Figure 2: A neural network computational graph. Note: Although we always deal with batches of inputs,
in the sketch, the input layer represents only one input instance (e.g one flattened image). Each colour
represnts a different weights column vector in W . Also, each neuron in the input layer (true to any neuron
in the network) will collect the gradients from the flow on the colourful edges that are attached to it.

2.1 What is a gradient

It all depends on the function!



∂f
f : R → R, x ∈ R, =a∈R (4)
∂x
• Gradient:  ∂f   
∂x1 a1
∂f
f : Rn → R, x ∈ Rn , =  ...  =  ...  ∈ Rn (5)
   
∂x ∂f
∂x
an
n

• Gradient:
∂f ∂f
 
∂x1,1 ... ∂x1,m
∂f  . ..
f : Rn×m → R, x ∈ Rn×m , n×m

= .. . ...  ∈ R
 (6)
∂x  ∂f ∂f
∂xn,1 ... ∂xn,m

• Jacobian:  ∂f1
    ∂f1 
x1 f1 ∂x1 ... ∂xn
 ..   ..  ∂f
f : Rn → Rm , f ( . ) =  .  , =  ... .. ..  ∈ Rm×n (7)

∂x . . 
∂f ∂fm
xn fm ∂x1
m
... ∂xn

• Note that if x was a row vector and so was the function ’image’ (result), then this Jacobian matrix
would have been transposed.

3
• An ugly Jacobian (Tensor - A multidimensional matrix):
 ∂f1 ∂f1 
∂w11 ... ∂w1m
 . ..
 ..

 . . . . 

     ∂f1 ... ∂f1
w11 ... w1m f1

 ∂wn1 ∂wmn 
 .. .. ..  =  ..  , ∂f ..
f : Rn×m → Rn ,
 
f  . . .   .  = .
 (8)
∂w  ∂fn ∂fn

wn1 ... wmn fn

 ∂w
11
... ∂w1m 

 .. .. 

..
 . . . 
∂fn ∂fn
∂wn1 ... ∂wmn

• What is a gradient? It is the derivative scalar-valued differentiable function by a vector or a matrix


input.
• Super important: Neural networks in general could take any shape of input, but they all result in a
loss function, that gives a scalar L ∈ R. That means:
– In the backpropogation step, the derivative of a learnable weight wij is to be calculated as a
scalar derivative:
∂L X X ∂L ∂yi,j
= (9)
∂wu,v i j
∂yi,j ∂wu,v
∂L
Where i, j correspond to the rows and columns of ∂Y in some function that utilizes w, such in
Y = f (X) = XW + b.
– In the neural network backpropagation algorithm, we observe only the current layer at a time, as
an abstraction. We do not try to think of the entire network at once, but step-by-step. Example:
Toy-Network:
∗ Affine()
∗ ReLU()
∗ Affine()
∗ Sigmoid()
∗ Loss()
When it comes to think of how to derive the current Affine() layer, we observe it as if it was a
function with its own scope.

Figure 3: Scope of a function

4
According to the chain rule, the derivative of the loss function value L according to the weight
matrix of our current affine layer W , would be:
∂L ∂L ∂σ(Y ) ∂Y
= ⊕ ⊕ (10)
∂W ∂σ(Y ) ∂Y ∂W
∂L ∂σ(Y )
In this case, ∂σ(Y ) ∂Y is what we call dout, or the upstream gradient, and we assume it is
already calculated before, as in our current scope (according to the relevant functions, of course).
Now it is sent to our current scope, to be calculated as a part of the chain-rule, and sent up the
stream to the next layer.
Also, ⊕ in scalar derivatives represents a simple multiplication. However, in multidimensional
derivatives, ⊕ represents some unknown function, which we need to figure out.

3 Stanford Article

• Original article by Stanford (cs231n): Link


Let’s follow their example:
   
x1,1 x1,2 w1,1 w1,2 w1,3
X= W = (11)
x2,1 x2,2 2×2
w2,1 w2,2 w2,3 2×3
 
x1,1 w1,1 + x1,2 w2,1 x1,1 w1,2 + x1,2 w2,2 x1,1 w1,3 + x1,2 w2,3
Y = XW = (12)
x2,1 w1,1 + x2,2 w2,1 x2,1 w1,2 + x2,2 w2,2 x2,1 w1,3 + x2,2 w2,3

∂L ∂L
Given a loss function Loss(Y ) = L, we want to calculate ∂X or ∂W .
As seen in (6), the derivative of a scalar by a matrix, is a gradient / Jacobian matrix that has the same shape
as the input. Moreover, we saw in (9), that the final derivative of the loss value L by any entry of any matrix
in the whole neural network is just a scalar. For better understanding we could look at the computational
graph of the network in, Figure 2, to clearly see that each neuron collects and sums the upstream derivatives
(from the loss up to it) - that it took part in calculation of, during the forward pass.

∂L ∂L ∂L
 
∂L ∂y1,1 ∂y1,2 ∂y1,3
=  (13)
∂Y ∂L ∂L ∂L
∂y2,1 ∂y2,2 ∂y2,3

So, from (6) we know that the gradient of Y will have the same shape of Y , because L is a scalar, and it is
calculated as a part of the chain-rule. This is the abstraction notion that is discussed above.
Let’s derive W . Eventually, after the chain-rule, the derivative of W would have the same shape:
 ∂L ∂L ∂L

∂L ∂w1,1 ∂w1,2 ∂w1,3
=  (14)
∂W ∂L ∂L ∂L
∂w2,1 ∂w2,2 ∂w2,3

Now, this is important. We do not (!!) want to calculate the Jacobians. For a better explanation why,
∂L
refer to the attached article. We have also learned that each entry of ∂W is a scalar, that is computed as in
(9).
So let’s divide and conquer. It is always a better practice, because it’s hard to wrap our minds on something
bigger than scalars.

5
2 3
∂L X X ∂L ∂yij
= (15)
∂w11 i=1 j=1
∂yij ∂w11

For better visualization, we could look at it as a dot product, which is elementwise multiplication and then
summation off all cells (Not what we know as np.dot() - this is confusing). Remember: when deriving a
function, it is by the input variable (at least one):

 ∂L ∂L ∂L
  ∂y1,1 ∂y1,2 ∂y1,3 
∂L ∂L ∂Y ∂y1,1 ∂y1,2 ∂y1,3 ∂w1,1 ∂w1,1 ∂w1,1
= · =   (16)
∂w11 ∂Y ∂w11 ∂L ∂L ∂L ∂y2,1 ∂y2,2 ∂y2,3
∂y2,1 ∂y2,2 ∂y2,3 ∂w1,1 ∂w1,1 ∂w1,1

If we go back to (23), we get:


∂L ∂L ∂L
   
∂L ∂y1,1 ∂y1,2 ∂y1,3 x1,1 0 0
= ·  (17)
∂w11 ∂L ∂L ∂L
x2,1 0 0
∂y2,1 ∂y2,2 ∂y2,3

Now let’s perform the dot product, and we get:


∂L ∂L ∂L
= x1,1 + x2,1 (18)
∂w11 ∂y1,1 ∂y2,1

We can do that for every entry wi,j in W, and we get:


 ∂L ∂L ∂L ∂L ∂L ∂L

∂L ∂y1,1 x1,1 + ∂y2,1 x2,1 ∂y1,2 x1,1 + ∂y2,2 x2,1 ∂y1,3 x1,1 + ∂y2,3 x2,1
=  (19)
∂W ∂L
x + ∂L ∂L ∂L ∂L ∂L
∂y1,1 1,2 ∂y2,1 x2,2 ∂y1,2 x1,2 + ∂y2,2 x2,2 ∂y1,3 x1,2 + ∂y2,3 x2,2

From this matrix, with a little experience, we could derive

∂L ∂L ∂L
 
 
∂L x1,1 x2,1 ∂y1,1 ∂y1,2 ∂y1,3
 = X T · ∂L
=  (20)
∂W x1,2 x2,2 ∂L ∂L ∂L ∂Y
∂y2,1 ∂y2,2 ∂y2,3

We could, of course, do the exact same thing in order to derive X, and we will see that:
∂L ∂L
= · WT (21)
∂X ∂Y

∂Y
Note: This is only true, because L is a scalar. If we just looked at Y = XW → ∂W would be a Jacobian.

6
3.1 What about the bias b term in the affine layer?

We could, or course, do the trick of merging it into X and W , as we saw in the lecture.
If not:
1.
Y = XW + b
where XN xD , WDxM , b1xM XWN xM
That means that each bi in b corresponds to one feature in a row of XW - but to add them like that,
it is quite impossible mathematically, right?
• NumPy uses broadcasting to duplicate the row vector b to be a matrix BN xM , by simply
copying b N times and stacking them together along the rows, or the 0-axis. But that’s just the
programming application.
• Mathematically speaking, it’s not Y = XW + b, but Y = XW + 1N b, where 1N is a column
vector, such that    
11 b1 . . . bM
 ..    . . .. 
 .  b1 . . . bM =  .. .. . 
1N b1 . . . bM N ×M
That gives us the broadcast that python does by itself, and allows us to actually sum those
matrices together.
Now, one can simply follow the exact same paradigm that we’ve shown above to solve for b, or we
could just look at 1N b as another XW , and do the exact same thing as you did for XW , where 1N
was X and b was W .
∂L
2. We see that the derivative of ∂b is,
∂L ∂L hP
N ∂L
PN ∂L
i
= (1N )T · = i=1 ∂yi,1 , . . . , i=1 ∂yi,M
∂b ∂Y
Which in NumPy translates into:
np.sum(dout, axis = 0)

4 Exercise

Given a simple neural network as above:


Toy-Network:
• Affine()
• Sigmoid()
• Loss()
Given again that:    
x1,1 x1,2 w1,1 w1,2 w1,3
X= W = (22)
x2,1 x2,2 2×2
w2,1 w2,2 w2,3 2×3
 
x1,1 w1,1 + x1,2 w2,1 x1,1 w1,2 + x1,2 w2,2 x1,1 w1,3 + x1,2 w2,3
Y = XW = (23)
x2,1 w1,1 + x2,2 w2,1 x2,1 w1,2 + x2,2 w2,2 x2,1 w1,3 + x2,2 w2,3

And Y = Af f ine(X)

7
1. Show a solution to compute the graident of the Sigmoid layer, w.r.t to the upstream graident.
2. Replace the sigmoid layer with f (x) = x2 . For the following input X and weight matrix W :
   
0 1 1 1
X= W =
2 3 2×2 1 1 2×2

y=X
N
X
Loss(X) = xi
i

Find:
(a) loss value.
∂L
(b) ∂f
∂L
(c) ∂y
∂L
(d) ∂X
∂L
(e) ∂W

Solutions:
(a) 52
 
1 1
(b)
1 1
 
2 2
(c)
10 10
 
4 4
(d)
20 20
 
20 20
(e)
32 32

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy