TUM I2DL Matrix Derivatives
TUM I2DL Matrix Derivatives
May 2, 2023
1 Affine layer
y = XW + b (1)
y = ax + b
1.1 What is X?
1
For the affine layer, as phrased in (1), each input instance is flattened to be a row vector inside X. Let us
take a batch of 2 images from the MNIST dataset.
x111 ... x118 x211 . . . x218
..
X = . .. .. .. .. .. → x111 , ... , x118 , , x121 , ... , x181 , ... , x188
. . . . . x211 , ... , x218 , , x221 , ... , x281 , ... , x288
x181 ... x188 x281 . . . x288
(2)
Here, the batch shape is 2 × 1 × 8 × 8
Question: What if we had a 3-channels RGB images?
Answer: The images are flattened the same, row by row and channel by channel. The actual order doesn’t
matter, but it is important that it will remain consistent among all input instances, so the weights will
correspond to the correct entries.
1.2 What is W?
1.3 Notes
• Note: It is not a linear function, but we treat it as an approximation. Why not? It doesn’t follow the
rules of linearity, where
f (x + y) = f (x) + f (y)
or
f (ax) = af (x)
Which calculates the exact same thing, but results in a column vector and not a row. W weight vectors
are now row vectors and X inputs are now column vectors. It is just a matter of how we construct our
inputs and weights.
2
2 Derivatives
Figure 2: A neural network computational graph. Note: Although we always deal with batches of inputs,
in the sketch, the input layer represents only one input instance (e.g one flattened image). Each colour
represnts a different weights column vector in W . Also, each neuron in the input layer (true to any neuron
in the network) will collect the gradients from the flow on the colourful edges that are attached to it.
• Gradient:
∂f ∂f
∂x1,1 ... ∂x1,m
∂f . ..
f : Rn×m → R, x ∈ Rn×m , n×m
= .. . ... ∈ R
(6)
∂x ∂f ∂f
∂xn,1 ... ∂xn,m
• Jacobian: ∂f1
∂f1
x1 f1 ∂x1 ... ∂xn
.. .. ∂f
f : Rn → Rm , f ( . ) = . , = ... .. .. ∈ Rm×n (7)
∂x . .
∂f ∂fm
xn fm ∂x1
m
... ∂xn
• Note that if x was a row vector and so was the function ’image’ (result), then this Jacobian matrix
would have been transposed.
3
• An ugly Jacobian (Tensor - A multidimensional matrix):
∂f1 ∂f1
∂w11 ... ∂w1m
. ..
..
. . . .
∂f1 ... ∂f1
w11 ... w1m f1
∂wn1 ∂wmn
.. .. .. = .. , ∂f ..
f : Rn×m → Rn ,
f . . . . = .
(8)
∂w ∂fn ∂fn
wn1 ... wmn fn
∂w
11
... ∂w1m
.. ..
..
. . .
∂fn ∂fn
∂wn1 ... ∂wmn
4
According to the chain rule, the derivative of the loss function value L according to the weight
matrix of our current affine layer W , would be:
∂L ∂L ∂σ(Y ) ∂Y
= ⊕ ⊕ (10)
∂W ∂σ(Y ) ∂Y ∂W
∂L ∂σ(Y )
In this case, ∂σ(Y ) ∂Y is what we call dout, or the upstream gradient, and we assume it is
already calculated before, as in our current scope (according to the relevant functions, of course).
Now it is sent to our current scope, to be calculated as a part of the chain-rule, and sent up the
stream to the next layer.
Also, ⊕ in scalar derivatives represents a simple multiplication. However, in multidimensional
derivatives, ⊕ represents some unknown function, which we need to figure out.
3 Stanford Article
∂L ∂L
Given a loss function Loss(Y ) = L, we want to calculate ∂X or ∂W .
As seen in (6), the derivative of a scalar by a matrix, is a gradient / Jacobian matrix that has the same shape
as the input. Moreover, we saw in (9), that the final derivative of the loss value L by any entry of any matrix
in the whole neural network is just a scalar. For better understanding we could look at the computational
graph of the network in, Figure 2, to clearly see that each neuron collects and sums the upstream derivatives
(from the loss up to it) - that it took part in calculation of, during the forward pass.
∂L ∂L ∂L
∂L ∂y1,1 ∂y1,2 ∂y1,3
= (13)
∂Y ∂L ∂L ∂L
∂y2,1 ∂y2,2 ∂y2,3
So, from (6) we know that the gradient of Y will have the same shape of Y , because L is a scalar, and it is
calculated as a part of the chain-rule. This is the abstraction notion that is discussed above.
Let’s derive W . Eventually, after the chain-rule, the derivative of W would have the same shape:
∂L ∂L ∂L
∂L ∂w1,1 ∂w1,2 ∂w1,3
= (14)
∂W ∂L ∂L ∂L
∂w2,1 ∂w2,2 ∂w2,3
Now, this is important. We do not (!!) want to calculate the Jacobians. For a better explanation why,
∂L
refer to the attached article. We have also learned that each entry of ∂W is a scalar, that is computed as in
(9).
So let’s divide and conquer. It is always a better practice, because it’s hard to wrap our minds on something
bigger than scalars.
5
2 3
∂L X X ∂L ∂yij
= (15)
∂w11 i=1 j=1
∂yij ∂w11
For better visualization, we could look at it as a dot product, which is elementwise multiplication and then
summation off all cells (Not what we know as np.dot() - this is confusing). Remember: when deriving a
function, it is by the input variable (at least one):
∂L ∂L ∂L
∂y1,1 ∂y1,2 ∂y1,3
∂L ∂L ∂Y ∂y1,1 ∂y1,2 ∂y1,3 ∂w1,1 ∂w1,1 ∂w1,1
= · = (16)
∂w11 ∂Y ∂w11 ∂L ∂L ∂L ∂y2,1 ∂y2,2 ∂y2,3
∂y2,1 ∂y2,2 ∂y2,3 ∂w1,1 ∂w1,1 ∂w1,1
∂L ∂L ∂L
∂L x1,1 x2,1 ∂y1,1 ∂y1,2 ∂y1,3
= X T · ∂L
= (20)
∂W x1,2 x2,2 ∂L ∂L ∂L ∂Y
∂y2,1 ∂y2,2 ∂y2,3
We could, of course, do the exact same thing in order to derive X, and we will see that:
∂L ∂L
= · WT (21)
∂X ∂Y
∂Y
Note: This is only true, because L is a scalar. If we just looked at Y = XW → ∂W would be a Jacobian.
6
3.1 What about the bias b term in the affine layer?
We could, or course, do the trick of merging it into X and W , as we saw in the lecture.
If not:
1.
Y = XW + b
where XN xD , WDxM , b1xM XWN xM
That means that each bi in b corresponds to one feature in a row of XW - but to add them like that,
it is quite impossible mathematically, right?
• NumPy uses broadcasting to duplicate the row vector b to be a matrix BN xM , by simply
copying b N times and stacking them together along the rows, or the 0-axis. But that’s just the
programming application.
• Mathematically speaking, it’s not Y = XW + b, but Y = XW + 1N b, where 1N is a column
vector, such that
11 b1 . . . bM
.. . . ..
. b1 . . . bM = .. .. .
1N b1 . . . bM N ×M
That gives us the broadcast that python does by itself, and allows us to actually sum those
matrices together.
Now, one can simply follow the exact same paradigm that we’ve shown above to solve for b, or we
could just look at 1N b as another XW , and do the exact same thing as you did for XW , where 1N
was X and b was W .
∂L
2. We see that the derivative of ∂b is,
∂L ∂L hP
N ∂L
PN ∂L
i
= (1N )T · = i=1 ∂yi,1 , . . . , i=1 ∂yi,M
∂b ∂Y
Which in NumPy translates into:
np.sum(dout, axis = 0)
4 Exercise
And Y = Af f ine(X)
7
1. Show a solution to compute the graident of the Sigmoid layer, w.r.t to the upstream graident.
2. Replace the sigmoid layer with f (x) = x2 . For the following input X and weight matrix W :
0 1 1 1
X= W =
2 3 2×2 1 1 2×2
y=X
N
X
Loss(X) = xi
i
Find:
(a) loss value.
∂L
(b) ∂f
∂L
(c) ∂y
∂L
(d) ∂X
∂L
(e) ∂W
Solutions:
(a) 52
1 1
(b)
1 1
2 2
(c)
10 10
4 4
(d)
20 20
20 20
(e)
32 32