Matrix Calculus (For Machine Learning and Beyond) : Lecturers: Alan Edelman and Steven G. Johnson
Matrix Calculus (For Machine Learning and Beyond) : Lecturers: Alan Edelman and Steven G. Johnson
Contents
Introduction 4
4 Finite-Difference Approximations 29
4.1 Why compute derivatives approximately instead of exactly? . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Finite-Difference Approximations: Easy Version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Example: Matrix squaring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1
4.4 Accuracy of Finite Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Order of accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Roundoff error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.7 Other finite-difference methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2
9.3.2 Reverse mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.4 Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
10 Calculus of Variations 74
10.1 Functionals: Mapping functions to scalars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
10.2 Inner products of functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
10.3 Example: Minimizing arc length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
10.4 Euler–Lagrange equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
13 Derivatives of Eigenproblems 96
13.1 Differentiating on the Unit Sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
13.1.1 Special Case: A Circle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
13.1.2 On the Sphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
13.2 Differentiating on Orthogonal Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
13.2.1 Differentiating the Symmetric Eigendecomposition . . . . . . . . . . . . . . . . . . . . . . . . 98
3
Introduction
These notes are based on the class as it was run for the second time in January 2023, taught by Professors Alan
Edelman and Steven G. Johnson at MIT. The previous version of this course, run in January 2022, can be found
on OCW here.
Both Professors Edelman and Johnson use he/him pronouns and are in the Department of Mathematics at MIT;
Prof. Edelman is also a Professor in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL)
running the Julia lab, while Prof. Johnson is also a Professor in the Department of Physics.
Here is a description of the course.:
We all know that typical calculus course sequences begin with univariate and vector calculus, respec-
tively. Modern applications such as machine learning and large-scale optimization require the next big
step, “matrix calculus” and calculus on arbitrary vector spaces.
This class covers a coherent approach to matrix calculus showing techniques that allow you to think of
a matrix holistically (not just as an array of scalars), generalize and compute derivatives of important
matrix factorizations and many other complicated-looking operations, and understand how differen-
tiation formulas must be re-imagined in large-scale computing. We will discuss “reverse” (“adjoint”,
“backpropagation”) differentiation and how modern automatic differentiation is more computer science
than calculus (it is neither symbolic formulas nor finite differences).
The class involved numerous example numerical computations using the Julia language, which you can install
on your own computer following these instructions. The material for this class is also located on GitHub at
https://github.com/mitmath/matrixcalc.
4
1 Overview and Motivation
Firstly, where does matrix calculus fit into the MIT course catalog? Well, there are 18.01 (Single-Variable Calculus)
and 18.02 (Vector Calculus) that students are required to take at MIT. But it seems as though this sequence of
material is being cut off arbitrarily:
After all, this is how the sequence is portrayed in many computer programming languages, including Julia! Why
should calculus stop with vectors?
In the last decade, linear algebra has taken on larger and larger importance in numerous areas, such as machine
learning, statistics, engineering, etc. In this sense, linear algebra has gradually taken over a much larger part of
today’s tools for lots of areas of study—now everybody needs linear algebra. So it makes sense that we would
want to do calculus on these higher-order arrays, and it won’t be a simple/obvious generalization (for instance,
d 2
dA A ̸= 2A for non-scalar matrices A).
More generally, the subjects of differentiation and sensitivity analysis are much deeper than one might suspect
from the simple rules learned in first- or second-semester calculus. Differentiating functions whose inputs and/or
outputs are in more complicated vector spaces (e.g. matrices, functions, or more) is one part of this subject. Another
topic is the efficient evaluation of derivatives of functions involving very complicated calculations, from neural
networks to huge engineering simulations—this leads to the topic of “adjoint” or “reverse-mode” differentiation,
also known as “backpropagation.” Automatic differentiation (AD) of computer programs by compilers is another
surprising topic, in which the computer does something very different from the typical human process of first writing
out an explicit symbolic formula and then passing the chain rule through it. These are only a few examples: the key
point is that differentiation is more complicated than you may realize, and that these complexities are increasingly
relevant for a wide variety of applications.
Let’s quickly talk about some of these applications.
1.1 Applications
Applications: Machine learning
Machine learning has numerous buzzwords associated with it, including but not limited to: parameter optimization,
stochastic gradient descent, automatic differentiation, and backpropagation. In this whole collage you can see a
fraction of how matrix calculus applies to machine learning. It is recommended that you look into some of these
topics yourself if you are interested.
Large physical simulations, such as engineering-design problems, are increasingly characterized by huge numbers
of parameters, and the derivatives of simulation outputs with respect to these parameters is crucial in order to
evaluate sensitivity to uncertainties as well as to apply large-scale optimization.
For example, the shape of an airplane wing might be characterized by thousands of parameters, and if you can
compute the derivative of the drag force (from a large fluid-flow simulation) with respect to these parameters then
you could optimize the wing shape to minimize the drag for a given lift or other constraints.
An extreme version of such parameterization is known as “topology optimization,” in which the material at “every
point” in space is potentially a degree of freedom, and optimizing over these parameters can discover not only a
5
optimal shape but an optimal topology (how materials are connected in space, e.g. how many holes are present).
For example, topology optimization has been applied in mechanical engineering to design the cross sections of
airplane wings, artificial hips, and more into a complicated lattice of metal struts (e.g. minimizing weight for a
given strength).
Besides engineering design, complicated differentiation problems arise in fitting unknown parameters of a
model to experimental data, and also in evaluating uncertainties in the outputs of models with imprecise pa-
rameters/inputs. This is closely related to regression problems in statistics, as discussed below, except that here
the model might be a giant set of differential equations with some unknown parameters.
In multivariate statistics, models are often framed in terms of matrix inputs and outputs (or even more complicated
objects such as tensors). For example, a “simple” linear multivariate matrix model might be Y (X) = XB +U , where
B is an unknown matrix of coefficients (to be determined by some form of fit/regression) and U is unknown matrix
of random noise (that prevents the model from exactly fitting the data). Regression then involves minimizing
some function of the error U (B) = Y − XB between the model XB and data Y ; for example, a matrix norm
∥U ∥2F = tr U T U , a determinant det U T U , or more complicated functions. Estimating the best-fit coefficients B,
analyzing uncertainties, and many other statistical analyses require differentiating such functions with respect to
B or other parameters. A recent review article on this topic is Liu et al. (2022): “Matrix differential calculus with
applications in the multivariate linear model and its diagnostics” (https://doi.org/10.1016/j.sctalk.2023.100274).
Typical differential calculus classes are based on symbolic calculus, with students essentially learning to do what
Mathematica or Wolfram Alpha can do. Even if you are using a computer to take derivatives symbolically, to use
this effectively you need to understand what is going on beneath the hood. But while, similarly, some numerics
may show up for a small portion of this class (such as to approximate a derivative using the difference quotient),
today’s automatic differentiation is neither of those two things. It is more in the field of the computer science topic
of compiler technology than mathematics. However, the underlying mathematics of automatic differentiation is
interesting, and we will learn about this in this class!
Even approximate computer differentiation is more complicated than you might expect. For single-variable
functions f (x), derivatives are defined as the limit of a difference [f (x + δx) − f (x)]/δx as δx → 0. A crude
“finite-difference” approximation is simply to approximate f ′ (x) by this formula for a small δx, but this turns out
to raise many interesting issues involving balancing truncation and roundoff errors, higher-order approximations,
and numerical extrapolation.
6
• and df = f ′ (x)dx.
This last one will be the preferred of the above for this class. One can think of dx and dy as “really small numbers.”
In mathematics, they are called infinitesimals, defined rigorously via taking limits. Note that here we do not want
to divide by dx. While this is completely fine to do with scalars, once we get to vectors and matrices you can’t
always divide!
The numerics of such derivatives are simple enough to play around with. For instance, consider the function
f (x) = x2 and the point (x0 , f (x0 )) = (3, 9). Then, we have the following numerical values near (3, 9):
f (3.0001) = 9.00060001
f (3.00001) = 9.0000600001
f (3.000001) = 9.000006000001
f (3.0000001) = 9.00000060000001.
Here, the bolded digits on the left are ∆x and the bolded digits on the right are ∆y. Notice that ∆y = 6∆x.
Hence, we have that
Therefore, we have that the linearization of x2 at x = 3 is the function f (x) − f (3) ≈ 6(x − 3).
We now leave the world of scalar calculus and enter the world of vector/matrix calculus! Professor Edelman
invites us to think about matrices holistically—not just as a table of numbers.
The notion of linearizing your function will conceptually carry over as we define the derivative of functions which
take in/spit out more than one number. Of course, this means that the derivative will have a different “shape” than
a single number. Here is a table on the shape of the first derivative. The inputs of the function are given on the
left hand side of the table, and the outputs of the function are given across the top.
You will ultimately learn how to do any of these in great detail eventually in this class! The purpose of this
table is to plant the notion of differentials as linearization. Let’s look at an example.
Example 1
Let f (x) = xT x, where x is a 2 × 1 matrix and the output is thus a 1 × 1 matrix. Confirm that 2xT0 dx is indeed
T
the differential of f at x0 = 3 4 .
7
T
Then, notice that 2xT0 dx = 2 3 4 dx = .022. Hence, we have that
As we will see right now, the 2xT0 dx didn’t come from nowhere!
By the differential of the matrix A, we think of it as a small (unconstrained) change in the matrix A. Later,
constraints may be places on the allowed perturbations.
Notice however, that (by our table) the derivative of a matrix is a matrix! So generally speaking, the products will
not commute.
If x is a vector, then by the differential product rule we have
Remark 3. The way the product rule works for vectors as matrices is that transposes “go for the ride.” See the
next example below.
Example 4
By the product rule. we have
1. d(uT v) = (du)T v + uT (dv) = v T du + uT dv since dot products commute.
Remark 5. The way to prove these sorts of statements can be seen in Section 2.
8
𝑓 𝑥 + 𝛿𝑥
! 𝑥
small change e𝑓
𝛿𝑓 𝑓 𝑥 slop
in “output”
small change
in “input”
𝛿𝑥
𝑥 𝑥 + 𝛿𝑥
δ𝑓 = 𝑓 𝑥 + 𝛿𝑥 − 𝑓 𝑥 = 𝑓 ! 𝑥 𝛿𝑥 + 𝑜(𝛿𝑥)
linear term higher-order
terms
Figure 1: The essence of a derivative is linearization: predicting a small change δf in the output f (x) from a small
change δx in the input x, to first order in δx.
We can more precisely express these higher-order terms using asymptotic “little-o” notation “o(δx)”, which denotes
any function whose magnitude shrinks much faster than |δx| as δx → 0, so that for sufficiently small δx it is
negligible compared to the linear f ′ (x) δx term. (Variants of this notation are commonly used in computer science,
and there is a formal definition that we omit here.1 ) Examples of such higher-order terms include (δx)2 , (δx)3 ,
(δx)1.001 , and (δx)/ log(δx).
Remark 6. Here, δx is not an infinitesimal but rather a small number. Note that our symbol “δ” (a Greek
lowercase “delta”) is not the same as the symbol “∂” commonly used to denote partial derivatives.
This notion of a derivative may remind you of the first two terms in a Taylor series f (x+δx) = f (x)+f ′ (x) δx+· · ·
(though in fact it is much more basic than Taylor series!), and the notation will generalize nicely to higher dimensions
1 Briefly, ∥g(δx)∥
a function g(δx) is o(δx) if limδx→0 ∥δx∥
= 0. We will return to this subject in Section 5.2.
9
and other vector spaces. In differential notation, we can express the same idea as:
In this notation we implicitly drop the o(δx) term that vanishes in the limit as δx becomes infinitesimally small.
We will use this as the more generalized definition of a derivative. In this formulation, we avoid dividing by dx,
because soon we will allow x (and hence dx) to be something other than a number—if dx is a vector, we won’t be
able to divide by it!
As above, you should think of the differential notation dx as representing an arbitrary small change in x, where
we are implicitly dropping any o(dx) terms, i.e. terms that decay faster than linearly as dx → 0. Often, we will
omit the square brackets and write simply f ′ (x)dx instead of f ′ (x)[dx], but this should be understood as the linear
operator f ′ (x) acting on dx—don’t write dx f ′ (x), which will generally be nonsense!
This definition will allow us to extend differentiation to arbitrary vector spaces of inputs x and outputs f (x).
(More technically, we will require vector spaces with a norm ∥x∥, called “Banach spaces,” in order to precisely
define the o(δx) terms that are dropped. We will come back to the subject of Banach spaces later.)
for scalars α ∈ R.
Remark : In this course, f ′ is a map that takes in an x and spits out a linear operator f ′ (x) (the derivative
of f at x). Furthermore, f ′ (x) is a linear map that takes in an input direction v and gives an output vector
f ′ (x)[v] (which we will later interpret as a directional derivative, see Sec. 2.2.1). When the direction v is an
infinitesimal dx, the output f ′ (x)[dx] = df is the differential of f (the corresponding infinitesimal change in
f ).
10
Notation 9 (Derivative operators and notations)
There are multiple notations for derivatives in common use, along with multiple related concepts of derivative,
differentiation, and differentials. In the table below, we summarize several of these notations, and put boxes
around the notations adopted for this course:
name notations remark
linear operator f ′ (x) that maps a small
′ df
derivative f , also dx , Df , fx , ∂x f , . . . change dx in the input to a small change
df = f ′ (x)[dx] in the output
a Informally, one can think of the vector space of infinitesimals dx as living in the same space as x (understood as a small change
in a vector, but still a vector nonetheless). Formally, one can define a distinct “vector space of infinitesimals” in various ways,
e.g. as a cotangent space in differential geometry, though we won’t go into more detail here.
11
Some examples of linear operators include
• Multiplication by scalars α, i.e. Lv = αv. Also multiplication of column vectors v by matrices A, i.e. Lv = Av.
• Some functions like f (x) = x2 are obviously nonlinear. But what about f (x) = x + 1? This may look linear
if you plot it, but it is not a linear operation, because f (2x) = 2x + 1 ̸= 2f (x)—such functions, which are
linear plus a nonzero constant, are known as affine.
• There are also many other examples of linear operations that are not so convenient or easy to write down as
matrix–vector products. For example, if A is a 3 × 3 matrix, then L[A] = AB + CA is a linear operator given
3 × 3 matrices B, C. The transpose f (x) = xT of a column vector x is linear, but is not given by any matrix
multiplied by x. Or, if we consider vector spaces of functions, then the calculus operations of differentiation
and integration are linear operators too!
There is an equivalent way to interpret this linear-operator viewpoint of a derivative, which you may have seen
before in multivariable calculus: as a directional derivative.
If we have a function f (x) of arbitrary vectors x, then the directional derivative at x in a direction (vector) v
is defined as:
∂ f (x + δα v) − f (x)
f (x + αv) = lim (1)
∂α α=0 δα→0 δα
where α is a scalar. This transforms derivatives back into single-variable calculus from arbitrary vector spaces. It
measures the rate of change of f in the direction v from x. But it turns out that this has a very simple relationship
to our linear operator f ′ (x) from above, because (dropping higher-order terms due to the limit δα → 0):
where we have factored out the scalar dα in the last step thanks to f ′ (x) being a linear operator. Comparing with
above, we immediately find that the directional derivative is:
∂
f (x + αv) = f ′ (x)[v] . (2)
∂α α=0
It is exactly equivalent to our f ′ (x) from before! (We can also see this as an instance of the chain rule from Sec. 2.5.)
One lesson from this viewpoint is that it is perfectly reasonable to input an arbitrary non-infinitesimal vector v
into f ′ (x)[v]: the result is not a df , but is simply a directional derivative.
Therefore, since dx is a column vector (in an arbitrary direction, representing an arbitrary small change in x),
the linear operator f ′ (x) that produces a scalar df must be a row vector (a “1-row matrix”, or more formally
something called a covector or “dual” vector or “linear form”)! We call this row vector the transpose of the gradient
12
x2
x1 nt our s of f(x)
co
maximum
f
x
Figure 2: For a real-valued f (x), the gradient ∇f is defined so that it corresponds to the “uphill” direction at a
point x, which is perpendicular to the contours of f . Although this may not point exactly towards the nearest
local maximum of f (unless the contours are circular), “going uphill” is nevertheless the starting point for many
computational-optimization algorithms to search for a maximum.
(∇f )T , so that df is the dot (“inner”) product of dx with the gradient. So we have that
dx1
dx2
df = ∇f · dx = (∇f )T dx where dx =
.. .
| {z } .
f ′ (x)
dxn .
Some authors view the gradient as a row vector (equating it with f ′ or the Jacobian), but treating it as a “column
vector” (the transpose of f ′ ), as we do in this course, is a common and useful choice. As a column vector, the
gradient can be viewed as the “uphill” (steepest-ascent) direction in the x space, as depicted in Fig. 2. Furthermore,
it is also easier to generalize to scalar functions of other vector spaces. In any case, for this class, we will always
define ∇f to have the same “shape” as x, so that df is a dot product (“inner product”) of dx with the gradient.
This is perfectly consistent with the viewpoint of the gradient that you may remember from multivariable
calculus, in which the gradient was a vector of components
∂f
∂x1
∂f
∂x2
∇f = .
;
.
.
∂f
∂xn
or, equivalently,
∂f ∂f ∂f
df = f (x + dx) − f (x) = ∇f · dx = dx1 + dx2 + · · · + dxn .
∂x1 ∂x2 ∂xn
While a component-wise viewpoint may sometimes be convenient, we want to encourage you to view the vector x as a
whole, not simply a collection of components, and to learn that it is often more convenient and elegant to differentiate
expressions without taking the derivative component-by-component, a new approach that will generalize better to
more complicated inputs/output vector spaces.
Let’s look at an example to see how we compute this differential.
13
Example 10
Consider f (x) = xT Ax where x ∈ Rn and A is a square n × n matrix, and thus f (x) ∈ R. Compute df , f ′ (x),
and ∇f .
df = f (x + dx) − f (x)
= (x + dx)T A(x + dx) − xT Ax
: higher order
xT
Ax + dxT Ax + xT Adx + T T
Adx − x Ax
= dx
= xT (A + AT ) dx =⇒ ∇f = (A + AT )x .
| {z }
f ′ (x)=(∇f )T
Here, we dropped terms with more than one dx factor as these are asymptotically negligible. Another trick
was to combine dxT Ax and xT Adx by realizing that these are scalars and hence equal to their own transpose:
dxT Ax = (dxT Ax)T = xT AT dx. Hence, we have found that f ′ (x) = xT (A + AT ) = (∇f )T , or equivalently
∇f = [xT (A + AT )]T = (A + AT )x.
It is, of course, also possible to compute the same gradient component-by-component, the way you probably
learned to do in multivariable calculus. First, you would need to write f (x) explicitly in terms of the components
of x, as f (x) = xT Ax = i,j xi Ai,j xj . Then, you would compute ∂f /∂xk for each k, taking care that x appears
P
twice in the f summation. However, this approach is awkward, error-prone, labor-intensive, and quickly becomes
worse as we move on to more complicated functions. It is much better, we feel, to get used to treating vectors and
matrices as a whole, not as mere collections of numbers.
df = f ′ (x) dx
|{z} ,
|{z} | {z }
m components m×n n components
df = Jdx .
∂fi
The matrix J has entries Jij = ∂x j
(corresponding to the i-th row and the j-th column of J).
So now, suppose that f : R → R2 . Let’s understand how we would compute the differential of f :
2
! ! !
∂f1 ∂f1 ∂f1 ∂f1
∂x1 ∂x2 dx1 ∂x1 dx1 + ∂x2 dx2
df = ∂f2 ∂f2 = ∂f2 ∂f2
.
∂x1 ∂x2 dx2 ∂x1 dx1 + ∂x2 dx2
14
Example 11
Consider the function f (x) = Ax where A is a constant m × n matrix. Then, by applying the distributive law
for matrix–vector products, we have
Therefore, f ′ (x) = A.
Notice then that the linear operator A is its own Jacobian matrix!
Let’s now consider some derivative rules.
• Sum Rule: Given f (x) = g(x) + h(x), we get that
Hence, f ′ = g ′ + h′ as we should expect. This is the linear operator f ′ (x)[v] = g ′ (x)[v] + h′ (x)[v], and note
that we can sum linear operators (like g ′ and h′ ) just like we can sum matrices! In this way, linear operators
form a vector space.
• Product Rule: Suppose f (x) = g(x)h(x). Then,
df = f (x + dx) − f (x)
= g(x + dx)h(x + dx) − g(x)h(x)
= (g(x) + g ′ (x)dx)(h(x) + h′ (x)dx) − g(x)h(x)
| {z } | {z }
dg dh
= gh + dg h + g dh +
dg−0 gh
dh
:
= dg h + g dh ,
where the dg dh term is higher-order and hence dropped in infinitesimal notation. Note, as usual, that dg
and h may not commute now as they may no longer be scalars!
Let’s look at some short examples of how we can apply the product rule nicely.
Example 12
Let f (x) = Ax (mapping Rn → Rm ) where A is a constant m × n matrix. Then,
15
Example 13
Let f (x) = xT Ax (mapping Rn → R). Then,
and hence f ′ (x) = xT (A + AT ). (In the common case where A is symmetric, this simplifies to f ′ (x) = 2xT A.)
Note that here we have applied Example 12 in computing d(Ax) = Adx. Furthermore, f is a scalar valued
function, so we may also obtain the gradient ∇f = (A + AT )x as before (which simplifies to 2Ax if A is
symmetric).
the element-wise product of vectors (also called the Hadamard product), where for convenience below we also
define diag(x) as the m×m diagonal matrix with x on the diagonal. Then, given A ∈ Rm,n , define f : Rn → Rm
via
f (x) = A(x .∗ x).
(c) d(x .∗ y) = (dx) .∗ y + x .∗ (dy). So if we take y to be a constant and define g(x) = y .∗ x, its Jacobian
matrix is diag(y).
(e) Notice that the directional derivative (Sec. 2.2.1) of f at x in the direction v is simply given by f ′ (x)[v] =
2A(x .∗ v). One could also check numerically for some arbitrary A, x, v that f (x + 10−8 v) − f (x) ≈
10−8 (2A(x .∗ v)).
16
• Chain Rule: Let f (x) = g(h(x)). Then,
Example 15
Let x ∈ Rn , h(x) ∈ Rp , and g(h(x)) ∈ Rm . Then, let f (x) = g(h(x)) mapping from Rn to Rm . The chain rule
then states that
f ′ (x) = g ′ (h(x))h′ (x),
which makes sense as g ′ is an m × p matrix and h′ is a p × n matrix, so that the product gives an m × n matrix
f ′ ! However, notice that this is not the same as h′ (x)g ′ (h(x)) as you cannot (if n ̸= m) multiply a p × n and an
m × p matrix together, and even if n = m you will get the wrong answer since they probably won’t commute.
Not only does the order of the multiplication matter, but the associativity of matrix multiplication matters
practically. Let’s consider a function
f (x) = a(b(c(x)))
by associativity (omitting the function arguments for brevity). The left-hand side is multiplication from left to
right, and the right-hand side is multiplication from right to left.
But who cares? Well it turns out that associativity is deeply important. So important that the two orderings
have names: multiplying left-to-right is called “reverse mode” and multiplying right-to-left is called “forward mode”
in the field of automatic differentiation (AD). Reverse-mode differentation is also known as an “adjoint method” or
“backpropagation” in some contexts, which we will explore in more detail later. Why does this matter? Let’s think
about the computational cost of matrix multiplication.
If you multiply a m × q matrix by a q × p matrix, you normally do it by computing mp dot products of length q (or
some equivalent re-ordering of these operations). To do a dot product of length q requires q multiplications and
q − 1 additions of scalars. Overall, this is approximately 2mpq scalar operations in total. In computer science, you
would write that this is “Θ(mpq)”: the computational effort is asymptotically proportional to mpq for large m, p, q.
17
row matrix × matrix × matrix fast!
Figure 3: Matrix multiplication is associative—that is, (AB)C = A(BC) for all A, B, C—but multiplying left-to-
right can be much more efficient than right-to-left if the leftmost matrix has only one (or few) rows, as shown here.
Correspondingly, the order in which you carry out the chain rule has dramatic consequences for the computational
effort required. Left-to-right is known as “reverse mode” or “backpropagation”, and is best suited to situations
where there are many fewer outputs than inputs.
So why does the order of the chain rule matter? Consider the following two examples.
Example 16
Suppose you have a lot of inputs n ≫ 1, and only one output m = 1, with lots of intermediate values, i.e.
q = p = n. Then reverse mode (left-to-right) will cost Θ(n2 ) scalar operations while forward mode (right-to-left)
would cost Θ(n3 )! This is a huge cost difference, depicted schematically in Fig. 3.
Conversely, suppose you have a lot of outputs m ≫ 1 and only one input n = 1, with lots of intermediate
values q = p = m. Then reverse mode would cost Θ(m3 ) operations but forward mode would be only Θ(m2 )!
Moral: If you have a lot of inputs and few outputs (the usual case in machine learning and optimization),
compute the chain rule left-to-right (reverse mode). If you have a lot of outputs and few inputs, compute the
chain rule right-to-left (forward mode). We return to this in Sec. 8.4.
18
Example 17
Let A be an n × n matrix. You could have the following matrix-valued functions. For example:
• f (A) = A3 ,
• f (A) = A−1 if A is invertible,
• or U , where U is the resulting matrix after applying Gaussian elimination to A!
You could also have scalar outputs. For example:
• f (A) = det A,
• f (A) = trace A,
• or f (A) = σ1 (A), the largest singular value of A.
Example 18
Let f (A) = A3 where A is a square matrix. Compute df .
df = dA A2 + A dA A + A2 dA = f ′ (A)[dA].
Notice that this is not equal to 3A2 (unless dA and A commute, which won’t generally be true since dA represents
an arbitrary small change in A). The right-hand side is a linear operator f ′ (A) acting on dA, but it is not so easy
to interpret it as simply a single “Jacobian” matrix multiplying dA!
Example 19
Let f (A) = A−1 where A is a square invertible matrix. Compute df = d(A−1 ).
Here, we use a slight trick. Notice that AA−1 = I, the identity matrix. Thus, we can compute the differential
using the product rule (noting that d I = 0, since changing A does not change I) so
19
3 Jacobians of Matrix Functions
When we have a function that has matrices as inputs and/or outputs, we have already seen in the previous lectures
that we can still define the derivative as a linear operator by a formula for f ′ mapping a small change in input to
the corresponding small change in output. However, when you first learned linear algebra, probably most linear
operations were represented by matrices multiplying vectors, and it may take a while to get used to thinking of linear
operations more generally. In this chapter, we discuss how it is still possible to represent f ′ by a Jacobian matrix
even for matrix inputs/outputs, and how the most common technique to do this involves matrix “vectorization”
and a new type of matrix operation, a Kronecker product. This gives us another way to think about our f ′
linear operators that is occasionally convenient, but at the same time it is important to become comfortable with
other ways of writing down linear operators too—sometimes, the explicit Jacobian-matrix approach can obscure
key structure, and it is often computationally inefficient as well.
For this section of the notes, we refer to the linked Pluto Notebook for computational demonstrations of this
material in Julia, illustrating multiple views of the derivative of the square A2 of 2 × 2 matrices A.
Example 20
An even simpler example is the matrix-square function:
f (A) = A2 ,
You can also work this out explicitly from df = f (A + dA) − f (A) = (A + dA)2 − A2 , dropping the (dA)2 term.
In all of these examples, f ′ (A) is described by a simple formula for f ′ (A)[dA] that relates an arbitrary change
dA in A to the change df = f (A+dA)−f (A) in f , to first order. If the differential is distracting you, realize that we
can plug any matrix X we want into this formula, not just an “infinitesimal” change dA, e.g. in our matrix-square
example we have
f ′ (A)[X] = XA + AX
for an arbitrary X (a directional derivative, from Sec. 2.2.1). This is linear in X: if we scale or add inputs, it scales
or adds outputs, respectively:
20
f ′ (A)[X + Y ] = (X + Y )A + A(X + Y ) = XA + Y A + AX + AY = XA + AX + Y A + AY
= f ′ (A)[X] + f ′ (A)[Y ] .
This is a perfectly good way to define a linear operation! We are not expressing it here in the familiar form
f ′ (A)[X] = (some matrix?) × (X vector?), and that’s okay! A formula like XA + AX is easy to write down, easy
to understand, and easy to compute with.
But sometimes you still may want to think of f ′ as a single “Jacobian” matrix, using the most familiar language
of linear algebra, and it is possible to do that! If you took a sufficiently abstract linear-algebra course, you may
have learned that any linear operator can be represented by a matrix once you choose a basis for the input and
output vector spaces. Here, however, we will be much more concrete, because there is a conventional “Cartesian”
basis for matrices A called “vectorization”, and in this basis linear operators like AX + XA are particularly easy
to represent in matrix form once we introduce a new type of matrix product that has widespread applications in
“multidimensional” linear algebra.
f (A) = A2
for the simple case of 2 × 2 matrices, which are described by only four scalars, so that we can look at every term
in the derivative explicitly. In particular,
Example 21
For a 2 × 2 matrix !
p r
A= ,
q s
the matrix-square function is
! ! !
p r p r p2 + qr pr + rs
f (A) = A2 = = .
q s q s pq + qs qr + s2
Written out explicitly in terms of the matrix entries (p, q, r, s) in this way, it is natural to think of our function
as mapping 4 scalar inputs to 4 scalar outputs. That is, we can think of f as equivalent to a “vectorized” function
f˜ : R4 → R4 given by
p p2 + qr
q pq + qs
˜
f ( ) = pr + rs .
r
s qr + s2
Converting a matrix into a column vector in this way is called vectorization, and is commonly denoted by the
21
operation “vec”:
! A1,1 p
p r A2,1 q
vec A = vec = r ,
q s A1,2
A2,2 s
! p2 + qr
p2 + qr
pr + rs pq + qs
vec f (A) = vec = .
pq + qs qr + s2 pr + rs
qr + s2
More generally,
Definition 22
The vectorization vec A ∈ Rmn of any m×n matrix A ∈ Rm×n is a defined by simply stacking the columns
of A, from left to right, into a column vector vec A. That is, if we denote the n columns of A by m-component
vectors ⃗a1 , ⃗a2 , . . . ∈ Rm , then
⃗a1
⃗a2
∈ Rmn
vec A = vec ⃗a1 ⃗a2 · · · ⃗an =
..
| {z } .
A∈Rm×n
⃗an
Problem 23
The vector vec A corresponds to the coefficients you get when you express the m × n matrix A in a basis of
matrices. What is that basis?
Vectorization turns unfamilar things (like matrix functions and derivatives thereof) into familiar things (like
vector functions and Jacobians or gradients thereof). In that way, it can be a very attractive tool, almost too
attractive—why do “matrix calculus” if you can turn everything back into ordinary multivariable calculus? Vec-
torization has its drawbacks, however: conceptually, it can obscure the underlying mathematical structure (e.g. f˜
above doesn’t look much like a matrix square A2 ), and computationally this loss of structure can sometimes lead
to severe inefficiencies (e.g. forming huge m2 × m2 Jacobian matrices as discussed below). Overall, we believe
22
that the primary way to study matrix functions like this should be to view them as having matrix inputs (A) and
matrix outputs (A2 ), and that one should likewise generally view the derivatives as linear operators on matrices,
not vectorized versions thereof. However, it is still useful to be familiar with the vectorization viewpoint in order
to have the benefit of an alternative perspective.
To understand Jacobians of functions (from matrices to matrices), let’s begin by considering a basic question:
Question 24. What is the size of the Jacobian of the matrix-square function?
Well, if we view the matrix squaring function via its vectorized equivalent f˜, mapping R4 7→ R4 (4-component
column vectors to 4-component column vectors), the Jacobian would be a 4 × 4 matrix (formed from the derivatives
of each output component with respect to each input component). Now let’s think about a more general square
matrix A: an m × m matrix. If we wanted to find the Jacobian of f (A) = A2 , we could do so by the same process
and (symbolically) obtain an m2 × m2 matrix (since there are m2 inputs, the entries of A, and m2 outputs, the
entries of A2 ). Explicit computation of these m4 partial derivatives is rather tedious even for small m, but is a task
that symbolic computational tools in e.g. Julia or Mathematica can handle. In fact, as seen in the Notebook, Julia
spits out the Jacobian quite easily. For the m = 2 case that we wrote out explicitly above, you can either take the
derivative of f˜ by hand or use Julia’s symbolic tools to obtain the Jacobian:
For example, the first row of f˜′ consists of the partial derivatives of p2 + qr (the first output) with respect to the
4 inputs p, q, r, and s. Here, we have labeled the rows by the (row,column) indices (jout , kout ) of the entries in the
“output” matrix d(A2 ), and have labeled the columns by the indices (jin , kin ) of the entries in the “input” matrix A.
Although we have written the Jacobian f˜′ as a “2d” matrix, you can therefore also imagine it to be a “4d” matrix
indexed by jout , kout , jin , kin .
However, the matrix-calculus approach of viewing the derivative f ′ (A) as a linear transformation on matrices
(as we derived above),
f ′ (A)[X] = XA + AX ,
seems to be much more revealing than writing out an explicit component-by-component “vectorized” Jacobian f˜′ ,
and gives a formula for any m×m matrix without laboriously requiring us to take m4 partial derivatives one-by-one.
If we really want to pursue the vectorization perspective, we need a way to recapture some of the structure that is
obscured by tedious componentwise differentiation. A key tool to bridge the gap between the two perspectives is a
type of matrix operation that you may not be familiar with: Kronecker products (denoted ⊗).
23
on matrices back into “2d” matrices for the vectorized inputs/outputs, is the Kronecker product A ⊗ B. Although
they don’t often appear in introductory linear-algebra courses, Kronecker products show up in a wide variety of
mathematical applications where multidimensional data arises, such as multivariate statistics and data science or
multidimensional scientific/engineering problems.
Definition 25
If A is an m × n matrix with entries aij and B is a p × q matrix, then their Kronecker product A ⊗ B is
defined by
a11 ··· a1n a11 B ··· a1n B
. .. .. .. .. ..
A = ..
. =⇒ A ⊗ B =
. |{z} |{z} . . .
,
m×n p×q
am1 ··· amn am1 B ··· amn B
| {z }
mp×nq
Then A⊗B is a 4×4 matrix containing all possible products of entries A with entries of B. Note that A⊗B ̸= B ⊗A
(but the two are related by a re-ordering of the entries):
! pa pc ra rc !
ap ar cp cr
pB rB pb pd rb rd aA cA aq as cq cs
A⊗B = = ̸= B⊗A= = ,
qB sB qa qc sa sc
bA dA bp
br dp dr
qb qd sb sd bq bs dq ds
where we’ve colored one copy of B red for illustration. See the Notebook for more examples of Kronecker products
of matrices (including some with pictures rather !
than numbers!).
p r
Above, we saw that f (A) = A2 at A = could be thought of as an equivalent function f˜(vec A) mapping
q s
column vectors of 4 inputs to 4 outputs (R4 7→ R4 ), with a 4 × 4 Jacobian that we (or the computer) laboriously
computed as 16 element-by-element partial derivatives. It turns out that this result can be obtained much more
elegantly once we have a better understanding of Kronecker products. We will find that the 4 × 4 “vectorized”
Jacobian is simply
f˜′ = I2 ⊗A + AT ⊗ I2 ,
where I2 is the 2 × 2 identity matrix. That is, the matrix linear operator f ′ (A)[dA] = dA A + A dA is equivalent,
after vectorization, to:
2p r q 0 dp
q p + s 0 q dq .
vec f ′ (A)[dA] = (I2 ⊗A + AT ⊗ I2 ) vec dA =
r 0 p + s r dr
| {z } | {z }
dA A+A dA f˜′
0 r q 2s ds
| {z } | {z }
f˜′ vec dA
In order to understand why this is the case, however, we must first build up some understanding of the algebra of
Kronecker products. To start with, a good exercise is to convince yourself of a few simpler properties of Kronecker
24
products:
Problem 26
From the definition of the Kronecker product, derive the following identities:
1. (A ⊗ B)T = AT ⊗ B T .
4. A ⊗ B is orthogonal (its transpose is its inverse) if A and B are orthogonal. (From properties 1 & 3.)
In order to convert linear operations like AX + XA into Kronecker products via vectorization, the key identity is:
Proposition 27
Given (compatibly sized) matrices A, B, C, we have
(A ⊗ B) vec(C) = vec(BCAT ).
We can thus view A ⊗ B as a vectorized equivalent of the linear operation C 7→ BCAT . We are tempted to
introduce a parallel notation (A ⊗ B)[C] = BCAT for the “non-vectorized” version of this operation, although
this notation is not standard.
One possible mnemonic for this identity is that the B is just to the left of the C while the A “circles around”
to the right and gets transposed.
Where does this identity come from? We can break it into simpler pieces by first considering the cases where
either A or B is an identity matrix I (of the appropriate size). To start with, suppose that A = I, so that
BCAT = BC. What is vec(BC)? If we let ⃗c1 , ⃗c2 , . . . denote the columns of C, then recall that BC simply multiples
B on the left with each of the columns of C:
B⃗c1
BC = B ⃗c1 ⃗c2 ··· = B⃗c1 B⃗c2 ··· =⇒ vec(BC) = B⃗c2 .
..
.
Now, how can we get this vec(BC) vector as something multiplying vec C? It should be immediately apparent that
B⃗c1 B ⃗c1
vec(BC) = B⃗c2 = B ⃗c2 ,
.. ..
..
. . .
| {z } | {z }
I ⊗B vec C
25
but this matrix is exactly the Kronecker product I ⊗ B! Hence, we have derived that
What about the AT term? This is a little trickier, but again let’s simplify to the case where B = I, in which case
BCAT = CAT . To vectorize this, we need to look at the columns of CAT . What is the first column of CAT ? It is
a linear combination of the columns of C whose coefficients are given by the first column of AT (= first row of A):
X
column 1 of CAT = a1j ⃗cj .
j
Similarly for column 2, etc, and we then “stack” these columns to get vec(CAT ). But this is exactly the formula
for multipling a matrix A by a vector, if the “elements” of the vector were the columns ⃗cj . Written out explicitly,
this becomes: P
j a1j ⃗cj a11 I a12 I · · · ⃗c1
P
vec(CAT ) = a22 I · · · ⃗c2
j a2j ⃗
cj = a21 I ,
.. .. .. ..
..
. . . . .
| {z } | {z }
A⊗I vec C
The full identity (A ⊗ B) vec(C) = vec(BCAT ) can then be obtained by straightforwardly combining these two
derivations: replace CAT with BCAT in the second derivation, which replaces ⃗cj with B⃗cj and hence I with B.
So now we want to use Prop. 27 to calculate the Jacobian of f (A) = A2 in terms of the Kronecker product. Let
dA be our C in Prop. 27. We can now immediately see that
vec(A dA + dA A) = (I ⊗A + AT ⊗ I) vec(dA) ,
| {z }
Jacobian f˜′ (vec A)
where I is the identity matrix of the same size as A. We can also write this in our “non-vectorized” linear-operator
notation:
A dA + dA A = (I ⊗A + AT ⊗ I)[dA] .
which exactly matches our laboriously computed Jacobian f˜′ from earlier!
26
Example 28
For the matrix-cube function A3 , where A is an m × m square matrix, compute the m2 × m2 Jacobian of the
vectorized function vec(A3 ).
Let’s use the same trick for the matrix-cube function. Sure, we could laboriously compute the Jacobian via element-
by-element partial derivatives (which is done nicely by symbolic computing in the notebook), but it’s much easier
and more elegant to use Kronecker products. Recall that our “non-vectorized” matrix-calculus derivative is the
linear operator:
(A3 )′ [dA] = dA A2 + A dA A + A2 dA,
You could go on to find the Jacobians of A4 , A5 , and so on, or any linear combination of matrix powers. Indeed,
you could imagine applying a similar process to the Taylor series of any (analytic) matrix function f (A), but it
starts to become awkward. Later on (and in homework), we will discuss more elegant ways to differentiate other
matrix functions, not as vectorized Jacobians but as linear operators on matrices.
One must be cautious about using Kronecker products as a computational tool, rather than as more of a conceptual
tool, because they can easily cause the computational cost of matrix problems to explode far beyond what is
necessary.
Suppose that A, B, and C are all m × m matrices. The cost of multiplying two m × m matrices (by the usual
methods) scales proportional to ∼ m3 , what the computer scientists call Θ(m3 ) “complexity.” Hence, the cost of
the linear operation C 7→ BCAT scales as ∼ m3 (two m × m multiplications). However, if we instead compute the
same answer via vec(BCAT ) = (A ⊗ B) vec C, then we must:
1. Form the m2 × m2 matrix A ⊗ B. This requires m4 multiplications (all entries of A by all entries of B), and
∼ m4 memory storage. (Compare to ∼ m2 memory to store A or B. If m is 1000, this is a million times
more storage, terabytes instead of megabytes!)
AX + XB = C
for an unknown matrix X, given A, B, C, where all of these are m × m matrices. This is called a “Sylvester
equation.” These are linear equations in our unknown X, and we can convert them to an ordinary system of m2
linear equations by Kronecker products:
27
which you can then solve for the m2 unknowns vec X using Gaussian elimination. But the cost of solving an
m2 × m2 system of equations by Gaussian elimination is ∼ (m2 )3 = m6 . It turns out, however, that there are
clever algorithms to solve AX + XB = C in only ∼ m3 operations (with ∼ m2 memory)—for m = 1000, this saves
a factor of ∼ m3 = 109 (a billion) in computational effort.
(Kronecker products can be a more practical computational tool for sparse matrices: matrices that are mostly
zero, e.g. having only a few nonzero entries per row. That’s because the Kronecker product of two sparse matrices
is also sparse, avoiding the huge storage requirements for Kronecker products of non-sparse “dense” matrices. This
can be a convenient way to assemble large sparse systems of equations for things like multidimensional PDEs.)
28
4 Finite-Difference Approximations
In this section, we will be referring to this Julia notebook for calculations that are not included here.
came from dropping higher-order terms from a small but finite difference:
So, we can just compare the finite difference f (x + δx) − f (x) to our (directional) derivative operator
f ′ (x)δx (i.e. the derivative in the direction δx). f (x+δx)−f (x) is also called a forward difference approximation.
The antonym of a forward difference is a backward difference approximation f (x) − f (x − δx) ≈ f ′ (x)δx. If you
just want to compute a derivative, there is not much practical distinction between forward and backward differences.
2 In some Julia AD software, this is done with by defining a “ChainRule”, and in Python autograd/JAX it is done by defining a
29
The distinction becomes more important when discretizing (approximating) differential equations. We’ll look at
other possibilities below.
Remark 29. Note that this definition of forward and backward difference is not the same as forward- and backward-
mode differentiation—these are unrelated concepts.
If x is a scalar, we can also divide both sides by δx to get an approximation for f ′ (x) instead of for df :
f (x + δx) − f (x)
f ′ (x) ≈ + (higher-order corrections) .
δx
This is a more common way to write the forward-difference approximation, but it only works for scalar x, whereas
in this class we want to think of x as perhaps belonging to some other vector space.
Finite-difference approximations come in many forms, but they are generally a last resort in cases where it’s
too much effort to work out an analytical derivative and AD fails. But they are also useful to check your analytical
derivatives and to quickly explore.
df = A dA + dA A,
i.e. f ′ (A) is the linear operator f ′ (A)[δA] = A δA + δA A. This is not equal to 2A δA because in general A and
δA do not commute. So let’s check this difference against a finite difference. We’ll try it for a random input A and
a random small perturbation δA.
Using a random matrix A, let dA = A · 10−8 . Then, you can compare f (A + dA) − f (A) to A dA + dA A. If
the matrix you chose was really random, you would get that the approximation minus the exact equality from the
product rule has entries with order of magnitude around 10−16 ! However, compared to 2AdA, you’d obtain entries
of order 10−8 .
To be more quantitative, we might compute that "norm" ∥approx − exact∥ which we want to be small. But
small compared to what? The natural answer is small compared to the correct answer. This is called the
relative error (or "fractional error") and is computed via
∥approx − exact∥
relative error = .
∥exact∥
Here, ∥·∥ is a norm, like the length of a vector. This allows us to understand the size of the error in the finite
difference approximation, i.e. it allows us to answer how accurate this approximation is (recall Sec. 4.1).
So, as above, you can compute that the relative error between the approximation and the exact answer is about
10−8 , where as the relative error between 2AdA and the exact answer is about 100 . This shows that our exact
answer is likely correct! Getting a good match up between a random input and small displacement isn’t a proof of
correctness of course, but it is always a good thing to check. This kind of randomized comparison will almost always
catch major bugs where you have calculated the symbolic derivative incorrectly, like in our 2AdA example.
30
Figure 4: Forward-difference accuracy for f (A) = A2 , showing the relative error in δf = f (A + δA) − f (A) versus
the linearization f ′ (A)δA, as a function of the magnitude ∥δA∥. A is a 4 × 4 matrix with unit-variance Gaussian
random entries, and δA is similarly a unit-variance Gaussian random perturbation scaled by a factor s ranging
from 1 to 10−16 .
Definition 30
Note that the norm of a matrix that we are using, computed by norm(A) in Julia, is just the direct analogue
of the familiar Euclidean norm for the case of vectors. It is simply the square root of the sum of the matrix
entries squared: sX q
∥A∥ := |Aij |2 = tr(AT A) .
i,j
31
4.5 Order of accuracy
The truncation error is the inaccuracy arising from the fact that the input perturbation δx is not infinitesimal:
we are computing a difference, not a derivative. If the truncation error in the derivative scales proportional ∥δx∥n ,
we call the approximation n-th order accurate. For forward differences, here, the order is n=1. Why?
For any f (x) with a nonzero second derivative (think of the Taylor series), we have
That is, the terms we dropped in our forward-difference approximations are proportional to ∥δx∥2 . But that means
that the relative error is linear:
∥f (x + δx) − f (x) − f ′ (x)δx∥
relative error =
∥f ′ (x)δx∥
(terms proportional to ∥δx∥2 ) + o(∥δx∥2 )
= = (terms proportional to ∥δx∥) + o(∥δx∥)
proportional to ∥δx∥
This is first-order accuracy. Truncation error in a finite-difference approximation is the inherent error in the
formula for non-infinitesimal δx. Does that mean we should just make δx as small as we possibly can?
32
Higher-dimensional inputs x pose a fundamental computational challenge for finite-difference techniques, be-
cause if you want to know what happens for every possible direction δx then you need many finite differences: one
for each dimension of δx. For example, suppose x ∈ Rn and f (x) ∈ R, so that you are computing ∇f ∈ Rn ; if you
want to know the whole gradient, you need n separate finite differences. The net result is that finite differences
in higher dimensions are expensive, quickly becoming impractical for high-dimensional optimization (e.g. neural
networks) where n might be huge. On the other hand, if you are just using finite differences as a check for bugs in
your code, it is usually sufficient to compare f (x + δx) − f (x) to f ′ (x)[δx] in a few random directions, i.e. for a few
random small δx.
33
5 Derivatives in General Vector Spaces
Matrix calculus requires us to generalize concepts of derivative and gradient further, to functions whose inputs
and/or outputs are not simply scalars or column vectors. To achieve this, we extend the notion of the ordinary
vector dot product and ordinary Euclidean vector “length” to general inner products and norms on vector
spaces. Our first example will consider familiar matrices from this point of view.
Recall from linear algebra that we can call any set V a “vector space” if its elements can be added/subtracted
x±y and multiplied by scalars αx (subject to some basic arithmetic axioms, e.g. the distributive law). For example,
the set of m × n matrices themselves form a vector space, or even the set of continuous functions u(x) (mapping
R → R)—the key fact is that we can add/subtract/scale them and get elements of the same set. It turns out to be
extraordinarily useful to extend differentiation to such spaces, e.g. for functions that map matrices to matrices or
functions to numbers. Doing so crucially relies on our input/output vector spaces V having a norm and, ideally,
an inner product.
Therefore, f ′ (x) is a linear operator taking in the vector dx in and giving a scalar value out. Another way to view
this is that f ′ (x) is the row vector3 (∇f )T . Under this viewpoint, it follows that df is the dot product (or “inner
product”):
df = ∇f · dx
We can generalize this to any vector space V with inner products! Given x ∈ V , and a scalar-valued function
f , we obtain the linear operator f ′ (x)[dx] ∈ R, called a “linear form.” In order to define the gradient ∇f , we need
an inner product for V , the vector-space generalization of the familiar dot product!
Given x, y ∈ V , the inner product ⟨x, y⟩ is a map (·) such that ⟨x, y⟩ ∈ R. This is also commonly denoted x · y
or ⟨x | y⟩. More technically, an inner product is a map that is
1. Symmetric: i.e. ⟨x, y⟩ = ⟨y, x⟩ (or conjugate-symmetric,4 ⟨x, y⟩ = ⟨y, x⟩, if we were using complex numbers),
not to be confused with the dual numbers used in automatic differentiation (Sec. 8).
4 Some authors distinguish the “dot product” from an “inner product” for complex vector spaces, saying that a dot product has no
complex conjugation x · y = y · x (in which case x · x need not be real and does not equal ∥x∥2 ), whereas the inner product must be
conjugate-symmetric, via ⟨x, y⟩ = x̄ · y. Another source of confusion for complex vector spaces is that some fields of mathematics define
⟨x, y⟩ = x · ȳ, i.e. they conjugate the right argument instead of the left (so that it is linear in the left argument and conjugate-linear in
the right argument). Aren’t you glad we’re sticking with real numbers?
34
Definition 31 (Hilbert Space)
A (complete) vector space with an inner product is called a Hilbert space. (The technical requirement of
“completeness” essentially means that you can take limits in the space, and is important for rigorous proofs.a )
a Completeness means that any Cauchy sequence of points in the vector space—any sequence of points that gets closer and
closer together—has a limit lying within the vector space. This criterion usually holds in practice for vector spaces over real or
complex scalars, but can get trickier when talking about vector spaces of functions, since e.g. the limit of a sequence of continuous
functions can be a discontinuous function.
Once we have a Hilbert space, we can define the gradient for scalar-valued functions. Given x ∈ V a Hilbert
space, and f (x) scalar, then we have the linear form f ′ (x)[dx] ∈ R. Then, under these assumptions, there is a
theorem known as the “Riesz representation theorem” stating that any linear form (including f ′ ) must be an inner
product with something:
f ′ (x)[dx] = (some vector), dx = df.
| {z }
gradient ∇f
x
That is, the gradient ∇f is defined as the thing you take the inner product of dx with to get df . Note that ∇f
always has the “same shape” as x.
The first few examples we look at involve the usual Hilbert space V = Rn with different inner products.
Example 32
Given V = Rn with n-column vectors, we have the familiar Euclidean dot product ⟨x, y⟩ = xT y. This leads to
the usual ∇f .
Example 33
We can have different inner products on Rn . For instance,
w1
⟨x, y⟩W = w1 x1 y1 + w2 x2 y2 + . . . wn xn yn = xT
..
y
.
wn
| {z }
W
We can also consider the space of m × n matrices V = Rm×n . There, is of course, a vector-space isomorphism
from V ∋ A → vec(A) ∈ Rmn . Thus, in this space we have the analogue of the familiar (“Frobenius") Euclidean
inner product, which is convenient to rewrite in terms of matrix operations via the trace:
35
Definition 34 (Frobenius inner product)
The Frobenius inner product of two m × n matrices A and B is:
X
⟨A, B⟩F = Aij Bij = vec(A)T vec(B) = tr(AT B) .
ij
Given this inner product, we also have the corresponding Frobenius norm:
p q sX
T
∥A∥F = ⟨A, A⟩F = tr(A A) = ∥vecA∥ = |Aij |2 .
i,j
Using this, we can now define the gradient of scalar functions with matrix inputs! This will be our default
matrix inner product (hence defining our default matrix gradient) in these notes (sometimes dropping the F
subscript).
Example 35
Consider the function q
f (A) = ∥A∥F = tr(AT A).
What is df ?
Firstly, by the familiar scalar-differentiation chain and power rules we have that
1
df = p d(tr AT A).
2 tr(AT A)
Hence,
1
df = tr(d(AT A))
2∥A∥F
1
= tr(dAT A + AT dA)
2∥A∥F
1
= (tr(dAT A) + tr(AT dA))
2∥A∥F
1 A
= tr(AT dA) = , dA .
∥A∥F ∥A∥F
Here, we used the fact that tr B = tr B T , and in the last step we connected df with a Frobenius inner product. In
other words,
A
∇f = ∇∥A∥F = .
∥A∥F
Note that one obtains exactly the same result for column vectors x, i.e. ∇∥x∥ = x/∥x∥ (and in fact this is equivalent
via x = vec A).
Let’s consider another simple example:
36
Example 36
Fix some constant x ∈ Rm , y ∈ Rn , and consider the function f : Rm×n → R given by
f (A) = xT Ay.
What is ∇f ?
We have that
df = xT dA y
= tr(xT dA y)
= tr(yxT dA)
= xy T , dA .
|{z}
∇f
More generally, for any scalar-valued function f (A), from the definition of Frobenius inner product it follows
that:
X
df = f (A + dA) − f (A) = ⟨∇f, dA⟩ = (∇f )i,j dAi,j ,
i,j
and hence the components of the gradient are exactly the elementwise derivatives
∂f
(∇f )i,j = ,
∂Ai,j
similar to the component-wise definition of the gradient vector from multivariable calculus! But for non-trivial
matrix-input functions f (A) it can be extremely awkward to take the derivative with respect to each entry of
A individually. Using the “holistic” matrix inner-product definition, we will soon be able to compute even more
complicated matrix-valued gradients, including ∇(det A)!
37
p
For example, given any inner product ⟨u, v⟩, there is a corresponding norm ∥u∥ = ⟨u, u⟩. (Thus, every Hilbert
5
space is also a Banach space. )
To define derivatives, we technically need both the input and the output to be Banach spaces. To see this,
recall our formalism
f (x + δx) − f (x) = f ′ (x)[δx] + o(δx) .
| {z } | {z }
linear smaller
To precisely define the sense in which the o(δx) terms are “smaller” or “higher-order,” we need norms. In particular,
the “little-o” notation o(δx) denotes any function such that
∥o(δx)∥
lim = 0,
δx→0 ∥δx∥
i.e. which goes to zero faster than linearly in δx. This requires both the input δx and the output (the function) to
have norms. This extension of differentiation to arbitrary normed/Banach spaces is sometimes called the Fréchet
derivative.
5 Proving the triangle inequality for an arbitrary inner product is not so obvious; one uses a result called the Cauchy–Schwarz
inequality.
38
6 Nonlinear Root-Finding, Optimization,
and Adjoint Differentiation
The next part is based on these slides. Today, we want to talk about why we are computing derivatives in the first
place. In particular, we will drill down on this a little bit and then talk about computation of derivatives.
For instance, suppose we have a scalar function f : R → R and we wanted to solve f (x) = 0 for a root x. Of
course, we could solve such an equation explicitly in simple cases, such as when f is linear or quadratic, but if
the function is something more arbitrary like f (x) = x3 − sin(cos x) you might not be able to obtain closed-form
solutions. However, there is a nice way to obtain the solution approximately to any accuracy you want, as long if
you know approximately where the root is. The method we are talking about is known as Newton’s method, which
is really a linear-algebra technique. It takes in the function and a guess for the root, approximates it by a straight
line (whose root is easy to find), which is then an approximate root that we can use as a new guess. In particular,
the method (depicted in Fig. 5) is as follows:
• Linearize f (x) near some x using the approximation
• and then use this to update the value of x we linearized near—i.e., letting the new x be
f (x)
xnew = x − δx = x + .
f ′ (x)
Once you are close to the root, Newton’s method converges amazingly quickly. As discussed below, it asymptotically
doubles the number of correct digits on every step!
One may ask what happens when f ′ (x) is not invertible, for instance here if f ′ (x) = 0. If this happens, then
Newton’s method may break down! See here for examples of when Newton’s method breaks down.
We can generalize Newton’s method to multidimensional functions! Let f : Rn → Rn be a function which takes in
a vector and spits out a vector of the same size n. We can then apply a Newton approach in higher dimensions:
• Linearize f (x) near some x using the first-derivative approximation
39
one Newton step
4
3
f(x)
2
1 f(xnew)
0 root
xnew
f(x)
1
2
3 initial x
4
5
4 2 0 2 4
x
Figure 5: Single step of the scalar Newton’s method to solve f (x) = 0 for an example nonlinear function f (x) =
2 cos(x) − x + x2 /10. Given a starting guess (x = 2.3 in this example), we use f (x) and f ′ (x) to form a linear
(affine) approximation of f , and then our next step xnew is the root of this approximation. As long as the initial
guess is not too far from the root, Newton’s method converges extremely rapidly to the exact root (black dot).
• and then use this to update the value of x we linearized near—i.e., letting the new x be
That’s it! Once we have the Jacobian, we can just solve a linear system on each step. This again converges
amazingly fast, doubling the number of digits of accuracy in each step. (This is known as “quadratic convergence.”)
However, there is a caveat: we need some starting guess for x, and the guess needs to be sufficiently close to the
root for the algorithm to make reliable progress. (If you start with an initial x far from a root, Newton’s method
can fail to converge and/or it can jump around in intricate and surprising ways—google “Newton fractal” for some
fascinating examples.) This is a widely used and very practical application of Jacobians and derivatives!
6.2 Optimization
6.2.1 Nonlinear Optimization
A perhaps even more famous application of large-scale differentiation is to nonlinear optimization. Suppose we
have a scalar-valued function f : Rn → R, and suppose we want to minimize (or maximize) f . For instance, in
machine learning, we could have a big neural network (NN) with a vector x of a million parameters, and one tries to
minimize a “loss” function f that compares the NN output to the desired results on “training” data. The most basic
idea in optimization is to go “downhill” (see diagram) to make f as small as possible. If we can take the gradient
of this function f , to go “downhill” we consider −∇f , the direction of steepest descent, as depicted in Fig. 6.
Then, even if we have a million parameters, we can evolve all of them simultaneously in the downhill direc-
tion. It turns out that calculating all million derivatives costs about the same as evaluating the function at a
40
steepest-descent minimization
10.0
7.5
5.0 f(x) contours
2.5
minimum
0.0
x2
2.5
5.0
7.5 steepest-descent
steps
10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x1
Figure 6: A steepest-descent algorithm minimizes a function f (x) by taking successive “downhill” steps in the
direction −∇f . (In the example shown here, we are minimizing a quadratic function in two dimensions x ∈ R2 ,
performing an exact 1d minimization in the downhill direction for each step.) Steepest-descent algorithms can
sometimes “zig-zag” along narrow valleys, slowing convergence (which can be counteracted in more sophisticated
algorithms by “momentum” terms, second-derivative information, and so on).
• We may also need to consider constraints, for instance minimizing f (x) subject to gk (x) ≤ 0 or hk (x) =
0, known as inequality/equality constraints. Points x satisfying the constraints are called “feasible”. One
typically uses a combination of ∇f and ∇gk to approximate (e.g. linearize) the problem and make progress
towards the best feasible point.
• If you just go straight downhill, you might “zig-zag” along narrow valleys, making convergence very slow. There
are a few options to combat this, such as “momentum” terms and conjugate gradients. Even fancier than
these techniques, one might estimate second-derivative “Hessian matrices” from a sequence of ∇f values—a
famous version of this is known as the BFGS algorithm—and use the Hessian to take approximate Newton
41
steps (for the root ∇f = 0). (We’ll return to Hessians in a later lecture.)
• Ultimately, there are a lot of techniques and a zoo of competing algorithms that you might need to experiment
with to find the best approach for a given problem. (There are many books on optimization algorithms, and
even a whole book can only cover a small slice of what is out there!)
Some parting advice: Often the main trick is less about the choice of algorithms than it is about finding the
right mathematical formulation of your problem—e.g. what function, what constraints, and what parameters should
you be considering—to match your problem to a good algorithm. However, if you have many (≫ 10) parameters,
try hard to use an analytical gradient (not finite differences), computed efficiently in reverse mode.
There are many, many applications of optimization besides machine learning (fitting models to data). It is inter-
esting to also consider engineering/physical optimization. (For instance, suppose you want to make an airplane
wing that is as strong as possible.) The general outline of such problems is typically:
1. You start with some design parameters p, e.g. describing the geometry, materials, forces, or other degrees of
freedom.
2. These p are then used in some physical model(s), such as solid mechanics, chemical reactions, heat transport,
electromagnetism, acoustics, etc. For example, you might have a linear model of the form A(p)x = b(p) for
some matrix A (typically very large and sparse).
3. The solution of the physical model is a solution x(p). For example, this could be the mechanical stresses,
chemical concentrations, temperatures, electromagnetic fields, etc.
4. The physical solution x(p) is the input into some design objective f (x(p)) that you want to improve/optimize.
For instance, strength, speed power, efficiency, etc.
5. To maximize/minimize f (x(p)), one uses the gradient ∇p f , computed using reverse-mode/“adjoint” methods,
to update the parameters p and improve the design.
As a fun example, researchers have even applied “topology optimization” to design a chair, optimizing every voxel
of the design—the parameters p represent the material present (or not) in every voxel, so that the optimization
discovers not just an optimal shape but an optimal topology (how materials are connected in space, how many
holes there are, and so forth)—to support a given weight with minimal material. To see it in action, watch this
chair-optimization video. (People have applied such techniques to much more practical problems as well, from
airplane wings to optical communications.)
where x solves A(p)x = b (e.g. a parameterized physical model as in the previous section) and f (x) is a scalar-valued
function of x (e.g. an optimization objective depending on our physics solution). For example, this could arise in
42
an optimization problem
min f (x)
min g(p) ⇐⇒ p ,
p subject to A(p)x=b
for which the gradient ∇g would be helpful to search for a local minimum. The chain rule for g corresponds to the
following conceptual chain of dependencies:
dg = f ′ (x)[dx] dg ←− dx
= f ′ (x)[d(A−1 )b] dx ←− d(A−1 )
= − f ′ (x)A−1 dA A−1 b dA−1 ←− dA
| {z }
vT
= −v A′ (p)[dp] A−1 b
T
dA ←− dp .
| {z }
dA
Here, we are defining the row vector v T = f ′ (x)A−1 , and we have used the differential of a matrix inverse d(A−1 ) =
−A−1 dA A−1 from Sec. 7.3.
Grouping the terms left-to-right, we first solve the “adjoint” (transposed) equation AT v = f ′ (x)T = ∇x f for v,
and then we obtain dg = −v T dA x. Because the derivative A′ (p) of a matrix with respect to a vector is awkward
to write explicitly, it is convenient to examine this object one parameter at a time. For any given parameter pk ,
∂g/∂pk = −v T (∂A/∂pk )x (and in many applications ∂A/∂pk is very sparse); here, “dividing by” ∂pk works because
this is a scalar factor that commutes with the other linear operations. That is, it takes only two solves to get both
g and ∇g: one for solving Ax = b to find g(p) = f (x), and another with AT for v, after which all of the derivatives
∂g/∂pk are just some cheap dot products.
Note that you should not use right-to-left “forward-mode” derivatives with lots of parameters, because
∂g ∂A
= −f ′ (x) A−1 x
∂pk ∂pk
represents one solve per parameter pk ! As discussed in Sec. 8.4, right-to-left (a.k.a. forward mode) is better
when there is one (or few) input parameters pk and many outputs, while left-to-right “adjoint” differentiation
(a.k.a. reverse mode) is better when there is one (or few) output values and many input parameters. (In Sec. 8.1,
we will discuss using dual numbers for differentiation, and this also corresponds to forward mode.)
Another possibility that might come to mind is to use finite differences (as in Sec. 4), but you should not use
this if you have lots of parameters! Finite differences would involve a calculation of something like
∂g
≈ [g(p + ϵek ) − g(p)]/ϵ,
∂pk
where ek is a unit vector in the k-th direction and ϵ is a small number. This, however, requires one solve for
each parameter pk , just like forward-mode differentiation. (It becomes even more expensive if you use fancier
higher-order finite-difference approximations in order to obtain higher accuracy.)
43
6.3.1 Nonlinear equations
You can also apply adjoint/reverse differentiation to nonlinear equations. For instance, consider the gradient of the
scalar function g(p) = f (x(p)), where x(p) ∈ Rn solves some system of n equations h(p, x) = 0 ∈ Rn . By the chain
rule,
−1
∂h ∂h ∂h ∂h
h(p, x) = 0 =⇒ dp + dx = 0 =⇒ dx = − dp .
∂p ∂x ∂x ∂p
∂h
(This is an instance of the Implicit Function Theorem: as long as ∂x is nonsingular, we can locally define a function
x(p) from an implicit equation h = 0, here by linearization.) Hence,
−1
∂h ∂h
dg = f ′ (x)dx = − f ′ (x) dp .
∂x ∂p
| {z }
vT
Associating left-to-right again leads to a single “adjoint” equation: (∂h/∂x)T v = f ′ (x)T = ∇x f . In other words,
it again only takes two solves to get both g and ∇g—one nonlinear “forward” solve for x and one linear “adjoint”
solve for v! Thereafter, all derivatives ∂g/∂pk are cheap dot products. (Note that the linear “adjoint” solve involves
the transposed Jacobian ∂h/∂x. Except for the transpose, this is very similar to the cost of a single Newton step
to solve h = 0 for x. So the adjoint problem should be cheaper than the forward problem.)
If you use automatic differentiation (AD) systems, why do you need to learn this stuff? Doesn’t the AD do
everything for you? In practice, however, it is often helpful to understand adjoint methods even if you use automatic
differentiation. Firstly, it helps you understand when to use forward- vs. reverse-mode automatic differentiation.
Secondly, many physical models call large software packages written over the decades in various languages that
cannot be differentiated automatically by AD. You can typically correct this by just supplying a “vector–Jacobian
product” y T dx for this physics, or even just part of the physics, and then AD will differentiate the rest and apply
the chain rule for you. Lastly, often models involve approximate calculations (e.g. for the iterative solution of
linear or nonlinear equations, numerical integration, and so forth), but AD tools often don’t “know” this and spend
extra effort trying to differentiate the error in your approximation; in such cases, manually written derivative rules
can sometimes be much more efficient. (For example, suppose your model involves solving a nonlinear system
h(x, p) = 0 by an iterative approach like Newton’s method. Naive AD will be very inefficient because it will
attempt to differentiate through all your Newton steps. Assuming that you converge your Newton solver to enough
accuracy that the error is negligible, it is much more efficient to perform differentiation via the implicit-function
theorem as described above, leading to a single linear adjoint solve.)
To finish off this section of the notes, we conclude with an example of how to use this “adjoint method” to compute
a derivative efficiently. Before working through the example, we first state the problem and highly recommend
trying it out before reading the solution.
44
Problem 38
Suppose that A(p) takes a vector p ∈ Rn−1 and returns the n × n tridiagonal real-symmetric matrix
a1 p1
p1 a2 p2
.. ..
A(p) =
p2 . . ,
..
. an−1 pn−1
pn−1 an
for some constant vectors b, c ∈ Rn (assuming we choose p and a so that A is invertible). Note that, in
practice, A(p)−1 b is not computed by explicitly inverting the matrix A—instead, it can be computed in Θ(n)
(i.e., roughly proportional to n) arithmetic operations using Gaussian elimination that takes advantage of the
“sparsity” of A (the pattern of zero entries), a “tridiagonal solve.”
(a) Write down a formula for computing ∂g/∂p1 (in terms of matrix–vector products and matrix inverses).
(Hint: once you know dg in terms of dA, you can get ∂g/∂p1 by “dividing” both sides by ∂p1 , so that dA
becomes ∂A/∂p1 .)
(b) Outline a sequence of steps to compute both g and ∇g (with respect to p) using only two tridiagonal
solves x = A−1 b and an “adjoint” solve v = A−1 (something), plus Θ(n) (i.e., roughly proportional to n)
additional arithmetic operations.
(c) Write a program implementing your ∇g procedure (in Julia, Python, Matlab, or any language you want)
from the previous part. (You don’t need to use a fancy tridiagonal solve if you don’t know how to do this
in your language; you can solve A−1 (vector) inefficiently if needed using your favorite matrix libraries.)
Implement a finite-difference test: Choose a, b, c, p at random, and check that ∇g · δp ≈ g(p + δp) − g(p)
(to a few digits) for a randomly chosen small δp.
Problem 38(a) Solution: From the chain rule and the formula for the differential of a matrix inverse, we
have dg = −2(cT A−1 b)cT A−1 dA A−1 b (noting that cT A−1 b is a scalar so we can commute it as needed). Hence
∂g ∂A −1
= −2(cT A−1 b)cT A−1 A b
∂p1 | {z } ∂p1 | {z }
T x
v
0 1
1 0 0
.. ..
= vT
0 . . x = v1 x 2 + v2 x 1 ,
. .. 0 0
0 0
| {z }
∂A
∂p1
where we have simplified the result in terms of x and v for the next part.
Problem 38(b) Solution: Using the notation from the previous part, exploiting the fact that AT = A, we
45
can choose v = A−1 [−2(cT x)c] , which is a single tridiagonal solve. Given x and v, the results of our two Θ(n)
tridiagonal solves, we can compute each component of the gradient similar to above by ∂g/∂pk = vk xk+1 + vk+1 xk
for k = 1, . . . , n − 1, which costs Θ(1) arithmetic per k and hence Θ(n) arithmetic to obtain all of ∇g.
Problem 38(c) Solution: See the Julia solution notebook (Problem 1) from our IAP 2023 course (which calls
the function f rather than g).
46
7 Derivative of Matrix Determinant and Inverse
Theorem 39
Given A is a square matrix, we have
where adj is the “adjugate”. (You may not have heard of the matrix adjugate, but this formula tells us that it
is simply adj(A) = det(A)A−1 , or cofactor(A) = adj(AT ).) Furthermore,
You may remember that each entry (i, j) of the cofactor matrix is (−1)i+j times the determinant obtained
by deleting row i and column j from A. Here are some 2 × 2 calculuations to obtain some intuition about these
functions:
!
a c
M= (4)
b d
!
d −c
=⇒ cofactor(M ) = (5)
−b a
!
d −b
adj(M ) = (6)
−c a
!
1 d −b
(M )−1 = . (7)
ad − bc −c a
Numerically, as is done in the notebook, you can construct a random n × n matrix A (say, 9 × 9), consider
e.g. dA = .00001A, and see numerically that
Thus,
∂ det A
= Cij =⇒ ∇(det A) = C,
∂Aij
the cofactor matrix. (In computing these partial derivatives, it’s important to remember that the cofactor Cij
contains no elements of A from row i or column j. So, for example, Ai1 only appears explicitly in the first term,
and not hidden in any of the C terms in this expansion.)
47
There is also a fancier proof of the theorem using linearization near the identity. Firstly, note that it is easy to
see from the properties of determinants that
and thus
7.2 Applications
7.2.1 Characteristic Polynomial
We now use this as an application to find the derivative of a characteristic polynomial evaluated at x. Let p(x) =
det(xI − A), a scalar function of x. Recall that through factorization, p(x) may be written in terms of eigenvalues
λi . So we may ask: what is the derivative of p(x), the characteristic polynomial at x? Using freshman calculus, we
could simply compute
d Y XY Y X
(x − λi ) = (x − λj ) = (x − λi ){ (x − λi )−1 },
dx i i i
j̸=i
as long as x ̸= λi .
This is a perfectly good simply proof, but with our new technology we have a new proof:
Note that here we used that d(xI − A) = dx I when A is constant and tr(Adx) = tr(A)dx since dx is a scalar.
We may again check this computationally as we do in the notebook.
d(det A)
d(log(det(A))) = = det(A−1 )d(det(A)) = tr(A−1 dA).
det A
The logarithmic derivative shows up a lot in applied mathematics. Note that here we use that 1
det A = det(A−1 ) as
−1 −1
1 = det(I) = det(AA ) = det(A) det(A ).
For instance, recall Newton’s method to find roots f (x) = 0 of single-variable real-valued functions f (x) by
taking a sequence of steps x → x + δx. The key formula in Newton’s method is δx = f ′ (x)−1 f (x), but this
1
is the same as (log f (x))′ . So, derivatives of log determinants show up in finding roots of determinants, i.e. for
f (x) = det M (x). When M (x) = A − xI, roots of the determinant are eigenvalues of A. For more general functions
M (x), solving det M (x) = 0 is therefore called a nonlinear eigenproblem.
48
7.3 Jacobian of the Inverse
Lastly, we compute the derivative (as both a linear operator and an explicit Jacobian matrix) of the inverse of a
matrix. There is a neat trick to obtain this derivative, simply from the property A−1 A = I of the inverse. By the
product rule, this implies that
Here, the second line defines a perfectly good linear operator for the derivative (A−1 )′ , but if we want we can
rewrite this as an explicit Jacobian matrix by using Kronecker products acting on the “vectorized” matrices as we
did in Sec. 3:
vec d(A−1 ) = vec −A−1 (dA)A−1 = −(A−T ⊗ A−1 ) vec(dA) ,
| {z }
Jacobian
−T −1 T T −1
where A denotes (A ) = (A ) . One can check this formula numerically, as is done in the notebook.
In practice, however, you will probably find that the operator expression −A−1 dA A−1 is more useful than
explicit Jacobian matrix for taking derivatives involving matrix inverses. For example, if you have a matrix-valued
d(A−1 )
function A(t) of a scalar parameter t ∈ R, you immediately obtain dt = −A−1 dA
dt A
−1
. A more sophisticated
application is discussed in Sec. 6.3.
49
8 Forward and Reverse-Mode Automatic Differentiation
The first time that Professor Edelman had heard about automatic differentiation (AD), it was easy for him to
imagine what it was . . . but what he imagined was wrong! In his head, he thought it was straightforward symbolic
differentiation applied to code—sort of like executing Mathematica or Maple, or even just automatically doing what
he learned to do in his calculus class. For instance, just plugging in functions and their domains from something
like the following first-year calculus table:
Derivative Domain
(sin x)′ = cos x −∞ < x < ∞
(cos x)′ = − sin x −∞ < x < ∞
(tan x)′ = sec2 x x ̸= π2 + πn, n ∈ Z
(cot x)′ = − csc2 x x ̸= πn, n ∈ Z
(sec x)′ = tan x sec x x ̸= π2 + πn, n ∈ Z
(csc x)′ = − cot x csc x x ̸= πn, n ∈ Z
And in any case, if it wasn’t just like executing Mathematica or Maple, then it must be finite differences, like one
learns in a numerical computing class (or as we did in Sec. 4).
It turns out that it is definitely not finite differences—AD algorithms are generally exact (in exact arithmetic,
neglecting roundoff errors), not approximate. But it also doesn’t look much like conventional symbolic algebra:
the computer doesn’t really construct a big “unrolled” symbolic expression and then differentiate it, the way you
might imagine doing by hand or via computer-algebra software. For example, imagine a computer program that
computes det A for an n × n matrix—writing down the “whole” symbolic expression isn’t possible until the program
runs and n is known (e.g. input by the user), and in any case a naive symbolic expression would require n! terms.
Thus, AD systems have to deal with computer-programming constructs like loops, recursion, and problem sizes n
that are unknown until the program runs, while at the same time avoiding constructing symbolic expressions whose
size becomes prohibitively large. (See Sec. 8.1.1 for an example that looks very different from the formulas you
differentiate in first-year calculus.) Design of AD systems often ends up being more about compilers than about
calculus!
We start with a simple example, an algorithm for the square-root function, where a practical method of automatic
differentiation came as both a mathematical surprise and a computing wonder for Professor Edelman. In particular,
√
we consider the “Babylonian” algorithm to compute x, known for millennia (and later revealed as a special case
50
√
of Newton’s method applied to t2 − x = 0): simply repeat t ← (t + x/t)/2 until t converges to x to any desired
accuracy. Each iteration has one addition and two divisions. For illustration purposes, 10 iterations suffice. Here
is a short program in Julia that implements this algorithm, starting with a guess of 1 and then performing N steps
(defaulting to N = 10):
If we run this function to compute the square root of x = 4, we will see that it converges very quickly: for only
N = 3 steps, it obtains the correct answer (2) to nearly 3 decimal places, and well before N = 10 steps it has
converged to 2 within the limits of the accuracy of computer arithmetic (about 16 digits). In fact, it roughly
doubles the number of correct digits on every step:
The basic idea of carrying the chain rule through a computer program is very simple: replace every number with
two numbers, one which keeps track of the value and one which tracks the derivative of that value. The values are
computed the same way as before, and the derivatives are computed by carrying out the chain rule for elementary
operations like + and /.
In Julia, we can implement this idea by defining a new type of number, which we’ll call D, that encapsulates a
value val and a derivative deriv.
51
julia> struct D <: Number
val::Float64
deriv::Float64
end
(A detailed explanation of Julia syntax can be found elsewhere, but hopefully you can follow the basic ideas even
if you don’t understand every punctuation mark.) A quantity x = D(a,b) of this new type has two components
x.val = a and x.deriv = b, which we will use to represent values and derivatives, respectively. The Babylonian
code only uses two arithmetic operations, + and /, so we just need to overload the built-in (“Base”) definitions of
these in Julia to include new rules for our D type:
If you look closely, you’ll see that the values are just added and divided in the ordinary way, while the derivatives
are computed using the sum rule (adding the derivatives of the inputs) and the quotient rule, respectively. We also
need one other technical trick: we need to define “conversion” and “promotion” rules that tell Julia how to combine
D values with ordinary real numbers, as in expressions like x + 1 or x/2:
This just says that an ordinary real number r is combined with a D value by first converting r to D(r,0): the value
is r and the derivative is 0 (the derivative of any constant is zero).
Given these definitions, we can now plug a D value into our unmodified Babylonian function, and it will
“magically” compute the derivative of the square root. Let’s try it for x = 49 = 72 :
julia> x = 49
49
julia> Babylonian(D(x,1))
D(7.0, 0.07142857142857142)
We can see that it correctly returned a value of 7.0 and a derivative of 0.07142857142857142, which indeed
√ √
matches the square root 49 and its derivative 0.5/ 49:
√ √
julia> ( x, 0.5/ x)
(7.0, 0.07142857142857142)
Why did we input D(x,1)? Where did the 1 come from? That’s simply the fact that the derivative of the input x
with respect to itself is (x)′ = 1, so this is the starting point for the chain rule.
In practice, all this (and more) has already been implemented in the ForwardDiff.jl package in Julia (and in
many similar software packages in a variety of languages). That package hides the implementation details under
the hood and explicitly provides a function to compute the derivative. For example:
52
Essentially, however, this is the same as our little D implementation, but implemented with greater generality and
sophistication (e.g. chain rules for more operations, support for more numeric types, partial derivatives with respect
to multiple variables, etc.): just as we did, ForwardDiff augments every value with a second number that tracks
the derivative, and propagates both quantities through the calculation.
We could have also implemented the same idea specifically for the Bablylonian algorithm, by writing a new
function dBabylonian that tracks both the variable t and its derivative t′ = dt/dx through the course of the
calculation:
julia> dBabylonian(49)
0.07142857142857142
There is a pleasing algebraic way to think about our new number type D(a, b) instead of the “value & derivative”
viewpoint above. Remember how a complex number a + bi is formed from two real numbers (a, b) by defining a
special new quantity i (the imaginary unit) that satisfies i2 = −1, and all the other complex-arithmetic rules follow
from this? Similarly, we can think of D(a, b) as a + bϵ, where ϵ is a new “infinitesimal unit” quantity that satisfies
ϵ2 = 0. This viewpoint is called a dual number.
Given the elementary rule ϵ2 = 0, the other algebraic rules for dual numbers immediately follow:
It’s also nice to add a “pretty printing” rule to make Julia display dual numbers as a + bϵ rather than as D(a,b):
53
Once we implement the multiplication rule for dual numbers in Julia, then ϵ2 = 0 follows from the special case
a = c = 0 and b = d = 1:
julia> ϵ = D(0,1)
0.0 + 1.0ϵ
julia> ϵ * ϵ
0.0 + 0.0ϵ
julia> ϵ^2
0.0 + 0.0ϵ
(We didn’t define a rule for powers D(a, b)n , so how did it compute ϵ2 ? The answer is that Julia implements xn
via repeated multiplication by default, so it sufficed to define the ∗ rule.) Now, we can compute the derivative of
the Babylonian algorithm at x = 49 as above by:
julia> Babylonian(x + ϵ)
7.0 + 0.07142857142857142ϵ
√
with the “infinitesimal part” being the derivative 0.5/ 49 = 0.0714 · · · .
A nice thing about this dual-number viewpoint is that it corresponds directly to our notion of a derivative as
linearization:
f (x + ϵ) = f (x) + f ′ (x)ϵ + (higher-order terms) ,
x2 + 6x + 1
.
4(x + 1)
Continuing this process by hand is quite tedious, but fortunately the computer can do it for us (as shown in the
accompanying Julia notebook). Three Babylonian iterations yields:
54
and five iterations produces the enormous expression:
x16 +496x15 +35960x14 +906192x13 +10518300x12 +64512240x11 +225792840x10 +471435600x9 +601080390x8 +471435600x7 +225792840x6 +64512240x5 +10518300x4 +906192x3 +35960x2 +496x+1
32(x15 +155x14 +6293x13 +105183x12 +876525x11 +4032015x10 +10855425x9 +17678835x8 +17678835x7 +10855425x6 +4032015x5 +876525x4 +105183x3 +6293x2 +155x+1) .
Notice how quickly these grow—in fact, the degree of the polynomials doubles on every iteration! Now, if we take
the symbolic derivatives of these functions using our ordinary calculus rules, and simplify (with the help of the
computer), the derivative of one iteration is 12 , of two iterations is
x2 + 2x + 5
,
4 (x2 + 2x + 1)
of three iterations is
x6 + 14x5 + 147x4 + 340x3 + 375x2 + 126x + 21
,
8 (x6 + 14x5 + 63x4 + 100x3 + 63x2 + 14x + 1)
of four iterations is
x14 +70x13 +3199x12 +52364x11 +438945x10 +2014506x9 +5430215x8 +8836200x7 +8842635x6 +5425210x5 +2017509x4 +437580x3 +52819x2 +3094x+85
16(x14 +70x13 +1771x12 +20540x11 +126009x10 +440986x9 +920795x8 +1173960x7 +920795x6 +440986x5 +126009x4 +20540x3 +1771x2 +70x+1) ,
and of five iterations is a monstrosity you can only read by zooming in:
x30 +310x29 +59799x28 +4851004x27 +215176549x26 +5809257090x25 +102632077611x24 +1246240871640x23 +10776333438765x22 +68124037776390x21 +321156247784955x20 +1146261110726340x19 +3133113888931089x18 +6614351291211874x17 +10850143060249839x16 +13883516068991952x15 +13883516369532147x14 +10850142795067314x13 +6614351497464949x12 +3133113747810564x11 +1146261195398655x10 +321156203432790x9 +68124057936465x8 +10776325550040x7 +1246243501215x6 +102631341330x5 +5809427001x4 +215145084x3 +4855499x2 +59334x+341
32(x30 +310x29 +36611x28 +2161196x27 +73961629x26 +1603620018x25 +23367042639x24 +238538538360x23 +1758637118685x22 +9579944198310x21 +39232152623175x20 +122387258419860x19 +293729420641881x18 +546274556891506x17 +791156255418003x16 +894836006026128x15 +791156255418003x14 +546274556891506x13 +293729420641881x12 +122387258419860x11 +39232152623175x10 +9579944198310x9 +1758637118685x8 +238538538360x7 +23367042639x6 +1603620018x5 +73961629x4 +2161196x3 +36611x2 +310x+1) .
This is a terrible way to compute derivatives! (However, more sophisticated approaches to efficient symbolic differen-
tiation exist, such as the “D∗ ” algorithm, that avoid explicit giant formulas by exploiting repeated subexpressions.)
To be clear, the dual number approach (absent rounding errors) computes an answer exactly as if it evaluated
these crazy expressions at some particular x, but the words “as if” are very important here. As you can see, we do
not form these expressions, let alone evaluate them. We merely compute results that are equal to the values we
would have gotten if we had.
55
8.3 Automatic Differentiation via Computational Graphs
Let’s now get into automatic differentiation via computational graphs. For this section, we consider the following
simple motivating example.
Example 40
Define the following functions:
a(x, y) = sin x
b(x, y) = y1 · a(x, y)
z(x, y) = b(x, y) + x.
∂z ∂z
Compute ∂x and ∂y .
There are a few ways to solve this problem. Firstly, of course, one can compute this symbolically, noting that
1 sin x
z(x, y) = b(x, y) + x = a(x, y) + x = + x,
y y
which implies
∂z cos x ∂z sin x
= + 1 and =− 2 .
∂x y ∂y y
However, one can also use a Computational Graph (see Figure of Computational Graph below) where the edge
∂B
from node A to node B is labelled with ∂A .
1
cos x y 1
x a(x, y) b(x, y) z(x, y)
− a(x,y)
y2
Figure 7: A computational graph corresponding to example 40, representing the computation of an output z(x, y)
from two inputs x, y, with intermediate quantities a(x, y) and b(x, y). The nodes are labelled by values, and edges
are labelled with the derivatives of the values with respect to the preceding values.
Now how do we use this directed acyclic graph (DAG) to find the derivatives? Well one view (called the
“forward view”) is given by following the paths from the inputs to the outputs and (left) multiplying as you go,
adding together multiple paths. For instance, following this procedure for paths from x to z(x, y), we have
∂z 1 cos x
= 1 · · cos x + 1 = + 1.
∂x y y
56
Similarly, for paths from y to z(x, y), we have
∂z −a(x, y) − sin x
=1· 2
= ,
∂y y y2
and if you have numerical derivatives on the edges, this algorithm works. Alternatively, you could follow a reverse
view and follow the paths backwards (multiplying right to left), and obtain the same result. Note that there is
nothing magic about these being scalar here– you could imagine these functions are the type that we are seeing
in this class and do the same computations! The only thing that matters here fundamentally is the associativity.
However, when considering vector-valued functions, the order in which you multiply the edge weights is vitally
important (as vector/matrix valued functions are not generally commutative).
The graph-theoretic way of thinking about this is to consider “path products”. A path product is the product
of edge weights as you traverse a path. In this way, we are interested in the sum of path products from inputs
to outputs to compute derivatives using computational graphs. Clearly, we don’t particularly care which order we
traverse the paths as long as the order we take the product in is correct. In this way, forward and reverse-mode
automatic differentiation is not so mysterious.
Let’s take a closer view of the implementation of forward-mode automatic differentiation. Suppose we are at a
node A during the process of computing the derivative of a computational graph, as shown in the figure below:
B1
∂f (A)
∂A
B2 A f (A)
B3
Suppose we know the path product P of all the edges up to and including the one from B2 ? to A. Then what
is the new path product as we move to the right from A? It is f ′ (A) · P ! So we need a data structure that maps
in the following way:
(value, path product) 7→ (f (value), f ′ · path product).
In some sense, this is another way to look at the Dual Numbers– taking in our path products and spitting out
values. In any case, we overload our program which can easily calculate f (value) and tack-on f ′ · (path product).
One might ask how our program starts– this is how the program works in the “middle”, but what should our
starting value be? Well the only thing it can be for this method to work is (x, 1). Then, at every step you do the
following map listed above:
57
(a, p)
∂z
∂a
z = f (a, b)
∂z
∂b
(b, q)
should the new output value be? We want to add the two path products together, obtaining
∂z ∂z
f (a, b), p+ q .
∂a ∂b
(a, p)
∂z ∂z
f (a, b), ∂a p+ ∂b q
(b, q)
This diagram of course generalizes if we may many different nodes on the left side of the graph.
If we come up with such a data structure for all of the simple computations (addition/subtraction, multiplication,
and division), and if this is all we need for our computer program, then we are set! Here is how we define the
structure for addition/subtraction, multiplication, and division.
58
Addition/Subtraction: See figure.
(a1 , p = 1)
∂z ∂z
z = a1 ± a2 , ∂a1
·1+ ∂a2 · (±1)
(a2 , q = ±1)
(a1 , p = a2 )
∂z ∂z
z = a1 a2 , ∂a1
· a2 + ∂a2 · a1
(a2 , q = a1 )
59
(a1 , p = a2 /a22 )
∂z 1 ∂z a1
z = a1 /a2 , ∂a1
· a2 − ∂a2 · a22
When we do reverse mode, we have arrows going the other direction, which we will understand in this section of
the notes. In forward mode it was all about “what do we depend on,” i.e. computing the derivative on the right
hand side of the above diagram using the functions in the nodes on the left. In reverse mode, the question is really
“what are we influenced by?” or “what do we influence later?”
When going “backwards,” we need know what nodes a given node influences. For instance, given a node A, we
want to know the nodes Bi that is influenced by, or depends on, node A. So now our diagram looks like this:
∂z
b1 , ∂b 1
∂z ∂z
(x, ∂z/∂x) a, ∂a b2 , ∂b 2
(z, 1)
∂z
b3 , ∂b 3
So now, we eventually have a final node (z, 1) (far on the right hand side) where everything starts. This time,
all of our multiplications take place from right to left as we are in reverse mode. Our goal is to be able to calculate
∂z
the node (x, ∂z/∂x). So if we know how to fill in the ∂a term, we will be able to go from right to left in these
∂z
computational graphs (i.e., in reverse mode). In fact, the formula for getting ∂a is given by
s
∂z X ∂bi ∂z
=
∂a i=1
∂a ∂bi
where the bi s come from the nodes that are influenced by the node A. This is again just another chain rule like
from calculus, but you can also view this as multiplying the sums of all the weights in the graph influenced by A.
Why can reverse mode be more efficient than forward mode? One reason it because it can save data and use it
60
x
p c q d
z
61
• Reverse-mode differentiation proceeds in the opposite direction to ordinary computation. This makes reverse-
mode AD much more complicated to implement, and adds a lot of storage overhead to the function compu-
tation. First you evaluate the function from inputs to outputs, but you (or the AD system) keep a record (a
“tape”) of all the intermediate steps of the computation; then, you run the computation in reverse (“play the
tape backwards”) to backpropagate the derivatives.
As a result of these practical advantages, even for the case of many (n > 1) inputs and a single (m = 1) output,
practitioners tell us that they’ve found forward mode to be more efficient until n becomes sufficiently large (perhaps
even until n > 100, depending on the function being differentiated and the AD implementation). (You may also be
interested in the blog post Engineering Trade-offs in AD by Chris Rackauckas, which is mainly about reverse-mode
implementations.)
If n = m, where neither approach has a scaling advantage, one typically prefers the lower overhead and simplicity
of forward-mode differentiation. This case arises in computing explicit Jacobian matrices for nonlinear root-finding
(Sec. 6.1), or Hessian matrices of second derivatives (Sec. 12), for which one often uses forward mode. . . or even a
combination of forward and reverse modes, as discussed below.
Of course, forward and reverse are not the only options. The chain rule is associative, so there are many possible
orderings (e.g. starting from both ends and meeting in the middle, or vice versa). A difficult6 problem that may
often require hybrid schemes is to compute Jacobians (or Hessians) in a minimal number of operations, exploiting
any problem-specific structure (e.g. sparsity: many entries may be zero). Discussion of this and other AD topics
can be found, in vastly greater detail than in these notes, in the book Evaluating Derivatives (2nd ed.) by Griewank
and Walther (2008).
Often, a combination of forward- and reverse-mode differentiation is advantageous when computing second deriva-
tives, which arise in many practical applications.
Hessian computation: For example, let us consider a function f (x) : Rn → R mapping n inputs x to a
single scalar. The first derivative f ′ (x) = (∇f )T is best computed by reverse mode if n ≫ 1 (many inputs). Now,
however, consider the second derivative, which is the derivative of g(x) = ∇f , mapping n inputs x to n outputs
∇f . It should be clear that g ′ (x) is therefore an n × n Jacobian matrix, called the Hessian of f , which we will
discuss much more generally in Sec. 12. Since g(x) has the same number of inputs and outputs, neither forward nor
reverse mode has an inherent scaling advantage, so typically forward mode is chosen for g ′ thanks to its practical
simplicity, while still computing ∇f in reverse-mode. That is, we compute ∇f by reverse mode, but then compute
g ′ = (∇f )′ by applying forward-mode differentiation to the ∇f algorithm. This is called a forward-over-reverse
algorithm.
An even more clear-cut application of forward-over-reverse differentiation is to Hessian–vector products. In
many applications, it turns out that what is required is only the product (∇f )′ v of the Hessian (∇f )′ with an
arbitrary vector v. In this case, one can completely avoid computing (or storing) the Hessian matrix explicitly,
and incur computational cost proportional only to that of a single function evaluation f (x). The trick is to recall
(from Sec. 2.2.1) that, for any function g, the linear operation g ′ (x)[v] is a directional derivative, equivalent to
∂
a single-variable derivative ∂α g(x + αv) evaluated at α = 0. Here, we simply apply that rule to the function
g(x) = ∇f , and obtain the following formula for a Hessian–vector product:
∂
(∇f )′ v =
∇f |x+αv .
∂α α=0
6 In fact, extraordinarily difficult: “NP-complete” (Naumann, 2006).
62
Computationally, the inner evaluation of the gradient ∇f at an arbitrary point x+αv can be accomplished efficiently
by a reverse/adjoint/backpropagation algorithm. In contrast, the outer derivative with respect to a single input α
is best performed by forward-mode differentiation.7 Since the Hessian matrix is symmetric (as discussed in great
generality by Sec. 12), the same algorithm works for vector–Hessian products v T (∇f )′ = [(∇f )′ v]T , a fact that
we employ in the next example.
Scalar-valued functions of gradients: There is another common circumstance in which one often combines
forward and reverse differentiation, but which can appear somewhat more subtle, and that is in differentiating a
scalar-valued function of a gradient of another scalar-valued function. Consider the following example:
Example 41
Let f (x) : Rn 7→ R be a scalar-valued function of n ≫ 1 inputs with gradient ∇f |x = f ′ (x)T , and let
g(z) : Rn 7→ R be another such function with gradient ∇g|z = g ′ (z)T . Now, consider the scalar-valued
function h(x) = g( ∇f |x ) : Rn 7→ R and compute ∇h|x = h′ (x)T .
Denote z = ∇f |x . By the chain rule, h′ (x) = g ′ (z)(∇f )′ (x), but we want to avoid explicitly computing
the large n × n Hessian matrix (∇f )′ . Instead, as discussed above, we use the fact that such a vector–Hessian
product is equivalent (by symmetry of the Hessian) to the transpose of a Hessian–vector product multiplying
the Hessian (∇f )′ with the vector ∇g = g ′ (z)T , which is equivalent to a directional derivative:
∂
∇h|x = h′ (x)T = ∇f |x+α ∇g| ,
∂α z
α=0
involving differentiation with respect to a single scalar α ∈ R. As for any Hessian–vector product, therefore,
we can evaluate h and ∇h by:
1. Evaluate h(x): evaluate z = ∇f |x by reverse mode, and plug it into g(z).
2. Evaluate ∇h:
This is a “forward-over-reverse” algorithm, where forward mode is used efficiently for the single-input derivative
with respect to α ∈ R, combined with reverse mode to differentate with respect to x, z ∈ Rn .
Example Julia code implementing the above “forward-over-reverse” process for just such a h(x) = g(∇f ) function
is given below. Here, the forward-mode differentiation with respect to α is implemented by the ForwardDiff.jl
package discussed in Sec. 8.1, while the reverse-mode differentiation with respect to x or z is performed by the
Zygote.jl package. First, let’s import the packages and define simple example functions f (x) = 1/∥x∥ and g(z) =
( k zk )3 , along with the computation of h via Zygote:
P
notes that one could also interchange the ∂/∂α and ∇x derivatives and employ reverse-over-forward mode, but suggests that this is
less efficient in practice: “because forward-mode has less overhead than reverse-mode, and since the outer differentiation operator here
has to differentiate a larger computation than the inner one, keeping forward-mode on the outside works best.” It also presents another
alternative: using the identity (∇f )′ v = ∇(v T ∇f ), one can apply reverse-over-reverse mode to take the gradient of v T ∇f , but this has
even more computational overhead.
63
Now, we’ll compute ∇h by forward-over-reverse:
We can now plug in some random numbers and compare to a finite-difference check:
julia> h(x)
-0.005284687528953334
julia> ∇h(x)
5-element Vector{Float64}:
-0.006779692698531759
0.007176439898271982
-0.006610264199241697
-0.0012162087082746558
0.007663756720005014
The finite-difference check matches to about 7 significant digits, which is as much as we can hope for—the forward-
over-reverse code works!
Problem 42
A common variation on the above procedure, which often appears in machine learning, involves a function
f (x, p) ∈ R that maps input “data” x ∈ Rn and “parameters” p ∈ RN to a scalar. Let ∇x f and ∇p f denote
the gradients with respect to x and p.
Now, suppose we have a function g(z) : Rn 7→ R as before, and define h(x, p) = g( ∇x f |x,p ). We want to
compute ∇p h = (∂h/∂p)T , which will involve “mixed” derivatives of f with respect to both x and p.
Show that you can compute ∇p h by:
∂
∇p h|x,p = ∇p f |x+α ∇g| ,p ,
∂α z
α=0
64
9 Differentiating ODE solutions
In this lecture, we will consider the problem of differentiating the solution of ordinary differential equations (ODEs)
with respect to parameters that appear in the equations and/or initial conditions. This is as important topic in a
surprising number of practical applications, such as evaluating the effect of uncertainties, fitting experimental data,
or machine learning (which is increasingly combining ODE models with neural networks). As in previous lectures,
we will find that there are crucial practical distinctions between “forward” and “reverse” (“adjoint”) techniques for
computing these derivatives, depending upon the number of parameters and desired outputs.
Although a basic familiarity with the concept of an ODE will be helpful to readers of this lecture, we will begin
with a short review in order to establish our notation and terminology.
The video lecture on this topic for IAP 2023 was given by Dr. Frank Schäfer (MIT). These notes follow the
same basic approach, but differ in some minor notational details.
65
and in many practical applications one must resort to approximate numerical solutions. Fortunately, if you supply
a computer program that can compute f (u, t), there are mature and sophisticated software libraries10 which can
compute u(t) from u(0) for any desired set of times t, to any desired level of accuracy (for example, to 8 significant
digits).
For example, the most basic numerical ODE method computes the solution at a sequence of times tn = n∆t for
du u(tn+1 )−u(tn )
n = 0, 1, 2, . . . simply by approximating dt = f (u, t) using the finite difference ∆t ≈ f (u(tn ), tn ), giving
us the “explicit” timestep algorithm:
Using this technique, known as “Euler’s method,” we can march the solution forward in time: starting from our
initial condition u0 , we compute u(t1 ) = u(∆t), then u(t2 ) = u(2∆t) from u(∆t), and so forth. Of course, this
might be rather inaccurate unless we make ∆t very small, necessitating many timesteps to reach a given time t, and
there can arise other subtleties like “instabilities” where the error may accumulate exponentially rapidly with each
timestep. It turns out that Euler’s method is mostly obsolete: there are much more sophisticated algorithms that
robustly produce accurate solutions with far less computational cost. However, they all resemble Euler’s method
in the conceptual sense: they use evaluations of f and u at a few nearby times t to “extrapolate” u at a subsequent
time somehow, and thus march the solution forwards through time.
Relying on a computer to obtain numerical solutions to ODEs is practically essential, but it can also make
ODEs a lot more fun to work with. If you ever took a class on ODEs, you may remember a lot of tedious labor
(tricky integrals, polynomial roots, systems of equations, integrating factors, etc.) to obtain solutions by hand.
Instead, we can focus here on simply setting up the correct ODEs and integrals and trust the computer to do the
rest.
where of course ∂u/∂p (which can be thought of as an n × N Jacobian matrix) depends on p and t. This kind of
question is commonplace. For example, it is important in:
• Uncertainty quantification (UQ): if you have some uncertainty in the parameters of your ODE (for example,
you have a chemical reaction in which the reaction rates are only known experimentally ± some measurement
errors), the derivative ∂u/∂p tells you (to first order, at least) how sensitive your answer is to each of these
10 For a modern and full-featured example, see the DifferentialEquations.jl suite of ODE solvers in the Julia language.
66
1.0
solutions to ut = pu with u(p, 0) = 1
0.8
u( p
0.6 , t)
solution u(p, t)
0.4
u( p
+ u
p, t
0.2 )
0.0
0.0 0.5 1.0 1.5 2.0
time t
uncertainties.
• Optimization and fitting: often, you want to choose the parameters p to maximize or minimize some objective
(or “loss” in machine learning). For example, if your ODE models some chemical reaction with unknown
reaction rates or other parameters p, you might want to fit the parameters p to minimize the difference
between u(p, t) and some experimentally observed concentrations.
In the latter case of optimization, you have a scalar objective function of the solution, since to minimize or
maximize something you need a real number (and u might be a vector). For example, this could take on one of the
following two forms:
1. A real-valued function g(u(p, T ), T ) ∈ R that depends on the solution u(p, T ) at a particular time T . For
example, if you have an experimental solution u∗ (t) that you are are trying to match at t = T , you might
minimize g(u(p, T ), T ) = ∥u(p, T ) − u∗ (T )∥2 .
RT
2. A real-valued function G(p) = 0 g(u(p, t), t)dt that depends on an average (here scaled by T ) over many
times t ∈ (0, T ) of our time-dependent g. In the example of fitting experimental data u∗ (t), minimizing
RT
G(p) = 0 ∥u(p, t) − u∗ (t)∥2 dt corresponds to a least-square fit to minimize the error averaged over a time T
(e.g. the duration of your experiment).
More generally, you can give more weight to certain times than others by including a non-negative weight
function w(t) in the integral: Z ∞
Gw (p) = ∥u(p, t) − u∗ (t)∥2 w(t) dt, .
0 | {z }
g(u(p,t),t)
1 t ≤ T
The two cases above are simply the choices w(t) = δ(t − T ) (a Dirac delta function) and w(t) =
0 otherwise
(a step function), respectively. As discussed in Problem 43, you can let w(t) be a sum of delta functions to represent
67
data at a sequence of discrete times.
In both cases, since these are scalar-valued functions, for optimization/fitting one would like to know the gradient
∇p g or ∇p G, such that, as usual,
T
g(u(p + dp, t), t) − g(u(p, t), t) = (∇p g) dp
so that ±∇p g is the steepest ascent/descent direction for maximization/minimization of g, respectively. It is worth
emphasizing that gradients (which we only define for scalar-valued functions) have the same shape as their inputs p,
so ∇p g is a vector of length N (the number of parameters) that depends on p and t.
These are “just derivatives,” but probably you can see the difficulty: if we don’t have a formula (explicit solution)
for u(p, t), only some numerical software that can crank out numbers for u(p, t) given any parameters p and t, how
do we apply differentiation rules to find ∂u/∂p or ∇p g? Of course, we could use finite differences as in Sec. 4—just
crank through numerical solutions for p and p + δp and subtract them—but that will be quite slow if we want
to differentiate with respect to many parameters (N ≫ 1), not to mention giving potentially poor accuracy. In
fact, people often have huge numbers of parameters inside an ODE that they want to differentiate. Nowadays, our
right-hand-side function f (u, p, t) can even contain a neural network (this is called a “neural ODE”) with thousands
or millions (N ) of parameters p, and we need all N of these derivatives ∇p g or ∇p G to minimize the “loss” function
g or G. So, not only do we need to find a way to differentiate our ODE solutions (or scalar functions thereof), but
these derivatives must be obtained efficiently. It turns out that there are two ways to do this, and both of them
hinge on the fact that the derivative is obtained by solving another ODE :
∂u
• Forward mode: ∂p turns out to solve another ODE that we can integrate with the same numerical solvers
∂u
for u. This gives us all of the derivatives we could want, but the drawback is that the ODE for ∂p is larger
by a factor of N than the original ODE for u, so it is only practical for small N (few parameters).
• Reverse (“adjoint”) mode: for scalar objectives, it turns out that ∇p g or ∇p G can be computed by solving
a different ODE for an “adjoint” solution v(p, t) of the same size as u, and then computing some simple
integrals involving u (the “forward” solution) and v. This has the advantage of giving us all N derivatives
with only about twice the cost of solving for u, regardless of the number N of parameters. The disadvantage
is that, since it turns out that v must be integrated “backwards” in time (starting from an “initial” condition
at t = T and working back to t = 0) and depends on u, it is necessary to store u(p, t) for all t ∈ [0, T ] (rather
than marching u forwards in time and discarding values from previous times when they are no longer needed),
which can require a vast amount of computer memory for large ODE systems integrated over long times.
We will now consider each of these approaches in more detail.
where we have used the familiar rule (from multivariable calculus) of interchanging the order of partial derivatives—
a property that we will re-derive explicitly for our generalized linear-operator derivatives in our lecture on Hessians
68
and second derivatives. Equating the right-hand sides of the two lines, we see that we have an ODE
∂ ∂u ∂f ∂u ∂f
= +
∂t ∂p ∂u ∂p ∂p
∂u
for the derivative ∂p , whose initial condition is obtained simply by differentiating the initial condition u(p, 0) = u0 (p)
for u:
∂u ∂u0
= .
∂p t=0 ∂p
We can therefore plug this into any ODE solver technique (usually numerical methods, unless we are extremely
∂u
lucky and can solve this ODE analytically for a particular f ) to find ∂p at any desired time t. Simple, right?
∂u
The only thing that might seem a little weird here is the shape of the solution: ∂p is a linear operator, but
how can the solution of an ODE be a linear operator? It turns out that there is nothing wrong with this, but it is
helpful to think about a few examples:
∂u
• If u, p ∈ R are scalars (that is, we have a single scalar ODE with a single scalar parameter), then ∂p is just
∂u
a (time-dependent) number, and our ODE for ∂p is an ordinary scalar ODE with an ordinary scalar initial
condition.
∂u
• If u ∈ Rn (a “system” of n ODEs) and p ∈ R is a scalar, then ∂p ∈ Rn is another column vector and our ODE
∂u ∂u
for ∂p is another system of n ODEs. So, we solve two ODEs of the same size n to obtain u and ∂p .
∂u
• If u ∈ Rn (a “system” of n ODEs) and p ∈ RN is a vector of N parameters, then ∂p ∈ Rn×N is an n × N
∂u
Jacobian matrix. Our ODE for ∂p is effectivly system of nN ODEs for all the components of this matrix,
∂u0
with a matrix ∂p of nN initial conditions! Solving this “matrix ODE” with numerical methods poses no
conceptual difficulty, but will generally require about N times the computational work of solving for u, simply
because there are N times as many unknowns. This could be expensive if N is large!
This reflects our general observation of forward-mode differentiation: it is expensive when the number N of “input”
parameters being differentiated is large. However, forward mode is straightforward and, especially for N ≲ 100 or
∂u
so, is often the first method to try when differentiating ODE solutions. Given ∂p , one can then straightforwardly
differentiate scalar objectives by the chain rule:
∂u T ∂g T
∇p g|t=T = ,
∂p |∂u
{z }
| {z }
JacobianT vector t=T
Z T
∇p G = ∇p g dt.
0
The left-hand side ∇p G is gradient of a scalar function of N parameters, and hence the gradient is a vector of N
components. Correspondingly, the right-hand side is an integral of an N -component gradient ∇p g as well, and the
integral of a vector-valued function can be viewed as simply the elementwise integral (the vector of integrals of each
component).
For large N ≫ 1 and scalar objectives g or G (etc.), we can in principle compute derivatives much more efficiently,
with about the same cost as computing u, by applying a “reverse-mode” or “adjoint” approach. In other lectures,
we’ve obtained analogous reverse-mode methods simply by evaluating the chain rule left-to-right (outputs-to-inputs)
69
instead of right-to-left. Conceptually, the process for ODEs is similar,11 but algebraically the derivation is rather
trickier and less direct. The key thing is that, if possible, we want to avoid computing ∂u∂p explicitly, since this
could be a prohibitively large Jacobian matrix if we have many parameters (p is large), especially if we have many
equations (u is large).
In particular, let’s start with our forward-mode sensitivity analysis, and consider the derivative G′ = (∇p G)T
where G is the integral of a time-varying objective g(u, p, t) (which we allow to depend explicitly on p for generality).
By the chain rule,
Z T
∂g ∂g ∂u
G′ = + dt,
0 ∂p ∂u ∂p
∂u
which involves our unwanted factor ∂p . To get rid of this, we’re going to use a “weird trick” (much like Lagrange
multipliers) of adding zero to this equation:
Z T
∂g ∂g ∂u ∂ ∂u ∂f ∂u ∂f
G′ = ∂p + ∂u ∂p + v
T
− − dt
0 ∂t ∂p ∂u ∂p ∂p
| {z }
=0
for some function v(t) of the same shape as u that multiplies our “forward-mode” equation for ∂u/∂p. (If u ∈ Rn
then v ∈ Rn ; more generally, for other vector spaces, read v T as an inner product with v.) The new term v T (· · · )
is zero because the parenthesized expression is precisely the ODE satisfied by ∂u
∂p , as obtained in our forward-mode
analysis above, regardless of v(t). This is important because it allows us the
freedom
to choose v(t) to cancel the
T ∂u
∂u
unwanted ∂p term. In particular, if we first integrate by parts on the v ∂t ∂p term to change it to − ∂v
T ∂ ∂u
∂t ∂p
plus a boundary term, then re-group the terms, we find:
T Z T T !
∂u ∂g ∂f ∂g ∂f ∂v ∂u
G′ = v T + − vT + − vT − dt .
∂p ∂p ∂p ∂u ∂u ∂t ∂p
0 0
| {z }
want to be zero!
What initial condition should we choose for v(t)? Well, we can use this choice to get rid of the boundary term we
obtained above from integration by parts:
T
∂u ∂u ∂u0
vT = v(T )T −v(0)T .
∂p 0 ∂p T ∂p
| {z } |{z}
unknown known
∂u
Here, the unknown ∂p term is a problem—to compute that, we would be forced to go back to integrating our big
T
∂u
∂p ODE from forward mode. The other term is okay: since the initial condition u0 is always given, we should know
∂u0
its dependence on p explicitly (and we will simply have ∂p = 0 in the common case where the initial conditions
11 This “left-to-right” picture can be made very explicit if we imagine discretizing the ODE into a recurrence, e.g. via Euler’s method
for an arbitrarily small ∆t, as described in the MIT course notes Adjoint methods and sensitivity analysis for recurrence relations by
S. G. Johnson (2011).
70
∂u
don’t depend on p). To eliminate the ∂p term, therefore, we make the choice
T
v(T ) = 0 .
Instead of an initial condition, our adjoint ODE has a final condition. That’s no problem for a numerical solver:
it just means that the adjoint ODE is integrated backwards in time, starting from t = T and working down
to t = 0. Once we have solved the adjoint ODE for v(t), we can plug it into our equation for G′ to obtain our
gradient by a simple integral:
T " T T #
Z T
′ T ∂u0 ∂g ∂f
∇p G = (G ) = − v(0) + − v dt .
∂p 0 ∂p ∂p
RT
(If you want to be fancy, you can compute this 0
simultaneously with v itself, by augmenting the adjoint ODE with
an additional set of unknowns and equations representing the G′ integrand. But that’s mainly just a computational
convenience and doesn’t change anything fundamental about the process.)
The only remaining annoyance is that the adjoint ODE depends on u(p, t) for all t ∈ [0, T ]. Normally, if we are
solving the “forward” ODE for u(p, t) numerically, we can “march” the solution u forwards in time and only store the
solution at a few of the most recent timesteps. Since the adjoint ODE starts at t = T , however, we can only start
integrating v after we have completed the calculation of u. This requires us to save essentially all of our previously
computed u(p, t) values, so that we can evaluate u at arbitrary times t ∈ [0, T ] during the integration of v (and
G′ ). This can require a lot of computer memory if u is large (e.g. it could represent millions of grid points from a
spatially discretized PDE, such as in a heat-diffusion problem) and many timesteps t were required. To ameliorate
this challenge, a variety of strategies have been employed, typically centered around “checkpointing” techniques in
which u is only saved at a subset of times t, and its value at other times is obtained during the v integration by
re-computing u as needed (numerically integrating the ODE starting at the closest “checkpoint” time). A detailed
discussion of such techniques lies outside the scope of these notes, however.
9.3 Example
Let us illustrate the above techniques with a simple example. Suppose that we are integrating the scalar ODE
1
∂u
= f (u, p, t) = p1 + p2 u + p3 u2 = pT u
∂t
u2
for an initial condition u(p, 0) = u0 = 0 and three parameters p ∈ R3 . (This is probably simple enough to solve in
closed form, but we won’t bother with that here.) We will also consider the scalar function
Z T
2
G(p) = [u(p, t) − u∗ (t)] dt
0 | {z }
g(u,p,t)
that (for example) we may want to minimize for some given u∗ (t) (e.g. experimental data or some given formula
like u∗ = t3 ), so we are hoping to compute ∇p G.
71
9.3.1 Forward mode
∂u ∂u ∂u ∂u
The Jacobian matrix ∂p = ∂p1 ∂p2 ∂p3
is simply a row vector, and satisfies our “forward-mode” ODE:
∂ ∂u ∂f ∂u ∂f ∂u
= + = (p2 + 2p3 u) + 1 u u2
∂t ∂p ∂u ∂p ∂p ∂p
∂u ∂u0
for the initial condition ∂p = ∂p = 0. This is an inhomogeneous system of three coupled linear ODEs, which
t=0
might look more conventional if we simply transpose both sides:
∂u ∂u
∂p1 ∂p1 1
∂ ∂u ∂u
= (p2 + 2p3 u) + u .
∂t
∂p2 ∂p2
∂u ∂u
∂p3 ∂p3
u2
| {z }
(∂u/∂p)T
The fact that this depends on our “forward” solution u(p, t) makes it not so easy to solve by hand, but a computer
can solve it numerically with no difficulty. On a computer, we would probably solve for u and ∂u/∂psimultaneously
by combining the two ODEs into a single ODE with 4 components:
2
! p 1 + p2 u + p 3 u
∂ u 1
T
=
T
.
∂t (∂u/∂p) (p2 + 2p3 u) (∂u/∂p) + u
u2
Given ∂u/∂p, we can then plug this into the chain rule for G:
T
∂u T
Z
∇p G = 2 [u(p, t) − u∗ (t)] dt
0 ∂p
In reverse mode, we have an adjoint solution v(t) ∈ R (the same shape as u) which solves our adjoint equation
T T
∂v ∂g ∂f
= − v = 2 [u(p, t) − u∗ (t)] − (p2 + 2p3 u) v
dt ∂u ∂u
with a final condition v(T ) = 0. Again, a computer can solve this numerically without difficulty (given the numerical
“forward” solution u) to find v(t) for t ∈ [0, T ]. Finally, our gradient is the integrated product:
Z T 1
∇p G = − u v dt .
0
u2
72
Problem 43
Suppose that G(p) takes the form of a sum of K terms:
K
X
G(p) = gk (p, u(p, tk ))
k=1
for times tk ∈ (0, T ) and functions gk (p, u). For example, this could arise in least-square fitting of experimental
data u∗ (tk ) at K discrete times, with gk (u(p, tk )) = ∥u∗ (tk ) − u(p, tk )∥2 measuring the squared difference
between u(p, tk ) and the measured data at time tk .
1. Show that such a G(p) can be expressed as a special case of our formulation in this chapter, by defining
our function g(u, t) using a sum of Dirac delta functions δ(t − tk ).
2. Explain how this affects the adjoint solution v(t): in particular, how the introduction of delta-function
terms on the right-hand side of dv/dt causes v(t) to have a sequence of discontinuous jumps. (In several
popular numerical ODE solvers, such discontinuities can be incorporated via discrete-time “callbacks”.)
3. Explain how these delta functions may also introduce a summation into the computation of ∇p G, but
only if gk depends explicitly on p (not just via u).
73
10 Calculus of Variations
In this lecture, we will apply our derivative machinery to a new type of input: neither scalars, nor column vectors,
nor matrices, but rather the inputs will be functions u(x), which form a perfectly good vector space (and can
even have norms and inner products).12 It turns out that there are lots of amazing applications for differentiating
with respect to functions, and the resulting techniques are sometimes called the “calculus of variations” and/or
“Frechét” derivatives.
Example 44
For example, consider functions u(x) that map x ∈ [0, 1] → u(x) ∈ R. We may then define the function f :
Z 1
f (u) = sin(u(x)) dx.
0
Such a function, mapping an input function u to an output number, is sometimes called a “functional.” What
is f ′ or ∇f in this case?
Recall that, given any function f , we always define the derivative as a linear operator f ′ (u) via the equation:
where now du denotes an arbitrary “small-valued” function du(x) that represents a small change in u(x), as depicted
in Fig. 12 for the analogous case of a non-infinitesimal δu(x). Here, we may compute this via linearization of the
integrand:
df = f (u + du) − f (u)
Z 1
= sin(u(x) + du(x)) − sin(u(x)) dx
0
Z 1
= cos(u(x)) du(x) dx = f ′ (u)[du] ,
0
where in the last step we took du(x) to be arbitrarily small13 so that we could linearize sin(u + du) to first-order
in du(x). That’s it, we have our derivative f ′ (u) as a perfectly good linear operation acting on du!
of functions, inserting annoying caveats about functions that differ only at isolated points, and so forth. In this lecture, we will mostly
ignore such technicalities—we will implicitly assume that our functions are integrable, differentiable, etcetera, as needed. The subject
of functional analysis exists to treat such matters with more care.
13 Technically, it only needs to be small “almost everywhere” since jumps that occur only at isolated points don’t affect the integral.
74
2.0
1.8
u(x)
1.6 u(x) + u(x)
function value
1.4
1.2
1.0
0.8
Recall that the gradient ∇f is defined as whatever we take the inner product of du with to obtain df . Therefore,
we obtain the gradient as follows:
Z 1
′
df = f (u)[du] = cos(u(x)) du(x) dx = ⟨∇f, du⟩ =⇒ ∇f = cos(u(x)) .
0
The two infinitesimals du and dx may seem a bit disconcerting, but if this is confusing you can just think of the
du(x) as a small non-infinitesimal function δu(x) (as in Fig. 12) for which we are dropping higher-order terms.
The gradient ∇f is just another function, cos(u(x))! As usual, ∇f has the same “shape” as u.
Remark 45. It might be instructive here to compare the gradient of an integral, above, with a discretized version
where the integral is replaced by a sum. If we have
n
X
f (u) = sin(uk )∆x
k=1
where ∆x = 1/n, for a vector u ∈ Rn , related to our previous u(x) by uk = u(k∆x), which can be thought of as a
“rectangle rule” (or Riemann sum, or Euler) approximation for the integral. Then,
cos(u1 )
∇u f = cos(u2 ) ∆x .
..
.
Why does this discrete version have a ∆x multiplying the gradient, whereas our continuous version did not? The
reason is that in the continuous version we effectively included the dx in the definition of the inner product ⟨u, v⟩
(which was an integral). In discrete case, the ordinary inner product (used to define the conventional gradient) is
75
Pn
just a sum without a ∆x. However, if we define a weighted discrete inner product ⟨u, v⟩ = k=1 uk vk ∆x, then,
according to Sec. 5, this changes the definition of the gradient, and in fact will remove the ∆x term to correspond
to the continuous version.
Example 46
Let u be a differentiable function on [0, 1] and consider the functional
Z 1 p
f (u) = 1 + u′ (x)2 dx.
0
Geometrically, you learned in first-year calculus that this is simply the length of the curve u(x) from x = 0
to x = 1. To differentiate this, first notice that ordinary single-variable calculus gives us the linearization
p p p p ′ v
d 1 + v 2 = 1 + (v + dv)2 − 1 + v 2 = 1 + v 2 dv = √ dv .
1 + v2
Therefore,
df = f (u + du) − f (u)
Z 1 p p
= 1 + (u + du)′2 − 1 + u′2 dx
0
1
u′
Z
= √ du′ dx.
0 1 + u′2
However, this is a linear operator on du′ and not (directly) on du. Abstractly, this is fine, because du′ is itself
a linear operation on du, so we have f ′ (u)[du] as the composition of two linear operations. However, it is more
revealing to rewrite it explicitly in terms of du, for example in order to define ∇f . To accomplish this, we can
apply integration by parts to obtain
1 1 1 ′
u′ u′ u′
Z Z
f ′ (u)[du] = √ du′ dx = √ du − √ du dx .
0 1 + u′2 1 + u′2 0 0 1 + u′2
Notice that up until now we did not need utilize the “boundary conditions” u(0) = u(1) = 0 for this calculation.
However, if we want to restrict ourselves to such functions u(x), then our perturbation du cannot change the
endpoint values, i.e. we must have du(0) = du(1) = 0. (Geometrically, suppose that we want to find the u that
minimizes arc length between (0, 0) and (1, 0), so that we need to fix the endpoints.) This implies that the boundary
term in the above equation is zero. Hence, we have that
Z 1 ′
u′
df = − √ du dx = ⟨∇f, du⟩ .
0 1 + u′2
| {z }
∇f
Furthermore, note that the u that minimizes the functional f has the property that ∇f |u = 0. Therefore, for
76
a u that minimizes the functional f (the shortest curve), we must have the following result:
′
u′
0 = ∇f = − √
1 + u′2
√ ′′ ′
u 1 + u′2 − u′ √u1+u
′′ u
′2
=−
1 + u′2
u (1 + u′2 ) − u′′ u′2
′′
=−
(1 + u′2 )3/2
u′′
=− .
(1 + u′2 )3/2
Hence, ∇f = 0 =⇒ u′′ (x) = 0 =⇒ u(x) = ax + b for constants a, b; and for these boundary conditions a = b = 0.
In other words, u is the horizontal straight line segment!
Thus, we have recovered the familiar result that straight line segments in R2 are the shortest curves between
two points!
u′′
Remark 47. Notice that the expression (1+u′2 )3/2
is the formula from multivariable calculus for the curvature of
the curve defined by y = u(x). It is not a coincidence that the gradient of arc length is the (negative) curvature,
and the minimum arc length occurs for zero gradient = zero curvature.
Example 48
Rb
Let f (u) = a F (u, u′ , x) dx where u is a differentiable function on [a, b]. Suppose the endpoints of u are fixed
(i.e. its values at x = a and x = b are constants). Let us calculate df and ∇f .
We find:
df = f (u + du) − f (u)
Z b
∂F ∂F ′
= du + du dx
a ∂u ∂u′
Z b ′ !
∂F b ∂F ∂F
= ′
du a + − du dx ,
|∂u {z } a ∂u ∂u′
=0
where we used the fact that du = 0 at a or b if the endpoints u(a) and u(b) are fixed. Hence,
′
∂F ∂F
∇f = − ,
∂u ∂u′
which equals zero at extremum. Notice that ∇f = 0 yields a second-order differential equation in u, known as the
Euler–Lagrange equations!
Remark 49. The notation ∂F/∂u′ is a notoriously confusing aspect of the calculus of variations—what does it
mean to take the derivative “with respect to u′ ” while holding u fixed? A more explicit, albeit more verbose, way
77
of expressing this is to think of F (u, v, x) as a function of three unrelated arguments, for which we only substitute
v = u′ after differentiating with respect to the second argument v:
∂F ∂F
= .
∂u′ ∂v v=u′
There are many wonderful applications of this idea. For example, search online for information about the
“brachistochrone problem” (animated here) and/or the “principle of least action”. Another example is a catenary
curve, which minimizes the potential energy of a hanging cable. A classic textbook on the topic is Calculus of
Variations by Gelfand and Fomin.
78
11 Derivatives of Random Functions
These notes are from a guest lecture by Gaurav Arya in IAP 2023.
11.1 Introduction
In this class, we’ve learned how to take derivatives of all sorts of crazy functions. Recall one of our first examples:
f (A) = A2 , (8)
where A is a matrix. To differentiate this function, we had to go back to the drawing board, and ask:
Question 50. If we perturb the input slightly, how does the output change?
We called δf and δA differentials in the limit where δA became arbitrarily small. We then had to ask:
We decided that (δA)2 should be neglected, justifying this by the fact that (δA)2 is “higher-order”. We were left
with the derivative operator δA 7→ A(δA) + (δA)A: the best possible linear approximation to f in a neighbourhood
of A. At a high level, the main challenge here was dealing with complicated input and output spaces: f was
matrix-valued, and also matrix-accepting. We had to ask ourselves: in this case, what should the notion of a
derivative even mean?
In this lecture, we will face a similar challenge, but with an even weirder type of function. This time, the output
of our function will be random. Now, we need to revisit the same questions. If the output is random, how can we
describe its response to a change in the input? And how can we form a useful notion of derivative?
where X(p) is a random variable. (To keep things simple, we’ll take p ∈ R and X(p) ∈ R in this chapter, though
of course they could be generalized to other vector spaces as in the other chapters. For now, the randomness is
complicated enough to deal with.)
The idea is that we can only sample from X(p), according to some distribution of numbers with probabilities
that depend upon p. One simple example would be sampling real numbers uniformly (equal probabilities) from
the interval [0, p]. As a more complicated example, suppose X(p) follows the exponential distribution with scale p,
corresponding to randomly sampled real numbers x ≥ 0 whose probability decreases proportional to e−x/p . This
can be denoted X(p) ∼ Exp(p), and implemented in Julia by:
79
julia> sample_X(p) = rand(Exponential(p))
sample_X (generic function with 1 method)
julia> sample_X(10.0)
1.7849785709142214
julia> sample_X(10.0)
4.435847397169775
julia> sample_X(10.0)
0.6823343897949835
If our program gives a different output each time, what could a useful notion of derivative be? Before we try
to answer this, let’s ask why we might want to take a derivative. The answer is that we may be very interested
in statistical properties of random functions, i.e. values that can be expressed using averages. Even if a function
is stochastic, its average (“expected value”), assuming the average exists, can be a deterministic function of its
parameters that has a conventional derivative.
So, why not take the average first, and then take the ordinary derivative of this average? This simple approach
works for very basic stochastic functions (e.g. the exponential distribution above has expected value p, with deriva-
tive 1), but runs into practical difficulties for more complicated distributions (as are commonly implemented by
large computer programs working with random numbers).
Remark 52. It is often much easier to produce an “unbiased estimate” X(p) of a statistical quantity than to compute
it exactly. (Here, an unbiased estimate means that X(p) averages out to our statistical quantity of interest.)
For example, in deep learning, the “variational autoencoder” (VAE) is a very common architecture that is
inherently stochastic. It is easy to get a stochastic unbiased estimate of the loss function by running a random
simulation X(p): the loss function L(p) is then the “average” value of X(p), denoted by the expected value E[X(p)].
However, computing the loss L(p) exactly would require integrating over all possible outcomes, which usually is
impractical. Now, to train the VAE, we also need to differentiate L(p), i.e. differentiate E[X(p)] with respect to p!
Perhaps more intuitive examples can be found in the physical sciences, where randomness may be baked into your
model of a physical process. In this case, it’s hard to get around the fact that you need to deal with stochasticity!
For example, you may have two particles that interact with an average rate of r. But in reality, the times when
these interactions actually occur follow a stochastic process. (In fact, the time until the first interaction might be
exponentially distributed, with scale 1/r.) And if you want to (e.g.) fit the parameters of your stochastic model to
real-world data, it’s once again very useful to have derivatives.
If we can’t compute our statistical quantity of interest exactly, it seems unreasonable to assume we can compute
its derivative exactly. However, we could hope to stochastically estimate its derivative. That is, if X(p) represents
the full program that produces an unbiased estimate of our statistical quantity, here’s one property we’d definitely
like our notion of derivative to have: we should be able to construct from it an unbiased gradient estimator14 X ′ (p)
14 For more discussion of these concepts, see (e.g.) the review article “Monte Carlo gradient estimation in machine learning” (2020)
80
satisfying
∂E[X(p)]
E[X ′ (p)] = E[X(p)]′ = . (11)
∂p
Of course, there are infinitely many such estimators. For example, given any estimator X ′ (p) we can add any
other random variable that has zero average without changing the expectation value. But in practice there are two
additional considerations: (1) we want X ′ (p) to be easy to compute/sample (about as easy as X(p)), and (2) we
want the variance (the “spread”) of X ′ (p) to be small enough that we don’t need too many samples to estimate its
average accurately (hopefully no worse than estimating E[X(p)]).
We can observe something a bit worrying: even for a very tiny δp (we chose δp = 10−5 ), δX(p) is still fairly large:
81
essentially as large as the original random variables. This is not good news if we want to construct a derivative
from δX(p): we would rather see its magnitude getting smaller and smaller with δp, like in the non-stochastic case.
Computationally, this will make it very difficult to determine E[X(p)]′ by averaging sample_δX(p, δp) / δp over
many samples: we’ll need a huge number of samples because the variance, the “spread” of random values, is huge
for small δp.
Let’s try a different approach. It is natural to think of X(p) for all p as forming a family of random variables,
all defined on the same probability space. A probability space, with some simplification, is a sample space Ω, with
a probability distribution P defined on the sample space. From this point of view, each X(p) can be expressed as
a function Ω → R. To sample from a particular X(p), we can imagine drawing a random ω from Ω according to
P, and then plugging this into X(p), i.e. computing X(p)(ω). (Computationally, this is how most distributions are
actually implemented: you start with a primitive pseudo-random number generator for a very simple distribution,15
e.g. drawing values ω uniformly from Ω = [0, 1), and then you build other distributions on top of this by transforming
ω somehow.) Intuitively, all of the “randomness” resides in the probability space, and crucially P does not depend
on p: as p varies, X(p) just becomes a different deterministic map on Ω.
The crux here is that all the X(p) functions now depend on a shared source of randomness: the random draw
of ω. This means that X(p) and X(p + δp) have a nontrivial joint distribution: what does it look like?
For concreteness, let’s study our exponential random variable X(p) ∼ Exp(p) from above. Using the “inversion
sampling” parameterization, it is possible to choose Ω to be [0, 1) and P to be the uniform distribution over Ω;
for any distribution, we can construct X(p) to be a corresponding nondecreasing function over Ω (given by the
inverse of X(p)’s cumulative probability distribution). Applied to X(p) ∼ Exp(p), the inversion method gives
X(p)(ω) = −p log (1 − ω). This is implemented below, and is a theoretically equivalent way of sampling X(p)
compared with the opaque rand(Exponential(p)) function we used above:
julia> sample_X2(10.0)
8.380816941818618
julia> sample_X2(10.0)
2.073939134369733
julia> sample_X2(10.0)
29.94586208847568
julia> sample_X2(10.0)
23.91658360124792
Okay, so what does our joint distribution look like? As shown in Figure 13, we can plot X(p) and X(p + δp) as
15 Most computer hardware cannot generate numbers that are actually random, only numbers that seem random, called “pseudo-
random” numbers. The design of these random-seeming numeric sequences is a subtle subject, steeped in number theory, with a long
history of mistakes. A famous ironic quotation in this field is (Robert Coveyou, 1970): “Random number generation is too important
to be left to chance.”
82
Output
X(p + δp)
X(p)
Not
O(δp) { δX(p)
{ O(δp)
Ω
0 ω2 ω1 1
Figure 13: For X(p) ∼ Exp(p) parameterized via the inversion method, we can write X(p), X(p + δp), and δX(p)
as functions from Ω = [0, 1] → R, defined on a probability space with P = Unif(0, 1).
functions over Ω. To sample the two of them jointly, we use the same choice of ω: thus, δX(p) can be formed by
subtracting the two functions pointwise at each Ω. Ultimately, δX(p) is itself a random variable over the same
probability space, sampled in the same way: we pick a random ω according to P, and evaluate δX(p)(ω), using the
function δX(p) depicted above. Our first approach with independent samples is depicted in red in Figure 13, while
our second approach is in blue. We can now see the flaw of the independent-samples approach: the O(1)-sized
“noise” from the independent samples washes out the O(δp)-sized “signal”.
What about our second question (Question 51): how can actually take the limit of δp → 0 and compute the
derivative? The idea is to differentiate δX(p) at each fixed sample ω ∈ Ω. In probability theory terms, we take the
limit of random variables δX(p)/δp as δp → 0:
δX(p)
X ′ (p) = lim . (13)
δp→0 δp
−δp log (1 − ω)
X ′ (p)(ω) = lim = − log (1 − ω). (14)
δp→0 δp
Once again, X ′ (p) is a random variable over the same probability space. The claim is that X ′ (p) is the notion of
derivative we were looking for! Indeed, X ′ (p) is itself in fact a valid gradient estimator:
′ δX(p) ? E[δX(p)] ∂E[X(p)]
E[X (p)] = E lim = lim = . (15)
δp→0 δp δp→0 δp ∂p
Rigorously, one needs to justify the interchange of limit and expectation in the above. In this chapter, however, we
will be content with a crude empirical justification:
83
julia> mean(X′ (10.0) for i in 1:10000)
1.011689946421105
So X ′ (p) does indeed average to 1, which makes sense since the expectation of Exp(p) is p, which has derivative
1 for any choice of p. However, the crux is that this notion of derivative also works for more complicated random
variables that can be formed via composition of simple ones such as an exponential random variable. In fact, it
turns out to obey the same chain rule as usual!
Let’s demonstrate this. Using the dual numbers introduced in Chapter 8, we can differentiate the expectation
of the square of a sample from an exponential distribution without having an analytic expression for this quantity.
(The expression for X ′ we derived is already implemented as a dual-number rule in Julia by the ForwardDiff.jl
package.) The primal and dual values of the outputted dual number are samples from the joint distribution of
(X(p), X ′ (p)).
Using the “reparameterization trick” to form a gradient estimator, as we have done here, is a fairly old idea. It
is also called the “pathwise” gradient estimator. Recently, it has become very popular in machine learning due
to its use in VAEs [e.g. Kingma & Welling (2013): https://arxiv.org/abs/1312.6114], and lots of resources can be
found online on it. Since composition simply works by the usual chain rule, it also works in reverse mode, and can
differentiate functions far more complicated than the one above!
julia> p = 0.5
0.6
84
Output Output
X(p)(ω) = 0 X(p)(ω) = 1
{
{
1 δX(p)
1
Ω Ω
0 1 − p − δp 1 − p 1 0 1 − p − δp 1 − p 1
Figure 14: For X(p) ∼ Ber(p) parameterized via the inversion method, plots of X(p), X(p + δp), and δX(p) as
functions Ω : [0, 1] → R.
julia> sample_X(δp)
false
julia> sample_X(δp)
true
The parameterization of a Bernoulli variable is shown in Figure 2. Using the inversion method once again, the
parameterization of a Bernoulli variable looks like a step function: for ω < 1 − p, X(p)(ω) = 0, while for ω ≥ 1 − p,
X(p)(ω) = 1.
Now, what happens when we perturb p? Let’s imagine perturbing p by a positive amount δp. As shown in
Figure 2, something qualitatively very different has happened here. At nearly every ω except a small region of
probability δp, the output does not change. Thus, the quantity X ′ (p) we defined in the previous subsection (which,
strictly speaking, was defined by an "almost-sure" limit that neglects regions of probability 0) is 0 at every ω: after
all, for every ω, there exists small enough δp such that δX(p)(ω) = 0.
However, there is certainly an important derivative contribution to consider here. The expectation of a Bernoulli
is p, so we would expect the derivative to be 1: but E[X ′ (p)] = E[0] = 0. What has gone wrong is that, although
δX(p) is 0 with tiny probability, the value of δX(p) on this region of tiny probability is 1, which is large. In
particular, it does not approach 0 as δp approaches 0. Thus, to develop a notion of derivative of X(p), we need to
somehow capture these large jumps with “infinitesimal” probability.
A recent (2022) publication (https://arxiv.org/abs/2210.08572) by the author of this chapter (Gaurav Arya), to-
gether with Frank Schäfer, Moritz Schauer, and Chris Rackauckas, worked to extend the above ideas to develop a no-
tion of “stochastic derivative” for discrete randomness, implemented by a software package called StochasticAD.jl
that performs automatic differentiation of such stochastic processes. It generalizes the idea of dual numbers to
stochastic triples, which include a third component to capture exactly these large jumps. For example, the stochas-
tic triple of a Bernoulli variable might look like:
Here, δp is denoted by ϵ, imagined to be an “infinitesimal unit”, so that the above triple indicates a flip from 0 to
1 with probability that has derivative 2.
85
However, many aspects of these problems are still difficult, and there are a lot of improvements awaiting future
developments! If you’re interested in reading more, you may be interested in the paper and our package linked
above, as well as the 2020 review article by Mohamed et al. (https://arxiv.org/abs/1906.10652), which is a great
survey of the field of gradient estimation in general.
At the end of class, we considered a differentiable random walk example with StochasticAD.jl. Here it is!
86
12 Second Derivatives, Bilinear Maps,
and Hessian Matrices
In this chapter, we apply the principles of this course to second derivatives, which are conceptually just derivatives
of derivatives but turn out to have many interesting ramifications. We begin with a (probably) familiar case of
scalar-valued functions from multi-variable calculus, in which the second derivative is simply a matrix called the
Hessian. Subsequently, however, we will show that similar principles can be applied to more complicated input and
output spaces, generalizing to a notion of f ′′ as a symmetric bilinear map.
(∇f )′ =
.. .. ..
=H.
. . .
∂2f ∂2f
∂x1 ∂xn ··· ∂xn ∂xn
This matrix, denoted here by H, is known as the Hessian of f , which has entries:
∂2f ∂2f
Hi,j = = = Hj,i .
∂xj ∂xi ∂xi ∂xj
The fact that you can take partial derivatives in either order is a familiar fact from multivariable calculus (sometimes
called the “symmetry of mixed derivatives” or “equality of mixed partials”), and means that the Hessian is a
symmetric matrix H = H T . (We will later see that such symmetries arise very generally from the construction of
second derivatives.)
Example 53
For x ∈ R2 and the function f (x) = sin(x1 ) + x21 x32 , its gradient is
!
cos(x1 ) + 2x1 x32
∇f = ,
3x21 x22
87
If we think of the Hessian as the Jacobian of ∇f , this tells us that H dx predicts the change in ∇f to first order:
d(∇f ) = ∇f |x+dx − ∇f |x = H dx .
Note that ∇f |x+dx means ∇f evaluated at x + dx, which is very different from df = (∇f )T dx, where we act
f ′ (x) = (∇f )T on dx.
Instead of thinking of H of predicting the first-order change in ∇f , however, we can also think of it as predicting
the second -order change in f , a quadratic approximation (which could be viewed as the first three terms in a
multidimensional Taylor series):
1
f (x + δx) = f (x) + (∇f )T δx + δxT H δx + o(∥δx∥2 ) ,
2
where both ∇f and H are evaluated at x, and we have switched from an infinitesimal dx to a finite change δx so
that we emphasize the viewpoint of an approximation where terms higher than second-order in ∥δx∥ are dropped.
You can derive this in a variety of ways, e.g. by taking the derivative of both sides with respect to δx to reproduce
∇f |x+δx = ∇f |x + H δx + o(δx): a quadratic approximation for f corresponds to a linear approximation for ∇f .
Related to this equation, another useful (and arguably more fundamental) relation that we can derive (and will
derive much more generally below) is:
where dx and dx′ are two independent “infinitesimal” directions and we have dropped terms of higher than second
order. This formula is very suggestive, because it uses H to map two vectors into a scalar, which we will generalize
below into the idea of a bilinear map f ′′ (x)[dx, dx′ ]. This formula is also obviously symmetric with respect to
interchange of dx and dx′ — f ′′ (x)[dx, dx′ ] = f ′′ (x)[dx′ , dx] — which will lead us once again to the symmetry
H = H T below.
Remark 54. Consider the Hessian matrix versus other Jacobian matrices. The Hessian matrix expresses the
second derivative of a scalar-valued multivariate function, and is always square and symmetric. A Jacobian matrix,
in general, expresses the first derivative of a vector-valued multivariate function, may be non-square, and is rarely
symmetric. (However, the Hessian matrix is the Jacobian of the ∇f function!)
implicitly dropping higher-order terms. If we similarly consider the second derivative f ′′ as simply the same process
applied to f ′ instead of f , we obtain the following formula, which is easy to write down but will take some thought
to interpret:
df ′ = f ′ (x + dx′ ) − f ′ (x) = f ′′ (x)[dx′ ].
(Notation: dx′ is not some kind of derivative of dx; the prime simply denotes a different arbitrary small change
in x.) What kind of “thing” is df ′ ? Let’s consider a simple concrete example:
88
Example 55
Consider the following function f (x) : R2 7→ R2 mapping two-component vectors x ∈ R2 to two-component
vectors f (x) ∈ R2 : !
x21 sin(x2 )
f (x) = .
5x1 − x32
Its first derivative is described by a 2 × 2 Jacobian matrix:
!
′ 2x1 sin(x2 ) x21 cos(x2 )
f (x) =
5 −3x22
that maps a small change dx in the input vector x to the corresponding small change df = f ′ (x)dx in the
output vector f .
What is df ′ = f ′′ (x)[dx′ ]? It must take a small change dx′ = (dx′1 , dx′2 ) in x and return the first-order
change df ′ = f ′ (x + dx′ ) − f ′ (x) in our Jacobian matrix f ′ . If we simply take the differential of each entry of
our Jacobian (a function from vectors x to matrices f ′ ), we find:
!
′ 2 dx′1 sin(x2 ) + 2x1 cos(x2 ) dx′2 2x1 dx′1 cos(x2 ) − x21 sin(x2 ) dx′2
df = = f ′′ (x)[dx′ ]
0 −6x2 dx′2
Notice that this result, which we will call f ′′ (x)[dx′ , dx] below, is the “same shape” as f (x) (a 2-component
vector). Moreover, it doesn’t change if we swap dx and dx′ : f ′′ (x)[dx′ , dx] = f ′′ (x)[dx, dx′ ], a key symmetry
of the second derivative that we will discuss further below.
df ′ is an (infinitesimal) object of the same “shape” as f ′ (x), not f (x). Here, f ′ is a linear operator, so its change
df ′ must also be an (infinitesimal) linear operator (a “small change” in a linear operator) that we can therefore act
on an arbitrary dx (or δx), in the form:
where we combine the two brackets for brevity. This final result f ′′ (x)[dx′ , dx] is the same type of object (vector)
as the original output f (x). This implies that f ′′ (x) is a bilinear map: acting on two vectors, and linear in either
vector taken individually. (We will see shortly that the ordering of dx and dx′ doesn’t matter: f ′′ (x)[dx′ , dx] =
f ′′ (x)[dx, dx′ ].)
More precisely, we have the following.
89
Definition 56 (Bilinear Map)
Let U, V, W be a vector spaces, not necessarily the same. Then, a bilinear map is a function B : U × V → W ,
mapping a u ∈ U and v ∈ V to B[u, v] ∈ W , such that we have linearity in both arguments:
B[u, αv + βv ] = αB[u, v ] + βB[u, v ]
1 2 1 2
B[αu + βu , v] = αB[u , v] + βB[u , v]
1 2 1 2
Note that in general, even if U = V (the two inputs u, v are the “same type” of vector) we may have B[u, v] ̸=
B[v, u], but in the case of f ′′ we have something very special that happens. In particular, we can show that f ′′ (x)
is a symmetric bilinear map, meaning
for any dx and dx′ . Why? Because, applying the definition of f ′′ as giving the change in f ′ from dx′ , and then the
definition of f ′ as giving the change in f from dx, we can re-order terms to obtain:
where we’ve boxed the middle formula for f ′′ which emphasizes its symmetry in a natural way. (The basic reason
why this works is that the “+” operation is always commutative for any vector space. A geometric interpretation
is depicted in Fig. 15.)
Example 57
Let’s review the familiar example from multivariable calculus, f : Rn → R. That is, f (x) is a scalar-valued
function of a column vector x ∈ Rn . What is f ′′ ?
Recall that
f ′ (x) = (∇f )T =⇒ f ′ (x)[dx] = scalar df = (∇f )T dx.
Similarly,
where H must be exactly the n × n matrix Hessian matrix introduced in Sec. 12.1, since an expression like
dx′T Hdx is the most general possible bilinear form mapping two vectors to a scalar. Moreover, since we now know
90
Figure 15: Geometric interpretation of f ′′ (x)[dx, dx′ ]: To first order, a function f maps parallelograms to par-
allelograms. To second order, however it “opens” parallelograms: The deviation from point B (the image of A)
from point C (the completion of the parallelogram) is the second derivative f ′′ (x)[dx, dx′ ]. The symmetry of f ′′
as a bilinear form can be traced back geometrically to the mirror symmetry of the input parallelogram across its
diagonal from x to point A.
for all dx and dx′ . This implies that H = H T : the Hessian matrix is symmetric. As discussed in Sec. 12.1, we
already knew this from multi-variable calculus. Now, however, this “equality of mixed partial derivatives” is simply
a special case of f ′′ being a symmetric bilinear map.
As an example, let’s consider a special case of the general formula above:
Example 58
Let f (x) = xT Ax for x ∈ Rn and A an n × n matrix. As above, f (x) ∈ R (scalar outputs). Compute f ′′ .
f ′ = (∇f )T = xT (A + AT ).
This implies that ∇f = (A + AT )x, a linear function of x. Hence, the Jacobian of ∇f is the Hessian f ′′ = H =
A + AT . Furthermore, note that this implies
91
Example 59
Let f (A) = det A for A an n × n matrix. Express f ′′ (A) as a rule for f ′′ (A)[dA, dA′ ] in terms of dA and dA′ .
Now, we want to compute the change d′ (df ) = d′ (f ′ (A)[dA]) = f ′ (A + dA′ )[dA] − f ′ (A)[dA] in this formula, i.e. the
differential (denoted d′ ) where we change A by dA′ while treating dA as a constant:
where the last line (symmetry) can be derived explicitly by the cyclic property of the trace (although of course it
must be true for any f ′′ ). Although f ′′ here is a perfectly good bilinear form acting on matrices dA, dA′ , it is not
very natural to express f ′′ as a “Hessian matrix.”
If we really wanted to express f ′′ in terms of an explicit Hessian matrix, we could use the the “vectorization”
approach of Sec. 3. Let us consider, for example, the term tr(A−1 dA′ A−1 dA) using Kronecker products (Sec. 3).
In general, for matrices X, Y, B, C:
recalling that (vec X)T vec Y = tr(X T Y ) is the Frobenius inner product (Sec. 5). Thus,
T
tr(A−1 dA′ A−1 dA) = vec (dA′T ) (A−T ⊗ A−1 ) vec (dA) .
This is still not quite in the form we want for a Hessian matrix, however, because it involves vec (dA′T ) rather
than vec (dA′ ) (the two vectors are related by a permutation matrix, sometimes called a “commutation” matrix).
Completing this calculation would be a nice exercise in mastery of Kronecker products, but getting an explicit
Hessian seems like a lot of algebra for a result of dubious utility!
Now, just as we did for the simple case of Hessian matrices in Sec. 12.1 above, we can use f ′′ to form a quadratic
approximation of f (x). In particular, one can show that
1
f (x + δx) = f (x) + f ′ (x)[δx] + f ′′ (x)[δx, δx] + o(∥δx∥2 ).
2
1
Note that the 2 factor is just as in the Taylor series. To derive this, simply plug the quadratic approximation into
92
and check that the right-hand side reproduces f ′′ (x). (Note how dx and dx′ appear symmetrically in this formula,
which reflects the symmetry of f ′′ .)
When searching for a local minimum (or maximum) of a complicated function f (x), a common procedure is to
approximate f (x + δx) by a simpler “model” function for small δx, and then to optimize this model to obtain a
potential optimization step. For example, approximating f (x + δx) ≈ f (x) + f ′ (x)[δx] (an affine model, colloquially
called “linear”) leads to gradient descent and related algorithms. A better approximation for f (x + δx) will often
lead to faster-converging algorithms, and so a natural idea is to exploit the second derivative f ′′ to make a quadratic
model, as above, and accelerate optimization.
For unconstrained optimization, minimizing f (x) corresponds to finding a root of the derivative f ′ = 0 (i.e.,
∇f = 0), and a quadratic approximation for f yields a first-order (affine) approximation f ′ (x + δx) ≈ f ′ (x) +
f ′′ (x)[δx] for the derivative f ′ . In Rn , this is δ(∇f ) ≈ Hδx. So, minimizing a quadratic model is effectively a
Newton step δx ≈ −H −1 ∇f to find a root of ∇f via first-order approximation. Thus, optimization via quadratic
approximations is often viewed as a form of Newton algorithm. As discussed below, it is also common to employ
approximate Hessians in optimization, resulting in “quasi-Newton” algorithms.
More complicated versions of this idea arise in optimization with constraints, e.g. minimizing an objective func-
tion f (x) subject to one or more nonlinear inequality constraints ck (x) ≤ 0. In such cases, there are a variety of
methods that take both first and second derivatives into account, such as “sequential quadratic programming” 17
(SQP) algorithms that solve a sequence of “QP” approximations involving quadratic objectives with affine con-
straints (see e.g. the book Numerical Optimization by Nocedal and Wright, 2006).
There are many technical details, beyond the scope of this course, that must be resolved in order to translate
such high-level ideas into practical algorithms. For example, a quadratic model is only valid for small enough δx, so
there must be some mechanism to limit the step size. One possibility is “backtracking line search”: take a Newton
step x + δx and, if needed, progressively “backtrack” to x + δx/10, x + δx/100, . . . until a sufficiently decreased value
of the objective is found. Another commonplace idea is a “trust region”: optimize the model with the constraint
that δx is sufficiently small, e.g. ∥δx∥ ≤ s (a spherical trust region), along with some rules to adaptively enlarge
or shrink the trust-region size (s) depending on how well the model predicts δf . There are many variants of
Newton/SQP-like algorithms depending on the choices made for these and other details.
In general, finding f ′′ or the Hessian is often computationally expensive in higher dimensions. If f (x) : Rn → R,
then the Hessian, H, is an n × n matrix, which can be huge if n is large—even storing H may be prohibitive, much
16 Much of machine learning uses only variations on gradient descent, without incorporating Hessian information except implicitly
via “momentum” terms. Partly this can be explained by the fact that optimization problems in ML are typically solved only to low
accuracy, often have nonsmooth/stochastic aspects, rarely involve nonlinear constraints, and are often very high-dimensional. This is
only a small corner of the wider universe of computational optimization!
17 The term “programming” in optimization theory does not refer to software engineering, but is rather an anachronistic term for
optimization problems. For example, “linear programming” (LP) refers to optimizing affine objectives and affine constraints, while
“quadratic programming” (QP) refers to optimizing convex quadratic objectives with affine constraints.
93
less computing it. When using automatic differentiation (AD), Hessians are often computed by a combination of
forward and reverse modes (Sec. 8.4.1), but AD does not circumvent the fundamental scaling difficulty for large n.
Instead of computing H explicitly, however, one can instead approximate the Hessian in various ways; in the
context of optimization, approximate Hessians are found in “quasi-Newton” methods such as the famous “BFGS”
algorithm and its variants. One can also derive efficient methods to compute Hessian–vector products Hv without
computing H explicitly, e.g. for use in Newton–Krylov methods. (Such a product Hv is equivalent to a directional
derivative of f ′ , which is efficiently computed by “forward-over-reverse” AD as in Sec. 8.4.1.)
Generalizing the rules you may recall from single- and multi-variable calculus, we can use the second derivative
to determine whether an extremum is a minimum, maximum, or saddle point. Firstly, an extremum of a scalar
function f is a point x0 such that f ′ (x0 ) = 0. That is,
f ′ (x0 )[δx] = 0
The definition of a local minimum x0 is that f (x0 + δx) > f (x0 ) for any δx ̸= 0 with ∥δx∥ sufficiently small. To
achieve this at a point where f ′ = 0, it is enough to have f ′′ be a positive-definite quadratic form:
For example, for inputs x ∈ Rn , so that f ′′ is a real-symmetric n×n Hessian matrix, f ′′ (x0 ) = H(x0 ) = H(x0 )T ,
this corresponds to the usual criteria for a positive-definite matrix:
f ′′ (x0 )[δx, δx] = δxT H(x0 )δx > 0 for all δx ̸= 0 ⇐⇒ H(x0 ) positive-definite ⇐⇒ all eigenvalues of H(x0 ) > 0.
In first-year calculus, one often focuses in particular on the 2-dimensional case, where H is a 2 × 2 matrix. In
the 2 × 2 case, there is a simple way to check the signs of the two eigenvalues of H, in order to check whether an
extremum is a minimum or maximum: the eigenvalues are both positive if and only if det(H) > 0 and tr(H) > 0,
since det(H) = λ1 λ2 and tr(H) = λ1 + λ2 . In higher dimensions, however, one needs more complicated techniques
to compute eigenvalues and/or check positive-definiteness, e.g. as discussed in MIT courses 18.06 (Linear Algebra)
and/or 18.335 (Introduction to Numerical Methods). (In practice, one typically checks positive-definiteness by
performing a form of Gaussian elimination, called a Cholesky factorization, and checking that the diagonal “pivot”
elements are > 0, rather than by computing eigenvalues which are much more expensive.)
Similarly, a point x0 where ∇f = 0 is a local maximum if f ′′ is negative-definite, or equivalently if the eigenvalues
of the Hessian are all negative. Additionally, x0 is a saddle point if f ′′ is indefinite, i.e. the eigenvalues include both
positive and negative values. However, cases where some eigenvalues are zero are more complicated to analyze;
e.g. if the eigenvalues are all ≥ 0 but some are = 0, then whether the point is a minimum depends upon higher
derivatives.
94
12.5 Further Reading
All of this formalism about “bilinear forms” and so forth may seem like a foray into abstraction for the sake of
abstraction. Can’t we always reduce things to ordinary matrices by choosing a basis (“vectorizing” our inputs
and outputs)? However, we often don’t want to do this for the same reason that we often prefer to express first
derivatives as linear operators rather than as explicit Jacobian matrices. Writing linear or bilinear operators as
explicit matrices, e.g. vec(A dA + dA A) = (I ⊗ A + AT ⊗ I) vec(dA) as in Sec. 3, often disguises the underlying
structure of the operator and introduces a lot of algebraic complexity for no purpose, as well as being potentially
computationally costly (e.g. exchanging small matrices A for large ones I ⊗ A).
As we discussed in this chapter, an important generalization of quadratic operations to arbitrary vector spaces
come in the form of bilinear maps and bilinear forms, and there are many textbooks and other sources discussing
these ideas and variations thereof. For example, we saw that the second derivative can be seen as a symmetric
bilinear form. This is closely related to a quadratic form Q[x], which what we get by plugging the same vector
twice into a symmetric bilinear form B[x, y] = B[y, x], i.e. Q[x] = B[x, x]. (At first glance, it may seem like Q
carries “less information” than B, but in fact this is not the case. It is easy to see that one can recover B from Q
via B[x, y] = (Q[x + y] − Q[x − y])/4, called a “polarization identity.”) For example, the f ′′ (x)[δx, δx]/2 term that
appears in quadratic approximations of f (x + δx) is a quadratic form. The most familiar multivariate version of
f ′′ (x) is the Hessian matrix when x is a column vector and f (x) is a scalar, and Khan Academy has an elementary
introduction to quadratic approximation.
Positive-definite Hessian matrices, or more generally definite quadratic forms f ′′ , appear at extrema (f ′ = 0) of
scalar-valued functions f (x) that are local minima. There are a lot more formal treatments of the same idea, and
conversely Khan Academy has the simple 2-variable version where you can check the sign of the 2 × 2 eignevalues
just by looking at the determinant and a single entry (or the trace). There’s a nice stackexchange discussion on
why an ill-conditioned Hessian tends to make steepest descent converge slowly. Some Toronto course notes on the
topic may also be useful.
Lastly, see for example these Stanford notes on sequential quadratic optimization using trust regions (Section
2.2), as well as the 18.335 notes on BFGS quasi-Newton methods. The fact that a quadratic optimization problem in
a sphere has strong duality, and hence is efficiently solvable, is discussed in Section 5.2.4 of the Convex Optimization
book. There has been a lot of work on automatic Hessian computation, but for large-scale problems you may only
be able to compute Hessian–vector products efficiently in general, which are equivalent to a directional derivative
of the gradient and can be used (for example) for Newton–Krylov methods.
The Hessian matrix is also known as the “curvature matrix" especially in optimization. If we have a scalar
function f (x) of n variables, its “graph” is the set of points (x, f (x)) in Rn+1 ; we call the last dimension the
“vertical" dimension. At a “critical point” x (where ∇f = 0), then v T Hv is the ordinary curvature sometimes
taught in first-year calculus, of the curve obtained by intersecting the graph with the plane in the direction v from
x and the vertical (the “normal section”). The determinant of H, sometimes known as the Hessian determinant,
yields the Gaussian curvature.
A closely related idea is the derivative of the unit normal. For a graph as in the preceding paragraph we may
assume that f (x) = xT Hx/2 to second order. It is easy to see that at any point x the tangents have the form
(dx, f ′ (x)[dx]) = (dx, xT Hdx) and the normal is then (Hx, 1). Near x = 0 this a unit normal to second order,
and its derivative is (Hdx, 0). Projecting onto the horizontal, we see that the Hessian is the derivative of the unit
normal. This is called the “shape operator" in differential geometry.
95
13 Derivatives of Eigenproblems
In other words, at the point x on the sphere (a radius, if you will), dx, the linearization of the constraint of moving
along the sphere satisfies dx ⊥ x. This is our first example where we have seen the infinitesimal perturbation dx
being constrained. See Figure 16.
dx ⟂ x
x+dx x
Figure 16: Differentials on a sphere (xT x = 1): the differential dx is constrained to be perpendicular to x.
Let us simply consider the unit circle in the plane where x = (cos θ, sin θ) for some θ ∈ [0, 2π). Then,
Here, we can think of x as “extrinsic” coordinates, in that it is a vector in R2 . On the other hand, θ is an “intrinsic”
coordinate, as every point on the circle is specified by one θ.
You may remember that the rank-1 matrix xxT , for any unit vector xT x = 1, is a projection matrix (meaning
that it is equal to its square and it is symmetric) which projects vectors onto their components in the direction of
x. Correspondingly, I − xxT is also a projection matrix, but onto the directions perpendicular to x: geometrically,
the matrix removes components in the x direction. In particular, if xT dx = 0, then (I − xxT )dx = dx. It follows
96
that if xT dx = 0 and A is a symmetric matrix, we have
1 T
d x Ax = (Ax)T dx
2
= xT A(dx)
= xT A(I − xxT )dx
= ((I − xxT )Ax)T dx.
Theorem 60
Given f : Sn → R, we have
df = g(x)T dx = ((I − xxT )g(x))T dx.
The proof of this is precisely the same as we did before for f (x) = 21 xT Ax.
Definition 61
A matrix M is anti-symmetric if M = −M T . Note that all anti-symmetric matrices thus have zeroes on their
diagonals.
Theorem 62
Given Q is an orthogonal matrix, we have that QT dQ is anti-symmetric.
Proof. The constraint of being orthogonal implies that QT Q = I. Differentiating this equation, we obtain
97
When n = 3, airplane pilots know about “roll, pitch, and yaw”, which are the three parameters for the orthogonal
2
matrices when n = 3. In general, in Rn , the orthogonal group has dimension n(n − 1)/2.
There are a few ways to see this.
• Firstly, orthogonality QT Q = I imposes n(n + 1)/2 constraints, leaving n(n − 1)/2 free parameters.
• When we do QR decomposition, the R “eats” up n(n + 1)/2 of the parameters, again leaving n(n − 1)/2 for
Q.
• Lastly, If we think about the symmetric eigenvalue problem where S = QΛQT , S has n(n + 1)/2 parameters
and Λ has n, so Q has n(n − 1)/2.
Let S be a symmetric matrix, Λ be diagonal containing eigenvalues of S, and Q be orthogonal with column
vectors as eigenvectors of S such that S = QΛQT . [For simplicity, let’s assume that the eigenvalues are “simple”
(multiplicity 1); repeated eigenvalues turn out to greatly complicate the analysis of perturbations because of the
ambiguity in their eigenvector basis.] Then, we have
dS = dQ ΛQT + Q dΛ QT + QΛdQT ,
As an exercise, one may check that the left and right hand sides of the above are both symmetric. This may
be easier if one looks at the diagonal entries on their own, as there (QT dS Q)ii = qiT dS qi . Since qi is the ith
eigenvector, this implies qiT dS qi = dλi . (In physics, this is sometimes called the “Hellman–Feynman” theorem, or
non-degenerate first-order eigenvalue-perturbation theory.)
dλi
Sometimes we think of a curve of matrices S(t) depending on a parameter such as time. If we ask for dt , this
implies it is thus equal to qiT dS(t)
dt qi . So how can we get the gradient ∇λi for one of the eigenvalues? Well, firstly,
note that
tr(qi qiT )T dS) = dλi =⇒ ∇λi = qi qiT .
What about the eigenvectors? Those come from off diagonal elements, where for i ̸= j,
T T dQ
(Q dS Q)ij = Q (λj − λi ).
dt ij
d2 Λ
dQ
= 2 diag E · ,
dt2 dt
98
or
d2 Λ X
2
2
=2 Eik /(λi − λk ).
dt
k̸=i
99
14 Where We Go From Here
There are many topics that we did not have time to cover, even in 16 hours of lectures. If you came into this class
thinking that taking derivatives is easy and you already learned everything there is to know about it in first-year
calculus, hopefully we’ve convinced you that it is an enormously rich subject that is impossible to exhaust in a
single course. Some of the things it might have been nice to include are:
• When automatic differentiation (AD) hits something it cannot handle, you may have to write a custom
Jacobian–vector product (a “Jvp,” “frule,” or “pushforward”) in forward-mode, and/or a custon row vector–
Jacobian product (a “vJp,” “rrule,” “pullback,” or “JacobianT -vector product”) in reverse-mode. In Julia with
Zygote AD, this is done using the ChainRules packages. In Python with JAX, this is done with jax.custon_jvp
and/or jax.custon_vjp respectively. In principle, this is straightforward, but the APIs can take some getting
used to because the of the generality that they support.
• For functions f (z) with complex arguments z (i.e. complex vector spaces), you cannot take “ordinary” complex
derivatives whenever the function involves the conjugate z, for example, |z|, Re(z), and Im(z). This must
occur if f (z) is purely real-valued and not constant, as in optimization problems involving complex-number
calculations. One option is to write z = x + iy and treat f (z) as a two-argument function f (x, y) with real
derivatives, but this can be awkward if your problem is “naturally” expressed in terms of complex variables
(for instance, the Fourier frequency domain). A common alternative is the “CR calculus” (or “Wirtinger
calculus”), in which you write
∂f ∂f
df = dz + dz,
∂z ∂z
as if z and z were independent variables. This can be extended to gradients, Jacobians, steepest-descent, and
Newton iterations, for example. A nice review of this concept can be found in these UCSD course notes by
K. Kreuz Delgado.
• Many, many more derivative results for matrix functions and factorizations can be found in the literature,
some of them quite tricky to derive. For example, a number of references are listed in this GitHub issue for
the ChainRules package.
• Another important generalization of differential calculus is to derivatives on curved manifolds and differential
geometry, leading to the exterior derivative.
• When differentiating eigenvalues λ of matrices A(x), a complication arises at eigenvalue crossings (where
multiplicity k > 1). Here, the eigenvalues and eigenvectors usually cease to be differentiable. More generally,
this problem arises for any implicit function with a repeated root. In this case, one option is use an expanded
definition of sensitivity analysis called a generalized gradient (a k×k matrix-valued linear operator G(x)[dx]
whose eigenvalues are the perturbations dλ. See for example Cox (1995), Seyranian et al. (1994), and
Stechlinski (2022). Physicistss call a related idea “degenerate perturbation theory.” A recent formulation
of similar ideas is called the lexicographic directional derivative. See for example Nesterov (2005) and
Barton et al. (2017).
Sometimes, optimization problems involving eigenvalues can be reformulated to avoid this difficulty by using
SDP constraints. See for example Men et al. (2014).
For a defective matrix the situation is worse: even the generalized derivatives blow up because dλ can be
proportional to (e.g.) the square root of the perturbation ∥dA∥ (for an eigenvalue with algebraic multiplicity
= 2 and geometric multiplicity = 1).
100
• Famous generalizations of differentation are the “distributional” and “weak” derivatives. For example, to
obtain Dirac delta “functions” by differentiating discontinuities. This requires changing not only the definition
of “derivative,” but also changing the definition of function, as reviewed at an elementary level in these MIT
course notes.
101