0% found this document useful (0 votes)
15 views14 pages

Mit18 S096iap23 Lec4

The lecture notes cover applications of matrix calculus, particularly in nonlinear optimization and the adjoint method. It discusses methods like Newton's method for solving nonlinear equations and emphasizes the importance of using analytical gradients for efficiency. The notes also highlight the significance of formulating problems correctly to leverage optimization algorithms effectively.

Uploaded by

amitlama2299
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views14 pages

Mit18 S096iap23 Lec4

The lecture notes cover applications of matrix calculus, particularly in nonlinear optimization and the adjoint method. It discusses methods like Newton's method for solving nonlinear equations and emphasizes the importance of using analytical gradients for efficiency. The notes also highlight the significance of formulating problems correctly to leverage optimization algorithms effectively.

Uploaded by

amitlama2299
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Matrix Calculus lecture notes:

How can we use so many derivatives?


… a couple of applications
… and the “adjoint method”

Matrix Calculus, IAP 2023


Profs. Steven G. Johnson & Alan Edelman, MIT

1
Newton’s method: Nonlinear equations via Linearization
scalar out scalar in

18.01: solving f(x) = 0:


1. Linearize:
f(x+δx) ≈ f(x) + f′(x)δx
2. Solve linear equation
f(x) + f’(x)δx = 0
⇒ δx = –f(x)/f′(x)
3. Update x
x ⟵ x – f(x)/f′(x)

Courtesy of Ralf Pfeifer on Wikimedia. License: CC BY-SA. Some rights reserved. This content is excluded
from our Creative Commons license. For more information, see https://ocw.mit.edu/help/faq-fair-use.
2
Multidimensional Newton’s method: Real world is nonlinear!
vector out vector in

18.06: solving f(x) = 0 where x ∈ ℝⁿ (input=vector) and f and 0 ∈ ℝⁿ


(output=vector)
Jacobian
1. Linearize:
That’s it! Once we have the Jacobian,
f(x+δx) ≈ f(x) + f′(x)δx
just solve a linear system on each step.
2. Solve linear equation
f(x) + f’(x)δx = 0 Converges amazingly fast:
doubles #digits (squares error)
⇒ δx = –f′(x)–1
inverse f(x) on each step (“quadratic convergence”)!
Jacobian

Caveat: needs a starting guess


3. Update x close enough to root
x ⟵ x – f′(x)–1f(x) (google “Newton fractal”…)
3
Nonlinear optimization: min f(x), x ∈ ℝn co
nto
(or maximize) ur
so
f f(
x)
–∇f points downhill (steepest descent)
Even if we have n=106 parameters x, we
can evolve them all simultaneously in the
downhill direction.
Reverse-mode / adjoint / left-to-right /
backpropagation: computing ∇f costs
about same as evaluating f(x) once.
Makes large-scale optimization practical: –∇f
training neural nets, optimizing shape of
airplane wing, portfolio optimization…
This image is in the public domain.
4
Nonlinear optimization: Lots of complications
● How far do we “step” in –∇f direction?
○ Line search: minα f(x–α∇f) — backtrack if not improved
○ and/or Limit step size to trust region, grow/shrink as needed
slow convergence:
○ Details are tricky to get right zig-zagging downhill
● Constraints: min f(x) subject to gk(x) ≤ 0
○ Algorithms still need gradients ∇gk!
● Faster convergence by “remembering” previous steps
○ Steepest-descent tends to “zig-zag” in narrow valleys
○ “Momentum” terms & conjugate gradients — simple “memory”
○ Fancier: estimate second derivative “Hessian matrix” from
sequence of ∇f changes: BFGS algorithm
● Lots of refinements & competing algorithms …
○ try out multiple (pre-packaged) algorithms on your problem!
This image is in the public domain.
5
Some parting advice:

Often, the main trick is finding the right mathematical


formulation of your problem — i.e. what function, what
constraints, what parameters? — which lets you exploit
the best algorithms.

…but if you have many (> 10) parameters,


always use an analytical gradient (not finite differences!)
… computed efficiently in reverse mode

6
Engineering/physical optimization

Design parameters p: Physical model(s): Model solution x(p):

Solid mechanics, Forces,


geometry, materials,
chemical reactions, displacements,
forces, unknowns…
heat transport, concentrations,
electromagnetism, temperatures,
acoustics, fluid flow… electric/magnetic
fields, pressures,
e.g. linear model: velocities, …
A(p)x = b(p)

Design objective f(x(p))

Strength, speed, power,


efficiency, dissipation loss,
Maximize/minimize f(x(p)) using gradient ∇pf match to experiment, …
… ∇pf computed by reverse-mode/“adjoint” methods 7
Example: “Topology optimization” of a chair
…optimizing every voxel to support weight with minimal material
(either voxel “density” or a “level-set” function)

© Source unknown. All rights reserved. This content is © Joris Laarman. All rights reserved. This content is
excluded from our Creative Commons license. For more excluded from our Creative Commons license. For more
information, see https://ocw.mit.edu/help/faq-fair-use. information, see https://ocw.mit.edu/help/faq-fair-use. 8
Adjoint differentiation
(yet another example of left-to-right/reverse-mode differentiation)
Example: gradient of scalar f(x(p)) where A(p)x=b, i.e. f(A(p)–1b)

● df = f′(x) dx = f’(x) d(A–1) b = – f′(x) A–1 dA A–1 b


row matrix row =x = “adjoint”
vec vec
solution vT
–1
● “Adjoint method:” Just multiply left-to-right! df = – (f′(x) A ) dA x
○ i.e. solve “adjoint equation” ATv = f’(x)T for v (“adjoint” meaning “transpose”)
○ …then df = vT dA x
○ For any given parameter pₖ, ∂f/∂pₖ = vT ∂A/∂pₖ x (& usually ∂A/∂pₖ is very
sparse)

● i.e. Takes only two solves to get both f and ∇f 🤓


○ Solve Ax=b once to get f(x), then solve one more time with AT for v 9
○ … then all derivatives ∂f/∂pₖ are just some cheap dot products
Don’t use right-to-left “forward-mode” derivatives
with lots of parameters!
∂f/∂pₖ = – f′(x) (A–1 (∂A/∂pₖ x)) = one solve per parameter pₖ! ☹
row = vector (different rhs)
vector
solve

Right-to-left (a.k.a. forward mode) better when 1 input & many outputs.
Left-to-right (a.k.a. backward mode, adjoint, backpropagation) better
when 1 output & many inputs

(Note: Using dual numbers is forward mode. Most AD uses the term
“forward” if it is forward mode. e.g. ForwardDiff.jl in Julia is forward mode.
jax.jacfwd in Python is forward mode.)
10
Don’t use finite differences with lots of parameters!

∂f/∂pₖ ≈ [ f(p + ε eₖ) – f(p) ] / ε (eₖ = unit vector, ε = small number)

= requires one solve x(p + ε eₖ) for each parameter pₖ ☹

… even worse if you use fancier finite-difference approximations

11
Adjoint differentiation with nonlinear equations
Example: gradient of scalar f(x(p)) where x(p) ∈ ℝⁿ solves g(p,x) = 0 ∈ ℝⁿ

● g(p,x) = 0 ⟹ dg = ∂g/∂p dp + ∂g/∂x dx = 0 ⟹ dx = –(∂g/∂x)–1 ∂g/∂p dp


Jacobian, = inverse Jacobian,
[ a.k.a. “implicit-function theorem”] matrix also used in Newton
solver for x!
● df = f′(x) dx = – ( f′(x) (∂g/∂x)–1 ) ∂g/∂p dp
= “adjoint”
solution vT
⟹ adjoint equation: (∂g/∂x)Tv = f′(x)T
● i.e. Takes only two solves to get both f and ∇f 🤓
○ one nonlinear solve for x, and one linear solve for v!
○ … then all derivatives ∂f/∂pₖ are just some cheap dot products
12
You need to understand adjoint methods even if you use AD
● Helps understand when to use forward vs. reverse mode!
● Many physical models call large software packages written over decades in
various languages, and cannot be differentiated automatically by AD
○ You often just need to supply a “vector–Jacobian product” yTdx for physics, or even just part
of the physics, and then AD will differentiate the rest and apply the chain rule for you

● Often models involve approximate calculations, but AD tools don’t know


this & spend extra effort trying to differentiate the error in your approximation
○ If you solve for x by an iterative method (e.g. Newton), it is inefficient for AD to
backpropagate through the iteration … instead, you want take derivative of the underlying
equation g(p,x) = 0
○ For discretized physics (e.g. a finite-element methods), it is often more efficient (and
sufficiently accurate) to apply adjoint method to continuous physics
(“differentiate-then-discretize”)
13
MIT OpenCourseWare
https://ocw.mit.edu

18.S096 Matrix Calculus for Machine Learning and Beyond


Independent Activities Period (IAP) 2023

For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy