Lecture12
Lecture12
This lecture covers the second half of Newton’s method. Four topics are covered: handling equality con-
straints, examples, convergence, and variants.
min f (x)
x
(12.1)
s.t. h(x) = 0.
Here we assume that f (x) : Rd → R be strictly convex and at least twice differentiable, and that h(x) :
Rd → Rk is differentiable. We will also use the following convenient terminologies:
The following is the key theorem that we’ll provide an intuitive explanation immediatly.
g(x∗ ) = J(x∗ )T λ
for some λ ∈ Rk .
Consider the plot in page 4 of the slides, where the contour plot represents f (x) and the green curve represent
the surface h(x) = 0. By looking at the plot (and a little imagination), we can understand that the local
minimizer x∗ must happen at places where the contour is parallel to the green curve h(x). Otherwise, we
would be able to move x∗ along h(x) a little bit and find a better minimizer. This intuition can be interpreted
in 3 different ways:
12-1
12-2 Lecture 12: October 4
In unconstrained cases (from previous lectures), the Newton’s method can be summarized as:
x+ = x + ∆xnt (12.2)
where ∆xnt = −H(x)−1 g(x) is called the Newton’s step. A nice interpretation of the Newton’s step is to set
it by minimizing the second-order approximation of f at x:
1
∆xnt = arg min fˆ(x + ∆x) = arg min f (x) + g(x)T ∆x + ∆xT H(x)∆x (12.3)
∆x∈R d ∆x∈R d 2
1
arg min f (x) + g(x)T ∆x + ∆xT H(x)∆x
∆x∈Rd 2 (12.4)
s.t. h(x) = 0.
H(x) J(x)T
∆xnt −g(x)
= . (12.5)
J(x) 0 λ −h(x)
To explain why, let’s expand the matrix form into two equations:
Now we can see that Equation 12.6 comes directly from Theorem 12.1, whereas Equation 12.7 comes from
Newton’s update for solving h(x) = 0, which is ∆xnt = x+ − x = −J(x)−1 h(x).
The good thing about Equation 12.5 is that it enable us to solve ∆xnt much more easily than Equation
12.4. Assuming that f is strictly convex, we know H has full-rank; assuming independent constraints, we
also know J has full-row-rank. Thus the matrix in the LHS is invertible. The Newton’s step can thus be
calculated by direct matrix inversion, and be plugged into Equation 12.2 for the update rule. In cases when
H is not stable to inverse (e.g., when H is sparse), we can still use Gaussian Elimination to solve the linear
system without inversion. The later bundle adjustment problem is one such example.
Lecture 12: October 4 12-3
12.2 Examples
When h(x) = Ax where A ∈ Rk×d and assume that the current solution is feasible (h(x) = 0), then linear
system in Equation 12.5 becomes:
H(xk ) AT
∆x −g
= .
A 0 λ 0
Suppose that H(x) is invertible (e.g., strictly convex f (x)) and A is full row-rank, then the linear system is
solvable. Hence −1
H(xk ) AT
∆x −g
= .
λ A 0 0
We’ll derive the MLE for exponential family using Newton’s method. Let {xi }N
i=1 be i.i.d samples with a
density in exponential family, i.e.
p(xi |θ) = exp(θT x − A(θ))
Where Z
A(θ) = ln exp(θT x)dx
X
Let’s calculate its gradient, Hessian matrix and the Newton’s step.
1 X T
dL(θ) = −N · d( (x θ − A(θ)))
N i
1 X T
= −N ( x θ − A0 (θ))dθ
N i
= N (−x̄T + A0 (θ))dθ
⇒ ∇L(θ) = N (∇A(θ) − x̄)
⇒ ∇2 L(θ) = N ∇2 A(θ)
Now we further figure out ∇A(θ) and ∇2 A(θ).
Z
A(θ) = ln exp(θT x)dx
X
Z
1
A0 (θ) = R T x)dx
x · exp(θT x)dx
X
exp(θ X
Z
= xp(x|θ)dx
X
12-4 Lecture 12: October 4
= E[x|θ]
Z
A00 (θ) = xp(x|θ)dx
X
Z
= (x · exp(θT x − A(θ)))0
X
Z
x · exp(θT x − A(θ))(xT − A0 (θ))
X
Z
= p(x|θ) · (xxT − x · E(x|θ))dx
X
= var[x|θ]
In this example, we visit a case where the Hessian is sparse and therefore cannot be reliabliy inverted. In
page 8 of the slides, we consider a Bundle adjustment problem as follows. Suppose we have a robot wondering
in a room where K landmarks present. At each time step t, 3 types of (noisy) sensor readings are available
to the robot:
The problem for the robot is therefore to infer the following 3 sets of variables from the above noisy sensors:
where θt is further encoded using ut = (cos θt , sin θt ) for convenience. Consider the following assumptions:
• wt = θt+1 − θt + w,t .
Assuming a strictly convex f where its Hessian is Lipschitz, the convergence of Newton’s method is “quadratic”
in number of iterations. That is,
• Given an error , the number of iterations needed to achieve that error is k = O(ln −2 ).
k
• Given k iterations, the error shrinks as quickly as = O(e− 2 ).
More specifically, the convergence of Newton’s method consists of two phases. During the “damp phase”,
the optimization will take O(1) iterations; during the “quadratic phase”, the optimization takes O(ln −2 )
iterations, which is essentially less than 6. More details about the convergence analysis of Newton’s method
can be found in Boyed’s book on page 488.
As amazing as its quick convergence, it is important to keep in mind that the convergence result is in terms of
number of iterations. Although few iterations are needed for convergence, each iteration needs to calculate
and inverse the Hessian, which is typically O(n3 ) where n is the dimensionality. For high-dimensional
problems, this is inhibitively expensive to run even one iteration. Therefore, first-order methods are still
attractive in many settings.
12.4 Variants
Although Newton method has fast convergence rate, it may also diverge quickly when Hessian matrix is
ill-conditioned. One way to avoid ”bad” steps is by introducing the trust region. For example, we may
instead solve
(H(x) + tI)∆x = −g(x)
to find the unconstrained Newton step. Here the I is the identify matrix. The intuition here is when t
is large, the Newton method would reduce to gradient descent method since the updating step is almost
−I −1 g = −g. On the contrary, if t is small, we get the Newton method. So the strategy would be using
large t initially to go along the gradient and small t later on to speed up the convergence. Another example
of trust region is
(H(x) + t(I ◦ H))∆x = −g(x).
The advantage of using this trust region is that it make the algorithm invariant to scaling.
12-6 Lecture 12: October 4
12.4.2 Quasi-Newton
Quasi-Newton methods use only the information of gradients to successively estimate the Hessian matrix.
For example, finite-differences may be used to approximate the Hessian matrix by gradients at nearby points.
An example of quasi-Newton method is the L-BFGS method, which can often get ”good enough” estimation
with only a few old information.
12.4.3 Gaussian-Newton
PN
Gaussian-Newton method solves the case that f (x) = i=1 21 (yi − f (xi , θ))2 where the minimization is with
respect to θ. In this case we approximate f (xi , θ + ∆θ) ≈ f (xi , θ) + ∇f (xi , θ)T ∆θ and
N N
X 1 X 1
(yi − f (xi , θ))2 ≈ (ri (θ) − ∇f (xi , θ)T ∆θ)2
i=1
2 i=1
2
X 1
= (ri (θ)2 ) + P T ∆X + ∆X T Q∆X.
i
2
PN PN
where ri (θ) = yi −f (xi , θ), P = − i=1 (yi −f (xi , θ)∇f (xi , θ) and Q = i=1 ∇f (xi , θ)∇f (xi , θ)T . Therefore
the unconstrained Newton step is
XN N
X
∆θ = ( ∇f (xi , θ)∇f (xi , θ)T )−1 ( (yi − f (xi , θ)∇f (xi , θ)).
i=1 i=1
References
[CW87] D. Coppersmith and S. Winograd, “Matrix multiplication via arithmetic progressions,”
Proceedings of the 19th ACM Symposium on Theory of Computing, 1987, pp. 1–6.