0% found this document useful (0 votes)
11 views6 pages

Lecture12

This lecture discusses the second half of Newton's method, focusing on handling equality constraints, convergence, and variants. It introduces key concepts such as the minimizer condition under constraints, the formulation of the Newton step for constrained optimization, and examples including linear equality constraints and bundle adjustment for autonomous robots. Additionally, it covers convergence results and various methods like trust region, quasi-Newton, and Gaussian-Newton to enhance the efficiency of Newton's method.

Uploaded by

rz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views6 pages

Lecture12

This lecture discusses the second half of Newton's method, focusing on handling equality constraints, convergence, and variants. It introduces key concepts such as the minimizer condition under constraints, the formulation of the Newton step for constrained optimization, and examples including linear equality constraints and bundle adjustment for autonomous robots. Additionally, it covers convergence results and various methods like trust region, quasi-Newton, and Gaussian-Newton to enhance the efficiency of Newton's method.

Uploaded by

rz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

10-725: Optimization Fall 2012

Lecture 12: October 4


Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Huan-Kai Peng, Hao-Chih Lee

Note: LaTeX template courtesy of UC Berkeley EECS dept.


Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications.
They may be distributed outside this class only with the permission of the Instructor.

This lecture covers the second half of Newton’s method. Four topics are covered: handling equality con-
straints, examples, convergence, and variants.

12.1 Newton’s method under equality constrains

The problem we consider in this section is of the form

min f (x)
x
(12.1)
s.t. h(x) = 0.

Here we assume that f (x) : Rd → R be strictly convex and at least twice differentiable, and that h(x) :
Rd → Rk is differentiable. We will also use the following convenient terminologies:

• g(x) : Rd → Rd denotes the gradient of f (x).

• H(x) : Rd → Rd×d denotes the Hessian of f (x).

• J(x) : Rd → Rk×d denotes the Jacobian of h(x).

The following is the key theorem that we’ll provide an intuitive explanation immediatly.

Theorem 12.1 The minimizer x∗ of the problem 12.1 satisfies

g(x∗ ) = J(x∗ )T λ

for some λ ∈ Rk .

12.1.1 Intuition in an 1-D case

Consider the plot in page 4 of the slides, where the contour plot represents f (x) and the green curve represent
the surface h(x) = 0. By looking at the plot (and a little imagination), we can understand that the local
minimizer x∗ must happen at places where the contour is parallel to the green curve h(x). Otherwise, we
would be able to move x∗ along h(x) a little bit and find a better minimizer. This intuition can be interpreted
in 3 different ways:

1 The contour of f (x) is tangent to h(x) at x∗ .

12-1
12-2 Lecture 12: October 4

2 g(x∗ ) is normal to h(x) at x∗ .

3 g(x∗ ) is parallel to J(x∗ ).

In either way, we have


g(x∗ ) = λJ(x∗ )

, which is the 1-d version of Theorem 12.1.

12.1.2 Newton’s method under equality constraints

In unconstrained cases (from previous lectures), the Newton’s method can be summarized as:

x+ = x + ∆xnt (12.2)

where ∆xnt = −H(x)−1 g(x) is called the Newton’s step. A nice interpretation of the Newton’s step is to set
it by minimizing the second-order approximation of f at x:

1
∆xnt = arg min fˆ(x + ∆x) = arg min f (x) + g(x)T ∆x + ∆xT H(x)∆x (12.3)
∆x∈R d ∆x∈R d 2

By setting the derivative of fˆ to zero, we will obtain ∆xnt = −H(x)−1 g(x).


In the equality-constrained case, the Newton’s step need to be calculated differently. In particular, Equation
12.3 needs to incorporate the constraint, that is, ∆xnt equals to:

1
arg min f (x) + g(x)T ∆x + ∆xT H(x)∆x
∆x∈Rd 2 (12.4)
s.t. h(x) = 0.

To solve ∆xnt , we claim that it suffices to solve the linear system:

H(x) J(x)T
    
∆xnt −g(x)
= . (12.5)
J(x) 0 λ −h(x)

To explain why, let’s expand the matrix form into two equations:

H(x)∆xnt + J(x)T λ = −g(x) (12.6)

J(x)∆xnt = −h(x). (12.7)

Now we can see that Equation 12.6 comes directly from Theorem 12.1, whereas Equation 12.7 comes from
Newton’s update for solving h(x) = 0, which is ∆xnt = x+ − x = −J(x)−1 h(x).
The good thing about Equation 12.5 is that it enable us to solve ∆xnt much more easily than Equation
12.4. Assuming that f is strictly convex, we know H has full-rank; assuming independent constraints, we
also know J has full-row-rank. Thus the matrix in the LHS is invertible. The Newton’s step can thus be
calculated by direct matrix inversion, and be plugged into Equation 12.2 for the update rule. In cases when
H is not stable to inverse (e.g., when H is sparse), we can still use Gaussian Elimination to solve the linear
system without inversion. The later bundle adjustment problem is one such example.
Lecture 12: October 4 12-3

12.2 Examples

12.2.1 Newton’s step in linear equality constraints

When h(x) = Ax where A ∈ Rk×d and assume that the current solution is feasible (h(x) = 0), then linear
system in Equation 12.5 becomes:

H(xk ) AT
    
∆x −g
= .
A 0 λ 0

Suppose that H(x) is invertible (e.g., strictly convex f (x)) and A is full row-rank, then the linear system is
solvable. Hence −1 
H(xk ) AT
   
∆x −g
= .
λ A 0 0

12.2.2 Exponential Family

We’ll derive the MLE for exponential family using Newton’s method. Let {xi }N
i=1 be i.i.d samples with a
density in exponential family, i.e.
p(xi |θ) = exp(θT x − A(θ))
Where Z
A(θ) = ln exp(θT x)dx
X

To estimate θ̂mle , we define the negative log-likelihood as


X
L(θ) = − ln ΠN
i=1 p(xi |θ) = − (θT xi − A(θ))
i

Let’s calculate its gradient, Hessian matrix and the Newton’s step.
1 X T
dL(θ) = −N · d( (x θ − A(θ)))
N i

1 X T
= −N ( x θ − A0 (θ))dθ
N i

= N (−x̄T + A0 (θ))dθ
⇒ ∇L(θ) = N (∇A(θ) − x̄)
⇒ ∇2 L(θ) = N ∇2 A(θ)
Now we further figure out ∇A(θ) and ∇2 A(θ).
Z
A(θ) = ln exp(θT x)dx
X
Z
1
A0 (θ) = R T x)dx
x · exp(θT x)dx
X
exp(θ X
Z
= xp(x|θ)dx
X
12-4 Lecture 12: October 4

= E[x|θ]
Z
A00 (θ) = xp(x|θ)dx
X
Z
= (x · exp(θT x − A(θ)))0
X
Z
x · exp(θT x − A(θ))(xT − A0 (θ))
X
Z
= p(x|θ) · (xxT − x · E(x|θ))dx
X

= E(xxT |θ)) − E(x|θ))2

= var[x|θ]

Therefore the Newton’s step is

∆x = −H −1 g = var[x|θ]−1 (x̄ − E(x|θ))

12.2.3 Bundle adjustment for autonomous robots

In this example, we visit a case where the Hessian is sparse and therefore cannot be reliabliy inverted. In
page 8 of the slides, we consider a Bundle adjustment problem as follows. Suppose we have a robot wondering
in a room where K landmarks present. At each time step t, 3 types of (noisy) sensor readings are available
to the robot:

• vt : forward and sideway displacements of the robot, 2-dimensional.

• wt : angular displacement of the robot, 1-dimensional.

• dt (k): distance from the robot to the k-th landmark, 1-dimensional.

The problem for the robot is therefore to infer the following 3 sets of variables from the above noisy sensors:

• xt : the true location of the robot at each time step t, 2-dimensional.

• θt : the true orientation of the robot at each time step, 1-dimensional.

• yk : the true location of the k-th landmark, 2-dimensional.

where θt is further encoded using ut = (cos θt , sin θt ) for convenience. Consider the following assumptions:

• vt = R(ut )(xt+1 − xt ) + v,t .

• wt = θt+1 − θt + w,t .

• dt (k) = R(ut )(yk − xt ) + k,t


Lecture 12: October 4 12-5

where R(ut ) is the rotation matrix:  


cos θt sin θt
(12.8)
− sin θt cos θt
and ’s are the noise resulted from sensors. The optimization in page 9 of the slides can then be interpreted
as finding the true robot location/orientation and the true landmark locations by minimizing the sum of
squared noise.
We now proceed by observing that the Hessian of the objective is sparse. Note that in the objective, each
variable doesn’t interact with one another in non-adjacent timesteps. That is to say, picking any two variables
more than 1 timestep away, the corresponding second derivative in the Hessian will be zero. This Hessian
is going to be extremely sparse, making it super unstable to inverse. This is a good case where we want to
calculate the Newton step by solving a linear system instead of inverting the Hessian.

12.3 Convergence results

Assuming a strictly convex f where its Hessian is Lipschitz, the convergence of Newton’s method is “quadratic”
in number of iterations. That is,

• Given an error , the number of iterations needed to achieve that error is k = O(ln −2 ).
k
• Given k iterations, the error shrinks as quickly as  = O(e− 2 ).

More specifically, the convergence of Newton’s method consists of two phases. During the “damp phase”,
the optimization will take O(1) iterations; during the “quadratic phase”, the optimization takes O(ln −2 )
iterations, which is essentially less than 6. More details about the convergence analysis of Newton’s method
can be found in Boyed’s book on page 488.
As amazing as its quick convergence, it is important to keep in mind that the convergence result is in terms of
number of iterations. Although few iterations are needed for convergence, each iteration needs to calculate
and inverse the Hessian, which is typically O(n3 ) where n is the dimensionality. For high-dimensional
problems, this is inhibitively expensive to run even one iteration. Therefore, first-order methods are still
attractive in many settings.

12.4 Variants

12.4.1 Trust region

Although Newton method has fast convergence rate, it may also diverge quickly when Hessian matrix is
ill-conditioned. One way to avoid ”bad” steps is by introducing the trust region. For example, we may
instead solve
(H(x) + tI)∆x = −g(x)
to find the unconstrained Newton step. Here the I is the identify matrix. The intuition here is when t
is large, the Newton method would reduce to gradient descent method since the updating step is almost
−I −1 g = −g. On the contrary, if t is small, we get the Newton method. So the strategy would be using
large t initially to go along the gradient and small t later on to speed up the convergence. Another example
of trust region is
(H(x) + t(I ◦ H))∆x = −g(x).
The advantage of using this trust region is that it make the algorithm invariant to scaling.
12-6 Lecture 12: October 4

12.4.2 Quasi-Newton

Quasi-Newton methods use only the information of gradients to successively estimate the Hessian matrix.
For example, finite-differences may be used to approximate the Hessian matrix by gradients at nearby points.
An example of quasi-Newton method is the L-BFGS method, which can often get ”good enough” estimation
with only a few old information.

12.4.3 Gaussian-Newton
PN
Gaussian-Newton method solves the case that f (x) = i=1 21 (yi − f (xi , θ))2 where the minimization is with
respect to θ. In this case we approximate f (xi , θ + ∆θ) ≈ f (xi , θ) + ∇f (xi , θ)T ∆θ and
N N
X 1 X 1
(yi − f (xi , θ))2 ≈ (ri (θ) − ∇f (xi , θ)T ∆θ)2
i=1
2 i=1
2

X 1
= (ri (θ)2 ) + P T ∆X + ∆X T Q∆X.
i
2
PN PN
where ri (θ) = yi −f (xi , θ), P = − i=1 (yi −f (xi , θ)∇f (xi , θ) and Q = i=1 ∇f (xi , θ)∇f (xi , θ)T . Therefore
the unconstrained Newton step is

XN N
X
∆θ = ( ∇f (xi , θ)∇f (xi , θ)T )−1 ( (yi − f (xi , θ)∇f (xi , θ)).
i=1 i=1

References
[CW87] D. Coppersmith and S. Winograd, “Matrix multiplication via arithmetic progressions,”
Proceedings of the 19th ACM Symposium on Theory of Computing, 1987, pp. 1–6.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy