Chapter 2 Optimization and Solving Nonlinear Equations
Chapter 2 Optimization and Solving Nonlinear Equations
This chapter deals with an important problem in mathematics and statistics: finding values of x to satisfy
f (x) = 0. Such values are called the roots of the equation and also known as the zeros of f (x).
A question that should be raised is the following: Is there a (real) root of f (x) = 0? One answer is
provided by the intermediate value theorem.
Intermediate value theorem. If f (x) is continuous on an interval [a, b], and f (a) and f (b) have opposite signs,
i.e., f (a)f (b) < 0, then there exists a point ξ ∈ (a, b) such that f (ξ) = 0.
The intermediate value theorem guarantees that a root exists under those conditions. However, it does
not tell us the precise value of the root ξ.
The bisection method works by assuming that we know two values a and b such that f (a)f (b) < 0, and
works by repeatedly narrowing the gap between a and b until it closes in on the correct answer.
19
20 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
> f=function(x){5*x^5-4*x^4+3*x^3-2*x^2+x-1}
> x=seq(-50, 50, length=500)
> plot(x, f(x))
Next we use the bisection method to find the zero between 0 and 1.
> f(0)
[1] -1
> f(1)
[1] 2
f=function(x){5*x^5-4*x^4+3*x^3-2*x^2+x-1}
bisection=function(a,b,n){
xa=a
xb=b
for(i in 1:n){ if(f(xa)*f((xa+xb)/2)<0) xb=(xa+xb)/2
else xa=(xa+xb)/2}
list(left=xa,right=xb, midpoint=(xa+xb)/2)
}
> bisection(0,1,15)
$left
[1] 0.7897034
$right
[1] 0.7897339
$midpoint
[1] 0.7897186
22 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
> gd=function(x){(1+x)/x-log(x)}
> x=seq(1, 2, length=50)
> plot(x, gd(x)) # It seems that c is between 1.2 and 1.4
> gd(3)
[1] 0.2347210
> gd(6)
[1] -0.6250928
> bisection=function(a,b,n){
xa=a
xb=b
for(i in 1:n){ if(gd(xa)*gd((xa+xb)/2)<0) xb=(xa+xb)/2
else xa=(xa+xb)/2}
list(left=xa,right=xb, midpoint=(xa+xb)/2)
}
> bisection(3,6,30)
$left
[1] 3.591121
$right
[1] 3.591121
$midpoint
[1] 3.591121
2.1. THE BISECTION METHOD 23
data = rcauchy(50, 1)
10
5
−250
ld(data, θ)
l(data, θ)
0
−350
−5
−10
−450
−40 0 20 40 −40 0 20 40
θ θ
(2) Treat the data you get from step (1) as sample observations from a Cauchy distribution with an
unknown θ. Plot the log-likelihood function of θ,
n
X
l(θ) = −n ln π − ln{1 + (xi − θ)2 }, θ ∈ R.
i=1
l=function(x,t){
s=0
n=length(x)
for(j in 1:n) s=s + log(1+(x[j]-t)^2)
l=-n*log(pi)-s
l
}
theta=seq(-50, 50,length=500)
plot(theta, l(data,theta), type="l",main="Log-likelihood function",
xlab=expression(theta), ylab=expression(l(data, theta)))
24 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
(3) The maximum value seems to be shown in the above plot of the log-likelihood function of θ. Use the
bisection method to find the maximum likelihood estimator of θ.
ld=function(x,t){
s=0
n=length(x)
for(j in 1:n) s=s + (t-x[j])/(1+(x[j]-t)^2)
l=s
l
}
theta=seq(-10, 10,length=500)
plot(theta, ld(data,theta), type="l",main="Derivative",
xlab=expression(theta), ylab=expression(ld(data, theta)))
f=function(t){ld(data, t)}
bisection(-10,10,30)
$left
[1] 0.9758892
$right
[1] 0.9758892
$midpoint
[1] 0.9758892
Hence, θ̂ = 0.9758892
2.2. SECANT METHOD 25
The secant method begins by finding two points on the curve of f (x), (x 0 , f (x0 )) and (x1 , f (x1 )), hopefully
near to a root r we seek. A straight line that passes these two points is
y − f (x0 ) x − x0
= .
f (x1 ) − f (x0 ) x1 − x 0
If x2 is the root of f (x) = 0, and the point (x 2 , f (x2 )) is on the line, then
0 − f (x0 ) x2 − x 0
= .
f (x1 ) − f (x0 ) x1 − x 0
Under the assumptions that the sequence {x n , n = 1, 2, . . .} converges to r, f (x) is differentiable near r,
and f 0 (r) 6= 0, we obtain
or
r = r − f (r)/f 0 (r),
which gives f (r) = 0.
26 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
Example. Find the zeros of f (x) = x3 − 2x2 + x − 1 using the secant method.
f=function(x){x^3-2*x^2+x-1}
> secant(0,5,12)
$"x(n)"
[1] 1.754878
$"x(n+1)"
[1] 1.754878
$"x(n+1)"
[1] 1.754878
$"x(n+1)"
[1] NaN
The above code does break down for high enough values of n (returns NaN). The following is an improvement
on function h that fixes the problem. The “if statement” will break out of the loop if the values of xa and
xb are equal.
2.2. SECANT METHOD 27
g=function(x,y){y-(f(y)/(f(x)-f(y)))*(x-y)}
> h(-10,50,500)
$"x(n)"
[1] 1.754878
$"x(n+1)"
[1] 1.754878
28 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
Newton’s method or the Newton-Raphson method is a procedure or algorithm for approximating the zeros
of a function f (or, equivalently, the roots of an equation f (x) = 0). It consists of the following three steps:
Step 1. Make a reasonable initial guess as to the location of a solution, which is denoted by x 0 .
Step 2. Calculate
f (x0 )
x1 = x 0 − .
f 0 (x0 )
Under the assumptions that the sequence x 0 , x1 , . . . , xn , . . . converges to r, and that f (x) is differentiable
near r with f 0 (r) 6= 0, by taking the limit on both sides of
f (xn−1 )
xn = xn−1 − ,
f 0 (xn−1 )
we obtain
f (r)
r=r− ,
f 0 (r)
which results in f (r) = 0.
This method requires that the first approximation is sufficiently close to the root r.
A comparison between the secant method and Newton’s method. The secant method is obtained from
Newton’s method by approximating the derivative of f (x) at two points x n and xn−1 by
f (xn ) − f (xn−1 )
f 0 (x) = .
xn − xn−1
Geometrically, Newton’s method uses the tangent line and the secant method approximates the tangent line
by a secant line.
2.3. NEWTON’S METHOD 29
> nr(20,0,20,0.3)
$x1
[1] -0.0004985045
$check
[1] 2.174953e-16
30 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
1 − exp{−(βx)λ },
if x ≥ 0,
F (x) =
0, elsewhere,
(1) Generate 50 random numbers from a Weibull distribution with β = 1 and λ = 1.8.
(2) Add three more numbers to the above group. Treat these 53 observations as your data from a Weibull
distribution with an unknown λ, but keep β = 1 fixed. Plot the log-likelihood function of λ.
n
!λ−1 n
!
Y X
L(λ) = λn xk exp − xλk ,
k=1 k=1
and
n
X n
X
l(λ) = n ln λ + (λ − 1) ln xk − xλk ,
k=1 k=1
respectively.
> loglike=function(t){
x=mydata
s=0
for(i in 1:length(x)) s=s-x[i]^t+(t-1)*log(x[i])
loglike=53*log(t)+s
loglike
}
It can be seen from the plot of loglikelihood function that l(λ) is concave.
2.3. NEWTON’S METHOD 31
To do so, we need solve the equation l 0 (λ) = 0 for stationary points. The first and second derivatives of
l(λ) are
n n
n X X
0
l (λ) = + ln xk − xλk ln xk ,
λ
k=1 k=1
and
n
n X
l00 (λ) = − − xλk ln2 xk .
λ2
k=1
Now l00 (λ) < 0 indicates that there is the unique maximum point of l(λ).
Suppose that we can bring an equation f (x) = 0 in the form x = g(x), which usually can be done in several
ways. Whenever r = g(r), r is said to be a fixed point for the function g(x).
Calculate
x1 = f (x0 ),
x2 = f (x1 ),
···
xn+1 = g(xn ), n = 0, 1, 2, . . .
x2 +2
Let us take a look of x = 3 .
> g=function(x){(x^2+2)/3}
> fixed(0.1, 20)
[1] 0.9999037 # It’s close to 1, one of the roots.
> fixed(3, 20)
[1] Inf # A problem of the initial point?
> fixed(-4, 20)
[1] Inf
2.4. FIXED-POINT ITERATION: X = G(X) METHOD 33
Theorem. If |g 0 (x)| ≤ k < 1 in an interval (a, b), and the sequence {x 0 , x1 , ..., xn , ...} belongs to (a, b), then
the sequence has a limit r, and r is the only root of x = g(x) in the interval (a, b).
Thus, by Cauchy’s criterion, the sequence {x n , n = 0, 1, 2, . . .} converges. Say the limit is r. By taking limit
of both sides of the equation
xn+1 = g(xn ),
we obtain limn→∞ xn+1 = limn→∞ g(xn ), or
r = g(r),
which means that r is a root of the equation x = g(x).
Then
g 0 (c) = 1,
and this gives a contradiction. 2
34 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
Notice that Newton’s method is a special case of the fixed-point iteration, with
f (x)
g(x) = x − ,
f 0 (x)
and
{f 0 (x)}2 − f (x)f 00 (x) f (x)f 00 (x)
g 0 (x) = 1 − = .
{f 0 (x)}2 {f 0 (x)}2
Applying the above theorem to this particular case, we obtain
Corollary. Assume that the function f (x) is continuous in the interval [a, b] and is twice differentiable in
(a, b), with
f (x)f 00 (x)
≤ k < 1, x ∈ (a, b).
{f 0 (x)}2
If the sequence {x0 , x1 , x2 , . . .} is formulated by Newton’s method with
f (xn )
xn+1 = xn − , n = 0, 1, 2, . . . ,
f 0 (xn )
and xn ∈ (a, b), n = 0, 1, 2, . . . , then the sequence has a limit r, and r is the only root of f (x) = 0 in the
interval [a, b].
This corollary indicates that the initial point x 0 is very important for Newton’s method. A good try
should start with a x0 that satisfies
f (x0 )f 00 (x0 )
≤ k < 1.
{f 0 (x0 )}2
Consider a fixed-point iteration for solving the equation x = g(x) with the procedure
xn+1 = g(xn ), n = 0, 1, 2, . . .
Let r be the root of the equation. Define the nth step error by
en = r − x n , n = 1, 2, . . .
en+1 = r − xn+1
= g(r) − g(xn )
= g 0 (cn )(r − xn ) by the mean value theorem
0
= g (cn )en .
This means the error at the (n + 1)th step is linearly related to the error at the nth step.
For Newton’s method, it can be shown that the error at the (n + 1)th step is quadratically related to
the error at the nth step.
2.6. NEWTON’S METHOD FOR A SYSTEM OF NONLINEAR EQUATIONS 35
Newton’s method can be applied for solving a system of nonlinear equations. This is particularly useful
when we try to find maximum likelihood estimators of several parameters.
Let F(x) be a vector-valued function of a vector argument x, assuming that both vectors contain m
components. To apply Newton’s method to the problem of approximating a solution of
F(x) = 0,
Two questions arise in the above procedure immediately. First, what is meant by F 0 (xn )? and second, what
is meant by the division F(xn )/F0 (xn )?
This matrix is known as the Jacobian matrix for the system and is typically denoted by J(x).
For the division of two matrices, we use multiplication of an inverse. Thus, Newton’s method takes the
form
xn+1 = xn − (J(xn ))−1 F(xn ), n = 0, 1, 2, . . . .
When implementing this scheme, rather than actually computing the inverse of the Jacobian matrix, we
define
vn = −(J(xn ))−1 F(xn ),
and then solve the linear system of equations
for vn . Once vn is known, the next iterate is computed according to the rule
xn+1 = xn + vn , n = 0, 1, 2, . . . .
36 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
x31 − 2x2 + 1 = 0,
x1 + 2x32 − 3 = 0.
x31 − 2x2 + 1
x1
x= , and F(x) = .
x2 x1 + 2x32 − 3
3x21 −2
J(x) = .
1 6x22
Sometimes we may need check whether the Jacobian matrix is invertible. For this purpose, the above codes
are improved.
2.6. NEWTON’S METHOD FOR A SYSTEM OF NONLINEAR EQUATIONS 37
Example 2 (Logistic regression model). Let Y denote a binary response variable. The regression model
exp(β0 + β1 x)
E(Y ) = π(x) =
1 + exp(β0 + β1 x)
exp(β0 + β1 xi )
E(Yi ) , πi = , i = 1, . . . , n,
1 + exp(β0 + β1 xi )
However, no closed-form solution exists for the values of β 0 and β1 ) that maximize the log-likelihood function
`(β0 , β1 ). So we need maximize `(β0 , β1 ) numerically.
A data set from Kutner et al. (2005), Applied Statistical Models, page 566, (x=months of experience,
y=task success):
x=c(14,29,6,25,18,4,18,12,22,6,30,11,30,5,20,13,9,32,24,13,19,4,28,22,8) # months
y=c(0,0,0,1,1,0,0,0,1,0,1,0,1,0,1,0,0,1,0,1,0,0,1,1,1) # success
∂ ∂
We start by defining the partial derivatives of `(β 0 , β1 ), ∂β0 `(β0 , β1 ) and ∂β1 `(β0 , β1 ), which are our target
functions.
F1=function(b){
F1=0
for(i in 1:length(x)) F1=F1+y[i]-exp(b[1]+b[2]*x[i])/(1+exp(b[1]+b[2]*x[i]))
F1
}
2.6. NEWTON’S METHOD FOR A SYSTEM OF NONLINEAR EQUATIONS 39
F2=function(b){
F2=0
for(i in 1:length(x)) F2=F2+x[i]*y[i]-x[i]*exp(b[1]+b[2]*x[i])/(1+exp(b[1]+b[2]*x[i]))
F2
}
F=function(b){
F=matrix(0,nrow=2)
F[1]=F1(b)
F[2]=F2(b)
F
}
F=function(b){
F=matrix(0,nrow=2)
s1=0
s2=0
for(i in 1:length(x)){
s1 = s1 +y[i]-((exp(b[1]+b[2]*x[i]))*(1+exp(b[1]+b[2]*x[i]))^(-1))
s2 = s2 +x[i]*y[i]-(x[i]*(exp(b[1]+b[2]*x[i]))*(1+exp(b[1]+b[2]*x[i]))^(-1))}
F[1]=s1
F[2]=s2
F}
J=function(b){
j=matrix(0,ncol=2,nrow=2) # The format of J is 2 by 2
s11=0
s12=0
s22=0
for(i in 1:length(x)){
s11 = s11-exp(b[1]+b[2]*x[i])*(1+exp(b[1]+b[2]*x[i]))^(-2)
s12 = s12 -x[i]*exp(b[1]+b[2]*x[i])*(1+exp(b[1]+b[2]*x[i]))^(-2)
s22 = s22 -(x[i]^(2))*exp(b[1]+b[2]*x[i])*(1+exp(b[1]+b[2]*x[i]))^(-2)
}
j[1,1]=s11
j[1,2]=s12
j[2,1]=s12
j[2,2]=s22
j
}
40 CHAPTER 2. OPTIMIZATION AND SOLVING NONLINEAR EQUATIONS
NNL = function(initial,n){
b=initial
v=matrix(0,ncol=length(b))
for (i in 1:n){
d=det(J(b)) # check that J(b0,b1) is invertible
if(identical(all.equal(d,0),TRUE))
{cat(’Jacobian has no inverse.Try a different initial point.’,’\n’)
break}
else
v=solve(J(b),-F(b))
b=b+v}
cat(’ b0=’,b[1],’\n’,’b1=’,b[2],’\n’)
}
Thus, the maximum likelihood estimators of β 0 and β1 are -3.059696 and 0.1614859, respectively.