0% found this document useful (0 votes)
8 views46 pages

Crs Mfai 2024 Slides

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views46 pages

Crs Mfai 2024 Slides

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Slides for the MFAI (Aug-Dec 2024) Lectures

slides for lectures from Sep 25 - Nov 15, 2024

C R Subramanian

CSE Dept, SECS, Indian Institute of Technology, Bhubaneswar.


I Example Motivation :
I Given : P = {(~xi , yi ) : ~xi ∈ Rd , yi ∈ R)}i=1,...,n ;
I Find : a function y = f (~x ) = ~a · ~x , ~a ∈ Rd , minimising
I total sum of squares of errors E (~a) = ni=1 (yi − ~a · ~xi )2 .
P

I
I Want to find a ~a ∈ Rd which minimises E (~a).
I
I When d = 1, E (a) becomes a continuous function of one
variable a.
I The minimiser a∗ and the minimum value E (a) can be
computed in O(n) time.
Limits and Continuity

I f : O → R is a function. O is an open set. Let a ∈ O.


I limit of f (x) as x approaches a is L if
I ∀ > 0 ∃δ > 0 such that 0 < |x − a| < δ ⇒ |f (x) − L| ≤ .
I Denoted by : Ltx→a f (x) = L.
I Left limit : Ltx→a− f (x) = L. (−δ < x − a < 0)
I Right limit : Ltx→a+ f (x) = L. (0 < x − a < δ).
I
I L exists if and only if left- and right- limits exist and equal L.
I Example : f (x) = [x] does not have limits when x is an
integer ; both left- and right- limits of f exist at integers.
I f (x) = 1/x has limits everywhere but not at x = 0. Both left
and right limits do not exist at x = 0.
Limits and Continuity
I f is continuous at a if f (a) is defined and Ltx→a f (x) = f (a).
I x, x 2 , x 3 , sin(x), cos(x), e x , |x| - continuous everywhere.
I f (x) = [x] continuous everywhere except at integers
I f (x) = x −1 continuous everywhere except at x = 0.
I f and g are continuous at a. Then, f + g , f − g , f · g are
continuous at a. g (a) 6= 0 ⇒ f /g cont. at a.
I
I f is continuous at a, g is continuous at f (a) ⇒
h(x) = g (f (x)) is continuous at a.
 2 2 2
I sin e x , e sin(x ) and e sin(x) are continuous everywhere.
I
I f is cont. over [a, b] with f (a) < f (b). Then,
∀c ∈ (f (a), f (b)) ∃x ∈ (a, b) such that f (x) = c.
I f is continuous over [a, b] implies f is bounded over [a, b].
I f is continuous over [a, b] implies f achieves its min and max.
Differentiability

I f is differentiable at a if Ltx→a f (x)−f


x−a
(a)
exists.
I limit is the derivative of f at a, denoted by
I f 0 (a), f (1) (a), dfdx(a) .
I x, x 2 , x 3 , e x , sin(x), cos(x) -differentiable at every x ∈ R.
I |x| is differentiable everywhere except at x = 0.
I
I f is differentiable at a ⇒ f is continuous at a.
I Converse need not be true : |x| and x = 0, for example.
I Left-derivative : same except we focus on x < a.
I Right-derivative : same except we focus on x > a.
I For |x|, fL0 (0) = −1 and fR0 (0) = +1.
Differentiability

I Algebra :
I f and g are defined over R.
I f 0 (a) and g 0 (a) exist for a ∈ R.
I (f ± g )0 (a) = f 0 (a) ± g 0 (a).
I (f · g )0 (a) = f (a) · g 0 (a) + f 0 (a) · g (a).
 0 0 (a)·g 0 (a)
I gf (a) = g (a)·f (a)−f g (a)2
provided g (a) 6= 0.
I
I Chain Rule :
I Suppose Range(f ) ⊆ Domain(g ) ; f 0 (a), g 0 (f (a)) exist.
I (g (f ))0 (a) exists and equals g 0 (f (a)) · f 0 (a).
I Familiar version :
I y = f (x), z = g (y ), z = g (f (x)) ⇒ dz dz dy
dx = dy · dx .
Differentiability

I For x ∈ R, B(x, δ) := {y ∈ R : 0 ≤ |y − x| < δ}.


I f is twice-differentiable at a if
I (i) for some δ > 0, f 0 (x) exists for every x ∈ B(a, δ)
f 0 (x)−f 0 (a)
I (ii) derivative of f 0 (x) (= Ltx→a x−a ) exists at a.
df 2 (a)
I Second derivative is denoted by f 00 (a), f (2) (a) and dx 2
.
I
I Generally, for k ≥ 1, f is k-times differentiable at a if
I (i) for some δ > 0, f (k−1) (x) exists for every x ∈ B(a, δ)
I (ii) f (k−1) (x) is differentiable at a.
df k (a)
I k-th derivative denoted by f (k) (a) or .
dx k
Differentiability

I x, x 2 , x 3 , e x , sin(x), cos(x) - k-times differentiable for every


k ≥ 1 and everywhere.
I f (x) = loge x - f (k) (x) exists for every k ≥ 1 for every x > 0.
I
I a is a local minimum / local maximum of f if
I f (a) ≤ f (x) / f (a) ≥ f (x)
I for every x ∈ B(a, δ) for some δ > 0.
I
I f : O → R, O is open.
I a ∈ O is a global minimum / global maximum of f over O if
I f (a) ≤ f (x) / f (a) ≥ f (x) for every x ∈ O.
I Every global optimum is also a local optimum.
Differentiability and optima

I If a is a local optimum for f , then f 0 (a) = 0.


I
I Necessary but not sufficient.
I Example : f (x) = x 3 for x < 0 and f (x) = x 2 for x ≥ 0.
I f 0 (0) = 0 but 0 is neither a local minimum nor a local
maximum for f .
I
I a is a saddle point if f 0 (a) = 0 but a is not a local optimum.
I f 0 (a) = 0 - a is a critical point.
Differentiability and optima

I f 0 (a) = 0 and f 00 (a) > 0 ⇒ a is a local minimum for f .


I sufficient but not necessary.
I Eg : f (x) = −x 3 for x ≤ 0 and f (x) = x 3 for x > 0.
I 0 is global minimum for f . But, f 0 (0) = f 00 (0) = 0.
I
I g 0 (a) = 0 and g 00 (a) < 0 ⇒ a is a local maximum for g .
I sufficient but not necessary.
I Eg : g (x) = −f (x)
I 0 is global maximum for g . But, g 0 (0) = g 00 (0) = 0.
I
Taylor’s Approximation Formula

I f 00 exists and is continuous over (a − δ, a + δ) for some δ > 0.


I Taylor’s first-order approximation formula :
I f (x) = f (a) + f 0 (a)(x − a) + E1 (x), ∀x ∈ B(a, δ)
I where E1 (x) = ax (x − t)f 00 (t)dt → 0 as x → a.
R

I
00 2
I E1 (x) = f (c)(x−a)
2 for some c ∈ (a, x).
I f (a + h) = f (a) + hf 0 (a) + o(h) as h → 0.
I f (a + h) ≈ f (a) + hf 0 (a) as h → 0.
I
I differentiability ⇐⇒ local linearizability.
Taylor’s Approximation Formula

I f 000 exists and is continuous over (a − δ, a + δ) for some δ > 0.


I Taylor’s second-order approximation formula :
00 2
I f (x) = f (a) + f 0 (a)(x − a) + f (a)(x−a)
2 + E2 (x), ∀x ∈ B(a, δ)
1
Rx 2 000
I where E2 (x) = 2 · a (x − t) f (t)dt → 0 as x → a.
I
f 000 (c)(x−a)3
I E2 (x) = 6 for some c ∈ (a, x).
h2 f 00 (a)
I f (a + h) = f (a) + hf 0 (a) + 2 + o(h2 ) as h → 0.
h2 f 00 (a)
I f (a + h) ≈ f (a) + hf 0 (a) + 2 as h → 0.
Taylor’s Approximation Formula

I f (n+1) () exists, continuous over (a − δ, a + δ) for some δ > 0.


I Taylor’s nth-order approximation formula :
(j) j
I f (x) = nj=0 f (a)(x−a)
P
j! + En (x), ∀x ∈ B(a, δ)
1 x
· a (x − t)n f (n+1) (t)dt → 0 as x → a.
R
I where En (x) = n!
I f (0) (a) = f (a).
I
f (n+1) (c)(x−a)n+1
I En (x) = (n+1)! for some c ∈ (a, x).
Pn f (j) (a)hj
I f (a + h) = j=0 j! + o(hn ) as h → 0.
Pn f (j) (a)hj
I f (a + h) ≈ j=0 j! as h → 0.
Lectures on 16/10/2024

I Taylor’s Formula - illustrations


I e x is infinitely differentiable over R.
I ex = 1 + x + x2 xn
2 + ... + n! + o(x n ), x → 0.
I
I log(1 + x) is infinitely differentiable for every x > −1.
x2 x3 n
I log(1 + x) = x − 2 + 3 + . . . + (−1)n−1 xn + o(x n ), x → 0.
I
I cos x is infinitely differentiable for every x ∈ R.
2 4 6 x 2n
I cos x = 1 − x2! + x4! − x6! + . . . + (−1)n (2n)! + o(x n ), x → 0.
I
Taylor’s series

I f is infinitely differentiable over (a − δ, a + δ) for some δ > 0.


I Taylor series expansion for f (x) :
(n) n
I f (x) = f (a) + f 0 (a)(x − a) + . . . + f (a)(x−a) n! + . . ..
I
f (j) (a)(x−a)j
I f (x) = ∞
P
j=0 j! , ∀x ∈ B(a, δ)
I
f (j) (a)hj
I f (a + h) = ∞
P
j=0 j! , ∀h ∈ (−δ, δ).
I
I infinite differentiability is necessary but not sufficient.
Optimisation :

I Problem : Minimise (or Maximise) f (x) subject to x ∈ Ω.


I Given : oracle access to computing f (x), f 0 (x) and f 00 (x)
I and oracle access to testing “x ∈ Ω ?” :
I Goal : Find a x ∈ Ω optimising f (x).
I
I A General Optimisation Algorithm :
1. Start with an initial guess x.
2. while x is not an optimal solution do
3. Determine a search direction p ;
4. x ← x + p. endwhile
5. Return x.
Optimisation :

I Repeatedly check for local optimality ;


I Check if f 0 (x) = 0 and if f 00 (x) 6= 0.
I Calls for finding zeroes of f 0 (x).
I Search direction p is guided by the optimality check.
I In special cases like Linear Programs or semi-definite
I programs, other direct and efficient approaches available.
I
I Checking global optimality is a much harder problem.
Newton’s Method for finding zeroes :

I Given oracle access to computing f and f 0 ,


Goal : To compute a x ∗ satisfying f (x ∗ ) = 0.
I Newton’s Method for finding roots :
1. Start with an initial guess x.
2. while f (x) 6= 0 and f 0 (x) 6= 0 do
3. p ← − ff 0(x)
(x) ; x ← x + p. endwhile
4. Return x.
I
I One can replace f (x) 6= 0 by |f (x)| > , small .
Newton’s Method - Analysis :
I Analysis of Newton’s Method :
I
I x0 = initial guess ; xk = guess after k iterations ;
I xk+1 = xk − ff 0(xk) ∗
(x ) ; ek = xk − x ;
k
f 00 (ηk )ek2 f (xk ) f 00 (ηk )ek2
I 0 = f (xk ) − ek f 0 (xk ) + 2 ⇒ ek = f 0 (xk ) + 2f 0 (xk ) .
f (xk ) f 00 (ηk ) f 00 (x ∗ )
I ek+1 = ek − f 0 (xk ) = ek2 · 2f 0 (xk ) → ek2 · 2f 0 (x ∗ ) , as xk → x ∗ ;
I
 00 ∗ 2k
2f 0 (x ∗ )

f (x )
I ∀ large k, ek ≈ e0 · 2f 0 (x ∗ ) · f 00 (x ∗ ) .
I If {xk } → x ∗ , the convergence rate is quadratic with rate
f 00 (x ∗ ) |ek+1 | f 00 (x ∗ )
constant 2f 0 (x ∗ ) , that is, Ltk→∞ |e |2 = 2f 0 (x ∗ ) .
k
I Works fine if x0 is reasonably close to x ∗ and rate constant is
not too big.
Unconstrained Optimisation in 1D :

I Given : oracle access to computing f 0 and f 00 :


I Optimising f ⇐⇒ repeatedly finding roots of f 0 (x) = 0.
I Optimising strictly convex f ⇐⇒ finding a root of f 0 (x) = 0.
I
I By applying Newton’s Method for finding roots,
I can find approximations to a root of f 0 (x) = 0 with
f 000 (x ∗ )
I quadratic convergence rate and rate constant 2f 00 (x ∗ )
I where x ∗ is a root of f 0 (x) = 0.
Gradient-Descent Method :

I Assumption : |f 00 (x)| ≤ L for x ∈ [a, b].


I Given oracle access to computing f and f 0 ,
Goal : To compute a x ∗ satisfying f 0 (x ∗ ) = 0.
I
1. Start with an initial guess x. Define γ ← L−1 .
2. while f 0 (x) 6= 0 do x ← x − γf 0 (x) endwhile
3. Return x.
I
Gradient-Descent - Analysis :

I xk = value of x after k iterations ; xk+1 = xk − γf 0 (xk ).


2 0 2 0 2
I f (xk+1 ) ≤ f (xk ) − γf 0 (xk )2 + Lγ f 2(xk ) = f (xk ) − f (xk)
2L .
I f (xk ) < f (xk ) for each k. {f (xk )}k is a decreasing sequence
converging to a limit a.
I {xk }k converges to a limit x ∗ satisfying f (x ∗ ) = Ltk f (xk ).
I Ltk f 0 (xk )2 ≤ 2L · Ltk (f (xk ) − f (xk+1 )) = 0 ⇒ f 0 (x ∗ ) = 0.
I A local optimum or a saddle point can be approached
arbitrarily closely.
Scalar and Vector functions

I f : Rn → Rm , n, m ≥ 1.
I m = 1 - real-valued or scalar functions/fields.
I m > 1 - vector-valued or vector functions/fields.
I n = 1 and m > 1 - trajectories (say, of a projectile in 3-space).
I q
I ~x ∈ Rd . ||~x ||2 = x12 + . . . + xd2 - L2 -norm of x.
I ~x , ~y ∈ Rd . d2 (~x , ~y ) = ||~x − ~y ||2 - L2 -distance.
I
I f : Rn → Rm . ~a ∈ Rn , ~l ∈ Rm .
I Lt~x →~a f (x) = ~l if, ∀ > 0, ∃δ > 0
I satisfying d2 (f (~x ), ~l) ≤  whenever 0 < d2 (~x , ~a) ≤ δ.
I f is continuous at ~a if Lt~ f ~(x) = f (~a).
x →~
a
Scalar and Vector functions

I Suppose f (~x ) → f~ and g (~x ) → ~g when ~x → ~a.


I Then, f (~x ) ± g (~x ) → f~ ± ~g as ~x → ~a.
I αf (~x ) → αf~ as ~x → ~a for every α ∈ R.
I ||f (~x )||2 → ||f~||2 as ~x → ~a.
I f (~x ) · g (~x ) → f~ · ~g as ~x → ~a.
I
I f : Rn → Rm defined by f (~x ) = (f1 (~x ), . . . , fm (~x )) for each x.
I f is continuous at ~a if and only if each fi is continuous at ~a.
I
Scalar and Vector functions
I Suppose f : Rn → Rm and g : Rm → Rp . Define
h = g · f : Rn → Rp by h(~x ) = g (f (~x )) for each x ∈ Rn .
I Suppose also that f is continuous at ~a ∈ Rn and g is
continuous at f (~a) ∈ Rm . Then, h is continuous at ~a.
I
I f1 , f2 , f3 , f4 : R2 → R be defined by
I f1 (x, y ) = sin(x 2 y ) ; f2 (x, y ) = loge (x 2 + y 2 ) ;
x+y
I f3 (x, y ) = ex+y ;
I f1 is continuous everywhere ;
I f2 is continuous everywhere except at (0, 0).
I f3 is continuous everywhere except on the line x + y = 0.
I
I f (x, y ) = x 2xy
+y 2
for (x, y ) 6= (0, 0) and f (0, 0) = 0.
I f is continuous as a function of x alone and as a function of y
alone but not as function of x and y both.
Differentiability of Scalar functions

I f : Rn → R may have different derivatives along different


directions at a point ~a.
I Focus on specific directions ~y .
I
I ~a, ~y ∈ Rn . The derivative of f at ~a along ~y is defined as
I Lth→0 f (~a+h~yh)−f (~a) . Denoted by f 0 (~a, ~y ) or dfd~(~
a)
y .
I f 0 (~a, ~0) = 0 always for any ~a.
I When ||~y ||2 = 1, f 0 (~a, ~y ) is the directional derivative of f at ~a.
∂f (~
a)
I When ~y = ei along xi axis, f 0 (~a, ei ) = ∂xi .
I the gradient of f at ~a is the vector
 
I ∇f (~a) = ∂f∂x(~a) , . . . , ∂f∂x(~a) .
1 n
Differentiability of Scalar functions

I Existence of directional derivatives f 0 (~a, ~y ) for each ~y does


not guarantee f is continuous at ~a.
I Example : f (y ) = xy 2
x 2 +y 4
for x 6= 0 and f (0, y ) = 0 for all y .
I f 0 ((0, 0), ~y ) exists for each y ∈ Rn .
I Along the parabola x = y 2 , f (x, y ) = 1/2 and f is not
continuous at (0, 0).
I
I f is differentiable at ~a if, for some r > 0, there exist a LT
I T~a : Rn → R and a scalar function E~a (~y ) such that
I f (~a + ~y ) = f (~a) + T~a (~y ) + ||~y ||2 · E~a (~y ) holds true for all
||~y || < r and E~a (~y ) → 0 as ||~y || → 0.
I T~a is the Total Derivative of f at ~a, denoted also by f 0 (~a).
Differentiability of Scalar functions

I f is differentiable at ~a =⇒ T~a (~y ) = f 0 (~a, ~y ) for each ~y .


I Also, T~a (~y ) = ∇ f (~a) · ~y = ni=1 ∂f∂x(~a) · yi for each ~y .
P
i
I f is differentiable at ~a =⇒ f is continuous at ~a.
I
I f is differentiable at ~a =⇒ Taylor’s first order formula :
I f (~a + ~y ) = f (~a) + ∇f (~a) · ~y + ||y ||E~a (~y ), ||y || < r .
I
I When ||y || = 1, f 0 (~a, ~y ) = ||∇f (~a)|| · cos(θ) where
I θ = angle between ∇f (~a) and ~y .
I f 0 (~a, ~y ) = component of ∇f (~a) in the direction of ~y .
Sufficiency for Differentiability and Chain Rule
I f is a scalar function over Rn and ~a ∈ Rn .
I If all first-order partial derivatives exist at all points in an
open neighborhood around ~a and they are continuous at ~a,
then f is differentiable at ~a.
I
I Chain Rule : r : O → S, f : S → R, O ⊆ R, S ⊆ Rn .
I Suppose r 0 (t) exists and f 0 (r (t)) exists. Then, for g = f · r ,
I g 0 (t) exists and g 0 (t) = ∇f (r (t)) · r 0 (t).
I
I Write r (t) = (r1 (t), . . . , rn (t)).
I r 0 (t) = (r10 (t), . . . , rn0 (t)).
 
I ∇f (r (t)) = ∂f ∂r (r (t))
, . . . , ∂f (r (t))
∂rn .
1

I g 0 (t) = ni=1 ∂f (r (t)) dri (t)


P
∂ri · dt .
Higher-order derivatives for Scalar functions

I f : O → R, O ⊆ Rn , O is open.
I Suppose f 0 (~x ) exists for every ~x ∈ B(~a, r ).
I Derivative of f 0 at ~a, if it exists, is the second-derivative f 00 (~a).
2
I Our Focus : Second-order partial derivatives - ∂∂xf ∂x
(~a)
.
i j
 2 
I Hessian (denoted by ∇2 f (~a)) is the matrix ∂∂xf ∂x
(~
a)
.
i j i,j
I Hessian is symmetric if the second-order pds are continuous.
Taylor’s approximation

I f : O → R, O ⊆ Rn , ~a ∈ O.
I second-order pds are continuous.
~T ·∇2 f (~
p a)·~
p
I f (~a + p~) = f (~a) + p~T · ∇f (~a) + 2 + . . ..
I
~T ·∇2 f (~
p η )·~
p
I f (~a + p~) = f (~a) + p~T · ∇f (~a) + 2
I for some ~η ∈ L(~a, ~a + p~).
I
2
pi ∂f∂x(~ai ) + pi pj ∂∂xfi ∂x
(~
η)
Pn Pn
I f (~a + p~) = f (~a) + i=1 i,j=1 j
I
pi ∂f∂x(~ai ) + o(||~
Pn
I f (~a + p~) = f (~a) + i=1 p ||) as p~ → ~0.
I
I Linear approximation : f (~a + p~) ≈ f (~a) + p~T · ∇f (~a).
Example (from Griva, Nash and Sofer)

I Consider f (x, y ) = x 3 + 5x 2 y + 7xy 2 + 2y 3 . Let ~a = (−2, 3).


I ∇f (~a) = (3x 2 + 10xy + 7y 2 , 5x 2 + 14xy + 6y 2 )(−2,3) =
(15, −10).
   
2 6x + 10y 10x + 14y 18 22
I ∇ f (~a) = =
10x + 14y 14x + 12y (−2,3) 22 8
I Let p~ = (0.1, 0.2).
~T ·∇2 f (~
p a)·~
p
I f (~a + p~) = f (−1.9, 3.2) ≈ f (~a) + p~T · ∇f (~a) + 2 .
I f (−1.9, 3.2) ≈ −20 − 0.5 + 0.69 = −19.81
I Actual f (−1.9, 3.2) = −19.755.
Unconstrained minimisation of scalar functions

I f is a scalar function.
I ~a is a local minimum for f ⇒ ∇f (~a)T · p~ ≥ 0 for all p~.
I ~a is a local minimum for f ⇒ ∇f (~a) = ~0.
I Necessary but not sufficient.
I ~a is a local minimum for f ⇒ ∇2 f (~a) is positive semi-definite.
I
I Sufficiency :
I ∇f (~a) = ~0 and ∇2 f (~a) is positive definite ⇒ ~a is a local
minimum.
I A symmetric matrix B is positive semi-definite (B  0) if
x T Bx ≥ 0 for all x ∈ Rn .
I A symmetric matrix B is positive definite (B  0) if
x T Bx > 0 for all x 6= ~0.
Unconstrained Minimization : Newton’s Method

I f : Rn → R, a scalar function.
I Given oracle access to computing ∇f and ∇2 f ,
Goal : To compute a local minimizer ~x ∗ of f .
I Newton’s Method for Minimizing :
1. Start with an initial guess ~x .
2. while ∇f (~x ) 6= ~0 and ∇2 f (~x )  0 do
−1
3. p ← − ∇2 f (~x ) · ∇f (~x ) ; x ← x + p. endwhile
4. Return x.
I
I In practice, one replaces ∇f (~x ) 6= ~0 by ||∇f (~x )|| > , small .
Unconstrained Minimization : Newton’s Method

I Obtained by minimizing the RHS of the quadratic


approximation :
2
I f (~x ) ≈ f (~xk ) + ∇f (xk )(~x − ~xk ) + (~x −~xk )∇ f2(~xk )(~x −~xk ) .
I ∇2 f is Lipschitz continuous on O, that is,
||∇2 f (~x ) − ∇2 f (~y )|| ≤ L||~x − ~y ||, ∀~x , ~y ∈ O.
I ~x ∗ - minimizer of f and ∇2 f (~x ∗ )  0.
I If ||~x − ~x ∗ || is “sufficiently small”,
then {~xk }k converges quadratically to ~x ∗ .
I
Unconstrained Minimization : Gradient-Descent Method :

I Descent along direction of Steepest Descent, namely, −∇f (~x ).


I Assumption : ||∇2 f (~x )|| ≤ L for x ∈ O, for some L > 0.
I Given oracle access to computing ∇f () and f (),
Goal : To compute a ~x ∗ satisfying ∇f (~x ∗ ) = ~0.
I
1. Start with an initial guess x. Define γ ← L−1 .
2. while ∇f (~x ) 6= ~0 do ~x ← ~x − γ∇f (~x ) endwhile
3. Return ~x .
I
I In practice, one replaces ∇f (~x ) 6= ~0 by ||∇f (~x )|| > , small .
Minimization of Scalar functions : Grad-Des. - Analysis :

I ~xk = value of ~x after k iterations ; ~xk+1 = ~xk − γ∇(~xk ).


γ 2 ||∇2 f (~ xk )||2
xk )||·||∇f (~
I f (~xk+1 ) ≤ f (~xk ) − γ||∇f (~xk )||2 + 2
xk )||2
||∇f (~
I = f (~xk ) − 2L .
I f (~xk+1 ) < f (~xk ) for each k. {f (~xk )}k is a decreasing
sequence converging to a limit a.
I As in the 1D-case, {~xk }k converges to a limit ~x ∗ satisfying
f (~x ∗ ) = Ltk f (~xk ).
I Ltk ||∇f (~xk )||2 ≤ 2L · Ltk (f (~xk ) − f (~xk+1 )) = 0
⇒ ∇f (~x ∗ ) = ~0.
I A local optimum or a saddle point can be approached
arbitrarily closely.
Gradient Descent with Backtracking Line Search :

I Presumes apriori knowledge of L. Possibly not available.


I
I x0 ← initial guess of ~x ∗ ; n ← 0 ;
I while ∇f (~xn ) 6= ~0 do
I γn ← initial estimate of Step size γ ;
xn )||2
γn ||∇f (~
I while f (~xn − γn ∇f (~xn )) > f (~xn ) − 2 do
I γn ← γn /2 endwhile
I ~xn+1 ← ~xn − γn ∇f (~xn ) ; n ← n + 1. endwhile
I Return ~xn .
I
I In practice, one replaces ∇f (~xn ) 6= ~0 by ||∇f (~xn )|| > .
I Takes care of narrow, deep valleys and chooses γ adaptively.
Descent with Exact Line Search

I p~ is any descent direction, that is, ∇f (~x )T p~ < 0.


I Solve P: minα>0 f (~x + α~ p ) for its optimum solution α∗ .
I Replace the current solution ~x by ~x ← ~x + α∗ p~.
I Solving P exactly is often possible, involving only variable α.
I Tries to update the current solution to the best possible one
in the direction of descent p~.
Newton’s method (NM) vs Gradient descent (GD)

I GD guarantees convergence while NM can fail if Hessian is


not positive definite.
I NM provides quadratic rate of convergence if ~x0 is
“reasonably close” local minimum.
I NM is computationally expensive (computing Hessian and its
inverse) and also suffers from numerical instabilities.
I Where applicable, NM converges much faster than GD if we
start within a suitable neighborhood.
I For GD, choose step size small in regions of greater variability
of the gradient and large in regions of small variability.
Gaussian Smoothing

I Tries to find a convex approximation of f by employing


Gaussian smoothening.
I f is replaced by g where g (x) is a weighted average of values
of y in the neighborhood.
I The weights are chosen by employing a Gaussian distribution.
I Has the effect of smoothing out sudden dips or ascents in the
value of f .
I Often finds even a global minimum (as against a local one)
even for non-convex f .
Stochastic Gradient Descent

I A tool very efficient and useful for f ’s of the form


f = ni=1 fi (~x ).
P

I Example : Least-squares : Minimize


2 P
f (~x ) = ni=1 yi − ~aiT ~x = ni=1 fi (~x ).
P

I shortens the work to compute ∇f (~x ) by computing a


stochastic approximation to it. O(nd) vs O(d).
I Idea : Choose uniformly at random i ∈ {1, . . . , n} and
compute r (~x ) = n∇fi (~x ).
Pn
I E[r (~x )] = i=1 n∇f
n
i (~
x)
= ∇f (~x ).
I r (~x ) is an unbiased estimator of ∇f (~x ).
Convex sets
I S ⊆ R d is convex if,
I λx + (1 − λ)y ∈ S, ∀x, y ∈ S and 0 ≤ λ ≤ 1.
I for every x, y ∈ S, the unique line segment L(~x , ~y ) is also in S.
I
I Examples : circle : d-boxes (ai ≤ xi ≤ bi ∀i), d-spheres
(d2 (x, c) ≤ r 2 ), hyperplanes (aT · x = b), half-spaces
(aT · x ≤ b), feasible solutions of a system of linear
constraints ({~x : Ax ≤ b}), etc.
I
I Si (i ∈ I ) are convex ⇒ i∈I Si is convex.
T

I A convex combination of x1 , . . . , xk is any vector y satisfying


P P
y = i λi xi where λi ≥ 0, ∀i and i λi = 1.
I S is convex ⇒ all convex combinations of any finite subset of
S are also in S.
Convex functions

I S ⊆ Rd is convex. f : S → R - scalar function.


I
I f is convex over S if, ∀x, y ∈ S and 0 ≤ λ ≤ 1,
I f (λx + (1 − λ)y ) ≤ λf (x) + (1 − λ)f (y ).
I for every x, y ∈ S, the segment of the curve f (x) between x
and y lies entirely on or below unique line segment joining
(x, f (x)) and (y , f (y )).
I
I f is strictly convex if ≤ is replaced with <.
I f is concave if ≤ is replaced with ≥.
I f is strictly concave if ≥ is replaced with >.
I f is convex over S if and only if −f is concave over S.
Convex functions

I Examples : aT x + b, x T Qx for pos-def. Q, etc. over Rd .


I Examples of f : R → R : x 2 , x 4 , x 6 , . . . , e ax , a ∈ R, over R.
I
I f is convex ⇔ f (y ) ≥ f (x) + ∇f (x) · (y − x), ∀x, y ∈ S.
I f is convex ⇔ ∇2 f (x) is pos-semi-def. for every x ∈ S.
I f is strictly convex ⇔ ∇2 f (x) is pos-def. ∀ x ∈ S.
I Analogous statements hold true for concave functions.
I
I f , g convex over S, α ≥ 0 ⇒ f + g and αf are convex over S.
Minima and maxima convex functions
I f : S → R - convex function. S ⊆ Rd - convex.
I
I x ∈ S is a local minimum for f if, for some r > 0,
I f (x) ≤ f (y ) for every y ∈ S satisfying ||x − y ||2 ≤ r .
I
I x is a local minimum of f over S if and only if
I (A) : ∇f (x) · (y − x) ≥ 0 for every y ∈ S.
I follows from f 0 (x, y − x) = ∇f (x) · (y − x).
I For S = Rd , (A) ⇔ ∇f (x) = 0.
I
I Every local minimum of f is a global minimum of f over S.
I follows by considering y ∈ S arbitrarily close to x.
I

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy