0% found this document useful (0 votes)
1 views14 pages

Slides Concepts 1 Differentiability

The document covers key mathematical concepts in optimization for machine learning, focusing on differentiation, smoothness, and the properties of gradients, Jacobians, and Hessians. It explains univariate and multivariate differentiability, the role of directional derivatives, and introduces Lipschitz continuity and its implications for gradient behavior. The document emphasizes the importance of these concepts for understanding local curvature and optimization techniques in machine learning.

Uploaded by

SajidBashir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views14 pages

Slides Concepts 1 Differentiability

The document covers key mathematical concepts in optimization for machine learning, focusing on differentiation, smoothness, and the properties of gradients, Jacobians, and Hessians. It explains univariate and multivariate differentiability, the role of directional derivatives, and introduces Lipschitz continuity and its implications for gradient behavior. The document emphasizes the importance of these concepts for understanding local curvature and optimization techniques in machine learning.

Uploaded by

SajidBashir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Optimization in Machine Learning

Mathematical Concepts:
Differentiation and Derivatives
Learning goals
Definition of smoothness
Uni- & multivariate differentiation
Gradient, partial derivatives
Jacobian matrix
Hessian matrix
Lipschitz continuity
UNIVARIATE DIFFERENTIABILITY
Definition: A function f : S ⊆ R → R is said to be differentiable for
each inner point x ∈ S if the following limit exists:

f (x + h) − f (x )
f ′ (x ) := lim
h →0 h
Intuitively: f can be approxed locally by a lin. fun. with slope m = f ′ (x ).

Left: Function is differentiable everywhere. Right: Not differentiable at the red point.

© Optimization in Machine Learning – 1 / 13


SMOOTH VS. NON-SMOOTH
Smoothness of a function f : S → R is measured by the number
of its continuous derivatives
C k is class of k -times continuously differentiable functions
(f ∈ C k means f (k ) exists and is continuous)
In this lecture, we call f “smooth”, if at least f ∈ C 1

3 fun
f1
2
y

f2
1 f3

0
−2 −1 0 1 2
x

f1 is smooth, f2 is continuous but not differentiable, and f3 is non-continuous.

© Optimization in Machine Learning – 2 / 13


MULTIVARIATE DIFFERENTIABILITY
Definition: f : S ⊆ Rd → R is differentiable in x ∈ S if there exists a
(continuous) linear map ∇f (x) : S ⊆ Rd → Rd with

f (x + h) − f (x) − ∇f (x)T · h
lim =0
h →0 ||h||

Geometrically: The function can be locally approximated by a tangent hyperplane.


Source: https://github.com/jermwatt/machine_learning_refined.

© Optimization in Machine Learning – 3 / 13


GRADIENT
Linear approximation is given by the gradient:
 T
∂f ∂f ∂f ∂f ∂f
∇f = e1 + · · · + ed = , ,...,
∂ x1 ∂ xd ∂ x1 ∂ x2 ∂ xd
Elements of the gradient are called partial derivatives.
To compute ∂ f /∂ xj , regard f as function of xj only (others fixed)

Example: f (x) = x12 /2 + x1 x2 + x22 ⇒ ∇f (x) = (x1 + x2 , x1 + 2x2 )T

© Optimization in Machine Learning – 4 / 13


DIRECTIONAL DERIVATIVE
The directional derivative tells how fast f : S → R is changing w.r.t.
an arbitrary direction v :

f (x + hv ) − f (x)
Dv f (x) := lim = ∇f (x)T · v .
h→0 h
Example: The directional derivative for v = (1, 1) is:
 
T 1 ∂f ∂f
Dv f (x) = ∇f (x) · = +
1 ∂ x1 ∂ x2
NB: Some people require that ||v || = 1. Then, we can identify Dv f (x)
with the instantaneous rate of change√in direction v – and in our
example we would have to divide by 2.

© Optimization in Machine Learning – 5 / 13


PROPERTIES OF THE GRADIENT
Orthogonal to level curves/surfaces of a function
Points in direction of greatest increase of f

Proof: Let v be a vector with ∥v ∥ = 1 and θ the angle between v and ∇f (x).
T
Dv f (x) = ∇f (x) v = ∥∇f (x)∥ ∥v ∥ cos(θ) = ∥∇f (x)∥ cos(θ)

by the cosine formula for dot products and ∥v ∥ = 1. cos(θ) is maximal if θ = 0,


hence if v and ∇f (x) point in the same direction.
(Alternative proof: Apply Cauchy-Schwarz to ∇f (x)T v and look for equality.)
Analogous: Negative gradient −∇f (x) points in direction of greatest decrease

© Optimization in Machine Learning – 6 / 13


PROPERTIES OF THE GRADIENT

Length of arrows is norm of their gradient

© Optimization in Machine Learning – 7 / 13


JACOBIAN MATRIX
For vector-valued function f = (f1 , . . . , fm )T , fj : S → R, the Jacobian
matrix Jf : S → Rm×d generalizes gradient by placing all ∇fj in its rows:
 
 T
 ∂ f1 (x) ∂ f1 (x)
∇f1 (x) ∂ x1 ··· ∂ xd
..   .. .. .. 
Jf (x) =  =

.  . . .


∇fm (x)T ∂ fm (x)
··· ∂ fm (x)
∂ x1 ∂ xd

Jacobian gives best linear approximation of distorted volumes

Source: Wikipedia

© Optimization in Machine Learning – 8 / 13


JACOBIAN DETERMINANT
Let f ∈ C 1 and x0 ∈ S .

Inverse function theorem: Let y0 = f (x0 ). If det(Jf (x0 )) ̸= 0, then


1 f is invertible in a neighborhood of x0 ,
2 f −1 ∈ C 1 with Jf −1 (y0 ) = Jf (x0 )−1 .

|det(Jf (x0 ))|: factor by which f expands/shrinks volumes near x0


If det(Jf (x0 )) > 0, f preserves orientation near x0
If det(Jf (x0 )) < 0, f reverses orientation near x0

© Optimization in Machine Learning – 9 / 13


HESSIAN MATRIX
For real-valued function f : S → R, the Hessian matrix H : S → Rd ×d
contains all their second derivatives (if they exist):

∂ 2 f (x)
 
2
H (x) = ∇ f (x) =
∂ xi ∂ xj i ,j =1,...,d

Note: Hessian of f is Jacobian of ∇f


Example: Let f (x) = sin(x1 ) · cos(2x2 ). Then:
 
− cos(2x2 ) · sin(x1 ) −2 cos(x1 ) · sin(2x2 )
H (x) =
−2 cos(x1 ) · sin(2x2 ) −4 cos(2x2 ) · sin(x1 )

If f ∈ C 2 , then H is symmetric
Many local properties (geometry, convexity, critical points) are
encoded by the Hessian and its spectrum (→ later)

© Optimization in Machine Learning – 10 / 13


LOCAL CURVATURE BY HESSIAN
Eigenvector corresponding to largest (resp. smallest) eigenvalue of
Hessian points in direction of largest (resp. smallest) curvature

Example (previous slide): For a = (−π/2, 0)T , we have


 
1 0
H (a ) =
0 4

and thus λ1 = 4, λ2 = 1, v1 = (0, 1)T , and v2 = (1, 0)T .

© Optimization in Machine Learning – 11 / 13


LIPSCHITZ CONTINUITY
Function h : S → Rm is Lipschitz continuous if slopes are bounded:

∥h(x) − h(y)∥ ≤ L∥x − y∥ for each x, y ∈ S and some L > 0

Examples (d = m = 1): sin(x ), |x |



Not examples: 1/x (but locally Lipschitz continuous), x
If m = d and h differentiable:
h Lipschitz continuous with constant L ⇐⇒ Jh ≼ L · Id
Note: A ≼ B :⇐⇒ B − A is positive semidefinite, i.e., vT (B − A)v ≥ 0 ∀v ̸= 0

Proof of “⇒” for d = m = 1:

′ h(x + ϵ) − h(x ) h(x + ϵ) − h(x )


h (x ) = lim ≤ lim ≤ lim L = L
ϵ→0 ϵ ϵ→0 ϵ ϵ→0
| {z }
≤L

[Proof of “⇐” by mean value theorem: Show that λmax (Jh ) ≤ L.]

© Optimization in Machine Learning – 12 / 13


LIPSCHITZ GRADIENTS
Let f ∈ C 2 . Since ∇2 f is Jacobian of h = ∇f (m = d ):

∇f Lipschitz continuous with constant L ⇐⇒ ∇2 f ≼ L · Id


Equivalently, eigenvalues of ∇2 f are bounded by L
Interpretation: Curvature in any direction is bounded by L
Lipschitz gradients occur frequently in machine learning
=⇒ Fairly weak assumption
Important for analysis of gradient descent optimization
=⇒ Descent lemma (later)

© Optimization in Machine Learning – 13 / 13

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy