0% found this document useful (0 votes)

1 views14 pages

Slides Concepts 1 Differentiability

The document covers key mathematical concepts in optimization for machine learning, focusing on differentiation, smoothness, and the properties of gradients, Jacobians, and Hessians. It explains univariate and multivariate differentiability, the role of directional derivatives, and introduces Lipschitz continuity and its implications for gradient behavior. The document emphasizes the importance of these concepts for understanding local curvature and optimization techniques in machine learning.

Uploaded by

SajidBashir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views14 pages

Slides Concepts 1 Differentiability

Uploaded by

SajidBashir

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Optimization in Machine Learning

Mathematical Concepts:
Differentiation and Derivatives
Learning goals
Definition of smoothness
Uni- & multivariate differentiation
Gradient, partial derivatives
Jacobian matrix
Hessian matrix
Lipschitz continuity
UNIVARIATE DIFFERENTIABILITY
Definition: A function f : S ⊆ R → R is said to be differentiable for
each inner point x ∈ S if the following limit exists:

f (x + h) − f (x )
f ′ (x ) := lim
h →0 h
Intuitively: f can be approxed locally by a lin. fun. with slope m = f ′ (x ).

Left: Function is differentiable everywhere. Right: Not differentiable at the red point.

© Optimization in Machine Learning – 1 / 13

SMOOTH VS. NON-SMOOTH
Smoothness of a function f : S → R is measured by the number
of its continuous derivatives
C k is class of k -times continuously differentiable functions
(f ∈ C k means f (k ) exists and is continuous)
In this lecture, we call f “smooth”, if at least f ∈ C 1

3 fun
f1
2
y

f2
1 f3

0
−2 −1 0 1 2
x

f1 is smooth, f2 is continuous but not differentiable, and f3 is non-continuous.

© Optimization in Machine Learning – 2 / 13

MULTIVARIATE DIFFERENTIABILITY
Definition: f : S ⊆ Rd → R is differentiable in x ∈ S if there exists a
(continuous) linear map ∇f (x) : S ⊆ Rd → Rd with

f (x + h) − f (x) − ∇f (x)T · h
lim =0
h →0 ||h||

Geometrically: The function can be locally approximated by a tangent hyperplane.

Source: https://github.com/jermwatt/machine_learning_refined.

© Optimization in Machine Learning – 3 / 13

GRADIENT
Linear approximation is given by the gradient:
T
∂f ∂f ∂f ∂f ∂f
∇f = e1 + · · · + ed = , ,...,
∂ x1 ∂ xd ∂ x1 ∂ x2 ∂ xd
Elements of the gradient are called partial derivatives.
To compute ∂ f /∂ xj , regard f as function of xj only (others fixed)

Example: f (x) = x12 /2 + x1 x2 + x22 ⇒ ∇f (x) = (x1 + x2 , x1 + 2x2 )T

© Optimization in Machine Learning – 4 / 13

DIRECTIONAL DERIVATIVE
The directional derivative tells how fast f : S → R is changing w.r.t.
an arbitrary direction v :

f (x + hv ) − f (x)
Dv f (x) := lim = ∇f (x)T · v .
h→0 h
Example: The directional derivative for v = (1, 1) is:

T 1 ∂f ∂f
Dv f (x) = ∇f (x) · = +
1 ∂ x1 ∂ x2
NB: Some people require that ||v || = 1. Then, we can identify Dv f (x)
with the instantaneous rate of change√in direction v – and in our
example we would have to divide by 2.

© Optimization in Machine Learning – 5 / 13

PROPERTIES OF THE GRADIENT
Orthogonal to level curves/surfaces of a function
Points in direction of greatest increase of f

Proof: Let v be a vector with ∥v ∥ = 1 and θ the angle between v and ∇f (x).
T
Dv f (x) = ∇f (x) v = ∥∇f (x)∥ ∥v ∥ cos(θ) = ∥∇f (x)∥ cos(θ)

by the cosine formula for dot products and ∥v ∥ = 1. cos(θ) is maximal if θ = 0,

hence if v and ∇f (x) point in the same direction.
(Alternative proof: Apply Cauchy-Schwarz to ∇f (x)T v and look for equality.)
Analogous: Negative gradient −∇f (x) points in direction of greatest decrease

© Optimization in Machine Learning – 6 / 13

PROPERTIES OF THE GRADIENT

Length of arrows is norm of their gradient

© Optimization in Machine Learning – 7 / 13

JACOBIAN MATRIX
For vector-valued function f = (f1 , . . . , fm )T , fj : S → R, the Jacobian
matrix Jf : S → Rm×d generalizes gradient by placing all ∇fj in its rows:
 
 T
 ∂ f1 (x) ∂ f1 (x)
∇f1 (x) ∂ x1 ··· ∂ xd
..   .. .. .. 
Jf (x) =  =

.  . . .


∇fm (x)T ∂ fm (x)
··· ∂ fm (x)
∂ x1 ∂ xd

Jacobian gives best linear approximation of distorted volumes

Source: Wikipedia

© Optimization in Machine Learning – 8 / 13

JACOBIAN DETERMINANT
Let f ∈ C 1 and x0 ∈ S .

Inverse function theorem: Let y0 = f (x0 ). If det(Jf (x0 )) ̸= 0, then

1 f is invertible in a neighborhood of x0 ,
2 f −1 ∈ C 1 with Jf −1 (y0 ) = Jf (x0 )−1 .

|det(Jf (x0 ))|: factor by which f expands/shrinks volumes near x0

If det(Jf (x0 )) > 0, f preserves orientation near x0
If det(Jf (x0 )) < 0, f reverses orientation near x0

© Optimization in Machine Learning – 9 / 13

HESSIAN MATRIX
For real-valued function f : S → R, the Hessian matrix H : S → Rd ×d
contains all their second derivatives (if they exist):

∂ 2 f (x)

2
H (x) = ∇ f (x) =
∂ xi ∂ xj i ,j =1,...,d

Note: Hessian of f is Jacobian of ∇f

Example: Let f (x) = sin(x1 ) · cos(2x2 ). Then:

− cos(2x2 ) · sin(x1 ) −2 cos(x1 ) · sin(2x2 )
H (x) =
−2 cos(x1 ) · sin(2x2 ) −4 cos(2x2 ) · sin(x1 )

If f ∈ C 2 , then H is symmetric
Many local properties (geometry, convexity, critical points) are
encoded by the Hessian and its spectrum (→ later)

LOCAL CURVATURE BY HESSIAN
Eigenvector corresponding to largest (resp. smallest) eigenvalue of
Hessian points in direction of largest (resp. smallest) curvature

Example (previous slide): For a = (−π/2, 0)T , we have

1 0
H (a ) =
0 4

and thus λ1 = 4, λ2 = 1, v1 = (0, 1)T , and v2 = (1, 0)T .

LIPSCHITZ CONTINUITY
Function h : S → Rm is Lipschitz continuous if slopes are bounded:

∥h(x) − h(y)∥ ≤ L∥x − y∥ for each x, y ∈ S and some L > 0

Examples (d = m = 1): sin(x ), |x |

√
Not examples: 1/x (but locally Lipschitz continuous), x
If m = d and h differentiable:
h Lipschitz continuous with constant L ⇐⇒ Jh ≼ L · Id
Note: A ≼ B :⇐⇒ B − A is positive semidefinite, i.e., vT (B − A)v ≥ 0 ∀v ̸= 0

Proof of “⇒” for d = m = 1:

′ h(x + ϵ) − h(x ) h(x + ϵ) − h(x )

h (x ) = lim ≤ lim ≤ lim L = L
ϵ→0 ϵ ϵ→0 ϵ ϵ→0
| {z }
≤L

[Proof of “⇐” by mean value theorem: Show that λmax (Jh ) ≤ L.]

LIPSCHITZ GRADIENTS
Let f ∈ C 2 . Since ∇2 f is Jacobian of h = ∇f (m = d ):

∇f Lipschitz continuous with constant L ⇐⇒ ∇2 f ≼ L · Id

Equivalently, eigenvalues of ∇2 f are bounded by L
Interpretation: Curvature in any direction is bounded by L
Lipschitz gradients occur frequently in machine learning
=⇒ Fairly weak assumption
Important for analysis of gradient descent optimization
=⇒ Descent lemma (later)

Lec 1 - 3 - Unit 1
No ratings yet
Lec 1 - 3 - Unit 1
19 pages
Chapter 2 Optimization
No ratings yet
Chapter 2 Optimization
47 pages
P3-Topic 3.4
No ratings yet
P3-Topic 3.4
47 pages
CS 182 Berkeley 2021 Discussion 2
No ratings yet
CS 182 Berkeley 2021 Discussion 2
9 pages
DL 1
No ratings yet
DL 1
10 pages
Unit 2 Introduction To Deep Learning
No ratings yet
Unit 2 Introduction To Deep Learning
79 pages
Lecture Notes 2
No ratings yet
Lecture Notes 2
181 pages
Math Camp Calculus
No ratings yet
Math Camp Calculus
51 pages
Dl Lip 深度学习数学原理
No ratings yet
Dl Lip 深度学习数学原理
99 pages
Gradient of Line
No ratings yet
Gradient of Line
23 pages
Introduction To Nonlinear Systems and Numerical Optimization
No ratings yet
Introduction To Nonlinear Systems and Numerical Optimization
83 pages
Tsea 15 11 111002
No ratings yet
Tsea 15 11 111002
7 pages
Electromagnetic Fields and Waves
100% (2)
Electromagnetic Fields and Waves
138 pages
Lecture NM 1 Numerical Differentiation Integration
No ratings yet
Lecture NM 1 Numerical Differentiation Integration
57 pages
Electric Field As A Potential Gradient - 20250130 - 102120 - 0000
No ratings yet
Electric Field As A Potential Gradient - 20250130 - 102120 - 0000
9 pages
Optimization Lecture 1
No ratings yet
Optimization Lecture 1
11 pages
Gradient Descent Slides
No ratings yet
Gradient Descent Slides
11 pages
Perceptrons
No ratings yet
Perceptrons
12 pages
Lecture 1 Gradients and More
No ratings yet
Lecture 1 Gradients and More
35 pages
Grasping Under Uncertainties Sequential Neural Ratio Estimation For 6-DoF Robotic Grasping
No ratings yet
Grasping Under Uncertainties Sequential Neural Ratio Estimation For 6-DoF Robotic Grasping
7 pages
斯坦福大学机器学习数学基础 17-24
No ratings yet
斯坦福大学机器学习数学基础 17-24
8 pages
Lecture 7 - SVM
No ratings yet
Lecture 7 - SVM
125 pages
Enigma Submission
No ratings yet
Enigma Submission
3 pages
Math Revision For DS and ML
No ratings yet
Math Revision For DS and ML
74 pages
Differentiable Programming and Design Optimization
No ratings yet
Differentiable Programming and Design Optimization
72 pages
CS115 Intro To Optimization
No ratings yet
CS115 Intro To Optimization
60 pages
Maths For ML
No ratings yet
Maths For ML
1 page
4.2 Gradient-Based Optimization
No ratings yet
4.2 Gradient-Based Optimization
35 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Calculus
No ratings yet
Calculus
5 pages
Multivariate Calculus Unit Two: M. A. Boateng, PHD., Mima
No ratings yet
Multivariate Calculus Unit Two: M. A. Boateng, PHD., Mima
26 pages
Exam Prep
No ratings yet
Exam Prep
29 pages
Derivatives
No ratings yet
Derivatives
6 pages
Lecture 8
No ratings yet
Lecture 8
16 pages
Mit18 S096iap23 Lec1
No ratings yet
Mit18 S096iap23 Lec1
16 pages
Iterative Reweighted Least Squares: Sargur N. Srihari
No ratings yet
Iterative Reweighted Least Squares: Sargur N. Srihari
22 pages
Notes MATH 251 Lecture 17
No ratings yet
Notes MATH 251 Lecture 17
27 pages
Gradient, Divergence, Curl, and Laplacian
No ratings yet
Gradient, Divergence, Curl, and Laplacian
8 pages
Maths
No ratings yet
Maths
6 pages
Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
No ratings yet
Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
18 pages
M3 R20 - Unit-1
No ratings yet
M3 R20 - Unit-1
63 pages
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
No ratings yet
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
114 pages
Unit 1
No ratings yet
Unit 1
123 pages
Gradient, Jacobian, Hessian, Laplacian and All That
No ratings yet
Gradient, Jacobian, Hessian, Laplacian and All That
2 pages
Gradients Derivatives
No ratings yet
Gradients Derivatives
23 pages
18 Vector Calculus and Optimization
No ratings yet
18 Vector Calculus and Optimization
6 pages
Matrix Differentiation
No ratings yet
Matrix Differentiation
15 pages
HA3 VectorDifferentiation
No ratings yet
HA3 VectorDifferentiation
5 pages
Electromagnetics Unit 1 EFT
No ratings yet
Electromagnetics Unit 1 EFT
141 pages
Slides Multivariate Calculus Wima 2018
No ratings yet
Slides Multivariate Calculus Wima 2018
27 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Distance, Midpoints, Gradients SC
No ratings yet
Distance, Midpoints, Gradients SC
88 pages
Machine Learning Uint I & Ii
No ratings yet
Machine Learning Uint I & Ii
60 pages
Convex Optimization Prerequisite - Topics
No ratings yet
Convex Optimization Prerequisite - Topics
6 pages
Math-Chapter 6
No ratings yet
Math-Chapter 6
4 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Mathematics For Machine Learning-4
No ratings yet
Mathematics For Machine Learning-4
1 page
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
No ratings yet
Logistic Regression For Malignancy Prediction in Cancer - by Luca Zammataro - Towards Data Science
32 pages
Gradient Descent - Xiaowei Huang
No ratings yet
Gradient Descent - Xiaowei Huang
53 pages
Mit18 S096iap23 Lec06
No ratings yet
Mit18 S096iap23 Lec06
9 pages
Engineering: Adnan Hajj Adnan - Elhaj@iul - Edu.lb
No ratings yet
Engineering: Adnan Hajj Adnan - Elhaj@iul - Edu.lb
3 pages
Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises
No ratings yet
Introduction To Machine Learning (CS 771A, IIT Kanpur) : Course Notes and Exercises
39 pages
Principle of Robot Motion PDF
No ratings yet
Principle of Robot Motion PDF
150 pages
Worksheet 2
No ratings yet
Worksheet 2
2 pages
Day 1
No ratings yet
Day 1
41 pages
Sms Essay 2
No ratings yet
Sms Essay 2
6 pages
Engineering Mathematics John Bird
No ratings yet
Engineering Mathematics John Bird
89 pages
Lecture Notes PDF
No ratings yet
Lecture Notes PDF
143 pages
Math For Machine Learning
No ratings yet
Math For Machine Learning
1 page
Real Analysis
No ratings yet
Real Analysis
49 pages
Curs Tehnici de Optimizare
No ratings yet
Curs Tehnici de Optimizare
141 pages
MFMLHandout
No ratings yet
MFMLHandout
7 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
All The Mathematics You Missed 16aug2023 Thomas A. Garrity
No ratings yet
All The Mathematics You Missed 16aug2023 Thomas A. Garrity
375 pages
Tut 3 On Vector Differential Calculus
No ratings yet
Tut 3 On Vector Differential Calculus
4 pages
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
No ratings yet
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
71 pages
ML Notes
No ratings yet
ML Notes
14 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
24 pages
MLF Combined
No ratings yet
MLF Combined
84 pages
Introduction To: Support Vector Machines
No ratings yet
Introduction To: Support Vector Machines
53 pages
Vectors and Tensors
No ratings yet
Vectors and Tensors
13 pages
Gjto 2018
No ratings yet
Gjto 2018
8 pages
Solutions To Problem Set 2: U U Is The Gradient of U
No ratings yet
Solutions To Problem Set 2: U U Is The Gradient of U
15 pages
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
No ratings yet
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
12 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
Data Mining1
No ratings yet
Data Mining1
3 pages
Optimization Based On Gradient Descent
No ratings yet
Optimization Based On Gradient Descent
24 pages
Alder - Multivariate Calculus
100% (1)
Alder - Multivariate Calculus
198 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Slides Concepts 1 Differentiability

Uploaded by

Slides Concepts 1 Differentiability

Uploaded by

Optimization in Machine Learning

© Optimization in Machine Learning – 1 / 13

f1 is smooth, f2 is continuous but not differentiable, and f3 is non-continuous.

© Optimization in Machine Learning – 2 / 13

Geometrically: The function can be locally approximated by a tangent hyperplane.

© Optimization in Machine Learning – 3 / 13

Example: f (x) = x12 /2 + x1 x2 + x22 ⇒ ∇f (x) = (x1 + x2 , x1 + 2x2 )T

© Optimization in Machine Learning – 4 / 13

© Optimization in Machine Learning – 5 / 13

by the cosine formula for dot products and ∥v ∥ = 1. cos(θ) is maximal if θ = 0,

© Optimization in Machine Learning – 6 / 13

Length of arrows is norm of their gradient

© Optimization in Machine Learning – 7 / 13

Jacobian gives best linear approximation of distorted volumes

© Optimization in Machine Learning – 8 / 13

Inverse function theorem: Let y0 = f (x0 ). If det(Jf (x0 )) ̸= 0, then

|det(Jf (x0 ))|: factor by which f expands/shrinks volumes near x0

© Optimization in Machine Learning – 9 / 13

Note: Hessian of f is Jacobian of ∇f

© Optimization in Machine Learning – 10 / 13

Example (previous slide): For a = (−π/2, 0)T , we have

and thus λ1 = 4, λ2 = 1, v1 = (0, 1)T , and v2 = (1, 0)T .

© Optimization in Machine Learning – 11 / 13

∥h(x) − h(y)∥ ≤ L∥x − y∥ for each x, y ∈ S and some L > 0

Examples (d = m = 1): sin(x ), |x |

Proof of “⇒” for d = m = 1:

′ h(x + ϵ) − h(x ) h(x + ϵ) − h(x )

© Optimization in Machine Learning – 12 / 13

∇f Lipschitz continuous with constant L ⇐⇒ ∇2 f ≼ L · Id

© Optimization in Machine Learning – 13 / 13

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.