Lec 1 - Maths in ML I
Lec 1 - Maths in ML I
MATHEMATICS IN
MACHINE LEARNING -I
ROHAN PILLAI
ASSISTANT PROFESSOR
DEPARTMENT OF ELECTRICAL ENGINEERING, DTU
2
Table of Content
• Vectors • Extrema
• Matrices • Derivatives
• Stationary Points
• Example
❑ References
Mathematics in ML-I
3
Mathematics in
Data Science vs Machine Learning
Mathematics in ML-I
4
Mathematics in
Data Science vs Machine Learning
Mathematics in ML-I
5
1. LINEAR
ALGEBRA
Mathematics in ML-I
6
Tensors
Mathematics in ML-I
7
Vectors
Mathematics in ML-I
8
Vectors…
Vector norms : A norm of a vector ||x|| is informally a measure of the “length” of the vector.
Common norms:
o L1 (Manhattan) -
Matrices
Frobenius Norm :
Af = σ𝑖,𝑗 𝐴𝑖𝑗 2
o L2 (Euclidean) -
Mathematics in ML-I
9
Vectors…
Geometric interpretation of a dot product : Projection of one vector upon the other
Mathematics in ML-I
10
Matrices
An N×M matrix has N rows and M columns i.e. it is 2-D array of numbers. Here is an
example matrix, A, with 3 rows and 2 columns :
1 2
A= 3 4
5 6
Linear transformations can be represented as matrices.
Multiplying a matrix with a vector has the geometric meaning of applying a linear
transformation to the vector
Multiplying two matrices has the geometric meaning of applying one linear
transformation after another
Mathematics in ML-I
11
Matrices…
Mathematics in ML-I
12
Special Matrices
a 0 0 a b c a 0 0 A = AT
0 b 0 0 d e b c 0
0 0 c 0 0 f d e f
Mathematics in ML-I
13
Transpose of a Matrix
T T
a a b a c
Example :
= (a b ) =
b c d b d
Mathematics in ML-I
14
Inverse of a matrix
Mathematics in ML-I
17
Matrix Differentiation
Mathematics in ML-I
18
Mathematics in ML-I
19
Mathematics in ML-I
20
Mathematics in ML-I
21
Matrix Differentiation…
Mathematics in ML-I
22
Mathematics in ML-I
23
Mathematics in ML-I
24
2. CALCULUS
REVISITED
Mathematics in ML-I
25
Convex Sets
C ⊆ ℝd
∀ x, y ∈ C
∀ λ ∈ [0,1]
z = λ.x + (1- λ).y
Mathematics in ML-I
26
Convex Functions
f : ℝd ℝ
∀ x, y ∈ C
∀ λ ∈ [0,1]
z = λ.x + (1- λ).y
f(z) ≤ λ. f(x) + (1- λ). f(y)
Mathematics in ML-I
27
Extrema
Since we always seek the “best” values of a function, usually we are looking for the
maxima or the minima of a function
Global extrema: a point which achieves the
best value of the function (max/min) among
all the possible points
Local extrema: a point which achieves the
best value of the function only in a small
region surrounding that point
Most machine learning algorithms love to find the global extrema
o E.g. we saw that SVM wanted to find the model with max margin
Sometimes it is difficult so we settle for local extrema (e.g. deepnets)
Mathematics in ML-I
28
Derivatives
′
Sum Rule: 𝑓 𝑥 + 𝑔 𝑥 = 𝑓 ′ 𝑥 + 𝑔′ 𝑥
′
Scaling Rule: 𝑎 ⋅ 𝑓 𝑥 = 𝑎 ⋅ 𝑓 ′ 𝑥 if 𝑎 is not a function of 𝑥
′
Product Rule: 𝑓 𝑥 ⋅ 𝑔 𝑥 = 𝑓 ′ 𝑥 ⋅ 𝑔 𝑥 + 𝑔′ 𝑥 ⋅ 𝑓 𝑥
2
Quotient Rule: 𝑓 𝑥 /𝑔 𝑥 ′ = 𝑓 ′ 𝑥 ⋅ 𝑔 𝑥 − 𝑔′ 𝑥 𝑓 𝑥 / 𝑔 𝑥
′
′
Chain Rule: 𝑓 𝑔 𝑥 ≝ 𝑓∘𝑔 𝑥 = 𝑓′ 𝑔 𝑥 ⋅ 𝑔′ 𝑥
Mathematics in ML-I
30
Gradient
Hessian can be seen as the gradient (Jacobian) of the gradient (which is a vector)
𝜕𝑓 𝑥
For 𝑓: ℝ𝑛 → ℝ𝑚 , Jacobian is a m × n matrix in which we have Ji, j = 𝑖
𝜕𝑥𝑗
Mathematics in ML-I
33
Stationary Points
If 𝑓 ′′ 𝑥 < 0 and 𝑓 ′ 𝑥 = 0 then derivative moves from +ve to -ve around
this point – local/global max!
If 𝑓 ′′ 𝑥 = 0 and 𝑓 ′ 𝑥 = 0 then
These are places where the derivative vanishes i.e. is 0 this may be extrema/saddle –
higher derivatives e.g. 𝑓 ′′′ 𝑥
These can be local/global extrema or saddle points needed
Just as sign of the derivative tells us if the function is increasing or decreasing at a given
point, the 2nd derivative tells us if the derivative is increasing or decreasing
Mathematics in ML-I
34
Stationary Points in higher dimensions
If a square 𝑑 × 𝑑 symmetric matrix 𝐴 satisfies 𝐱 ⊤ 𝐴𝐱 < 0 for all non-zero 𝐱 ∈ ℝ𝑑
then it is negative definite (ND)
These are places where the gradient vanishes i.e. is a zero vector!
We can still find out if a stationary point is saddle or extrema using the 2nd derivative test
just as in 1D
A bit more complicated to visualize, but the Hessian tells us how the surface of the
function is curved at a point
If 𝛻𝑓 𝐱 = 𝟎 and 𝛻 2 𝑓 𝐱 is a PD matrix, then 𝐱 is a local/global min
If 𝛻𝑓 𝐱 = 𝟎 and 𝛻 2 𝑓 𝐱 is a ND matrix, then 𝐱 is a local/global max
If neither of these are true, then either 𝐱 is a saddle point or the test fails, need higher
order derivatives to verify
Whether point is saddle or test has failed depends on eigenvalues of 𝛻𝑓 𝐱
Recall that if a matrix satisfies 𝐱 ⊤ 𝐴𝐱 > 0 for all non-zero 𝐱 ∈ ℝ𝑑
then it is called positive definite (PD)
Mathematics in ML-I
35
Mathematics in ML-I
36
Example – Function Values
Mathematics in ML-I
37
Example – Gradients
Mathematics in ML-I
38
Example – Gradients
Gradients
converge
toward local In this discrete example, we can calculate
max
gradient at a point 𝑥0 , 𝑦0 as
Gradients Δ𝑓 Δ𝑓
diverge o 𝛻𝑓 𝑥0 , 𝑦0 = , where
Δ𝑥 Δ𝑦
away from
local min Δ𝑓 𝑓 𝑥0 +1,𝑦0 −𝑓 𝑥0 −1,𝑦0
o =
Δ𝑥 2
Δ𝑓 𝑓 𝑥0 ,𝑦0 +1 −𝑓 𝑥0 ,𝑦0 −1
At saddle points,
o =
Δ𝑦 2
both can happen
along different axes o We can visualize these gradients
using simple arrows as well
Mathematics in ML-I
39
Example – Hessians
Mathematics in ML-I
40
References
https://github.com/purushottamkar/ml19-20w/tree/master/lecture_slides
https://www.analyticsvidhya.com/blog/2019/10/mathematics-behind-machine-learning/
https://tminka.github.io/papers/matrix/minka-matrix.pdf
Mathematics in ML-I