0% found this document useful (0 votes)
68 views38 pages

Lec 1 - Maths in ML I

The document provides an overview of key mathematics concepts used in machine learning, including: 1) Linear algebra topics like tensors, vectors, matrices, and matrix operations. 2) Calculus topics like convex sets and functions, extrema, derivatives, and the rules for computing derivatives. 3) How these linear algebra and calculus concepts are applied in machine learning algorithms and modeling.

Uploaded by

14 Asif Akhtar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views38 pages

Lec 1 - Maths in ML I

The document provides an overview of key mathematics concepts used in machine learning, including: 1) Linear algebra topics like tensors, vectors, matrices, and matrix operations. 2) Calculus topics like convex sets and functions, extrema, derivatives, and the rules for computing derivatives. 3) How these linear algebra and calculus concepts are applied in machine learning algorithms and modeling.

Uploaded by

14 Asif Akhtar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

1

MATHEMATICS IN
MACHINE LEARNING -I
ROHAN PILLAI
ASSISTANT PROFESSOR
DEPARTMENT OF ELECTRICAL ENGINEERING, DTU
2
Table of Content

❑ Mathematics in Data Science vs Machine Learning ❑ CALCULUS REVISITED

❑ LINEAR ALGEBRA • Convex Sets

• Tensors • Convex Functions

• Vectors • Extrema

• Matrices • Derivatives

• Special Matrices • Rules of Derivatives

• Transpose of a Matrix • Gradient

• Inverse of a matrix • Hessian

• Matrix Differentiation • Jacobian

• Stationary Points

• Stationary Points in higher dimensions

• Example

❑ References
Mathematics in ML-I
3
Mathematics in
Data Science vs Machine Learning

Mathematics in ML-I
4
Mathematics in
Data Science vs Machine Learning

Mathematics in data science and machine learning is not about


How can we learn crunching numbers, but about what is happening, why it’s
mathematics without getting happening, and how we can play around with different things to
bogged down into the theory? obtain the results we want.

Mathematics in ML-I
5

1. LINEAR
ALGEBRA

Mathematics in ML-I
6
Tensors

 A tensor is an array of numbers, that may be


❖ a scalar (0th Order Tensor)
❖ a vector (1st Order Tensor)
❖ a matrix (2nd Order Tensor)

Mathematics in ML-I
7
Vectors

 Vector in Rn is an ordered set of n real numbers.


o e.g. v = (1,5,3,4,2) is in R5
1
5
o A column vector: 3
4
2
o A row vector: (1 5 3 4 2)
 Vector operations : Scalar multiplication and Vector addition
 Any point on the plane containing the vectors u and v is some linear combination
of au + bv

Mathematics in ML-I
8
Vectors…

 Vector norms : A norm of a vector ||x|| is informally a measure of the “length” of the vector.

 Common norms:
o L1 (Manhattan) -
Matrices
Frobenius Norm :
Af = σ𝑖,𝑗 𝐴𝑖𝑗 2
o L2 (Euclidean) -

Mathematics in ML-I
9
Vectors…

 Vector dot (inner) product:

 Geometric interpretation of a dot product : Projection of one vector upon the other

❑ If u•v=0, ||u||2 != 0, ||v||2 != 0 → u and v are orthogonal


❑ If u•v=0, ||u||2 = 1, ||v||2 = 1 → u and v are orthonormal

Mathematics in ML-I
10
Matrices

 An N×M matrix has N rows and M columns i.e. it is 2-D array of numbers. Here is an
example matrix, A, with 3 rows and 2 columns :
1 2
A= 3 4
5 6
 Linear transformations can be represented as matrices.
 Multiplying a matrix with a vector has the geometric meaning of applying a linear
transformation to the vector
 Multiplying two matrices has the geometric meaning of applying one linear
transformation after another

Mathematics in ML-I
11
Matrices…

 Matrix Multiplication : C = AB  Hadamard Product : C = A ⊙ B


▪ Cij = σk AikBkj ▪ Element wise multiplication
▪ Sizes must match ▪ A,B,C must be of the same size

Mathematics in ML-I
12
Special Matrices

Diagonal matrix Upper Triangular / Symmetric Matrix


Lower Triangular
Matrix
Example : Example :

 a 0 0 a b c a 0 0 A = AT
     
 0 b 0 0 d e b c 0
0 0 c 0 0 f  d e f 
   

Mathematics in ML-I
13
Transpose of a Matrix

 Transpose of a Matrix : We can think of it as

o “flipping” the rows and columns


OR
o “reflecting” vector/matrix on line

T T
a a b  a c 
 Example :
  = (a b )   =  
b c d  b d 

Mathematics in ML-I
14
Inverse of a matrix

 Inverse of a square matrix A, denoted by A-1  Pseudo- Inverse of a matrix A ε Rm× n is


is the unique matrix s.t. defined as A+ ∈ Rn× m, where
o AA-1 =A-1A= I (identity matrix) A+ = (ATA)-1AT
 If A-1 and B-1 exist, then
o (AB)-1 = B-1A-1,
o (AT)-1 = (A-1)T
 For orthogonal matrices : (A)-1 = (AT)
 For diagonal matrices :
D-1 = diag {d1-1 ,…, dn-1}

Mathematics in ML-I
17
Matrix Differentiation

Mathematics in ML-I
18

Mathematics in ML-I
19

Mathematics in ML-I
20

Mathematics in ML-I
21
Matrix Differentiation…

Mathematics in ML-I
22

Mathematics in ML-I
23

Mathematics in ML-I
24

2. CALCULUS
REVISITED

Mathematics in ML-I
25
Convex Sets

C ⊆ ℝd
∀ x, y ∈ C
∀ λ ∈ [0,1]
z = λ.x + (1- λ).y

Mathematics in ML-I
26
Convex Functions

f : ℝd ℝ
∀ x, y ∈ C
∀ λ ∈ [0,1]
z = λ.x + (1- λ).y
f(z) ≤ λ. f(x) + (1- λ). f(y)

Mathematics in ML-I
27
Extrema

 Since we always seek the “best” values of a function, usually we are looking for the
maxima or the minima of a function
 Global extrema: a point which achieves the
best value of the function (max/min) among
all the possible points
 Local extrema: a point which achieves the
best value of the function only in a small
region surrounding that point
 Most machine learning algorithms love to find the global extrema
o E.g. we saw that SVM wanted to find the model with max margin
 Sometimes it is difficult so we settle for local extrema (e.g. deepnets)

Mathematics in ML-I
28
Derivatives

 Derivatives measure how much one variable(y = f(x))


changes when there is a small change in the other
variable(x)
𝑑𝑦
o f '(x) =
𝑑𝑥
 It is zero when the function is flat (horizontal), such as at
the minimum or maximum of f(x)
 It is positive when f(x) is strictly increasing and negative
when f(x) is strictly decreasing
 Double Derivatives :
𝑑2𝑦
o Double derivative f ''(x) OR y '' =
𝑑𝑥2
o It is positive for convex functions (having single
minima) and negative for concave functions (having
single maxima)
 In higher dimensions (functions of many variables /
vectors ), we use the idea of partial derivatives
Mathematics in ML-I
29
Rules of Derivatives


 Sum Rule: 𝑓 𝑥 + 𝑔 𝑥 = 𝑓 ′ 𝑥 + 𝑔′ 𝑥

 Scaling Rule: 𝑎 ⋅ 𝑓 𝑥 = 𝑎 ⋅ 𝑓 ′ 𝑥 if 𝑎 is not a function of 𝑥

 Product Rule: 𝑓 𝑥 ⋅ 𝑔 𝑥 = 𝑓 ′ 𝑥 ⋅ 𝑔 𝑥 + 𝑔′ 𝑥 ⋅ 𝑓 𝑥
2
 Quotient Rule: 𝑓 𝑥 /𝑔 𝑥 ′ = 𝑓 ′ 𝑥 ⋅ 𝑔 𝑥 − 𝑔′ 𝑥 𝑓 𝑥 / 𝑔 𝑥


 Chain Rule: 𝑓 𝑔 𝑥 ≝ 𝑓∘𝑔 𝑥 = 𝑓′ 𝑔 𝑥 ⋅ 𝑔′ 𝑥

Mathematics in ML-I
30
Gradient

 Gradient ∶ Multivariate generalisation of the derivative


 For multivariate functions with 𝑑-dim inputs, the
gradient simply records how much the function would
change if we move a little bit along each one of the 𝑑
axes!
𝜕𝑓 𝜕𝑓 𝜕𝑓 ′
 𝛻𝑓 𝐱 = , ,…,
𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑑

 The gradient also has the distinction of offering the


steepest ascent i.e. if we want maximum increase in
function value, we must move a little bit along the
gradient. Similarly, we must move a little bit in the
direction opposite to gradient to get the maximum
decrease in the function value, i.e. the gradient also
offers us the steepest descent
 At minima or maxima , the gradient is a zero vector
Mathematics in ML-I
31
Hessian

 Hessian : Gradient of the gradient


 2nd derivative of 𝑓: ℝ𝑛 → ℝ is a n × n matrix called the
Hessian

𝜕2𝑓 𝜕2𝑓 𝜕2𝑓


𝜕𝑥12 𝜕𝑥1 𝑥2 … 𝜕𝑥1 𝑥𝑛
𝜕2𝑓 𝜕2𝑓 … 𝜕2𝑓
𝛻 2 𝑓 𝐱 = 𝜕𝑥 𝑥 𝜕𝑥22 𝜕𝑥2 𝑥𝑛
2 1
⋮ ⋮ ⋱ ⋮
2 2 2
𝜕 𝑓 𝜕 𝑓 𝜕 𝑓

𝜕𝑥𝑛 𝑥1 𝜕𝑥𝑛 𝑥2 𝜕𝑥𝑛2
 Note that the Hessian matrix is Symmetric
 It is the equivalent of 2nd derivative in scalar calculus and
has similar uses
 All rules of derivatives (chain, product etc) apply here as
well
Mathematics in ML-I
32
Jacobian

 Jacobian is the equivalent of the gradient for vector valued functions

 Hessian can be seen as the gradient (Jacobian) of the gradient (which is a vector)

𝜕𝑓 𝑥
 For 𝑓: ℝ𝑛 → ℝ𝑚 , Jacobian is a m × n matrix in which we have Ji, j = 𝑖
𝜕𝑥𝑗

𝜕𝑓1 𝜕𝑓1 𝜕𝑓1


𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛
𝜕𝑓2 𝜕𝑓2 𝜕𝑓2
J= 𝜕𝑥1 𝜕𝑥2 𝜕𝑥𝑛
⋮ ⋱ ⋮
𝜕𝑓𝑚 𝜕𝑓𝑚

𝜕𝑥1 𝜕𝑥𝑛

Mathematics in ML-I
33
Stationary Points
If 𝑓 ′′ 𝑥 < 0 and 𝑓 ′ 𝑥 = 0 then derivative moves from +ve to -ve around
this point – local/global max!

If 𝑓 ′′ 𝑥 = 0 and 𝑓 ′ 𝑥 = 0 then
 These are places where the derivative vanishes i.e. is 0 this may be extrema/saddle –
higher derivatives e.g. 𝑓 ′′′ 𝑥
 These can be local/global extrema or saddle points needed

 The derivative being zero is its way of telling us


that at that point, the function looks flat
 Saddle points can be tedious in ML
 We can find out if a stationary point
is saddle or extrema using 2nd derivative If 𝑓 ′′ 𝑥 > 0 and 𝑓 ′ 𝑥 = 0 then derivative moves
from -ve to +ve around this point – local/global min!

 Just as sign of the derivative tells us if the function is increasing or decreasing at a given
point, the 2nd derivative tells us if the derivative is increasing or decreasing

Mathematics in ML-I
34
Stationary Points in higher dimensions
If a square 𝑑 × 𝑑 symmetric matrix 𝐴 satisfies 𝐱 ⊤ 𝐴𝐱 < 0 for all non-zero 𝐱 ∈ ℝ𝑑
then it is negative definite (ND)

 These are places where the gradient vanishes i.e. is a zero vector!
 We can still find out if a stationary point is saddle or extrema using the 2nd derivative test
just as in 1D
 A bit more complicated to visualize, but the Hessian tells us how the surface of the
function is curved at a point
 If 𝛻𝑓 𝐱 = 𝟎 and 𝛻 2 𝑓 𝐱 is a PD matrix, then 𝐱 is a local/global min
 If 𝛻𝑓 𝐱 = 𝟎 and 𝛻 2 𝑓 𝐱 is a ND matrix, then 𝐱 is a local/global max
 If neither of these are true, then either 𝐱 is a saddle point or the test fails, need higher
order derivatives to verify
 Whether point is saddle or test has failed depends on eigenvalues of 𝛻𝑓 𝐱
Recall that if a matrix satisfies 𝐱 ⊤ 𝐴𝐱 > 0 for all non-zero 𝐱 ∈ ℝ𝑑
then it is called positive definite (PD)

Mathematics in ML-I
35

Mathematics in ML-I
36
Example – Function Values

In this discrete example, we can


calculate gradient at a point 𝑥0 , 𝑦0
as
Δ𝑓 Δ𝑓
o 𝛻𝑓 𝑥0 , 𝑦0 = , where
Δ𝑥 Δ𝑦
Δ𝑓 𝑓 𝑥0 +1,𝑦0 −𝑓 𝑥0 −1,𝑦0
o =
Δ𝑥 2
Δ𝑓 𝑓 𝑥0 ,𝑦0 +1 −𝑓 𝑥0 ,𝑦0 −1
o =
Δ𝑦 2

Mathematics in ML-I
37
Example – Gradients

In this discrete example, we can


Local max
calculate gradient at a point 𝑥0 , 𝑦0
as
Local min Δ𝑓 Δ𝑓
o 𝛻𝑓 𝑥0 , 𝑦0 = , where
Δ𝑥 Δ𝑦
Δ𝑓 𝑓 𝑥0 +1,𝑦0 −𝑓 𝑥0 −1,𝑦0
Saddle point o =
Δ𝑥 2
Δ𝑓 𝑓 𝑥0 ,𝑦0 +1 −𝑓 𝑥0 ,𝑦0 −1
o =
Δ𝑦 2

Mathematics in ML-I
38
Example – Gradients

Gradients
converge
toward local In this discrete example, we can calculate
max
gradient at a point 𝑥0 , 𝑦0 as
Gradients Δ𝑓 Δ𝑓
diverge o 𝛻𝑓 𝑥0 , 𝑦0 = , where
Δ𝑥 Δ𝑦
away from
local min Δ𝑓 𝑓 𝑥0 +1,𝑦0 −𝑓 𝑥0 −1,𝑦0
o =
Δ𝑥 2
Δ𝑓 𝑓 𝑥0 ,𝑦0 +1 −𝑓 𝑥0 ,𝑦0 −1
At saddle points,
o =
Δ𝑦 2
both can happen
along different axes o We can visualize these gradients
using simple arrows as well

Mathematics in ML-I
39
Example – Hessians

−𝟐 −𝟎. 𝟐𝟓  In this discrete example, we can calculate


𝟐
𝛁 𝒇=
−𝟎. 𝟐𝟓 −𝟐 Hessian at 𝑥0 , 𝑦0 as
which is ND i.e. local max
Δ2 𝑓 Δ2 𝑓
Δ𝑥 2 Δ𝑥Δ𝑦
𝛁𝟐 𝒇 = o 𝛻 2 𝑓 𝑥0 , 𝑦0 = where
𝟐 𝟎. 𝟏𝟐𝟓 Δ2 𝑓 Δ2 𝑓
𝟎. 𝟏𝟐𝟓 𝟐 Δ𝑥Δ𝑦 Δ𝑦 2
which is PD i.e.
local min Δ2 𝑓
o = 𝑓 𝑥0 + 1, 𝑦0 + 𝑓 𝑥0 − 1, 𝑦0 − 2𝑓 𝑥0 , 𝑦0
Δ𝑥 2
Δ2 𝑓
o = 𝑓 𝑥0 , 𝑦0 + 1 + 𝑓 𝑥0 , 𝑦0 − 1 − 2𝑓 𝑥0 , 𝑦0
𝟐 Δ𝑦 2
𝛁 𝒇=
−𝟐 −𝟎. 𝟏𝟐𝟓 Δ2 𝑓 𝑓𝑥𝑦 +𝑓𝑦𝑥
−𝟎. 𝟏𝟐𝟓 𝟐
o = where
Δ𝑥Δ𝑦 2
which is neither PD nor Δ𝑓 Δ𝑓
𝑥0 ,𝑦0 +1 − 𝑥0 ,𝑦0 −1
ND (it is a saddle point) ❑ 𝑓𝑥𝑦 = Δ𝑥 Δ𝑥
2
Δ𝑓 Δ𝑓
𝑥0 +1,𝑦0 − 𝑥0 −1,𝑦0
Δ𝑦 Δ𝑦
❑ 𝑓𝑦𝑥 = 2

Mathematics in ML-I
40
References

 https://github.com/purushottamkar/ml19-20w/tree/master/lecture_slides
 https://www.analyticsvidhya.com/blog/2019/10/mathematics-behind-machine-learning/
 https://tminka.github.io/papers/matrix/minka-matrix.pdf

Mathematics in ML-I

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy