FAI 4 Mathematical Concepts II
FAI 4 Mathematical Concepts II
2
Overview of This Course
11, 12, 13. Computer Vision 14. Natural language
I, II, III processing
Deep Learning Applications
• Compute f’(x)
• If f’(x) > 0: Decrease x a bit
• If f’(x) < 0: Increase x a bit
• Repeat
4
Gradient Descent Algorithm (2/4)
• This suggests some procedure for finding a minimum:
• Start at any x (eg. x = 0)
• Compute f’(x)
In practice, we do this:
• If f’(x) > 0: Decrease x a bit
𝑥: = 𝑥 − 𝑙𝑟 ⋅ 𝑓 ! (𝑥)
• If f’(x) < 0: Increase x a bit lr: learning rate
Should be a positive value
• Repeat If too large: no convergence
If too small: very slow convergence
5
Gradient Descent Algorithm (3/4)
lr = 0.2
!
Initialize x 𝑓(𝑥) = 𝑥 − 𝑥
x=0
𝑓 !(𝑥) = 2𝑥 − 1 f’(x) = -1
Compute argmin𝑓(𝑥) = 0.5
f’(x) " x=0.2
f’(x) = -0.6
x = 0.392
No …
…
lr: learning rate x=0.493
Update x: Should be a positive value f’(x) = -0.014
𝑥: = 𝑥 − 𝑙𝑟 ⋅ 𝑓 ! (𝑥) If too large: no convergence STOP?
If too small: very slow convergence
6
Gradient Descent Algorithm (4/4)
• Gradient descent works well even when we have functions of
millions of variable
• This is why it is so useful for Machine Learning and Neural
Networks
• Other methods will not be practical in such settings
• Convergence will depend on the choice of a good learning rate
• In experiments, a good deal of time is often spent finding an
optimal learning rate
• Too large learning rate: no convergence (ie. the system learn
nothing)
• Too small learning rate: slow convergence (ie. the system takes
a long time to learn)
7
Minimizing a Function of
Several Variables
8
Functions of Several Variables
• A function of several variables is just that: a function which has several
variables
𝑓(0,0,0) = 0
𝑓: ℝ! →ℝ 𝑓(1,2,3) = 7
𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧 𝑓(−1,2,2) = 11
𝑓(0,1,1) =?
𝑓(2,2,0) =?
• Like before, we want to find its minimum:
argmin𝑓(𝑥, 𝑦, 𝑧) = (0,0,0.5)
",$,%
9
Parameterized Functions (1/5)
• By fixing one of the variable, we can obtain a function
with one less variable
Function of 3 variables: 𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧
Fixing z: 𝑧=2
𝑓(𝑥, 𝑦, 2) = (𝑥 − 𝑦)! + 4 − 2
10
Parameterized Functions (2/5)
• By fixing one of the variable, we can obtain a function
with one less variable
Function of 3 variables: 𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧
Fixing y: 𝑦=2
𝑓(𝑥, 2, 𝑧) = (𝑥 − 2)! + 𝑧 ! − 𝑧
11
Parameterized Functions (3/5)
• By fixing one of the variable, we can obtain a function
with one less variable
Function of 3 variables: 𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧
𝑓(𝑥, 2, 3) = (𝑥 − 2)! + 9 − 3
12
Parameterized Functions (4/5)
• Therefore, in this case, variables y and z can be used to describe a
“family” of functions. We say they parameterize the
13
Parameterized Functions (5/5)
• In such a case, we will say that f is a function parameterized by y
and z.
15
Partial Derivatives (2/4)
• What is the equivalent of our “high school” derivatives when we have several
variables?
• Partial derivatives are computed by choosing one variable and fixing the others
Compute derivative
( (𝑥) = 2(𝑥 − 𝑦)
𝑓$,% ( (𝑦) = 2(𝑦 − 𝑥) ( (𝑧) = 2𝑧 − 1
𝑓",$
𝑓",%
17
Partial Derivatives (4/4)
𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧
Fix other variables
Compute derivative
( (𝑥) = 2(𝑥 − 𝑦)
𝑓$,% ( (𝑦) = 2(𝑦 − 𝑥) ( (𝑧) = 2𝑧 − 1
𝑓",$
𝑓",%
In practice, we use this notation for partial derivatives:
𝜕𝑓 𝜕𝑓 𝜕𝑓
= 2(𝑥 − 𝑦) = 2(𝑦 − 𝑥) = 2𝑧 − 1
𝜕𝑥 𝜕𝑦 18
𝜕𝑧
Compute the Partial Derivatives
𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧
𝜕𝑓 𝜕𝑓 𝜕𝑓
= = =
𝜕𝑥 𝜕𝑦 𝜕𝑧
19
Vectors (1/3)
• What are vectors?
• You probably have used vectors in physics classes to represent force and speed
20
Vectors (2/3)
• For now, we only need to know the following about vectors:
21
Vectors (3/3)
→
• We will usually denote a vector by a letter with an arrow on it: 𝑥
→
• We denote the ith component of 𝑥 by xi
• If →
𝑥 = [1, 2.2, −1,4]
22
Vectors and Numpy
• In Python, Numpy arrays are a
convenient way to represent
vectors
→
𝑥 = [1, 5, −2, 0.5]
• x = np.array([1, 5, -2, 0.5])
• x[0] == x0 == 1
• x[1] == x1 == 5
23
Vectors and Multivariate Functions
• For now, we have represented the variables of a multivariate function with the
letters x, y, z as in: 𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧
• In practice, we can have any number of variables. So it is more convenient to use:
x0 (instead of x) , x1 (instead of y), x2 (instead of z), x3 .. xn (if we need more than 3
variables) 𝑓(𝑥' , 𝑥+ , 𝑥! ) = (𝑥' − 𝑥+ )! + 𝑥!! − 𝑥!
• We can also use a vectorial notation to represent all of the variables as one vector
variable: → → ! !
𝑥 = [𝑥' , 𝑥+ , 𝑥! ] 𝑓(𝑥 ) = (𝑥' − 𝑥+ ) + 𝑥! − 𝑥!
• So, keep in mind that the 3 following expressions actually refer to the same
function: ! !
𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦) + 𝑧 − 𝑧
𝑓(𝑥' , 𝑥+ , 𝑥! ) = (𝑥' − 𝑥+ )! + 𝑥!! − 𝑥!
→
𝑓(𝑥 ) = (𝑥' − 𝑥+ )! + 𝑥!! − 𝑥!
24
Gradient (1/2)
𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧
• For example:
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥, 𝑦, 𝑧) = [2(𝑥 − 𝑦), 2(𝑦 − 𝑥), 2𝑧 − 1]
25
Gradient (2/2)
𝜕𝑓 𝜕𝑓 𝜕𝑓
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥, 𝑦, 𝑧) = [ , , ]
𝜕𝑥 𝜕𝑦 𝜕𝑧
• In this case, the function has 3 variables. Therefore the gradient is a
vector of size 3
27
Contour Plot
Gradient Descent
• Because we know that the gradient point in a direction opposite
to the minimum, we can use the same idea as in the case of one
variable
A → → →
𝑥: = 𝑥 − 𝑙𝑟 ⋅ 𝑓 (𝑥) 𝑥 : = 𝑥 − 𝑙𝑟 ⋅ 𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 )
28
Gradient Descent Algorithm →
→ →
𝑓(𝑥 ) = (𝑥' − 𝑥+ )! + 𝑥!! − 𝑥!
Initialize 𝑥 𝑥 = [𝑥', 𝑥+, . . . 𝑥, ] →
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 ) = [2(𝑥! − 𝑥" ), 2(𝑥" − 𝑥! ), 2𝑥# − 1]
𝑙𝑟 = 0.2
→
Compute grad f(𝑥 )
𝜕𝑓→ 𝜕𝑓
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 ) = [ ,... ] →
𝑥 = [0,1,0]
𝜕𝑥' 𝜕𝑥, →
Yes →
𝑥 should be close to the
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 ) = [−2, 2, −1]
|grad(x)| < err
minimum
→
𝑥 = [0.4,0.6,0.2]
No →
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 ) = [−0.4, 0.4, −0.6]
→
Update 𝑥: →
→ → →
𝑥 : = 𝑥 − 𝑙𝑟 ⋅ 𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 ) 𝑥 = [0.41,0.43,0.51]
→
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 ) = [−0.04, 0.04, 0.01]
29
What is the Equivalent of Second
Derivative for Multivariate Functions?
𝜕"𝑓 𝜕"𝑓 𝜕"𝑓
𝜕𝑥 " 𝜕𝑥𝜕𝑦 𝜕𝑥𝜕𝑧
31
Gradient Descent with Momentum
32
Stochastic Gradient Descent (1/2)
33
Stochastic Gradient Descent (2/2)
• What happens if the gradient is noisy?
• That is, we can only compute a value that is equal to the true gradient “on
average”?
• But you have to decrease your learning rate over time to stabilize 𝑙𝑟#
𝑙𝑟 =
• Convergence will be slower (𝑡 + 1)
• Very interesting because a noisy gradient can be million times faster to
compute than a “true” gradient
34
Optimization Libraries
• You can also minimize a function by using a specialized library
• Which is one reason for Gradient Descent and its variants are still the main
tool for large scale Machine Learning (In particular, Deep Learning)
35
Google Colab Notebook
36
Report
37
Exercise 1
• Compute the partial derivatives of:
𝑓(𝑥, 𝑦, 𝑧) = 𝑥𝑦𝑧 − 𝑧 / − 𝑦 /
38
Exercise 2
→
𝑥 = [1.5, −2.0, 5]
→
𝑦 = [2, 2, 10, 10]
→
𝑧 = [3, −3, 0]
• Dimensions of → → →
𝑥 , 𝑦, 𝑧 ?
39