0% found this document useful (0 votes)
19 views39 pages

FAI 4 Mathematical Concepts II

This document provides an overview of mathematical concepts for an artificial intelligence course, including: - Gradient descent algorithms for finding the minimum of a function by iteratively updating parameters based on the gradient. - Partial derivatives, which are used when functions have multiple variables and involve fixing variables to obtain derivatives of single variable parameterized functions. - Parameterized functions, where some variables can be used to describe a "family" of functions, with each value of the parameters corresponding to a distinct single-variable function.

Uploaded by

zhipengyang0110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views39 pages

FAI 4 Mathematical Concepts II

This document provides an overview of mathematical concepts for an artificial intelligence course, including: - Gradient descent algorithms for finding the minimum of a function by iteratively updating parameters based on the gradient. - Partial derivatives, which are used when functions have multiple variables and involve fixing variables to obtain derivatives of single variable parameterized functions. - Parameterized functions, where some variables can be used to describe a "family" of functions, with each value of the parameters corresponding to a distinct single-variable function.

Uploaded by

zhipengyang0110
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Mathematical Concepts (2/2)

Functions, Minimization, Gradient


Fundamentals of Artificial Intelligence
Instructor: Chenhui Chu
Email: chu@i.kyoto-u.ac.jp

Teaching Assistant: Youyuan Lin


E-mail: youyuan@nlp.ist.i.kyoto-u.ac.jp
Schedule
• 1. Overview of AI and this Course (10/2)
• 2. Introduction to Python (10/16)
• 3, 4. Mathematics Concepts I, II (10/23, 10/30)
• 5, 6. Regression I, II (11/6, 11/13)
• 7. Classification (11/20)
• 8. Introduction to Neural Networks (11/27)
• 9. Neural Networks Architecture and Backpropagation (12/4)
• 10. Fully Connected Layers (12/1112/18)
• 11, 12, 13. Computer Vision I, II, III (12/25, 1/4, 1/15)
• 14. Natural Language Processing (1/22)

2
Overview of This Course
11, 12, 13. Computer Vision 14. Natural language
I, II, III processing
Deep Learning Applications

8. Neural network 9. Architecture and 10. Feedforward


Introduction Backpropagation neural networks
Deep Learning

5. Simple linear 6. Multiple linear


7. Classification
regression regression
Basic Supervised Machine Learning

2. Python 3, 4. Mathematics Concepts I, II

Fundamental of Machine Learning 3


Gradient Descent Algorithm (1/4)
• This suggests some procedure for finding a minimum:
• Start at any x (e.g., x = 0)

• Compute f’(x)
• If f’(x) > 0: Decrease x a bit
• If f’(x) < 0: Increase x a bit
• Repeat
4
Gradient Descent Algorithm (2/4)
• This suggests some procedure for finding a minimum:
• Start at any x (eg. x = 0)

• Compute f’(x)
In practice, we do this:
• If f’(x) > 0: Decrease x a bit
𝑥: = 𝑥 − 𝑙𝑟 ⋅ 𝑓 ! (𝑥)
• If f’(x) < 0: Increase x a bit lr: learning rate
Should be a positive value
• Repeat If too large: no convergence
If too small: very slow convergence
5
Gradient Descent Algorithm (3/4)
lr = 0.2
!
Initialize x 𝑓(𝑥) = 𝑥 − 𝑥
x=0
𝑓 !(𝑥) = 2𝑥 − 1 f’(x) = -1
Compute argmin𝑓(𝑥) = 0.5
f’(x) " x=0.2
f’(x) = -0.6

x is good Yes x=0.32


enough? Done f’(x) = -0.36

x = 0.392
No …

lr: learning rate x=0.493
Update x: Should be a positive value f’(x) = -0.014
𝑥: = 𝑥 − 𝑙𝑟 ⋅ 𝑓 ! (𝑥) If too large: no convergence STOP?
If too small: very slow convergence
6
Gradient Descent Algorithm (4/4)
• Gradient descent works well even when we have functions of
millions of variable
• This is why it is so useful for Machine Learning and Neural
Networks
• Other methods will not be practical in such settings
• Convergence will depend on the choice of a good learning rate
• In experiments, a good deal of time is often spent finding an
optimal learning rate
• Too large learning rate: no convergence (ie. the system learn
nothing)
• Too small learning rate: slow convergence (ie. the system takes
a long time to learn)
7
Minimizing a Function of
Several Variables

8
Functions of Several Variables
• A function of several variables is just that: a function which has several
variables

𝑓(0,0,0) = 0
𝑓: ℝ! →ℝ 𝑓(1,2,3) = 7
𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧 𝑓(−1,2,2) = 11
𝑓(0,1,1) =?
𝑓(2,2,0) =?
• Like before, we want to find its minimum:

argmin𝑓(𝑥, 𝑦, 𝑧) = (0,0,0.5)
",$,%

9
Parameterized Functions (1/5)
• By fixing one of the variable, we can obtain a function
with one less variable
Function of 3 variables: 𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧

Fixing z: 𝑧=2

𝑓(𝑥, 𝑦, 2) = (𝑥 − 𝑦)! + 4 − 2

Function of 2 variables: 𝑓(𝑥, 𝑦) = (𝑥 − 𝑦)! + 2

10
Parameterized Functions (2/5)
• By fixing one of the variable, we can obtain a function
with one less variable
Function of 3 variables: 𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧

Fixing y: 𝑦=2

𝑓(𝑥, 2, 𝑧) = (𝑥 − 2)! + 𝑧 ! − 𝑧

Function of 2 variables: 𝑓(𝑥, 𝑧) = (𝑥 − 2)! + 𝑧 ! − 𝑧

11
Parameterized Functions (3/5)
• By fixing one of the variable, we can obtain a function
with one less variable
Function of 3 variables: 𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧

Fixing y and z: 𝑦=2 𝑧=3

𝑓(𝑥, 2, 3) = (𝑥 − 2)! + 9 − 3

Function of 1 variable: 𝑓(𝑥) = (𝑥 − 2)! + 6

12
Parameterized Functions (4/5)
• Therefore, in this case, variables y and z can be used to describe a
“family” of functions. We say they parameterize the

Function of 3 variables: 𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧

For each value of y and


Fixing y and z: 𝑦=2 𝑧=3 z, we have one function
of one variable
𝑓(𝑥, 2, 3) = (𝑥 − 2)! + 9 − 3

Function of 1 variable: 𝑓(𝑥) = (𝑥 − 2)! + 6

13
Parameterized Functions (5/5)
• In such a case, we will say that f is a function parameterized by y
and z.

• And we note the parameters separately, as subscripts


Function of 3 variables: 𝑓$,% (𝑥) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧

For each value of y and


Fixing y and z: 𝑦=2 𝑧=3 z, we have one function
of one variable
𝑓!,& (𝑥) = (𝑥 − 2)! + 9 − 3
Function of 1 variable: 𝑓!,& (𝑥) = (𝑥 − 2)! + 6 𝑓',' (𝑥) = 𝑥 !
𝑓',! (𝑥) = 𝑥 ! + 2
14
Partial Derivatives (1/4)
• What is the equivalent of our “high school” derivatives when we
have several variables?

• One part of the answer is partial derivatives


• Partial derivatives are computed by choosing one variable and
fixing the others

• In other words, we see the function of several variables as a


parameterized function of one variable

15
Partial Derivatives (2/4)
• What is the equivalent of our “high school” derivatives when we have several
variables?

• One part of the answer is partial derivatives

• Partial derivatives are computed by choosing one variable and fixing the others

• In other words, we see the function of several variables as a parameterized


function of one variable ! !
𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦) + 𝑧 − 𝑧
• Indeed, if we choose y, and fix x and z, we can see f(x, y, z) as a function of one
variable and compute its derivative
𝜕𝑓 𝜕𝑓 𝜕𝑓
= 2(𝑥 − 𝑦) = 2(𝑦 − 𝑥) = 2𝑧 − 1
𝜕𝑥 𝜕𝑦 𝜕𝑧
16
Partial Derivatives (3/4)
𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧
Fix other variables

𝑓$,% (𝑥) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧 𝑓",% (𝑦) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧 𝑓",$ (𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧

Compute derivative

( (𝑥) = 2(𝑥 − 𝑦)
𝑓$,% ( (𝑦) = 2(𝑦 − 𝑥) ( (𝑧) = 2𝑧 − 1
𝑓",$
𝑓",%

17
Partial Derivatives (4/4)
𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧
Fix other variables

𝑓$,% (𝑥) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧 𝑓",% (𝑦) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧 𝑓",$ (𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧

Compute derivative

( (𝑥) = 2(𝑥 − 𝑦)
𝑓$,% ( (𝑦) = 2(𝑦 − 𝑥) ( (𝑧) = 2𝑧 − 1
𝑓",$
𝑓",%
In practice, we use this notation for partial derivatives:
𝜕𝑓 𝜕𝑓 𝜕𝑓
= 2(𝑥 − 𝑦) = 2(𝑦 − 𝑥) = 2𝑧 − 1
𝜕𝑥 𝜕𝑦 18
𝜕𝑧
Compute the Partial Derivatives

𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧

𝜕𝑓 𝜕𝑓 𝜕𝑓
= = =
𝜕𝑥 𝜕𝑦 𝜕𝑧

19
Vectors (1/3)
• What are vectors?

• You probably have used vectors in physics classes to represent force and speed

• 3-dimensional vectors: [2.3, 4.5, -1]

• In machine learning, we also use them a lot

• Except that they can have more than 3 dimensions

• 5-dimensional vector: [-1, 3, 4.1, 5.2, 4]

• We often note the set of all n-dimensional vectors ℝ6


[1.2, 1.4, 1, −1, −1] ∈ ℝ)

20
Vectors (2/3)
• For now, we only need to know the following about vectors:

• A n-dimensional vector is a list of n numbers

• We can add 2 vectors (if they have the same dimension)


[2.1, 3.4, 1.1, 3.2] + [−1, 2.1, 3.1, −2] = [1.1, 5.5, 4.2, 1.2]
[2.1,3.4] + [−1,2.1,3.1, −2] =
• We can multiply a vector by a number
0.5×[2, 3, −1, −2] = [1, 1.5, −0.5, −1]

21
Vectors (3/3)


• We will usually denote a vector by a letter with an arrow on it: 𝑥

• We denote the ith component of 𝑥 by xi
• If →
𝑥 = [1, 2.2, −1,4]

• Then we have x0=1, x1=2.2, x2=-1, x3=4

22
Vectors and Numpy
• In Python, Numpy arrays are a
convenient way to represent
vectors

𝑥 = [1, 5, −2, 0.5]
• x = np.array([1, 5, -2, 0.5])

• x[0] == x0 == 1

• x[1] == x1 == 5

• Vector operations: x+0.5*y

23
Vectors and Multivariate Functions
• For now, we have represented the variables of a multivariate function with the
letters x, y, z as in: 𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧
• In practice, we can have any number of variables. So it is more convenient to use:
x0 (instead of x) , x1 (instead of y), x2 (instead of z), x3 .. xn (if we need more than 3
variables) 𝑓(𝑥' , 𝑥+ , 𝑥! ) = (𝑥' − 𝑥+ )! + 𝑥!! − 𝑥!
• We can also use a vectorial notation to represent all of the variables as one vector
variable: → → ! !
𝑥 = [𝑥' , 𝑥+ , 𝑥! ] 𝑓(𝑥 ) = (𝑥' − 𝑥+ ) + 𝑥! − 𝑥!
• So, keep in mind that the 3 following expressions actually refer to the same
function: ! !
𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦) + 𝑧 − 𝑧
𝑓(𝑥' , 𝑥+ , 𝑥! ) = (𝑥' − 𝑥+ )! + 𝑥!! − 𝑥!

𝑓(𝑥 ) = (𝑥' − 𝑥+ )! + 𝑥!! − 𝑥!
24
Gradient (1/2)

• The partial derivatives become the component of a vector we call the


gradient
𝜕𝑓 𝜕𝑓 𝜕𝑓
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥, 𝑦, 𝑧) = [ , , ]
𝜕𝑥 𝜕𝑦 𝜕𝑧

𝑓(𝑥, 𝑦, 𝑧) = (𝑥 − 𝑦)! + 𝑧 ! − 𝑧
• For example:
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥, 𝑦, 𝑧) = [2(𝑥 − 𝑦), 2(𝑦 − 𝑥), 2𝑧 − 1]

25
Gradient (2/2)
𝜕𝑓 𝜕𝑓 𝜕𝑓
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥, 𝑦, 𝑧) = [ , , ]
𝜕𝑥 𝜕𝑦 𝜕𝑧
• In this case, the function has 3 variables. Therefore the gradient is a
vector of size 3

• If the gradient has n variables, it is a vector of size n

• More precisely, the gradient of f is itself a function that return a vector

𝑓: ℝ" →ℝ 𝑔𝑟𝑎𝑑 ⋅ 𝑓: ℝ" → ℝ"


𝑓(𝑥# , 𝑥$ , . . . , 𝑥% ) 𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥" , 𝑥! , . . . , 𝑥# ) = [𝑔" , . . . , 𝑔# ]
26
Interpreting the Gradient
• At a given point, the gradient is the direction for 3D plot
which the value of the function increase fastest

• Therefore, in general, it points in the direction


opposite to the minimum
𝑓(𝑥, 𝑦) = 4(𝑥 − 2)! + 4(𝑦 + 1)! − 0.1𝑥𝑦
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥, 𝑦) = [8(𝑥 − 2) − 0.1𝑦, 8(𝑦 + 1) − 0.1𝑥]

𝑔𝑟𝑎𝑑 ⋅ 𝑓(0,0) = [−16, 16]


𝑔𝑟𝑎𝑑 ⋅ 𝑓(2, −1) = [0.1, −0.2]

27
Contour Plot
Gradient Descent
• Because we know that the gradient point in a direction opposite
to the minimum, we can use the same idea as in the case of one
variable

One variable: Multiple variables:

A → → →
𝑥: = 𝑥 − 𝑙𝑟 ⋅ 𝑓 (𝑥) 𝑥 : = 𝑥 − 𝑙𝑟 ⋅ 𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 )

28
Gradient Descent Algorithm →
→ →
𝑓(𝑥 ) = (𝑥' − 𝑥+ )! + 𝑥!! − 𝑥!
Initialize 𝑥 𝑥 = [𝑥', 𝑥+, . . . 𝑥, ] →
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 ) = [2(𝑥! − 𝑥" ), 2(𝑥" − 𝑥! ), 2𝑥# − 1]

𝑙𝑟 = 0.2

Compute grad f(𝑥 )
𝜕𝑓→ 𝜕𝑓
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 ) = [ ,... ] →
𝑥 = [0,1,0]
𝜕𝑥' 𝜕𝑥, →
Yes →
𝑥 should be close to the
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 ) = [−2, 2, −1]
|grad(x)| < err
minimum

𝑥 = [0.4,0.6,0.2]
No →
𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 ) = [−0.4, 0.4, −0.6]

Update 𝑥: →
→ → →
𝑥 : = 𝑥 − 𝑙𝑟 ⋅ 𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 ) 𝑥 = [0.41,0.43,0.51]

𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 ) = [−0.04, 0.04, 0.01]
29
What is the Equivalent of Second
Derivative for Multivariate Functions?
𝜕"𝑓 𝜕"𝑓 𝜕"𝑓
𝜕𝑥 " 𝜕𝑥𝜕𝑦 𝜕𝑥𝜕𝑧

𝜕"𝑓 𝜕"𝑓 𝜕"𝑓


• It is the Hessian Matrix:
𝜕𝑥𝜕𝑦 𝜕𝑦𝜕𝑧
𝜕𝑦 "
• But thankfully, we will not need to use it 𝜕"𝑓 𝜕"𝑓 𝜕"𝑓
𝜕𝑥𝜕𝑧 𝜕𝑦𝜕𝑧 𝜕𝑧 "
• But for your information, this would be the equivalent of the “High School”
minimization when we have several variables:
To minimize f(x, y, z):
1. Compute gradient of f(x, y , z)
2. Compute hessian of f(x)
3. Find x, y, z such that grad f(x,y,z) = 0
4. If hessian of f(x,y,z) is definite positive then
(x,y,z) is a local minimum of f
30
Gradient Descent Algorithm
• You can see that, in the case of the gradient descent, the algorithm is the
same for univariate functions and multivariate functions

• It is a simple algorithm, but it scales very well

• There exists many variations of it:

• Gradient Descent with Momentum

• Stochastic Gradient Descent

• Adagrad, Adadelta, Adam, …

31
Gradient Descent with Momentum

• Compute a “gradient with momentum” at each


iteration:

𝑔𝑚< = 0.6𝑔𝑟𝑎𝑑 ⋅ 𝑓(𝑥 ) + 0.4𝑔𝑚<=+

Update 𝑥 :
→ →
𝑥 : = 𝑥 − 𝑙𝑟 ⋅ 𝑔𝑚<

32
Stochastic Gradient Descent (1/2)

• What happens if the gradient is noisy?


• That is, we can only compute a value that is equal to
the true gradient “on average”?

• A bit like if you are drunk and trying to get home

33
Stochastic Gradient Descent (2/2)
• What happens if the gradient is noisy?

• That is, we can only compute a value that is equal to the true gradient “on
average”?

• A bit like if you are drunk and trying to get home

• It turns out it works.

• But you have to decrease your learning rate over time to stabilize 𝑙𝑟#
𝑙𝑟 =
• Convergence will be slower (𝑡 + 1)
• Very interesting because a noisy gradient can be million times faster to
compute than a “true” gradient

34
Optimization Libraries
• You can also minimize a function by using a specialized library

• It gives you access to more sophisticated minimization algorithms

• However these more sophisticated algorithms do not scale as well as


Gradient Descent

• Which is one reason for Gradient Descent and its variants are still the main
tool for large scale Machine Learning (In particular, Deep Learning)

35
Google Colab Notebook

• Let us check gradient descent in practice with Google


Colab notebook
https://shorturl.at/lxUX1

36
Report

• Submit Exercise 1 and 2 in pdf via PandA

• Submission due: next lecture

• Name the pdf file as student id_name.

37
Exercise 1
• Compute the partial derivatives of:

𝑓(𝑥, 𝑦, 𝑧) = 𝑥𝑦𝑧 − 𝑧 / − 𝑦 /

𝑓(𝑥, 𝑦, 𝑧) = 𝑒 KLM − 𝑙𝑜𝑔(𝑧)

38
Exercise 2

𝑥 = [1.5, −2.0, 5]

𝑦 = [2, 2, 10, 10]

𝑧 = [3, −3, 0]
• Dimensions of → → →
𝑥 , 𝑦, 𝑧 ?

• Values of x1, y2, z0, y0?


→ → → → → →
• Compute: 𝑥 + 𝑦 𝑥 + 0.5×𝑦 𝑦+𝑧

39

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy