Linear Regression
Linear Regression
Machine Intelligence
Lecture # 5
Fall 2023
1
Tentative Course Topics
2
Agenda
Regression Problem:
Univariate Linear Regression
Optimization Technique:
Gradient Descent
Regression Problem:
Univariate Linear Regression Example
Regression Problem:
Multivariate Linear Regression
Optimization Technique:
Normal Equations
3
Regression Problem
4
Linear Regression with One Variable
• Linear regression with one variable is also known as "univariate”
• We want to predict a single output value y from a single input value x.
400
300
230
y 200
100
0
Supervised Learning 0 500 1000 1500 2000 2500 3000
Andrew Ng 6
The Hypothesis Function
(hypothesis)
Andrew Ng 7
Example input x
0
output y
4
1 7
• Suppose we have the following set of training data: 2 7
• Now we can make a random guess about our hθ 3 8
ℎ𝜃 (𝑥) = 2 + 2 𝑥
For example: 0 , 1 9
4. This is off by 3.
y
4
0
0 0.5 1 1.5 2 2.5 3 3.5
9
Cost Function
• We can measure the accuracy of our hypothesis function by using a cost function.
• This takes an average of all the results of the hypothesis with inputs from compared to
the actual output .
10
input 𝒙 output 𝒚
Simplified Example 1
2
1
2
3 3
Cost function J( )= () ()
y
1
0
0 1 2 3 ℎ =0
x
Andrew Ng 11
input x output y
Simplified Example 1
2
1
2
3 3
Cost function J( )= () ()
y
1 ℎ = 0.5𝑥
0
0 1 2 3 ℎ =0
x
Andrew Ng 12
input x output y
Simplified hypothesis 1
2
1
2
3 3
Cost function J( )= () ()
= )= 0
Andrew Ng 13
Simplified Example 𝐽(0) ≈ 2.3
𝐽(0.5) ≈ 0.58
𝐽(1) = 0
2
X
y
1 X
X
0 -0.5 0 0.5 1 1.5 2 2.5
0 1
x 2 3
Andrew Ng 14
Summary of cost function
15
Summary of cost function
16
(for fixed , this is a function of x) (function of the parameters )
ℎ 𝑥 = 800 − 0.15 𝑥
has same
X 𝐽(𝜃 , 𝜃 )
-0.15 X
800
17
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
ℎ 𝑥 = 500 − 0.5 𝑥
-0.5
500
18
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
ℎ 𝑥 = 100 + 0.1 𝑥
0.1
100
our objective
19
Andrew Ng
Quick Summary until now
Hypothesis:
Cost Function:
Goal:
Our strategy until now: Keep changing to reduce until
we hopefully end up at a minimum!!
20
Optimization Technique
Gradient Descent
Optimization
• Optimization is the process of finding the set of
parameters/weights that minimize the c o s t
f un ct i o n .
22
Optimization
• Strategy 2: Following the Gradient
◮ Compute the best direction along which we should change our
parameter (weight) vector that is mathematically guaranteed to
be the direction of the steepest descend.
◮ This direction will be related to the gradient of the cost
function.
23
Problem set up
24
Gradient Descent
Andrew Ng 25
Gradient Descent
Andrew Ng 27
Gradient Descent
• We make steps down the cost function in the direction with the steepest descent,
and the size of each step is determined by the parameter α, which is called the
learning rate.
• The gradient descent algorithm is:
(function of the parameter )
• Repeat until convergence { 3
2
X
}
1
X
29
GD algorithm
30
Two ways to compute the gradient
• There are two ways to compute the gradient:
1) Numerical gradient: A slow, approximate but easy way to implement.
Approximate (since we have to pick a small value of h, while the true
gradient is defined as the limit as h goes to zero), and that it is very
computationally expensive to compute
2) Analytic gradient: A fast, exact but more error-prone way that requires
calculus. It allows us to derive a direct formula for the gradient (no
approximations) that is also very fast to compute.
32
Gradient Descent variants
• There are three variants of gradient descent based on the amount of data used
to calculate the gradient:
33
Batch Gradient Descent
• Batch Gradient Descent, Vanilla gradient descent, calculates the error for
each observation in the dataset but performs an update only after all
observations have been evaluated.
• One cycle through the entire training dataset is called a training epoch.
Therefore, it is often said that batch gradient descent performs model
updates at the end of each training epoch.
34
Batch Gradient Descent
35
Stochastic Gradient Descent (SGD)
• Stochastic gradient descent, often abbreviated SGD, is a variation of the
gradient descent algorithm that calculates the error and updates the model
for each example in the training dataset.
• The noisy update process can allow the model to avoid local minima (e.g.
premature convergence).
• SGD is usually faster than batch gradient descent, but its frequent updates
cause a higher variance in the error rate, that can sometimes jump around
instead of decreasing.
36
Mini-Batch Gradient Descent
• Mini-batch gradient descent seeks to find a balance between the robustness of
stochastic gradient descent and the efficiency of batch gradient descent.
• It is the most common implementation of gradient descent used in the field of deep
learning.
• It splits the training dataset into small batches that are used to calculate model error
and update model coefficients.
37
(for fixed , this is a function of x) (function of the parameters. )
39
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
40
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
41
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
42
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
43
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
44
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
45
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
46
Andrew Ng
(for fixed , this is a function of x) (function of the parameters )
1250
47
Andrew Ng
Learning Rate
• The gradient tells us the direction, but it does not tell us how far along this
direction we should step.
• The learning rate (step size) determines how big the step would be on
each iteration. It determines how fast or slow we will move towards the
optimal weights.
48
Learning Rate
• If learning rate is large, it may fail to converge and overshoot the
minimum.
• If learning rate is very small, it would take long time to converge and
become computationally expensive.
• The most commonly used rates are :
0.001, 0.003, 0.01 (default),
0.03, 0.1, 0.3
49
Regression Problem
50
Multiple variables (Features)
• Linear regression with multiple variables is also known as "multivariate linear
regression".
51
Multiple variables (Features)
52
Hypothesis
𝑥 𝜃
𝑥 𝜃
𝐹𝑜𝑟 𝑐𝑜𝑛𝑣𝑒𝑛𝑖𝑒𝑛𝑐𝑒 𝑜𝑓 𝑛𝑜𝑡𝑎𝑡𝑖𝑜𝑛, 𝑑𝑒𝑓𝑖𝑛𝑒 𝑥 = 1
𝑥 𝜃
𝑥= 𝑥 ∈ℝ 𝜃= 𝜃 ∈ℝ
. .
ℎ 𝑥 = 𝜃 𝑥 +𝜃 𝑥 + 𝜃 𝑥 +⋯+𝜃 𝑥 . .
𝑥 𝜃
53
Hypothesis for Multiple Features
54
Hypothesis for Multiple Features 𝑥 𝜃
𝑥 𝜃
𝑥 𝜃
𝑥= 𝑥 ∈ℝ 𝜃= 𝜃 ∈ℝ
. .
. .
𝑥 𝜃
55
GD for Multiple Variables
57
Gradient Descent in Practice I - Feature Scaling
• We can speed up gradient descent by having each of our input values in roughly the same
range.
• Because will:
– descend quickly on small ranges
– descend slowly on large ranges, and
– oscillate inefficiently down to the optimum when the variables are very uneven.
58
Gradient Descent in Practice I - Feature Scaling
https://www.blog.nipunarora.net/ml_multi_variate_linear_regression/ 59
Features Selection
• We can improve our features and the form of our hypothesis function.
• We can combine multiple features into one. For example, we can
combine and into a new feature by taking .
– Ex: is length and is width of a house, we can combine them into a new
feature area= length x width
63
Optimization Technique:
Normal Equations
Better values for
• To solve for analytically normal equation
•
Minimize
𝐽 𝜃 = …=0 (set to zero)
for 𝑒𝑣𝑒𝑟𝑦 𝑗 solve for 𝜃 , 𝜃 , … , 𝜃
67
Example: m=4
𝑺𝒊𝒛𝒆 (𝒇𝒆𝒆𝒕)𝟐 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝑨𝒈𝒆 𝒐𝒇 𝒉𝒐𝒎𝒆 𝑷𝒓𝒊𝒄𝒆 $𝟏𝟎𝟎𝟎
𝒃𝒆𝒅𝒓𝒐𝒐𝒎𝒔 𝒇𝒍𝒐𝒐𝒓𝒔 (𝒚𝒆𝒂𝒓𝒔)
𝒙𝟎 𝒙𝟏 𝒙𝟐 𝒙𝟑 𝒙𝟒 𝒚
1 2104 5 1 45 460
1 1416 3 2 40 232
1 1534 3 2 30 315
1 852 2 1 36 178
68
examples ; n features.
𝑚 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 , 𝑛 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠
( )
( )
()
𝑥
()
𝑥
() m x (n+1)
= 𝑥. ∈ ℝ
𝑖
• 𝑥
. ( )
.
()
𝑥
𝑦( )
𝐸𝑥𝑎𝑚𝑝𝑙𝑒: 𝑜𝑛𝑒 𝑓𝑒𝑎𝑡𝑢𝑟𝑒
1 𝑥 𝑦( )
𝑥( ) =
1
() 𝑋= 1 𝑥 𝑦= .
𝑥 ⋮ ⋮ .
1 𝑥 .
𝑦( )
69
Gradient Descent Vs. Normal Equation
Gradient Descent Normal Equation
Need to choose No need to choose
Needs many iterations No need to iterate
Gradient descent
Works well even when n is massive (millions)
Better suited to big data
What is a big n though
100 or even a 1000 is still (relativity) small
If n is 10 000 then look at using gradient descent
Normal equation
Normal equation needs to compute (XT X)-1
This is the inverse of an n x n matrix
With most implementations computing a matrix inverse grows by O(n3 )
Can be much slower
70
Example 2:
X= y=
71
Check?
• Suppose you have training examples with features. The normal equation is
• For the given values of m and n, what are the dimensions of , X, and in this equation?
X has m x n + 1 = 25 x 7
y is an m-vector = 25 x1.
θ is an (n+1)-vector= 7 x1
72