L3 Linear Regression and Gradient Descent
L3 Linear Regression and Gradient Descent
Gradient Descent
Mariette Awad
• Learn more about regression and applications of the above key concepts
with linear regression:
• Learn what are ”regression” models
• Learn what linear “regression” models are
• Illustrate how parameters are used to represent a linear model
• Explain what is a hypothesis function and how it is learned
• Explain what is a cost function and how it is used to learn the hypothesis
• Illustrate how the cost function changes
• Explain Gradient Descent Algorithm
Outline
Prices Price
300
200
• Predict House price in Portland, OR. Given (in 1000s of
dollars)
100
prices per size
0
• Learning algorithm can fit a straight line or 0 1000 2000 3000
Size (feet2)
quadratic. Later we will see how to choose.
• We have given the algorithm “right Size in Price ($) in
answers” (actual price), and the task of the feet2 (x) 1000's (y)
algorithm is to find more “right answers”. 2104 460
• House prices problem is called regression: 1416 232
Predict continuous valued numbers. 1534 315
852 178
… …
Notations
•
400
300
200
100
0
0 500 1000 1500 2000 2500 3000
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Learning Objective
Parameters:
Cost Function:
Goal/Objective Function:
Behavior of Cost Function
Two Key things to consider
• Two key functions we want to understand:
• hypothesis function (h) and
• the cost function (J)
Parameters:
Cost Function:
Goal:
Illustration & Intuition
•Consider the case where, the training data
consists of the following (x,y) values: (1,1), (2,2),
and (3,3)
•Let’s check the two key parts (h and J).
•Consider different possible parameter values (for
h) and see the impact on J
Behavior of Cost Function with
one parameter
Case study different 1 values:
h(x) = x; h(x) = 0.5x and h(x) = 0
3 3
2 2
y
1 1
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 𝑖 (𝑖)
1
ℎ𝜃 𝑥 =𝑦
Type equation here.
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
𝜃1 = 1
x
𝑚 𝑚
1 2 1 2 1
𝐽 𝜃1 = ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖) = 𝜃1 𝑥 (𝑖) − 𝑦 (𝑖) = 02 + 02 + 02 = 0 𝐽 1 =0
2𝑚 2𝑚 2𝑚
𝑖=1 𝑖=1
(for fixed , this is a function of x) (function of the parameter )
3 3
2 2
y
1 1
ℎ𝜃 (𝑥 𝑖 )
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1 2 2 2
1 3.5
𝐽 0.5 = 0.5 − 1 + 1−2 + 1.5 − 3 = 3.5 = ≈ 0.58
2𝑚 2×3 6
5.25
3 3
𝜃1 = 1
2 2
y 𝜃1 = 0.5
1 1
𝜃1 = 1
𝜃1 = 0
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
𝜃1 = −0.5
x
1 2 2 2
1
𝐽 0 = 1 + 2 + 3 = × 14 ≈ 2.3 𝑓𝑖𝑛𝑑𝑖𝑛𝑔 𝜃1 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑤𝑒 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽(𝜃1 )
2𝑚 6
Best 1 (h)
• For this example, the best 1 = 1.
5.25
1
𝜃1
0
-0.5 0 0.5 1 1.5 2 2.5
Behavior of Cost Function
with multiple parameters
Visualization of the cost function when
considering both 0 and 1
Hypothesis:
Parameters:
Cost Function:
Goal:
Cost Function Visualization for two parameters
• A particular set of (0 , 1) values correspond to a point in the 3D plot
or on one of the contour plots.
(for fixed , this is a function of x) (function of the parameters )
𝐽(𝜃0 , 𝜃1 )
ℎ(𝑥)
−0.15
800
𝜃0 , 𝜃1
(for fixed , this is a function of x) (function of the parameters )
ℎ 𝑥 = 360 + 0 ∙ 𝑥
𝜃0 = 360, 𝜃1 = 0
(for fixed , this is a function of x) (function of the parameters )
ℎ(𝑥)
(for fixed , this is a function of x) (function of the parameters )
ℎ(𝑥)
Gradient Descent
Intuitions and Math Description
Overview of Gradient Descent
• Need an Algorithm to efficiently find the values of i
that optimize the cost function J(i)
• Solution: Gradient Descent Algorithm to minimize a
cost function J
• Approach:
• Start with random choices of 0 and 1 (common choice is to set
them to 0’s)
• Keep changing 0 and 1 to reduce J(0 , 1)
• Until we hopefully get to a minimum
Idea behind Gradient Descent
• Consider the Figure with two hills, and consider
standing on one point on one of the hills.
• The idea of gradient descent is that you look
around you 360 degrees and ask yourself in
what direction should I move if I want to move
the fastest down hill.
• Starting from a given point, you may end up in
one local minimum (one valley bottom).
• If you start from another point, you may end up
in another local minimum (another valley
bottom).
Mathematical Description • The notation “:=”means we are
assigning the right to the left. This is
Gradient descent algorithm
similar to programming “=”.
𝑟𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒 { • Alpha is called the “learning rate”. It
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 J 𝜃0 , 𝜃1 𝑓𝑜𝑟 𝑗 = 0 𝑎𝑛𝑑 𝑗 = 1 controls the size of the step we take
𝜕𝜃𝑗
every time in the steepest descent.
}
• The derivative term is the partial
Correct: Simultaneous update derivative. (needs calculus
𝜕 background).
𝑡𝑒𝑚𝑝0 ≔ 𝜃0 − 𝛼 J 𝜃 ,𝜃
𝜕𝜃𝑗 0 1 • Note the subtilty in the algorithm is to
𝜕 have the SIMULTANEOUS update. Note
𝑡𝑒𝑚𝑝1 ≔ 𝜃1 − 𝛼 J 𝜃 ,𝜃
𝜕𝜃𝑗 0 1 that if you do not do simultaneous
𝜃0 ≔ 𝑡𝑒𝑚𝑝0 update, you would probably be
𝜃1 ≔ 𝑡𝑒𝑚𝑝1
implementing a different algorithm
with different properties.
Consider case of one variable
• To convey the intuition, we will use one 5.25
parameter Minimize (over 1) J(1)
• Notice now the partial is just the derivative and
the derivative with respect to 1 is the same as 3
the slope of the tangent going through that
2
point.
• With a parabolic shape, on the right side of the 1
minimum, the slope is positive. On the left side 𝜃1
0
of the minimum, the slope is negative.
-0.5 0 0.5 1 1.5 2 2.5
Behavior
• On the right side, we would then update the
parameter with -alpha*the positive slope ==
negative number. So we are moving left
(backwards) towards the minimum and this is
the right thing to do.
How CF varies
with iterations,
for different
learning rates?
Applying Gradient Descent to
Linear Regression
Applying GD to Linear regression
• Putting gradient descent algorithm with cost function for linear
regression to get out our first ML algorithm.
1 𝑖 𝑖 2
• The cost function J(0 , 1) = σ𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡 ℎ 𝑥 −𝑦
2𝑚
Applying the gradient descent, we need to
find the partial derivatives:
𝜕 𝜕 1 2
J 𝜃0 , 𝜃1 = ∙ σ𝑚
𝑖=1 ℎ𝜃 𝑥 𝑖 −𝑦 𝑖
𝜕𝜃𝑗 𝜕𝜃𝑗 2𝑚
𝑚
𝜕 1 𝑖 𝑖 2
= ∙ 𝜃0 + 𝜃1 𝑥 −𝑦
𝜕𝜃𝑗 2𝑚
𝑖=1
𝜕 1 1
𝜃0 𝑜𝑟 𝑗 = 0: J 𝜃0 , 𝜃1 = σ𝑚 ℎ𝜃 𝑥 𝑖
−𝑦 𝑖
𝜕𝜃0 𝑚 𝑖=1
𝜕 1 1
𝜃1 𝑜𝑟 𝑗 = 1: J 𝜃0 , 𝜃1 = σ𝑚 ℎ𝜃 𝑥 𝑖
−𝑦 𝑖
∙𝑥 𝑖
𝜕𝜃1 𝑚 𝑖=1
Gradient descent algorithm