0% found this document useful (0 votes)
107 views46 pages

L3 Linear Regression and Gradient Descent

Linear regression models a relationship between two variables (x and y) using a linear function. The parameters (θ0 and θ1) of the linear function (hypothesis h) are chosen to minimize a cost function J. Gradient descent is an algorithm that can be used to iteratively update the parameters and reduce the cost function, finding the optimal parameters that fit the training data.

Uploaded by

joseph karim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views46 pages

L3 Linear Regression and Gradient Descent

Linear regression models a relationship between two variables (x and y) using a linear function. The parameters (θ0 and θ1) of the linear function (hypothesis h) are chosen to minimize a cost function J. Gradient descent is an algorithm that can be used to iteratively update the parameters and reduce the cost function, finding the optimal parameters that fit the training data.

Uploaded by

joseph karim
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Linear Regression and

Gradient Descent
Mariette Awad

Slide sources for this set of slides: Stanford Intro to ML course


Lesson Objectives
• How to model ML problems (using hypothesis function)
• How to learn ML models (using cost function and objective function)
• Learn Gradient Descent (most popular algorithm) for solving the
optimization problem in the objective function, and deriving ML model
parameters.

• Learn more about regression and applications of the above key concepts
with linear regression:
• Learn what are ”regression” models
• Learn what linear “regression” models are
• Illustrate how parameters are used to represent a linear model
• Explain what is a hypothesis function and how it is learned
• Explain what is a cost function and how it is used to learn the hypothesis
• Illustrate how the cost function changes
• Explain Gradient Descent Algorithm
Outline

• Intro to Linear Models


• Regression Model Representation / Hypothesis Function
• Cost Function and Objective Function for Regression
• Behavior of Cost Function with one parameter
• Behavior of Cost Function with multiple parameters
• Gradient Descent: Intuition and Mathematical Description
Linear Models
500
Example 1 - Predicting House 400

Prices Price
300

200
• Predict House price in Portland, OR. Given (in 1000s of
dollars)
100
prices per size
0
• Learning algorithm can fit a straight line or 0 1000 2000 3000
Size (feet2)
quadratic. Later we will see how to choose.
• We have given the algorithm “right Size in Price ($) in
answers” (actual price), and the task of the feet2 (x) 1000's (y)
algorithm is to find more “right answers”. 2104 460
• House prices problem is called regression: 1416 232
Predict continuous valued numbers. 1534 315
852 178
… …
Notations

• m = Number of training examples


• x’s = “input” variable / features
• y’s = “output” variable / “target” variable
Supervised Learning
Learning Goal Process:
Training Set • Feed training data to Learning
Algorithm
• The output of the learning
algorithm a function (h) a
Learning Algorithm hypothesis (name used
historically. May not be best
name)
• h is a function that maps the x-
Size of h Estimated values to y’s
house price • The hypothesis function takes the
size of house as input and
produce the price of the house
Model Representation
How do we represent h ?
Learning Goal • For now (start simple), we will use
the following representation
Training Set
• h(x) = 0 + 1x
• i ‘s : 0 and 1 are the parameters
of the model
Learning Algorithm • shorthand h(x)
• In this case, h is predicting that y is a
linear function of x
• Why linear function? Just to start
Size of h Estimated simple. Later we will do more
house price complicated models.
Linear regression with one variable.
Univariate linear regression.
Size in feet2 (x) Price ($) in 1000's (y)
Training Set 2104 460
1416 232
1534 315
852 178
… …
Hypothesis:
‘s: Parameters
How to choose ‘s ?
Cost Function
How to choose ‘s ? How do we pick the
best “h”
500

400

300

200

100

0
0 500 1000 1500 2000 2500 3000

Need a cost function to assess “goodness” of function


Consider the following three cases

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Learning Objective

• We want to come up with values for y


the i ‘s so that the model fits the
data well.
x

Idea: Choose so that


is close to for
our training examples
Formal Description of Cost Function (1/2)
• Formally, we want to solve the minimization problem:
1 𝑖 𝑖 2
• min 2𝑚 σ𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡 ℎ 𝑥 −𝑦
(𝜃0 ,𝜃1 )
• In other words, find the values of 0 and 1 so that the sum is minimized.
• This is our objective function for linear regression
• By convention, we define a cost function
1 𝑖 𝑖 2
• J(0 , 1) = σ𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡 ℎ 𝑥 − 𝑦
2𝑚
• Objective Function: We want to minimize J(0 , 1) over 0 and 1,
min J(0 , 1)
(𝜃0 ,𝜃1 )
• This particular choice of cost function is called the squared error
function. It is very common for regression problems.
Formal Description (2/2)
Hypothesis:

Parameters:

Cost Function:

Goal/Objective Function:
Behavior of Cost Function
Two Key things to consider
• Two key functions we want to understand:
• hypothesis function (h) and
• the cost function (J)

• To visualize the cost function, we simplify the


hypothesis to h(x) = 1x
• So the optimization objective is to
• Minimize (over 1) J(1)
Simplified
Hypothesis:

Parameters:

Cost Function:

Goal:
Illustration & Intuition
•Consider the case where, the training data
consists of the following (x,y) values: (1,1), (2,2),
and (3,3)
•Let’s check the two key parts (h and J).
•Consider different possible parameter values (for
h) and see the impact on J
Behavior of Cost Function with
one parameter
Case study different 1 values:
h(x) = x; h(x) = 0.5x and h(x) = 0

(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

y
1 1

0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

y
1 𝑖 (𝑖)
1
ℎ𝜃 𝑥 =𝑦
Type equation here.
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
𝜃1 = 1
x
𝑚 𝑚
1 2 1 2 1
𝐽 𝜃1 = ෍ ℎ𝜃 𝑥 (𝑖) − 𝑦 (𝑖) = ෍ 𝜃1 𝑥 (𝑖) − 𝑦 (𝑖) = 02 + 02 + 02 = 0 𝐽 1 =0
2𝑚 2𝑚 2𝑚
𝑖=1 𝑖=1
(for fixed , this is a function of x) (function of the parameter )

3 3

2 2

y
1 1

ℎ𝜃 (𝑥 𝑖 )
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
x
1 2 2 2
1 3.5
𝐽 0.5 = 0.5 − 1 + 1−2 + 1.5 − 3 = 3.5 = ≈ 0.58
2𝑚 2×3 6
5.25

(for fixed , this is a function of x) (function of the parameter )

3 3
𝜃1 = 1

2 2

y 𝜃1 = 0.5

1 1
𝜃1 = 1
𝜃1 = 0
0 0
0 1 2 3 -0.5 0 0.5 1 1.5 2 2.5
𝜃1 = −0.5
x
1 2 2 2
1
𝐽 0 = 1 + 2 + 3 = × 14 ≈ 2.3 𝑓𝑖𝑛𝑑𝑖𝑛𝑔 𝜃1 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑤𝑒 𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽(𝜃1 )
2𝑚 6
Best 1 (h)
• For this example, the best 1 = 1.
5.25

1
𝜃1
0
-0.5 0 0.5 1 1.5 2 2.5
Behavior of Cost Function
with multiple parameters
Visualization of the cost function when
considering both 0 and 1
Hypothesis:
Parameters:
Cost Function:
Goal:
Cost Function Visualization for two parameters
• A particular set of (0 , 1) values correspond to a point in the 3D plot
or on one of the contour plots.
(for fixed , this is a function of x) (function of the parameters )

𝐽(𝜃0 , 𝜃1 )

ℎ(𝑥)

−0.15

800

𝜃0 , 𝜃1
(for fixed , this is a function of x) (function of the parameters )

ℎ 𝑥 = 360 + 0 ∙ 𝑥

𝜃0 = 360, 𝜃1 = 0
(for fixed , this is a function of x) (function of the parameters )

ℎ(𝑥)
(for fixed , this is a function of x) (function of the parameters )

ℎ(𝑥)
Gradient Descent
Intuitions and Math Description
Overview of Gradient Descent
• Need an Algorithm to efficiently find the values of i
that optimize the cost function J(i)
• Solution: Gradient Descent Algorithm to minimize a
cost function J
• Approach:
• Start with random choices of 0 and 1 (common choice is to set
them to 0’s)
• Keep changing 0 and 1 to reduce J(0 , 1)
• Until we hopefully get to a minimum
Idea behind Gradient Descent
• Consider the Figure with two hills, and consider
standing on one point on one of the hills.
• The idea of gradient descent is that you look
around you 360 degrees and ask yourself in
what direction should I move if I want to move
the fastest down hill.
• Starting from a given point, you may end up in
one local minimum (one valley bottom).
• If you start from another point, you may end up
in another local minimum (another valley
bottom).
Mathematical Description • The notation “:=”means we are
assigning the right to the left. This is
Gradient descent algorithm
similar to programming “=”.
𝑟𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒𝑛𝑐𝑒 { • Alpha is called the “learning rate”. It
𝜕
𝜃𝑗 ≔ 𝜃𝑗 − 𝛼 J 𝜃0 , 𝜃1 𝑓𝑜𝑟 𝑗 = 0 𝑎𝑛𝑑 𝑗 = 1 controls the size of the step we take
𝜕𝜃𝑗
every time in the steepest descent.
}
• The derivative term is the partial
Correct: Simultaneous update derivative. (needs calculus
𝜕 background).
𝑡𝑒𝑚𝑝0 ≔ 𝜃0 − 𝛼 J 𝜃 ,𝜃
𝜕𝜃𝑗 0 1 • Note the subtilty in the algorithm is to
𝜕 have the SIMULTANEOUS update. Note
𝑡𝑒𝑚𝑝1 ≔ 𝜃1 − 𝛼 J 𝜃 ,𝜃
𝜕𝜃𝑗 0 1 that if you do not do simultaneous
𝜃0 ≔ 𝑡𝑒𝑚𝑝0 update, you would probably be
𝜃1 ≔ 𝑡𝑒𝑚𝑝1
implementing a different algorithm
with different properties.
Consider case of one variable
• To convey the intuition, we will use one 5.25
parameter Minimize (over 1) J(1)
• Notice now the partial is just the derivative and
the derivative with respect to 1 is the same as 3
the slope of the tangent going through that
2
point.
• With a parabolic shape, on the right side of the 1
minimum, the slope is positive. On the left side 𝜃1
0
of the minimum, the slope is negative.
-0.5 0 0.5 1 1.5 2 2.5
Behavior
• On the right side, we would then update the
parameter with -alpha*the positive slope ==
negative number. So we are moving left
(backwards) towards the minimum and this is
the right thing to do.

• On the left side, we would then update the


parameter with -alpha*the negative slope ==
positive number. So we are moving right
(forward) towards the minimum and this is the
right thing to do.

• So both cases, the gradient descent works


correctly.
Learning Rate
About the learning rate (alpha):
• If the learning too small, the gradient descent can be slow
• If the learning too large, the gradient descent may overshoot the minimum. It may even fail to
converge or even diverge.

What if the parameter theta is already at a local minimum,


what would the gradient descent do?
• It turns out that a local minimum, the derivative is zero,
So the parameter is left unchanged (which is what we
want)

• Note that even of alpha is fixed the gradient descent may


still converge if the derivatives get smaller as we get closer
to the minimum, so do not overshoot it. So in effect, we are
taking smaller and smaller steps. As we approach local
minimum gradient descent automatically takes smaller steps
and no need to reduce alpha
Visualizing Learning Rate

How CF varies
with iterations,
for different
learning rates?
Applying Gradient Descent to
Linear Regression
Applying GD to Linear regression
• Putting gradient descent algorithm with cost function for linear
regression to get out our first ML algorithm.

• We want to apply gradient descent on our minimum square function


1 𝑖 𝑖 2
min ෍ ℎ 𝑥 −𝑦
(𝜃0 ,𝜃1 ) 2𝑚 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡

1 𝑖 𝑖 2
• The cost function J(0 , 1) = σ𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡 ℎ 𝑥 −𝑦
2𝑚
Applying the gradient descent, we need to
find the partial derivatives:
𝜕 𝜕 1 2
J 𝜃0 , 𝜃1 = ∙ σ𝑚
𝑖=1 ℎ𝜃 𝑥 𝑖 −𝑦 𝑖
𝜕𝜃𝑗 𝜕𝜃𝑗 2𝑚
𝑚
𝜕 1 𝑖 𝑖 2
= ∙ ෍ 𝜃0 + 𝜃1 𝑥 −𝑦
𝜕𝜃𝑗 2𝑚
𝑖=1

𝜕 1 1
𝜃0 𝑜𝑟 𝑗 = 0: J 𝜃0 , 𝜃1 = σ𝑚 ℎ𝜃 𝑥 𝑖
−𝑦 𝑖
𝜕𝜃0 𝑚 𝑖=1

𝜕 1 1
𝜃1 𝑜𝑟 𝑗 = 1: J 𝜃0 , 𝜃1 = σ𝑚 ℎ𝜃 𝑥 𝑖
−𝑦 𝑖
∙𝑥 𝑖
𝜕𝜃1 𝑚 𝑖=1
Gradient descent algorithm

𝑟𝑒𝑝𝑒𝑎𝑡 𝑢𝑛𝑡𝑖𝑙 𝑐𝑜𝑛𝑣𝑒𝑟𝑔𝑒


{
𝑚
1 𝑖 𝑖
𝜃0 ≔ 𝜃0 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦
𝑚
𝑖=1
𝑚
1 𝑖 𝑖 𝑖
𝜃1 ≔ 𝜃1 − 𝛼 ෍ ℎ𝜃 𝑥 −𝑦 ∙𝑥
𝑚
𝑖=1
}

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy