0% found this document useful (0 votes)
13 views32 pages

Optimization23 22

The document discusses gradient descent for linear regression. It introduces linear regression as an optimization problem to minimize the mean squared error between predicted and actual output values. Gradient descent is then presented as an algorithm to optimize the parameters in linear regression. It works by iteratively updating the parameters in the direction of the negative gradient of the cost function until convergence. The document contrasts batch, mini-batch, and stochastic gradient descent, noting that stochastic gradient descent uses a single randomly selected training example to approximate the gradient, making it suitable for large datasets.

Uploaded by

Vamsi Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views32 pages

Optimization23 22

The document discusses gradient descent for linear regression. It introduces linear regression as an optimization problem to minimize the mean squared error between predicted and actual output values. Gradient descent is then presented as an algorithm to optimize the parameters in linear regression. It works by iteratively updating the parameters in the direction of the negative gradient of the cost function until convergence. The document contrasts batch, mini-batch, and stochastic gradient descent, noting that stochastic gradient descent uses a single randomly selected training example to approximate the gradient, making it suitable for large datasets.

Uploaded by

Vamsi Krishna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Class 22

Gradient Descent in
ML
SLIDES FROM ANDREW NG
Let’s Learn a bit of ML
Linear Regression
Start with an example f i e d
i m pli
Predicting Housing prices in Hyderabad
l y s
s s
Gro ple
Size(x) Price(y) x a m
In Sqft INR in
We are given with some data in the form
of a Table with
e
L Notation:
x as size of houses in sft and
1100 199 y as the price in INR in Lakhs = Number of examples
= Input variable
1400 245
Task of ML is to learn the hidden = Output variable
1425 319 relationship between x, y and many (, ) = One training example
1550 240 other hidden features and to be able to
predict correctly. (, ) = training example
1600 312
1700 279
1700 310 If I want to buy a property of 2000 Sft, how much is the price?
1875 380
Given this data, a friend has a house 750 sft - how much can she be
2350 405 expected to get?"
2450 540
Predicting Housing prices in Hyderabad

Size(x) Price(y Assume that x and y are related in a


In Sqft ) linear fashion
In INR
1100 199
1400 245 ML task is to find the best , and from the
1425 319 given table
1550 240
1600 312 What do we mean by the best , and
1700 279
1700 310
1875 380
Suppose the ML system tells that the best = 1 and = 0.2
2350 405
2450 540
𝑚
1 (𝑖) 2
Let’s Formulate it𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝜃 𝜃 1 2 𝑚 ∑ ( 𝑦 − ( 𝜃0 +𝜃1 𝑥 ) )
0
(𝑖)

𝑖=1

Size(x) Price(y For any given , and x Predicted Actual y Sq Error


In Sqft ) y
In INR For a we predict as 1100 221 199 484
1100 199
1400 281 245 1296
1400 245 For =1, and
For we predict 1425 286 319 1089
1425 319
1550 311 240
1550 240
1600 321 312 81
1600 312 Squared Error for each observation
1700 341 279
1700 279
Squared Error for first observation
1700 341 310 961
1700 310
1875 376 380 16
1875 380
Mean Square Error (taken over all 2350 471 405 4356
2350 405 observations) 2450 491 540
2450 540
Linear Regression  Unconstrained
Optimization
𝑚
1 (𝑖) 2
𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽 ( 𝜃 0 , 𝜃1 )=¿ ∑
2 𝑚 𝑖=1
( 𝑦 − ( 𝜃 0+𝜃 1 𝑥 ) ) ¿
(𝑖)
For Linear Regression, J is convex

SLIDES FROM ANDREW NG


Contour Plots

SLIDES FROM ANDREW NG


Graphical Interpretation
Price (₹)
in 100000’s

Size(x) Price(y Error 𝒚 =𝜽𝟎 + 𝜽𝟏 𝒙


In Sqft )
In INR 400

1100 199 Predicted


300
1400 245
1425 319 Actual
200
1550 240
1600 312 100

1700 279
1700 310
500 1000 1500 2000 2500
1875 380
2350 405
2450 540 Size in sft
Gradient descent for linear
regression

Repeat until convergence

SLIDES FROM ANDREW


NG
(for fixed , this is a function of x) (function of the parameters )

SLIDES FROM ANDREW NG


(for fixed , this is a function of x) (function of the parameters )

SLIDES FROM ANDREW NG


(for fixed , this is a function of x) (function of the parameters )

SLIDES FROM ANDREW NG


(for fixed , this is a function of x) (function of the parameters )

SLIDES FROM ANDREW NG


(for fixed , this is a function of x) (function of the parameters )

SLIDES FROM ANDREW NG


(for fixed , this is a function of x) (function of the parameters )

SLIDES FROM ANDREW NG


(for fixed , this is a function of x) (function of the parameters )

SLIDES FROM ANDREW NG


(for fixed , this is a function of x) (function of the parameters )

SLIDES FROM ANDREW NG


(for fixed , this is a function of x) (function of the parameters )

SLIDES FROM ANDREW NG


(for fixed , this is a function of x) (function of the parameters )

SLIDES FROM ANDREW NG


(for fixed , this is a function of x) (function of the parameters )

SLIDES FROM ANDREW NG


The ML problem can be much
more complicated
1. Prices donot only depend on sft, but
◦ Floor,
◦ Facing- East-facing, north-facing, lake-facing, park-facing
◦ Mode of payment
◦ Layout …

2. The relationship need not be linear


3. The error may be different than MSE
4. and so on.
Optimization problems faced in Data
Science/Machine Learning are
primarily Finite Sum Optimization.

Finite Sum Optimization


The objective function is
Minimize J
Finite Sum
Optimization What is the Gradient of such an
objective function

GD Step
Linear Regression is a finite-sum
optimization
𝑚
1 (𝑖) 2
𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽 ( 𝜃 0 , 𝜃1 )=¿ ∑
2 𝑚 𝑖=1
( 𝑦 − ( 𝜃 0+𝜃 1 𝑥 ) ) ¿
(𝑖)

To compute the gradient ∇, we need to enumerate all n training data points


What is the Gradient of such an objective
function

GD Step
Finite Sum
Optimization
Uses Full Training
Set

Computing the gradient can be very time


consuming.
Approximating the gradient is still useful as
long as it points in roughly the same
direction as the true gradient

Stochastic gradient descent descent is a


Stochastic stochastic approximation of the gradient
descent method for minimizing an
Gradient objective function that is written as a
Descent sum of differentiable functions.
(SGD)
The word stochastic here refers to the fact
that we acknowledge that we do not know
the gradient precisely, but instead only know
a noisy approximation to it.
BATCH- Traditional GD. Gradient is
computed for the entire set of training
data

MINIBATCH GD. Gradient is computed


for a subset of training data
Stochastic
Gradient STOCHASTIC GRADIENT DESCENT
Descent (SGD)- Gradient is computed by selecting
(SGD) randomly only one training data.
SGD is very effective in large-scale machine
learning problemssuch as training deep
neural networks on millions of images, topic
models, reinforcement learning, or training
of large-scale Gaussian process models.
𝑚
1 (𝑖) 2
𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 𝐽 ( 𝜃 0 , 𝜃1 )=¿ ∑
2 𝑚 𝑖=1
( 𝑦 − ( 𝜃 0+𝜃 1 𝑥 ) ) ¿
(𝑖)

: Number of training examples


Batch Gradient Descent
Repeat until convergence

Mini-Batch Gradient Descent Stochastic Gradient Descent


Mini-batch size b
Select b number of training examples Repeat until convergence
Randomly sample b data points
BATCH GD
Repeat until convergence

• + (+…+ (
• = +

MINIBATCH GD (with batch size b=3)


Repeat for each minibatch until convergence
Stochastic GD
Minibatch 1 Repeat until convergence
• + (+ ( Select i randomly
• = +
Minibatch 2
• + (+ (
• = +
For gradient descent to converge, we only require that the
gradient is an unbiased estimate of the true gradient.

In fact the term in GD is an empirical estimate of the expected


value of the gradient.

Therefore, any other unbiased empirical estimate of the


expected value, for example using any subsample of the data,
would suffice for convergence of gradient descent
Key Insight
When the learning rate decreases at an appropriate rate, and
subject to relatively mild assumptions, stochastic gradient descent
converges almost surely to local minimum

Intuition- As long as each noisy step takes us in a direction that is


correct on average, we will over many steps make progress in
minimizing the loss.
Use large batch is when
noise/randomness is large.

Use small batch is when


noise/randomness is small.
How to
Choose Batch Large batch allows more efficient
Size computation on GPUs.

Often best to increase batch size up to


the GPU memory limit.
End of Class

Slides From Andrew Ng

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy