0% found this document useful (0 votes)

15 views38 pages

lec6_7_Linear_regression

Uploaded by

Saitama Deku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views38 pages

lec6_7_Linear_regression

Uploaded by

Saitama Deku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

1

Linear regression

CSCI-P 556
ZORAN TIGANJ
2
Reminders/Announcements

u Don’t forget the Quiz deadline on Wednesday

u Homework assignment 1 released
u Updates to ML quest
u Stretch goal for paper 1
u New paper
3
Deeper dive into ML techniques

u So far we have treated Machine Learning models and their training

algorithms mostly like black boxes.
u As of today, we will start taking a deeper dive into how popular machine
learning techniques actually work.
u Having a good understanding of how things work can help you quickly
home in on the appropriate model, the right training algorithm to use, and
a good set of hyperparameters for your task.
u Understanding what’s under the hood will also help you de bug issues and
perform error analysis more efficiently.
4
Today: linear regression

u Based on chapter 4 of the Hands On ML textbook.

u We will discuss two ways to train linear regression models:
u Using a “closed-form” equation that directly computes the model parameters
that best fit the model to the training set
u Using an iterative optimization approach called Gradient Descent (GD) that
gradually tweaks the model parameters to minimize the cost function over the
training set, eventually converging to the same set of parameters as the first
method.
u much faster because it has very little data to manipulate at every iteration.
u the cost function will bounce up and down, decreasing only on average.
5
Linear regression

Linear regression model

6
Linear regression

Linear regression model:

More concise form of linear regression model:

Note: θ and x are both column vectors (in ML vectors are commonly represented as column vectors,
in other words vectors are represented as 2D arrays with a single column).
7
Training

u We first need a measure of how well (or poorly) the model fits the training
data.
u For example, we can use RMSE.
u In practice, it is simpler to minimize the mean squared error (MSE) than the
RMSE, and it leads to the same result (because the value that minimizes a
function also minimizes its square root).
m=no. of instances
u The MSE of a Linear Regression hypothesis ℎ𝜽 on a training set X is x = vector of features

u ℎ𝜽 instead of just ℎ to make it clear that the model is parametrized by the

vector θ.
8
Training

u To find the value of θ that minimizes the cost function, there is a closed-
form solution—in other words, a mathematical equation that gives the
result directly.
u This is called the Normal Equation.
9
Deriving the normal equation

First, we use linear

algebra to
reorganize the
loss function
10
Deriving the normal equation

y 𝝏𝒚
Note: we ignore 1/m 𝝏𝑿
AX AT
r⇥ M SE(⇥) = 0
<latexit sha1_base64="/lETp9cpCDqGLoYGRLm8zOkdpVY=">AAACF3icbVBNS8NAEN34WetX1aOXxSLUS0lE1ItQFMGLULFf0IQy2W7apZtN2N0IJfRfePGvePGgiFe9+W/ctD1o64OBx3szzMzzY86Utu1va2FxaXllNbeWX9/Y3Nou7Ow2VJRIQusk4pFs+aAoZ4LWNdOctmJJIfQ5bfqDq8xvPlCpWCRqehhTL4SeYAEjoI3UKZRdAT6HjhuC7vtB6tb6VMPo9v66NCMd4QtsdwpFu2yPgeeJMyVFNEW1U/hyuxFJQio04aBU27Fj7aUgNSOcjvJuomgMZAA92jZUQEiVl47/GuFDo3RxEElTQuOx+nsihVCpYeibzuxWNetl4n9eO9HBuZcyESeaCjJZFCQc6whnIeEuk5RoPjQEiGTmVkz6IIFoE2XehODMvjxPGsdl57Ts3J0UK5fTOHJoHx2gEnLQGaqgG1RFdUTQI3pGr+jNerJerHfrY9K6YE1n9tAfWJ8/OGmfSQ==</latexit>

Compute gradient w.r.t. Θ, XT A A

equate with 0 and solve for
Θ. XT X 2X
XTAX AX + ATX
11
Deriving the normal equation

Compute gradient w.r.t. Θ, XT A A

equate with 0 and solve for
Θ. XT X 2X
XTAX AX + ATX
12
Linear regression

u Let’s generate some linear-looking data to test this equation on

13
The Normal Equation

u & using the Normal Equation.

Now let’s compute Θ

The function that we used to generate the data is y = 4 + 3x + Gaussian noise.

Close enough, but the noise made it impossible to recover the exact
parameters of the original function.
14
The Normal Equation

u &
Now we can make predictions using Θ

y = 4 + 3x + Gaussian noise.
15
The Normal Equation

u Let’s plot this model’s predictions

16
Linear regression using Scikit-Learn

not invertible if m(instances) < n(features)

4.21 is for x=0, 9.75 is value of y_hat for X_n32

This approach is more efficient than computing the Normal Equation, plus it
handles edge cases nicely: indeed, the Normal Equation may not work if the
matrix XTX is not invertible (i.e., singular), such as:
• if m < n (where n is the number of features) or
• if some features are redundant, but the pseudoinverse is always defined.
17
Computational Complexity of The
Normal Equation

u The Normal Equation computes the inverse of XTX, which is an (n + 1) × (n +

1) matrix (where n is the number of features).
u The computational complexity of inverting such a matrix is typically about
O(n2.4) to O(n3), depending on the implementation.
u In other words, if you double the number of features, you multiply the
computation time by roughly 22.4 = 5.3 to 23 = 8.
u The SVD approach used by Scikit-Learn’s Linear Regression class is about
O(n2). If you double the number of features, you multiply the computation
time by roughly 4.

Matrix inversion has more compute complexity for features than SVD Scikit-Learn
18
Computational Complexity of The
Normal Equation

u Both the Normal Equation and the SVD approach get very slow when the
number of features grows large (e.g., 100,000).
u On the positive side, both are linear with regard to the number of instances
in the training set (they are O(m)), so they handle large training sets
efficiently, provided they can fit in memory.

exponential compute complexity for features,

linear compute complexity for instances
19
Gradient Descent

u Now we will look at a very different way to train a Linear Regression model,
which is better suited for cases where there are a large number of features
or too many training instances to fit in memory.
20
Gradient Descent

u Gradient Descent is a generic optimization algorithm

capable of finding optimal solutions to a wide range of
problems.
u The general idea of Gradient Descent is to tweak parameters
iteratively in order to minimize a cost function.
u Suppose you are lost in the mountains in a dense fog, and
you can only feel the slope of the ground be low your feet. A
good strategy to get to the bottom of the valley quickly is to In this depiction of Gradient Descent,
go downhill in the direction of the steepest slope. the model parameters are initialized
randomly and get tweaked
u This is exactly what Gradient Descent does: it measures the repeatedly to minimize the cost
function; the learning step size is
local gradient of the error function with regard to the proportional to the slope of the cost
parameter vector θ, and it goes in the direction of function, so the steps gradually get
descending gradient. smaller as the parameters approach
the minimum
21
Learning rate in Gradient Descent

u An important parameter in Gradient Descent is the size of the steps,

determined by t he learning rate hyperparameter. If the learning rate is
too small, then the algorithm will have to go through many iterations to
converge, which will take a long time.

The learning rate is too small

22
Learning rate in Gradient Descent

u On the other hand, if the learning rate is too high, you might jump across
the valley and end up on the other side, possibly even higher up than you
were before. This might make the algorithm diverge, with larger and larger
values, failing to find a good solution.

The learning rate is too large

23
Challenges in Gradient Descent

u Finally, not all cost functions look like nice, regular bowls. There may be holes,
ridges, plateaus, and all sorts of irregular terrains, making convergence to the
minimum difficult.

u If the random initialization starts the algorithm on the left, then it will converge to a
local minimum, which is not as good as the global minimum.
u If it starts on the right, then it will take a very long time to cross the plateau.
24
Gradient descent for MSE cost
function and Linear Regression
u Fortunately, the MSE cost function for a Linear Regression model happens MSE is convex

to be a convex function, which means that if you pick any two points on
the curve, the line segment joining them never crosses the curve.
u This implies that there are no local minima, just one global minimum and
Gradient Descent is guaranteed to approach arbitrarily close the global
minimum (if you wait long enough and if the learning rate is not too high).

You should ensure that all features have a similar

scale or else it will take much longer to converge.
25
Batch Gradient Descent

u To implement Gradient Descent, you need to compute the gradient of the

cost function with regard to each model parameter θj.
u In other words, you need to calculate how much the cost function will
change if you change θj just a little bit.
26
Batch Gradient Descent

u Instead of computing these partial derivatives individually, we can use

gradient vector to compute them in one step:

• Notice that this formula involves calculations over the full training set X, at each Gradient
Descent step! This is why the algorithm is called Batch Gradient Descent: it uses the whole
batch of training data at every step.
• As a result it is terribly slow on very large training sets.
• However, Gradient Descent scales well with the number of features; training a Linear
Regression model when there are hundreds of thousands of features is much faster using
Gradient Descent than using the Normal Equation or SVD decomposition.
27
Batch Gradient Descent

u Once you have the gradient vector, which points uphill, just go in the
opposite direction to go downhill.

η is the learning rate

28
Batch Gradient Descent

u First 10 steps of gradient descent using various learning rates

When to stop? When the gradient vector becomes tiny

29
Stochastic Gradient Descent

u The main problem with Batch Gradient Descent is the fact that it uses the
whole training set to compute the gradients at every step, which makes it
very slow when the training set is large.
u At the opposite extreme, Stochastic Gradient Descent picks a random
instance in the training set at every step and computes the gradients
based only on that single instance.

batch gradient descent: entire data

stochastic gradient descent: random instance af every step
30
Stochastic Gradient Descent

u When the cost function is

very irregular, this can
actually help the algorithm
jump out of local minima, so
Stochastic Gradient
Descent has a better
chance of finding the
global minimum than Batch
Gradient Descent does.

SGD can jump out of local minimum

31
Stochastic Gradient Descent

u Therefore, randomness is good to escape from local optima, but bad

because it means that the algorithm can never settle at the minimum.
u The solution to this dilemma is to gradually reduce the learning rate. The
steps start out large (which helps make quick progress and escape local
minima), then get smaller and smaller, allowing the algorithm to settle at
the global minimum (simulated annealing).
u The function that determines the learning rate at each iteration is called
the learning schedule.

For SGD: start with high learning rate but decrease it to reach a global minimum.
32
Stochastic Gradient Descent
33
Stochastic Gradient Descent
34
Stochastic Gradient Descent

u When using Stochastic Gradient Descent, the training instances must be

independent and identically distributed (IID) to ensure that the parameters
get pulled toward the global optimum, on average.
u A simple way to ensure this is to shuffle the instances during training (e.g.,
pick each instance randomly, or shuffle the training set at the beginning of
each epoch).
u If you do not shuffle the instances—for example, if the instances are sorted
by label—then SGD will start by optimizing for one label, then the next, and
so on, and it will not settle close to the global minimum.
35
Stochastic Gradient Descent
36
Mini-batch Gradient Descent

u At each step, instead of computing the gradients based on the full training
set (as in Batch GD) or based on just one instance (as in Stochastic GD),
Mini-batch GD computes the gradients on small random sets of instances
called mini-batches.
u The main advantage of Minibatch GD over Stochastic GD is that you can
get a performance boost from hardware optimization of matrix operations,
especially when using GPUs.

mini-batch: small random sets of instances.

Performance boost.
37
Comparison
38
Next time

u Regularization and polynomial regression, from Chapter 4 from Hands on

machine learning textbook

simple equation + SVD: slow for features

Batch GD: slow for instances
All GD require scaling

Psychological Assessment
No ratings yet
Psychological Assessment
3 pages
CS 304.A Training Models
No ratings yet
CS 304.A Training Models
149 pages
Module3_Ch1
No ratings yet
Module3_Ch1
83 pages
Chapter04_Training_Models
No ratings yet
Chapter04_Training_Models
33 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Lec 3
No ratings yet
Lec 3
22 pages
Linear+regression+with+one+variable
No ratings yet
Linear+regression+with+one+variable
48 pages
CH 4
No ratings yet
CH 4
41 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
Module 3
No ratings yet
Module 3
27 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Lecture3_upload
No ratings yet
Lecture3_upload
28 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Regression
No ratings yet
Regression
25 pages
ML: Introduction 1. What Is Machine Learning?
No ratings yet
ML: Introduction 1. What Is Machine Learning?
38 pages
ML 02 Linear Regression
No ratings yet
ML 02 Linear Regression
51 pages
01B-DL2023-LinearModels
No ratings yet
01B-DL2023-LinearModels
47 pages
Lecture slides - Linear Regression (2025)
No ratings yet
Lecture slides - Linear Regression (2025)
45 pages
Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
Lecture 3 Ai
No ratings yet
Lecture 3 Ai
48 pages
Machine Learning Theory and Applications - 2024 - Vasques - Machine Learning Alg (1)
No ratings yet
Machine Learning Theory and Applications - 2024 - Vasques - Machine Learning Alg (1)
98 pages
LinearRegression Annotated
No ratings yet
LinearRegression Annotated
116 pages
Module2-Optimizations
No ratings yet
Module2-Optimizations
65 pages
M02 Linear Regression Methods
No ratings yet
M02 Linear Regression Methods
40 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
Lecture 2. Regression
No ratings yet
Lecture 2. Regression
61 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Week 04
No ratings yet
Week 04
101 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
10 Linear Regression
No ratings yet
10 Linear Regression
61 pages
Cost Function
No ratings yet
Cost Function
17 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
10 pages
Updating_Weight
No ratings yet
Updating_Weight
9 pages
Linear Regression
100% (1)
Linear Regression
51 pages
A Layman's Guide to the Project
No ratings yet
A Layman's Guide to the Project
34 pages
Linear Regression
No ratings yet
Linear Regression
95 pages
[PR 2024] Lec2 Regression II
No ratings yet
[PR 2024] Lec2 Regression II
41 pages
training-models
No ratings yet
training-models
13 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
Regression
No ratings yet
Regression
30 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
cs229 2
No ratings yet
cs229 2
275 pages
CS229 Lecture Notes: Supervised Learning
No ratings yet
CS229 Lecture Notes: Supervised Learning
30 pages
CS229
No ratings yet
CS229
69 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
L4 More On Linear Regression and Polynomial Regression
No ratings yet
L4 More On Linear Regression and Polynomial Regression
37 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
Linear Regression
No ratings yet
Linear Regression
75 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
2. Linear_ Regression_SGD
No ratings yet
2. Linear_ Regression_SGD
71 pages
The Ultimate Career Success Toolkit: Proven Strategies for Landing Your Dream Job and Achieving Your Goals: The Self-Development Mini Series, #0
From Everand
The Ultimate Career Success Toolkit: Proven Strategies for Landing Your Dream Job and Achieving Your Goals: The Self-Development Mini Series, #0
Rae Stonehouse
No ratings yet
The Ultimate Career Success Toolkit: Proven Strategies for Landing Your Dream Job and Achieving Your Goals
From Everand
The Ultimate Career Success Toolkit: Proven Strategies for Landing Your Dream Job and Achieving Your Goals
Rae A. Stonehouse
No ratings yet
The Influential Communicator: Strategies for Persuasive Speaking, Authentic Networking & Assertive Leadership: The Self-Development Mini Series, #0
From Everand
The Influential Communicator: Strategies for Persuasive Speaking, Authentic Networking & Assertive Leadership: The Self-Development Mini Series, #0
Rae A. Stonehouse
No ratings yet
The Personal Transformation Trilogy: Breaking Free from Limiting Beliefs, Embracing Change & Turning Challenges into Opportunities: The Self-Development Mini Series, #0
From Everand
The Personal Transformation Trilogy: Breaking Free from Limiting Beliefs, Embracing Change & Turning Challenges into Opportunities: The Self-Development Mini Series, #0
Rae A. Stonehouse
No ratings yet
The Self-Mastery Toolkit: Harnessing Emotional Intelligence, Productivity & Time Management for Personal Growth: The Self-Development Mini Series, #0
From Everand
The Self-Mastery Toolkit: Harnessing Emotional Intelligence, Productivity & Time Management for Personal Growth: The Self-Development Mini Series, #0
Rae A. Stonehouse
No ratings yet
The Resilient Mind: Mastering the Art of Turning Adversity into Opportunity
From Everand
The Resilient Mind: Mastering the Art of Turning Adversity into Opportunity
Rae A. Stonehouse
No ratings yet
Economics Model Exit Exam 2023 (1)
No ratings yet
Economics Model Exit Exam 2023 (1)
19 pages
Unit 1 MAD
No ratings yet
Unit 1 MAD
10 pages
Real Estate Price Prediction With Regression and Classification
No ratings yet
Real Estate Price Prediction With Regression and Classification
5 pages
Fringe Benefits and Job Satisfaction
100% (1)
Fringe Benefits and Job Satisfaction
22 pages
Reading Material For AMR - Dr. Vikas Goyal
No ratings yet
Reading Material For AMR - Dr. Vikas Goyal
290 pages
Mathematical Models Predicting Performance Inlicensure Examination of Engineering Graduates
No ratings yet
Mathematical Models Predicting Performance Inlicensure Examination of Engineering Graduates
10 pages
1. Lecture+Notes+-+Advanced+Regression
No ratings yet
1. Lecture+Notes+-+Advanced+Regression
12 pages
SRK/T: Development of The Intraocular Lens Implant Power Calculation Formula
No ratings yet
SRK/T: Development of The Intraocular Lens Implant Power Calculation Formula
8 pages
DR - Hiren Maniar Is Working As A Faculty in Finance With L&T Institute of Project Management, Vadodara. He May Be
No ratings yet
DR - Hiren Maniar Is Working As A Faculty in Finance With L&T Institute of Project Management, Vadodara. He May Be
22 pages
DOE - Design of Experiments: Reliability Engineering Paper Series
No ratings yet
DOE - Design of Experiments: Reliability Engineering Paper Series
4 pages
1 s2.0 S0026265X24000614 Main
No ratings yet
1 s2.0 S0026265X24000614 Main
10 pages
Assignment 1 (2) Sale Force Management
No ratings yet
Assignment 1 (2) Sale Force Management
9 pages
Exam PA Knowledge Based Outline
No ratings yet
Exam PA Knowledge Based Outline
22 pages
Maddala Disequilibrium Selfselection and Swtching Model
No ratings yet
Maddala Disequilibrium Selfselection and Swtching Model
56 pages
The Effects of Mergers and Acquisitions On Stock Prices: Evidence From International Transactions From 2001-2016
No ratings yet
The Effects of Mergers and Acquisitions On Stock Prices: Evidence From International Transactions From 2001-2016
39 pages
Deep-Learning Models For Forecasting Financial Risk Premia and Their Interpretations
No ratings yet
Deep-Learning Models For Forecasting Financial Risk Premia and Their Interpretations
14 pages
10 - Ir. Muhammad Khairul Ikhwan Bin Mohd Ali PDF
No ratings yet
10 - Ir. Muhammad Khairul Ikhwan Bin Mohd Ali PDF
27 pages
Hotelling 1936
No ratings yet
Hotelling 1936
57 pages
Ashutosh Resume1
No ratings yet
Ashutosh Resume1
1 page
Mastering Metrics Published
No ratings yet
Mastering Metrics Published
4 pages
BMA3102 Topic One
No ratings yet
BMA3102 Topic One
34 pages
Tesis de Triage Por Zoar
100% (4)
Tesis de Triage Por Zoar
102 pages
Chapter 4-Correlation and Regresssion
No ratings yet
Chapter 4-Correlation and Regresssion
60 pages
Session 2-3 (ANOVA) Regression
No ratings yet
Session 2-3 (ANOVA) Regression
54 pages
ML Unit-2
No ratings yet
ML Unit-2
17 pages
Management Accounting: The Cornerstone For Business Decisions
No ratings yet
Management Accounting: The Cornerstone For Business Decisions
24 pages
ASHRAE Symposium AC-02-9-4 Cooling Tower Model-Hydeman PDF
No ratings yet
ASHRAE Symposium AC-02-9-4 Cooling Tower Model-Hydeman PDF
10 pages
Artificial Intelligence - Chances and Challenges
No ratings yet
Artificial Intelligence - Chances and Challenges
12 pages
Making Hard Decisions With Decisiontools 3rd Edition Robert T Clemen download
No ratings yet
Making Hard Decisions With Decisiontools 3rd Edition Robert T Clemen download
78 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

lec6_7_Linear_regression

Uploaded by

lec6_7_Linear_regression

Uploaded by

1

u Don’t forget the Quiz deadline on Wednesday

u So far we have treated Machine Learning models and their training

u Based on chapter 4 of the Hands On ML textbook.

Linear regression model

Linear regression model:

More concise form of linear regression model:

u ℎ𝜽 instead of just ℎ to make it clear that the model is parametrized by the

First, we use linear

Compute gradient w.r.t. Θ, XT A A

Compute gradient w.r.t. Θ, XT A A

u Let’s generate some linear-looking data to test this equation on

u & using the Normal Equation.

The function that we used to generate the data is y = 4 + 3x + Gaussian noise.

u Let’s plot this model’s predictions

not invertible if m(instances) < n(features)

u The Normal Equation computes the inverse of XTX, which is an (n + 1) × (n +

exponential compute complexity for features,

u Gradient Descent is a generic optimization algorithm

u An important parameter in Gradient Descent is the size of the steps,

The learning rate is too small

The learning rate is too large

You should ensure that all features have a similar

u To implement Gradient Descent, you need to compute the gradient of the

u Instead of computing these partial derivatives individually, we can use

η is the learning rate

u First 10 steps of gradient descent using various learning rates

When to stop? When the gradient vector becomes tiny

batch gradient descent: entire data

u When the cost function is

SGD can jump out of local minimum

u Therefore, randomness is good to escape from local optima, but bad

u When using Stochastic Gradient Descent, the training instances must be

mini-batch: small random sets of instances.

u Regularization and polynomial regression, from Chapter 4 from Hands on

simple equation + SVD: slow for features

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.