0% found this document useful (0 votes)
15 views38 pages

lec6_7_Linear_regression

Uploaded by

Saitama Deku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views38 pages

lec6_7_Linear_regression

Uploaded by

Saitama Deku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

1

Linear regression

CSCI-P 556
ZORAN TIGANJ
2
Reminders/Announcements

u Don’t forget the Quiz deadline on Wednesday


u Homework assignment 1 released
u Updates to ML quest
u Stretch goal for paper 1
u New paper
3
Deeper dive into ML techniques

u So far we have treated Machine Learning models and their training


algorithms mostly like black boxes.
u As of today, we will start taking a deeper dive into how popular machine
learning techniques actually work.
u Having a good understanding of how things work can help you quickly
home in on the appropriate model, the right training algorithm to use, and
a good set of hyperparameters for your task.
u Understanding what’s under the hood will also help you de bug issues and
perform error analysis more efficiently.
4
Today: linear regression

u Based on chapter 4 of the Hands On ML textbook.


u We will discuss two ways to train linear regression models:
u Using a “closed-form” equation that directly computes the model parameters
that best fit the model to the training set
u Using an iterative optimization approach called Gradient Descent (GD) that
gradually tweaks the model parameters to minimize the cost function over the
training set, eventually converging to the same set of parameters as the first
method.
u much faster because it has very little data to manipulate at every iteration.
u the cost function will bounce up and down, decreasing only on average.
5
Linear regression

Linear regression model


6
Linear regression

Linear regression model:

More concise form of linear regression model:

Note: θ and x are both column vectors (in ML vectors are commonly represented as column vectors,
in other words vectors are represented as 2D arrays with a single column).
7
Training

u We first need a measure of how well (or poorly) the model fits the training
data.
u For example, we can use RMSE.
u In practice, it is simpler to minimize the mean squared error (MSE) than the
RMSE, and it leads to the same result (because the value that minimizes a
function also minimizes its square root).
m=no. of instances
u The MSE of a Linear Regression hypothesis ℎ𝜽 on a training set X is x = vector of features

u ℎ𝜽 instead of just ℎ to make it clear that the model is parametrized by the


vector θ.
8
Training

u To find the value of θ that minimizes the cost function, there is a closed-
form solution—in other words, a mathematical equation that gives the
result directly.
u This is called the Normal Equation.
9
Deriving the normal equation

First, we use linear


algebra to
reorganize the
loss function
10
Deriving the normal equation

y 𝝏𝒚
Note: we ignore 1/m 𝝏𝑿
AX AT
r⇥ M SE(⇥) = 0
<latexit sha1_base64="/lETp9cpCDqGLoYGRLm8zOkdpVY=">AAACF3icbVBNS8NAEN34WetX1aOXxSLUS0lE1ItQFMGLULFf0IQy2W7apZtN2N0IJfRfePGvePGgiFe9+W/ctD1o64OBx3szzMzzY86Utu1va2FxaXllNbeWX9/Y3Nou7Ow2VJRIQusk4pFs+aAoZ4LWNdOctmJJIfQ5bfqDq8xvPlCpWCRqehhTL4SeYAEjoI3UKZRdAT6HjhuC7vtB6tb6VMPo9v66NCMd4QtsdwpFu2yPgeeJMyVFNEW1U/hyuxFJQio04aBU27Fj7aUgNSOcjvJuomgMZAA92jZUQEiVl47/GuFDo3RxEElTQuOx+nsihVCpYeibzuxWNetl4n9eO9HBuZcyESeaCjJZFCQc6whnIeEuk5RoPjQEiGTmVkz6IIFoE2XehODMvjxPGsdl57Ts3J0UK5fTOHJoHx2gEnLQGaqgG1RFdUTQI3pGr+jNerJerHfrY9K6YE1n9tAfWJ8/OGmfSQ==</latexit>

Compute gradient w.r.t. Θ, XT A A


equate with 0 and solve for
Θ. XT X 2X
XTAX AX + ATX
11
Deriving the normal equation

y 𝝏𝒚
Note: we ignore 1/m 𝝏𝑿
AX AT
r⇥ M SE(⇥) = 0
<latexit sha1_base64="/lETp9cpCDqGLoYGRLm8zOkdpVY=">AAACF3icbVBNS8NAEN34WetX1aOXxSLUS0lE1ItQFMGLULFf0IQy2W7apZtN2N0IJfRfePGvePGgiFe9+W/ctD1o64OBx3szzMzzY86Utu1va2FxaXllNbeWX9/Y3Nou7Ow2VJRIQusk4pFs+aAoZ4LWNdOctmJJIfQ5bfqDq8xvPlCpWCRqehhTL4SeYAEjoI3UKZRdAT6HjhuC7vtB6tb6VMPo9v66NCMd4QtsdwpFu2yPgeeJMyVFNEW1U/hyuxFJQio04aBU27Fj7aUgNSOcjvJuomgMZAA92jZUQEiVl47/GuFDo3RxEElTQuOx+nsihVCpYeibzuxWNetl4n9eO9HBuZcyESeaCjJZFCQc6whnIeEuk5RoPjQEiGTmVkz6IIFoE2XehODMvjxPGsdl57Ts3J0UK5fTOHJoHx2gEnLQGaqgG1RFdUTQI3pGr+jNerJerHfrY9K6YE1n9tAfWJ8/OGmfSQ==</latexit>

Compute gradient w.r.t. Θ, XT A A


equate with 0 and solve for
Θ. XT X 2X
XTAX AX + ATX
12
Linear regression

u Let’s generate some linear-looking data to test this equation on


13
The Normal Equation

u & using the Normal Equation.


Now let’s compute Θ

The function that we used to generate the data is y = 4 + 3x + Gaussian noise.


Close enough, but the noise made it impossible to recover the exact
parameters of the original function.
14
The Normal Equation

u &
Now we can make predictions using Θ

y = 4 + 3x + Gaussian noise.
15
The Normal Equation

u Let’s plot this model’s predictions


16
Linear regression using Scikit-Learn

not invertible if m(instances) < n(features)


4.21 is for x=0, 9.75 is value of y_hat for X_n32

This approach is more efficient than computing the Normal Equation, plus it
handles edge cases nicely: indeed, the Normal Equation may not work if the
matrix XTX is not invertible (i.e., singular), such as:
• if m < n (where n is the number of features) or
• if some features are redundant, but the pseudoinverse is always defined.
17
Computational Complexity of The
Normal Equation

u The Normal Equation computes the inverse of XTX, which is an (n + 1) × (n +


1) matrix (where n is the number of features).
u The computational complexity of inverting such a matrix is typically about
O(n2.4) to O(n3), depending on the implementation.
u In other words, if you double the number of features, you multiply the
computation time by roughly 22.4 = 5.3 to 23 = 8.
u The SVD approach used by Scikit-Learn’s Linear Regression class is about
O(n2). If you double the number of features, you multiply the computation
time by roughly 4.

Matrix inversion has more compute complexity for features than SVD Scikit-Learn
18
Computational Complexity of The
Normal Equation

u Both the Normal Equation and the SVD approach get very slow when the
number of features grows large (e.g., 100,000).
u On the positive side, both are linear with regard to the number of instances
in the training set (they are O(m)), so they handle large training sets
efficiently, provided they can fit in memory.

exponential compute complexity for features,


linear compute complexity for instances
19
Gradient Descent

u Now we will look at a very different way to train a Linear Regression model,
which is better suited for cases where there are a large number of features
or too many training instances to fit in memory.
20
Gradient Descent

u Gradient Descent is a generic optimization algorithm


capable of finding optimal solutions to a wide range of
problems.
u The general idea of Gradient Descent is to tweak parameters
iteratively in order to minimize a cost function.
u Suppose you are lost in the mountains in a dense fog, and
you can only feel the slope of the ground be low your feet. A
good strategy to get to the bottom of the valley quickly is to In this depiction of Gradient Descent,
go downhill in the direction of the steepest slope. the model parameters are initialized
randomly and get tweaked
u This is exactly what Gradient Descent does: it measures the repeatedly to minimize the cost
function; the learning step size is
local gradient of the error function with regard to the proportional to the slope of the cost
parameter vector θ, and it goes in the direction of function, so the steps gradually get
descending gradient. smaller as the parameters approach
the minimum
21
Learning rate in Gradient Descent

u An important parameter in Gradient Descent is the size of the steps,


determined by t he learning rate hyperparameter. If the learning rate is
too small, then the algorithm will have to go through many iterations to
converge, which will take a long time.

The learning rate is too small


22
Learning rate in Gradient Descent

u On the other hand, if the learning rate is too high, you might jump across
the valley and end up on the other side, possibly even higher up than you
were before. This might make the algorithm diverge, with larger and larger
values, failing to find a good solution.

The learning rate is too large


23
Challenges in Gradient Descent

u Finally, not all cost functions look like nice, regular bowls. There may be holes,
ridges, plateaus, and all sorts of irregular terrains, making convergence to the
minimum difficult.

u If the random initialization starts the algorithm on the left, then it will converge to a
local minimum, which is not as good as the global minimum.
u If it starts on the right, then it will take a very long time to cross the plateau.
24
Gradient descent for MSE cost
function and Linear Regression
u Fortunately, the MSE cost function for a Linear Regression model happens MSE is convex

to be a convex function, which means that if you pick any two points on
the curve, the line segment joining them never crosses the curve.
u This implies that there are no local minima, just one global minimum and
Gradient Descent is guaranteed to approach arbitrarily close the global
minimum (if you wait long enough and if the learning rate is not too high).

You should ensure that all features have a similar


scale or else it will take much longer to converge.
25
Batch Gradient Descent

u To implement Gradient Descent, you need to compute the gradient of the


cost function with regard to each model parameter θj.
u In other words, you need to calculate how much the cost function will
change if you change θj just a little bit.
26
Batch Gradient Descent

u Instead of computing these partial derivatives individually, we can use


gradient vector to compute them in one step:

• Notice that this formula involves calculations over the full training set X, at each Gradient
Descent step! This is why the algorithm is called Batch Gradient Descent: it uses the whole
batch of training data at every step.
• As a result it is terribly slow on very large training sets.
• However, Gradient Descent scales well with the number of features; training a Linear
Regression model when there are hundreds of thousands of features is much faster using
Gradient Descent than using the Normal Equation or SVD decomposition.
27
Batch Gradient Descent

u Once you have the gradient vector, which points uphill, just go in the
opposite direction to go downhill.

η is the learning rate


28
Batch Gradient Descent

u First 10 steps of gradient descent using various learning rates

When to stop? When the gradient vector becomes tiny


29
Stochastic Gradient Descent

u The main problem with Batch Gradient Descent is the fact that it uses the
whole training set to compute the gradients at every step, which makes it
very slow when the training set is large.
u At the opposite extreme, Stochastic Gradient Descent picks a random
instance in the training set at every step and computes the gradients
based only on that single instance.

batch gradient descent: entire data


stochastic gradient descent: random instance af every step
30
Stochastic Gradient Descent

u When the cost function is


very irregular, this can
actually help the algorithm
jump out of local minima, so
Stochastic Gradient
Descent has a better
chance of finding the
global minimum than Batch
Gradient Descent does.

SGD can jump out of local minimum


31
Stochastic Gradient Descent

u Therefore, randomness is good to escape from local optima, but bad


because it means that the algorithm can never settle at the minimum.
u The solution to this dilemma is to gradually reduce the learning rate. The
steps start out large (which helps make quick progress and escape local
minima), then get smaller and smaller, allowing the algorithm to settle at
the global minimum (simulated annealing).
u The function that determines the learning rate at each iteration is called
the learning schedule.

For SGD: start with high learning rate but decrease it to reach a global minimum.
32
Stochastic Gradient Descent
33
Stochastic Gradient Descent
34
Stochastic Gradient Descent

u When using Stochastic Gradient Descent, the training instances must be


independent and identically distributed (IID) to ensure that the parameters
get pulled toward the global optimum, on average.
u A simple way to ensure this is to shuffle the instances during training (e.g.,
pick each instance randomly, or shuffle the training set at the beginning of
each epoch).
u If you do not shuffle the instances—for example, if the instances are sorted
by label—then SGD will start by optimizing for one label, then the next, and
so on, and it will not settle close to the global minimum.
35
Stochastic Gradient Descent
36
Mini-batch Gradient Descent

u At each step, instead of computing the gradients based on the full training
set (as in Batch GD) or based on just one instance (as in Stochastic GD),
Mini-batch GD computes the gradients on small random sets of instances
called mini-batches.
u The main advantage of Minibatch GD over Stochastic GD is that you can
get a performance boost from hardware optimization of matrix operations,
especially when using GPUs.

mini-batch: small random sets of instances.

Performance boost.
37
Comparison
38
Next time

u Regularization and polynomial regression, from Chapter 4 from Hands on


machine learning textbook

simple equation + SVD: slow for features


Batch GD: slow for instances
All GD require scaling

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy