lec6_7_Linear_regression
lec6_7_Linear_regression
Linear regression
CSCI-P 556
ZORAN TIGANJ
2
Reminders/Announcements
Note: θ and x are both column vectors (in ML vectors are commonly represented as column vectors,
in other words vectors are represented as 2D arrays with a single column).
7
Training
u We first need a measure of how well (or poorly) the model fits the training
data.
u For example, we can use RMSE.
u In practice, it is simpler to minimize the mean squared error (MSE) than the
RMSE, and it leads to the same result (because the value that minimizes a
function also minimizes its square root).
m=no. of instances
u The MSE of a Linear Regression hypothesis ℎ𝜽 on a training set X is x = vector of features
u To find the value of θ that minimizes the cost function, there is a closed-
form solution—in other words, a mathematical equation that gives the
result directly.
u This is called the Normal Equation.
9
Deriving the normal equation
y 𝝏𝒚
Note: we ignore 1/m 𝝏𝑿
AX AT
r⇥ M SE(⇥) = 0
<latexit sha1_base64="/lETp9cpCDqGLoYGRLm8zOkdpVY=">AAACF3icbVBNS8NAEN34WetX1aOXxSLUS0lE1ItQFMGLULFf0IQy2W7apZtN2N0IJfRfePGvePGgiFe9+W/ctD1o64OBx3szzMzzY86Utu1va2FxaXllNbeWX9/Y3Nou7Ow2VJRIQusk4pFs+aAoZ4LWNdOctmJJIfQ5bfqDq8xvPlCpWCRqehhTL4SeYAEjoI3UKZRdAT6HjhuC7vtB6tb6VMPo9v66NCMd4QtsdwpFu2yPgeeJMyVFNEW1U/hyuxFJQio04aBU27Fj7aUgNSOcjvJuomgMZAA92jZUQEiVl47/GuFDo3RxEElTQuOx+nsihVCpYeibzuxWNetl4n9eO9HBuZcyESeaCjJZFCQc6whnIeEuk5RoPjQEiGTmVkz6IIFoE2XehODMvjxPGsdl57Ts3J0UK5fTOHJoHx2gEnLQGaqgG1RFdUTQI3pGr+jNerJerHfrY9K6YE1n9tAfWJ8/OGmfSQ==</latexit>
y 𝝏𝒚
Note: we ignore 1/m 𝝏𝑿
AX AT
r⇥ M SE(⇥) = 0
<latexit sha1_base64="/lETp9cpCDqGLoYGRLm8zOkdpVY=">AAACF3icbVBNS8NAEN34WetX1aOXxSLUS0lE1ItQFMGLULFf0IQy2W7apZtN2N0IJfRfePGvePGgiFe9+W/ctD1o64OBx3szzMzzY86Utu1va2FxaXllNbeWX9/Y3Nou7Ow2VJRIQusk4pFs+aAoZ4LWNdOctmJJIfQ5bfqDq8xvPlCpWCRqehhTL4SeYAEjoI3UKZRdAT6HjhuC7vtB6tb6VMPo9v66NCMd4QtsdwpFu2yPgeeJMyVFNEW1U/hyuxFJQio04aBU27Fj7aUgNSOcjvJuomgMZAA92jZUQEiVl47/GuFDo3RxEElTQuOx+nsihVCpYeibzuxWNetl4n9eO9HBuZcyESeaCjJZFCQc6whnIeEuk5RoPjQEiGTmVkz6IIFoE2XehODMvjxPGsdl57Ts3J0UK5fTOHJoHx2gEnLQGaqgG1RFdUTQI3pGr+jNerJerHfrY9K6YE1n9tAfWJ8/OGmfSQ==</latexit>
u &
Now we can make predictions using Θ
y = 4 + 3x + Gaussian noise.
15
The Normal Equation
This approach is more efficient than computing the Normal Equation, plus it
handles edge cases nicely: indeed, the Normal Equation may not work if the
matrix XTX is not invertible (i.e., singular), such as:
• if m < n (where n is the number of features) or
• if some features are redundant, but the pseudoinverse is always defined.
17
Computational Complexity of The
Normal Equation
Matrix inversion has more compute complexity for features than SVD Scikit-Learn
18
Computational Complexity of The
Normal Equation
u Both the Normal Equation and the SVD approach get very slow when the
number of features grows large (e.g., 100,000).
u On the positive side, both are linear with regard to the number of instances
in the training set (they are O(m)), so they handle large training sets
efficiently, provided they can fit in memory.
u Now we will look at a very different way to train a Linear Regression model,
which is better suited for cases where there are a large number of features
or too many training instances to fit in memory.
20
Gradient Descent
u On the other hand, if the learning rate is too high, you might jump across
the valley and end up on the other side, possibly even higher up than you
were before. This might make the algorithm diverge, with larger and larger
values, failing to find a good solution.
u Finally, not all cost functions look like nice, regular bowls. There may be holes,
ridges, plateaus, and all sorts of irregular terrains, making convergence to the
minimum difficult.
u If the random initialization starts the algorithm on the left, then it will converge to a
local minimum, which is not as good as the global minimum.
u If it starts on the right, then it will take a very long time to cross the plateau.
24
Gradient descent for MSE cost
function and Linear Regression
u Fortunately, the MSE cost function for a Linear Regression model happens MSE is convex
to be a convex function, which means that if you pick any two points on
the curve, the line segment joining them never crosses the curve.
u This implies that there are no local minima, just one global minimum and
Gradient Descent is guaranteed to approach arbitrarily close the global
minimum (if you wait long enough and if the learning rate is not too high).
• Notice that this formula involves calculations over the full training set X, at each Gradient
Descent step! This is why the algorithm is called Batch Gradient Descent: it uses the whole
batch of training data at every step.
• As a result it is terribly slow on very large training sets.
• However, Gradient Descent scales well with the number of features; training a Linear
Regression model when there are hundreds of thousands of features is much faster using
Gradient Descent than using the Normal Equation or SVD decomposition.
27
Batch Gradient Descent
u Once you have the gradient vector, which points uphill, just go in the
opposite direction to go downhill.
u The main problem with Batch Gradient Descent is the fact that it uses the
whole training set to compute the gradients at every step, which makes it
very slow when the training set is large.
u At the opposite extreme, Stochastic Gradient Descent picks a random
instance in the training set at every step and computes the gradients
based only on that single instance.
For SGD: start with high learning rate but decrease it to reach a global minimum.
32
Stochastic Gradient Descent
33
Stochastic Gradient Descent
34
Stochastic Gradient Descent
u At each step, instead of computing the gradients based on the full training
set (as in Batch GD) or based on just one instance (as in Stochastic GD),
Mini-batch GD computes the gradients on small random sets of instances
called mini-batches.
u The main advantage of Minibatch GD over Stochastic GD is that you can
get a performance boost from hardware optimization of matrix operations,
especially when using GPUs.
Performance boost.
37
Comparison
38
Next time