Group 30
Group 30
CSN-382
Regression: Pictorially 2
Regression: A supervised learning problem. Goal is to model the relationship between input ()
and real-valued output (). This is akin to a line/plane or curve fitting problem
(Output
(Output )
(Output
)
)
(Feature 2) (Feature 1) Input (single feature)
Input (single
feature)
Linear Regression Linear Regression Polynomial Regression
With one variable With multiple variables
3
Linear Regression
Given: Training data with input-output pairs , ,
𝑛𝑥1 𝑛𝑥𝐷 𝐷𝑥1
𝑇
𝑥1 𝑤1
Goal: Learn a model to predict the output for new test inputs 𝑦1
⋮ ⋮
⋮
𝑦𝑛 𝑥𝑛
𝑇
𝑤𝐷
𝑦 𝑋 𝑤
Assume the function that approximates the I/O relationship to be a linear model
Can also write all of
them compactly using
⊤
𝑦 𝑛 ≈ 𝑓 ( 𝒙𝑛 ) = 𝒘 𝒙 𝑛 (𝑛=1 , 2 , … , 𝑁 ) matrix-vector notation
as
Let’s write the total error or “loss” of this model over the training data as
• Here, w represents the coefficients (also known as weights), which determine the contribution of each input
feature to the output prediction.
5
Linear Regression: Pictorially
Linear regression is like fitting a line or (hyper)plane to a set of
points [ ] 𝑧1
𝑧2
= 𝜙( 𝑥 )
What if a line/plane
doesn’t model the
input-output
relationship very well, Original (single) feature Two features
e.g., if their Nonlinear curve needed Can fit a plane (linear)
(Output )
relationship is better
(Output )
modeled by a
nonlinear curve or
Do linear No. We can even fit
curved surface?
models a curve using a
become linear model after
useless in suitably
such cases? transforming the
Input (single feature) (Feature 2) (Feature 1)
inputs
Outliers are
• Absolute/Huber loss is preferred if there are outliers observations that lie at
in the data an abnormal distance
from other
(Output
• Less Affected by large errors as compared to the values in a dataset.
squared loss
)
• Squared loss objective functions are easy to optimize
(Convex and Differentiable) Input (single
feature)
• 𝜖-sensitive loss used where small errors (within 𝜖)
are ignored, making it suitable for applications where
minor deviations are acceptable.
Cost Function 8
Partial derivative of dot product w.r.t each Result of this derivative is - same size
element of as
∑ 2 𝒙 𝑛 ( 𝑦𝑛 − 𝒙 ⊤
𝑛 𝒘 ) =0 ∑ 𝑦 𝑛 𝒙 𝑛 − 𝒙𝑛 𝒙 ⊤
𝑛 𝒘=0
𝑛=1 𝑛=1
= ⊤
¿( 𝑿 𝑿)
−1 ⊤
𝑿 𝒚
11
Problem(s) with the Solution!
We minimized the objective w.r.t. and got
= ⊤
¿( 𝑿 𝑿)
−1
𝑿 𝒚
⊤
Two popular examples of
regularization for linear regression
Problem: The matrix may not be invertible are:
1. Ridge Regression or L2
This may lead to non-unique solutions for Regularization
2. Lasso Regression or L1
Problem: Overfitting since we only minimized loss defined on
Regularization
training data
is called the Regularizer
Weights may become arbitrarily large to fit training data perfectly
and measures the
Such weights may perform poorly on the test data however of
“magnitude”
is the reg.
One Solution: Minimize a regularized objective hyperparam. Controls
how much we wish to
The reg. will prevent the elements of from becoming tooregularize
large (needs to be
tuned via cross-
Reason: Now we are minimizing training error + magnitude of vector
validation)
12
Regularized Least Squares (a.k.a. Ridge Regression)
Recall that the regularized objective is of the form
= arg min 𝒘 ∑ ( 𝑦 𝑛 − 𝒘 𝒙 𝑛 ) + 𝜆 𝒘 𝒘
solution. We are adding a small ⊤ 2 ⊤
value to the diagonals of the
DxD matrix (like adding a 𝑛=1
ridge/mountain to some land)
𝑁
𝐿 𝑟𝑒𝑔 ( 𝒘 ) =∑ ( 𝑦 𝑛 − 𝒘 ⊤ 𝒙 𝑛 )2 + 𝜆 𝒘 ⊤ 𝒘
•Minimizing the above objective
𝑛=1
w.r.t. w does two things
•Keeps the training error small
•Keeps the norm of w small (and thus also the individual components of
w): Regularization
•There is a trade-off between the two terms: The regularization
hyperparameter λ>0 controls it
•Very small λ means almost no regularization (can overfit)
•Very large λ means very high regularization (can underfit - high training
error)
•Can use cross-validation to choose the "right" λ
•Note that, in this case, regularization also made inversion possible (note
Note that optimizing loss
functions with such
15
Other Ways to Control Overfitting regularizers is usually
harder than ridge reg. but
several advanced
Use a regularizer defined by other norms, e.g., Use them
techniques exist (we will
see some
if you of those
have a later)
norm very large number of
regularizer 𝐷
features but many
irrelevant features.
‖𝒘‖1 =∑ ¿ 𝑤𝑑 ∨¿ ¿ These regularizers can
When should I used 𝑑=1
these regularizers help in automatic
sparse means many
feature selection
instead of the Using such
regularizer? ‖𝒘‖0 =¿ nnz ( 𝒘 ) regularizers gives a
entries in will be
Automatic feature zero or near zero.
sparse weight vector Thus those features
selection? Wow, norm regularizer
as solution will be considered
cool!!! (counts number of
But how exactly? nonzeros in irrelevant by the
model and will not
Use non-regularization based approaches influence prediction
Note: Since they learn a sparse w, or regularization is also useful for doing
feature selection
(= 0 means feature d is irrelevant). We will revisit later to formally see
17
Linear/Ridge Regression via Gradient Descent
•Both least squares regression and ridge regression require matrix inversion
⊤ −1 ⊤
⊤ −1
𝒘 𝐿𝑆 =( 𝑿 𝑿 ) 𝑿 𝒚 ⊤ 𝒘 𝑟𝑖𝑑𝑔𝑒 =( 𝑿 𝑿 + 𝜆 𝐼 𝐷 ) 𝑿 𝒚
•Can be computationally expensive when is very large
•A faster way is to use iterative optimization, such as batch or stochastic gradient descent
•A basic batch gradient-descent based procedure looks like
•Start with an initial value of
•Update by moving in the opposite direction of the gradient of the loss function
η
Such iterative methods for
optimizing loss functions are
where η is the learning rate widely used in ML. Will revisit
•Repeat until convergence these later in Detail
𝑁
𝜕 𝐿(𝒘
•For least squares, the gradient is )
=− ∑ 𝒙 𝑛 ( 𝑦 𝑛 −𝒘 𝒙 𝑛)
⊤ 2
𝜕𝒘 𝑛=1
(no matrix inversion involved)
Linear/Ridge Regression via Gradient Descent 18
We will revisit gradient-based methods later, but a few things to keep in mind:
A B A B
• (Includes an illustration showing convex and non-convex functions.)
• A function is convex if the second derivative is non-negative everywhere (for scalar
functions) or if the Hessian is positive semi-definite (for vector-valued functions).
For a convex function, every local minima is also a global minima.
• For Gradient Descent, the learning rate is important (should not be too large or too
small).
19
Linear Regression as Solving System of Linear Eqs
The form of the lin. reg. model is akin to a system of linear
equation
Assuming training𝑦examples
First training example: with features each,
1=𝑥11 𝑤 1+ 𝑥 12 𝑤 2+ …+ 𝑥 1 𝐷 𝑤 𝐷
weHere
Note: have
denotes
the feature of the
Second training example: 𝑦 2=𝑥 21 𝑤1 + 𝑥 22 𝑤2 +… + 𝑥2 𝐷 𝑤 𝐷 training example
equations and
unknowns here ()
N-th training example: 𝑦 𝑁 =𝑥 𝑁 1 𝑤1 + 𝑥 𝑁 2 𝑤 2+ …+ 𝑥 𝑁𝐷 𝑤 𝐷
https://ml-cheatsheet.readthedocs.io/en/latest/linear_regressio
n.html
https://towardsdatascience.com/introduction-to-machine-learni
ng-algorithms-linear-regression-14c4e325882a
https://www.edureka.co/blog/linear-regression-for-machine-lear
ning/
https://www.analyticsvidhya.com/blog/2021/06/linear-regressio
n-in-machine-learning/
https://towardsdatascience.com/polynomial-regression-bbe8b9
d97491
https://www.analyticsvidhya.com/blog/2021/10/understanding-
polynomial-regression-model/
https://www.coursera.org/lecture/machine-learning/features-an
d-polynomial-regression-Rqgfz
21
Bias and Variance
Overfitting: The model fits the training set very well but fails to generalize to new
examples. Overfitting happens when a model learns the detail and noise in the
training data to the extent that it negatively impacts the performance of the model
on new data. This means that the noise or random fluctuations in the training data is
picked up and learned as concepts by the model. The problem is that these concepts
do not apply to new data and negatively impact the models ability to generalize.
Underfitting: Underfitting refers to a model that can neither model the training data nor generalize to new data. An underfit
machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.
Variance: Captures how much your classifier changes if you train on a different training set. How "over-specialized" is your
classifier to a particular training set (overfitting)? If we have the best possible model for our training data, how far off are we
from the average classifier?
Bias: What is the inherent error that you obtain from your classifier even with infinite training data? This is due to your
classifier being "biased" to a particular kind of solution (e.g. linear classifier). In other words, bias is inherent to your model.
22
Bias and Variance
Variance: The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from an
algorithm modeling the random noise in the training data (overfitting). The target function is estimated from the training data
by a machine learning algorithm, so we should expect the algorithm to have some variance. Ideally, it should not change too
much from one training dataset to the next, meaning that the algorithm is good at picking out the hidden underlying mapping
between the inputs and the output variables. Machine learning algorithms with low variance are Linear Regression, Logistic
Regression. Machine learning algorithms with high variance are decision tree, and K-nearest neighbours.
The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both
accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically
impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at
risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler
models that may fail to capture important regularities (i.e. underfit) in the data.
23
24
25
26
27
28
29
Models with low bias and low variance are golden
30
Bias Variance Tradeoffs
but usually they exist only for specific domains
(e.g. linear models may do very well in predicting
income
• Two main sources as atest
of bad function of education).
performance Expecting low
for ML algos
variance
• Bias: model is too weakand lowmodel
e.g. linear biasforina very
general
complexistask
a pipe dream.
• Even the best trained linear model is pathetic
Test
• Variance: model is strong but you could not train it properly e.g. NN
• The best trained NN (NeuralError
Network) is NP-hard to learn
32
33
Next Lecture
Solving linear regression using iterative optimization (ex.
Gradient Descent)methods
Faster and don’t require matrix inversion
Brief intro to optimization techniques