0% found this document useful (0 votes)

8 views33 pages

Group 30

The document discusses linear regression as a supervised learning method aimed at modeling the relationship between input features and real-valued outputs, emphasizing the importance of minimizing prediction error through various loss functions. It highlights the use of regularization techniques such as Ridge and Lasso regression to prevent overfitting by controlling the magnitude of model weights. Additionally, it covers different loss functions, including squared loss and absolute loss, and their suitability based on data characteristics.

Uploaded by

ISHAAN JAIN 22114039

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views33 pages

Group 30

Uploaded by

ISHAAN JAIN 22114039

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Linear Regression

CSN-382
Regression: Pictorially 2
Regression: A supervised learning problem. Goal is to model the relationship between input ()
and real-valued output (). This is akin to a line/plane or curve fitting problem

(Output
(Output )

(Output
)

)
(Feature 2) (Feature 1) Input (single feature)
Input (single
feature)
Linear Regression Linear Regression Polynomial Regression
With one variable With multiple variables
3
Linear Regression
 Given: Training data with input-output pairs , ,
𝑛𝑥1 𝑛𝑥𝐷 𝐷𝑥1
𝑇
𝑥1 𝑤1
 Goal: Learn a model to predict the output for new test inputs 𝑦1
⋮ ⋮
⋮
𝑦𝑛 𝑥𝑛
𝑇
𝑤𝐷

𝑦 𝑋 𝑤
 Assume the function that approximates the I/O relationship to be a linear model
Can also write all of
them compactly using
⊤
𝑦 𝑛 ≈ 𝑓 ( 𝒙𝑛 ) = 𝒘 𝒙 𝑛 (𝑛=1 , 2 , … , 𝑁 ) matrix-vector notation
as

 Let’s write the total error or “loss” of this model over the training data as

Goal of learning is to measures the

find the that ) prediction error
del on a single or
training
minimizes this loss + “loss”
input or “deviation” of
does well on test data the model on a single
Linear Regression

• Here, w represents the coefficients (also known as weights), which determine the contribution of each input
feature to the output prediction.
5
Linear Regression: Pictorially
 Linear regression is like fitting a line or (hyper)plane to a set of
points [ ] 𝑧1
𝑧2
= 𝜙( 𝑥 )

What if a line/plane
doesn’t model the
input-output
relationship very well, Original (single) feature Two features
e.g., if their Nonlinear curve needed Can fit a plane (linear)

(Output )
relationship is better
(Output )

modeled by a
nonlinear curve or
Do linear No. We can even fit
curved surface?
models a curve using a
become linear model after
useless in suitably
such cases? transforming the
Input (single feature) (Feature 2) (Feature 1)
inputs

 The line/plane must also predict outputs the unseen (test)

inputs well
Choice of loss function
6
Loss Functions for Regression usually depends on the
nature of the data. Also,
some loss functions
result in easier
 Many possible loss functions for regression problems optimization problem
than others
Squared Absolute
Very commonly
loss Loss ( 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ))2 Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨¿
used for loss
Grows more slowly
regression. than squared loss.
Leads to an Thus better suited
easy-to-solve when data has some
optimization outliers (inputs on
problem which model makes
) large errors) )
Huber Loss ¿ 𝑦 𝑛 − 𝑓 ( 𝒙 𝑛 ) ∨− 𝜖
loss loss for
Squared -insensitive Loss
small errors (say loss
up to ); absolute (a.k.a. Vapnik Note: Can also
loss for larger
loss)
Zero loss for small use squared
errors. Good for errors (say up to ); loss instead of
data with absolute loss for
−𝛿 𝛿 ) absolute loss
outliers larger errors
−𝜖 𝜖 )
Loss Functions for Regression 7

Outliers are
• Absolute/Huber loss is preferred if there are outliers observations that lie at
in the data an abnormal distance
from other

(Output
• Less Affected by large errors as compared to the values in a dataset.
squared loss

)
• Squared loss objective functions are easy to optimize
(Convex and Differentiable) Input (single
feature)
• 𝜖-sensitive loss used where small errors (within 𝜖)
are ignored, making it suitable for applications where
minor deviations are acceptable.
Cost Function 8

 Loss/Error function: Loss is calculated per observation, for each

training example we see how much it deviated from the actual
observation. Loss functions measure how far an estimated value is from
its true value.
 MAE (Mean absolute error), MSE (Mean Squared Error) are loss functions
for regression problems
 Cost function: Cost is associated with the whole training data. A cost
function is the average loss over the entire training dataset. It is a
function that determines how well a Machine Learning model performs
for a a
Usually given
sumset of data.
of the
 The general
training error +form of an optimization problem usually are
regularizer We will introduce
Regularizer soon
while discussing
overfitting.
 Here denotes the loss function to be optimized
9
Linear Regression with Squared Loss
In matrix-vector notation, can write
 In this case, the loss func will be
it compactly as )

 Let us find the that optimizes (minimizes) the above squared

loss The “least squares” (LS)
problem Gauss-Legendre,
18th century)
 We need calculus and optimization to do this!

 The LS problem can be solved easily and has a closed form

solution =
= ⊤
¿( 𝑿 𝑿)
−1 ⊤
𝑿 𝒚
matrix inversion – can be
expensive. Ways to handle this.
10
Proof: A bit of calculus/optim. (more on this later)
 We wanted to find the minima of

 Let us apply basic rule of calculus: Take first derivative of and

Chain rule of
set to zero calculus

Partial derivative of dot product w.r.t each Result of this derivative is - same size
element of as

 Using the fact , we get

 To separate to get a solution, we
𝑁
write the above as
𝑁

∑ 2 𝒙 𝑛 ( 𝑦𝑛 − 𝒙 ⊤
𝑛 𝒘 ) =0 ∑ 𝑦 𝑛 𝒙 𝑛 − 𝒙𝑛 𝒙 ⊤
𝑛 𝒘=0
𝑛=1 𝑛=1

= ⊤
¿( 𝑿 𝑿)
−1 ⊤
𝑿 𝒚
11
Problem(s) with the Solution!
 We minimized the objective w.r.t. and got

= ⊤
¿( 𝑿 𝑿)
−1
𝑿 𝒚
⊤
Two popular examples of
regularization for linear regression
 Problem: The matrix may not be invertible are:
1. Ridge Regression or L2
 This may lead to non-unique solutions for Regularization
2. Lasso Regression or L1
 Problem: Overfitting since we only minimized loss defined on
Regularization

training data
is called the Regularizer
 Weights may become arbitrarily large to fit training data perfectly
and measures the
 Such weights may perform poorly on the test data however of
“magnitude”
is the reg.
 One Solution: Minimize a regularized objective hyperparam. Controls
how much we wish to
 The reg. will prevent the elements of from becoming tooregularize
large (needs to be
tuned via cross-
 Reason: Now we are minimizing training error + magnitude of vector
validation)
12
Regularized Least Squares (a.k.a. Ridge Regression)
 Recall that the regularized objective is of the form

 One possible/popular regularizer: the squared Euclidean

( squared) norm of 𝑅 ( 𝒘 )=‖𝒘‖2=𝒘 ⊤ 𝒘
2

 With this regularizer, we have the regularized least

Whysquares
is the method
problem as + called “ridge”
Look at the form of the 𝑁 regression ?

= arg min 𝒘 ∑ ( 𝑦 𝑛 − 𝒘 𝒙 𝑛 ) + 𝜆 𝒘 𝒘
solution. We are adding a small ⊤ 2 ⊤
value to the diagonals of the
DxD matrix (like adding a 𝑛=1
ridge/mountain to some land)

=Proceeding just like the LS case, we can find ⊤

¿ ( 𝑿the
𝑿optimal
−1 ⊤
+ 𝜆 𝐼 𝐷 ) which
𝑿 𝒚
is given by
13
A closer look at regularization Remember – in general,
weights with large
 The regularized objective we minimized is magnitude are bad since
𝑁
they can cause
𝐿 𝑟𝑒𝑔 ( 𝒘 ) =∑ ( 𝑦 𝑛 − 𝒘 ⊤ 𝒙 𝑛 )2 + 𝜆 𝒘 ⊤ 𝒘
overfitting on training
data and may not work
𝑛=1
well on test data
 Minimizing w.r.t. gives a solution for that
Good because,
 Keeps the training error small consequently, the Not a “smooth”
model since its
individual entries of the
 Has a small squared norm = weight vector are also
test data
predictions may
prevented from becoming change drastically
 Small entries in are good since they lead to “smooth” models too large even with small
changes in some
feature’s value
A typical learned without reg.
𝒙 𝑛=¿ 1.2 0.5 2.4 0.3 0.8 0.1 0.9 2.1 𝑦 𝑛= 0.8
3.2 1.8 1.3 2.1 10000 2.5 3.1 0.1

𝒙 𝑚=¿ 1.2 0.5 2.4 0.3 0.8 0.1 0.9 2.1

𝑦 𝑚=100 Just to fit the training data where
one of the inputs was possibly an
Exact same feature vectors only Very different outputs though outlier, this weight became too big.
differing in just one feature by a (maybe one of these two Such a weight vector will possibly
small amount training ex. is an outlier) do poorly on normal test inputs
14
Regularized Least Squares (a.k.a. Ridge Regression)
•Consider regularized loss: Training error + -squared norm of w, i.e.,

𝑁
𝐿 𝑟𝑒𝑔 ( 𝒘 ) =∑ ( 𝑦 𝑛 − 𝒘 ⊤ 𝒙 𝑛 )2 + 𝜆 𝒘 ⊤ 𝒘
•Minimizing the above objective
𝑛=1
w.r.t. w does two things
•Keeps the training error small
•Keeps the norm of w small (and thus also the individual components of
w): Regularization
•There is a trade-off between the two terms: The regularization
hyperparameter λ>0 controls it
•Very small λ means almost no regularization (can overfit)
•Very large λ means very high regularization (can underfit - high training
error)
•Can use cross-validation to choose the "right" λ

•The solution to the above optimization problem

¿ ( 𝑿⊤ 𝑿 𝜆 𝐼 𝐷 )−1 𝑿 ⊤ 𝒚
is:+ w

•Note that, in this case, regularization also made inversion possible (note
Note that optimizing loss
functions with such
15
Other Ways to Control Overfitting regularizers is usually
harder than ridge reg. but
several advanced
 Use a regularizer defined by other norms, e.g., Use them
techniques exist (we will
see some
if you of those
have a later)
norm very large number of
regularizer 𝐷
features but many
irrelevant features.
‖𝒘‖1 =∑ ¿ 𝑤𝑑 ∨¿ ¿ These regularizers can
When should I used 𝑑=1
these regularizers help in automatic
sparse means many
feature selection
instead of the Using such
regularizer? ‖𝒘‖0 =¿ nnz ( 𝒘 ) regularizers gives a
entries in will be
Automatic feature zero or near zero.
sparse weight vector Thus those features
selection? Wow, norm regularizer
as solution will be considered
cool!!! (counts number of
But how exactly? nonzeros in irrelevant by the
model and will not
 Use non-regularization based approaches influence prediction

 Early-stopping (stopping training just when we have a decent val. set

accuracy) All of these are very popular
ways to control overfitting in
 Dropout (in each iteration, don’t update some of thedeep
weights)
learning models. More
 Injecting noise in the inputs on these later when we talk
about deep learning
16
, Many
, andways regularizations:
to regularize ML models (forSome Comments
linear as well as other models)
 Some are based on adding a norm of to the loss function (as we already
saw)
 Using norm in the loss function promotes the individual entries to be
small (we saw that)
 Using norm encourages very few non-zero entries in w (thereby promoting
“sparse” )
‖𝒘‖0 =¿ nnz ( 𝒘 )

 Optimizing with is difficult (NP-hard

𝐷 problem); can use norm as an
approximation ‖𝒘‖1 =∑ ¿ 𝑤𝑑 ∨¿ ¿
𝑑=1

 Note: Since they learn a sparse w, or regularization is also useful for doing
feature selection
(= 0 means feature d is irrelevant). We will revisit later to formally see
17
Linear/Ridge Regression via Gradient Descent
•Both least squares regression and ridge regression require matrix inversion
⊤ −1 ⊤
⊤ −1
𝒘 𝐿𝑆 =( 𝑿 𝑿 ) 𝑿 𝒚 ⊤ 𝒘 𝑟𝑖𝑑𝑔𝑒 =( 𝑿 𝑿 + 𝜆 𝐼 𝐷 ) 𝑿 𝒚
•Can be computationally expensive when is very large
•A faster way is to use iterative optimization, such as batch or stochastic gradient descent
•A basic batch gradient-descent based procedure looks like
•Start with an initial value of
•Update by moving in the opposite direction of the gradient of the loss function

η
Such iterative methods for
optimizing loss functions are
where η is the learning rate widely used in ML. Will revisit
•Repeat until convergence these later in Detail
𝑁
𝜕 𝐿(𝒘
•For least squares, the gradient is )
=− ∑ 𝒙 𝑛 ( 𝑦 𝑛 −𝒘 𝒙 𝑛)
⊤ 2
𝜕𝒘 𝑛=1
(no matrix inversion involved)
Linear/Ridge Regression via Gradient Descent 18

We will revisit gradient-based methods later, but a few things to keep in mind:

• Gradient Descent is guaranteed to converge to a local minima.

• Gradient Descent converges to a global minima if the function is convex.
Convex
Non - Convex

A B A B
• (Includes an illustration showing convex and non-convex functions.)
• A function is convex if the second derivative is non-negative everywhere (for scalar
functions) or if the Hessian is positive semi-definite (for vector-valued functions).
For a convex function, every local minima is also a global minima.

• Note: The squared loss function in linear regression is convex.

• With regularizer, it becomes strictly convex (single global minima).

• For Gradient Descent, the learning rate is important (should not be too large or too
small).
19
Linear Regression as Solving System of Linear Eqs
 The form of the lin. reg. model is akin to a system of linear
equation
 Assuming training𝑦examples
First training example: with features each,
1=𝑥11 𝑤 1+ 𝑥 12 𝑤 2+ …+ 𝑥 1 𝐷 𝑤 𝐷
weHere
Note: have
denotes
the feature of the
Second training example: 𝑦 2=𝑥 21 𝑤1 + 𝑥 22 𝑤2 +… + 𝑥2 𝐷 𝑤 𝐷 training example
equations and
unknowns here ()
N-th training example: 𝑦 𝑁 =𝑥 𝑁 1 𝑤1 + 𝑥 𝑁 2 𝑤 2+ …+ 𝑥 𝑁𝐷 𝑤 𝐷

 However, in regression, we rarely have but rather or

 Thus we have an underdetermined () or overdetermined () system
 Methods to solve over/underdetermined systems canNow be solve
used for lin-reg as well
this!
 Many
Solving lin-regof these
𝒘 ¿methods
( 𝑿 𝑿 ) don’t
⊤ −1
𝑿 require
⊤
𝒚 expensive matrix
where inversion
, and
as system of lin eq. System of lin. Eqns with equations and
unknowns
20
Linear and Polynomial Regression Implementations

 https://ml-cheatsheet.readthedocs.io/en/latest/linear_regressio
n.html
 https://towardsdatascience.com/introduction-to-machine-learni
ng-algorithms-linear-regression-14c4e325882a
 https://www.edureka.co/blog/linear-regression-for-machine-lear
ning/
 https://www.analyticsvidhya.com/blog/2021/06/linear-regressio
n-in-machine-learning/
 https://towardsdatascience.com/polynomial-regression-bbe8b9
d97491
 https://www.analyticsvidhya.com/blog/2021/10/understanding-
polynomial-regression-model/
 https://www.coursera.org/lecture/machine-learning/features-an
d-polynomial-regression-Rqgfz
21
Bias and Variance

Overfitting: The model fits the training set very well but fails to generalize to new
examples. Overfitting happens when a model learns the detail and noise in the
training data to the extent that it negatively impacts the performance of the model
on new data. This means that the noise or random fluctuations in the training data is
picked up and learned as concepts by the model. The problem is that these concepts
do not apply to new data and negatively impact the models ability to generalize.
Underfitting: Underfitting refers to a model that can neither model the training data nor generalize to new data. An underfit
machine learning model is not a suitable model and will be obvious as it will have poor performance on the training data.
Variance: Captures how much your classifier changes if you train on a different training set. How "over-specialized" is your
classifier to a particular training set (overfitting)? If we have the best possible model for our training data, how far off are we
from the average classifier?
Bias: What is the inherent error that you obtain from your classifier even with infinite training data? This is due to your
classifier being "biased" to a particular kind of solution (e.g. linear classifier). In other words, bias is inherent to your model.
22
Bias and Variance

Bias: The bias error is an error from erroneous

assumptions in the learning algorithm. High bias can
cause an algorithm to miss the relevant relations
between features and target outputs (underfitting).
For example, in linear regression, the relationship between the X and the Y variable is assumed to be linear, when in reality the
relationship may not be perfectly linear. You can expect an algorithm like linear regression to have high bias error, whereas an
algorithm like decision tree has lower bias. Why? because decision trees don’t make such hard assumptions. So is the case with
algorithms like k-Nearest Neighbours, etc.

Variance: The variance is an error from sensitivity to small fluctuations in the training set. High variance may result from an
algorithm modeling the random noise in the training data (overfitting). The target function is estimated from the training data
by a machine learning algorithm, so we should expect the algorithm to have some variance. Ideally, it should not change too
much from one training dataset to the next, meaning that the algorithm is good at picking out the hidden underlying mapping
between the inputs and the output variables. Machine learning algorithms with low variance are Linear Regression, Logistic
Regression. Machine learning algorithms with high variance are decision tree, and K-nearest neighbours.
The bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to choose a model that both
accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically
impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well but are at
risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler
models that may fail to capture important regularities (i.e. underfit) in the data.
23
24
25
26
27
28
29
Models with low bias and low variance are golden
30
Bias Variance Tradeoffs
but usually they exist only for specific domains
(e.g. linear models may do very well in predicting
income
• Two main sources as atest
of bad function of education).
performance Expecting low
for ML algos
variance
• Bias: model is too weakand lowmodel
e.g. linear biasforina very
general
complexistask
a pipe dream.
• Even the best trained linear model is pathetic
Test
• Variance: model is strong but you could not train it properly e.g. NN
• The best trained NN (NeuralError
Network) is NP-hard to learn

• Models with high variance usually are brittle as well

Error
• Changing training data even slightly changes the model parameters a lot
Biasare also easy to train very
• Usually models that are weak Variance
accurately of most models
• In other words, they exhibit high bias, low variance goes down with more
Varian
• training
Usually models that are strong are more difficult to train too data or else more
• In other words, they exhibit low-bias, high variance ce
effective optimization
• Need to balance bias and variance in practice
• Increasing the bias will decrease
Low the variance.
Medium High
Model the
• Increasing the variance will decrease Complexity
bias.
Generalization error (just like 31
Generalization Error variance) can usually be brought
down by using more data points or
• The gap
choosing models between
that train and test
are simpler
Test/ValidationError error rates
Difference between variance and gen error is subtle.
• Measures how well is the
Model complexity and train set sizemodel+parameters
affect both.able to “generalize” to
unseen data
However, variance can also be high because of using an
• Gen error usually small for models with
improper learning algorithm (or not optimizing properly)
small complexity (small variance), high for
Error

but usually algo issues do not affectmodels

gen error much.
with high complexity (large
Underfitting Overfitting variance)
• Note: a model with large bias may
Gen give very good gen error but high test
Training Error error
Error • Its test error will be close to train error
but both will be very large
Low Mediu High
Model Complexity
m
32
Detecting Over/underfitting
Adding more data cannot
decrease bias. The chosen
model just sucks  Adding more
• Low training error but high test error?? data can decrease variance
• You may have overfit – your model is Sometimes
simply memorizing may need
though
training data to iterate
• Your model is clearly powerful enoughthrough the to
– does not seem above
be a biasexperiences
problem
(experience
• Use more data/better optimizer/simpler high
model (or all) bias, variance
to decrease reduce it only
• High training error and high testtoerror?? increase variance, then decrease
variance
• You may have underfit – your model etc)
is incapable before
of handling thereaching
learning taska sweet
spot bias
• Increase model class complexity, add better features, to decrease
• Low training error and low test error
• er … very good … moving on
• High training error and low test error
• Maybe you did early stopping which acted as a regularizer – lucky you!

32
33
Next Lecture
 Solving linear regression using iterative optimization (ex.
Gradient Descent)methods
 Faster and don’t require matrix inversion
 Brief intro to optimization techniques

10 Linear Regression
No ratings yet
10 Linear Regression
61 pages
Linear Regression
No ratings yet
Linear Regression
104 pages
Lecture 3
No ratings yet
Lecture 3
61 pages
Lecture3 Supervised Learning I
No ratings yet
Lecture3 Supervised Learning I
84 pages
Notes 04
No ratings yet
Notes 04
50 pages
CM20315 02 Supervised
No ratings yet
CM20315 02 Supervised
53 pages
Lecture 0.2 - Linear Methods For Regression, Optimization
No ratings yet
Lecture 0.2 - Linear Methods For Regression, Optimization
53 pages
Lecture 07
No ratings yet
Lecture 07
29 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
Lecture 1.5-1.6
No ratings yet
Lecture 1.5-1.6
23 pages
2.1 Supervised Regression
No ratings yet
2.1 Supervised Regression
26 pages
DR Ajmal Ali Department of Mathematics, Virtual University of Pakistan
100% (1)
DR Ajmal Ali Department of Mathematics, Virtual University of Pakistan
207 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
Lecture 3
No ratings yet
Lecture 3
22 pages
GradientDescent-Regression Slides
No ratings yet
GradientDescent-Regression Slides
26 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
M02 Linear Regression Methods
No ratings yet
M02 Linear Regression Methods
40 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
Lec 3-5 (Function Approximation)
No ratings yet
Lec 3-5 (Function Approximation)
34 pages
Global Optimization Toolbox Gads TB 2017
No ratings yet
Global Optimization Toolbox Gads TB 2017
694 pages
Lecture 2 - Linear Regression
No ratings yet
Lecture 2 - Linear Regression
54 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Operational Research Final Exam
100% (2)
Operational Research Final Exam
5 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Lecture 5
No ratings yet
Lecture 5
18 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
Introml 02 Regression Annotated PDF
No ratings yet
Introml 02 Regression Annotated PDF
26 pages
Intro To ML RevisionNotes
No ratings yet
Intro To ML RevisionNotes
24 pages
Lec 03
No ratings yet
Lec 03
42 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
01 Lecturenote SRM
No ratings yet
01 Lecturenote SRM
9 pages
Quiz1 Solutions Quiz 1 Soln
No ratings yet
Quiz1 Solutions Quiz 1 Soln
7 pages
PA Notes 2
No ratings yet
PA Notes 2
23 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
Title: Non-Linear Optimization (Unconstrained) - Direct Search Method
100% (1)
Title: Non-Linear Optimization (Unconstrained) - Direct Search Method
21 pages
CS550 Lec2
No ratings yet
CS550 Lec2
24 pages
MCQ On NM-II
75% (4)
MCQ On NM-II
13 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
Abstract: y F X X X, X, X
No ratings yet
Abstract: y F X X X, X, X
10 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Class-Ix Mathematics Assignment-2 - 2 Polynomials
0% (1)
Class-Ix Mathematics Assignment-2 - 2 Polynomials
2 pages
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
No ratings yet
CIS 4526: Foundations of Machine Learning Linear Regression: (Modified From Sanja Fidler)
20 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
ML Models and When To Choose One Over Others
No ratings yet
ML Models and When To Choose One Over Others
7 pages
31-Explicit and Crank-Nicolson Methods For Solution of Heat Equations-14!11!2022
No ratings yet
31-Explicit and Crank-Nicolson Methods For Solution of Heat Equations-14!11!2022
43 pages
MTH646 Mid-Term Handout
No ratings yet
MTH646 Mid-Term Handout
31 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Hundred Page ML Book CH 3
No ratings yet
Hundred Page ML Book CH 3
16 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Notes Linearregression
No ratings yet
Notes Linearregression
4 pages
Applied Maths K Scheme
No ratings yet
Applied Maths K Scheme
13 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Scilab Codes Sem III
No ratings yet
Scilab Codes Sem III
15 pages
The Hundred-Page Machine Learning Book - Andriy Burkov
No ratings yet
The Hundred-Page Machine Learning Book - Andriy Burkov
16 pages
Prmo DPP 1
No ratings yet
Prmo DPP 1
5 pages
Division of Polynomial (Synthetic Division) : Daily Lesson Plan
100% (1)
Division of Polynomial (Synthetic Division) : Daily Lesson Plan
2 pages
LinearRegression LectureNotesPublic PDF
No ratings yet
LinearRegression LectureNotesPublic PDF
7 pages
OE Syllabus
No ratings yet
OE Syllabus
3 pages
Simple Linear Regression Model Ordinary Least Square (OLS) Method
No ratings yet
Simple Linear Regression Model Ordinary Least Square (OLS) Method
18 pages
Subject: Maths Objective Problems Class: X Topic: Polynomials
No ratings yet
Subject: Maths Objective Problems Class: X Topic: Polynomials
2 pages
Solution of Nonlinear Equations: Root Finding Problems)
No ratings yet
Solution of Nonlinear Equations: Root Finding Problems)
89 pages
Matrix Partial Fractions
No ratings yet
Matrix Partial Fractions
4 pages
Lecture 3 of 6 Topic 2 Integration
No ratings yet
Lecture 3 of 6 Topic 2 Integration
31 pages
Module 5 Notes (Filled Out)
No ratings yet
Module 5 Notes (Filled Out)
13 pages
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
From Everand
Bilinear Interpolation: Enhancing Image Resolution and Clarity through Bilinear Interpolation
Fouad Sabry
No ratings yet
Lesson6 2notinbook PDF
No ratings yet
Lesson6 2notinbook PDF
20 pages
Federal Public Service Commission: e y y y
No ratings yet
Federal Public Service Commission: e y y y
2 pages
Assignment 2 (MAB-103)
No ratings yet
Assignment 2 (MAB-103)
2 pages
Polynomials Important Questions
No ratings yet
Polynomials Important Questions
3 pages
Quiz Sle Gaussianelimination
No ratings yet
Quiz Sle Gaussianelimination
3 pages
CHE555 Lesson Plan PDF
No ratings yet
CHE555 Lesson Plan PDF
6 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
FEM Part 2 - FL22
No ratings yet
FEM Part 2 - FL22
16 pages
Algebra 2 Assignment in Factoring
No ratings yet
Algebra 2 Assignment in Factoring
6 pages
A Novel Method For Computationally Efficacious Linear and Polynomial Regression Analytics of Big Data in Medicine
No ratings yet
A Novel Method For Computationally Efficacious Linear and Polynomial Regression Analytics of Big Data in Medicine
10 pages
Beresford Parlett
No ratings yet
Beresford Parlett
4 pages
Eastborne Realty Excel
No ratings yet
Eastborne Realty Excel
5 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Group 30

Uploaded by

Group 30

Uploaded by

Linear Regression

Goal of learning is to measures the

 The line/plane must also predict outputs the unseen (test)

 Loss/Error function: Loss is calculated per observation, for each

 Let us find the that optimizes (minimizes) the above squared

 The LS problem can be solved easily and has a closed form

 Let us apply basic rule of calculus: Take first derivative of and

 Using the fact , we get

 One possible/popular regularizer: the squared Euclidean

 With this regularizer, we have the regularized least

=Proceeding just like the LS case, we can find ⊤

𝒙 𝑚=¿ 1.2 0.5 2.4 0.3 0.8 0.1 0.9 2.1

•The solution to the above optimization problem

 Early-stopping (stopping training just when we have a decent val. set

 Optimizing with is difficult (NP-hard

• Gradient Descent is guaranteed to converge to a local minima.

• Note: The squared loss function in linear regression is convex.

 However, in regression, we rarely have but rather or

Bias: The bias error is an error from erroneous

• Models with high variance usually are brittle as well

but usually algo issues do not affectmodels

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.