0% found this document useful (0 votes)
6 views33 pages

Lecture 13 - Reguralization

Regularization

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views33 pages

Lecture 13 - Reguralization

Regularization

Uploaded by

raoseshu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Transfer Functions

Regression
- Regularization
Overfitting and underfitting

 Overfitting occurs when


The model captures the noise and the outliers in
the data along with the underlying pattern.
These models usually have high variance and low bias

 Under fitting occurs when


The model is unable to capture the underlying
pattern of the data. These models usually have a
low variance and a high bias.

2
Bias and variance

 Bias error
• How far is the predicted value from the true value
• The systematic error of the model
• It is about the model and the data itself

 Variance error
• The error caused by sensitivity to
small variances in the training data set
• The dispersion of predicted values
over target values with different train-sets
• It is about the model sensitivity

3
The bias variance curse
 As the complexity of model increases the bias decreases but the variance
increases

 There is a trade-off between bias and variance You can't get both low bias and
low variance at the same time

4
Regularization
 A regularizer is an additional criterion to the loss
function to make sure that we don’t overfit

 It’s called a regularizer since it tries to keep the


parameters more normal/regular

 It is a bias on the model that forces the learning


to prefer certain types of weights over others
Regularization

• Ridge/Lasso regression is a model tuning method that


is used to analyze any data that suffers from
multicollinearity.

• This method performs L2/L1 regularization.

• When the issue of multicollinearity occurs, least-


squares are unbiased, and variances are large, this
results in predicted values being far away from the
actual values.
Regularization
Multicollinearity

• It occurs when the independent variables show moderate to high correlation.


• In a model with correlated variables, it becomes a tough task to figure out the
true relationship of a predictors with response variable. In other words, it
becomes difficult to find out which variable is actually contributing to predict the
response variable.
• Another point, with presence of correlated predictors, the standard errors tend
to increase. And, with large standard errors, the confidence interval becomes
wider leading to less precise estimates of slope parameters.
• Additionally, when predictors are correlated, the estimated regression
coefficient of a correlated variable depends on the presence of other predictors
in the model.

Y = W0+W1*X1+W2*X2

• Coefficient W1 is the increase in Y for a unit increase in X1 while keeping X2


constant. But since X1 and X2 are highly correlated, changes in X1 would also
cause changes in X2, and we would not be able to see their individual effect on
Y.
What Causes Multicollinearity?

• Poorly designed experiments, highly observational data, or the inability


to manipulate the data.

• Multicollinearity could also occur when new variables are created which
are dependent on other variables.
• For example, creating a variable for BMI from the height and weight
variables would include redundant information in the model, and the new
variable will be a highly correlated variable.

• Including identical variables in the dataset.


• For example, including variables for temperature in Fahrenheit and
temperature in Celsius.

• Insufficient data, in some cases, can also cause multicollinearity


problems.
Multicollinearity

• How to check: One can use scatter plot to visualize correlation effect among
variables.

• Also VIF (Variance Inflation Factors) factor.

• VIF determines the strength of the correlation between the independent


variables. It is predicted by taking a variable and regressing it against every
other variable.

• the closer the R^2 value to 1, the higher the value of VIF and the higher the
multicollinearity with the particular independent variable.

• VIF starts at 1 and has no upper limit


• VIF = 1, no correlation between the independent variable and the other
variables
• VIF exceeding 5 or 10 indicates high multicollinearity between this
independent variable and the others.
Multicollinearity - Example

Gender (0 – female, 1- male)


Education level (0 – no formal education, 1 – under-graduation, 2 – post-graduation)

• We can see here that the ‘Age’ and ‘Years of service’ have a high VIF value, meaning
they can be predicted by other independent variables in the dataset.
Fixing Multicollinearity - Example

• We were able to drop the variable ‘Age’ from the dataset because its information was
being captured by the ‘Years of service’ variable.

• This has reduced the redundancy in our dataset.

• Dropping variables should be an iterative process starting with the variable having the
largest VIF value because other variables highly capture its trend.

• If you do this, you will notice that VIF values for other variables would have reduced, too,
although to a varying extent.

• In our example, after dropping the ‘Age’ variable, VIF values for all variables have
decreased to varying degrees.
Need to fix Multicollinearity

• Multicollinearity may not be a problem every time. The need to fix


multicollinearity depends primarily on the following reasons:

1. When you care more about how much each individual feature rather than a
group of features affects the target variable, then removing multicollinearity
may be a good option

2. If multicollinearity is not present in the features you are interested in, then
multicollinearity may not be a problem.
Regularization methods

 Ridge regression

 Lasso regression

 Elastic regression
Regularization: An Overview

14
Common regularizers

sum of the weights

sum of the squared weights

What’s the difference between these?

• Squared weights penalizes large values more


• Sum of weights will penalize small values more
Ridge Regression
 The regularized loss function is:

 Note that is the square of the l2 norm of the vector b

16
LASSO Regression
 The regularized loss function is:

 Note that is the l1 norm of the vector b

17
Choosing l
 In both Ridge and Lasso regression, we see that the larger our
choice of the regularization parameter l, the more heavily we
penalize large values in b,
• If l is close to zero, we recover the MSE, i.e. ridge and LASSO regression is just
ordinary regression.

• If l is sufficiently large, the MSE term in the regularized loss function will be
insignificant and the regularization term will force bridge and bLASSO to be close to
zero.

 To avoid ad-hoc choices, we should select l using cross-validation.

18
Ridge Regularization
Ridge Regularization
Ridge Regularization
Ridge Regularization - Example
Ridge Regularization - Example
Ridge visualized

Ridge
estimator

The ridge estimator is where The values of the coefficients


the constraint and the loss decrease as lambda increases,
intersect. but they are not nullified.

24
Ridge regularization: step by step

25
Lasso Regression

• The objective function is not differentiable, so it only


works for gradient descent solvers
LASSO visualized

Lasso
estimator

The Lasso estimator tends to The values of the coefficients


zero out parameters as the OLS decrease as lambda increases,
loss can easily intersect with the and are nullified fast.
constraint on one of the axis.

27
Lasso regularization: step by step

28
Lasso vs. Ridge Regression
 The lasso has a major advantage over ridge regression, in that it
produces simpler and more interpretable models that involved only a
subset of predictors.

 The lasso leads to qualitatively similar behavior to ridge regression, in


that as λ increases, the variance decreases and the bias increases.

 The lasso can generate more accurate predictions compared to ridge


regression.

 Cross-validation can be used in order to determine which approach is


better on a particular data set.
Lasso vs. Ridge Regression
Feature Ridge Regression Lasso Regression
Description Ridge regression, also known Lasso regression, or Least Absolute
as Tikhonov regularization, is a Shrinkage and Selection Operator, is a
technique that introduces a regularization method that also
penalty term to the linear includes a penalty term but can set
regression model to shrink the some coefficients exactly to zero,
coefficient values. effectively selecting relevant features.
Penalty Type Ridge regression utilizes an L2 Lasso regression employs an L1
penalty, which adds the sum of penalty, which sums the absolute
the squared coefficient values values of the coefficients multiplied by
multiplied by a tuning lambda.
parameter (lambda).
Coefficient The L2 penalty in ridge The L1 penalty in lasso regression can
Impact regression discourages large drive some coefficients to exactly zero
coefficient values, pushing when the lambda value is large
them towards zero but never enough, performing feature selection
exactly reaching zero. This and resulting in a sparse model.
shrinks the less important
features’ impact.
Lasso vs. Ridge Regression
Feature Ridge Regression Lasso Regression
Feature Ridge regression retains all Lasso regression can set some
Selection features in the model, reducing coefficients to zero, effectively
the impact of less important selecting the most relevant features
features by shrinking their and improving model interpretability.
coefficients.
Use Case Ridge regression is useful when Lasso regression is preferred when the
the goal is to minimize the goal is feature selection, resulting in a
impact of less important simpler and more interpretable model
features while keeping all with fewer variables.
variables in the model.
Model Ridge regression tends to Lasso regression can lead to a less
Complexity favor a model with a higher complex model by setting some
number of parameters, as it coefficients to zero, reducing the
shrinks less important number of effective parameters.
coefficients but keeps them in
the model.
Lasso vs. Ridge Regression
Feature Ridge Regression Lasso Regression
Sparsity Ridge regression does not Lasso regression can produce sparse
yield sparse models since all models by setting some coefficients
coefficients remain non-zero. to exactly zero.

Sensitivity More robust and less sensitive More sensitive to outliers due to the
to outliers compared to lasso absolute value in the penalty term.
regression.
Interpretabili The results of ridge regression Lasso regression can improve
ty may be less interpretable due interpretability by selecting only the
to the inclusion of all features, most relevant features, making the
each with a reduced but non- model’s predictions more
zero coefficient. explainable.
Elastic Regression
• If there is a group of highly correlated variables, then the LASSO tends
to select one variable from a group and ignore the others.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy