Ridge Regression
Ridge Regression
GROUP MEMBERS ID
1
L2 Regularization (Ridge Regression)
How It Works
● Adding a Penalty: When we train our model, we want not only to minimize
the loss function (the error) but also to keep the model simple. L2
regularization does this by adding a "penalty" to the loss function based on
the size of the coefficients (the numbers that represent how much each
feature contributes to the prediction).
● Squaring the Coefficients: Specifically, L2 regularization adds a term that is
proportional to the square of each coefficient. This means if any coefficient
becomes too large, it will significantly increase the penalty. The formula
looks something like this:
Here, α is a number that controls how much we want to penalize large coefficients.
Purpose of L2 Regularization
2
from becoming excessively large, thus stabilizing the model's predictions across
different datasets [1][2].
The first term measures the error between predicted and actual values, while the
second term penalizes large coefficients. As α increases, more regularization is
applied, leading to smaller coefficient values but potentially higher bias [3][4].
● Increasing α: When you increase α, you make the penalty for large
coefficients stronger. This encourages the model to keep coefficients smaller,
which can help prevent overfitting. However, it may also lead to a simpler
model that doesn’t capture all patterns in the data very well—this can
increase bias (the model might miss some important relationships).
● Decreasing α: When you decrease α, you reduce the penalty on large
coefficients. This allows your model to fit more closely to your training data,
3
which can reduce bias but may lead to overfitting if it captures too much
noise from that data.
By understanding the bias-variance tradeoff, we should adjust α to give us a good
balance. A good balance between bias and variance helps ensure that your model
performs well not just on training data but also on real-world situations.
1. Handling Multicollinearity
What is Multicollinearity?
● Multicollinearity occurs when two or more independent variables (features)
in a dataset are highly correlated. For example, if you have both "height in
centimeters" and "height in inches" as features, they provide similar
information, which can confuse the model.
Why is it a Problem?
● When features are highly correlated, it can lead to unstable estimates of the
coefficients in a regression model. This means that small changes in the data
can lead to large changes in the model's predictions.
How L2 Regularization Helps:
● L2 regularization helps by distributing the influence of these correlated
features across all of them instead of allowing one feature to dominate the
model. It does this by shrinking the coefficients of all correlated features
towards each other, which stabilizes the model and makes it more reliable.
4
2. Model Interpretability is Not a Primary Concern
What Does This Mean?
● In some cases, understanding exactly how each feature contributes to the
prediction is not as important as getting accurate predictions overall. For
example, in complex models used for tasks like image recognition, knowing
the exact contribution of each pixel may not be necessary.
Why Choose L2 Regularization?
● Unlike L1 regularization (lasso), which can eliminate some features entirely
by setting their coefficients to zero (making them irrelevant), L2
regularization keeps all features in the model. This means you won't lose any
potentially useful information, even if you don’t fully understand how each
feature affects the outcome.
3. High-Dimensional Datasets
What Are High-Dimensional Datasets?
● High-dimensional datasets have a large number of features compared to the
number of observations (data points). For instance, if you have thousands of
features but only a few hundred samples, this situation can lead to
overfitting.
Challenges with High Dimensions
● In high-dimensional spaces, models can easily become overly complex and
fit noise rather than actual patterns in the data. This makes them perform
poorly on new, unseen data.
Benefits of L2 Regularization:
● L2 regularization is particularly effective in these situations because it helps
control the complexity of the model by shrinking all coefficients towards
5
zero without eliminating any features. This ensures that even with many
predictors, the model remains stable and performs well on new data.
Limitations of L2 Regularization
While L2 regularization is a powerful technique for improving model performance,
it does have some drawbacks. Here are the main limitations:
1. No Feature Selection
What Does This Mean?
● In machine learning, feature selection is the process of identifying and
keeping only the most important variables (features) for making predictions.
Some methods, like L1 regularization (lasso), can completely remove less
important features by setting their coefficients to zero.
L2 Regularization's Approach
● L2 regularization does not perform feature selection. Instead of eliminating
features, it shrinks all coefficients towards zero but never actually sets any of
6
them to zero. This means that even if a feature is not useful or relevant to the
prediction, it will still remain in the model with a small coefficient.
Why Is This a Problem?
● Keeping irrelevant features can make the model more complex and harder to
interpret. It may also lead to longer training times and less efficient models
because the model is still considers all features, even those that don’t
contribute meaningful information[1][4].
2. Bias-Variance Tradeoff
What Is Bias and Variance?
● Bias refers to the error introduced by approximating a real-world problem
with a simplified model. High bias can cause a model to miss important
patterns (underfitting).
● Variance refers to how much the model's predictions change when it is
trained on different datasets. High variance can cause a model to fit noise in
the training data (overfitting).
How Does Regularization Affect Bias and Variance?
When you increase the strength of L2 regularization (the parameter α), you make
the penalty for large coefficients stronger:
● This typically leads to higher bias because the model becomes simpler and
may miss some relationships in the data.
● It also leads to lower variance because the model becomes less sensitive to
fluctuations in training data.
Why Is This Important?
While increasing regularization can help improve how well the model performs on
new data (generalization), it may also result in poorer performance on training data
7
because it oversimplifies the model. Striking the right balance between bias and
variance is crucial for creating effective models [2][6].
3. Sensitivity to Outliers
What Are Outliers?
● Outliers are data points that are significantly different from others in your
dataset. For example, if you're predicting house prices and one house costs
$10 million while most others cost around $300,000, that $10 million house
is an outlier.
How Does L2 Regularization Respond to Outliers?
● L2 regularization squares the coefficients when calculating its penalty term.
This means that larger coefficients (which might be influenced by outliers)
will have an even greater penalty because squaring amplifies their value.
Why Is This a Concern?
● If there are outliers in your data, they can disproportionately affect the
model's performance because they can lead to larger coefficient values,
which then receive higher penalties during training. As a result, the model
may become skewed by these outliers, leading to less reliable predictions
[7].
8
● It can be sensitive to outliers, which may disproportionately affect how well
the model performs.
What is Bias?
● Definition: Bias refers to the error introduced when a model makes
assumptions about the data that simplify the problem too much. This
simplification can lead to missing important patterns in the data.
● Example: Imagine you are trying to predict the price of houses based on
their size. If you use a very simple model, like a straight line (linear
regression), it might predict house prices based only on size without
considering other important factors like location or number of bedrooms.
This model might consistently predict prices that are too low or too high because it
ignores these other factors. This is an example of high bias, which can lead to
underfitting—the model is too simple to capture the underlying trends in the data.
What is Variance?
● Definition: Variance refers to how much a model's predictions change when
it is trained on different subsets of data. A model with high variance pays too
much attention to the training data, including its noise and outliers.
● Example: Now, imagine you use a very complex model, like a polynomial
regression with many curves and bends, to fit your house price data. This
model might fit the training data perfectly, capturing every fluctuation and
detail.
However, when you test it on new data (houses not included in the training set), it
may perform poorly because it has learned patterns that are not actually present in
9
general (like noise). This is an example of high variance, leading to
overfitting—the model is too complex and captures noise instead of the true signal.
10
Why Is This Important?
Understanding the bias-variance tradeoff is crucial for building effective machine
learning models:
● Generalization: The goal of any predictive model is to generalize well to
new, unseen data. A good balance between bias and variance helps ensure
that your model performs well not just on training data but also on
real-world situations.
● Model Selection: By recognizing where your current model lies on the
bias-variance spectrum, you can make informed decisions about whether to
simplify or complicate your model, or whether to apply regularization
techniques.
● Performance Evaluation: Monitoring both bias and variance helps you
identify issues like underfitting (high bias) or overfitting (high variance)
during training. You can adjust your approach accordingly based on these
insights.
11
Housing price prediction with Ridge regularisation technique
Problem statement
A US-based real estate firm, has made the decision to join the Australian market.
Utilizing data analytics, the business buys homes below their true value and then
sells them for more money. The business has gathered a data set from Australian
home sales for the same reason.
The business is searching for potential properties to purchase in order to get into
the market. To determine whether or not to invest in the potential properties, we
must use regularization to construct a regression model that predicts their true
value.
Business Objective
We must use the available independent variables to model the cost of homes. The
management will then utilize this model to determine the precise way in which the
prices change with the variables. As a result, they can influence the company's
strategy and focus on areas that will generate large profits. Additionally,
management will be able to comprehend the price dynamics of a new market with
the help of the model.
Dataset: The dataset has been submitted together with the data dictionary.
12
Code Implementations
We have attached our code to GitHub, and here we have described the steps we
take to implement the code.
Why: This step provides an overview of the dataset's features and potential issues,
such as missing values or inconsistent data types, that need resolution.
2. Data Preprocessing
Step: Prepare Data for Modeling
Substeps:
● Handle Missing Values:
a. Fill in missing values using strategies like mean, median, or mode.
b. Alternatively, drop rows or columns with excessive missing data.
● Encode Categorical Variables:
13
a. Use one-hot encoding or label encoding to convert text labels into
numeric format.
● Normalize/Scale Features:
a. Apply standardization using StandardScaler or MinMaxScaler to bring
features to a similar scale.
Why: Proper preprocessing ensures data consistency and improves model
performance by avoiding scale-related biases.
14
● Mean Squared Error (MSE): Measures prediction error.
● R-squared: Indicates the proportion of variance explained by the
model.
Why: Ridge regression ensures all features contribute to predictions but reduces
the impact of less important ones.
This section compares the performance of the ML model before and after
regularization using Ridge regression. It evaluates the trade-offs between bias and
variance, as well as the impact of hyperparameter tuning on the model's
performance. The R² score, a metric that indicates how well the model explains the
variance in the data, is used to assess both training and test set performance.
Before applying Ridge regression, the model achieved an R² score of 0.8485 on the
training set and 0.8374 on the test set. This performance highlights a
well-performing model, explaining approximately 85% of the variance in the data.
Notably, the gap between training and test scores indicates that the model
generalizes well to unseen data, with minimal overfitting.
However, despite this solid performance, the absence of regularization means that
the model likely assigns high importance to all features, including less relevant
ones. This leads to overfitting in scenarios where the dataset is noisy or has
irrelevant features. Without regularization, the model becomes sensitive to small
changes in the training data, which reduces its robustness over time.
15
The high R² score for the training set demonstrates that the model closely fits the
training data, potentially capturing noise or minor fluctuations that do not
generalize well. While the test performance was strong in this case, relying on such
a model without addressing overfitting could lead to issues if the data distribution
changes or new datasets introduce unseen complexities.
After applying Ridge regression, the R² score for the training set decreased to
0.8461, and the test set score dropped to 0.8247. This change reflects the impact of
introducing a regularization penalty, which prevents the model from assigning
excessively high weights to certain features.
As the regularization parameter (alpha) increases, the R² score for the training set
consistently decreases. This behavior is expected because higher alpha values
introduce a greater penalty for large coefficients, making the model simpler and
more generalized. By discouraging the model from closely fitting the training data,
ridge regression reduces the likelihood of overfitting. However, this also means
that the model sacrifices some of its ability to capture intricate patterns, leading to
a slight increase in error and a corresponding drop in the R² score.
On the test set, Ridge regression introduced a more complex dynamic. Initially,
with very low values of alpha, the test error was high, and the R² score was low.
This indicates that the model was overfitting the training data, as evidenced by the
discrepancy between high training performance and low test performance. As
alpha increased, the test error decreased, and the R² score improved, reaching its
peak at alpha = 2, where the test R² score was approximately 81%. Beyond this
point, further increases in alpha caused the R² score to decline, signaling that the
model had become overly simplistic and was underfitting the data.
16
This behavior underscores the importance of choosing an optimal alpha value. In
this case, alpha = 2 gave the right balance between bias and variance, creating a
simpler yet effective model. The regularized model at this stage was better
generalized and less prone to overfitting while maintaining strong predictive
performance on unseen data.
Analysis of Differences
17
Conclusions
Before regularization, the model achieved high R² scores for both training and test
sets, indicating strong performance but with a potential risk of overfitting. After
applying Ridge regression, the training R² score decreased, reflecting the reduction
in overfitting as the model became less tailored to the training data. On the test set,
the R² score dropped as well, as the model became more generalized, balancing its
ability to predict unseen data. Regularization effectively ensured that the model
focused on meaningful patterns, even if it meant sacrificing some predictive power
in exchange for robustness.
Overall, Ridge regression was instrumental in creating a simpler and more robust
model. By introducing regularization and carefully selecting the alpha parameter,
the model achieved a better balance between bias and variance, ensuring reliable
performance on both training and test datasets. This highlights the importance of
using regularization techniques like Ridge regression, particularly when overfitting
is a concern, to build models that generalize effectively to unseen data.
18
References
[1] https://builtin.com/data-science/l2-regularization
[2] https://www.ibm.com/think/topics/ridge-regression
[3] https://scikit-learn.org/dev/auto_examples/linear_model/plot_ridge_coeffs.html
[4] https://www.dataquest.io/blog/regularization-in-machine-learning/
[5]
https://towardsdatascience.com/l1-and-l2-regularization-methods-ce25e7fc831c?gi
=5ac6fe73dad4
[6]
https://www.linkedin.com/pulse/l1-l2-regularization-why-neededwhat-doeshow-hel
ps-ravi-shankar
[7]
https://www.researchgate.net/publication/342725398_Ridge_Regularization_An_E
ssential_Concept_in_Data_Science
19