Regularization & Gradient Descent
Regularization & Gradient Descent
Introduction
In the field of machine learning, one of the fundamental challenges is finding the right
balance between a model that captures underlying patterns in the data and one that
avoids overfitting to noise. This project aims to explore that challenge through the lens
of polynomial regression, with a focus on the powerful tools of Regularization and
Gradient Descent.
We work with a sparse, noisy dataset generated from a known underlying function — the
sine wave:
y = sin(2πx)
This function serves as our ground truth, and we compare it with a small sample of
noisy observations to simulate real-world data scenarios where signals are often
imperfect and incomplete.
As the complexity of a polynomial model increases, it tends to fit the training data more
closely. While this can reduce training error, it often leads to poor generalization on
unseen data — a phenomenon known as overfitting. Conversely, models with too little
complexity may fail to capture important relationships — leading to underfitting. This
project investigates how regularization techniques such as Ridge Regression (L2) and
Lasso Regression (L1) help mitigate overfitting by constraining the model's coefficients.
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 1/18
3/21/25, 4:11 PM Notebook
Gain intuition for how gradient descent works to minimize cost functions with and
without regularization.
Learn to visualize and evaluate model performance in the presence of noise and
sparse data.
This foundation is crucial not only for academic understanding but also for real-world
applications where interpretability, generalizability, and robustness are essential.
We will begin with a short tutorial on regression, polynomial features, and regularization
based on a very simple, sparse data set that contains a column of x data and associated
y noisy data. The data file is called X_Y_Sinusoid_Data.csv .
In [1]: import os
data_path = ['data']
Also generate approximately 100 equally spaced x data points over the range of 0 to
1. Using these points, calculate the y-data which represents the "ground truth" (the
real function) from the equation: y = sin(2πx)
In [40]: import os
import pandas as pd
import numpy as np
file_path = "/content/X_Y_Sinusoid_Data.csv"
data = pd.read_csv(file_path)
sns.set_style('white')
sns.set_context('talk')
sns.set_palette('dark')
ax.legend()
ax.set(xlabel='x data', ylabel='y data');
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 2/18
3/21/25, 4:11 PM Notebook
Note that PolynomialFeatures requires either a dataframe (with one column, not a
Series) or a 2D array of dimension ( X , 1), where X is the length.
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 3/18
3/21/25, 4:11 PM Notebook
ax = plt.gca()
ax.set(xlabel='x data', ylabel='y data');
Perform the regression on using the data with polynomial features using ridge
regression (α=0.001) and lasso regression (α=0.0001).
Plot the results, as was done in Question 1.
Also plot the magnitude of the coefficients obtained from these regressions, and
compare them to those obtained from linear regression in the previous question.
The linear regression coefficients will likely need a separate plot (or their own y-axis)
due to their large magnitude.
What does the comparatively large magnitude of the data tell us about the role of
regularization?
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 4/18
3/21/25, 4:11 PM Notebook
plt.legend()
ax = plt.gca()
ax.set(xlabel='x data', ylabel='y data');
In [9]: # let's look at the absolute value of coefficients for each model
coefficients = pd.DataFrame()
coefficients['linear regression'] = lr.coef_.ravel()
coefficients['ridge regression'] = rr.coef_.ravel()
coefficients['lasso regression'] = lassor.coef_.ravel()
coefficients = coefficients.applymap(abs)
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 5/18
3/21/25, 4:11 PM Notebook
ax2.plot(lassor.coef_.ravel(),
color=colors[2], marker='o', label='lasso regression')
ax1.set(xlabel='coefficients',ylabel='linear regression')
ax2.set(ylabel='ridge and lasso regression')
ax1.set_xticks(range(len(lr.coef_)));
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 6/18
3/21/25, 4:11 PM Notebook
For the remaining questions, we will be working with the data set from last lesson, which
is based on housing prices in Ames, Iowa. There are an extensive number of features--
see the exercises from week three for a discussion of these features.
To begin:
Import the data with Pandas, remove any null values, and one hot encode
categoricals. Either Scikit-learn's feature encoders or Pandas get_dummies method
can be used.
Split the data into train and test sets.
Log transform skewed features.
Scaling can be attempted, although it can be interesting to see how well
regularization works without scaling features.
Create a list of categorial data and one-hot encode. Pandas one-hot encoder
( get_dummies ) works well with data that is defined as a categorical.
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 7/18
3/21/25, 4:11 PM Notebook
There are a number of columns that have skewed features--a log transformation can be
applied to them. Note that this includes the SalePrice , our predictor. However, let's
keep that one as is.
skew_cols = (skew_vals
.sort_values(ascending=False)
.to_frame()
.rename(columns={0:'Skew'})
.query('abs(Skew) > {0}'.format(skew_limit)))
skew_cols
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 8/18
3/21/25, 4:11 PM Notebook
Out[19]: Skew
MiscVal 26.915364
PoolArea 15.777668
LotArea 11.501694
LowQualFinSF 11.210638
3SsnPorch 10.150612
ScreenPorch 4.599803
BsmtFinSF2 4.466378
EnclosedPorch 3.218303
LotFrontage 3.138032
MasVnrArea 2.492814
OpenPorchSF 2.295489
SalePrice 2.106910
BsmtFinSF1 2.010766
TotalBsmtSF 1.979164
1stFlrSF 1.539692
GrLivArea 1.455564
WoodDeckSF 1.334388
BsmtUnfSF 0.900308
GarageArea 0.838422
2ndFlrSF 0.773655
Transform all the columns where the skew is greater than 0.75, excluding "SalePrice".
In [20]: #Let's look at what happens to one of these features, when we apply np.log1p vis
field = "BsmtFinSF1"
fig, (ax_before, ax_after) = plt.subplots(1, 2, figsize=(10, 5))
train[field].hist(ax=ax_before)
train[field].apply(np.log1p).hist(ax=ax_after)
ax_before.set(title='before np.log1p', ylabel='frequency', xlabel='value')
ax_after.set(title='after np.log1p', ylabel='frequency', xlabel='value')
fig.suptitle('Field "{}"'.format(field));
# a little bit better
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 9/18
3/21/25, 4:11 PM Notebook
X_test = test[feature_cols]
y_test = test['SalePrice']
Write a function rmse that takes in truth and prediction values and returns the
root-mean-squared error. Use sklearn's mean_squared_error .
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 10/18
3/21/25, 4:11 PM Notebook
947309.7044202151
In [25]: f = plt.figure(figsize=(6,6))
ax = plt.axes()
ax.plot(y_test, linearRegression.predict(X_test),
marker='o', ls='', ms=3.0)
lim = (0, y_test.max())
ax.set(xlabel='Actual Price',
ylabel='Predicted Price',
xlim=lim,
ylim=lim,
title='Linear Regression Results');
Ridge regression uses L2 normalization to reduce the magnitude of the coefficients. This
can be helpful in situations where there is high variance. The regularization functions in
Scikit-learn each contain versions that have cross-validation built in.
Fit a regular (non-cross validated) Ridge model to a range of α values and plot the
RMSE using the cross validated error function I created above.
Use
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 11/18
3/21/25, 4:11 PM Notebook
Then repeat the fitting of the Ridge models using the range of α values from the
prior section. Compare the results.
Now for the RidgeCV method. It's not possible to get the alpha values for the models
that weren't selected, unfortunately. The resulting error values and α values are very
similar to those obtained above.
15.0 32195.778260172978
Much like the RidgeCV function, there is also a LassoCV function that uses an L1
regularization function and cross-validation. L1 regularization will selectively shrink some
coefficients, effectively performing feature elimination.
The LassoCV function does not allow the scoring function to be set. However, the
custom error function ( rmse ) created above can be used to evaluate the error on the
final model.
Similarly, there is also an elastic net function with cross validation, ElasticNetCV ,
which is a combination of L2 and L1 regularization.
Fit a Lasso model using cross validation and determine the optimum value for α and
the RMSE using the function created above. Note that the magnitude of α may be
different from the Ridge model.
Repeat this with the Elastic net model.
Compare the results via table and/or plot.
0.0005 37753.025305153475
Now try the elastic net, with the same alphas as in Lasso, and l1_ratios between 0.1 and
0.9
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 12/18
3/21/25, 4:11 PM Notebook
Out[32]: RMSE
Linear 947309.704420
Ridge 32195.778260
Lasso 37753.025305
ElasticNet 35009.076126
In [33]: f = plt.figure(figsize=(6,6))
ax = plt.axes()
labels = ['Ridge', 'Lasso', 'ElasticNet']
models = [ridgeCV, lassoCV, elasticNetCV]
for mod, lab in zip(models, labels):
ax.plot(y_test, mod.predict(X_test),
marker='o', ls='', ms=3.0, label=lab)
leg = plt.legend(frameon=True)
leg.get_frame().set_edgecolor('black')
leg.get_frame().set_linewidth(1.0)
ax.set(xlabel='Actual Price',
ylabel='Predicted Price',
title='Linear Regression Results');
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 13/18
3/21/25, 4:11 PM Notebook
Fit a stochastic gradient descent model without a regularization penalty (the relevant
parameter is penalty ).
Now fit stochastic gradient descent models with each of the three penalties (L2, L1,
Elastic Net) using the parameter values determined by cross validation above.
Do not scale the data before fitting the model.
Compare the results to those obtained without using stochastic gradient descent.
model_parameters_dict = {
'Linear': {'penalty': None}, # Change 'none' to None
'Lasso': {'penalty': 'l2',
'alpha': lassoCV.alpha_},
'Ridge': {'penalty': 'l1',
'alpha': ridgeCV_rmse},
'ElasticNet': {'penalty': 'elasticnet',
'alpha': elasticNetCV.alpha_,
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 14/18
3/21/25, 4:11 PM Notebook
'l1_ratio': elasticNetCV.l1_ratio_}
}
new_rmses = {}
for modellabel, parameters in model_parameters_dict.items():
# following notation passes the dict items as arguments
SGD = SGDRegressor(**parameters)
SGD.fit(X_train, y_train)
new_rmses[modellabel] = rmse(y_test, SGD.predict(X_test))
rmse_df['RMSE-SGD'] = pd.Series(new_rmses)
rmse_df
Notice how high the error values are! The algorithm is diverging. This can be due to
scaling and/or learning rate being too high. Let's adjust the learning rate and see what
happens.
model_parameters_dict = {
'Linear': {'penalty': None}, # Changed 'none' to None
'Lasso': {'penalty': 'l2',
'alpha': lassoCV.alpha_},
'Ridge': {'penalty': 'l1',
'alpha': ridgeCV_rmse},
'ElasticNet': {'penalty': 'elasticnet',
'alpha': elasticNetCV.alpha_,
'l1_ratio': elasticNetCV.l1_ratio_}
}
new_rmses = {}
for modellabel, parameters in model_parameters_dict.items():
# following notation passes the dict items as arguments
SGD = SGDRegressor(eta0=1e-7, **parameters)
SGD.fit(X_train, y_train)
new_rmses[modellabel] = rmse(y_test, SGD.predict(X_test))
rmse_df['RMSE-SGD-learningrate'] = pd.Series(new_rmses)
rmse_df
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 15/18
3/21/25, 4:11 PM Notebook
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
new_rmses = {}
for modellabel, parameters in model_parameters_dict.items():
# following notation passes the dict items as arguments
SGD = SGDRegressor(**parameters)
SGD.fit(X_train_scaled, y_train)
new_rmses[modellabel] = rmse(y_test, SGD.predict(X_test_scaled))
rmse_df['RMSE-SGD-scaled'] = pd.Series(new_rmses)
rmse_df
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
new_rmses = {}
for modellabel, parameters in model_parameters_dict.items():
# following notation passes the dict items as arguments
SGD = SGDRegressor(**parameters)
SGD.fit(X_train_scaled, y_train)
new_rmses[modellabel] = rmse(y_test, SGD.predict(X_test_scaled))
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 16/18
3/21/25, 4:11 PM Notebook
rmse_df['RMSE-SGD-scaled'] = pd.Series(new_rmses)
rmse_df
Conclusion
This project has provided a hands-on exploration of the interplay between model
complexity, regularization, and optimization in the context of polynomial regression.
Using a simple yet powerful example — a noisy sinusoidal dataset — we observed how
different polynomial degrees influence the model's ability to fit the data, and how easily
overfitting can occur in high-capacity models.
Ridge Regression (L2), which penalizes large coefficients and stabilizes the model,
Lasso Regression (L1), which not only reduces overfitting but can also eliminate
irrelevant features through coefficient shrinkage.
In essence, this project reinforced the following core machine learning principles:
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 17/18
3/21/25, 4:11 PM Notebook
The insights and techniques applied here are directly transferable to broader machine
learning problems — from linear models to deep neural networks — where the same
concepts of regularization and optimization continue to play a central role.
file:///C:/Users/ayobola.lawal_kuda/Downloads/Final_Regularization_and_Gradient_Descent_by_Ayobola_Lawal.html 18/18