BDS-Homework-1-Submission.ipynb - Colab
BDS-Homework-1-Submission.ipynb - Colab
ipynb - Colab
Abhiram Iyengar
Anshul Joshi
Nikita Andhale
Go through all the notebooks we have done in class and make sure you understand what we did, and why.
1. Let’s start with our first Kaggle submission in a playground regression competition. Make an account to Kaggle and find
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/
2. Follow the data preprocessing steps from https://www.kaggle.com/code/apapiu/regularized-linear-models. Then run a ridge regression
using λ = 0.1 . Make a submission of this prediction, what is the RMSE you get? (Hint: remember to exponentiate np.expm1(ypred) your
predictions).
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
<ipython-input-1-b12170f47c6f>:8: DeprecationWarning: Please import `pearsonr` from the `scipy.stats` namespace; the `scipy.stats.stats`
from scipy.stats.stats import pearsonr
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train.head()
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 1/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeat
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN
5 rows × 81 columns
all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
test.loc[:,'MSSubClass':'SaleCondition']))
First I'll transform the skewed numeric features by taking log(feature + 1) - this will make the features more normal
all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
Replace the numeric missing values (NaN's) with the mean of their respective columns
all_data = pd.get_dummies(all_data)
#filling NA's with the mean of the column:
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 2/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
all_data = all_data.fillna(all_data.mean())
Models Now we are going to use regularized linear regression models from the scikit learn module. I'm going to try both l_1(Lasso) and
l_2(Ridge) regularization. I'll also define a function that returns the cross-validation rmse error so we can evaluate our models and pick the best
tuning par
def rmse_cv(model):
rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
return(rmse)
model_ridge = Ridge()
The main tuning parameter for the Ridge model is alpha - a regularization parameter that measures how flexible our model is. The higher the
regularization the less prone our model will be to overfit. However it will also lose flexibility and might not capture all of the signal in the data.
cv_ridge.min()
0.12731233261727531
So for the Ridge regression we get a rmsle of about 0.127 with alpha = 0.5
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 3/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
keyboard_arrow_down run a ridge regression using λ=0.1 . Make a submission of this prediction, what is the
RMSE you get?
#(Hint: remember to exponentiate np.expm1(ypred) your predictions).
▾ Ridge i ?
Ridge(alpha=0.1)
def rmse_cv(model):
rmse = np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv=5))
return rmse
What is the best score you can get from a single ridge regression model and from a single lasso model?
2. The ℓ0 (or L0 ) norm is the number of nonzeros of a vector. Plot the L0 norm of the coefficients that lasso produces as you vary the
strength of regularization parameter λ.
PROBLEM 3 : 1. Let' try out the Lasso model. We will do a slightly different approach here and use the built in
keyboard_arrow_down Lasso CV to figure out the best alpha for us. For some reason the alphas in Lasso CV are really the inverse or
the alphas in Ridge.
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 4/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
rmse_cv(model_lasso).mean()
0.1225674790699958
Nice! The lasso performs even better so we'll just use this one to predict on the test set. Another neat thing about the Lasso is that it does
feature selection for you - setting coefficients of features it deems unimportant to zero. Let's take a look at the coefficients:
Lasso picked 110 variables and eliminated the other 177 variables
optimal_alpha = model_lasso.alpha_
print(f"Optimal alpha for Lasso: {optimal_alpha}")
Let's find the best alpha with Ridge model & cross-validation.
PROBLEM 3 : 1) What is the best score you can get from a single ridge regression model and from a single lasso
model?
The best score (lowest RMSE) between these two models is achieved by the Lasso regression model: Best RMSE: 0.1225674790699958 (Lasso
model) The Lasso model outperforms the Ridge model by a small margin in this case.
keyboard_arrow_down PROBLEM 3 : 2) The ℓ0 (or L0 ) norm is the number of nonzeros of a vector. Plot the L0 norm of the coefficients
that lasso produces as you vary the strength of regularization parameter λ .
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 5/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
# Initialize lists to store results
l0_norms = []
Trends:
High L0 Norm at Low λ: Minimal regularization leads to nearly all coefficients being non-zero.
Decreasing L0 Norm with Increasing λ: As λ increases, more coefficients are set to zero, showcasing Lasso's feature selection capability.
Plateau at High λ: Beyond around 10^-1, the number of non-zero coefficients stabilizes near zero, indicating strong regularization and the
exclusion of most features.
Interpretation:
Feature Selection: Lasso effectively reduces features by zeroing out coefficients as λ increases.
Model Complexity: Lower λ values yield complex models with more features, while higher λ values simplify the model.
Optimal Regularization: The ideal λ balances retaining essential features and eliminating noise, typically where the curve flattens.
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 6/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error
Problem 5
Use the data generation used in the LASSO notebook where we first introduced Lasso, to generate data.
1. Manually implement forward selection. Report the order in which you add features.
2. In this example, we know the true support size is 5. But what if we did not know this? Plot test error as a function of the size of the
support. Use this to recover the true support size. Justify your answer.
3. Use Lasso with a manually implemented Cross validation using the metric of your choice. What is the value of the hyperparameter?
(Manually implemented means that you can either do it entirely on your own, or you can use GridSearchCV, but I’m asking you not to use
LassoCV, which you will use in the next problem).
4. (Optional) Change the number of folds in your CV and repeat the previous step. How does the optimal value of the hyperparameter
change? Try to explain any trends that you find.
5. (Optional) Read about and use LassoCV from sklearn.linear model. How does this compare with what you did in the previous step? If they
agree, then explain why they agree, and if they disagree explain why. This will require you to make sure you understand what LassoCV is
doing.
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 7/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
k = 5
# beta generated with k nonzeros
#coef = 10 * np.random.randn(n_features)
coef = 10 * np.ones(n_features)
inds = np.arange(n_features)
np.random.shuffle(inds)
coef[inds[k:]] = 0 # sparsify coef
y = np.dot(X, coef)
# add noise
y += 0.01 * np.random.normal((n_samples,))
remaining_features.remove(best_feature)
Selected features: [15, 18, 78, 76, 29, 80, 55, 0, 27, 62]
MSE for Selected features: [274.9259047713406, 127.87410588555692, 50.25889203977732, 28.96253432071043, 17.198394502743234, 11.59848226
keyboard_arrow_down Step 2: Estimate the True Support Size by Plotting Test Error
import numpy as np
import matplotlib.pyplot as plt
test_errors = corresponding_mse
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 8/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
plt.figure(figsize=(12, 6))
plt.plot(support_sizes, test_errors, marker='o')
plt.title('Test Error vs. Support Size')
plt.xlabel('Support Size')
plt.ylabel('Test Error (MSE)')
plt.yscale('log') # Log scale for better visualization
plt.grid(True)
plt.show()
Based on the results, the optimal support size is indeed 10, with a minimum test error of approximately 8.15e-01. This indicates that as we
added more features, the test error continued to decrease, reaching its lowest value when all 10 selected features were used.
However, the key observation here is that while the error decreases steadily as more features are added, the improvement becomes less
pronounced after a certain number of features, indicating diminishing returns. Even though the optimal support size is 10 in this case, the
earlier features (around 5) seem to have the most significant impact on reducing the error, and additional features improve the model more
gradually.
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 9/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
# Find the best alpha and evaluate the test MSE
best_alpha = grid_search.best_params_['alpha']
lasso_best = Lasso(alpha=best_alpha, max_iter=10000)
lasso_best.fit(X_train_scaled, y_train)
y_pred_test = lasso_best.predict(X_test_scaled)
# Test MSE
test_mse = mean_squared_error(y_test, y_pred_test)
print(f"Best alpha (5 fold): {best_alpha}")
print(f"Test MSE with scaled features: {test_mse:.4f}")
# Store results
results = {}
# Print results
for n_folds, res in results.items():
print(f"Number of Folds: {n_folds}, Best Alpha: {res['best_alpha']}, Test MSE: {res['test_mse']:.4f}")
3 Folds: The best alpha found is 0.00268 with a relatively high test MSE of 67.00. This suggests that the model is likely too complex or not
effectively regularized for this dataset when using just three folds.
5 Folds: The best alpha increases significantly to 0.05690, resulting in a much lower test MSE of 0.0543. This indicates improved regularization,
as the model is now performing better with a more appropriate alpha value.
10 Folds: The optimal alpha is 0.00869, and the test MSE drops even further to 0.0018, showing excellent performance. This lower test MSE
reflects a better generalization of the model on unseen data.
# Make predictions
y_pred_cv = lasso_cv.predict(X_test_scaled)
GridSearchCV: Best alpha of 0.0054 with a test MSE of 0.0012. LassoCV: Best alpha of 1.9307 with a test MSE of 60.1140.
The significant difference in the best alpha values and MSE results suggests that the two methods are identifying different optimal
hyperparameters for the Lasso model. Here are potential reasons for this discrepancy:
Regularization Sensitivity: Lasso regression is sensitive to the choice of the alpha parameter, which controls the strength of the penalty. The
vastly different optimal alphas indicate that the model is responding differently to the regularization effect in each approach.
Data Characteristics: The distribution of the features and the target variable can affect how regularization is applied. If the features have a wide
range or differing scales, it can lead to different model performance across methods.
Hyperparameter Exploration: The search strategies may lead to different regions in the alpha parameter space being explored. While both
methods utilize the same range, LassoCV optimizes based on a built-in cross-validation approach, potentially leading it to converge on a less
optimal solution compared to the grid search.
Variance in Cross-Validation: Even though both methods used 5-fold cross-validation, the specific splits and their interaction with the model
could lead to variability in performance estimates, especially with a small sample size.
https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 11/11