0% found this document useful (0 votes)
46 views

BDS-Homework-1-Submission.ipynb - Colab

Uploaded by

nikita.andhale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

BDS-Homework-1-Submission.ipynb - Colab

Uploaded by

nikita.andhale
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

10/17/24, 9:12 AM BDS-Homework-1-Submission.

ipynb - Colab

Abhiram Iyengar

Anshul Joshi

Nikita Andhale

keyboard_arrow_down BDS: Homework 1


Submit:

1. A pdf of your notebook with solutions.


2. A link to your colab notebook

Goals of this homework


1. More experience with regression and ridge regression (regularization)
2. Start playing with Kaggle
3. More experience with Lasso.
4. An initial shot at ensembling and stacking.

Problem 1 (Nothing to turn in)

Go through all the notebooks we have done in class and make sure you understand what we did, and why.

keyboard_arrow_down Problem 2: Starting in Kaggle.


Later this month, we are opening a Kaggle competition made for this class. In that one, you will be participating on your own. This is an intro to
get us started, and also an excuse to work with regularization and regression which we have been discussing.

1. Let’s start with our first Kaggle submission in a playground regression competition. Make an account to Kaggle and find
https://www.kaggle.com/c/house-prices-advanced-regression-techniques/

2. Follow the data preprocessing steps from https://www.kaggle.com/code/apapiu/regularized-linear-models. Then run a ridge regression
using λ = 0.1 . Make a submission of this prediction, what is the RMSE you get? (Hint: remember to exponentiate np.expm1(ypred) your
predictions).

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib

import matplotlib.pyplot as plt


from scipy.stats import skew
from scipy.stats.stats import pearsonr

%config InlineBackend.figure_format = 'retina' #set 'png' here when working on notebook


%matplotlib inline

<ipython-input-1-b12170f47c6f>:8: DeprecationWarning: Please import `pearsonr` from the `scipy.stats` namespace; the `scipy.stats.stats`
from scipy.stats.stats import pearsonr

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train.head()

https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 1/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeat

0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN

1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN

2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN

3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN

4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN

5 rows × 81 columns

all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
test.loc[:,'MSSubClass':'SaleCondition']))

First I'll transform the skewed numeric features by taking log(feature + 1) - this will make the features more normal

matplotlib.rcParams['figure.figsize'] = (12.0, 6.0)


prices = pd.DataFrame({"price":train["SalePrice"], "log(price + 1)":np.log1p(train["SalePrice"])})
prices.hist()

array([[<Axes: title={'center': 'price'}>,


<Axes: title={'center': 'log(price + 1)'}>]], dtype=object)

Create Dummy variables for the categorical features

#log transform the target:


train["SalePrice"] = np.log1p(train["SalePrice"])

#log transform skewed numeric features:


numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index

skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness


skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index

all_data[skewed_feats] = np.log1p(all_data[skewed_feats])

Replace the numeric missing values (NaN's) with the mean of their respective columns

all_data = pd.get_dummies(all_data)
#filling NA's with the mean of the column:

https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 2/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
all_data = all_data.fillna(all_data.mean())

#creating matrices for sklearn:


X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice

Models Now we are going to use regularized linear regression models from the scikit learn module. I'm going to try both l_1(Lasso) and
l_2(Ridge) regularization. I'll also define a function that returns the cross-validation rmse error so we can evaluate our models and pick the best
tuning par

from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV


from sklearn.model_selection import cross_val_score

def rmse_cv(model):
rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
return(rmse)
model_ridge = Ridge()

The main tuning parameter for the Ridge model is alpha - a regularization parameter that measures how flexible our model is. The higher the
regularization the less prone our model will be to overfit. However it will also lose flexibility and might not capture all of the signal in the data.

alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]


cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean()
for alpha in alphas]

cv_ridge = pd.Series(cv_ridge, index = alphas)


cv_ridge.plot(title = "Validation - Just Do It")
plt.xlabel("alpha")
plt.ylabel("rmse")

Text(0, 0.5, 'rmse')

cv_ridge.min()

0.12731233261727531

So for the Ridge regression we get a rmsle of about 0.127 with alpha = 0.5

https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 3/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab

keyboard_arrow_down run a ridge regression using λ=0.1 . Make a submission of this prediction, what is the
RMSE you get?
#(Hint: remember to exponentiate np.expm1(ypred) your predictions).

from sklearn.linear_model import Ridge


from sklearn.model_selection import cross_val_score
import numpy as np

# Initialize Ridge model with alpha (λ) = 0.1


ridge_model = Ridge(alpha=0.1)

# Fit the Ridge regression model to the training data


ridge_model.fit(X_train, y)

▾ Ridge i ?

Ridge(alpha=0.1)

# Make predictions on the test set


ridge_preds = ridge_model.predict(X_test)

# Exponentiate the predictions to reverse the log1p transformation


ridge_preds_exp = np.expm1(ridge_preds)

# Prepare the submission file


submission = pd.DataFrame({"Id": test["Id"], "SalePrice": ridge_preds_exp})

# Save the submission to a CSV file


submission.to_csv("ridge_submission.csv", index=False)

def rmse_cv(model):
rmse = np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv=5))
return rmse

# Calculate RMSE for Ridge model


rmse_ridge = rmse_cv(ridge_model).mean()
print(f"RMSE for Ridge Regression with λ=0.1: {rmse_ridge}")

RMSE for Ridge Regression with λ=0.1: 0.13774989813144883

House Prices - Advanced Regression Techniques : Kaggle Score = 0.13564

keyboard_arrow_down Problem 3: Continuing in Kaggle


1. Compare a ridge regression and a lasso regression model. Optimize the regularization constants using cross validation. This means that
you will have to select different values of the regularization parameters, and set up a k -fold cross validation experiment to decide which of
these is best, and then finally compare your best ridge regression model with your best lasso regression model.

What is the best score you can get from a single ridge regression model and from a single lasso model?

2. The ℓ0 (or L0 ) norm is the number of nonzeros of a vector. Plot the L0 norm of the coefficients that lasso produces as you vary the
strength of regularization parameter λ.

PROBLEM 3 : 1. Let' try out the Lasso model. We will do a slightly different approach here and use the built in
keyboard_arrow_down Lasso CV to figure out the best alpha for us. For some reason the alphas in Lasso CV are really the inverse or
the alphas in Ridge.

model_lasso = LassoCV(alphas = [1, 0.1, 0.001, 0.0005]).fit(X_train, y)

https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 4/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
rmse_cv(model_lasso).mean()

0.1225674790699958

Nice! The lasso performs even better so we'll just use this one to predict on the test set. Another neat thing about the Lasso is that it does
feature selection for you - setting coefficients of features it deems unimportant to zero. Let's take a look at the coefficients:

coef = pd.Series(model_lasso.coef_, index = X_train.columns)


print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")

Lasso picked 110 variables and eliminated the other 177 variables

optimal_alpha = model_lasso.alpha_
print(f"Optimal alpha for Lasso: {optimal_alpha}")

Optimal alpha for Lasso: 0.0005

Let's find the best alpha with Ridge model & cross-validation.

from sklearn.linear_model import RidgeCV


import numpy as np

# Define a range of alpha values to test


alphas = [0.1, 1.0, 10.0, 100.0]

# Initialize RidgeCV model with the specified alphas


ridge_cv = RidgeCV(alphas=alphas, scoring='neg_mean_squared_error', cv=5)

# Fit the model to the training data


ridge_cv.fit(X_train, y)

# Get the best alpha value


best_alpha_ridge = ridge_cv.alpha_
print(f"Optimal alpha for Ridge: {best_alpha_ridge}")

# Calculate RMSE for the best Ridge model


rmse_ridge_cv = np.sqrt(-cross_val_score(ridge_cv, X_train, y, scoring="neg_mean_squared_error", cv=5)).mean()
print(f"RMSE for Ridge Regression with best alpha: {rmse_ridge_cv}")

Optimal alpha for Ridge: 10.0


RMSE for Ridge Regression with best alpha: 0.12731233261727531

PROBLEM 3 : 1) What is the best score you can get from a single ridge regression model and from a single lasso
model?

Best Score Comparison

The best score (lowest RMSE) between these two models is achieved by the Lasso regression model: Best RMSE: 0.1225674790699958 (Lasso
model) The Lasso model outperforms the Ridge model by a small margin in this case.

Optimal alpha for Ridge: 10.0

RMSE for Ridge Regression with best alpha: 0.12731233261727531

Optimal alpha for Lasso: 0.0005

RMSE for Lasso Regression with best alpha:0.1225674790699958

keyboard_arrow_down PROBLEM 3 : 2) The ℓ0 (or L0 ) norm is the number of nonzeros of a vector. Plot the L0 norm of the coefficients
that lasso produces as you vary the strength of regularization parameter λ .

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import Lasso

# Define a range of alpha values (regularization parameters)


alphas = np.logspace(-4, 1, 50)

https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 5/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
# Initialize lists to store results
l0_norms = []

# Loop over each alpha value


for alpha in alphas:
# Initialize and fit the Lasso model
lasso = Lasso(alpha=alpha, max_iter=10000)
lasso.fit(X_train, y) # Using X_train and y as per your previous code

# Calculate the L0 norm (number of non-zero coefficients)


l0_norm = np.sum(lasso.coef_ != 0)
l0_norms.append(l0_norm)

# Plot the results


plt.figure(figsize=(10, 6))
plt.plot(alphas, l0_norms, marker='o')
plt.xscale('log')
plt.xlabel('Regularization Parameter (λ)')
plt.ylabel('L0 Norm of Coefficients')
plt.title('L0 Norm vs Regularization Strength in Lasso Regression')
plt.grid(True)
plt.show()

Trends:

High L0 Norm at Low λ: Minimal regularization leads to nearly all coefficients being non-zero.

Decreasing L0 Norm with Increasing λ: As λ increases, more coefficients are set to zero, showcasing Lasso's feature selection capability.

Plateau at High λ: Beyond around 10^-1, the number of non-zero coefficients stabilizes near zero, indicating strong regularization and the
exclusion of most features.

Interpretation:

Feature Selection: Lasso effectively reduces features by zeroing out coefficients as λ increases.

Model Complexity: Lower λ values yield complex models with more features, while higher λ values simplify the model.

Optimal Regularization: The ideal λ balances retaining essential features and eliminating noise, typically where the curve flattens.

keyboard_arrow_down Problem 4: Introduction to Stacking and Ensembling


Add the outputs of your models as features and train a ridge regression on all the features plus the model outputs (This is called Ensembling
and Stacking). Be careful not to overfit. What score can you get?

https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 6/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_squared_error

# Assume X_train, y_train, X_test are defined


# Train Ridge and Lasso models
ridge_model = Ridge(alpha=10).fit(X_train, y)
lasso_model = Lasso(alpha=0.0005).fit(X_train, y)

# Generate predictions on training data


ridge_preds_train = ridge_model.predict(X_train)
lasso_preds_train = lasso_model.predict(X_train)

# Generate predictions on test data


ridge_preds_test = ridge_model.predict(X_test)
lasso_preds_test = lasso_model.predict(X_test)

# Create new feature sets


X_train_stack = np.hstack((X_train, ridge_preds_train.reshape(-1, 1), lasso_preds_train.reshape(-1, 1)))
X_test_stack = np.hstack((X_test, ridge_preds_test.reshape(-1, 1), lasso_preds_test.reshape(-1, 1)))

# Train final Ridge regression model on stacked features


ridge_final_model = Ridge(alpha=10).fit(X_train_stack, y)

# Evaluate performance using cross-validation


rmse_final = np.sqrt(-cross_val_score(ridge_final_model, X_train_stack, y, scoring='neg_mean_squared_error', cv=5)).mean()
print(f"RMSE for stacked model: {rmse_final}")

# Predict on test data using stacked model


final_predictions = ridge_final_model.predict(X_test_stack)

# Prepare submission file with IDs and predicted SalePrice


submission = pd.DataFrame({"Id": test["Id"], "SalePrice": final_predictions})
submission.to_csv("stacked_submission.csv", index=False)

RMSE for stacked model: 0.12356315812543787

Kaggle Score after Stack Submission : 0.12496

Problem 5

Use the data generation used in the LASSO notebook where we first introduced Lasso, to generate data.

You can find that in the pages tab in Canvas.

1. Manually implement forward selection. Report the order in which you add features.
2. In this example, we know the true support size is 5. But what if we did not know this? Plot test error as a function of the size of the
support. Use this to recover the true support size. Justify your answer.
3. Use Lasso with a manually implemented Cross validation using the metric of your choice. What is the value of the hyperparameter?
(Manually implemented means that you can either do it entirely on your own, or you can use GridSearchCV, but I’m asking you not to use
LassoCV, which you will use in the next problem).
4. (Optional) Change the number of folds in your CV and repeat the previous step. How does the optimal value of the hyperparameter
change? Try to explain any trends that you find.
5. (Optional) Read about and use LassoCV from sklearn.linear model. How does this compare with what you did in the previous step? If they
agree, then explain why they agree, and if they disagree explain why. This will require you to make sure you understand what LassoCV is
doing.

keyboard_arrow_down Step 0: Generate Data


np.random.seed(7)

n_samples, n_features = 100, 200


X = np.random.randn(n_samples, n_features)

https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 7/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab

k = 5
# beta generated with k nonzeros
#coef = 10 * np.random.randn(n_features)
coef = 10 * np.ones(n_features)
inds = np.arange(n_features)
np.random.shuffle(inds)
coef[inds[k:]] = 0 # sparsify coef
y = np.dot(X, coef)

# add noise
y += 0.01 * np.random.normal((n_samples,))

# Split data in train set and test set


n_samples = X.shape[0]
X_train, y_train = X[:25], y[:25]
X_test, y_test = X[25:], y[25:]

keyboard_arrow_down Step 1: Manually Implement Forward Selection


import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Assuming X_train, y_train are already defined as in the previous code

# Forward selection implementation


selected_features = []
corresponding_mse = []
remaining_features = list(range(X_train.shape[1]))

# Limiting selection to top 10 for demonstration purposes


for _ in range(10):
best_feature = None
best_mse = float('inf')
for feature in remaining_features:
current_features = selected_features + [feature]
model = LinearRegression().fit(X_train[:, current_features], y_train)
y_pred = model.predict(X_train[:, current_features])
mse = mean_squared_error(y_train, y_pred)
if mse < best_mse:
best_mse = mse
best_feature = feature
selected_features.append(best_feature)
corresponding_mse.append(best_mse)

remaining_features.remove(best_feature)

print("Selected features:", selected_features)


print("MSE for Selected features:", corresponding_mse)

Selected features: [15, 18, 78, 76, 29, 80, 55, 0, 27, 62]
MSE for Selected features: [274.9259047713406, 127.87410588555692, 50.25889203977732, 28.96253432071043, 17.198394502743234, 11.59848226

keyboard_arrow_down Step 2: Estimate the True Support Size by Plotting Test Error
import numpy as np
import matplotlib.pyplot as plt

# Data from the forward selection results

test_errors = corresponding_mse

# Support sizes for the feature selections


support_sizes = range(1, len(selected_features) + 1)

# Plot the test error as a function of the support size

https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 8/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
plt.figure(figsize=(12, 6))
plt.plot(support_sizes, test_errors, marker='o')
plt.title('Test Error vs. Support Size')
plt.xlabel('Support Size')
plt.ylabel('Test Error (MSE)')
plt.yscale('log') # Log scale for better visualization
plt.grid(True)
plt.show()

# Find the minimum error and its corresponding support size


min_error = min(test_errors)
optimal_support_size = support_sizes[test_errors.index(min_error)]

print(f"Optimal support size: {optimal_support_size}")


print(f"Minimum test error: {min_error:.2e}")

Optimal support size: 10


Minimum test error: 8 15e 01

Based on the results, the optimal support size is indeed 10, with a minimum test error of approximately 8.15e-01. This indicates that as we
added more features, the test error continued to decrease, reaching its lowest value when all 10 selected features were used.

However, the key observation here is that while the error decreases steadily as more features are added, the improvement becomes less
pronounced after a certain number of features, indicating diminishing returns. Even though the optimal support size is 10 in this case, the
earlier features (around 5) seem to have the most significant impact on reducing the error, and additional features improve the model more
gradually.

keyboard_arrow_down Step 3: Lasso Regression with Manual Cross-Validation


from sklearn.preprocessing import StandardScaler

# Normalize the feature matrix


scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Perform cross-validation again with a wider range of alphas


alphas = np.logspace(-4, 1, 50)
param_grid = {'alpha': alphas}
grid_search = GridSearchCV(Lasso(max_iter=10000), param_grid, scoring='neg_mean_squared_error', cv=5, n_jobs=-1)
grid_search.fit(X_train_scaled, y_train)

https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 9/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
# Find the best alpha and evaluate the test MSE
best_alpha = grid_search.best_params_['alpha']
lasso_best = Lasso(alpha=best_alpha, max_iter=10000)
lasso_best.fit(X_train_scaled, y_train)
y_pred_test = lasso_best.predict(X_test_scaled)

# Test MSE
test_mse = mean_squared_error(y_test, y_pred_test)
print(f"Best alpha (5 fold): {best_alpha}")
print(f"Test MSE with scaled features: {test_mse:.4f}")

Best alpha (5 fold): 0.005428675439323859


Test MSE with scaled features: 0.0012

keyboard_arrow_down Step 4: (Optional) Vary the Number of Folds in Cross-Validation


from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
# Set up alpha ranges for Lasso
lasso_alphas = {'alpha': np.logspace(-4, 1, 50)}

# Define different number of folds for cross-validation


folds = [3, 5, 10]

# Store results
results = {}

for n_folds in folds:


# Define k-fold cross-validation
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

# Lasso Regression with GridSearchCV


lasso = Lasso(max_iter=10000)
lasso_cv = GridSearchCV(lasso, lasso_alphas, cv=kf, scoring='neg_mean_squared_error')
lasso_cv.fit(X_train_scaled, y_train)

# Get best model and error


best_alpha = lasso_cv.best_params_['alpha']
test_pred = lasso_cv.predict(X_test_scaled)
test_mse = mean_squared_error(y_test, test_pred)

results[n_folds] = {'best_alpha': best_alpha, 'test_mse': test_mse}

# Print results
for n_folds, res in results.items():
print(f"Number of Folds: {n_folds}, Best Alpha: {res['best_alpha']}, Test MSE: {res['test_mse']:.4f}")

Number of Folds: 3, Best Alpha: 0.002682695795279727, Test MSE: 67.0029


Number of Folds: 5, Best Alpha: 0.05689866029018299, Test MSE: 0.0543
Number of Folds: 10, Best Alpha: 0.008685113737513529, Test MSE: 0.0018

Observations Variation in Optimal Alpha:

3 Folds: The best alpha found is 0.00268 with a relatively high test MSE of 67.00. This suggests that the model is likely too complex or not
effectively regularized for this dataset when using just three folds.

5 Folds: The best alpha increases significantly to 0.05690, resulting in a much lower test MSE of 0.0543. This indicates improved regularization,
as the model is now performing better with a more appropriate alpha value.

10 Folds: The optimal alpha is 0.00869, and the test MSE drops even further to 0.0018, showing excellent performance. This lower test MSE
reflects a better generalization of the model on unseen data.

keyboard_arrow_down Step 5: (Optional) Compare with LassoCV


from sklearn.linear_model import LassoCV
from sklearn.metrics import mean_squared_error, r2_score

# Create LassoCV object


https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 10/11
10/17/24, 9:12 AM BDS-Homework-1-Submission.ipynb - Colab
# Create LassoCV object
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 50), cv=5, random_state=42)

# Fit the model


lasso_cv.fit(X_train_scaled, y_train)

# Make predictions
y_pred_cv = lasso_cv.predict(X_test_scaled)

# Calculate MSE and R2 score


mse_cv = mean_squared_error(y_test, y_pred_cv)
r2_cv = r2_score(y_test, y_pred_cv)

print(f"Best alpha: {lasso_cv.alpha_}")


print(f"MSE: {mse_cv}")
print(f"R2 Score: {r2_cv}")

Best alpha: 1.9306977288832496


MSE: 60.114048729555506
R2 Score: 0.8491435237962413

Given the results from two approaches:

GridSearchCV: Best alpha of 0.0054 with a test MSE of 0.0012. LassoCV: Best alpha of 1.9307 with a test MSE of 60.1140.

Brief Explanation of Discrepancy

The significant difference in the best alpha values and MSE results suggests that the two methods are identifying different optimal
hyperparameters for the Lasso model. Here are potential reasons for this discrepancy:

Regularization Sensitivity: Lasso regression is sensitive to the choice of the alpha parameter, which controls the strength of the penalty. The
vastly different optimal alphas indicate that the model is responding differently to the regularization effect in each approach.

Data Characteristics: The distribution of the features and the target variable can affect how regularization is applied. If the features have a wide
range or differing scales, it can lead to different model performance across methods.

Hyperparameter Exploration: The search strategies may lead to different regions in the alpha parameter space being explored. While both
methods utilize the same range, LassoCV optimizes based on a built-in cross-validation approach, potentially leading it to converge on a less
optimal solution compared to the grid search.

Variance in Cross-Validation: Even though both methods used 5-fold cross-validation, the specific splits and their interaction with the model
could lead to variability in performance estimates, especially with a small sample size.

https://colab.research.google.com/drive/1Rw5Ml1jBmQ868im5CIzRmmyRxowv-glk#scrollTo=KKaU8rEVvImI&printMode=true 11/11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy