Exp 2 (Multiple Linear Regression)
Exp 2 (Multiple Linear Regression)
To Perform multiple linear regression on multiple datasets and see the results and check which one
has better output.
Theory:
Multiple linear regression (MLR) is a statistical technique used to model the relationship between a
single dependent variable (what you want to predict) and multiple independent variables (features
that influence the dependent variable). It assumes a linear relationship between these variables and
builds a linear equation to capture this relationship.
Key Concepts:
Equation:
β₀ is the intercept term (constant value when all independent variables are zero).
Limitations of MLR:
Applications of MLR:
Predicting house prices based on features like size, location, and amenities.
Understanding how factors like age, income, and education affect job satisfaction.
import pandas as pd
import numpy as np
df = pd.read_csv('boston.csv')
numeric_cols = df.select_dtypes(include=[np.number]).columns
Q1 = df[numeric_cols].quantile(0.25)
Q3 = df[numeric_cols].quantile(0.75)
IQR = Q3 - Q1
df = df[~((df[numeric_cols] < (Q1 - 1.5 * IQR)) | (df[numeric_cols] > (Q3 + 1.5 * IQR))).any(axis=1)]
# Extract features and target variable (using the provided column names)
y = df['MEDV']
# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=0)
rf_model = RandomForestRegressor(random_state=0)
rf_model.fit(X_train, y_train)
X_train = sfm.transform(X_train)
X_test = sfm.transform(X_test)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
# Model fitting
regressor = LinearRegression()
regressor.fit(X_train_poly, y_train)
# Evaluation
y_pred = regressor.predict(X_test_poly)
r2 = r2_score(y_test, y_pred)
# Visualization (optional)
plt.scatter(y_test, y_pred)
plt.xlabel("Actaul Medv")
plt.ylabel("Predicted Medv")
plt.show()
Performance Metrics:
Multiple Linear Regression Dataset:
Output:
Multiple Regression dataset:
Boston Housing Dataset Output:
Comparission.
Comparing the performance of models trained on a multiple regression dataset and the Boston
Housing dataset:
Train Score:
The multiple regression model achieves a very high train score (0.983), indicating an excellent fit to
the training data.
The Boston Housing model also demonstrates a reasonably high train score (0.822), suggesting a
good fit to its training data.
Test Score:
Both models exhibit high test scores, with the multiple regression model at 0.887 and the Boston
Housing model at 0.877, indicating strong generalization performance.
Mean Squared Error (MSE):
The multiple regression model has a relatively high MSE of 2,611,228, suggesting higher prediction
errors on average.
In contrast, the Boston Housing model shows a much lower MSE of 5.379, indicating superior
prediction accuracy.
R-squared (R2):
The multiple regression model and the Boston Housing model both achieve high R-squared values
(0.887 and 0.877 respectively), indicating good explanatory power over the variance in their
respective dependent variables.
Conclusion:
While both models exhibit strong performance in terms of train and test scores, the Boston Housing
model outperforms in terms of MSE, suggesting superior prediction accuracy.
Despite the multiple regression model's higher R-squared value, indicating a better fit to the data, its
higher MSE implies potential issues with prediction accuracy on unseen data.
Therefore, for accurate prediction of housing prices, the Boston Housing model is preferred.
However, if the goal is to explain variance in the dependent variable, the multiple regression model
may be more suitable.