0% found this document useful (0 votes)
55 views18 pages

House Price Prediction: # Importing Necessary Libraries

Uploaded by

maha.kandadai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views18 pages

House Price Prediction: # Importing Necessary Libraries

Uploaded by

maha.kandadai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

House Price Prediction

Data Exploration
In [1]: # importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
!pip install lazypredict
from sklearn.model_selection import train_test_split
from lazypredict.Supervised import LazyRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

/opt/conda/lib/python3.10/site-packages/scipy/__init__.py:146: UserWarning: A NumPy vers


ion >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.5
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Collecting lazypredict
Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Requirement already satisfied: click in /opt/conda/lib/python3.10/site-packages (from la
zypredict) (8.1.3)
Requirement already satisfied: scikit-learn in /opt/conda/lib/python3.10/site-packages
(from lazypredict) (1.2.2)
Requirement already satisfied: pandas in /opt/conda/lib/python3.10/site-packages (from l
azypredict) (1.5.3)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.10/site-packages (from laz
ypredict) (4.65.0)
Requirement already satisfied: joblib in /opt/conda/lib/python3.10/site-packages (from l
azypredict) (1.2.0)
Requirement already satisfied: lightgbm in /opt/conda/lib/python3.10/site-packages (from
lazypredict) (3.3.2)
Requirement already satisfied: xgboost in /opt/conda/lib/python3.10/site-packages (from
lazypredict) (1.7.6)
Requirement already satisfied: wheel in /opt/conda/lib/python3.10/site-packages (from li
ghtgbm->lazypredict) (0.40.0)
Requirement already satisfied: numpy in /opt/conda/lib/python3.10/site-packages (from li
ghtgbm->lazypredict) (1.23.5)
Requirement already satisfied: scipy in /opt/conda/lib/python3.10/site-packages (from li
ghtgbm->lazypredict) (1.11.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/lib/python3.10/site-pa
ckages (from scikit-learn->lazypredict) (3.1.0)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/conda/lib/python3.10/site-
packages (from pandas->lazypredict) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /opt/conda/lib/python3.10/site-packages
(from pandas->lazypredict) (2023.3)
Requirement already satisfied: six>=1.5 in /opt/conda/lib/python3.10/site-packages (from
python-dateutil>=2.8.1->pandas->lazypredict) (1.16.0)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12

In [2]: # loading datset


df=pd.read_csv("/kaggle/input/housing-price-prediction/Housing.csv")

In [3]: # checking first 5 rows


df.head()

Out[3]: price area bedrooms bathrooms stories mainroad guestroom basement hotwaterheating aircondit

0 13300000 7420 4 2 3 yes no no no

1 12250000 8960 4 4 4 yes no no no

2 12250000 9960 3 2 2 yes no yes no

3 12215000 7500 4 2 2 yes no yes no

4 11410000 7420 4 1 2 yes yes yes no

In [4]: # checking last 5 rows


df.tail()

Out[4]: price area bedrooms bathrooms stories mainroad guestroom basement hotwaterheating aircond

540 1820000 3000 2 1 1 yes no yes no

541 1767150 2400 3 1 1 no no no no

542 1750000 3620 2 1 1 yes no no no

543 1750000 2910 3 1 1 no no no no

544 1750000 3850 3 1 2 yes no no no

In [5]: # checking null values


df.isnull().sum()

price 0
Out[5]:
area 0
bedrooms 0
bathrooms 0
stories 0
mainroad 0
guestroom 0
basement 0
hotwaterheating 0
airconditioning 0
parking 0
prefarea 0
furnishingstatus 0
dtype: int64

In [6]: # checking duplicate values


df.duplicated().value_counts()

False 545
Out[6]:
dtype: int64

In [7]: # checking column names


df.columns

Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad',


Out[7]:
'guestroom', 'basement', 'hotwaterheating', 'airconditioning',
'parking', 'prefarea', 'furnishingstatus'],
dtype='object')

In [8]: # checking data types


df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 545 entries, 0 to 544
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 price 545 non-null int64
1 area 545 non-null int64
2 bedrooms 545 non-null int64
3 bathrooms 545 non-null int64
4 stories 545 non-null int64
5 mainroad 545 non-null object
6 guestroom 545 non-null object
7 basement 545 non-null object
8 hotwaterheating 545 non-null object
9 airconditioning 545 non-null object
10 parking 545 non-null int64
11 prefarea 545 non-null object
12 furnishingstatus 545 non-null object
dtypes: int64(6), object(7)
memory usage: 55.5+ KB

In [9]: # checking unique values


df.nunique()

price 219
Out[9]:
area 284
bedrooms 6
bathrooms 4
stories 4
mainroad 2
guestroom 2
basement 2
hotwaterheating 2
airconditioning 2
parking 4
prefarea 2
furnishingstatus 3
dtype: int64

In [10]: # getting statistical summary


df.describe()

Out[10]: price area bedrooms bathrooms stories parking

count 545.00 545.00 545.00 545.00 545.00 545.00

mean 4766729.25 5150.54 2.97 1.29 1.81 0.69

std 1870439.62 2170.14 0.74 0.50 0.87 0.86

min 1750000.00 1650.00 1.00 1.00 1.00 0.00

25% 3430000.00 3600.00 2.00 1.00 1.00 0.00


50% 4340000.00 4600.00 3.00 1.00 2.00 0.00

75% 5740000.00 6360.00 3.00 2.00 2.00 1.00

max 13300000.00 16200.00 6.00 4.00 4.00 3.00

Data Visualization
In [11]: # Visualizing 'price'
plt.hist(df['price'], color='r')
plt.xlabel('Price')
plt.ylabel('Frequency')
plt.title('Distribution of Prices')
plt.show()

In [12]: # Visualizing 'bedrooms'


df['bedrooms'].value_counts().plot(kind='bar', color='g')
plt.xlabel('Bedrooms')
plt.ylabel('Count')
plt.title('Number of Properties for each Number of Bedrooms')
plt.show()
In [13]: # Visualizing 'bathrooms'
df['bathrooms'].value_counts().plot(kind='barh', color='y')
plt.title('Proportion of Properties for each Number of Bathrooms')
plt.show()

In [14]: # Visualizing 'stories'


df['stories'].value_counts().plot(kind='barh', color='c')
plt.xlabel('Stories')
plt.ylabel('Count')
plt.title('Number of Properties for each Number of Stories')
plt.show()

In [15]: # Visualizing 'mainroads'


df['mainroad'].value_counts().plot(kind='pie', colors=['red', 'yellow'])
plt.title('Properties for Availability of Mainroads')
plt.show()

In [16]: # Visualizing 'guestrooms'


df['guestroom'].value_counts().plot(kind='pie', colors=['green', 'pink'])
plt.title('Number of Properties for Availability of Guestroom')
plt.show()

In [17]: # Visualizing 'basement'


df['basement'].value_counts().plot(kind='pie', colors=['grey', 'cyan'])
plt.title('Number of Properties for Availability of Basement')
plt.show()

In [18]: # Visualizing 'Hot Water Heating'


df['hotwaterheating'].value_counts().plot(kind='pie', colors=['brown', 'orange'])
plt.title('Number of Properties for Availability of Hot Water Heating')
plt.show()

In [19]: # Visualizing 'Air Conditioners'


df['airconditioning'].value_counts().plot(kind='pie', colors=['purple', 'magenta'])
plt.title('Number of Properties for Availability of Air Conditioners')
plt.show()

In [20]: # Visualizing 'parking'


df['parking'].value_counts().plot(kind='bar', color='m')
plt.xlabel('Parking')
plt.ylabel('Count')
plt.title('Number of Properties for each Number of Parking')
plt.show()

In [21]: # Visualizing 'prefarea'


df['prefarea'].value_counts().plot(kind='pie', colors=['darkgreen', 'lightgreen'])
plt.title('Number of Properties for Availability of Prefarea')
plt.show()

In [22]: # Visualizing 'furnishing status'


df['furnishingstatus'].value_counts().plot(kind='pie', colors=['skyblue', 'lightblue', '
plt.title('Number of Properties for Type of furnishing status')
plt.show()

In [23]: # Visualizing 'area' vs. 'price'


plt.scatter(df['area'], df['price'], color='orange')
plt.xlabel('Area')
plt.ylabel('Price')
plt.title('Area vs. Price')
plt.show()
In [24]: # Creating a pair plot
sns.pairplot(df)
plt.show()

In [25]: # Calculating the correlation matrix


correlation_matrix = df.corr()

# Creating a correlation heatmap


plt.figure(figsize=(10, 8)) # Adjust the figure size as per your preference
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()
Feature Engineering
In [26]: # Select the columns to encode
categorical_columns = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'aircondi

# Perform label encoding


label_encoder = LabelEncoder()
for col in categorical_columns:
df[col] = label_encoder.fit_transform(df[col])

# Display the updated DataFrame


df

Out[26]: price area bedrooms bathrooms stories mainroad guestroom basement hotwaterheating aircon

0 13300000 7420 4 2 3 1 0 0 0

1 12250000 8960 4 4 4 1 0 0 0

2 12250000 9960 3 2 2 1 0 1 0

3 12215000 7500 4 2 2 1 0 1 0

4 11410000 7420 4 1 2 1 1 1 0

... ... ... ... ... ... ... ... ... ...
540 1820000 3000 2 1 1 1 0 1 0

541 1767150 2400 3 1 1 0 0 0 0

542 1750000 3620 2 1 1 1 0 0 0

543 1750000 2910 3 1 1 0 0 0 0

544 1750000 3850 3 1 2 1 0 0 0

545 rows × 13 columns

Machine Learning Model

Splitting the dataset


In [27]: X = df.drop('price', axis=1) # Features (excluding the target variable)
y = df['price'] # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42

In [28]: regressor = LazyRegressor(verbose=0, ignore_warnings=True, custom_metric=None)


models, predictions = regressor.fit(X_train, X_test, y_train, y_test)

100%|██████████| 42/42 [00:14<00:00, 2.98it/s]

In [29]: print(models)

Adjusted R-Squared R-Squared RMSE \


Model
GradientBoostingRegressor 0.62 0.66 1301871.87
PoissonRegressor 0.62 0.66 1303698.42
LassoLarsCV 0.61 0.65 1331071.42
LassoLarsIC 0.61 0.65 1331071.42
LarsCV 0.61 0.65 1331071.42
Lars 0.61 0.65 1331071.42
TransformedTargetRegressor 0.61 0.65 1331071.42
LinearRegression 0.61 0.65 1331071.42
Lasso 0.61 0.65 1331072.08
LassoLars 0.61 0.65 1331072.09
Ridge 0.61 0.65 1331290.05
SGDRegressor 0.60 0.65 1332795.68
LassoCV 0.60 0.65 1332883.21
RidgeCV 0.60 0.65 1333447.11
HistGradientBoostingRegressor 0.58 0.63 1369076.05
BaggingRegressor 0.58 0.63 1370335.03
XGBRegressor 0.58 0.63 1374945.10
LGBMRegressor 0.58 0.62 1381195.18
ExtraTreesRegressor 0.57 0.62 1391999.83
RandomForestRegressor 0.56 0.61 1400765.84
ElasticNet 0.55 0.60 1418765.63
HuberRegressor 0.55 0.60 1420233.36
KNeighborsRegressor 0.53 0.58 1451363.57
OrthogonalMatchingPursuitCV 0.52 0.57 1467413.32
AdaBoostRegressor 0.50 0.56 1497403.63
TweedieRegressor 0.49 0.55 1512162.75
GammaRegressor 0.49 0.55 1515460.98
RANSACRegressor 0.48 0.54 1527036.86
DecisionTreeRegressor 0.40 0.47 1639566.30
ExtraTreeRegressor 0.33 0.41 1729079.53
OrthogonalMatchingPursuit 0.18 0.27 1917103.70
ElasticNetCV -0.14 -0.02 2265132.23
BayesianRidge -0.15 -0.02 2268298.23
DummyRegressor -0.15 -0.02 2268298.23
NuSVR -0.17 -0.04 2294658.59
QuantileRegressor -0.24 -0.10 2356800.96
SVR -0.24 -0.10 2359647.74
KernelRidge -4.62 -4.00 5024858.77
PassiveAggressiveRegressor -4.78 -4.13 5094459.54
LinearSVR -5.71 -4.96 5488681.78
MLPRegressor -5.71 -4.96 5488874.14
GaussianProcessRegressor -12833.58 -11407.52 240135712.32

Time Taken
Model
GradientBoostingRegressor 0.19
PoissonRegressor 0.02
LassoLarsCV 0.03
LassoLarsIC 0.02
LarsCV 0.05
Lars 0.09
TransformedTargetRegressor 0.01
LinearRegression 0.01
Lasso 0.01
LassoLars 0.01
Ridge 0.01
SGDRegressor 0.01
LassoCV 0.08
RidgeCV 0.01
HistGradientBoostingRegressor 0.25
BaggingRegressor 0.05
XGBRegressor 0.13
LGBMRegressor 0.32
ExtraTreesRegressor 0.25
RandomForestRegressor 0.32
ElasticNet 0.02
HuberRegressor 0.02
KNeighborsRegressor 0.01
OrthogonalMatchingPursuitCV 0.02
AdaBoostRegressor 0.13
TweedieRegressor 0.02
GammaRegressor 0.02
RANSACRegressor 0.21
DecisionTreeRegressor 0.01
ExtraTreeRegressor 0.01
OrthogonalMatchingPursuit 0.03
ElasticNetCV 0.08
BayesianRidge 0.02
DummyRegressor 0.01
NuSVR 0.08
QuantileRegressor 9.24
SVR 0.02
KernelRidge 0.12
PassiveAggressiveRegressor 0.05
LinearSVR 0.01
MLPRegressor 1.87
GaussianProcessRegressor 0.13

In [30]: predictions

Out[30]: Adjusted R-Squared R-Squared RMSE Time Taken

Model

GradientBoostingRegressor 0.62 0.66 1301871.87 0.19

PoissonRegressor 0.62 0.66 1303698.42 0.02


LassoLarsCV 0.61 0.65 1331071.42 0.03

LassoLarsIC 0.61 0.65 1331071.42 0.02

LarsCV 0.61 0.65 1331071.42 0.05

Lars 0.61 0.65 1331071.42 0.09

TransformedTargetRegressor 0.61 0.65 1331071.42 0.01

LinearRegression 0.61 0.65 1331071.42 0.01

Lasso 0.61 0.65 1331072.08 0.01

LassoLars 0.61 0.65 1331072.09 0.01

Ridge 0.61 0.65 1331290.05 0.01

SGDRegressor 0.60 0.65 1332795.68 0.01

LassoCV 0.60 0.65 1332883.21 0.08

RidgeCV 0.60 0.65 1333447.11 0.01

HistGradientBoostingRegressor 0.58 0.63 1369076.05 0.25

BaggingRegressor 0.58 0.63 1370335.03 0.05

XGBRegressor 0.58 0.63 1374945.10 0.13

LGBMRegressor 0.58 0.62 1381195.18 0.32

ExtraTreesRegressor 0.57 0.62 1391999.83 0.25

RandomForestRegressor 0.56 0.61 1400765.84 0.32

ElasticNet 0.55 0.60 1418765.63 0.02

HuberRegressor 0.55 0.60 1420233.36 0.02

KNeighborsRegressor 0.53 0.58 1451363.57 0.01

OrthogonalMatchingPursuitCV 0.52 0.57 1467413.32 0.02

AdaBoostRegressor 0.50 0.56 1497403.63 0.13

TweedieRegressor 0.49 0.55 1512162.75 0.02

GammaRegressor 0.49 0.55 1515460.98 0.02

RANSACRegressor 0.48 0.54 1527036.86 0.21

DecisionTreeRegressor 0.40 0.47 1639566.30 0.01

ExtraTreeRegressor 0.33 0.41 1729079.53 0.01

OrthogonalMatchingPursuit 0.18 0.27 1917103.70 0.03

ElasticNetCV -0.14 -0.02 2265132.23 0.08

BayesianRidge -0.15 -0.02 2268298.23 0.02

DummyRegressor -0.15 -0.02 2268298.23 0.01

NuSVR -0.17 -0.04 2294658.59 0.08

QuantileRegressor -0.24 -0.10 2356800.96 9.24

SVR -0.24 -0.10 2359647.74 0.02

KernelRidge -4.62 -4.00 5024858.77 0.12

PassiveAggressiveRegressor -4.78 -4.13 5094459.54 0.05

LinearSVR -5.71 -4.96 5488681.78 0.01

MLPRegressor -5.71 -4.96 5488874.14 1.87

GaussianProcessRegressor -12833.58 -11407.52 240135712.32 0.13


Gradient Boosting Regressor
In [31]: # Create and fit the Gradient Boosting Regression model
model = GradientBoostingRegressor()
model.fit(X_train, y_train)

# Predict on the training and testing sets


train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

Model Evaluation
In [32]: # Evaluate the model using metrics
train_rmse = mean_squared_error(y_train, train_predictions, squared=False)
train_mae = mean_absolute_error(y_train, train_predictions)
test_rmse = mean_squared_error(y_test, test_predictions, squared=False)
test_mae = mean_absolute_error(y_test, test_predictions)

# Print the evaluation metrics


print("Training set - RMSE:", train_rmse)
print("Training set - MAE:", train_mae)
print("Testing set - RMSE:", test_rmse)
print("Testing set - MAE:", test_mae)

Training set - RMSE: 641817.0283186645


Training set - MAE: 476055.9856965125
Testing set - RMSE: 1304741.320665057
Testing set - MAE: 966476.970839526

In [33]: # Visualize the predicted values vs. actual values for the training set
plt.scatter(y_train, train_predictions, color='violet', alpha=0.5)
plt.plot([min(y_train), max(y_train)], [min(y_train), max(y_train)], color='cyan', lines
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Training Set - Actual vs. Predicted Price')
plt.show()

# Visualize the predicted values vs. actual values for the testing set
plt.scatter(y_test, test_predictions, color='violet', alpha=0.5)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='cyan', linestyle
plt.xlabel('Actual Price')
plt.ylabel('Predicted Price')
plt.title('Testing Set - Actual vs. Predicted Price')
plt.show()
Model Interpretation
In [34]: importances = model.feature_importances_
feature_names = X_train.columns
# Sort the feature importances in descending order
sorted_indices = importances.argsort()[::-1]
sorted_importances = importances[sorted_indices]
sorted_features = feature_names[sorted_indices]

# Plot the feature importances


plt.figure(figsize=(10, 6))
plt.bar(range(len(sorted_importances)), sorted_importances, tick_label=sorted_features,
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Feature Importances')
plt.xticks(rotation=45)
plt.show()

Predicting price for new house


In [35]: # Example input for a new house
new_house = np.array([[2000, 4, 3, 2, 1, 1, 2, 1, 1, 2, 3, 1]])

# Predict the price for the new house


predicted_price = model.predict(new_house)

print('Predicted Price:', predicted_price)

Predicted Price: [7152296.80587577]

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy