Predicting Home Prices in Bangalore
Predicting Home Prices in Bangalore
Prices in Bangalore
The Dataset was downloaded from
https://www.kaggle.com/datasets/amitabhajoy/bengaluru-house-price-data
Out[2]: area_type availability location size society total_sqft bath balcony price
Ready To 4
1 Plot Area Chikka Tirupathi Theanmp 2600 5.0 3.0 120.00
Move Bedroom
Ready To
2 Built-up Area Uttarahalli 3 BHK NaN 1440 2.0 3.0 62.00
Move
In [3]: df.shape
(13320, 9)
Out[3]:
In [4]: df.groupby('area_type')['area_type'].agg('count')
area_type
Out[4]:
Built-up Area 2418
Carpet Area 87
Plot Area 2025
Super built-up Area 8790
Name: area_type, dtype: int64
location 1
Out[6]:
size 16
total_sqft 0
bath 73
price 0
dtype: int64
location 0
Out[7]:
size 0
total_sqft 0
bath 0
price 0
dtype: int64
In [8]: df2.shape
(13246, 5)
Out[8]:
In [9]: df2['size'].unique()
array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
Out[9]:
'1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
'7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
'9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
'10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
'12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)
Feature Engineering
Add new feature(integer) for bhk (Bedrooms Hall Kitchen)
C:\Users\47455\AppData\Local\Temp\ipykernel_6512\1142257054.py:1: SettingWithCopyWarnin
g:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
In [12]: df2['bhk'].unique()
array([ 2, 4, 3, 6, 1, 8, 7, 5, 11, 9, 27, 10, 19, 16, 43, 14, 12,
Out[12]:
13, 18], dtype=int64)
In [13]: df2[df2.bhk>20]
In [14]: df2.total_sqft.unique()
array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
Out[14]:
dtype=object)
In [16]: df2[~df2['total_sqft'].apply(is_float)].head()
2475.0
Out[19]:
Feature Engineering
Add new feature called price per square feet
In [22]: df4.location.unique()
array(['Electronic City Phase II', 'Chikka Tirupathi', 'Uttarahalli', ...,
Out[22]:
'12th cross srinivas nagar banshankari 3rd stage',
'Havanur extension', 'Abshot Layout'], dtype=object)
In [23]: len(df4.location.unique())
1304
Out[23]:
location
Out[24]:
Whitefield 535
Sarjapur Road 392
Electronic City 304
Kanakpura Road 266
Thanisandra 236
...
1 Giri Nagar 1
Kanakapura Road, 1
Kanakapura main Road 1
Karnataka Shabarimala 1
whitefiled 1
Name: location, Length: 1293, dtype: int64
In [25]: len(location_stats[location_stats<=10])
1052
Out[25]:
Dimensionality Reduction
Any location having less than 10 data points should be tagged as "other" location. This
way number of categories can be reduced by huge amount. Later on when we do one
hot encoding, it will help us with having fewer dummy columns
location
Out[26]:
Basapura 10
1st Block Koramangala 10
Gunjur Palya 10
Kalkere 10
Sector 1 HSR Layout 10
..
1 Giri Nagar 1
Kanakapura Road, 1
Kanakapura main Road 1
Karnataka Shabarimala 1
whitefiled 1
Name: location, Length: 1052, dtype: int64
In [27]: len(df4.location.unique())
1293
Out[27]:
242
Out[28]:
In [29]: df4.head(5)
Check above data points. We have 6 bhk apartment with 1020 sqft. Another one is 8
bhk and total sqft is 600. These are clear data errors that can be removed safely
In [31]: df4.shape
(13246, 7)
Out[31]:
(12502, 7)
Out[32]:
(10241, 7)
Out[34]:
Let's check if for a given location how does the 2 BHK and 3 BHK property prices look
like
In [35]: def plot_scatter_chart(df,location):
bhk2 = df[(df.location==location) & (df.bhk==2)]
bhk3 = df[(df.location==location) & (df.bhk==3)]
matplotlib.rcParams['figure.figsize'] = (15,10)
plt.scatter(bhk2.total_sqft,bhk2.price,color='blue',label='2 BHK', s= 50)
plt.scatter(bhk3.total_sqft,bhk3.price,marker='+',color='green',label='3 BHK', s= 50
plt.xlabel('Total Square Feet Area')
plt.ylabel('Price')
plt.title(location)
plt.legend()
plot_scatter_chart(df6, 'Rajaji Nagar')
Now we can remove those 2 BHK apartments whose price_per_sqft is less than mean
price_per_sqft of 1 BHK apartment
(7329, 7)
Out[37]:
Plot some scatter chart again to visualize price_per_sqft for 2 BHK and 3 BHK properties
In [40]: plt.figure(figsize=(20,10))
plt.hist(df7.price_per_sqft,rwidth = 0.8)
plt.xlabel('Price Per Square Feet')
plt.ylabel('Count')
array([ 4., 3., 2., 5., 8., 1., 6., 7., 9., 12., 16., 13.])
Out[41]:
In [42]: df7[df7.bath>10]
In [43]: plt.hist(df7.bath,rwidth=0.8)
plt.xlabel('Number of bathrooms')
plt.ylabel('Count')
In [45]: df7[df7.bath>df7.bhk+2]
In [46]: df8=df7[df7.bath<df7.bhk+2]
df8.shape
(7251, 7)
Out[46]:
0 1 0 0 0 0 0 0 0 0 0 ... 0
1 1 0 0 0 0 0 0 0 0 0 ... 0
2 1 0 0 0 0 0 0 0 0 0 ... 0
3 1 0 0 0 0 0 0 0 0 0 ... 0
4 1 0 0 0 0 0 0 0 0 0 ... 0
1st Block
0 2850.0 4.0 428.0 4 1 0 0 0 0 ... 0
Jayanagar
1st Block
1 1630.0 3.0 194.0 3 1 0 0 0 0 ... 0
Jayanagar
1st Block
2 1875.0 2.0 235.0 3 1 0 0 0 0 ... 0
Jayanagar
In [52]: X = df11.drop('price',axis='columns')
X.head()
In [53]: y = df11.price
y.head()
0 428.0
Out[53]:
1 194.0
2 235.0
3 130.0
4 148.0
Name: price, dtype: float64
0.845227769787429
Out[55]:
We can see that in 5 iterations we get a score above 80% all the time. This is pretty
good but we want to test few other algorithms for regression to see if we can get even
better score. We will use GridSearchCV for this purpose
def find_best_model_using_gridsearchcv(X,y):
algos = {
'linear_regression' : {
'model': LinearRegression(),
'params': {
'normalize': [True, False]
}
},
'lasso': {
'model': Lasso(),
'params': {
'alpha': [1,2],
'selection': ['random', 'cyclic']
}
},
'decision_tree': {
'model': DecisionTreeRegressor(),
'params': {
'criterion' : ['mse','friedman_mse'],
'splitter': ['best','random']
}
}
}
scores = []
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
for algo_name, config in algos.items():
gs = GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=
gs.fit(X,y)
scores.append({
'model': algo_name,
'best_score': gs.best_score_,
'best_params': gs.best_params_
})
return pd.DataFrame(scores,columns=['model','best_score','best_params'])
find_best_model_using_gridsearchcv(X,y)
C:\Users\47455\anaconda3\lib\site-packages\sklearn\linear_model\_base.py:141: FutureWarn
ing: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing sta
ge. To reproduce the previous behavior:
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to
each step of the pipeline as follows:
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\linear_model\_base.py:141: FutureWarn
ing: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing sta
ge. To reproduce the previous behavior:
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to
each step of the pipeline as follows:
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\linear_model\_base.py:141: FutureWarn
ing: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing sta
ge. To reproduce the previous behavior:
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to
each step of the pipeline as follows:
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\linear_model\_base.py:141: FutureWarn
ing: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing sta
ge. To reproduce the previous behavior:
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to
each step of the pipeline as follows:
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\linear_model\_base.py:141: FutureWarn
ing: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing sta
ge. To reproduce the previous behavior:
from sklearn.pipeline import make_pipeline
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to
each step of the pipeline as follows:
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\linear_model\_base.py:148: FutureWarn
ing: 'normalize' was deprecated in version 1.0 and will be removed in 1.2. Please leave
the normalize parameter to its default value to silence this warning. The default behavi
or of this estimator is to not do any normalization. If normalization is needed please u
se sklearn.preprocessing.StandardScaler instead.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\linear_model\_base.py:148: FutureWarn
ing: 'normalize' was deprecated in version 1.0 and will be removed in 1.2. Please leave
the normalize parameter to its default value to silence this warning. The default behavi
or of this estimator is to not do any normalization. If normalization is needed please u
se sklearn.preprocessing.StandardScaler instead.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\linear_model\_base.py:148: FutureWarn
ing: 'normalize' was deprecated in version 1.0 and will be removed in 1.2. Please leave
the normalize parameter to its default value to silence this warning. The default behavi
or of this estimator is to not do any normalization. If normalization is needed please u
se sklearn.preprocessing.StandardScaler instead.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\linear_model\_base.py:148: FutureWarn
ing: 'normalize' was deprecated in version 1.0 and will be removed in 1.2. Please leave
the normalize parameter to its default value to silence this warning. The default behavi
or of this estimator is to not do any normalization. If normalization is needed please u
se sklearn.preprocessing.StandardScaler instead.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\linear_model\_base.py:148: FutureWarn
ing: 'normalize' was deprecated in version 1.0 and will be removed in 1.2. Please leave
the normalize parameter to its default value to silence this warning. The default behavi
or of this estimator is to not do any normalization. If normalization is needed please u
se sklearn.preprocessing.StandardScaler instead.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\linear_model\_base.py:141: FutureWarn
ing: 'normalize' was deprecated in version 1.0 and will be removed in 1.2.
If you wish to scale the data, use Pipeline with a StandardScaler in a preprocessing sta
ge. To reproduce the previous behavior:
If you wish to pass a sample_weight parameter, you need to pass it as a fit parameter to
each step of the pipeline as follows:
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\tree\_classes.py:359: FutureWarning:
Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2. Use `criterio
n='squared_error'` which is equivalent.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\tree\_classes.py:359: FutureWarning:
Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2. Use `criterio
n='squared_error'` which is equivalent.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\tree\_classes.py:359: FutureWarning:
Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2. Use `criterio
n='squared_error'` which is equivalent.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\tree\_classes.py:359: FutureWarning:
Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2. Use `criterio
n='squared_error'` which is equivalent.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\tree\_classes.py:359: FutureWarning:
Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2. Use `criterio
n='squared_error'` which is equivalent.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\tree\_classes.py:359: FutureWarning:
Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2. Use `criterio
n='squared_error'` which is equivalent.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\tree\_classes.py:359: FutureWarning:
Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2. Use `criterio
n='squared_error'` which is equivalent.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\tree\_classes.py:359: FutureWarning:
Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2. Use `criterio
n='squared_error'` which is equivalent.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\tree\_classes.py:359: FutureWarning:
Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2. Use `criterio
n='squared_error'` which is equivalent.
warnings.warn(
C:\Users\47455\anaconda3\lib\site-packages\sklearn\tree\_classes.py:359: FutureWarning:
Criterion 'mse' was deprecated in v1.0 and will be removed in version 1.2. Use `criterio
n='squared_error'` which is equivalent.
warnings.warn(
Out[57]: model best_score best_params
Based on above results we can say that LinearRegression gives the best score. Hence we
will use that.
</span>
x = np.zeros(len(X.columns))
x[0] = sqft
x[1] = bath
x[2] = bhk
if loc_index >= 0:
x[loc_index] = 1
return lr_clf.predict([x])[0]