Capstone Project 6 April
Capstone Project 6 April
By Nandini Priya M
Date: 06-04-2024
1
INDEX -
1) Introduction of the business problem Page No - 3
2
1) Introduction of the business problem -
To analyze the housing dataset and determine which property features have the most significant impact
on price. This will help the organization:
1. Predict property values more accurately.
2. Identify neighborhoods or home types with high or low investment potential.
3. Assist buyers with better-informed decisions.
4. Guide city planning and affordable housing strategies.
Dataset Description:
The dataset contains over 21,000 property listings with features such as:
Number of bedrooms and bathrooms
Living area and lot size
Quality and condition ratings
Year built/renovated
Location (zipcode, latitude, longitude)
Features like basement, floors, waterfront, and furnishings
Data Report -
3
Sample Of the Data set -
0 1 2 3 4
room_bed 4 2 4 3 2
ceil 1 1 2 2 1
coast 0 0 1 0 0
sight 0 0 4 0 0
condition 3 4 3 3 3
quality 8 6 8 8 7
basement 1250 0 0 0 0
yr_renovated 0 0 0 0 0
furnished 0 0 0 0 0
4
Table 1 – Sample of the dataseet
2
In the above table we can see a sample of the data, the data contains 21613 rows and 23 columns. (We have
written the transpose of the data so that it fits)
Descriptive Details -
General Observations:
Large Dataset: With a count of over 21,000 for most features, you have a reasonably large dataset,
which can support more robust statistical analysis and modeling.
Missing Values: The date_house_sold column has nan in its descriptive statistics, indicating missing
values in this column. You'll need to address these (e.g., imputation or removal) before using this
feature in modeling.
3
Varied Scales: The features have vastly different scales (e.g., price in millions, room_bed in single
digits, lat and long around specific values). This suggests that feature scaling (like standardization
or normalization) will be important for many machine learning algorithms.
Insights into Specific Features:
Price:
o The average price (mean) is approximately 540,182.
o The price ranges from a minimum of 75,000 to a maximum of 7,700,000, indicating a
wide variety of property values.
o The standard deviation (std) of around 387,382 is quite high, suggesting significant price
dispersion around the mean.
Room Features (room_bed, room_bath):
o The average number of bedrooms is around 3.7, with most properties having between 3 and 4
bedrooms (as seen in the quartiles). The maximum of 33 seems like a potential outlier or error.
o The average number of bathrooms is around 2.1, with the interquartile range (25th to
75th percentile) being between 1.75 and 2.5.
Size Features (living_measure, lot_measure, ceil_measure, basement, total_area):
o These features show a wide range of values and relatively high standard deviations, indicating
significant differences in property sizes.
o The minimum values being 0 for some of these (e.g., basement, total_area) might indicate
the absence of that feature in some properties or potential data entry issues.
Location Features (zipcode, lat, long):
o The zipcode has a relatively small standard deviation compared to its mean, suggesting
properties are clustered within a limited number of zip codes.
o The lat and long values have small standard deviations, indicating the properties are
geographically concentrated. The mean values give you a sense of the general location
(though you'd need a map to pinpoint it).
Categorical/Binary-like Features (coast, sight, condition, quality, furnished):
o coast has a mean of 0.00745, suggesting that very few properties in the dataset are classified
as being on the coast (assuming 1 represents coastal).
o sight has a mean of 0.23, indicating that a small proportion of properties have a
noteworthy view (assuming higher values represent better sight).
o condition has a mean of around 3.4, with most properties having a condition score between
3 and 4. The range is relatively small (1 to 5), suggesting a defined scale.
o quality has a mean of around 7.7, with a wider range (1 to 13), indicating a more granular
scale for property quality.
o furnished has a very low mean (0.19), suggesting that most properties in the dataset are
not furnished (assuming 1 represents furnished).
Age/Renovation Features (yr_built, yr_renovated):
o The average year built is around 1969, with properties ranging from 0 (potentially an error
or placeholder) to 2015.
o The average year renovated is around 84, which is unusual if it represents an actual year.
It's more likely that 0 indicates no renovation, and non-zero values represent the year of
renovation (though the scale seems off). Further investigation into the data dictionary or
context is needed for yr_renovated.
2
Past Measurement Features (living_measure15, lot_measure15):
o Comparing these to the current measurements (living_measure, lot_measure) could provide
insights into how property sizes have changed over time due to renovations or other
factors. The means and standard deviations are different, suggesting some level of change.
3
Entries: The DataFrame contains 21,613 rows (entries), indexed from 0 to 21612. This confirms the
large size of your dataset as mentioned earlier.
Columns: There are a total of 23 columns, each representing a different feature or attribute of the
properties.
Data Types:
Numerical Features (float64): The majority of your features (18 columns) are stored as float64.
This is suitable for numerical data that may have decimal values, such as measurements (living
area, lot area), coordinates (latitude, longitude), and potentially ratings (condition, quality).
Integer Features (int64): You have 4 columns stored as int64:
o cid (likely a unique property identifier)
o yr_built (year the house was built)
o yr_renovated (year of renovation, with 0 likely indicating no renovation)
o zipcode (postal code)
Datetime Feature (datetime64[ns]): The date_house_sold column is correctly recognized
as datetime objects, which is essential for time-based analysis.
Missing Values:
This is a crucial insight from the .info() output:
Missing Values Present: Several columns have a "Non-Null Count" that is less than the total
number of entries (21613). This indicates the presence of missing values in these columns:
o price: 21613 non-null (No missing values in price itself)
o room_bed: 21505 non-null (108 missing values)
o room_bath: 21505 non-null (108 missing values)
o living_measure: 21596 non-null (17 missing values)
o lot_measure: 21571 non-null (42 missing values)
o ceil: 21571 non-null (42 missing values)
o coast: 21612 non-null (1 missing value)
o sight: 21556 non-null (57 missing values)
o condition: 21556 non-null (57 missing values)
o quality: 21612 non-null (1 missing value)
o ceil_measure: 21612 non-null (1 missing value)
o basement: 21612 non-null (1 missing value)
o living_measure15: 21447 non-null (166 missing values)
o lot_measure15: 21584 non-null (29 missing values)
o furnished: 21584 non-null (29 missing values)
o total_area: 21584 non-null (29 missing values)
4
2) EDA and Business Implications -
Univariate Analysis -
price
property prices are highly skewed — while most homes are moderately priced, the top 1% (99th percentile) have prices that
spike dramatically, nearing 2 million, indicating a small segment of high-value luxury properties in the dataset.
2
Median home price is ₹450,000, indicating that half the homes are priced below this value — this reflects the central tendency of the
market.
There's a steep jump at the 99th percentile, where price reaches ₹1,968,800 — suggesting a small group of very high-end properties
inflating the upper end of the market.
The price distribution is right-skewed, with most homes priced between ₹150k and ₹650k, showing a concentration in the mid-to-lower
range .
The majority of properties are priced between 200k and 500k, indicating a concentration in the mid-tier housing market.
Only 25 properties are under 100k, showing that affordable housing is extremely limited.
Interestingly, there is also a substantial number (3792) of high-value properties priced above 500k, reflecting demand for premium
housing.
3
The histogram shows that house prices are most concentrated between ₹250,000 and ₹450,000, with the peak around ₹350,000.
The distribution is slightly right-skewed, indicating that while most houses are moderately priced, there are still a notable number of
higher-priced properties.
Very few houses are priced below ₹150,000 or above ₹600,000, highlighting the dominance of mid-range housing .
4
The distribution has become more symmetric, closely resembling a normal (bell-shaped) distribution.
This indicates that log transformation effectively reduced the right skewness observed in the original price distribution.
Majority Without Basement: The value 0.0 has a count of 10391. Assuming that 0.0 represents the absence of a basement (based on
the code you previously shared where values of 0 in basement_measure were mapped to "0"), this indicates that a majority of the
properties in your house_price_1 DataFrame do not have a basement.
5
Significant Number With Basement: The value 1.0 has a count of 5808. Assuming that 1.0 represents the presence of a basement
(mapped from positive values in basement_measure), this shows that a substantial number of properties in your dataset do have a
basement.
replace all the missing values with zero for columns with blank as value for ceil', 'coast', 'condition', 'yr_built',
'long', 'total_area
6
replace missing values of room_bed , room_bath with mod
7
Percentage of zeros in 'ceil' column after impute ceil =zero with median
8
9
Price:
Histogram: The distribution of price appears to be right-skewed. This means that there are a larger number of properties with lower
prices, and fewer properties with very high prices. The peak of the distribution is towards the lower end of the price range.
Boxplot: The boxplot confirms the right skewness. The median line is closer to the first quartile, and the whisker extends further on the
higher price side. There are also several outliers indicated by individual points above the upper whisker, representing very expensive
properties.
Density Plot: The density plot shows a primary peak at a lower price point, with a long tail extending towards higher prices. There might
be a secondary, smaller peak or shoulder at a slightly higher price range, suggesting potential clusters in the price distribution.
2. Living Measure:
Histogram: The distribution of living measure also appears to be right-skewed. Most properties have a smaller living area, with fewer
properties having very large living spaces.
Boxplot: Similar to price, the boxplot shows the median closer to the lower quartile and a longer upper whisker. There are also
numerous outliers above the upper whisker, indicating some very large properties.
Density Plot: The density plot shows a clear peak at a lower living measure and a tail extending towards larger values.
3. Lot Measure:
Histogram: The distribution of lot measure is heavily right-skewed. The vast majority of properties have relatively small lot sizes, with a
very small number of properties having extremely large lots.
Boxplot: The boxplot strongly emphasizes the right skewness. The box itself is compressed towards the lower end, and there are a very
large number of outliers above the upper whisker, indicating exceptionally large lots.
Density Plot: The density plot shows a very sharp peak at a small lot measure and a very long, thin tail extending to the right, confirming
the extreme right skewness.
4. Room Bed:
Histogram: The distribution of the number of bedrooms is somewhat discrete and appears to have a few prominent peaks, likely
corresponding to common numbers of bedrooms (e.g., 3, 4). There's a significant frequency for a lower number of bedrooms, and the
frequency generally decreases as the number of bedrooms increases, although there might be smaller peaks at higher values. There
seems to be a potential outlier or less frequent occurrence at a very high number of bedrooms.
Boxplot: The boxplot shows a median around 3 bedrooms, with the interquartile range likely covering 3 and 4 bedrooms. There are
outliers on the higher end, representing properties with an unusually large number of bedrooms. The lower whisker extends to a smaller
number of bedrooms.
Density Plot: The density plot reflects the discrete nature with multiple humps, corresponding to the common numbers of bedrooms.
5. Room Bath:
Histogram: Similar to the number of bedrooms, the distribution of the number of bathrooms is also discrete with noticeable peaks, likely
10
at common numbers of bathrooms (e.g., 1, 2, 2.5). The distribution seems somewhat multimodal.
11
Boxplot: The boxplot shows a median likely around 2 bathrooms, with the interquartile range covering a common range of bathroom
counts. There are outliers on the higher end, indicating properties with a very large number of bathrooms.
Density Plot: The density plot shows multiple peaks, reflecting the common discrete values for the number of bathrooms.
Overall Inferences:
Right Skewness in Continuous Variables: The continuous variables (price, living_measure, lot_measure) all exhibit a right-skewed
distribution, indicating the presence of some very large values (high prices, large living areas, large lots) that pull the mean to the right of
the median.
Outliers: All the continuous variables show the presence of outliers, which might require investigation and potential handling
depending on the analysis goals.
Discrete Distributions for Room Features: The number of bedrooms and bathrooms have discrete distributions with peaks at
common integer or half-integer values.
The ceil_measure distribution is right-skewed, with most values concentrated between 1000 and 2000, indicating that typical ceiling
measurements fall within this range, while very high values are rare.
The boxplot for ceil_measure shows that the data is right-skewed with a large number of outliers above the upper whisker, indicating
the presence of unusually high ceiling measurement
11
The histogram for basement_measure is highly right-skewed, with a large concentration of properties having zero basement area,
indicating that most houses do not have basements
The boxplot for basement_measure shows that most properties have a relatively small or zero basement size, with a significant number
of outliers indicating some properties have exceptionally large basements. The median basement size is noticeably lower than the upper
12
The histogram for living_measure15 shows a distribution that is right-skewed, indicating that most properties had a smaller living area 15
years prior, with a tail extending towards larger living areas. The peak of the distribution is between approximately 1250 and 1750.
The boxplot for living_measure15 confirms a right-skewed distribution with a median around 1750 and a significant number of outliers
above the upper whisker, indicating some properties had considerably larger living areas 15 years ago. The lower whisker extends to a
value around 500.
The histogram for lot_measure15 is extremely right-skewed, with the vast majority of properties having a small lot size 15 years ago, and
a very long tail indicating a few properties with exceptionally large lots. The primary peak is very close to zero
The boxplot for lot_measure15 confirms a highly right-skewed distribution with a very low median and a large number of extreme
outliers above the upper whisker, indicating that while most properties had small lot sizes 15 years ago, some had exceptional
13
yr_built -
14
Most of the houses were built between 1900 and 2015.
yr_renovated
15
Almost 20000 houses have not been renovated. Houses have been renovated between 1934 and 2015.
16
total_area -
17
We can see that the distribution is right skewed. There are 4 houses with more than 1000000
18
square foot of total area.
13
14
1. Room (bedrooms):
Most houses have 3 bedrooms (50.8%), followed by 4 bedrooms (26.3%). Houses with more than 5 bedrooms are very rare.
2. Room (bathrooms):
Bathroom counts vary more widely, but the majority have 2.5 or fewer bathrooms, with 1.0, 1.75, and 2.0 being the most common.
4. Coast proximity:
An overwhelming majority of homes (99.8%) are not near the coast, making coastal properties very rare.
5. Sight (view):
95.1% of homes have no special view (sight = 0), while only a small fraction enjoy any notable view.
6. Condition:
The majority of homes are in average to good condition, with condition 3 (64.9%) and condition 4 (26.7%) being the most common.
7. Quality:
Most homes are of quality 7 (51.1%) and 8 (28.2%), indicating a high concentration of moderately good construction quality.
8. Furnished:
93.4% of homes are unfurnished, showing that furnished homes are quite uncommon in the dataset.
Overall insight:
The dataset predominantly contains standard-quality, unfurnished homes in average condition, with 3 bedrooms, 2 bathrooms, and no
special location advantages (like coast or view).
15
square foot of total area.
Bivariate Analysis -
Pair plot -
16
We can see that price is right skewed. Most of the variables don’t have a clear relationship with price
variables.
Pearson correlation -
17
Fig 20 – Pearson Correlation
From the above heatmap we can see that most of the variables are correlated to each other.
18
Bivariate analysis of year on price
The boxplots suggest that the median house price was slightly higher in 2015 compared to 2014. However, the
overall distribution of prices, as indicated by the interquartile range and the length of the whiskers, appears quite
similar in both years, suggesting no drastic change in the price range.
19
The boxplots generally show a positive correlation between the number of bedrooms and house price, with median prices tending to
increase as the number of bedrooms increases up to a certain point. However, there's considerable overlap in the price distributions
across different bedroom counts, and the variability in price also seems to increase with more bedrooms. The data also includes some
less common instances with a very high number of bedrooms (e.g., 11 and 33), which show high price points but might represent outliers
or different property types.
2
.
The boxplots generally illustrate a positive relationship between the number of bathrooms and house price,
with median prices tending to increase as the number of bathrooms goes up. However, the relationship isn't
strictly linear, and there's considerable overlap in price distributions across different bathroom counts,
especially in the mid-range. Properties with a very high number of bathrooms (e.g., above 5) tend to have higher
prices, but the number of data points in these categories might be smaller.
3
The scatter plot generally shows a trend of increasing house price with a higher number of bathrooms
(room_bath), particularly noticeable up to around 4 bathrooms. Similar to the number of bedrooms, there's a
significant range of prices for any given number of bathrooms, indicating other factors also strongly influence the
final value. The vertical lines of points reflect the discrete nature of bathroom counts (including half bathrooms).
The scatter plot illustrates a tendency for house prices to increase with the number of bedrooms, especially
within the common range of 1 to 6 bedrooms. However, for any given number of bedrooms, there's a significant
range of prices, suggesting other factors strongly influence the final value
4
The scatter plot shows a weak positive correlation between lot measure and house price, particularly for smaller
lot sizes. While prices tend to be higher for slightly larger lots, the relationship becomes much less clear and more
scattered as the lot measure increases, suggesting other factors have a stronger influence on price for larger
properties. There are also many properties with small lot measures spanning a wide range of prices
5
The scatter plot
shows a clear positive correlation between living measure and house price, as indicated by the upward-sloping trendline.
Generally, as the living area of a property increases, its price also tends to increase. However, there is still considerable
scatter around the trendline, suggesting that living measure is not the only factor determining house price, and other
variables also play a significant role.
The scatter plot suggests a weak positive correlation between lot measure and house price, as indicated by the
6
slightly upward-sloping trendline. While there's a general tendency for prices to be higher for larger lots,
especially beyond a certain threshold, the relationship is not strong, and there's a wide spread of prices for
properties with similar lot measures. This indicates that other factors likely have a more substantial impact on the
final house price.
The scatter plot shows a positive correlation between the number of bedrooms and house price, with the
trendline indicating a general increase in price as the number of bedrooms increases. However, the relationship
appears somewhat stepped due to the discrete nature of bedroom counts, and there's a significant spread in
prices for a given number of bedrooms, especially in the lower range. The presence of data points with a very high
number of bedrooms (beyond the typical range) also correlates with higher prices, though these are less frequent.
7
The scatter plot reveals a positive correlation between the number of bathrooms and house price, with the
trendline showing a general increase in price as the number of bathrooms increases. Similar to the number of
bedrooms, the relationship appears somewhat stepped due to the discrete nature of bathroom counts, and
there's a noticeable spread in prices for properties with the same number of bathrooms. Properties with a higher
number of bathrooms tend to command higher prices overall
The scatter plot indicates a weak positive correlation between basement measure and house price, as shown by
the slightly upward-sloping trendline. While there's a general tendency for prices to be somewhat higher for
properties with larger basements, the relationship is not very strong, and there's a significant amount of price
variation for properties with similar basement measures, especially for smaller basement sizes. This suggests that
8
other factors likely have a more substantial influence on house price than just the basement size
9
The scatter plot shows a positive correlation between ceil measure and house price, with the trendline indicating
a general increase in price as the ceil measure increases. While there's a clear upward trend, there's also a
considerable spread of prices for a given ceil measure, suggesting that other factors contribute significantly to the
final price. The relationship seems stronger for larger ceil measures.
9
10
The scatter plot suggests a general tendency for house prices to be higher with an increasing number of
ceilings/stories (ceil), although the relationship isn't strictly linear and shows considerable overlap in price ranges
for different ceiling counts. Properties with 2.5 and 3.5 ceilings appear to have a wider range of higher prices
compared to those with fewer ceilings
10
There is a slight increase in price for houses with a waterfront view.
11
Fig 26 – Lat and Long of House
Business Implications -
12
Core Value Drivers:
Living Space is Key: The strong positive correlation between living_measure and price across
multiple visualizations consistently highlights that the size of the living area is a primary driver of
house value. This emphasizes the importance of maximizing usable indoor space in
development and highlighting it in sales.
Bedrooms and Bathrooms Matter: The positive correlation between the number of bedrooms and
bathrooms with price suggests that these functional aspects significantly influence property value,
catering to different household sizes and needs. Developers should consider the optimal mix of
bedroom/bathroom counts based on target markets.
Basement Presence: The significant number of properties with and without basements, coupled
with the potential (though not directly visualized against price) for basement size to add value,
indicates that basement features are a relevant factor in the market, especially in areas where
they are common.
Land Size Influence:
Diminishing Returns for Lot Size: The weaker correlation of lot_measure with price, especially for
larger lots, suggests that beyond a certain point, simply having more land doesn't proportionally
increase the price. Value might be tied more to the usable land or other property features on the
lot.
Market Dynamics and Trends:
Slight Price Appreciation (2014-2015): The minor increase in median price from 2014 to 2015
suggests a slightly appreciating market during that period. Real estate professionals can use
such trend analysis (over longer periods with more data) to advise on investment timing.
Right-Skewed Distributions: The right skewness in price, living measure, and lot measure indicates
a market with a larger number of more affordable/smaller properties and a smaller segment of
high-end/larger properties. This informs inventory management and marketing strategies for
different price points.
Outliers Represent Opportunities or Errors: The presence of outliers in several features (price,
size measures) could represent unique, high-value properties or potential data entry errors that
need investigation. Identifying and understanding these outliers can be valuable for niche markets
or data quality control.
Temporal Considerations:
Changes Over Time (living_measure15, lot_measure15): The differences in distributions between
current and past measurements suggest potential trends in property development and land use
over the 15-year period. This historical data can inform long-term investment and planning
strategies.
13
The table shows significant variation in both the mean and median house prices across different zip codes. This
indicates that location, as represented by zip code, is a strong determinant of property value in this
dataset.
The table suggests a non-linear relationship between property condition and price. While the highest
mean and median prices are observed for conditions 0.0 and 5.0, lower condition scores (1.0 and 2.0)
2
correspond to notably lower average property values.
The table clearly shows a strong positive correlation between property quality and price, with both the
mean and median house prices consistently increasing as the quality score increases. Higher quality ratings
directly translate to higher property values.
Properties classified as being on the coast (coast = 1.0) have a significantly higher mean and median price
compared to those not on the coast (coast = 0.0), indicating that coastal proximity is a valuable feature.
This aligns with typical real estate market trends where waterfront or coastal properties command a
premium.
3
The table suggests a generally positive trend between the number of ceilings/stories (ceil) and house price, with
higher mean and median prices observed for properties with more levels. However, the relationship isn't strictly
linear, as seen with the dip at 3.0 ceilings and the subsequent increase at 3.5.
There were 403 null values among which some were dropped, and some were
replaced. Most of the variables that had outliers were treated with IQR.
IQR is simply the range of the middle 50% of data values, it's not affected by extreme outliers.
4
5
After outlier treatment
6
7
Two new variables Basement and Renovated have been created. These tell us whether
a house has a basement and whether a house is renovated.
Seven unwanted variables have been dropped which were not required for our analysis.
Label encoding has been done on ceil, room_bed and room_bath columns.
Final data consisted of 21538 rows and 21 columns.
2) Model Building -
We will be using various regression and classification models for this problem. Following are the models used
and the accuracies and RMSE values.
8
1) Linear Regression Model –
9
5) Decision Tree Regression Model -
RMSE on Train Set for Decision Tree Regression Model was 0.0
RMSE on Test Set for Decision Tree Regression Model was 0.0
R square on Train Set for Decision Tree Regression Model: 1.0
R square on Test Set for Decision Tree Regression Model: 0.596609764509463
10
Model Tuning -
Hyper tuning the random forest model with the appropriate measures gave us an accuracy of 80.7% on the
training set and 71.1% on the test set. The RMSE value for the test set is higher than the train set which
indicates the model is overfitting.
Hyper tuning the gradient boost model with the appropriate measures gave us an accuracy of 84.5% on the
training set and 72.2% on the test set. The RMSE value for the train set is higher than the test set which
indicates the model is underfitting.
Hyper tuning the Gradient Boost Model gave us the best accuracy. The model is also underfitting.
11
Mean squared error : 52050.694594988425
1) Dimension Reduction
2) Clustering
3
4
1) Model Validation -
The models were compared based on their accuracies and RMSE values.
Out of all the models Gradient Boost Model performed the best and gave us the highest accuracy.
Hyper Tuning the Gradient boost model with the best parameters gave us an accuracy of 84.5% on the
training set and 72.2% accuracy on the test set.
The gradient boost hyper tune model performed the best on both test and train dataset.
These are the features that affect the price of the house according to the gradient boost hyper tune model.
4
Fig 29 – Important features than affect price
We can see that almost all the features affect the price of the house, quality and furnished_1.0 are the top
two features that affect the price of a house.
5
4) Final Interpretation / Recommendation -
Insights -
Quality rating is the most important feature that is looked for in a house.
People prefer furnished houses with good square footage to live in.
Overall, a house with good living space, furnished and with 1-2 floors is what people want to buy.
Recommendations -
Some of the features that affect the price the most, like quality, furnished and living measure,
should be looked for while purchasing or selling a house.
6
Most important feature is quality, a house with higher quality rating is priced higher.
Selling a furnished house with ample living space is easier compared to an unfurnished house with
less or excess living space, House with 2000 Sq foot of living space is what people want, a house with
less than 1000 Sq foot or more than 3000 Sq foot will be hard to sell.
More than 50% of the houses are not furnished, furnishing these houses will help sell them as
people prefer furnished houses.
A house with a quality rating higher than 6 is what people prefer, therefore scoring at least a 6
is recommended.
************