0% found this document useful (0 votes)

4 views64 pages

Capstone Project 6 April

The capstone project analyzes a housing dataset with over 21,000 listings to identify key property features impacting prices, aiding in accurate value predictions and informed buyer decisions. It includes sections on exploratory data analysis, data cleaning, model building, and validation, highlighting significant findings such as price distribution skewness and the prevalence of certain property features. The project aims to develop pricing models for real estate agents and urban planners while addressing missing values and data preprocessing challenges.

Uploaded by

Nandini Priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views64 pages

Capstone Project 6 April

Uploaded by

Nandini Priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 64

CAPSTONE PROJECT

By Nandini Priya M
Date: 06-04-2024

1
INDEX -
1) Introduction of the business problem Page No - 3

2) EDA and Business implications Page No -10

3) Data Cleaning and Preprocessing Page No -51

4) Model Building Page No -55

5) Model Validation Page No -61

6) Final Interpretation / Recommendation Page No -63

2
1) Introduction of the business problem -

To analyze the housing dataset and determine which property features have the most significant impact
on price. This will help the organization:
1. Predict property values more accurately.
2. Identify neighborhoods or home types with high or low investment potential.
3. Assist buyers with better-informed decisions.
4. Guide city planning and affordable housing strategies.

Dataset Description:
The dataset contains over 21,000 property listings with features such as:
 Number of bedrooms and bathrooms
 Living area and lot size
 Quality and condition ratings
 Year built/renovated
 Location (zipcode, latitude, longitude)
 Features like basement, floors, waterfront, and furnishings

Business Use Cases:

 Develop pricing models to support real estate agents and urban planners.
 Identify how features like location, size, and condition affect pricing.

Data Report -

3
Sample Of the Data set -

0 1 2 3 4

cid 3876100940 3145600250 7129303070 7338220280 7950300670

dayhours 20150427T000000 20150317T000000 20140820T000000 20141010T000000 20150218T000000

price 600000 190000 735000 257000 450000

room_bed 4 2 4 3 2

room_bath 1.75 1 2.75 2.5 1

living_measure 3050 670 3040 1740 1120

lot_measure 9440 3101 2415 3721 4590

ceil 1 1 2 2 1

coast 0 0 1 0 0

sight 0 0 4 0 0

condition 3 4 3 3 3

quality 8 6 8 8 7

ceil_measure 1800 670 3040 1740 1120

basement 1250 0 0 0 0

yr_built 1966 1948 1966 2009 1924

yr_renovated 0 0 0 0 0

zipcode 98034 98118 98118 98002 98118

lat 47.7228 47.5546 47.5188 47.3363 47.5663

long -122.183 -122.274 -122.256 -122.213 -122.285

living_measure15 2020 1660 2620 2030 1120

lot_measure15 8660 4100 2433 3794 5100

furnished 0 0 0 0 0

total_area 12490 3771 5455 5461 5710

4
Table 1 – Sample of the dataseet

2
In the above table we can see a sample of the data, the data contains 21613 rows and 23 columns. (We have
written the transpose of the data so that it fits)

Descriptive Details -

Table 2 – Descriptive Details

General Observations:
 Large Dataset: With a count of over 21,000 for most features, you have a reasonably large dataset,
which can support more robust statistical analysis and modeling.
 Missing Values: The date_house_sold column has nan in its descriptive statistics, indicating missing
values in this column. You'll need to address these (e.g., imputation or removal) before using this
feature in modeling.

3
 Varied Scales: The features have vastly different scales (e.g., price in millions, room_bed in single
digits, lat and long around specific values). This suggests that feature scaling (like standardization
or normalization) will be important for many machine learning algorithms.
Insights into Specific Features:
 Price:
o The average price (mean) is approximately 540,182.
o The price ranges from a minimum of 75,000 to a maximum of 7,700,000, indicating a
wide variety of property values.
o The standard deviation (std) of around 387,382 is quite high, suggesting significant price
dispersion around the mean.
 Room Features (room_bed, room_bath):
o The average number of bedrooms is around 3.7, with most properties having between 3 and 4
bedrooms (as seen in the quartiles). The maximum of 33 seems like a potential outlier or error.
o The average number of bathrooms is around 2.1, with the interquartile range (25th to
75th percentile) being between 1.75 and 2.5.
 Size Features (living_measure, lot_measure, ceil_measure, basement, total_area):
o These features show a wide range of values and relatively high standard deviations, indicating
significant differences in property sizes.
o The minimum values being 0 for some of these (e.g., basement, total_area) might indicate
the absence of that feature in some properties or potential data entry issues.
 Location Features (zipcode, lat, long):
o The zipcode has a relatively small standard deviation compared to its mean, suggesting
properties are clustered within a limited number of zip codes.
o The lat and long values have small standard deviations, indicating the properties are
geographically concentrated. The mean values give you a sense of the general location
(though you'd need a map to pinpoint it).
 Categorical/Binary-like Features (coast, sight, condition, quality, furnished):
o coast has a mean of 0.00745, suggesting that very few properties in the dataset are classified
as being on the coast (assuming 1 represents coastal).
o sight has a mean of 0.23, indicating that a small proportion of properties have a
noteworthy view (assuming higher values represent better sight).
o condition has a mean of around 3.4, with most properties having a condition score between
3 and 4. The range is relatively small (1 to 5), suggesting a defined scale.
o quality has a mean of around 7.7, with a wider range (1 to 13), indicating a more granular
scale for property quality.
o furnished has a very low mean (0.19), suggesting that most properties in the dataset are
not furnished (assuming 1 represents furnished).
 Age/Renovation Features (yr_built, yr_renovated):
o The average year built is around 1969, with properties ranging from 0 (potentially an error
or placeholder) to 2015.
o The average year renovated is around 84, which is unusual if it represents an actual year.
It's more likely that 0 indicates no renovation, and non-zero values represent the year of
renovation (though the scale seems off). Further investigation into the data dictionary or
context is needed for yr_renovated.

2
 Past Measurement Features (living_measure15, lot_measure15):
o Comparing these to the current measurements (living_measure, lot_measure) could provide
insights into how property sizes have changed over time due to renovations or other
factors. The means and standard deviations are different, suggesting some level of change.

Columns in the dataset -

3
 Entries: The DataFrame contains 21,613 rows (entries), indexed from 0 to 21612. This confirms the
large size of your dataset as mentioned earlier.
 Columns: There are a total of 23 columns, each representing a different feature or attribute of the
properties.
Data Types:
 Numerical Features (float64): The majority of your features (18 columns) are stored as float64.
This is suitable for numerical data that may have decimal values, such as measurements (living
area, lot area), coordinates (latitude, longitude), and potentially ratings (condition, quality).
 Integer Features (int64): You have 4 columns stored as int64:
o cid (likely a unique property identifier)
o yr_built (year the house was built)
o yr_renovated (year of renovation, with 0 likely indicating no renovation)
o zipcode (postal code)
 Datetime Feature (datetime64[ns]): The date_house_sold column is correctly recognized
as datetime objects, which is essential for time-based analysis.
Missing Values:
This is a crucial insight from the .info() output:
 Missing Values Present: Several columns have a "Non-Null Count" that is less than the total
number of entries (21613). This indicates the presence of missing values in these columns:
o price: 21613 non-null (No missing values in price itself)
o room_bed: 21505 non-null (108 missing values)
o room_bath: 21505 non-null (108 missing values)
o living_measure: 21596 non-null (17 missing values)
o lot_measure: 21571 non-null (42 missing values)
o ceil: 21571 non-null (42 missing values)
o coast: 21612 non-null (1 missing value)
o sight: 21556 non-null (57 missing values)
o condition: 21556 non-null (57 missing values)
o quality: 21612 non-null (1 missing value)
o ceil_measure: 21612 non-null (1 missing value)
o basement: 21612 non-null (1 missing value)
o living_measure15: 21447 non-null (166 missing values)
o lot_measure15: 21584 non-null (29 missing values)
o furnished: 21584 non-null (29 missing values)
o total_area: 21584 non-null (29 missing values)

4
2) EDA and Business Implications -

Univariate Analysis -
price

property prices are highly skewed — while most homes are moderately priced, the top 1% (99th percentile) have prices that
spike dramatically, nearing 2 million, indicating a small segment of high-value luxury properties in the dataset.

2
Median home price is ₹450,000, indicating that half the homes are priced below this value — this reflects the central tendency of the
market.
There's a steep jump at the 99th percentile, where price reaches ₹1,968,800 — suggesting a small group of very high-end properties
inflating the upper end of the market.
The price distribution is right-skewed, with most homes priced between ₹150k and ₹650k, showing a concentration in the mid-to-lower

range .

The majority of properties are priced between 200k and 500k, indicating a concentration in the mid-tier housing market.
Only 25 properties are under 100k, showing that affordable housing is extremely limited.
Interestingly, there is also a substantial number (3792) of high-value properties priced above 500k, reflecting demand for premium
housing.

3
The histogram shows that house prices are most concentrated between ₹250,000 and ₹450,000, with the peak around ₹350,000.
The distribution is slightly right-skewed, indicating that while most houses are moderately priced, there are still a notable number of
higher-priced properties.

Very few houses are priced below ₹150,000 or above ₹600,000, highlighting the dominance of mid-range housing .

4
The distribution has become more symmetric, closely resembling a normal (bell-shaped) distribution.
This indicates that log transformation effectively reduced the right skewness observed in the original price distribution.

Creation of year renovated and basement

Majority Without Basement: The value 0.0 has a count of 10391. Assuming that 0.0 represents the absence of a basement (based on
the code you previously shared where values of 0 in basement_measure were mapped to "0"), this indicates that a majority of the
properties in your house_price_1 DataFrame do not have a basement.

5
Significant Number With Basement: The value 1.0 has a count of 5808. Assuming that 1.0 represents the presence of a basement
(mapped from positive values in basement_measure), this shows that a substantial number of properties in your dataset do have a
basement.

replace all the missing values with zero for columns with blank as value for ceil', 'coast', 'condition', 'yr_built',
'long', 'total_area

6
replace missing values of room_bed , room_bath with mod

replace missing values of living_measure lot_measure with mean or average

replace missing values of sight , furnished , with mode

7
Percentage of zeros in 'ceil' column after impute ceil =zero with median

histograms, box plots, and density plots of numerical values

8
9
Price:

 Histogram: The distribution of price appears to be right-skewed. This means that there are a larger number of properties with lower
prices, and fewer properties with very high prices. The peak of the distribution is towards the lower end of the price range.

 Boxplot: The boxplot confirms the right skewness. The median line is closer to the first quartile, and the whisker extends further on the
higher price side. There are also several outliers indicated by individual points above the upper whisker, representing very expensive
properties.

 Density Plot: The density plot shows a primary peak at a lower price point, with a long tail extending towards higher prices. There might
be a secondary, smaller peak or shoulder at a slightly higher price range, suggesting potential clusters in the price distribution.

2. Living Measure:

 Histogram: The distribution of living measure also appears to be right-skewed. Most properties have a smaller living area, with fewer
properties having very large living spaces.

 Boxplot: Similar to price, the boxplot shows the median closer to the lower quartile and a longer upper whisker. There are also
numerous outliers above the upper whisker, indicating some very large properties.

 Density Plot: The density plot shows a clear peak at a lower living measure and a tail extending towards larger values.

3. Lot Measure:

 Histogram: The distribution of lot measure is heavily right-skewed. The vast majority of properties have relatively small lot sizes, with a
very small number of properties having extremely large lots.

 Boxplot: The boxplot strongly emphasizes the right skewness. The box itself is compressed towards the lower end, and there are a very
large number of outliers above the upper whisker, indicating exceptionally large lots.

 Density Plot: The density plot shows a very sharp peak at a small lot measure and a very long, thin tail extending to the right, confirming
the extreme right skewness.

4. Room Bed:

 Histogram: The distribution of the number of bedrooms is somewhat discrete and appears to have a few prominent peaks, likely
corresponding to common numbers of bedrooms (e.g., 3, 4). There's a significant frequency for a lower number of bedrooms, and the
frequency generally decreases as the number of bedrooms increases, although there might be smaller peaks at higher values. There
seems to be a potential outlier or less frequent occurrence at a very high number of bedrooms.

 Boxplot: The boxplot shows a median around 3 bedrooms, with the interquartile range likely covering 3 and 4 bedrooms. There are
outliers on the higher end, representing properties with an unusually large number of bedrooms. The lower whisker extends to a smaller
number of bedrooms.

 Density Plot: The density plot reflects the discrete nature with multiple humps, corresponding to the common numbers of bedrooms.

5. Room Bath:

 Histogram: Similar to the number of bedrooms, the distribution of the number of bathrooms is also discrete with noticeable peaks, likely

10
at common numbers of bathrooms (e.g., 1, 2, 2.5). The distribution seems somewhat multimodal.

11
 Boxplot: The boxplot shows a median likely around 2 bathrooms, with the interquartile range covering a common range of bathroom
counts. There are outliers on the higher end, indicating properties with a very large number of bathrooms.

 Density Plot: The density plot shows multiple peaks, reflecting the common discrete values for the number of bathrooms.

Overall Inferences:

 Right Skewness in Continuous Variables: The continuous variables (price, living_measure, lot_measure) all exhibit a right-skewed
distribution, indicating the presence of some very large values (high prices, large living areas, large lots) that pull the mean to the right of
the median.

 Outliers: All the continuous variables show the presence of outliers, which might require investigation and potential handling
depending on the analysis goals.

 Discrete Distributions for Room Features: The number of bedrooms and bathrooms have discrete distributions with peaks at
common integer or half-integer values.

 Potential for Transformation: For statistical modeling, the right-skewed variables

The ceil_measure distribution is right-skewed, with most values concentrated between 1000 and 2000, indicating that typical ceiling
measurements fall within this range, while very high values are rare.

The boxplot for ceil_measure shows that the data is right-skewed with a large number of outliers above the upper whisker, indicating
the presence of unusually high ceiling measurement

11
The histogram for basement_measure is highly right-skewed, with a large concentration of properties having zero basement area,
indicating that most houses do not have basements
The boxplot for basement_measure shows that most properties have a relatively small or zero basement size, with a significant number
of outliers indicating some properties have exceptionally large basements. The median basement size is noticeably lower than the upper

whisker, suggesting a right-skewed distribution for basement size .

12
The histogram for living_measure15 shows a distribution that is right-skewed, indicating that most properties had a smaller living area 15
years prior, with a tail extending towards larger living areas. The peak of the distribution is between approximately 1250 and 1750.

The boxplot for living_measure15 confirms a right-skewed distribution with a median around 1750 and a significant number of outliers
above the upper whisker, indicating some properties had considerably larger living areas 15 years ago. The lower whisker extends to a
value around 500.

The histogram for lot_measure15 is extremely right-skewed, with the vast majority of properties having a small lot size 15 years ago, and
a very long tail indicating a few properties with exceptionally large lots. The primary peak is very close to zero

The boxplot for lot_measure15 confirms a highly right-skewed distribution with a very low median and a large number of extreme
outliers above the upper whisker, indicating that while most properties had small lot sizes 15 years ago, some had exceptional

13
yr_built -

14
Most of the houses were built between 1900 and 2015.

14 houses with null values will be changed.

yr_renovated

15
Almost 20000 houses have not been renovated. Houses have been renovated between 1934 and 2015.

16
total_area -

17
We can see that the distribution is right skewed. There are 4 houses with more than 1000000

18
square foot of total area.

Pie plots of Categorical variables

13
14
1. Room (bedrooms):
Most houses have 3 bedrooms (50.8%), followed by 4 bedrooms (26.3%). Houses with more than 5 bedrooms are very rare.

2. Room (bathrooms):
Bathroom counts vary more widely, but the majority have 2.5 or fewer bathrooms, with 1.0, 1.75, and 2.0 being the most common.

3. Ceiling height (ceil):

Most homes have ceiling values around 1.0 (57.1%) and 2.0 (31.5%), indicating standard ceiling heights dominate.

4. Coast proximity:
An overwhelming majority of homes (99.8%) are not near the coast, making coastal properties very rare.

5. Sight (view):
95.1% of homes have no special view (sight = 0), while only a small fraction enjoy any notable view.

6. Condition:
The majority of homes are in average to good condition, with condition 3 (64.9%) and condition 4 (26.7%) being the most common.

7. Quality:
Most homes are of quality 7 (51.1%) and 8 (28.2%), indicating a high concentration of moderately good construction quality.

8. Furnished:
93.4% of homes are unfurnished, showing that furnished homes are quite uncommon in the dataset.

Overall insight:
The dataset predominantly contains standard-quality, unfurnished homes in average condition, with 3 bedrooms, 2 bathrooms, and no
special location advantages (like coast or view).

15
square foot of total area.

Bivariate Analysis -
Pair plot -

Fig 19 – Pair plot

16
We can see that price is right skewed. Most of the variables don’t have a clear relationship with price
variables.

we will have to convert some of these variables into categorical.

Pearson correlation -

17
Fig 20 – Pearson Correlation

From the above heatmap we can see that most of the variables are correlated to each other.

18
Bivariate analysis of year on price

The boxplots suggest that the median house price was slightly higher in 2015 compared to 2014. However, the
overall distribution of prices, as indicated by the interquartile range and the length of the whiskers, appears quite
similar in both years, suggesting no drastic change in the price range.

Bivariate analysis of price and room_bed -

19
The boxplots generally show a positive correlation between the number of bedrooms and house price, with median prices tending to
increase as the number of bedrooms increases up to a certain point. However, there's considerable overlap in the price distributions
across different bedroom counts, and the variability in price also seems to increase with more bedrooms. The data also includes some
less common instances with a very high number of bedrooms (e.g., 11 and 33), which show high price points but might represent outliers
or different property types.

Bivariate analysis of price and room_bath -

2
.
The boxplots generally illustrate a positive relationship between the number of bathrooms and house price,
with median prices tending to increase as the number of bathrooms goes up. However, the relationship isn't
strictly linear, and there's considerable overlap in price distributions across different bathroom counts,
especially in the mid-range. Properties with a very high number of bathrooms (e.g., above 5) tend to have higher
prices, but the number of data points in these categories might be smaller.

3
The scatter plot generally shows a trend of increasing house price with a higher number of bathrooms
(room_bath), particularly noticeable up to around 4 bathrooms. Similar to the number of bedrooms, there's a
significant range of prices for any given number of bathrooms, indicating other factors also strongly influence the
final value. The vertical lines of points reflect the discrete nature of bathroom counts (including half bathrooms).

The scatter plot illustrates a tendency for house prices to increase with the number of bedrooms, especially
within the common range of 1 to 6 bedrooms. However, for any given number of bedrooms, there's a significant
range of prices, suggesting other factors strongly influence the final value

Bivariate analysis of price and Lot measure -

4
The scatter plot shows a weak positive correlation between lot measure and house price, particularly for smaller
lot sizes. While prices tend to be higher for slightly larger lots, the relationship becomes much less clear and more
scattered as the lot measure increases, suggesting other factors have a stronger influence on price for larger
properties. There are also many properties with small lot measures spanning a wide range of prices

Bivariate analysis of Price and living measure

5
The scatter plot
shows a clear positive correlation between living measure and house price, as indicated by the upward-sloping trendline.
Generally, as the living area of a property increases, its price also tends to increase. However, there is still considerable
scatter around the trendline, suggesting that living measure is not the only factor determining house price, and other
variables also play a significant role.

The scatter plot suggests a weak positive correlation between lot measure and house price, as indicated by the
6
slightly upward-sloping trendline. While there's a general tendency for prices to be higher for larger lots,
especially beyond a certain threshold, the relationship is not strong, and there's a wide spread of prices for
properties with similar lot measures. This indicates that other factors likely have a more substantial impact on the
final house price.

The scatter plot shows a positive correlation between the number of bedrooms and house price, with the
trendline indicating a general increase in price as the number of bedrooms increases. However, the relationship
appears somewhat stepped due to the discrete nature of bedroom counts, and there's a significant spread in
prices for a given number of bedrooms, especially in the lower range. The presence of data points with a very high
number of bedrooms (beyond the typical range) also correlates with higher prices, though these are less frequent.

7
The scatter plot reveals a positive correlation between the number of bathrooms and house price, with the
trendline showing a general increase in price as the number of bathrooms increases. Similar to the number of
bedrooms, the relationship appears somewhat stepped due to the discrete nature of bathroom counts, and
there's a noticeable spread in prices for properties with the same number of bathrooms. Properties with a higher
number of bathrooms tend to command higher prices overall

The scatter plot indicates a weak positive correlation between basement measure and house price, as shown by
the slightly upward-sloping trendline. While there's a general tendency for prices to be somewhat higher for
properties with larger basements, the relationship is not very strong, and there's a significant amount of price
variation for properties with similar basement measures, especially for smaller basement sizes. This suggests that
8
other factors likely have a more substantial influence on house price than just the basement size

9
The scatter plot shows a positive correlation between ceil measure and house price, with the trendline indicating
a general increase in price as the ceil measure increases. While there's a clear upward trend, there's also a
considerable spread of prices for a given ceil measure, suggesting that other factors contribute significantly to the
final price. The relationship seems stronger for larger ceil measures.

Bivariate analysis of price and ceil

9
10
The scatter plot suggests a general tendency for house prices to be higher with an increasing number of
ceilings/stories (ceil), although the relationship isn't strictly linear and shows considerable overlap in price ranges
for different ceiling counts. Properties with 2.5 and 3.5 ceilings appear to have a wider range of higher prices
compared to those with fewer ceilings

Bivariate analysis of price and coast-

10
There is a slight increase in price for houses with a waterfront view.

Bivariate analysis of price and quality-

We can see an upward trend in price on better quality rating.

Latitude and Longitude of houses\properties -

11
Fig 26 – Lat and Long of House

Properties There are 21579 properties spread across Seattle USA.

Fig 27 – Lat and Long of House

Properties We can see that there are also 34 properties in France.

Business Implications -

12
Core Value Drivers:
 Living Space is Key: The strong positive correlation between living_measure and price across
multiple visualizations consistently highlights that the size of the living area is a primary driver of
house value. This emphasizes the importance of maximizing usable indoor space in
development and highlighting it in sales.
 Bedrooms and Bathrooms Matter: The positive correlation between the number of bedrooms and
bathrooms with price suggests that these functional aspects significantly influence property value,
catering to different household sizes and needs. Developers should consider the optimal mix of
bedroom/bathroom counts based on target markets.
 Basement Presence: The significant number of properties with and without basements, coupled
with the potential (though not directly visualized against price) for basement size to add value,
indicates that basement features are a relevant factor in the market, especially in areas where
they are common.
Land Size Influence:
 Diminishing Returns for Lot Size: The weaker correlation of lot_measure with price, especially for
larger lots, suggests that beyond a certain point, simply having more land doesn't proportionally
increase the price. Value might be tied more to the usable land or other property features on the
lot.
Market Dynamics and Trends:
 Slight Price Appreciation (2014-2015): The minor increase in median price from 2014 to 2015
suggests a slightly appreciating market during that period. Real estate professionals can use
such trend analysis (over longer periods with more data) to advise on investment timing.
 Right-Skewed Distributions: The right skewness in price, living measure, and lot measure indicates
a market with a larger number of more affordable/smaller properties and a smaller segment of
high-end/larger properties. This informs inventory management and marketing strategies for
different price points.
 Outliers Represent Opportunities or Errors: The presence of outliers in several features (price,
size measures) could represent unique, high-value properties or potential data entry errors that
need investigation. Identifying and understanding these outliers can be valuable for niche markets
or data quality control.
Temporal Considerations:
 Changes Over Time (living_measure15, lot_measure15): The differences in distributions between
current and past measurements suggest potential trends in property development and land use
over the 15-year period. This historical data can inform long-term investment and planning
strategies.

13
The table shows significant variation in both the mean and median house prices across different zip codes. This
indicates that location, as represented by zip code, is a strong determinant of property value in this
dataset.

The table suggests a non-linear relationship between property condition and price. While the highest
mean and median prices are observed for conditions 0.0 and 5.0, lower condition scores (1.0 and 2.0)
2
correspond to notably lower average property values.

The table clearly shows a strong positive correlation between property quality and price, with both the
mean and median house prices consistently increasing as the quality score increases. Higher quality ratings
directly translate to higher property values.

Properties classified as being on the coast (coast = 1.0) have a significantly higher mean and median price
compared to those not on the coast (coast = 0.0), indicating that coastal proximity is a valuable feature.
This aligns with typical real estate market trends where waterfront or coastal properties command a
premium.

3
The table suggests a generally positive trend between the number of ceilings/stories (ceil) and house price, with
higher mean and median prices observed for properties with more levels. However, the relationship isn't strictly
linear, as seen with the dip at 3.0 ceilings and the subsequent increase at 3.5.

1) Data Cleaning and Preprocessing -

Null values and Outliers in the data -

There were 403 null values among which some were dropped, and some were

replaced. Most of the variables that had outliers were treated with IQR.

IQR is simply the range of the middle 50% of data values, it's not affected by extreme outliers.

Before outlier treatment

4
5
After outlier treatment

6
7
 Two new variables Basement and Renovated have been created. These tell us whether
a house has a basement and whether a house is renovated.
 Seven unwanted variables have been dropped which were not required for our analysis.
 Label encoding has been done on ceil, room_bed and room_bath columns.
 Final data consisted of 21538 rows and 21 columns.

2) Model Building -

We will be using various regression and classification models for this problem. Following are the models used
and the accuracies and RMSE values.

8
1) Linear Regression Model –

2) Lasso Linear Regression Model

RMSE on Train Set for Lasso Regression Model: 77240.99326782417
RMSE on Test Set for Lasso Regression Model: 75617.74100181667
R square on Train Set for Lasso Regression Model: 0.5738500162927691
R square on Test Set for Lasso Regression Model: 0.583907421542122

3) Ridge Linear Regression Model -

RMSE on Train Set for Ridge Regression Model: 77241.63232554749
RMSE on Test Set for Ridge Regression Model: 75624.01481203036
R square on Train Set for Ridge Regression Model: 0.5738429647116476
R square on Test Set for Ridge Regression Model: 0.5837698728400622

4) KNN Regression Model -

RMSE on Train Set for KNN Regression Model: 104143.07948409667
RMSE on Test Set for KNN Regression Model: 110459.001866508
R square on Train Set for KNN Regression Model: 0.22531061101532268
R square on Test Set for KNN Regression Model: 0.14490866619350762

9
5) Decision Tree Regression Model -
 RMSE on Train Set for Decision Tree Regression Model was 0.0
 RMSE on Test Set for Decision Tree Regression Model was 0.0
 R square on Train Set for Decision Tree Regression Model: 1.0
 R square on Test Set for Decision Tree Regression Model: 0.596609764509463

6) Random Forest Regression Model -

RMSE on Train Set for Random Forest Regression Model: 43814.47220967624
RMSE on Test Set for Random Forest Regression Model: 47994.637606156124
R square on Train Set for Random Forest Regression Model: 0.8628796397803093
R square on Test Set for Random Forest Regression Model: 0.7757951331367557

7) Gradient Boost Regression Model -

RMSE on Train Set for Gradient Boost Regression Model: 33290.984103866256
RMSE on Test Set for Gradient Boost Regression Model: 21121.8205447088
R square on Train Set for Gradient Boost Regression Model: 0.9208374011791041
R square on Test Set for Gradient Boost Regression Model: 0.798153797218344

10
Model Tuning -

1) Bagging Regression Model –

RMSE on Train Set for Bagging Regression Model: 24322.32800290531

RMSE on Test Set for Bagging Regression Model: 26073.995517530057

R square on Train Set for Bagging Regression Model: 0.9577451019432845
R square on Test Set for Bagging Regression Model: 0.7700227581332421

2) Random Forest Hyper Tune Model -

 Accuracy on Train Set for Random Forest Hyper Tune Model was 80.7%
 Accuracy on Test Set for Random Forest Hyper Tune Model was 71.1%
 RMSE on Train Set for Random Forest Hyper Tune Model was 109854.27
 RMSE on Test Set for Random Forest Hyper Tune Model was 110770.51

Hyper tuning the random forest model with the appropriate measures gave us an accuracy of 80.7% on the
training set and 71.1% on the test set. The RMSE value for the test set is higher than the train set which
indicates the model is overfitting.

3) Gradient Boost Hyper Tune Model -

 Accuracy on Train Set for Gradient Boost Hyper Tune Model was 84.5%
 Accuracy on Test Set for Gradient Boost Hyper Tune Model was 72.2%
 RMSE on Train Set for Gradient Boost Hyper Tune Model was 98316.62
 RMSE on Test Set for Gradient Boost Hyper Tune Model was 83986.66

Hyper tuning the gradient boost model with the appropriate measures gave us an accuracy of 84.5% on the
training set and 72.2% on the test set. The RMSE value for the train set is higher than the test set which
indicates the model is underfitting.

Hyper tuning the Gradient Boost Model gave us the best accuracy. The model is also underfitting.

3) Recursive Feature Elimination

11
Mean squared error : 52050.694594988425

1) Dimension Reduction

MSE using PCA on test set: 5525088530.344118

RMSE using PCA on test set: 74330.93925374627

2) Clustering

3
4
1) Model Validation -

The models were compared based on their accuracies and RMSE values.

Out of all the models Gradient Boost Model performed the best and gave us the highest accuracy.

Hyper Tuning the Gradient boost model with the best parameters gave us an accuracy of 84.5% on the
training set and 72.2% accuracy on the test set.

The gradient boost hyper tune model performed the best on both test and train dataset.

Important Features that affect the price variable -

These are the features that affect the price of the house according to the gradient boost hyper tune model.

4
Fig 29 – Important features than affect price

We can see that almost all the features affect the price of the house, quality and furnished_1.0 are the top
two features that affect the price of a house.

5
4) Final Interpretation / Recommendation -

Insights -

 Quality rating is the most important feature that is looked for in a house.

 Having a coast or not doesn't affect much of the price.

 Houses with a 6 – 9.5 quality rating are preferred.

 People prefer furnished houses with good square footage to live in.

 People also prefer houses with 1 to 2 floors.

 Overall, a house with good living space, furnished and with 1-2 floors is what people want to buy.

Recommendations -

 Some of the features that affect the price the most, like quality, furnished and living measure,
should be looked for while purchasing or selling a house.

6
 Most important feature is quality, a house with higher quality rating is priced higher.

 Selling a furnished house with ample living space is easier compared to an unfurnished house with
less or excess living space, House with 2000 Sq foot of living space is what people want, a house with
less than 1000 Sq foot or more than 3000 Sq foot will be hard to sell.

 More than 50% of the houses are not furnished, furnishing these houses will help sell them as
people prefer furnished houses.

 A house with a quality rating higher than 6 is what people prefer, therefore scoring at least a 6
is recommended.

************

170k Valorant Combolist UHQ FRESH
No ratings yet
170k Valorant Combolist UHQ FRESH
2,907 pages
Bedsitter Tenant Agreement
No ratings yet
Bedsitter Tenant Agreement
3 pages
Problem Statement
No ratings yet
Problem Statement
1 page
Case Study: Land Acquisition For Industrialization in West Bengal: The Tata Nano Case
No ratings yet
Case Study: Land Acquisition For Industrialization in West Bengal: The Tata Nano Case
21 pages
MNM 1
No ratings yet
MNM 1
17 pages
Real Estate Analysis
No ratings yet
Real Estate Analysis
38 pages
Delhi House Price Prediction 1692019997
No ratings yet
Delhi House Price Prediction 1692019997
34 pages
Estimated Closing Item and Cost
No ratings yet
Estimated Closing Item and Cost
1 page
Ds ML House Price Book
No ratings yet
Ds ML House Price Book
46 pages
(House Price Prediction) Capstone Project For Python
No ratings yet
(House Price Prediction) Capstone Project For Python
10 pages
House Value
No ratings yet
House Value
22 pages
MiniProject BI
No ratings yet
MiniProject BI
16 pages
Radha Krishnaveni - 133-135
No ratings yet
Radha Krishnaveni - 133-135
32 pages
Dawit House
No ratings yet
Dawit House
49 pages
Ese Lab File
No ratings yet
Ese Lab File
30 pages
Shreyas Report
No ratings yet
Shreyas Report
11 pages
Vikas Report
No ratings yet
Vikas Report
11 pages
Final Article Benson
No ratings yet
Final Article Benson
25 pages
Project Report Vishal Pradeep
No ratings yet
Project Report Vishal Pradeep
97 pages
House Price Prediction
No ratings yet
House Price Prediction
17 pages
Housing Prices Notebook
No ratings yet
Housing Prices Notebook
14 pages
House Price Prediction: # Importing Necessary Libraries
No ratings yet
House Price Prediction: # Importing Necessary Libraries
18 pages
Kirubavathi
No ratings yet
Kirubavathi
10 pages
Final DA LAB1 Merged
No ratings yet
Final DA LAB1 Merged
48 pages
Girish Chadha Capstone Final Report Submission 16 Jul 23
No ratings yet
Girish Chadha Capstone Final Report Submission 16 Jul 23
33 pages
A Brief On Transfer of Property Act
No ratings yet
A Brief On Transfer of Property Act
7 pages
NAME
No ratings yet
NAME
11 pages
Rajasri
No ratings yet
Rajasri
10 pages
Formal Research Paper Slideshow by Slidesgo
No ratings yet
Formal Research Paper Slideshow by Slidesgo
9 pages
Copy - of - Descriptive - EDA - Munjal - Exercise1.ipynb - Colaboratory
No ratings yet
Copy - of - Descriptive - EDA - Munjal - Exercise1.ipynb - Colaboratory
30 pages
Data Analysis Project MAIN
No ratings yet
Data Analysis Project MAIN
6 pages
CEMAP-1 Questions
No ratings yet
CEMAP-1 Questions
7 pages
Updated Resume
No ratings yet
Updated Resume
3 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
24 pages
Bollosa Residence Complete Plan
No ratings yet
Bollosa Residence Complete Plan
10 pages
Extrajudicial Foreclosure of Mortgage
No ratings yet
Extrajudicial Foreclosure of Mortgage
4 pages
Report
No ratings yet
Report
7 pages
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
No ratings yet
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
9 pages
Bi El
No ratings yet
Bi El
26 pages
Feature Engineering Problem Statement
No ratings yet
Feature Engineering Problem Statement
6 pages
Business Mathematics and Statistics CIA 3
No ratings yet
Business Mathematics and Statistics CIA 3
36 pages
18BCS115
No ratings yet
18BCS115
25 pages
Translation - Deed
100% (1)
Translation - Deed
7 pages
Story Point Estimation Copy
No ratings yet
Story Point Estimation Copy
16 pages
FML PROJECT Diya
No ratings yet
FML PROJECT Diya
9 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
Predictive Analytics For Housing Market Trends and Valuation
No ratings yet
Predictive Analytics For Housing Market Trends and Valuation
6 pages
Project PDF
No ratings yet
Project PDF
13 pages
Task 1 - Data Analytics in Python
No ratings yet
Task 1 - Data Analytics in Python
15 pages
Template For The International Journal of Computational Linguistics and Chinese Language Processing IJCLCLP
No ratings yet
Template For The International Journal of Computational Linguistics and Chinese Language Processing IJCLCLP
19 pages
Housing
No ratings yet
Housing
21 pages
Cadaster Chapter 2
No ratings yet
Cadaster Chapter 2
15 pages
Phase 2 Irfan
No ratings yet
Phase 2 Irfan
5 pages
Anbuselvan Phase2
No ratings yet
Anbuselvan Phase2
5 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
Jose Maria College - College of Law: Case Title
No ratings yet
Jose Maria College - College of Law: Case Title
16 pages
Affidavit of Co Ownership
100% (1)
Affidavit of Co Ownership
2 pages
Javelosa vs. Tapus
No ratings yet
Javelosa vs. Tapus
11 pages
1-Real Estate Law - IIB Exam IOV PPT 01.12.20
No ratings yet
1-Real Estate Law - IIB Exam IOV PPT 01.12.20
132 pages
Problem Statement - Capstone
No ratings yet
Problem Statement - Capstone
1 page
Coding
No ratings yet
Coding
7 pages
PropTech - The Real Estate Industry in Transition
No ratings yet
PropTech - The Real Estate Industry in Transition
4 pages
Ten Forty Realty and Lorenzana Vs Cruz GR NO 151212 SEPTEMBER 10 2013
No ratings yet
Ten Forty Realty and Lorenzana Vs Cruz GR NO 151212 SEPTEMBER 10 2013
2 pages
House Price Prediction
No ratings yet
House Price Prediction
5 pages
Submitted in Partial Fulfilment of PGDM 2018-20 in SVKM'S Nmims School of Business Management, Hyderabad
No ratings yet
Submitted in Partial Fulfilment of PGDM 2018-20 in SVKM'S Nmims School of Business Management, Hyderabad
13 pages
Property Price Prediction Capstone Project
100% (1)
Property Price Prediction Capstone Project
7 pages
Contract of Lease
No ratings yet
Contract of Lease
3 pages
Capstone Project Report
No ratings yet
Capstone Project Report
187 pages
CH13 - End of Chapter Problems With Answers.2e
No ratings yet
CH13 - End of Chapter Problems With Answers.2e
9 pages
Introduction to Area-Based Anti-Aliasing for CGI
From Everand
Introduction to Area-Based Anti-Aliasing for CGI
Michel A Rohner
No ratings yet
PN1 Shakti Akshaya S PDF
100% (2)
PN1 Shakti Akshaya S PDF
60 pages
Final
No ratings yet
Final
14 pages
Machiya: A Typology of Japanese Townhouses. 町屋：日本の都市型住居の伝統と保存
No ratings yet
Machiya: A Typology of Japanese Townhouses. 町屋：日本の都市型住居の伝統と保存
4 pages
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
No ratings yet
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
20 pages
DWG Report
No ratings yet
DWG Report
1 page
SAMPLE ONLY Topic - Real Estate Brokerage Practice
No ratings yet
SAMPLE ONLY Topic - Real Estate Brokerage Practice
3 pages
Transferapplication
No ratings yet
Transferapplication
17 pages
Linear Regression - House Price Prediction
100% (2)
Linear Regression - House Price Prediction
174 pages
Everything: You Wanted To Know About Krisumi Waterfall Residences
No ratings yet
Everything: You Wanted To Know About Krisumi Waterfall Residences
15 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
Business: Capstone Project House Price Prediction Project Note-1
88% (8)
Business: Capstone Project House Price Prediction Project Note-1
40 pages
House Pricing Regression
No ratings yet
House Pricing Regression
11 pages
Copper Techical Data Sheet
No ratings yet
Copper Techical Data Sheet
1 page
2022 Saln Form
No ratings yet
2022 Saln Form
2 pages
Capstone Project Submission
100% (2)
Capstone Project Submission
31 pages
3.042 Set B - Mock Exam Rea
100% (2)
3.042 Set B - Mock Exam Rea
15 pages
Anushi Project-House Price Prediction
100% (2)
Anushi Project-House Price Prediction
26 pages
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet
Tenancy Agreement
No ratings yet
Tenancy Agreement
2 pages
6514 - Standard TIR FormaT
No ratings yet
6514 - Standard TIR FormaT
21 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Capstone Project 6 April

Uploaded by

Capstone Project 6 April

Uploaded by

CAPSTONE PROJECT

2) EDA and Business implications Page No -10

3) Data Cleaning and Preprocessing Page No -51

4) Model Building Page No -55

5) Model Validation Page No -61

6) Final Interpretation / Recommendation Page No -63

Business Use Cases:

cid 3876100940 3145600250 7129303070 7338220280 7950300670

dayhours 20150427T000000 20150317T000000 20140820T000000 20141010T000000 20150218T000000

price 600000 190000 735000 257000 450000

room_bath 1.75 1 2.75 2.5 1

living_measure 3050 670 3040 1740 1120

lot_measure 9440 3101 2415 3721 4590

ceil_measure 1800 670 3040 1740 1120

yr_built 1966 1948 1966 2009 1924

zipcode 98034 98118 98118 98002 98118

lat 47.7228 47.5546 47.5188 47.3363 47.5663

long -122.183 -122.274 -122.256 -122.213 -122.285

living_measure15 2020 1660 2620 2030 1120

lot_measure15 8660 4100 2433 3794 5100

total_area 12490 3771 5455 5461 5710

Table 2 – Descriptive Details

Columns in the dataset -

Creation of year renovated and basement

replace missing values of living_measure lot_measure with mean or average

replace missing values of sight , furnished , with mode

histograms, box plots, and density plots of numerical values

 Potential for Transformation: For statistical modeling, the right-skewed variables

whisker, suggesting a right-skewed distribution for basement size .

14 houses with null values will be changed.

Pie plots of Categorical variables

3. Ceiling height (ceil):

Fig 19 – Pair plot

we will have to convert some of these variables into categorical.

Bivariate analysis of price and room_bed -

Bivariate analysis of price and room_bath -

Bivariate analysis of price and Lot measure -

Bivariate analysis of Price and living measure

Bivariate analysis of price and ceil

Bivariate analysis of price and coast-

Bivariate analysis of price and quality-

We can see an upward trend in price on better quality rating.

Latitude and Longitude of houses\properties -

Properties There are 21579 properties spread across Seattle USA.

Fig 27 – Lat and Long of House

Properties We can see that there are also 34 properties in France.

1) Data Cleaning and Preprocessing -

Null values and Outliers in the data -

Before outlier treatment

2) Lasso Linear Regression Model

3) Ridge Linear Regression Model -

4) KNN Regression Model -

6) Random Forest Regression Model -

7) Gradient Boost Regression Model -

1) Bagging Regression Model –

RMSE on Train Set for Bagging Regression Model: 24322.32800290531

RMSE on Test Set for Bagging Regression Model: 26073.995517530057

2) Random Forest Hyper Tune Model -

3) Gradient Boost Hyper Tune Model -

3) Recursive Feature Elimination

MSE using PCA on test set: 5525088530.344118

Important Features that affect the price variable -

 Having a coast or not doesn't affect much of the price.

 Houses with a 6 – 9.5 quality rating are preferred.

 People also prefer houses with 1 to 2 floors.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.