120 GSJ10713
120 GSJ10713
a
Post Graduate Scholar, Dept. Computer Engineering (Software Engineering), L J University, Ahmedabad,
Gujarat, India
b
Assistant Professor, Dept. Computer Engineering (Software Engineering), L J University, Ahmedabad,
Gujarat, India
Abstract: Owning a home is not only a basic requirement, but it also signifies prestige. House costs
change depending on a variety of elements, such as location, size, no. of bedrooms, lift availability,
parking spaces, etc. The goal of the study is to create a reliable housing price prediction model. For
our model, we plan to use the ensemble learning technique. We will use Decision Tree, Random Forest,
and XGBoost as the base models for ensemble learning. Our primary goal in merging various base
models is to increase prediction accuracy.
Keywords: Machine learning, Real estate, House price prediction, Machine learning algorithms, Ensemble learning,
Decision tree, Random Forest, XGBoost, MSE, MAE, R2 score.
1. INTRODUCTION
A house is an asset for many reasons. More than just a location to call home, it also gives security,
personal space, emotional attachment, stability, Tax benefits etc. The purchase of house is popular
among investors. Investing in properties can be a viable way to build wealth over the long term, but it's
important to carefully consider the risks and benefits before making an investment. Knowing the exact
valuation of a property is very important, whether you are buying or selling a property. The price of a
house is influenced by several factors like Location, Area, No. of bedrooms, Lift availability, Parking
slots etc.
Machine learning (ML) is study of algorithms that can recognise patterns and make predictions or
judgements without being explicitly programmed. ensemble learning is a technique of machine learning
which combines the predictions of various models to get a prediction that is more reliable and accurate.
[1]
In this research, we will use bagging ensemble learning technique. We will use Decision Tree, Random
Forest and XGBoost as base models. Our main purpose of combining different base model is to improve
prediction and achieve higher accuracy. This thesis is divided into 5 parts: Section 1 contains
Introduction about topic, Section 2 contains Literature review, Section 3 contains Research
Methodology with Dataset, Data exploration and transformation, Proposed model, Section 4 contains
Results and Section 5 have Conclusion and References.
2. Literature Review
Adetunji et al., (2022) [2] studied that housing prices are based on factors like location, city etc. The
authors use of Random Forest algorithm for house price prediction. UCI machine learning repository
“Boston housing” is used in this paper. Performance evaluation metrics are used to test the performance
of the model.
Ghosalkar et al., (2018) [3] focuses on prediction of house prices for the people considering their
financial plans and needs. This study predicts house prices in Indian city Mumbai. The reason Linear
Regression used in this paper is that Linear Regression can predict a numerical target value. MAE,
MSE, RMSE are used to check the quality of model.
T.D. Phan, (2019) [4] studied historical data for house price prediction. It analyzes a real historical
transactional dataset to get valuable insight into the housing market in Melbourne city. In this paper
different machine learning techniques are used like Linear Regression, Polynomial Regression,
Regression Tree, Neural Network, and SVM. In this paper “Melbourne Housing Market” dataset is
used.
Truong et al., (2018) [5] estimates the changes the in house pricing.This paper uses both traditional and
advance machine learning methods. A dataset named “Housing Price in Beijing” is used. In this paper
we will discuss traditional machine learning techniques like Random Forest, Extreme Gradient
Boosting (XGBoost), Light Gradient Boosting Machine (LightGBM) and advanced machine learning
techniques like Hybrid Regression (65% Lasso and 35% XGBoost) and Stacked Generalization (Level
1:Random Forest and LightGBM, Level 2:XGBoost).
Jain et al., (2020) [6] uses stacking algorithm on different regression algorithms. Other algorithms like
SVM algorithm, decision tree algorithm, Random Forest classifier etc are also used. The final prediction
is calculated with higher accuracy and precision.
Ahtesham et al., (2020) [7] examines information gathered from the Open Data Pakistan website.
Karachi dataset was chosen especially for predicting home prices. This paper uses XGBoost machine
learning algorithm for house price prediction. Accuracy score and MAE are used to evaluate the model's
quality.
3. Research Methodology
3.1 Dataset
As we know price of a house is influenced by several factors like Location, Area, No. of bedrooms, Lift
availability, Parking slots etc. We have used data of Indian city Mumbai. This dataset was downloaded
from a website name Kaggle. This dataset has 6347 observations and 19 variables. The dataset contains
following information: Price, Area, Location, No. of Bedrooms, New/Resale, Gymnasium, Lift
Available, Car Parking, Maintenance Staff, 24x7 Security, Children's Play Area, Clubhouse, Intercom,
Landscaped Gardens, Indoor Games, Gas Connection, Jogging Track, Swimming Pool.
To understand dataset better we will use heatmap. Heatmap will help us to gain insights into the
relationships and dependencies within dataset. Heatmap is used for data exploration and analysis,
particularly for understanding the correlations between variables in a dataset.
From figure 3.1 we observed that Area of house and No. of bedroom are strongly connected to price of
the house. As we know Location is a categorical variable, we will use technique called one-hot encoding
to convert location into numerical variable.
As we know, the price of a house is influenced by several factors like Location, Area, No. of bedrooms,
Lift availability, Parking slots etc. There are numerous machine learning techniques available for
prediction. Every technique has benefits and drawbacks. Hence, we will use ensemble learning. We
will use Decision Tree, Random Forest and XGBoost as base models.
1. Decision Tree
Decision trees, a type of supervised learning algorithm used for classification and prediction
tasks. It is a non-parametric technique that generates predictions by using a tree-like model of
choices and potential outcomes. The tree is a collection of nodes that represent decisions based
on the input features, and branches represent the possible outcomes. The leaves of the tree
represent the final output labels.[8]
2. Random Forest
Random forest is a supervised ensemble learning technique which can be used for
classification and regression tasks. A huge number of decision trees are constructed by
Random Forest, each uses a random subset of the input features and training data. The
average or majority decision of the individual tree outputs makes up the final output.
3. XGBoost
XGBoost (Extreme Gradient Boosting) is a supervised ensemble learning algorithm used for
regression, classification, and ranking. It uses decision trees to make predictions. The final
output is the sum of the predictions made by all the trees in the ensemble, and each tree is
constructed using the errors of the previous tree. XGBoost provides a parallel tree boosting
which prevides fast and accurate results. [12]
After prediction output from all 3 models, we will have to combine them into one. There are 2
techniques for combining multiple output into one in ensemble learning: Averaging and Max voting.
We will use Averaging to combine the predictions of individual models to obtain a final prediction.
Averaging means calculating the average(mean) of predictions made by each model.
4. Result
After Averaging, we will use performance evaluation metrics to evaluate performance and effectiveness
of our system. We will use R2 score, Mean Absolute Error (MAE), Mean Squared Error(MSE), and
Root Mean Squared Error(RMSE).
RMSE = √(MSE)
MAE, MSE, RMSE shows error ratio of our system while R2 score shows relationship between the
Predicted and actual output. Hence the closer the score to 1, the better. In figure 4.2 there is a graph
Comparing the predicted price and actual price.
5. Conclusion
The Housing market is Pillar of Economic Growth and Stability. House price prediction plays a
significant role in shaping and influencing the economy. The existing systems mostly focus on a single
model. As there are numerous ML algorithms that can be used for price predictions. We will use
ensemble learning technique for our system. In ensemble learning multiple models are integrated
together for better outcome. Our main purpose of this study is to build a model that improve house price
prediction and achieve higher accuracy.
References
[1] Polikar, R. (2012). Ensemble learning. Ensemble machine learning: Methods and applications, 1-
34.
[2] Adetunji, A. B., Akande, O. N., Ajala, F. A., Oyewo, O., Akande, Y. F., & Oluwadara, G.
(2022). House Price Prediction using Random Forest Machine Learning Technique. Procedia
Computer Science, 199, 806-813.
[3] Ghosalkar, N. N., & Dhage, S. N. (2018, August). Real estate value prediction using linear
regression. In 2018 fourth international conference on computing communication control and
automation (ICCUBEA) (pp. 1-5). IEEE.
[4] Phan, T. D. (2018, December). Housing price prediction using machine learning algorithms:
The case of Melbourne city, Australia. In 2018 International conference on machine learning and
data engineering (iCMLDE) (pp. 35-42). IEEE.
[5] Truong, Q., Nguyen, M., Dang, H., & Mei, B. (2020). Housing price prediction via improved
machine learning techniques. Procedia Computer Science, 174, 433-442.
[6] Jain, M., Rajput, H., Garg, N., & Chawla, P. (2020, July). Prediction of house pricing using machine
learning with Python. In 2020 International Conference on Electronics and Sustainable Communication
Systems (ICESC) (pp. 570-574). IEEE
[7] M. Ahtesham, N. Z. Bawany and K. Fatima, "House Price Prediction using Machine Learning
Algorithm - The Case of Karachi City, Pakistan," 2020 21st International Arab Conference on
Information Technology (ACIT), Giza, Egypt, 2020, pp. 1-5, doi:
10.1109/ACIT50332.2020.9300074.
[8] Thamarai, M., & Malarvizhi, S. P. (2020). House Price Prediction Modeling Using Machine
Learning. International Journal of Information Engineering & Electronic Business, 12(2).
[9] RPubs - House Price Prediction with R. (2021, August 29). RPubs - House Price Prediction With
R. https://rpubs.com/Zetrosoft/lbb-rm
[10] Bagging & Boosting in Machine Learning world. (n.d.). Bagging & Boosting in Machine Learning
World. https://www.linkedin.com/pulse/bagging-boosting-machine-learning-world-debaditya-
chakravorty
[11] What is a Random Forest? (n.d.). TIBCO Software. https://www.tibco.com/reference-center/what-
is-a-random-forest
[12] XGBoost Documentation — xgboost 1.7.3 documentation.
(n.d.). https://xgboost.readthedocs.io/en/stable/
[13] Simplified structure of XGBoost. (n.d.). https://www.researchgate.net/figure/Simplified-structure-
of-XGBoost_fig2_348025909