0% found this document useful (0 votes)
69 views62 pages

Implementation and Study of K-Nearest Ne

This document describes an implementation and study of K-Nearest Neighbour and linear regression algorithms for real-time housing market recommendation. The author implemented a web crawling system using Scrapy to collect property data from Zoopla, the second largest UK property website. Over 100,000 property listings were gathered. The author then performed data cleaning, feature engineering, and model training. Both KNN and linear regression models were evaluated for predicting housing prices. The top performing models were integrated into a web application for providing real-time property recommendations to investors.

Uploaded by

mohamed chilla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views62 pages

Implementation and Study of K-Nearest Ne

This document describes an implementation and study of K-Nearest Neighbour and linear regression algorithms for real-time housing market recommendation. The author implemented a web crawling system using Scrapy to collect property data from Zoopla, the second largest UK property website. Over 100,000 property listings were gathered. The author then performed data cleaning, feature engineering, and model training. Both KNN and linear regression models were evaluated for predicting housing prices. The top performing models were integrated into a web application for providing real-time property recommendations to investors.

Uploaded by

mohamed chilla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

University of Nottingham

School of Computer Science

Implementation and Study of K-Nearest


Neighbour and Regression
Algorithm for Real-time Housing Market
Recommendation Application

Shujia Zhao
4274913
G53IDS
Supervised by Ke Zhou

April 2018
Abstract

Finding opportunities in the dynamic property market is a challenging problem. It is


needed to establish a model for property buyers (e.g.investors) and then try to recommend
appropriate properties to them in real-time. In order to collect a comprehensive data set,
I choose zoopla.com(the second biggest UK property website) as data source. A powerful
crawling spider is implemented using Scrapy to constantly collect property information
on Zoopla. The goal of this project is to find potential houses through machine learning
techniques and give recommendations to investors in UK.

i
ii
Acknowledgements

I would like to thank Dr Ke Zhou for supervising my project. Without his willingness and
ability to help, I could not have finished this project successfully. In addition, I would
like to thank my friends and open source developers for their kind help. Lastly, I would
like to thank my parents for their continuous support and unconditional love.

iii
iv
Contents

Abstract i

Acknowledgements iii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Aims and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Description of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Related Work 4

2.1 Housing Market in UK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Data Source and House Features . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Investment Estate Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Related Academic Project . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Summery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Technical Preliminaries 9

3.1 Data Crawling Framework: Scrapy . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.1 Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2.2 TTest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 K-nearest-Neighbour Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.1 Euclidean distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.2 Predict Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

v
vi CONTENTS

3.4 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4.1 Basis Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4.2 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4.3 Graident Descent Algorithm . . . . . . . . . . . . . . . . . . . . . . 13

3.5 Feature Weighted KNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Design and Implementation: Crawling and Prediction System 15

4.1 Web Crawling System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1.2 Implementaion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 KNN and Feature Regression . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2.1 UK Postcode System . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2.2 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.5 KNN implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5 Design and Implementation: Web Application 28

5.1 Client Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2 Server Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2.1 Database Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.2.2 Query Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2.3 Real-time Server Pipeline . . . . . . . . . . . . . . . . . . . . . . . 33

6 Evaluation 34

6.1 Feature Regression Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1.1 Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1.2 Single Feature Linear Regression Analysis . . . . . . . . . . . . . . 36

6.1.3 Multiple Features Linear Regression . . . . . . . . . . . . . . . . . . 38

6.2 KNN Prediction Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.2.1 Choice of K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2.2 Property Feature Ablation Study . . . . . . . . . . . . . . . . . . . 40
6.2.3 TTest and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 42

7 Summary and Reflections 43


7.1 Project Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Contributions and Reflections . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.2.1 Project Contribution and Reflection . . . . . . . . . . . . . . . . . . 44
7.2.2 Personal Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography 44

vii
viii
List of Tables

2.1 Average price by country and government office region Jan, 2018 . . . . . . 4
2.2 Example value of crawling data fields . . . . . . . . . . . . . . . . . . . . . 6

4.1 House Features Description . . . . . . . . . . . . . . . . . . . . . . . . . . 20


4.2 Training Attributes and Label . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3 Sample Feature Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6.1 Error rate of Single Feature KNN Prediction . . . . . . . . . . . . . . . . . 41

ix
x
List of Figures

1.1 Annual House Price Rates of Change, UK Land Registry[1] . . . . . . . . 1

1.2 Average Price Map in England, UK Land Registry[2] . . . . . . . . . . . . 2

2.1 Land Registry Transaction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Example Zoopla on-sale House . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.1 Scrapy architecture[7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 Skewness Source:[8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 KNN Classification[12] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4 Detail Process of Gradient Descent[15] . . . . . . . . . . . . . . . . . . . . 13

3.5 Illustration of gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1 Structure of the crawling spider . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Tree hierarchy of the crawling spider . . . . . . . . . . . . . . . . . . . . . 17

4.3 Spider configuration in settings.py . . . . . . . . . . . . . . . . . . . . . . . 17

4.4 Mapping relation in di↵erent data files . . . . . . . . . . . . . . . . . . . . 18

4.5 Parse House Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.6 ID Indexing in Spider . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.7 UK Postcode Map1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.8 UK postcode Explaination[20] . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.9 Crawling Data set Description . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.10 Extreme Price Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.11 Data set Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.12 Bedrooms and Bathrooms Distribution . . . . . . . . . . . . . . . . . . . . 23

xi
xii LIST OF FIGURES

4.13 Skewness After Filtering the Data . . . . . . . . . . . . . . . . . . . . . . . 23

4.14 Unfiltered Housing Price Distribution . . . . . . . . . . . . . . . . . . . . . 23

4.15 Filtered Housing Price Distribution . . . . . . . . . . . . . . . . . . . . . . 24

4.16 Data Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.17 Shu✏e the data point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.18 Single Feature Performance Training . . . . . . . . . . . . . . . . . . . . . 26

4.19 KNN Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.20 KNN Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Capture of Starting Page . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2 Example Properties Displayed on Google Map . . . . . . . . . . . . . . . . 29

5.3 Capture of Area Information . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.4 Routing in Flask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.5 Flask GoogleMap Instance . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5.6 Fetch Area Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.7 Server Side Pipelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1 1 Bed Rental Price with Di↵erent House type in Di↵erent Areas . . . . . . 35

6.2 Monthly Rental Price with Di↵erent Property type and Beds in Same Area 35

6.3 Coefficient of Variation Between Di↵erent Areas . . . . . . . . . . . . . . . 36

6.4 num bath regression result . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.5 num bed regression result . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.6 month view regression result . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.7 geographic distance regression result . . . . . . . . . . . . . . . . . . . . . 38

6.8 Multiple Features Training Result . . . . . . . . . . . . . . . . . . . . . . . 38

6.9 K=3 Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.10 K=5 Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.11 K=10 Error Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.12 Average Error Rate in di↵erent K value . . . . . . . . . . . . . . . . . . . . 40

6.13 Single Feature KNN Testing Error Rate 1 . . . . . . . . . . . . . . . . . . 41


6.14 Single Feature KNN Testing Error Rate 2 . . . . . . . . . . . . . . . . . . 41
6.15 Single Feature KNN Prediction Error Rate Comparison . . . . . . . . . . . 42
6.16 T-Test on two Prediction Methods . . . . . . . . . . . . . . . . . . . . . . 42

xiii
xiv
Chapter 1

Introduction

1.1 Motivation
Finding a property which is worth to invest in the dynamic property market is a challeng-
ing problem. Many house buyers are confused when they first go into the market, leaving
them vulnerable to dishonest the estate agents who may take advantage of them. Though
there are many property websites in UK, it is rare to see any that can give investment
suggestions to house buyers. House price di↵ers in regions, for example, in England, there
are more than 300 areas. It is not realistic for an estate agency to cover all the areas and
give investors an accurate estimation.

Figure 1.1: Annual House Price Rates of Change, UK Land Registry[1]

1
2 Chapter 1. Introduction

Figure 1.2: Average Price Map in England, UK Land Registry[2]

As Figure1.1 shows, region price change in UK have di↵erent trends and volatility. Scot-
land nearly doubled its price change rate while other regions’ index have a slight decline
from 2016 to 2017. Di↵erences in housing market leads to locality between house agencies,
however, just relying on traditional estate valuers can not satisfy the growing needs in
housing market. This is where machine learning methods might help people to find out
potential property.
Machine learning has been used in disciplines such as business, engineering,and finance
etc. As Park and Bae mentioned in their study, machine learning algorithm can enhance
the predictability of housing prices and significantly contribute to the correct evaluation
of real estate price.[3] Actually, many researchers and developers already tried to apply
machine learning methods on house price prediction. Even on Kaggle1 , a machine learning
competition platform, there are competition on house price prediction.

1.2 Aims and Objectives


The aim of this project is to develop a real time system which can catch new proper-
ties come up from the market and provide investors property recommendation based on
machine learning techniques. The key objectives are:

1. Implement a robust web crawling spider which can collect property information from
market and monitor the property status change(on sale to sold) in real-time.

2. Train feature weighted K-nearest neighbour(KNN) model to estimate the housing


1
https://www.kaggle.com/
1.3. Description of the work 3

rental income based on the neighbours. Then use estimated price to calculate esti-
mated metrics of this property and this area.

3. Create a web application to provide user property recommendation based on esti-


mated investment metrics(price to rent ratio) and monitor the market in real-time.

1.3 Description of the work


We are going to use real properties sale on Zoopla2 . However, we find the properties
on Zoopla are either ”on sale” or ”to rent”, but, at the end, we are going to rank these
properties by ”price-to-rent” ratio(will be covered in next section), which means we need
both rental price and sale price for each single property.
So, here is the problem: if a house only has sale price or rental price, how can we calculate
it’s ”price-to-rent” ratio?
After communicate with my supervisor, I choose to use KNN algorithm: If a house only
have sale price and if we have rental data set and sale data set, we can use KNN algorithm
to estimate the rental price based on it’s neighbours’ rental price. Then we can calculate
it’s ”price-to-rent” ratio.

Here are the requirements which should be achieved for this project:

1. The spider should be able to crawl enough house features and stored in local
database.

2. The spider should be fast enough to track a house status in real time

3. A regression model should be trained to get the weights for KNN features

4. An e↵ective house pipeline should be implemented to clean the data and use KNN
algorithm to estimate the rental income and insert into database.

5. An e↵ective area pipeline should be implemented to update the area information


such as average price, average sale speed etc.

6. The result should be a web application which can display a group of flags(houses)
recommended on the Google map based on user’s requirements(location, property
type etc.) and get a detailed area metrics returned by the system.

2
Zoopla is the second biggest online housing platform in UK
Chapter 2

Background and Related Work

In this chapter, we will first have an overview of the housing price distribution and existing
housing platform in UK. We will look at the data source that will be used in this project,
and explain the features we have. Then, we are going to introduce some metrics in
property investment. Finally, we will discuss related work that we found during our
research.

2.1 Housing Market in UK


Country and government office region Price(GBP) Monthly change Annual change
England £ 242,286 -1% 5%
Northern Ireland (Quarter 4 - 2017) £ 130,482 1.0% 4%
Scotland £ 148,512 1% 7%
Wales £ 153,034 -1% 5%
East Midlands £ 185,568 -0% 7%
East of England £ 289,729 -1% 5%
London £ 485,830 1.0% 2%
North East £ 122,870 -6% 1%
North West £ 155,788 -2% 4%
South East £ 323,435 0% 3%
South West £ 255,307 1% 7%
West Midlands Region £ 187,905 -2.0% 5%
Yorkshire and The Humber £ 156,484 -1% 5%

Table 2.1: Average price by country and government office region Jan, 2018

The UK Land Registry provides a general situation of housing market, and as the table
shows, London’s housing price is double higher than most of the other areas, and south
areas are generally more expensive than north area. However, when it comes to small
areas such as towns or a district in the city, the land registry could not give us a detailed
information.

4
2.2. Data Source and House Features 5

2.2 Data Source and House Features


Land registry does provide house transactions made by every year, but it does not give
much information about property itself, such as the number of rooms or the size of the
property, which we normally consider as factors a↵ecting house prices. Moreover, because
it is made annually, it’s inevitable that the data are lagging. We have to find out a
real-time updated and accurate data source.

Figure 2.1: Land Registry Transaction

Zoopla, the second biggest property website in UK, having millions of detailed properties
listed with dynamic updating house status is a good choice for us. However, it does not
provide data set for free downloading, a stable web crawling spider needs to be imple-
mented to collect the data.

Figure 2.2: Example Zoopla on-sale House

Table 2.2 describes details of these fields. One of our task is research how to predict the
label ”Price” based on other attributes. Some features have a significant influence on
house price, some house features are not related to the price. We will talk about how we
select these features in next chapter.
6 Chapter 2. Background and Related Work

Attritube Name Value


agent address 63 New Road, Chippenham
agent logo https://st.zoocdn.com/zoopla agent logo (248467).png
agent name Kingsley Pike Estate Agents
agent phone 01249 584154
category Residential
country England
country code gb
county Wiltshire
description Located on the Western fringes of the town centre, of-
fering excellent road links to both the town centre and
the M4 motorway, a one bedroom Grade Two Listed
terrace cottage. The accommodation briefly comprises:
Glazed porch, sitting room, kitchen, one bedroom and
bathroom. To the front there is a small mature gar-
den with range of shrubs and lawn. The property
benefits from gas central heating and is o↵ered with
no onward chain.Glazed Entrance PorchDoor leads into
glazed entrance porch, with front door leading into the
Sitting Room.Sitting Room (12.0 x 11’05” (0.30m x
3.48m))Window to the front which is secondary glazed
details url http://www.zoopla.co.uk/for-sale/details/44285866
displayable address Bristol Road, Chippenham, Wiltshire SN15
first published date 2017-06-28 09:42:45
floor plan https://lc.zoocdn.com/4300ce353e29e00b091c6f4d74f.jpg
image url https://li.zoocdn.com/229af9e680e6cbb5ee2b 255.jpg
last published date 2017-07-04 10:33:33
latitude 51.465588
listing id 44285866
listing status sale
longitude -2.134677
num bathrooms 1
num bedrooms 1
num floors 0
num recepts 1
outcode SN15
post town Chippenham
price 169950
property type Terraced
status for sale

Table 2.2: Example value of crawling data fields


2.3. Investment Estate Metrics 7

2.3 Investment Estate Metrics


Here I would like to list some useful real-estate metrics for reference when buying or
renting houses[4].

1. Number of Days on the Market(Sale Speed)


This metrics indicates if a property is priced too high or has too many issues. Good
properties in the right neighborhoods with correct prices usually spend the least
amount of days on the market.

2. Return on Investment(ROI)
A standard owner-occupied home (buying a house to live in) should not have a
debt-to-income ratio of more than 36%.

3. Debt-To-Income Ratio

Anual Rental Annual Expense


ROI = ⇤ 100
P roperty Cost
This number is important because it is an estimate of your potential return. The
capitalization rate is the rate of return based on a real estate investment property
income.

4. Price-to-Rent Ratio

P roperty P rice
PTR =
Annual Rental
However, in this project, we will use monthly rental price/ property price for
simplicity. Generally, a investor should follow a famous rule called ”1% rule”, this
means: If monthly rental price / total price is bigger than 1%, in this situation,
buying such a property for rental is worth. However, if you are looking house for
living, higher than 1% is not efficient for buying, instead, it’s more efficient to rent.

2.4 Related Academic Project


We also find a graduate thesis by NG[5] researching housing price prediction in London
area. It has compared di↵erent machine learning methods in housing price prediction,
and an mobile application was implemented for searching property. However, the main
purpose of his article is to comparing di↵erent machine learning methods, and there are
several di↵erences to our project:

1. Real time house market.


8 Chapter 2. Background and Related Work

The housing market is changing dynamically, but his data are not the properties
currently on the market, which is not so useful when comes to user application.

2. Training methods.
In his project, he tested many advanced machine learning method such as Bayesian
Linear Regression, Relevance Vector Machines,and Gaussian Process. In our project,
we will try to use a simple but useful method: K-nearest Neighbour method to pre-
dict the price of a house from it’s neighbours.

3. User needs.
In his project, the user can only define a budget and the system will provide the
houses under the budget. In our project, we will rank the properties based on
housing metrics.

2.5 Summery
The public data set from the Land Registry, as discussed in Section 2.2, does not cover
details of the house. Though Zoopla gives lots of information, it doesn’t extract these
useful information and give user suggestions. NG’s work has some parts in common
with our project, but he does not give users specific suggestions. None of them have
reached what this project has set out to achieve: an application providing customized
house searching and dynamic recommendation for investors within UK.
Chapter 3

Technical Preliminaries

3.1 Data Crawling Framework: Scrapy

Scrapy is a fast high-level web crawling and web scraping framework[6], used to crawl
websites and extract structured data from their pages. It can be used for a wide range of
purposes, from data mining to monitoring and automated testing. The following diagram
shows an overview of the Scrapy architecture with its components and an outline of the
data flow that takes place inside the system (shown by the red arrows).

Figure 3.1: Scrapy architecture[7]

The reason why I choose Scrapy is because, firstly, Scrapy can crawl websites concurrently
using multi-process and asynchronous requests, which is fast and efficient. Secondly, be-
cause it’s well maintained by many developers and it’s highly active community, which
saves me a lot of time worrying about various network issues such as sessions. Further-

9
10 Chapter 3. Technical Preliminaries

more, Scrapy’s architecture is decoupled enough to allow developers to customize their


needs. All of its advantages makes it suitable for large-scale web crawling.

3.2 Statistics
3.2.1 Skewness
In probability theory and statistics, skewness is a measure of the asymmetry of the prob-
ability distribution of a real-valued random variable about its mean.[8]

Figure 3.2: Skewness Source:[8]

Negative skew: The left tail is longer; the mass of the distribution is concentrated on the
right of the figure.
Positive skew: The right tail is longer; the mass of the distribution is concentrated on the
left of the figure.

3.2.2 TTest
The t-test is one type of inferential statistics. It is used to determine whether there is a
significant di↵erence between the means of two groups.[9]
With all inferential statistics, we assume the dependent variable fits a normal distribution.
When we assume a normal distribution exists, we can identify the probability of a par-
ticular outcome. We specify the level of probability (level of significance) we are willing
to accept before we collect data (p less than 0.05 is a common value that is used). After
we collect data we calculate a test statistic with a formula. We compare our test statistic
with a critical value found on a table to see if our results fall within the acceptable level
of probability. Modern computer programs calculate the test statistic for us and also
provide the exact probability of obtaining that test statistic.
3.3. K-nearest-Neighbour Algorithm 11

3.3 K-nearest-Neighbour Algorithm


The original KNN algorithm is to use a database in which the data points are separated
into several separate classes to predict the classification of a new sample point.[10] Pre-
dictions are made for a new instance (x) by searching through the entire training set for
the K most similar instances (the neighbors) and summarizing the output variable for
those K instances. For regression, this might be the mean output variable.
The nature of KNN algorithm is to find out the “nearest” neighbor items to an item that
has some attribute to be predicted, e.g., “price” attribute in this research. The “nearest”
also means “most similar”, “closest” etc.[11]

Figure 3.3: KNN Classification[12]

3.3.1 Euclidean distance


How to define a distance function is a key problem. Inspired by geometry, we can use
Euclidean distance as the metric. Although usually it is often used in 2D, it can also be
used in higher dimensions.
q
D= (x1 y1 )2 + (x2 y2 )2 + . . . + (xn yn ) 2

3.3.2 Predict Value


After we get all the distances from the neighbours, sort the distances and pick the K
nearest neighbours.
12 Chapter 3. Technical Preliminaries

Given the predict label xq , take the mean value of its k nearest neighbours
Pk
f (xi )
fˆ(xq ) = i=1
k

3.4 Linear Regression


Linear regression, which involves a linear combination of independent variables to estimate
a continuous dependent variable[13]
When calculating the distance between target and neighbors, we treat every feature as
the same, but the real situation may not. Some features might weigh more than others.
For example, when measuring similarity, geographic distance between two houses could be
more important than the di↵erence in the number of bathrooms. Thus, linear regression
is introduced to get the weight of each feature, so as to improve the accuracy of KNN
algorithm.

3.4.1 Basis Function


Given a model h with solution space S and a training set X,Y, a learning algorithm finds
the solution that minimizes the cost function J(S) A model where h is the predict value
and x = (x0 , x1 , ..., xn )T is the vector of n independent variables, a linear regression model
can be formulated as follows:
n
X
h(x, w) = wj j (x) +"
j=0

where j (x) is a basis function with corresponding parameter w = is the noise term, M
is the number of basis functions and w = (w0 , w1 , ..., wn )T

3.4.2 Loss Function


A standard approach to solving this type of problem is to define an error function (also
called a cost function) that measures how “good” a given prediction is. This function
will take in a function and return an error value based on how well the line fits our data.
To compute this error for a given line, we’ll iterate through each (x, h(x, w)) point in our
data set and sum the square distances between each point’s y value and the candidate
line’s y value:
n
1 X
J(w) = (hw (xi ) ti )2
2n i=1
3.4. Linear Regression 13

3.4.3 Graident Descent Algorithm


Gradient descent is a first-order iterative optimization algorithm for finding the minimum
of a function.[14]
Data: Parameters array w
Result: Parameters array w
while not converge do
@
wj := wj ↵ J(w) f or j = 1...k
@wj
end
Algorithm 1: Pesudo-code for Gradient Descent Algorithm

First start from a initial parameter value and move downhill to find the line with the
lowest error, which means repeat this algorithm until the value is not change.
To run gradient descent on this error function, we first need to compute its gradient. The
gradient will act like a compass and always point us downhill. To compute it, we will
need to di↵erentiate our error function. Since our function is defined by parameters, we
will need to compute a partial derivative for each.

Figure 3.4: Detail Process of Gradient Descent[15]


14 Chapter 3. Technical Preliminaries

Figure 3.5: Illustration of gradient descent

3.5 Feature Weighted KNN


Euclidean distance treat every attribute as equal, however, in some real world problem,
some attributes can weigh more than others.[16]
In feature weighted KNN algorithm, every attribute is associated with a weight W and
learns weights by gradient descent and cross-validation. The distance between sample X1
and sample X2 is defined as:
q
D= W1 (x1 y1 )2 + W2 (x2 y2 )2 + . . . + Wn (xn yn ) 2
Chapter 4

Design and Implementation:


Crawling and Prediction System

4.1 Web Crawling System


Because the number of houses on the UK market is more than a million, and new proper-
ties are coming up all the time, we have to find a better way so as to guarantee searching
speed when house data growing big.
The first problem is how to design a data storage system in order to optimize the speed
of searching and accessing a specific house among one million pieces of house information
(approximately 3GB). Because the limitation of my server, reading all the information
into the memory is not desirable. Though building a database can solve the problem, the
searching time complexity is not a constant time (at least O(log N)) and frequent IO will
make the whole crawling process slow.

4.1.1 Design

The requirements of the system should be:

1. Takes up no more than 60% of the memory, the rest of the memory is kept for
spiders.

2. Operation time should be fast: constant O(1).

3. Easy to modify, update the information when house status has been changed.

Luckily, every house on the market(Zoopla) has a unique id number, which means we can
map all the information like a relational database but through a hash table file.

15
16 Chapter 4. Design and Implementation: Crawling and Prediction System

Figure 4.1: Structure of the crawling spider

Figure 4.1 shows the structural design of my crawling spider. The spider throws an
HTTP request into the job queue and gets the responses from the Scrapy engine. Then
the information is extracted and wrap as a House Item object put into the pipeline. Each
item pipeline component is a Python class that implements a simple method. The pipeline
is mainly used to clean the data and store it to local storage. The detail information will
be introduced in the implementation section.
The checking spider will read all the on sale property ID and check each house property
if they have been updated, and if the house has been removed from the Zoopla, we regard
it as a sold item. Spider for rent are also implemented the same as house sale spider.
4.1. Web Crawling System 17

Figure 4.2: Tree hierarchy of the crawling spider

As Figure 4.2 shows, there are four spiders, two of them are responsible for crawling
new properties and other two are checking if there are any updates on existing houses.
Item objects are defined in item.py for di↵erent data items and users can also customize
crawling settings through settings.py

Figure 4.3: Spider configuration in settings.py

Figure 4.3 is the setting file for Scrapy spider. I set the delay to 3 seconds and concurrent
requests to 32, since too frequent HTTP requests will cause the IP address restriction by
Zoopla.
18 Chapter 4. Design and Implementation: Crawling and Prediction System

Figure 4.4: Mapping relation in di↵erent data files

In order to optimize the searching speed, I maintained two property tables as the index to
quick access di↵erent files. This is actually like a database system, but instead of sorting
the data and create tree structure[17], I use a hash table when accessing a specific item.

4.1.2 Implementaion
I choose Python as the project programming language because it has much less amount of
code than Java or C++, which means I can code faster and it also has excellent libraries
in the broad area. The Linux server is Amazon Web Services Ubuntu 16.04.3 with 3GB
memory and 1 CPU.

Parsing House

def parse_house(self, response):

if self.close_down:
raise CloseSpider(reason='Usage exceeded')
listing_id = response.css("html").re('listing_id":"(.*?)"')[0]
if listing_id in self.house_id_dict:
print("Find duplicate item and Drop! Drop id is %s"%listing_id)
return
else:
#start parsing the elements
self.house_id_dict[listing_id] = len(self.house_id_dict)
title =
,! response.css("h2.listing-details-h1::text").extract_first()
price =
,! transGBP(response.css("div.listing-details-price.text-price
,! strong::text").extract_first().strip())
4.1. Web Crawling System 19

street_address = response.css("div.listing-details-address
,! h2::text").extract_first()
num_of_bedrooms =
,! response.css("span.num-icon.num-beds::text").extract_first()
num_of_bathrooms =
,! response.css("span.num-icon.num-baths::text").extract_first()
num_of_receptions =
,! response.css("span.num-icon.num-reception::text").extract_first()

Figure 4.5: Parse House Page


parse house() method is implemented to extract the house information from the returned
response. And then all the information will be wrapped together as a House Item Object
and transport to the pipeline.

Optimize Speed and Memory Allocation

Firstly, when the spider is crawling, it needs to know those houses that have already been
crawled. After crawling for a month, the data has increased to 2GB, just reading the
whole file into memory and locating a specific house is memory consuming. As my design
Figure 4.1 shows, I maintained an id list to record on-sale house on the market. Once
the house is sold, it’s id will be wiped o↵ on the list and put into a sold list. In addition,
when running the program, the id list is converted into a hash table which speed up the
searching time to O(1).

Figure 4.6: ID Indexing in Spider


20 Chapter 4. Design and Implementation: Crawling and Prediction System

4.2 KNN and Feature Regression

To get a better performance and more accurate estimation, we think feature-weighted


KNN will be more close to actual value. Based on Abidoye’s research[18] and the data we
have, we picked 5 features to train the linear regression model. Though I think we still
have some useful features, due to the current progress of the project and the time limit,
we have to make a compromise.
We are going to use linear regression to find weights for each features in order to get a
better measurement on house similarity.
Feature Name Description
num bed Number of bedroom
num bath Number of bathroom
coordinate Latitude and longitude
property type Property type: flat, detached, semi-detached etc
monthly view The monthly click on website

Table 4.1: House Features Description

Attributes Label
abs(X.num bed-Y.num bed) abs(X.price-Y.price /X.price)
abs(X.num bath-Y.num bath)
geographic distance
abs(X.monthview-Y.monthview)

Table 4.2: Training Attributes and Label

We transform our original house features into training attributes as Table 4.2 shows. X
and Y are two data points(house), and abs means absolute value. Our attributes are the
absolute value di↵erence between two house features and the label is the absolute price
di↵erence ratio of X and Y. So, if the house price are the same, we think they are 100%
similar. We make some transformation from original features to current features. We got
the coordinates of house at the start, and translate the coordinates to geographic(meter)
distance by geodesics on an ellipsoid[19]. And as you may notice, we drop the feature
property type, which we will discuss in evaluation chapter. Here is an example feature
vector in training:
2 1 2300 78
1 0 652 121
2 2 1437 17
0 1 325 54

Table 4.3: Sample Feature Vector


4.2. KNN and Feature Regression 21

4.2.1 UK Postcode System

The reason why we would like to introduce UK postcode system is because we meet
memory problems when we do further training and testing. And it turns out to be a
suitable way that we can use postcode to solve this problem.
The postcode system was devised by the Royal Mail to enable efficient mail delivery to
all UK addresses. Initially introduced in London in 1857 the system as we now know it
became operational for most of the UK in the late seventies.[20]

Figure 4.7: UK Postcode Map1

The structure of a postcode is a one or two-letter postcode area code named after a
local city, town or area of London, one or two digits signifying a district in that region,
a space, and then an arbitrary code of one number and two letters. For example, the
postcode of the University of Roehampton in London is SW15 5PU, where SW stands
for south-west London.

1
https://www.electricmarketing.co.uk/map.html
22 Chapter 4. Design and Implementation: Crawling and Prediction System

Figure 4.8: UK postcode Explaination[20]

4.2.2 Data Cleaning

We first print out the basic statistics of the data set we have. As the Figure 4.9 shows,
the whole data set has noisy data, with the minimum of £3000 and maximum price up
to 16 million pound. The average price in UK is £220000, however, the number of houses
that under £30000 pound is 2116, and the number of houses that are valued more than
one million is 16577. Either higher or below this range are not normal. Then I checked
the skewness of the data set, and it is a positive skew with the value of more than 70. In
order to get a better performance on the actual prediction, we need to cut some extreme
data points on both sides so that it can form a better distribution.

Figure 4.9: Crawling Data set Description

Figure 4.10: Extreme Price Count

In addition, we plotted the distribution of bedrooms and bathrooms, and there are also
noisy data points that which needs to clean.
4.2. KNN and Feature Regression 23

Figure 4.11: Data set Skewness

Figure 4.12: Bedrooms and Bathrooms Distribution

Figure 4.13: Skewness After Filtering the Data

Figure 4.14: Unfiltered Housing Price Distribution


24 Chapter 4. Design and Implementation: Crawling and Prediction System

Figure 4.15: Filtered Housing Price Distribution

The histograms show that after the adjustments, we have 3565350 properties left but the
data is much more symmetry, with the skewness of 0.38.

4.2.3 Optimization
Training process needs two input, example house X1 , target house X2 . The attributes are
the value di↵erence between two houses and the label is the price di↵erence price(X1 )
price(X2 ). It is computationally expensive for a single house to compare all the houses
across the UK and also no reference to this model shall be made if pairs are located in
di↵erent regions. Example house only needs to compare houses in its local area. Instead
of putting all the data into a single file, we can split houses into small chunks by regions.
So I get the full list of UK city postcode and put into a python set and classify the houses
into di↵erent regions based on their postcode.

with open(house_info,"r") as r:
for item in r:
#load house file
info = json.loads(item)
# for each house
for house_id,house in info.items():
outpostcode = extract_region(house["postcode"])
city_postcode = citypattern.search(outpostcode).group(0)
#extract postcode and put into different region
if city_postcode in region:
4.2. KNN and Feature Regression 25

region_set[city_postcode].append(house)

for area, area_houses in region_set.items():


#create file for different regions
with open("%s_region/%s.json"%(folder,area),"w") as f:
area_dict = {}
area_dict[area] = area_houses
f.write(json.dumps(area_dict,ensure_ascii=False))

4.2.4 Training
2
Due to the huge number of paired data points (C3565350 = 10 more billion), training a
model using the entire dataset would be computationally intractable. Thus, sampling the
data could be a feasible solution.
I divide our dataset into 2:1 for cross-validation. I choose scikit-learn[21] library for
regression model. It also provide min-max scaler which can normalize the data:

def data_transform(features,target):

min_max_scaler = preprocessing.MinMaxScaler()
features = feature_scaling(min_max_scaler,features)
X_train, X_test, y_train, y_test = train_test_split(features, target,
test_size=0.33,
random_state=42)
return (X_train, X_test, y_train, y_test)

Figure 4.16: Data Transform


For every region, we take half of the datapoints as a set first, and for each datapoint, it
will match the half of the re-shu✏ed set.

for i, house_info in enumerate(region_houses_list):


if house_info['price'] == -1 or house_info['property_type']
,! not in house_type_set:
continue
count+=1
if count >= len(region_houses_list)/2:
break
clip = region_houses_list[i+1:len(region_houses_list)]
shuffle(clip)
#clean rare housing type and low price property
26 Chapter 4. Design and Implementation: Crawling and Prediction System

clip = clean_noise(clip,minimum_price,house_type_set)
for index, house in
,! clip[:round(len(clip.index)/2)].iterrows():
#do feature difference

Figure 4.17: Shu✏e the data point


At the end, we sampled the result data by fraction of 0.3, overall, we take 12% of the
data points. It takes 11 hours to train the model with 121 areas.
Figure 4.16 are used to testing each feature’s influence on house similarity, and we will
talk about the results in evaluation chapter.

def single_feature(examples,targets):
for i in examples.columns:
data = examples[i].reshape(-1,1)
data_frame = data_transform(data,targets)
linear_fit(data_frame)

def linear_fit(data_frame):
lr_model = linear_model.LinearRegression()
model = lr_model.fit(data_frame[0],data_frame[2])
predictions=model.predict(data_frame[1])
RMSE = round(mean_squared_error(data_frame[3],predictions),3)
score = round(model.score(data_frame[1],data_frame[3]),3)

Figure 4.18: Single Feature Performance Training

4.2.5 KNN implementation

We have two KNN methods: KNN and feature-weighted KNN. The only di↵erence is
their distance function are not the same, so I decide to use callback in Python in order to
pass the distance function as a parameter.
4.2. KNN and Feature Regression 27

Figure 4.19: KNN Process

Actually, the choice of K is a factor of accuracy, When k=1 the result is the nearest
neighbor value. Either too few or too many neighbors all will have a influence on the
prediction. In this research, we will look at when k = 3,5,10.

def house_KNN(house_info,neighbours,K,attribute_list,
distance_method,weights):

#apply distance method


neighbours['distance'] = neighbours.apply(lambda row:
,! distance_method(house_info,row,attribute_list,weights),axis=1)
neighbours = neighbours.sort_values('distance')

K_neighbours_list = neighbours[:K].copy()
return K_neighbours_list

Figure 4.20: KNN Method


Chapter 5

Design and Implementation: Web


Application

In this chapter, we will cover the design and implementation of our web application. The
web application consists of front-side web interface where the user can type in areas and
property type, and a server-side component where handling user queries while dynamically
catching new properties coming to the market.

5.1 Client Side


I choose Python Flask Web Framework[22] which supports Jinja2[23] templates. Flask and
Django are two of the most popular web frameworks for Python, however, Flask provides
more simplicity, flexibility and fine-grained control. It also supports large extensions
and libraries. I choose ’model–view–controller’(MVC) design pattern to develop the web
application, because MVC gives a clear layout and split functions into di↵erent parts,
which can make developers focus on individual functionality.

Figure 5.1: Capture of Starting Page

The application will allow user search areas or cities in di↵erent property types and return
a list of properties each displayed as a flag on the Google Map. The ranking is sort by the
estimated price-to-rent ratio. If user clicks on the flag, basic information will show up and
a list of neighbour properties will also be displayed. This neighbour list is a house id list
which contains 5 most similar neighbours calculated by our KNN method. If user wanna

28
5.1. Client Side 29

check these properties, just click on the link in the property pane to see the property
details on Zoopla. In addition, under the map, there is an area information table, which
tells investors the metrics of di↵erent sub-areas.

Figure 5.2: Example Properties Displayed on Google Map

Figure 5.3: Capture of Area Information

Flask uses a simple ”magic” mechanism which binds URL and action, called ”routing”[24].
This helps the developer easier to maintain the code and to make the develop fast.
30 Chapter 5. Design and Implementation: Web Application

@app.route('/', methods=['GET', 'POST'])


def home_page():
form = SearchForm(request.form)
if request.method == 'POST' and form.validate():
name = form.name.data
house_type = form.house_type.data
area = server.parse_sentence(name)
if area == "Value Error":
error = "Looks like you entered an unknown region!"
flash(error)
else:
return redirect("/results/"+area+"&"+house_type)

return flask.render_template('index.html', form=form)

Figure 5.4: Routing in Flask


When implementing the display panel, I was considering about Google Map API, but the
API has to use JavaScript, and then I searched online to see if there are any Python-
based map libraries. Luckily, I find Flask-GoogleMaps which satisfies all my needs. It
is friendly to Python and easy to install. Just create a Map instance and pass it to the
render template() function, and Google Map will be loaded in a second.

from flask_googlemaps import Map


mymap = Map(
identifier="housemap",
#centre coordinates
lat=house_info.iloc[0].lat,
lng=house_info.iloc[0].lon,

#flags
markers = redmark + greenmark,
fit_markers_to_bounds = True,
style = "height:600px;width:800px;margin:0;"
)

Figure 5.5: Flask GoogleMap Instance

5.2 Server Side


There are two parts at server side, query handling and prediction system pipeline.
5.2. Server Side 31

5.2.1 Database Design


I divide the data into three separate tables: Houses, Area, SuperArea.
Houses table records the information for each individual house.Area table stores the in-
formation for each sub-district. As we mentioned in Section 4.2, UK postcode has two
parts: prefix(eg.NG7) and suffix(eg.2RD). In prefix, the English characters represent a
city or an area, and the number represents a district in that city. SuperArea stores the
whole area information, such as NG, SW, E etc.
Here are the fields of Area and SuperArea table:

• index: region index


• region: region code
• city: city name
• property type
• rent price: average rental price per month
• sale price: average sale price
• sale speed: average sale speed
• rent speed: average rental speed
• s monthview: average sale monthly view
• r monthview: average rental monthly view
• PTR: price to rent ratio

Here is the data fields in Houses table:

• listing id: ID number on Zoopla


• crawling time: the date when crawling this property
• region: region code
• city: city name
• property type
• price: sale-price
• estimate price: estimate rental-price
• neighbour: an array of neighbours
• num bed: bedrooms
• num bath: bathrooms
• lat: latitude
• lon: longitude
• monthview: monthly view
• postcode
• PTR: price to rent ratio
32 Chapter 5. Design and Implementation: Web Application

5.2.2 Query Handling


A Server class is implemented to handle search query. The server will first parse the user
input: if it is a valid word, start fetch the data from database otherwise return an error.
I created a postcode dict to store the mapping relation between city names and postcode
prefix. Users can either type a city name like London or area code like SW or SW7 to get
di↵erent search scale. In addition, the search keyword is not case sensitive.

def fetch_area_info(self, name,house_type):


if name.lower() in {v.lower() for v in
,! self.postcode_set['city']}:
sql_query = "SELECT * FROM G53DT.SuperArea where UPPER (city)
,! = %s AND property_type = %s ORDER BY PTR DESC;"
name = name.upper()
else:
sql_query = "SELECT * FROM G53DT.Area where region LIKE %s
,! AND property_type = %s ORDER BY PTR DESC;"
name = "%"+name+"%"
t = pd.read_sql(sql_query, self.engine,params=[name,house_type])
return t

Figure 5.6: Fetch Area Information


5.2. Server Side 33

5.2.3 Real-time Server Pipeline

Figure 5.7: Server Side Pipelines

The server can dynamically capture new properties come up on the market and put the
house into KNN pipeline. KNN Pipeline will generate an estimate rental price based on
the data points we have and calculate the price-to-rent ratio automatically.
Another pipeline is area pipeline. After the spider finished, new houses will be classified
into di↵erent region files by region classifier, and then the area pipeline starts processing.
The area pipeline will calculate each area’s result based on all the sold houses and rented
houses in that area. I use Pandas1 to process the data and doing statistical analysis.
Besides having the prediction system running on the server, several Python scripts also
exist on the application server to support the prediction system. Extractor() Class extract
useful information the raw data. DataFetcher() Class are used to retrieve sold properties
based on id.

1
https://pandas.pydata.org/
Chapter 6

Evaluation

In this chapter, we will talk about the challenges we faced while working on the project.
Then we are going to evaluate each single variable linear regression performance and multi-
variables linear regression on property similarity. After that, we will apply the weights
we get from regression into feature weighted KNN, and compare the accuracy between
these two methods: feature weighted KNN and unweighted KNN in di↵erent values of K.
Finally, feature ablation study will be conducted on every feature to see the individual’s
e↵ect on house price prediction.

6.1 Feature Regression Analysis


6.1.1 Curse of Dimensionality
In our original training features, feature ”property type” has 18 di↵erent categories.
Training process requirements all features have to be numeric type. I use one-Hot-
Encoder1 to convert the string into binary representations. However, this large increases
the dimensionality which makes the training time endless and causes memory overflow.
Keeping this feature will consume too many resources, I started to think does property
type strongly relates to the rental price? I randomly choose 30 di↵erent city/areas among
the UK and compared the average rental price:

1
One hot encoding is a process by which categorical variables are converted into a form that could be
provided to ML algorithms to do a better job in prediction.

34
6.1. Feature Regression Analysis 35

Figure 6.1: 1 Bed Rental Price with Di↵erent House type in Di↵erent Areas

It seems that there is no direct relation between rental price and property type from the
figure. So, I plot a more detailed graph:

Figure 6.2: Monthly Rental Price with Di↵erent Property type and Beds in Same Area

In average, it is not a big price di↵erence between these property types.(except for 4 bed
flat, the reason is probably because there are not many big flats in reality.) I was thinking,
if property types does significantly a↵ect rental price, and if we control other variables
constant, maybe we can use standard deviation divided by the mean(rental price) to
measure how much variations these property types can make. Later on, I researched
36 Chapter 6. Evaluation

online and I find this measurement does exist, it’s called ”Coefficient of Variation”2 .
I randomly picked 20 areas’ data to do the statistics:

Figure 6.3: Coefficient of Variation Between Di↵erent Areas

The graph shows that more than 75% of the areas are between 5% to 15%, which means
more than 3/4 of tested areas, the average price di↵erence to the mean are between 5% to
15% for di↵erent property types. We can not say this 5% to 15% is because of property
type. Aware that we do not count all the related variables in the test, because it’s not
feasible to control all other variables as the same, for example, we can not let all the
houses’ furniture be the same. Property type may be a factor of rental price, but this
experiment tells us the di↵erence between the property type does not a↵ect rental price
significantly.
Thus, in order to keep going on data training and save memory, I decide to drop this
feature.

6.1.2 Single Feature Linear Regression Analysis

I take each single feature as input to train linear regression model in order to each feature’s
influence on price di↵erence and use cross-validation to test trained model. The ”predict
value” and ”actual value” in the graph mean price di↵erence ratio.

2
The coefficient of variation is a measure of relative variability. It is the ratio of the standard deviation
to the mean (average).
6.1. Feature Regression Analysis 37

Figure 6.4: num bath regression result

Figure 6.5: num bed regression result

Figure 6.6: month view regression result


38 Chapter 6. Evaluation

Figure 6.7: geographic distance regression result

The above shows that ”num bed” has the lowest RMSE of 0.41, which means this feature
has more weights than the others when estimate house price. However, it’s interesting
that geographic distance does not weigh much on house similarity, with the score of 0.001
and coefficient of 0.103. It might be the reason that our training examples are from the
same area, and when properties are close to each other in the same region, this factor
contributes less to the price di↵erence.

6.1.3 Multiple Features Linear Regression


The result shows it’s slightly better than single feature regression with the score of 0.165.

Figure 6.8: Multiple Features Training Result

Due to the limitation of features and many other factors, the linear regression perfor-
mance seems not so satisfying, however, because this is a real world problem which is not
as idealistic as artificial statistics, the features and labels we have may not be exactly
accurate. We will first apply the coefficients on K-Nearest-Neighbour algorithm to see
the accuracy.
6.2. KNN Prediction Result 39

6.2 KNN Prediction Result

KNN is instance-based learning method, which means it has to take all the instances
every time to make hypothesis.[25] It is time consuming to test on every example, so we
decide to make a sampling to test our regression result. I randomly choose 30 areas and
from each area we take 10% but no more than 100 properties for each area. In total, we
have 2123 properties to test.

6.2.1 Choice of K

I choose 3, 5, 10 as di↵erent K value and evaluated the prediction accuracy. Blue lines
are feature weighted KNN error and yellow lines are unweighted KNN error. X-axis
represents di↵erent UK areas, because one area has too few data points, thus only 29
areas are counted.

Figure 6.9: K=3 Error Rate

Figure 6.10: K=5 Error Rate


40 Chapter 6. Evaluation

Figure 6.11: K=10 Error Rate

Figure 6.12: Average Error Rate in di↵erent K value

The graph shows that there is not much di↵erence when K = 3 and K = 5, however,
when K goes up to 10, the error rate in unweighted KNN method has reached about 10%
As we can see, overall, feature weighted KNN gives a better performance on price predic-
tion, with the average error rate of 4.2% when K = 3 and 5, however, unweighted KNN
also gives a good result, with the average accuracy of 8% when K = 3 and 5.
I choose K = 5 to apply in the final software, because it put more neighbours into estima-
tion price than K = 3, which make the result more average and less likely to get extreme
results but also it has almost the same accuracy as K = 3.

6.2.2 Property Feature Ablation Study

After we determine the value of K, feature ablation study should be conducted in order
to justify each feature’s e↵ect on price prediction.
6.2. KNN Prediction Result 41

Figure 6.13: Single Feature KNN Testing Error Rate 1

Figure 6.14: Single Feature KNN Testing Error Rate 2

Feature Error rate


num bath 17.4%
num bed 11.1%
monthly view 18.0%
geo distance 10.1%

Table 6.1: Error rate of Single Feature KNN Prediction

Figure 6.13 and Figure 6.14 gives error rate for each single feature KNN testing. The data
set is the same as the previous testing. Figure 6.14 is the result for feature ”num bath” and
”num bed” and Figure 6.13 is the result for feature ”geographic distance” and ”monthly
view”.
As the Figure 6.15 shows, 0 represents ”num bath”, 1 represents ”num bed”, 2 repre-
sents ”monthly view”, 3 represents ”geographic distance”. The number of bedrooms and
geographic distance have relatively lower error than the other two, with the error rate
of 11.1% and 10.1%. While ”monthly view” has the highest error rate of 18.0%. It’s
interesting that even if we combined these features and apply KNN method, as the pre-
vious section did, the error rate is around 8%, which is not much improved. However,
weighted feature KNN shows a small but significant performance improve, the error rate
is one times less than single feature KNN.
42 Chapter 6. Evaluation

Figure 6.15: Single Feature KNN Prediction Error Rate Comparison

6.2.3 TTest and Summary

Figure 6.16: T-Test on two Prediction Methods

We run a t-test to see if weighted KNN prediction results and unweighted KNN prediction
results are statistically di↵erent. We compare the values of two vector: The first one is
the price di↵erence between weighted feature KNN results and the actual price and the
second vector is the price di↵erence between unweighted KNN results and the actual
price. The result is p-value = 1.117 ⇤ 10 9 < 0.05, which means we can accept the new
hypothesis. In terms of the error rate, our feature weighted KNN nearly declined 100%
than the unweighted KNN method. I didn’t really expect it can give nearly 95% accuracy
on actual prediction, however, when it comes to some extreme cases such as luxury house,
the method may not work so perfectly as normal houses. Overall, our project’s result is
good.
Chapter 7

Summary and Reflections

In this chapter, I will summarize the progress and problems when working on this project
as well as provide suggestions for future work and experiments.

7.1 Project Management


I started this project in August 2017 and finished in April 2018. Working on this dis-
sertation takes much more e↵orts than I initially expected, however, I gained something
that I have never thought of before.
In August and September, I was busy on the crawling system and debugging. After finish
the real-time crawling system, I start to do research on economic metrics and housing
investments. In addition, I also researched on learning to rank for recommendation.
At that moment, I was not sure what’s the next step, because there are so many branches
we can choose. I was a bit anxious and I tried to talk to my supervisor. In that period,
my friend who was also doing project under Dr Zhou, changed his mind. Because we have
some resources that were going to share, due to the change, I communicated with my
supervisor and we changed our plan to fit the time. Then I start working on my interim
report until Christmas.
At the middle of January, I start to preparing data and do data cleaning for linear re-
gression, it took quite a lot of time. At the meanwhile, we are also preparing for semester
exams, so the progress was slow down. During that period, I was also learning statistics
course for machine learning, and I learned many techniques in the Pandas data analysis
tools for Python. Now, I think I am proficient in using Pandas.
After finish and evaluate the linear regression and KNN method, it’s going to the final
stage and I started to design and implement web application.
I have experience in web development, so it didn’t take me too long. I made a simple
interface and finished the server’s back-end at the beginning of the April. Then I was
fully working on my final report until 10th April.

43
44 Chapter 7. Summary and Reflections

7.2 Contributions and Reflections


7.2.1 Project Contribution and Reflection
We have managed to produce a web application that give users recommendation on prop-
erties for investments in real-time based on property features. Users can monitoring areas
and catch investment opportunity instantly on the housing market. Machine learning
methods have been explored and tested, finally applied in the software.
There are many things to do in the future. Firstly, more property features can be used such
as pictures and house descriptions to get more accurate prediction. Moreover, times series
method can be used which can provide price trends to users. In addition, more complex
metrics combination can be applied to provide a more professional recommendation.

7.2.2 Personal Reflection


The project is ambitious, because it doesn’t like a coursework or making a game, it is
a real-world problem. There is no prepared training data, and I have to collect them
by myself, which is a tough and long period. However, now I understand the work-flow
of a machine learning project and how to do it from 0. More importantly, through this
project I think I am more self-motivated, to solve problems and learn new skills, and to
be independent, though I still need my supervisor’s guidance sometimes. But these are
the things that I can’t learn from any specific modules.
The weekly meetings with my supervisor, Dr Ke Zhou, were very beneficial. Basically,
on the meeting, we discuss my progress and problems I meet during the week, and he
tell me his idea and possible solutions. The questions put forward during these sessions
has motivated me to explore many other courses such as statistics and economics. Now
I realized how valuable this is, through these meetings, again and again, my supervisor
told me one thing that’s more important than domain knowledge: how to communicate
efficiently with people. I think this is not easy for me currently, but I think he is right:
express your ideas and work in a simple way will be more important than what you did.
Bibliography

[1] UK House Price Index (UK HPI) annual review 2017. url: https://www.gov.uk/
government/news/uk-house-price-index-uk-hpi-annual-review-2017.
[2] UK House Price Index England: January 2018. url: https : / / www . gov . uk /
government/publications/uk- house- price- index- england- january- 2018/
uk-house-price-index-england-january-2018.
[3] Byeonghwa Park and Jae Kwon Bae. “Using machine learning algorithms for housing
price prediction: The case of Fairfax County, Virginia housing data”. In: Expert
Systems with Applications 42.6 (2015), pp. 2928–2934. issn: 0957-4174. doi: https:
//doi.org/10.1016/j.eswa.2014.11.040.
[4] Diala Taneeb. What Are the Main Metrics That Real Estate Investors Look at When
Deciding Which Property to Invest In? url: https://mobe.com/main-metrics-
real - estate - investors - look - deciding - property - invest / ?aff _ id = 5538.
(accessed: 01.09.2018).
[5] Aaron NG. “Machine Learning for a London Housing Price Prediction Mobile Ap-
plication”. In: (2015).
[6] Scrapy at a glance. url: https : / / docs . scrapy . org / en / latest / intro /
overview.html. (accessed: 01.08.2017).
[7] Architecture overview. url: https : / / doc . scrapy . org / en / latest / topics /
architecture.html.
[8] Skewness. url: https://en.wikipedia.org/wiki/Skewness.
[9] T TEST. url: https://researchbasics.education.uconn.edu/t-test/.
[10] Oliver Sutton. “Introduction to k Nearest Neighbour Classification and Condensed
Nearest Neighbour Data Reduction”. In: (2012).
[11] Weikun Zhao. “The Research on Price Prediction of Second-hand houses based on
KNN and Stimulated Annealing Algorithm”. In: (2014). doi: http://dx.doi.org/
10.14257/ijsh.2014.8.2.19.
[12] k-nearest neighbors algorithm. url: https : / / en . wikipedia . org / wiki / K -
nearest_neighbors_algorithm.

45
46 BIBLIOGRAPHY

[13] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
isbn: 0387310738.
[14] Gradient descent wiki. url: https : / / en . wikipedia . org / wiki / Gradient _
descent.
[15] Matt Nedrich. An Introduction to Gradient Descent and Linear Regression. url:
https://spin.atomicobject.com/2014/06/24/gradient- descent- linear-
regression/.
[16] Ming Zhao. “Improvement and Comparison of Weighted k Nearest Neighbors Clas-
sifiers for Model Selection. Journal of Software Engineering”. In: (2016). doi: 10.
3923/jse.2016.109.118.
[17] How is MySQL implemented? url: https://www.quora.com/How- is- MySQL-
implemented - What - data - structures - are - used - Are - there - any - unique -
tricks- or- optimizations- that- the- developers- employed- that- allowed-
for-the-fast-query-time.
[18] Rotimi Boluwatife Abidoye. “Factors That Influence Real Estate Project Invest-
ment: Professionals’ Standpoint”. In: (2016). url: openbooks.uct.ac.za/cidb/
index.php/cidb/catalog/download/3/1/134-2.
[19] Geodesics on an ellipsoid. url: https://en.wikipedia.org/wiki/Geodesics_
on_an_ellipsoid.
[20] Postcodes in the United Kingdom. url: https : / / en . wikipedia . org / wiki /
Postcodes_in_the_United_Kingdom.
[21] scikit-learn. url: http://scikit-learn.org/stable/.
[22] Flask. url: http://flask.pocoo.org/.
[23] Jinja2. url: http://jinja.pocoo.org/.
[24] Routing in Flask. url: http://flask.pocoo.org/docs/0.12/quickstart/.
[25] David W. Aha, Dennis Kibler, and Marc K. Albert. “Instance-based learning al-
gorithms”. In: 6.1 (1991), pp. 37–66. issn: 1573-0565. doi: 10.1007/BF00153759.
url: https://doi.org/10.1007/BF00153759.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy