Implementation and Study of K-Nearest Ne
Implementation and Study of K-Nearest Ne
Shujia Zhao
4274913
G53IDS
Supervised by Ke Zhou
April 2018
Abstract
i
ii
Acknowledgements
I would like to thank Dr Ke Zhou for supervising my project. Without his willingness and
ability to help, I could not have finished this project successfully. In addition, I would
like to thank my friends and open source developers for their kind help. Lastly, I would
like to thank my parents for their continuous support and unconditional love.
iii
iv
Contents
Abstract i
Acknowledgements iii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.5 Summery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Technical Preliminaries 9
3.2 Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.1 Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 TTest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
v
vi CONTENTS
4.1.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.2 Implementaion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6 Evaluation 34
6.2.1 Choice of K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2.2 Property Feature Ablation Study . . . . . . . . . . . . . . . . . . . 40
6.2.3 TTest and Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Bibliography 44
vii
viii
List of Tables
2.1 Average price by country and government office region Jan, 2018 . . . . . . 4
2.2 Example value of crawling data fields . . . . . . . . . . . . . . . . . . . . . 6
ix
x
List of Figures
xi
xii LIST OF FIGURES
6.1 1 Bed Rental Price with Di↵erent House type in Di↵erent Areas . . . . . . 35
6.2 Monthly Rental Price with Di↵erent Property type and Beds in Same Area 35
xiii
xiv
Chapter 1
Introduction
1.1 Motivation
Finding a property which is worth to invest in the dynamic property market is a challeng-
ing problem. Many house buyers are confused when they first go into the market, leaving
them vulnerable to dishonest the estate agents who may take advantage of them. Though
there are many property websites in UK, it is rare to see any that can give investment
suggestions to house buyers. House price di↵ers in regions, for example, in England, there
are more than 300 areas. It is not realistic for an estate agency to cover all the areas and
give investors an accurate estimation.
1
2 Chapter 1. Introduction
As Figure1.1 shows, region price change in UK have di↵erent trends and volatility. Scot-
land nearly doubled its price change rate while other regions’ index have a slight decline
from 2016 to 2017. Di↵erences in housing market leads to locality between house agencies,
however, just relying on traditional estate valuers can not satisfy the growing needs in
housing market. This is where machine learning methods might help people to find out
potential property.
Machine learning has been used in disciplines such as business, engineering,and finance
etc. As Park and Bae mentioned in their study, machine learning algorithm can enhance
the predictability of housing prices and significantly contribute to the correct evaluation
of real estate price.[3] Actually, many researchers and developers already tried to apply
machine learning methods on house price prediction. Even on Kaggle1 , a machine learning
competition platform, there are competition on house price prediction.
1. Implement a robust web crawling spider which can collect property information from
market and monitor the property status change(on sale to sold) in real-time.
rental income based on the neighbours. Then use estimated price to calculate esti-
mated metrics of this property and this area.
Here are the requirements which should be achieved for this project:
1. The spider should be able to crawl enough house features and stored in local
database.
2. The spider should be fast enough to track a house status in real time
3. A regression model should be trained to get the weights for KNN features
4. An e↵ective house pipeline should be implemented to clean the data and use KNN
algorithm to estimate the rental income and insert into database.
6. The result should be a web application which can display a group of flags(houses)
recommended on the Google map based on user’s requirements(location, property
type etc.) and get a detailed area metrics returned by the system.
2
Zoopla is the second biggest online housing platform in UK
Chapter 2
In this chapter, we will first have an overview of the housing price distribution and existing
housing platform in UK. We will look at the data source that will be used in this project,
and explain the features we have. Then, we are going to introduce some metrics in
property investment. Finally, we will discuss related work that we found during our
research.
Table 2.1: Average price by country and government office region Jan, 2018
The UK Land Registry provides a general situation of housing market, and as the table
shows, London’s housing price is double higher than most of the other areas, and south
areas are generally more expensive than north area. However, when it comes to small
areas such as towns or a district in the city, the land registry could not give us a detailed
information.
4
2.2. Data Source and House Features 5
Zoopla, the second biggest property website in UK, having millions of detailed properties
listed with dynamic updating house status is a good choice for us. However, it does not
provide data set for free downloading, a stable web crawling spider needs to be imple-
mented to collect the data.
Table 2.2 describes details of these fields. One of our task is research how to predict the
label ”Price” based on other attributes. Some features have a significant influence on
house price, some house features are not related to the price. We will talk about how we
select these features in next chapter.
6 Chapter 2. Background and Related Work
2. Return on Investment(ROI)
A standard owner-occupied home (buying a house to live in) should not have a
debt-to-income ratio of more than 36%.
3. Debt-To-Income Ratio
4. Price-to-Rent Ratio
P roperty P rice
PTR =
Annual Rental
However, in this project, we will use monthly rental price/ property price for
simplicity. Generally, a investor should follow a famous rule called ”1% rule”, this
means: If monthly rental price / total price is bigger than 1%, in this situation,
buying such a property for rental is worth. However, if you are looking house for
living, higher than 1% is not efficient for buying, instead, it’s more efficient to rent.
The housing market is changing dynamically, but his data are not the properties
currently on the market, which is not so useful when comes to user application.
2. Training methods.
In his project, he tested many advanced machine learning method such as Bayesian
Linear Regression, Relevance Vector Machines,and Gaussian Process. In our project,
we will try to use a simple but useful method: K-nearest Neighbour method to pre-
dict the price of a house from it’s neighbours.
3. User needs.
In his project, the user can only define a budget and the system will provide the
houses under the budget. In our project, we will rank the properties based on
housing metrics.
2.5 Summery
The public data set from the Land Registry, as discussed in Section 2.2, does not cover
details of the house. Though Zoopla gives lots of information, it doesn’t extract these
useful information and give user suggestions. NG’s work has some parts in common
with our project, but he does not give users specific suggestions. None of them have
reached what this project has set out to achieve: an application providing customized
house searching and dynamic recommendation for investors within UK.
Chapter 3
Technical Preliminaries
Scrapy is a fast high-level web crawling and web scraping framework[6], used to crawl
websites and extract structured data from their pages. It can be used for a wide range of
purposes, from data mining to monitoring and automated testing. The following diagram
shows an overview of the Scrapy architecture with its components and an outline of the
data flow that takes place inside the system (shown by the red arrows).
The reason why I choose Scrapy is because, firstly, Scrapy can crawl websites concurrently
using multi-process and asynchronous requests, which is fast and efficient. Secondly, be-
cause it’s well maintained by many developers and it’s highly active community, which
saves me a lot of time worrying about various network issues such as sessions. Further-
9
10 Chapter 3. Technical Preliminaries
3.2 Statistics
3.2.1 Skewness
In probability theory and statistics, skewness is a measure of the asymmetry of the prob-
ability distribution of a real-valued random variable about its mean.[8]
Negative skew: The left tail is longer; the mass of the distribution is concentrated on the
right of the figure.
Positive skew: The right tail is longer; the mass of the distribution is concentrated on the
left of the figure.
3.2.2 TTest
The t-test is one type of inferential statistics. It is used to determine whether there is a
significant di↵erence between the means of two groups.[9]
With all inferential statistics, we assume the dependent variable fits a normal distribution.
When we assume a normal distribution exists, we can identify the probability of a par-
ticular outcome. We specify the level of probability (level of significance) we are willing
to accept before we collect data (p less than 0.05 is a common value that is used). After
we collect data we calculate a test statistic with a formula. We compare our test statistic
with a critical value found on a table to see if our results fall within the acceptable level
of probability. Modern computer programs calculate the test statistic for us and also
provide the exact probability of obtaining that test statistic.
3.3. K-nearest-Neighbour Algorithm 11
Given the predict label xq , take the mean value of its k nearest neighbours
Pk
f (xi )
fˆ(xq ) = i=1
k
where j (x) is a basis function with corresponding parameter w = is the noise term, M
is the number of basis functions and w = (w0 , w1 , ..., wn )T
First start from a initial parameter value and move downhill to find the line with the
lowest error, which means repeat this algorithm until the value is not change.
To run gradient descent on this error function, we first need to compute its gradient. The
gradient will act like a compass and always point us downhill. To compute it, we will
need to di↵erentiate our error function. Since our function is defined by parameters, we
will need to compute a partial derivative for each.
4.1.1 Design
1. Takes up no more than 60% of the memory, the rest of the memory is kept for
spiders.
3. Easy to modify, update the information when house status has been changed.
Luckily, every house on the market(Zoopla) has a unique id number, which means we can
map all the information like a relational database but through a hash table file.
15
16 Chapter 4. Design and Implementation: Crawling and Prediction System
Figure 4.1 shows the structural design of my crawling spider. The spider throws an
HTTP request into the job queue and gets the responses from the Scrapy engine. Then
the information is extracted and wrap as a House Item object put into the pipeline. Each
item pipeline component is a Python class that implements a simple method. The pipeline
is mainly used to clean the data and store it to local storage. The detail information will
be introduced in the implementation section.
The checking spider will read all the on sale property ID and check each house property
if they have been updated, and if the house has been removed from the Zoopla, we regard
it as a sold item. Spider for rent are also implemented the same as house sale spider.
4.1. Web Crawling System 17
As Figure 4.2 shows, there are four spiders, two of them are responsible for crawling
new properties and other two are checking if there are any updates on existing houses.
Item objects are defined in item.py for di↵erent data items and users can also customize
crawling settings through settings.py
Figure 4.3 is the setting file for Scrapy spider. I set the delay to 3 seconds and concurrent
requests to 32, since too frequent HTTP requests will cause the IP address restriction by
Zoopla.
18 Chapter 4. Design and Implementation: Crawling and Prediction System
In order to optimize the searching speed, I maintained two property tables as the index to
quick access di↵erent files. This is actually like a database system, but instead of sorting
the data and create tree structure[17], I use a hash table when accessing a specific item.
4.1.2 Implementaion
I choose Python as the project programming language because it has much less amount of
code than Java or C++, which means I can code faster and it also has excellent libraries
in the broad area. The Linux server is Amazon Web Services Ubuntu 16.04.3 with 3GB
memory and 1 CPU.
Parsing House
if self.close_down:
raise CloseSpider(reason='Usage exceeded')
listing_id = response.css("html").re('listing_id":"(.*?)"')[0]
if listing_id in self.house_id_dict:
print("Find duplicate item and Drop! Drop id is %s"%listing_id)
return
else:
#start parsing the elements
self.house_id_dict[listing_id] = len(self.house_id_dict)
title =
,! response.css("h2.listing-details-h1::text").extract_first()
price =
,! transGBP(response.css("div.listing-details-price.text-price
,! strong::text").extract_first().strip())
4.1. Web Crawling System 19
street_address = response.css("div.listing-details-address
,! h2::text").extract_first()
num_of_bedrooms =
,! response.css("span.num-icon.num-beds::text").extract_first()
num_of_bathrooms =
,! response.css("span.num-icon.num-baths::text").extract_first()
num_of_receptions =
,! response.css("span.num-icon.num-reception::text").extract_first()
Firstly, when the spider is crawling, it needs to know those houses that have already been
crawled. After crawling for a month, the data has increased to 2GB, just reading the
whole file into memory and locating a specific house is memory consuming. As my design
Figure 4.1 shows, I maintained an id list to record on-sale house on the market. Once
the house is sold, it’s id will be wiped o↵ on the list and put into a sold list. In addition,
when running the program, the id list is converted into a hash table which speed up the
searching time to O(1).
Attributes Label
abs(X.num bed-Y.num bed) abs(X.price-Y.price /X.price)
abs(X.num bath-Y.num bath)
geographic distance
abs(X.monthview-Y.monthview)
We transform our original house features into training attributes as Table 4.2 shows. X
and Y are two data points(house), and abs means absolute value. Our attributes are the
absolute value di↵erence between two house features and the label is the absolute price
di↵erence ratio of X and Y. So, if the house price are the same, we think they are 100%
similar. We make some transformation from original features to current features. We got
the coordinates of house at the start, and translate the coordinates to geographic(meter)
distance by geodesics on an ellipsoid[19]. And as you may notice, we drop the feature
property type, which we will discuss in evaluation chapter. Here is an example feature
vector in training:
2 1 2300 78
1 0 652 121
2 2 1437 17
0 1 325 54
The reason why we would like to introduce UK postcode system is because we meet
memory problems when we do further training and testing. And it turns out to be a
suitable way that we can use postcode to solve this problem.
The postcode system was devised by the Royal Mail to enable efficient mail delivery to
all UK addresses. Initially introduced in London in 1857 the system as we now know it
became operational for most of the UK in the late seventies.[20]
The structure of a postcode is a one or two-letter postcode area code named after a
local city, town or area of London, one or two digits signifying a district in that region,
a space, and then an arbitrary code of one number and two letters. For example, the
postcode of the University of Roehampton in London is SW15 5PU, where SW stands
for south-west London.
1
https://www.electricmarketing.co.uk/map.html
22 Chapter 4. Design and Implementation: Crawling and Prediction System
We first print out the basic statistics of the data set we have. As the Figure 4.9 shows,
the whole data set has noisy data, with the minimum of £3000 and maximum price up
to 16 million pound. The average price in UK is £220000, however, the number of houses
that under £30000 pound is 2116, and the number of houses that are valued more than
one million is 16577. Either higher or below this range are not normal. Then I checked
the skewness of the data set, and it is a positive skew with the value of more than 70. In
order to get a better performance on the actual prediction, we need to cut some extreme
data points on both sides so that it can form a better distribution.
In addition, we plotted the distribution of bedrooms and bathrooms, and there are also
noisy data points that which needs to clean.
4.2. KNN and Feature Regression 23
The histograms show that after the adjustments, we have 3565350 properties left but the
data is much more symmetry, with the skewness of 0.38.
4.2.3 Optimization
Training process needs two input, example house X1 , target house X2 . The attributes are
the value di↵erence between two houses and the label is the price di↵erence price(X1 )
price(X2 ). It is computationally expensive for a single house to compare all the houses
across the UK and also no reference to this model shall be made if pairs are located in
di↵erent regions. Example house only needs to compare houses in its local area. Instead
of putting all the data into a single file, we can split houses into small chunks by regions.
So I get the full list of UK city postcode and put into a python set and classify the houses
into di↵erent regions based on their postcode.
with open(house_info,"r") as r:
for item in r:
#load house file
info = json.loads(item)
# for each house
for house_id,house in info.items():
outpostcode = extract_region(house["postcode"])
city_postcode = citypattern.search(outpostcode).group(0)
#extract postcode and put into different region
if city_postcode in region:
4.2. KNN and Feature Regression 25
region_set[city_postcode].append(house)
4.2.4 Training
2
Due to the huge number of paired data points (C3565350 = 10 more billion), training a
model using the entire dataset would be computationally intractable. Thus, sampling the
data could be a feasible solution.
I divide our dataset into 2:1 for cross-validation. I choose scikit-learn[21] library for
regression model. It also provide min-max scaler which can normalize the data:
def data_transform(features,target):
min_max_scaler = preprocessing.MinMaxScaler()
features = feature_scaling(min_max_scaler,features)
X_train, X_test, y_train, y_test = train_test_split(features, target,
test_size=0.33,
random_state=42)
return (X_train, X_test, y_train, y_test)
clip = clean_noise(clip,minimum_price,house_type_set)
for index, house in
,! clip[:round(len(clip.index)/2)].iterrows():
#do feature difference
def single_feature(examples,targets):
for i in examples.columns:
data = examples[i].reshape(-1,1)
data_frame = data_transform(data,targets)
linear_fit(data_frame)
def linear_fit(data_frame):
lr_model = linear_model.LinearRegression()
model = lr_model.fit(data_frame[0],data_frame[2])
predictions=model.predict(data_frame[1])
RMSE = round(mean_squared_error(data_frame[3],predictions),3)
score = round(model.score(data_frame[1],data_frame[3]),3)
We have two KNN methods: KNN and feature-weighted KNN. The only di↵erence is
their distance function are not the same, so I decide to use callback in Python in order to
pass the distance function as a parameter.
4.2. KNN and Feature Regression 27
Actually, the choice of K is a factor of accuracy, When k=1 the result is the nearest
neighbor value. Either too few or too many neighbors all will have a influence on the
prediction. In this research, we will look at when k = 3,5,10.
def house_KNN(house_info,neighbours,K,attribute_list,
distance_method,weights):
K_neighbours_list = neighbours[:K].copy()
return K_neighbours_list
In this chapter, we will cover the design and implementation of our web application. The
web application consists of front-side web interface where the user can type in areas and
property type, and a server-side component where handling user queries while dynamically
catching new properties coming to the market.
The application will allow user search areas or cities in di↵erent property types and return
a list of properties each displayed as a flag on the Google Map. The ranking is sort by the
estimated price-to-rent ratio. If user clicks on the flag, basic information will show up and
a list of neighbour properties will also be displayed. This neighbour list is a house id list
which contains 5 most similar neighbours calculated by our KNN method. If user wanna
28
5.1. Client Side 29
check these properties, just click on the link in the property pane to see the property
details on Zoopla. In addition, under the map, there is an area information table, which
tells investors the metrics of di↵erent sub-areas.
Flask uses a simple ”magic” mechanism which binds URL and action, called ”routing”[24].
This helps the developer easier to maintain the code and to make the develop fast.
30 Chapter 5. Design and Implementation: Web Application
#flags
markers = redmark + greenmark,
fit_markers_to_bounds = True,
style = "height:600px;width:800px;margin:0;"
)
The server can dynamically capture new properties come up on the market and put the
house into KNN pipeline. KNN Pipeline will generate an estimate rental price based on
the data points we have and calculate the price-to-rent ratio automatically.
Another pipeline is area pipeline. After the spider finished, new houses will be classified
into di↵erent region files by region classifier, and then the area pipeline starts processing.
The area pipeline will calculate each area’s result based on all the sold houses and rented
houses in that area. I use Pandas1 to process the data and doing statistical analysis.
Besides having the prediction system running on the server, several Python scripts also
exist on the application server to support the prediction system. Extractor() Class extract
useful information the raw data. DataFetcher() Class are used to retrieve sold properties
based on id.
1
https://pandas.pydata.org/
Chapter 6
Evaluation
In this chapter, we will talk about the challenges we faced while working on the project.
Then we are going to evaluate each single variable linear regression performance and multi-
variables linear regression on property similarity. After that, we will apply the weights
we get from regression into feature weighted KNN, and compare the accuracy between
these two methods: feature weighted KNN and unweighted KNN in di↵erent values of K.
Finally, feature ablation study will be conducted on every feature to see the individual’s
e↵ect on house price prediction.
1
One hot encoding is a process by which categorical variables are converted into a form that could be
provided to ML algorithms to do a better job in prediction.
34
6.1. Feature Regression Analysis 35
Figure 6.1: 1 Bed Rental Price with Di↵erent House type in Di↵erent Areas
It seems that there is no direct relation between rental price and property type from the
figure. So, I plot a more detailed graph:
Figure 6.2: Monthly Rental Price with Di↵erent Property type and Beds in Same Area
In average, it is not a big price di↵erence between these property types.(except for 4 bed
flat, the reason is probably because there are not many big flats in reality.) I was thinking,
if property types does significantly a↵ect rental price, and if we control other variables
constant, maybe we can use standard deviation divided by the mean(rental price) to
measure how much variations these property types can make. Later on, I researched
36 Chapter 6. Evaluation
online and I find this measurement does exist, it’s called ”Coefficient of Variation”2 .
I randomly picked 20 areas’ data to do the statistics:
The graph shows that more than 75% of the areas are between 5% to 15%, which means
more than 3/4 of tested areas, the average price di↵erence to the mean are between 5% to
15% for di↵erent property types. We can not say this 5% to 15% is because of property
type. Aware that we do not count all the related variables in the test, because it’s not
feasible to control all other variables as the same, for example, we can not let all the
houses’ furniture be the same. Property type may be a factor of rental price, but this
experiment tells us the di↵erence between the property type does not a↵ect rental price
significantly.
Thus, in order to keep going on data training and save memory, I decide to drop this
feature.
I take each single feature as input to train linear regression model in order to each feature’s
influence on price di↵erence and use cross-validation to test trained model. The ”predict
value” and ”actual value” in the graph mean price di↵erence ratio.
2
The coefficient of variation is a measure of relative variability. It is the ratio of the standard deviation
to the mean (average).
6.1. Feature Regression Analysis 37
The above shows that ”num bed” has the lowest RMSE of 0.41, which means this feature
has more weights than the others when estimate house price. However, it’s interesting
that geographic distance does not weigh much on house similarity, with the score of 0.001
and coefficient of 0.103. It might be the reason that our training examples are from the
same area, and when properties are close to each other in the same region, this factor
contributes less to the price di↵erence.
Due to the limitation of features and many other factors, the linear regression perfor-
mance seems not so satisfying, however, because this is a real world problem which is not
as idealistic as artificial statistics, the features and labels we have may not be exactly
accurate. We will first apply the coefficients on K-Nearest-Neighbour algorithm to see
the accuracy.
6.2. KNN Prediction Result 39
KNN is instance-based learning method, which means it has to take all the instances
every time to make hypothesis.[25] It is time consuming to test on every example, so we
decide to make a sampling to test our regression result. I randomly choose 30 areas and
from each area we take 10% but no more than 100 properties for each area. In total, we
have 2123 properties to test.
6.2.1 Choice of K
I choose 3, 5, 10 as di↵erent K value and evaluated the prediction accuracy. Blue lines
are feature weighted KNN error and yellow lines are unweighted KNN error. X-axis
represents di↵erent UK areas, because one area has too few data points, thus only 29
areas are counted.
The graph shows that there is not much di↵erence when K = 3 and K = 5, however,
when K goes up to 10, the error rate in unweighted KNN method has reached about 10%
As we can see, overall, feature weighted KNN gives a better performance on price predic-
tion, with the average error rate of 4.2% when K = 3 and 5, however, unweighted KNN
also gives a good result, with the average accuracy of 8% when K = 3 and 5.
I choose K = 5 to apply in the final software, because it put more neighbours into estima-
tion price than K = 3, which make the result more average and less likely to get extreme
results but also it has almost the same accuracy as K = 3.
After we determine the value of K, feature ablation study should be conducted in order
to justify each feature’s e↵ect on price prediction.
6.2. KNN Prediction Result 41
Figure 6.13 and Figure 6.14 gives error rate for each single feature KNN testing. The data
set is the same as the previous testing. Figure 6.14 is the result for feature ”num bath” and
”num bed” and Figure 6.13 is the result for feature ”geographic distance” and ”monthly
view”.
As the Figure 6.15 shows, 0 represents ”num bath”, 1 represents ”num bed”, 2 repre-
sents ”monthly view”, 3 represents ”geographic distance”. The number of bedrooms and
geographic distance have relatively lower error than the other two, with the error rate
of 11.1% and 10.1%. While ”monthly view” has the highest error rate of 18.0%. It’s
interesting that even if we combined these features and apply KNN method, as the pre-
vious section did, the error rate is around 8%, which is not much improved. However,
weighted feature KNN shows a small but significant performance improve, the error rate
is one times less than single feature KNN.
42 Chapter 6. Evaluation
We run a t-test to see if weighted KNN prediction results and unweighted KNN prediction
results are statistically di↵erent. We compare the values of two vector: The first one is
the price di↵erence between weighted feature KNN results and the actual price and the
second vector is the price di↵erence between unweighted KNN results and the actual
price. The result is p-value = 1.117 ⇤ 10 9 < 0.05, which means we can accept the new
hypothesis. In terms of the error rate, our feature weighted KNN nearly declined 100%
than the unweighted KNN method. I didn’t really expect it can give nearly 95% accuracy
on actual prediction, however, when it comes to some extreme cases such as luxury house,
the method may not work so perfectly as normal houses. Overall, our project’s result is
good.
Chapter 7
In this chapter, I will summarize the progress and problems when working on this project
as well as provide suggestions for future work and experiments.
43
44 Chapter 7. Summary and Reflections
[1] UK House Price Index (UK HPI) annual review 2017. url: https://www.gov.uk/
government/news/uk-house-price-index-uk-hpi-annual-review-2017.
[2] UK House Price Index England: January 2018. url: https : / / www . gov . uk /
government/publications/uk- house- price- index- england- january- 2018/
uk-house-price-index-england-january-2018.
[3] Byeonghwa Park and Jae Kwon Bae. “Using machine learning algorithms for housing
price prediction: The case of Fairfax County, Virginia housing data”. In: Expert
Systems with Applications 42.6 (2015), pp. 2928–2934. issn: 0957-4174. doi: https:
//doi.org/10.1016/j.eswa.2014.11.040.
[4] Diala Taneeb. What Are the Main Metrics That Real Estate Investors Look at When
Deciding Which Property to Invest In? url: https://mobe.com/main-metrics-
real - estate - investors - look - deciding - property - invest / ?aff _ id = 5538.
(accessed: 01.09.2018).
[5] Aaron NG. “Machine Learning for a London Housing Price Prediction Mobile Ap-
plication”. In: (2015).
[6] Scrapy at a glance. url: https : / / docs . scrapy . org / en / latest / intro /
overview.html. (accessed: 01.08.2017).
[7] Architecture overview. url: https : / / doc . scrapy . org / en / latest / topics /
architecture.html.
[8] Skewness. url: https://en.wikipedia.org/wiki/Skewness.
[9] T TEST. url: https://researchbasics.education.uconn.edu/t-test/.
[10] Oliver Sutton. “Introduction to k Nearest Neighbour Classification and Condensed
Nearest Neighbour Data Reduction”. In: (2012).
[11] Weikun Zhao. “The Research on Price Prediction of Second-hand houses based on
KNN and Stimulated Annealing Algorithm”. In: (2014). doi: http://dx.doi.org/
10.14257/ijsh.2014.8.2.19.
[12] k-nearest neighbors algorithm. url: https : / / en . wikipedia . org / wiki / K -
nearest_neighbors_algorithm.
45
46 BIBLIOGRAPHY
[13] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
isbn: 0387310738.
[14] Gradient descent wiki. url: https : / / en . wikipedia . org / wiki / Gradient _
descent.
[15] Matt Nedrich. An Introduction to Gradient Descent and Linear Regression. url:
https://spin.atomicobject.com/2014/06/24/gradient- descent- linear-
regression/.
[16] Ming Zhao. “Improvement and Comparison of Weighted k Nearest Neighbors Clas-
sifiers for Model Selection. Journal of Software Engineering”. In: (2016). doi: 10.
3923/jse.2016.109.118.
[17] How is MySQL implemented? url: https://www.quora.com/How- is- MySQL-
implemented - What - data - structures - are - used - Are - there - any - unique -
tricks- or- optimizations- that- the- developers- employed- that- allowed-
for-the-fast-query-time.
[18] Rotimi Boluwatife Abidoye. “Factors That Influence Real Estate Project Invest-
ment: Professionals’ Standpoint”. In: (2016). url: openbooks.uct.ac.za/cidb/
index.php/cidb/catalog/download/3/1/134-2.
[19] Geodesics on an ellipsoid. url: https://en.wikipedia.org/wiki/Geodesics_
on_an_ellipsoid.
[20] Postcodes in the United Kingdom. url: https : / / en . wikipedia . org / wiki /
Postcodes_in_the_United_Kingdom.
[21] scikit-learn. url: http://scikit-learn.org/stable/.
[22] Flask. url: http://flask.pocoo.org/.
[23] Jinja2. url: http://jinja.pocoo.org/.
[24] Routing in Flask. url: http://flask.pocoo.org/docs/0.12/quickstart/.
[25] David W. Aha, Dennis Kibler, and Marc K. Albert. “Instance-based learning al-
gorithms”. In: 6.1 (1991), pp. 37–66. issn: 1573-0565. doi: 10.1007/BF00153759.
url: https://doi.org/10.1007/BF00153759.