Project Report ME-315 Machine Learning in Practice: Sebastian Perez Viegener LSE ID:201870983 July 3, 2019
Project Report ME-315 Machine Learning in Practice: Sebastian Perez Viegener LSE ID:201870983 July 3, 2019
July 3, 2019
In this Project we will examine two real world datasets and we will try to make useful
predictions based on the attributes of the predicted variable that are given in the
dataset. I will work on a regression problem and on a classification problem.
Regression:
In this particular case I chose the Real Estate Valuation Data Set.
https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set
Its main characteristics are summed up as follows:
Data Set Characteristics: Multivariate
Number of Instances: 414 Area: Business Attribute Characteristics: Integer, Real
Number of Attributes: 7
Date Donated 2018-08-18 Associated Tasks: Regression
Missing Values: N/A Number of Web Hits: 27877
Data Set Information:
The market historical data set of real estate valuation is collected from Sindian Dist.,
New Taipei City, Taiwan. The real estate valuation is a regression problem. The data
set was randomly split into the training data set (2/3 samples) and the testing data set
(1/3 samples). Attribute Information:
The inputs are as follows
X1=the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June,
etc.) X2=the house age (unit: year) X3=the distance to the nearest MRT station (unit:
meter) X4=the number of convenience stores in the living circle on foot (integer)
X5=the geographic coordinate, latitude. (unit: degree) X6=the geographic coordinate,
longitude. (unit: degree)
The output is as follows
Y= house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local
unit, 1 Ping = 3.3 meter squared)
It is first useful to take a look at the summarized data to see what the ranges of values
for each attribute are.
## date age dist stores
## Min. :2013 Min. : 0.000 Min. : 23.38 Min. : 0.000
## 1st Qu.:2013 1st Qu.: 9.025 1st Qu.: 289.32 1st Qu.: 1.000
## Median :2013 Median :16.100 Median : 492.23 Median : 4.000
## Mean :2013 Mean :17.713 Mean :1083.89 Mean : 4.094
## 3rd Qu.:2013 3rd Qu.:28.150 3rd Qu.:1454.28 3rd Qu.: 6.000
## Max. :2014 Max. :43.800 Max. :6488.02 Max. :10.000
## lat lon price
## Min. :24.93 Min. :121.5 Min. : 7.60
## 1st Qu.:24.96 1st Qu.:121.5 1st Qu.: 27.70
## Median :24.97 Median :121.5 Median : 38.45
## Mean :24.97 Mean :121.5 Mean : 37.98
## 3rd Qu.:24.98 3rd Qu.:121.5 3rd Qu.: 46.60
## Max. :25.01 Max. :121.6 Max. :117.50
Linear Regression:
After having taken a general look at the data set we will proceed to evaluate different
prediction models and compare them based on their mean square error to decide
which one performs better on this particular data set. We will compare five different
models:
• Linear regression
• Linear regression + interaction
• Linear regression + interaction + quadratic term
• Ridge
• Lasso
The training set used to fit the parameters will be (2/3) times the whole data ser, and
the remaining (1/3) will be the test set used to evaluate how well the models predcit
outcomes for yet unseen features.
First we analyze the three regression models
##
## Call:
## lm(formula = price ~ ., data = real.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.311 -5.317 -0.799 4.618 72.947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.365e+04 8.399e+03 -1.625 0.105235
## date 6.175e+00 1.895e+00 3.258 0.001267 **
## age -3.265e-01 4.730e-02 -6.903 3.66e-11 ***
## dist -4.992e-03 9.044e-04 -5.520 7.98e-08 ***
## stores 8.071e-01 2.352e-01 3.432 0.000694 ***
## lat 2.288e+02 5.113e+01 4.476 1.13e-05 ***
## lon -3.659e+01 6.237e+01 -0.587 0.557934
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.863 on 269 degrees of freedom
## Multiple R-squared: 0.5642, Adjusted R-squared: 0.5545
## F-statistic: 58.05 on 6 and 269 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ age + dist * stores + dist * lat + stores *
## lat, data = real.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.572 -4.612 -1.059 3.299 71.648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.384e+04 3.451e+03 -6.909 3.54e-11 ***
## age -3.011e-01 4.406e-02 -6.835 5.53e-11 ***
## dist 5.458e+00 1.106e+00 4.933 1.43e-06 ***
## stores 2.726e+03 5.922e+02 4.602 6.46e-06 ***
## lat 9.566e+02 1.382e+02 6.923 3.27e-11 ***
## dist:stores -1.575e-03 3.021e-04 -5.213 3.72e-07 ***
## dist:lat -2.187e-01 4.433e-02 -4.935 1.41e-06 ***
## stores:lat -1.091e+02 2.372e+01 -4.600 6.52e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.192 on 268 degrees of freedom
## Multiple R-squared: 0.6292, Adjusted R-squared: 0.6195
## F-statistic: 64.96 on 7 and 268 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = price ~ age + dist * stores + dist * lat + stores *
## lat + I(age^2), data = real.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.368 -4.211 -0.515 3.575 72.598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.350e+04 3.368e+03 -6.979 2.35e-11 ***
## age -9.323e-01 1.712e-01 -5.447 1.17e-07 ***
## dist 5.439e+00 1.080e+00 5.038 8.69e-07 ***
## stores 2.510e+03 5.806e+02 4.322 2.18e-05 ***
## lat 9.433e+02 1.349e+02 6.994 2.15e-11 ***
## I(age^2) 1.594e-02 4.185e-03 3.810 0.000173 ***
## dist:stores -1.451e-03 2.965e-04 -4.892 1.73e-06 ***
## dist:lat -2.180e-01 4.325e-02 -5.040 8.61e-07 ***
## stores:lat -1.005e+02 2.325e+01 -4.321 2.20e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.993 on 267 degrees of freedom
## Multiple R-squared: 0.6483, Adjusted R-squared: 0.6378
## F-statistic: 61.52 on 8 and 267 DF, p-value: < 2.2e-16
From the summarized information for each of the regression models we can see
improvements in the Residual standard error and in the Multiple R-squared, and we
can also see an improvement in the (Residuals vs Fitted Values) graph when going
from a linear model to one that takes into account interaction terms and quadratic
terms.
Ridge and Lasso
Considering now the additional Ridge and Lasso regression models we compare the
mean square errors for the five models
We can see from the following graphs that the best mean square error for the Ridge
regression is obtained for a lambda of
## [1] 0.1873817
and for the Lasso Regression the best mean square error is obtained for a lambda of
## [1] 0.1873817
Comparing the mean square errors for each model we conclude that the best model is
the linear model that also takes into account the interaction and square terms.
## [,1]
## linear 81.87439
## linear+interaction 61.11175
## linear+interaction+square 55.72100
## Ridge 81.54077
## Lasso 81.41223
Classification:
For the classification problem I will be using the “Churn_Modelling.csv” dataset from
the Kaggle webpage. https://www.kaggle.com/yogirj/churn-modelling
The churn rate, also known as the rate of attrition, is most commonly known as the
percentage of service subscribers who discontinue their subscriptions within a given
time period. For a company to expand its clientele, its growth rate (measured by the
number of new customers) must exceed its churn rate.
The data set contains: CustomerId, Surname, CreditScore, Geography, Gender, Age,
Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary
And the variable to be predicted: Exited
We will take a random sample of 8.000 cases as the training set from a total of 10.000
cases.
I will compare four models and their mean error to determine which one performs
better on this particular data set.
• KNN
• Logistcic Regression
• QDA
• LDA
We again, as a first step, take a look at the basic summarized information from the
data set.
## CreditScore Geography Gender Age
## Min. :350.0 Min. :1.000 Min. :1.000 Min. :18.00
## 1st Qu.:584.0 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:32.00
## Median :652.0 Median :1.000 Median :2.000 Median :37.00
## Mean :650.5 Mean :1.746 Mean :1.546 Mean :38.92
## 3rd Qu.:718.0 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:44.00
## Max. :850.0 Max. :3.000 Max. :2.000 Max. :92.00
## Tenure Balance NumOfProducts HasCrCard
## Min. : 0.000 Min. : 0 Min. :1.00 Min. :0.0000
## 1st Qu.: 3.000 1st Qu.: 0 1st Qu.:1.00 1st Qu.:0.0000
## Median : 5.000 Median : 97199 Median :1.00 Median :1.0000
## Mean : 5.013 Mean : 76486 Mean :1.53 Mean :0.7055
## 3rd Qu.: 7.000 3rd Qu.:127644 3rd Qu.:2.00 3rd Qu.:1.0000
## Max. :10.000 Max. :250898 Max. :4.00 Max. :1.0000
## IsActiveMember EstimatedSalary Exited
## Min. :0.0000 Min. : 11.58 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 51002.11 1st Qu.:0.0000
## Median :1.0000 Median :100193.91 Median :0.0000
## Mean :0.5151 Mean :100090.24 Mean :0.2037
## 3rd Qu.:1.0000 3rd Qu.:149388.25 3rd Qu.:0.0000
## Max. :1.0000 Max. :199992.48 Max. :1.0000
Taking a look at the correlation matrix and its heat map one can see that there area no
clear correlations between the variable to be predictec “Exited” and its features.
KNN:
As a first model I will consider the KNN model. When plotting the number K of
neighbors against the mean error we get the following graph.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## K neighbors 1.000 2.000 3.0000 4.000 5.0000 6.000 7.0000 8.0000 9.000
## Mean Error 0.306 0.328 0.2695 0.271 0.2335 0.239 0.2275 0.2315 0.223
## [,10]
## K neighbors 10.00
## Mean Error 0.22
It can be seen that with K>5 the reduction in the mean error we get doesn’t justify the
additional complexity of the model so we choose K=5. To compare KNN to the other
models we will use the mean square error obtained for K=5.
Logistic Regression:
Now we will try running a logistic regression algorithm to see if it performs better
than the KNN algorithm.
After running the logistic regression fit we get the following coefficients.
## (Intercept) CreditScore Geography Gender
## -3.052234e+00 -6.905519e-04 6.695423e-02 -5.286857e-01
## Age Tenure Balance NumOfProducts
## 7.122400e-02 -1.331991e-02 5.374569e-06 8.588360e-03
## HasCrCard IsActiveMember EstimatedSalary
## -4.531044e-02 -1.096925e+00 3.260459e-07
We can also compute a matrix in this case that displays the relationship between the
mean error of the classification and the threshold of the logistic regression prediction,
that is the probability above which we consider the prediction to be positive (1),
P(X>t) -> Exited = 1 or negative (0), P(X<= t) -> Exited =0.
## [,1] [,2] [,3] [,4] [,5] [,6]
[,7]
## Logistic Regression threshold t 0.100 0.2000 0.300 0.400 0.500 0.600 0
.700
## Mean Error 0.514 0.3055 0.217 0.193 0.188 0.194 0
.204
## [,8] [,9] [,10]
## Logistic Regression threshold t 0.800 0.9000 1.0000
## Mean Error 0.207 0.2075 0.2075
It can be seen that the lowest mean error is obtained for a threshold of 0,5.
Comparing the mean square error for each model we can see that QDA is the best
model for making predictions on this data set.
## [,1] [,2] [,3] [,4]
## [1,] 0.2335 0.188 0.1665 0.191
## attr(,"names")
## [1] "KNN" "LogReg" "QDA" "LDA"