Exercise 1 Statistical Learning
Exercise 1 Statistical Learning
Johan S. Wind
February 2018
a)
In Figure 2 from the task description we see the result of running K-NN many
times with different random seeds for the noise. This causes the resulting (blue)
lines to vary, and therefore create a band around the true (black) signal. We
see that the band is tighter when we average over more points (K larger). Also,
the averaged forms a straight line on the edges, as the K nearest neighbors are
the same for all points far enough to the right (or left). We also see that higher
K leads to a smoother fit.
A low value of K gives the most flexible fit. The most extreme case is K = 1
when the fit passes through all available data points. Higher values of K forces
the fit to be smoother and take more data points into account for each predicted
point.
In Figure 3 from the task description we see the training error and test error for
different values of K. As expected a low K gives a better training data fit, as
it is more flexible. However, too low values of K (< 3) has a higher test error
than K = 5, showing that we are over-fitting the noise, and not only capturing
the signal. As K increases further the model becomes too strict to fit the data
well, so both training error and test error steadily go up. Judging from the test
error as a function of K the minimum seems to be around K = 4, so this would
be the best value.
b)
By repeating the fitting M times with different random seeds for the noise, we
can estimate the variance and bias in the fit. The variance is simply the variance
in the estimated fits, and the bias is the average difference between the fits and
the true signal. So the variance is high if the model overfits to noise, and bias
is high if the model is too restrictive to fit the data well (and this shows up as
consistent differences in fit and true signal over may random seeds).
In figure 4 from the task description we see how the different components of the
error varies with K. The irreducible error is of course not dependent on K, The
variance goes down as K increases, because larger K enforces a more smooth
1
fit. The bias increases with K, again because K restricts the fit by enforcing
smoothness and wider averaging. We now see the reason why the total error has
a minimum around K = 3, this is where the variance-bias trade-off is optimized.
The value K = 3 found here is slightly lower than the answer K = 4 found in
a), but there isn’t much difference in the actual test error for K ∈ 3, 4, 5, so we
would say the answers are consistent.
Extra: If we naively look at each plot independently we find the minimums are
about K = 9, K = 15, K = 10 and K = 12. Averaging these gives K = 11.5,
which is much greater than the previous answers. This is however a bad way
to find the optimal K, because it does not take into account that the edges (x
around −3 or 3) are a special case where larger K give a much worse fit. Also,
it doesn’t take into account if the function has a clear minimum or a very flat
one for each x. We would therefore still trust K ∈ 1, 2, 3 much more.
a)
The fitted model is:
T
−1.103e − 01 1
−2.989e − 04 SEX
2.378e − 04 AGE
√
−2.504e − 04
−1/ SY SBP = CU RSM OKE
3.087e − 04 BM I
9.288e − 06 T OT CHOL
5.469e − 03 BP M EDS
The ”estimate” field in the summary is the least squares estimate of the con-
tribution of the corresponding factor. This means that it should√ be how much
a change in the factor would affect the prediction (here −1/ SY SBP ). F.ex.
AGE’s esimate of 2.378e − 04 means that accurding to the model, being one
year older makes negative one over the square root of the systonic blood pres-
sure 2.378e−04 higher. The intercept estimate represents the baseline, as it gives
the result if all other parameters were set to zero, and its (linear) contribution
is the same no matter the parameters. The formula for the estimate is the least
squares fit β = (X T X)−1 X T y, where X are the observed factors augmented
with a column of ones (for the intercept) and t is the observed response.
The ”Std.Error” field is an estimate of the standard error of the ”estimate” field
of the same factor. So, basically, a low ”Std.Error” means a that the ”estimate”
field is certain, and conversely a high ”Std.Error” means the corresponding ”es-
timate” field is uncertain. A formula for the ”Std.Error” is taking the diagonal
of the covariance matrix n1 εT (I − X(X T X)−1 X T )ε(X T X)−1 . Here ε = y − Xβ
is the empirical residual.
2
The ”t value” is the t statistic of a hypothesis checking if the corresponding
”estimate” field is different from zero. The formula is simply the corresponding
”estimate” divided by ”Std.Error”.
The ”P r(> |t|)” field is the probability density that a randomly chosen standard
t statistic (here with 2600−7 = 2593 degrees of freedom) will exceed the absolute
value in the ”t value” field. So it is an approximation of the probability that the
”estimate” field should be zero and the correlation was a coincidence, these fields
are also marked with stars to easily see which factors contribute significantly.
The ”Residual standard error” is the standard deviation of the residuals,
p so it
measures how closely the model fits the observed data. The formula is V ar(ε),
where V ar is the empirical variance estimator.
The ”F-statistic” field shows how the model compares to a constant model,
measuring against the hypothesis that all your factors are uncorrelated with the
response. The p value of this test is the chance that a constant model would
perform this good. The formula is F = (T SS−RSS)/p
(y − yi )2
P
P 2 RSS/(n−p−1) . where T SS =
and RSS = εi .
b)
The proportion of the variability explained by the model is given by the ”Mul-
tiple R-squared” field, so about 25% is explained. This metric does not take
into account overfitting of noise, so if we have many factors and we can often
get a high ”R2 ” based on overfitting, not based on explaining the underlying
causes. So in these cases the model would generalize poorly to other datasets
and predictions. ”Adjusted R-squared” tries to mitigate this problem, and is
very similar to the ”Multiple R-squared” in this case. This, combined with the
fact that we only have 6 factors (7 with intercept) and thousands of data points
makes me believe 25% is fairly close to the proportion of explained variabilty of
the underlying distribution.
The “fitted values vs. standardized residuals” shows fairly uniformly distributed
residuals, in the ”QQ-plot of standardized residuals” the residuals follow the
line well, and the Anderson-Darling normality test gives no reason to suspect
anything wrong with the model (p = 0.8959). So modelA passes all diagnostics
tests well.
modelB is a whole other story. The “fitted values vs. standardized residuals” are
clearly biased towards higher (positive) values, in the ”QQ-plot of standardized
residuals” the residuals clearly curve upwards compared to the reference line,
and the Anderson-Darling normality test gives extremely strong evidence (A =
13.2, p-value < 2.2e − 16) that the regression assumptions are violated.
For all purposes we would therefore prefer modelA over modelB.
As a side note, we note that ”Multiple R-squared” and ”Adjusted R-squared”
are both around 25% for modelB, like they were for modelA. This shows that
R2 isn’t an indicator of model validity (this time at least).
3
Figure 1: Plots of standarized residuals for modelA (left) and modelB(right)
c)
The estimate β̂BM I = 3.087e − 04 is given in the summary. This says that
according to the model, if you keep the other parameters the same, but increase
the BMI by one, you will increase the negative one over the square root of the
systonic blood pressure by 3.087e−04. So it is the amount the BMI is estimated
to affect the response in our model.
Our estimate β̂BM I follows a t distribution based on its mean (3.087e − 04) and
standard error (2.955e − 05). We are working with 2593 degrees of freedom so
we can approximate it with a normal distribution. A 99% confidence interval
is then (3.087e − 04 − 2.58 ∗ 2.955e − 05, 3.087e − 04 + 2.58 ∗ 2.955e − 05) =
(2.324610e − 04, 3.849390e − 04), here 2.58 comes from the 99% confidence on a
4
normal distributed variable. This interval tells me there is a about 99% chance
of the true βBM I lying in this interval. The confidence interval also tells me
that the the p-value of the null hypothesis βBM I = 0 is less than 0.01, since 0 is
not in the confidence interval and 99% of the βBM I -values lie in the confidence
interval leaving at most 1% for our hypothesis.
d)
√
”predict(modelA, new)” gives us our best estimate for his −1/ SY SBP of
−0.08667246 meaning our best guess for his SY SBP is 133.1.
”predict(modelA, new, interval=”prediction”,
√ level = 0.90)” constructs a 90%
confidence interval for us in −1/ SY SBP . The output is (−0.09625664, −0.07708829).
The interval in SY SBP is then (107.92911212012403, 168.27638580887458). We
see that the prediction is quite uncertain, as the range is very wide even at 90%
confidence. Because of the high uncertainty we wouldn’t find this interval very
useful, unless it counts that the interval shows that it IS very uncertain, which
you wouldn’t know if you just looked at the best prediction produced by the
model. I find the underlying model parameters the more useful result of this
task, as they indicate what and how you could change to affect your systolic
blood pressure (even if some of it might be correlation and not causality).
3
We chose random seed 1234 and reuse the code given in the task description as
much as possible.
a)
5
This is clearly a linear boundary, if plotted as x1 against x2 along the boundary
we simply get a straight line (x2 = − ββ20 − ββ12 x1 ).
The relevant part of the summary output is β̂0 = 3.3824, β̂1 = 0.3354 and
β̂2 = −1.9645. Inserting this into the formula for pi given in the problem
statement gives
The accuracy is then 54/65 = 83%, this classifier performs reasonably taking
the number of data points into account. However, we see that the model is
clearly biased towards answering Y = 1, the error rate is less than half if we put
the threshold at 0.9 instead of 0.5. Changing this hyperparameter based on the
test data is however not recommended when we have this few data points and
no validation dataset.
6
The sensitivity is 100% while the specificity is 24/35 = 69%. So we also here
observe a strong bias towards false positives.
b)
In the expression
1 X
P (Y = 1|X = x0 ) = I(yi = 1)
K
i∈N
The accuracy is 85%, the sensitivity is 100% while the specificity is 25/35 = 71%.
So we also here observe a strong bias towards false positives.
We get the following confusion matrix for K = 9.
0 1
0 22 13
1 0 30
The accuracy is 80%, the sensitivity is 100% while the specificity is 22/35 = 63%
I prefer the K = 3 case because it gives higher test accuracy.
We can’t choose K too low because of the risk of overfitting, and we can’t choose
K too high because that would limit our model from accurately modelling the
ideal solution (extreme case K ≥ n we can only get the same class for all points).
c)
πk is the prior probability of a random element have Y = k, µk is the mean of
x = (x1 , x2 )T for all elements with Y = k (not just the training samples, but
the whole population), Σ is the covariance matrix of x (again of the population),
and fk (x) is the posterior probability density that x has class Y = k.
πk is easily estimated as the fraction of the training data having class Y = k,
µk is estimated to be the empirical mean of x = (x1 , x2 )T for all elements with
Y = k, Σ is the estimated as the empirical, unbiased covariance matrix of x.
Writing out the lda object from R, and using ”var(train[c(”x1”, ”x2”)])” give
the estimates. Here ”Prior probabilities of groups” is πk , the rows of ”Group
means” are µk and ”Covariance” is Σ.
7
Prior p r o b a b i l i t i e s o f groups :
0 1
0.3692308 0.6307692
Group means :
x1 x2
0 16.4375 5.737500
1 19.7122 3.179024
Covariance :
x1 x2
x1 1 2 . 4 9 9 3 6 5 −2.190996
x2 −2.190996 2 . 6 7 4 9 4 6
If we want the class boundary at P r(Y = 1|x) > 0.5 we get the same line as
above, giving classification Y = 1 iff:
We solve the equation (1) to get a linear equation for the boundary between
classes and plot it.
l t r a i n=l d a ( y˜ x1+x2 , data=t r a i n )
ltrain
c a t ( ” \ nCovariance : \ n ” )
cov = var ( t r a i n [ c ( ” x1 ” , ” x2 ” ) ] )
cov
i n f o = s o l v e ( cov )
mu1 = c ( 1 6 . 4 3 7 5 , 5 . 7 3 7 5 0 0 )
pi1 = 0.3692308
cat (”\n”)
mu2 = c ( 1 9 . 7 1 2 2 , 3 . 1 7 9 0 2 4 )
pi2 = 0.6307692
ab = i n f o%∗%mu1−i n f o%∗%mu2
c = −0.5∗( t (mu1)%∗% i n f o%∗%mu1)+ l o g ( p i 1 ) −( −0.5∗( t (mu2)%∗% i n f o%∗%mu2)+ l o g ( p i 2 ) )
s l o p e = −ab [ 1 ] / ab [ 2 ]
i n t e r c e p t = −c /ab [ 2 ]
8
g1 + g e o m a b l i n e ( s l o p e=s l o p e , i n t e r c e p t=i n t e r c e p t )+
g g t i t l e ( ” t r a i n data and l d a boundary ” ) + geom point ( data = t e s t , pch = 3 )
Figure 4: Training data is represented by circles (o), while test data is repre-
sented by plusses (+)
The accuracy is 83%, the sensitivity is 100% while the specificity is 24/35 = 69%.
So we also here observe a strong bias towards false positives.
The QDA lets each class have its own covariance matrix, this results in degrees
of freedom quadratic in the number of explaining factors. This means we shouls
use it if each class is suspected to have a different covariance matrix, and be
extra careful about overfitting.
d)
9
Because of the strong bias towards false positives across all methods, we expect
that the training/test data split we got was hard for the algorithms with the
default threshold of 0.5. This also means we wouldn’t put much trust in choosing
the methods based on results gained with this threshold. But for the sake of
the task question, K-NN with K = 3 scored the best with default threshold,
this is probably since K-NN is less sensitive than the other methods towards
the threshold (there isn’t really a continuous threshold in the same way as the
other methods have).
When we change the threshold in the previous algorithms, we get a trade-
off between sensitivity and specificity. ROC traces out this trade-off for all
thresholds so we can get an informative plot of it. AUC is simply the area
under the ROC.
We use the code provided in the task description to get the ROC and AUC.
We see that K-NN performs significantly worse than the other two methods,
which are quite similar. If we had to choose one we would choose LDA as it is
10
slightly better at both the ROC plot and AUC than Logistic regression. Here we
see the importance of taking the threshold into account, as the method K-NN
which seemed the most promising with default threshold scores considerably
worse with other thresholds.
11