Lecture BDS 3 23 24 Print
Lecture BDS 3 23 24 Print
14 February 2024
Constructing a test
Let us now look at how to test the hypothesis in the Remark (testing significance), part
(ii).
■ The first thing to note is that the regression model in H0 is nested in the model in
H1 .
■ Nested means here: Any regression function belonging to H0 also belongs to H1 if
we choose β1 , . . . , βd appropriately.
■ For H0 and H1 in the Remark (testing significance), part (ii) this property holds
because for a given function
β1 xi1 + . . . + βr xir
8
Constructing a test (cont’d)
A simple yet powerful idea for constructing a test statistic is as follows:
■ Under H0 unknown vector is: β r = (β1 , . . . , βr ).
■ Estimating the unknown vector β r = (β1 , . . . , βr ) by least squares we have
r T r −1 r
(Xnr )T Yn
(Xn ) Xn = β̂ n ,
■ With this estimator we can calculate the sum of squared errors (or sum of squared
residuals) SSEH0 under H0 which is given by
n
X r
X
SSEH0 = (yi − β̂jr xij )2 .
i=1 j=1
9
Constructing a test (cont’d)
■ Under the alternative the unknown vector is β = (β1 , . . . , βr , βr+1 , . . . , βd ).
■ This is just our good old friend from Lecture 1 which can be estimated by
(cf. Lecture 1)
(XnT Xn )−1 XnT Yn = β̂ n .
■ With this estimator we can calculate the sum of squared errors SSEH1 under H1
which is given by
X n d
X
SSEH1 = (yi − β̂j xij )2 .
i=1 j=1
■ Intuitively, if including the covariates Xr+1 , . . . , Xd has not much added value we
should have
SSEH0 ≈ SSEH1 .
10
Constructing a test (cont’d)
■ The test statistic builds indeed on the difference between SSEH0 and SSEH1
which is
◆ large if the alternative is true;
◆ small if the null hypothesis is true.
■ Our test statistic is defined by
(n − d) (SSEH0 − SSEH1 )
F = .
d−r SSEH1
11
Constructing a test (cont’d)
■ For the test statistic to be useful we need its distribution under H0 .
■ Thinking back to Lecture 1 we conjecture that this distribution will depend on
whether we
◆ whether we have the classical linear model (either with fixed or random
design):
◆ or the linear model (either with fixed or random design).
■ This conjecture is indeed true and we have that
◆ under H0 our test statistic F has exactly an F -distribution with (d − r) and
n − d degrees of freedom for the classical linear model (either with fixed or
random design);
◆ under H0 our test statistic F has approximately an F -distribution with
(d − r) and n − d degrees of freedom for the linear model (either with fixed
or random design).
12
Motivating test statistic
■ Our above test for testing the hypothesis on slide 6 for linear models builds on
◆ The squared residuals (or squared errors) under H0 ; and
◆ The squared residuals (or squared errors) under H1 .
■ If we want to use a similar test construction for testing H0 on slide 14 for GLMs
we need to define residuals for GLMs.
■ Let us recall the comparison
Pd made by residuals for linear models: They compare
yi to its expected value j=1 βj xij !
■ Because GLMs also explain the expected value of the responses it is meaningful to
define residuals (or errors) for GLMs analogously to linear models. For GLMs one
defines residuals and standardized residuals by
respectively, where σ̂i2 is the estimated variance; see Lecture 2, slide 38 and
Exercise sheet 2.
15
The test statistic
Let’s make it more concrete what we do.
■ Suppose we have n independent Yi each with pmf or pdf according to the first
element of GLMs, i.e. of the form
i yi θi − b(θi )
f Yθ (yi ) = exp − c(ψ, yi ) , yi ∈ D. (3)
ψ
■ We have two different models for their expected values (r < d as before)
r
X d
X
Model 0: E[Yi ] = h( βj xij ) and Model 1: E[Yi ] = h( βj xij ).
j=1 j=1
■ From slide 36 of Lecture 2 we know that the θi s in (3) are determined by the
choice of h and the linear regression function. Therefore, two models for E[Yi ]
imply two models for the θi s
X r d
X
Model 0: θi = (b′ )−1 (h( βj xij )) and Model 1: θi = (b′ )−1 (h( βj xij )).
j=1 j=1
(4)
19
The test statistic (cont’d)
■ What we want to know is whether we can work with the more parsimonious
Model 0, i.e. whether we do not need the regressors r + 1, . . . , d, or whether they
really contribute to explaining the response.
■ More formally we want to test
■ For a concrete example think of the CPS data and the response "how often
employee i is laid off” (see slide 14).
■ We proceed as follows: We use the likelihood (see slide 36 of Lecture 2) to get
MLEs β˘1 , . . . , β̆r for β1 , . . . , βr , i.e. we fit Model 0 to the data.
■ This gives us estimated values θ̆i under Model 0 for the θi s by plugging them into
the relation (4), i.e.
r
X
θ̆i = (b′ )−1 h β̆j xij .
j=1
20
The test statistic (cont’d)
■ What’s next? Right, we do the same for Model 1.
■ When fitting Model 1 to the data using the likelihood we estimate d parameters.
■ Denote the MLEs for β1 , . . . , βd under Model 1 by β̂1 , . . . , β̂d .
■ The corresponding estimators for the θi s are
X d
θ̂i = (b′ )−1 h β̂j xij .
j=1
■ Remark: Note that we estimate more parameters under Model 1 than under
Model 0 and that typically also the MLEs β̂1 , . . . , β̂r for β1 , . . . , βr are different
from β̆1 , . . . , β̆r .
■ Let’s see what we have
◆ We fitted two models; and
◆ Above we spent quite some time finding ways to transfer the idea of the SSE
based test to GLMs.
■ Well, if we have these two ingredients let us combine them.
21
The test statistic (cont’d)
■ Combining means calculating
yi θ̆i −b(θ̆i )
Q
n
i=1 exp ψ − c(ψ, yi )
Ln = −2 log Q .
n yi θ̂i −b(θ̂i )
i=1 exp ψ − c(ψ, yi )
■ This is the log likelihood ratio test for (5) or in terms of regression functions for
testing (1).
■ Note the idea: We compare
◆ The maximum value of the likelihood function under Model 0 (our numerator)
to
◆ The maximum value of the likelihood function under Model 1 (our
denominator).
■ Under the null hypothesis Ln has asymptotically a χ2 distribution with d − r
degrees of freedom.
22
Null model and full model
■ If you are familiar with GLMs (or in case you are not but open a book on GLMs)
you know (will learn) that GLM researchers have a soft spot for two particular
models:
◆ the full model; and
◆ the null model.
Pn
■ Definition (full model): We call the model with linear predictor j=1 βj xij = βi
for all i = 1, . . . , n the full model (cf. Week 2, Quiz 3, Question 2); this means
E[yi ] = h(βi ).
■ Definition
Pn (null model): We call the model with linear predictor
j=1 βj xij = β1 for all i = 1, . . . , n the null model; this means
E[yi ] = h(β1 ).
■ We see that these two models are the most extreme ones we can think of because:
full model has n parameters for n observations, null model has only one parameter.
23
Prediction error
■ Above we discussed hypothesis tests to choose between two nested models.
■ In this section we look at other criteria to make such a choice.
■ Our setting will be as follows
◆ For each response Yi we also have measured d covariates xi1 , . . . , xid .
◆ We assume a linear relation between the expected values of the Yi and the
covariates.
◆ In our linear regression model we include M covariates. Here M can be any
number between 1 and d. I use here M instead of r to make clear that we
follow a different approach in this section.
◆ We are interested in predicting k future responses denoted by
YF = (Y1F , . . . , YkF ).
◆ We denote the expectations of this future responses by
µY F = (µY F , . . . , µY F ).
1 k
◆ The covariates associated with YF that will be used in our model are denoted
by xFij , 1 ≤ i ≤ k, 1 ≤ j ≤ M .
26
Prediction error (cont’d)
Our setting (cont’d)
■ If we knew the regression coefficients our prediction function for Y F would be
PM i
F
j=1 βj xij .
■ !!! Note that here we may have
M
X
µY F 6= βj xFij ,
i
j=1
Here as before we do not use a hat because we have estimates for a model that
does not use all covariates. 27
Prediction error (cont’d)
■ Having described our setting a performance measure is needed.
■ As a measure for the performance of our model we look here at the squared
prediction error (SPE), i.e.
k
X M
X
E[(YiF − β̆j xFij )2 ].
i=1 j=1
■ We will look at the special case k = n and xFij = xij . In this case the derivations
simplify a bit without affecting the main message.
■ As a first step for calculating
PM the SPE we determine the squared error between
µY F and our prediction j=1 β̆j xj . We find
i
2
n
X M
X
E µY F − β̆j xij
i
i=1 j=1
2
n
X M
X M
X M
X
= E µY F − E[ β̆j xij ] + E[ β̆j xij ] − β̆j xij
i
i=1 j=1 j=1 j=1
28
Prediction error (cont’d)
■ The expectation on the previous slide can be re-written as
2
n
X M
X
E µY F − E[ β̆ij xj ]
i
i=1 j=1
n
X M
X M
X M
X
+2 E µY F − E[ β̆j xij ] E[ β̆j xij ] − β̆j xij
i
i=1 j=1 j=1 j=1
2
n
X M
X M
X
+ E E[ β̆j xij ] − β̆j xij
i=1 j=1 j=1
■ The term inside the first expectation is constant. Hence, we can drop the
expectation.
■ The second expectation is zero.
■ The third expectation is actually a variance which can be re-written as on the next
slide.
29
Prediction error (cont’d)
We have
2
n
X M
X M
X
E E[ β̆j xij ] − β̆j xij
i=1 j=1 j=1
where
■ tr denotes the trace of a matrix;
■ the matrix XnM is the design matrix as on slide 9 but with M in place of r;
■ σ 2 = Var[Yi ] with Yi as on slide 27; and
■ the last equality used that the trace of a projection matrix equals its rank (here M ).
30
Prediction error (cont’d)
■ Let us now calculate the (SPE).
■ We have
X n XM n
X M
X
E (YiF − β̆j xij )2 = E (YiF − µY F + µY F − β̆j xij )2
i i
i=1 j=1 i=1 j=1
n
X h i Xn M
X
= E (YiF − µY F )2 + E (µY F − β̆j xij )2
i i
i=1 i=1 j=1
2
n
X M
X
= nσ 2 + M σ 2 + µY F − E[ β̆j xij ] ,
i
i=1 j=1
where
PM
◆ The second equality used E[(YiF
− µY F )(µY F − j=1 β̆j xij )] = 0; and
i i h i
PM
◆ The third equality that we already calculated E (µY F − j=1 β̆j xij )2 on
i
the previous slides.
31
Prediction error (cont’d)
The formula for the SPE just derived tells us several interesting things
■ The first term, i.e. nσ 2 , is unavoidable because it comes from the fact that YiF is
random and fluctuates around its expected value µY F ;
i
■ The second term, i.e. M σ 2 , stems from estimating the M regression coefficients.
It is an estimation error; 2
Pn PM
■ The third term, i.e. i=1 µY F − E[ j=1 β̆j xij ] , can be seen as an
i
approximation error because µY F is the true expectation of Yi that we approximate
i
with our M regressors;
■ We can decrease the approximation error by increasing the covariates used in our
regression function. Yet, this will increase the estimation error;
■ We can decrease the estimation error by decreasing the number of covariates used
in our regression function. Yet, this will increase the approximation error.
The last two bullets are known as variance-bias trade-off. It is a fundamental property
of prediction in all statistical models.
32
Prediction error (cont’d)
■ Given the original sample y1 , . . . , yn and x11 , . . . , xnd (and no future values) we
can estimate the sum of the second and third part of SPE, i.e.,
2
n
X M
X
M σ2 + µY F − E[ β̂j xFij ] ,
i
i=1 j=1
by
2
n
X M
X
M σ̂ 2 + yi − β̂j xij , (6)
i=1 j=1
where σ̂ 2 is an estimator for the variance that uses all covariates, i.e.
n d
1 X X
σ̂ 2 = (yi − β̂j xij )2 .
n−d
i=1 j=1
■ We can determine (7) for any subset of regressors we are interested in and then
choose the one for which this is smallest.
34