0% found this document useful (0 votes)
8 views17 pages

lOFTUS_ET_AL

This document presents a mathematical framework for selective inference in regression models that utilize groups of variables, particularly focusing on forward stepwise selection. It introduces a method for conducting significance tests that account for model selection bias, providing exact tests based on truncated distributions. The authors also discuss the implementation of their methods in the R programming language and illustrate their approach with a California county health data example.

Uploaded by

Marcos Inoue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views17 pages

lOFTUS_ET_AL

This document presents a mathematical framework for selective inference in regression models that utilize groups of variables, particularly focusing on forward stepwise selection. It introduces a method for conducting significance tests that account for model selection bias, providing exact tests based on truncated distributions. The authors also discuss the implementation of their methods in the R programming language and illustrate their approach with a California county health data example.

Uploaded by

Marcos Inoue
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Selective inference in regression models

with groups of variables


arXiv:1511.01478v1 [stat.ME] 4 Nov 2015

Joshua Loftus∗,† and Jonathan Taylor†


Department of Statistics
Sequoia Hall
390 Serra Mall
Stanford, CA
e-mail: joftius@stanford.edu; jonathan.taylor@stanford.edu

Abstract: We discuss a general mathematical framework for selective in-


ference with supervised model selection procedures characterized by quadratic
forms in the outcome variable. Forward stepwise with groups of variables is
an important special case as it allows models with categorical variables or
factors. Models can be chosen by AIC, BIC, or a fixed number of steps. We
provide an exact significance test for each group of variables in the selected
model based on appropriately truncated χ or F distributions for the cases of
known and unknown σ2 respectively. An efficient software implementation
is available as a package in the R statistical programming language.

1. Introduction

A common strategy in data analysis is to assume a class of models M speci-


fied a priori contains a particular model M ∈ M which is well-suited to the
data. Unfortunately, when the data has been used to pick a particular model
M̂ , inference about the selected model is usually invalid. For example, in the
regression setting with many predictors if the “best” predictors are chosen by
lasso or forward stepwise then the classical t, χ2 , or F tests for their coefficients
will be anti-conservative. The most widespread solution is data splitting, where
one subset of the data is used only for model selection and another subset is
used only for inference. This results in a loss of accuracy for model selection and
power for inference.
Recently, Lockhart et al. (2014); Lee et al. (2015) developed methods of ad-
justing the classical significance tests to account for model selection. These meth-
ods make use of the fact that the model chosen by lasso and forward stepwise
can be characterized by affine inequalities in the data. Inference conditional on
the selected model requires truncating null distributions to the affine selection
region. The present work provides a new and more general mathematical frame-
work based on quadratic inequalities. Forward stepwise with groups of variables
serves as the main example which we develop in some detail.
Allowing variables to be grouped provides important modeling flexibility to
respect known structure in the predictors. For example, to fit hierarchical in-
teraction models as in Lim and Hastie (2013), non-linear models with groups
∗ Supported in part by NIH grant 5T32GMD96982.
† Supported in part by NSF grant DMS 1208857 and AFOSR grant 113039.
1
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 2

of spline bases as in Chouldechova and Hastie (2015), or factor models where


groups are determined by scientific knowledge such as variables in common
biological pathways. Further, in the presence of categorical variables, without
grouping a model selection procedure may include only some subset of the levels
that variable.
All methods described here are implemented in the selectiveInference R
package (Tibshirani et al., 2015) available on CRAN.

1.1. California county health data example

Information on health and various other indicators has been aggregated to the
county level by University of Wisconsin Population Health Institute (2015). We
attempt to fit a linear model for California counties with outcome given by
log-years of potential life lost, and search among 31 predictors using forward
stepwise. This example has groups only of size 1, but still serves to illustrate
the procedure. With the BIC criteria for choosing model size the resulting model
has 8 variables listed in Table 1. The naive p-values computed from standard
hypothesis tests for linear regression models are biased to be small because we
chose the 8 best predictors out of 31.
Table 1
Significance test p-values for predictors chosen by forward stepwise with BIC.
Step Variable Naive Selective
1 %80th Percentile Income <0.001 0.001
2 Injury Death Rate <0.001 0.086
3 Chlamydia Rate 0.078 0.287
4 %Obese <0.001 0.170
5 %Receiving HbA1c <0.001 0.335
6 %Some College 0.005 0.864
7 Teen Birth Rate 0.071 0.940
8 Violent Crime Rate 0.067 0.179

The selective p-values reported here are the T χ statistics described in Sec-
tion 3. These have been adjusted to account for model selection. Using the data
to select a model and then conducting inference about that model means we are
randomly choosing which hypotheses to test. In this scenario, Fithian, Sun and Taylor
(2014) propose controlling the selective type 1 error defined as

PM,H0 (M) (reject H0 |M selected). (1)

The notation H0 (M ) emphasizes the fact that the hypothesis depends upon the
model M . Scientific studies reporting p-values which control (1) would not suffer
from model selection bias, and consequently would have greater replicability.
One of the most common methods which controls selective type 1 error is data
splitting, where independent subsets of the data are used for model selection
and inference separately. Unfortunately, by using less data to select the model
and less data to conduct tests this method suffers from loss of model selection
accuracy and lower power. The T χ and TF hypothesis tests described in the
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 3

present work control (1) while using the whole data for both selection and
inference.

1.2. Background: the affine framework

Many model selection procedures, including lasso, forward stepwise, least angle
regression, marginal screening, and others, can be characterized by affine in-
equalities. In other words, for each m ∈ M there exists a matrix Am and vector
bm such that M̂ (y) = m is equivalent to Am y ≤ bm . Hence, assuming

y = µ + ǫ, ǫ ∼ N (0, σ 2 I) (2)

we can carry out inference conditional on M̂ (y) = m by constraining the mul-


tivariate normal distribution to the polyhedral region {z : Am z ≤ bm }.
One limitation of these methods is that the definition of a model m includes
not only the active set E of variables chosen by the lasso or forward stepwise,
but also their signs. Without including signs, the model selection region becomes
a union of 2|E| polytopes. Details are provided in Lee et al. (2015). The forward
stepwise and least angle regression cases without groups of variables appear in
Tibshirani et al. (2014).
In an earlier work, Loftus and Taylor (2014) iterated the global null hypothe-
sis test of Taylor, Loftus and Tibshirani (2015) at each step of forward stepwise
with groups. The test there is only exact at the first step where it coincides with
the T χ test described here, and empirically it was increasingly conservative at
later steps. The present work does not pursue sequential testing, and instead
conducts exact tests for all variables included in the final model.

1.3. The quadratic framework

Our approach will exploit the structure of model selection procedures charac-
terized by quadratic inequalities in the following sense.
Definition 1.1. A quadratic model selection procedure is a map M : X → M
determining a model m = M (y), such that for any m ∈ M
\
Em := {y : M (y) = m} = Em,j ,
j∈Jm (3)
T T
Em,j := {y : y Qm,j y + am,j y + bm,j ≥ 0}

The power of this definition lies in the ease with which we can compute
one-dimensional slices through Em . This enables us to compute closed-form
results for certain one-dimensional statistics and to potentially implement hit-
and-run sampling schemes. Consider tests based on T 2 := kP yk22 for some fixed
projection matrix P . Write u = P y/T , z = y −P y, and substitute uT +z for y in
the definition (3). Conditioning on z and u, the only remaining variation is in T ,
and the equations above are univariate quadratics. This allows us to determine
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 4

the truncation region for T induced by the model selection M (y) = m. We now
apply this approach and work out the details for several examples. The example
in Section 4 is a slight extension beyond the quadratic framework that illustrates
how this approach may be further generalized.

2. Forward stepwise with groups

This section establishes notation and describes the particular variant of grouped
forward stepwise referred to throughout this paper. Readers familiar with the
statistical programming language R may know of this algorithm as the one im-
plemented in the step function. Because we are interested in groups of variables,
the stepwise algorithm is not stated in terms of univariate correlations the way
forward stepwise is often presented.
We observe an outcome y ∼ N (µ, σ 2 I) and wish to model µ using a sparse
linear model µ ≈ Xβ for a matrix of predictors X and sparse coefficient vector
β. We assume the matrix X is subdivided a priori into groups
 
X = X1 X2 · · · XG (4)

with Xg denoting the submatrix of X with all columns of the group g for
1 ≤ g ≤ G. Forward stepwise, described in Algorithm 1, picks at each step the
group minimizing a penalized residual sums of squares criterion. The penalty
is an AIC criterion penalizing group size, since otherwise the algorithm will be
biased toward selecting larger groups. First we describe the case where σ 2 is
known. With a parameter k ≥ 0, the penalized RSS criterion is

g1 = arg min k(I − Pg1 )yk22 + kσ 2 Tr(Pg1 ) (5)

where Pg1 := Xg Xg† and Tr(Pg1 ) is the degrees of freedom used by Xg . Minimizing
this with respect to g is equivalent to maximizing

RSS1 (g) := y T Pg1 y − kσ 2 Tr(Pg1 ) (6)

Note that the event E1 := {z : arg max RSS1 (g) = g1 } can be decomposed as
an intersection \
E1 = {z : z T Q1g z + kg1 ≥ 0} (7)
g6=g1

where Q1g := Pg11 − Pg1 and kg1 := kσ 2 Tr(Pg11 − Pg1 ). We ignore ties and use >
and ≥ interchangeably. Ties do not occur with probability 1 under fairly general
conditions on the design matrix.
After adding a group of variables to the active set, we orthogonalize the
outcome and the remaining groups with respect to the added group. So, at step
s > 1, we have As = {g1 , g2 , . . . , gs−1 } and define

Ps := (I − Pgs−1
s−1
)(I − Pgs−2
s−2
) · · · (I − Pg11 )
(8)
rs := Ps y, Xgs := Ps Xg , Pgs := Xgs (Xgs )†
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 5

Data: An n vector y and n × p matrix X with G groups, complexity penalty parameter


k≥0
Result: Ordered active set A of groups included in the model at each step
begin
A ← ∅, Ac ← {1, . . . , G}, r0 ← y
for s = 1 to steps do
Pg ← Xg Xg†
g ∗ ← arg maxg∈Ac {rs−1T P r 2
g s−1 − kσ Tr(Pg )}
∗ c
A ← A ∪ {g }, A ← A \{g } c ∗

for h ∈ Ac do
Xh ← (I − Pg∗ )Xh
rs ← (I − Pg∗ )rs−1
return A
Algorithm 1: Forward stepwise with groups, σ 2 known

Then the group added at step s is

gs = arg max rsT Pgs rs − kσ 2 TrPgs (9)


g∈Ac

Now for
Es := {z ∈ Es−1 : arg max RSSs (g) = gs } (10)
we have \
Es = Es−1 ∩ {z : z T Ps Qsg Ps z + kgs ≥ 0} (11)
g∈Acs
g6=gs

where Qsg := Pgss − Pgs and kgs := σ 2 kTr(Pgss − Pgs ). We have established the
following lemma.
Lemma 2.1. The selection event E that forward stepwise selects the ordered
active set As = {g1 , g2 , . . . , gs } can be written as an intersection of quadratic
inequalities all of the form y T Qy + b ≥ 0, which satisfies Definition 1.1.
We leverage this fact to compute truncation intervals for selective significance
tests based on one-dimensional slices through E. When σ is known, the trun-
cated χ test statistic T χ described in Section 3 is used to test the significance
of each group in the active set. For unknown σ, the corresponding truncated
F statistic TF is detailed in Section 4. The unknown σ 2 case alters the above
algorithm by changing the RSS criteria to the form

gs = arg max(rsT rs − rsT Pgss rs ) exp(−kTr(Pgss )) (12)

This merely replaces the additive constant with a multiplicative one. Finally,
we note that k = 2 corresponds to the classic AIC criterion Akaike (1973),
k = log(n) yields BIC Schwarz et al. (1978), and k = 2 log(p) gives the RIC
criterion of Foster and George (1994). The extension of Algorithm 1 to allow
stopping using the AIC criterion is given later in Section 5.
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 6

2.1. Linear model and null hypothesis

Once forward stepwise has terminated yielding an active set A, we wish to


conduct inference about the model regressing y on XA where the subscript
denotes all columns of X corresponding to groups in A. Throughout this paper
we do not assume the linear model is correctly specified, i.e. it is possible that
µ 6= XA β for all β. This is a strength for robustness purposes, however it also
results in lower power against alternatives where the linear model is correctly
specified and forward stepwise captures the true active set.
Even when the linear model is not correctly specified, there is still a well-
defined best linear approximation

βA := arg min E[ky − XA βk22 ] (13)

with the usual estimate given by the least squares fit β̂A = XA y.
The probability modeling assumption throughout this paper is (2), and the
null hypothesis for a group g in A is given by

H0 (A, g) : βA,g =0 (14)

where βA,g denotes the coordinates of coefficients for group g in the linear
model determined by XA . Note that this null hypothesis depends on A, as
the coefficients for the same group g in a different active set A′ 6= A with g ∈ A′
have a different meaning, namely they are regression coefficients controlling for
the variables in A′ rather than in A. Several equivalent formulations of (14) are

H0 (A, g) : XgT (I − PA/g )µ = 0,


H0 (A, g) : Pg (I − PA/g )µ = 0, (15)
H0 (A, g) : P̃g µ = 0

where P̃g is the projection onto the column space of (I − PA/g )Xg . We use these
forms to emphasize that the probability model for y is determined by µ and not
by β ∗ .

3. Truncated χ significance test

We now describe our significance test for a group of variables g in the active set
A, under the probability model y ∼ N (µ, σ 2 I) where we assume σ is known or
can be estimated independently from y. Section 4 concerns the case where σ is
unknown.

3.1. Test statistic, null hypothesis, and distribution

First we consider forward stepwise with a fixed, deterministic number of steps S.


Choosing S with an AIC-type criterion requires some additional work described
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 7

in Section 5. Let A denote the ordered active set and E the event that A =
{g1 , g2 , . . . , gS }. With X̃g = (I − PA/g )Xg , let X̃g = Ug Dg VgT be the singular
value decomposition of X̃g . Note that P̃g = Ug UgT . In this notation, without
model selection, the test statistic and null distribution

T 2 := σ −2 kUgT yk22 ∼ χ2Tr(P̃g ) under H0 (A, g) (16)

form the usual regression model hypothesis test for the group g of variables in
model XA when σ 2 is known.
Let u ∝ P̃g y be a unit vector, so y = z + T u where z = (I − P̃g )y. Under the
null, (T, z, u) are independent because P̃g y and z are orthogonal and (T, u) are
independent by Basu’s theorem. We condition on z and u without changing the
test under the null, but it should be noted that this may result in some loss of
power under certain alternatives. Now the only remaining variation is in T , and
we still have T 2 ∼ χ2Tr(P̃ ) if there is no model selection. Finally, to obtain the
g

selective hypothesis test we apply the following theorem with PA = P̃g .


Theorem 3.1. If y ∼ N (µ, σ 2 I) with σ 2 known, and PA µ = 0 for some PA
which is constant on {M (y) = A}, then defining r := Tr(PA ), R := PA y, u :=
R/kRk2, z := y − R, the truncation interval MA := {t ≥ 0 : M (utσ + z) = A},
and the observed statistic T = kRk2 /σ, we have

T |(A, z, u) ∼ χr |MA (17)

where the vertical bar denotes truncation. Hence, with fr the pdf of a central χr
random variable R
f (t)dt
MA ∩[T,∞] r
T χ := R ∼ U [0, 1] (18)
MA r
f (t)dt
is a p-value.
Proof. First consider PA = P as fixed independently of y, and write TP , zP , uP ,
and MA (P ) to emphasize these are determined by P . Without selection, we
know (TP , zP , uP ) are independent and TP |(zP , uP ) ∼ χr . Since the selection
event is determined entirely by M (uP tσ + zP ) = A, conditioning further on
{M (y) = A} has the effect of truncating TP to MA (P ), so

TP |(A, zP , uP ) ∼ χr |MA (P ) . (19)

Now we let P = PA , and note that TPA , zPA , uPA depend on P only through
A, hence the same is true of MA (P ). So since PA is fixed on {M (y) = A}, the
conclusion (19) still holds with this random choice of P .
This establishes (17), and (18) follows by application of the truncated survival
function transform, or, equivalently, truncated CDF transform.
Since the right hand side of (18) does not functionally depend on any of the
conditions involving A, the p-value result holds unconditionally. But, in this
case the meaning of the left hand side cannot be interpreted as a test statistic
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 8

for a fixed group of variables in a specific model. We interpret the p-value T χ


conditionally on selection so that it has the desired meaning.
This test is not optimal in the selective UMPU sense described in Fithian, Sun and Taylor
(2014). We do not assume the linear model is correct and we condition on z,
this corresponds to the “saturated model” in their paper. We do this for com-
putational reasons as described in the next section. Optimizing computation to
make increased power feasible by not conditioning on z is an area of ongoing
work.

3.2. Computing the T χ truncation interval

Computing the support of T χ is possible due to Lemma 2.1 since

MA = {t ≥ 0 : M (utσ + z) = A}
S \
\
= {t ≥ 0 : (utσ + z)T Qsg (utσ + z) + kgs ≥ 0}
s=1 g∈Acs
g6=gs (20)
S
\ \
= {t ≥ 0 : asg t2 + bsg t + csg ≥ 0}
s=1 g∈Acs
g6=gs

with

asg := σ 2 uT Ps Qsg Ps u, bsg := 2σuT Ps Qsg Ps z, csg := z T Ps Qsg Ps z + kgs (21)

Each set in the above intersection can be computed from the roots of the cor-
responding quadratic in t, yielding either a single interval or a union of two
intervals. Computing MA as the intersection of these unions of intervals can be
done O( G

S ). Empirically we observe that the support MA is almost always a
single interval, a fact we hope to exploit in further work on sampling.
Since each matrix Pgs is low rank, in practice we only store the left singular
values which are sufficient for computing the above coefficients. This improves
both storage since we do not store the full n × n projections, and computation
since the coefficients (21) require only several inner products rather than a full
matrix-vector product.

4. Truncated F significance test

We next turn to the case when σ 2 is unknown. We will see that this example
does not satisfy Definition 1.1, but the spirit of the approach remains the same.
A selective F test was first explored by Gross, Taylor and Tibshirani (2015) for
affine model selection procedures. Following these authors, we write

Psub := PA/g , Pfull := PA


(22)
R1 := (I − Psub )y, R2 := (I − Pfull )y.
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 9

and consider the F statistic


kR1 k22 − kR2 k22 Tr(Pfull − Psub )
T := 2 , with c := >0 (23)
ckR2 k2 Tr(I − Pfull )

Writing u := R1 /r with r = kR1 k2 , we have the following decomposition

u = u(T ) = g1 (T )v∆ + g2 (T )v2


R1 − R2 R2
with v∆ := , v2 :=
kR1 − R2 |22 kR2 k2 (24)
r
cT 1
g1 (T ) := r , and g2 (T ) := r √ .
1 + cT 1 + cT
Then y = ru(T ) + z for z = Psub y. After conditioning on (r, z, v∆ , v2 ) the only
remaining variation is in T and hence u(T ). Without model selection, the usual
regression test for such nested models would be

T ∼ FTr(PA −PA/g ),Tr(I−PA ) under H0 (A, g) (25)

Theorem 4.1. If y ∼ N (µ, σ 2 I) with σ 2 unknown, then with definitions (22),


(23), (24), the truncation interval MA := {t ≥ 0 : M (u(t) + z) = A, d1 :=
Tr(PA − PA/g ), d2 := Tr(I − PA ), under H0 (A, g) we have

T |(z, u, A) ∼ Fd1 ,d2 |MA (26)

where the vertical bar denotes truncation. Hence, with fd1 ,d2 the pdf of an Fd1 ,d2
random variable R
MA ∩[T,∞] fd1 ,d2 (t)dt
TF := R ∼ U [0, 1] (27)
MA fd1 ,d2 (t)dt

is a p-value conditional on (v∆ , v2 , A). Since the right hand side does not depend
on these conditions, (27) also holds marginally.
Proof. Using the fact that (R1 − R2 , R2 ) are independent, the rest of the proof
follows the same argument as Theorem 3.1.
Conditioning on (z, v∆ , v2 ) reduces power, but it is unclear how to compute
the truncation region without using this strategy. As we see next, this is already
non-trivial as a one dimensional problem for quadratic selection regions.

4.1. Computing the TF truncation interval

Instead of a quadratically parametrized curve through the selection region as


we saw in the T χ case, we now must compute positive level sets of functions of
the form
IQ,a,b (t) := [z + ru(t)]T Q[z + ru(t)] + aT [z + ru(t)] + b
(28)
= g12 x11 + g1 g2 x12 + g22 x22 + g1 x1 + g2 x2 + x0
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 10

where r
ct 1
g1 (t) = r , g2 (t) = r √
1 + ct 1 + ct
T T
x11 := v∆ Qv∆ , x12 := 2v∆ Qv2 , (29)
x22 := v2T Qv2 , T
x1 := 2v∆ Qz + aT v∆ ,
x2 := 2v2T Qz + aT v2 , x0 := z T Qz + aT z + b.
By continuity the positive level sets {t ≥ 0 : IQ,a,b (t) ≥ 0} are unions of intervals,
and the selection event is an intersection of sets of this form. We can use the
same strategy as before solving for each one and then intersecting the unions of
intervals. However, the curve is no longer quadratic, it is not clear how many
roots it may have, and its derivative near 0 may approach ∞. So finding the
roots is not trivial. Our approach begins by reparametrizing with trigonometric
functions. There is an associated complex quartic polynomial and we solve for
its roots first using numerical algorithms specialized for polynomials. A subset
of these are potential roots of the original function. This allows us to isolate
the roots of the original function and solve for them numerically in bounded
intervals containing only one root. The details are as follows.
Since c > 0 and t ≥ 0, we can make the substitution ct = tan2 (θ) with
0 ≤ θ < π/2. Then

g1 (θ) = r sin(θ), g2 (θ) = r cos(θ) (30)

Using Euler’s formula,

r2 r2 
g1 (θ)2 = r2 sin2 (θ) = 2 − ei2θ + e−i2θ

[1 − cos(2θ)] =
2 4
2 2 
r r
g2 (θ)2 = r2 cos2 (θ) = 2 + ei2θ + e−i2θ

[1 + cos(2θ)] =
2 4
 iθ
e − e−iθ
  iθ
e + e−iθ ir2  i2θ

g1 (θ)g2 (θ) = r2 e − e−i2θ

=−
2i 2 4
motivates us to consider the complex function

r2 2 ir2 2 r2
p(z) := − (z + z −2 − 2)x11 + − (z − z −2 )x12 + (z 2 + z −2 + 2)x22
4 4 4
ir r
− (z − z −1 )x1 + (z + z −1 )x2 + x0
2 2
r r
h r
= (−x11 − ix12 + x22 )z 2 + (−x11 + ix12 + x22 )z −2 + (−ix1 + x2 )z
2 2 2 i
+ (ix1 + x2 )z −1 + (rx11 + rx22 + 2x0 /r)

which agrees with (28) when z = eiθ . Hence, the zeroes of I(t) coincide with
the zeroes of the polynomial p̃(z) := z 2 p(z) with z = eiθ and 0 ≤ θ ≤ π/2. This
is the polynomial we solve numerically, as described above. Finally, transform-
ing the polynomial roots back to the original domain may not be numerically
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 11

stable, which is why we use a numerical method to solve for them again. In the
selectiveInference software we use the R functions polyroot for the poly-
nomial and uniroot in the original domain. This customized approach is highly
robust and numerically stable, enabling accurate computation of the selection
region.

5. Choosing a model with AIC

Instead of running forward stepwise for S steps with S chosen a priori, we would
like to choose when to stop adaptively. An AIC-style criterion is one classical
way of accomplishing this, choosing the s that minimizes
− 2 log(Ls ) + k · edfs (31)
where Ls is the likelihood and edfs denotes the effective degrees of freedom of
the model at step s. As noted in Section 2, with k = 2 this is the usual AIC,
with k = log(n) it is BIC, and with k = 2 log(p) it is RIC. Note that edf
includes whether the error variance σ has been estimated. For example, for a
linear model with unknown σ minimizing the above is equivalent to minimizing
log(ky − Xs βs k22 ) + k · (1 + kβs k0 )/n (32)
where k·k0 counts the number of nonzero entries. Instead of taking the approach
which minimizes this over 1 ≤ s ≤ S, we adopt an early-stopping rule which
picks the s after which the AIC criterion increases s+ times in a row, for some
s+ ≥ 1. Readers familiar with the R programming language might recognize
this as the default behavior of step when s+ = 1.
Fortunately, since the stopping rule depends only on successive values of the
quadratic objective which we are already tracking, we can condition on the
event {ŝ = s} by appending some additional quadratic inequalities encoding
the successive comparisons. The results concerning computing T χ or TF remain
intact, with the only complication being the addition of a few more inequalities
in the computation of MA .

6. Theory and power

In this section we describe the behavior of the T χ statistic under some simpli-
fying assumptions and evaluate power against some alternatives.
Definition 6.1. By groupwise orthogonality or orthogonal groups we mean that
XgT Xh = 0pg ×ph for all g, h, and the columns of X satisfy kXj k2 = 1 for all j.
Proposition 6.1. Let Ts denote the observed T χ statistic for the group that
entered the model at step s. With orthogonal groups of equal size these are or-
dered T1 > T2 > · · · > TS . Further, the truncation regions are given exactly by
this ordering along with T1 < ∞, TS > TS+1 where TS+1 is the T χ statistic for
the next variable that would enter the model, understood to be 0 if there are no
remaining variables.
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 12

Proof. For simplicity we assume S = G so that TS+1 = 0. For each g let Xg =


Ug Dg VgT be a singular value decomposition. Hence Ug is a matrix constructed
of orthonormal columns forming a basis for the column space of Xg . The first
group to enter g1 satisfies

g1 = arg max kUgT yk (33)


g

Note that r1 = y − Ug1 UgT1 y is the residual after the first step, and that Ug −
Ug1 UgT1 Ug = Ug for all g 6= g1 by orthogonality. The second group g2 to enter
satisfies
g2 = arg max kUgT r1 k = arg max kUgT yk (34)
g6=g1 g6=g1

because UgT Ug1 = 0 implies UgT r1


= UgT y.
By induction it follows that the
Ti = kUgTi yk are
ordered.
Since we have assumed all groups have equal size, the truncation region for
Ts is determined by the intersection of the inequalities

(ηs,i t + zs,i )T (Ugi UgTi − Ugj UgTj )(ηs,i t + zs,i ) ≥ 0 ∀i, j s.t. i ≤ j ≤ G. (35)

By orthogonality, ηs,i = ηs ∝ Ugs UgTs y and zs,i = zs = y − Ugs UgTs y for all i.
Hence if s ∈/ {i, j} the inequality is satisfied for all t. If s = i, after expanding
the inequalities become t2 − Tj2 ≥ 0 for all j, and by the ordering this reduces
to t ≥ Tgs+1 . The case s = j is similar and yields one upper bound Tgs−1 ≥ t
(implicitly we have defined Tg0 = ∞).
Proposition 6.1 allows us to analyze the power in the orthogonal groups case
by evaluating χ survival functions with upper and lower limits given simply by
the neighboring T χ statistics. To apply this we still need to deal with the fact
that the order of the noncentrality parameters may not correspond to the order
of groups entering the model, and it is also possible for null statistics to be
interspersed with the non-nulls.
Proposition 6.1 also yields some heuristic understanding. For example, we see
explicitly the non-independence of the T χ statistics, but also note that in this
case the dependency is only on neighboring statistics. Further, we see how the
power for a given test depends on the values of the neighboring statistics. An
example with groups of size 5 is plotted in Figure 1. This example considers a
fixed value of T2 and plots contours of the p-value for T2 as a function of the
neighboring statistics. It is apparent that when T3 → T2 (right side of plot), the
p-value for T2 will be large irrespective of T1 .
This motivates us to consider when a given nonnull Ti is likely to have a
close lower limit, since that scenario results in low power. Since a non-central
χ2k (λ) with non-centrality parameter λ has mean k + λ and variance 2k + 4λ, we
expect that the nonnull Ti will be larger and have greater spread than the Ti
corresponding to null variables. Hence, if the true signal is sparse then the risk
of having a nearby lower limit is mostly due to the bulk of null statistics. By
upper bounding the largest null statistic, we get an idea of both when forward
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 13

Tχ p−value for T2 as a function of its neighbors

15

12
p−value
1.00

0.75
T1

0.50
9
0.25

1 2 3 4
T3

Fig 1: Contours of T χ p-value for T2 as a function of its neighbors. Here T2 ≈ 4.35


(corresponding to the 50% quantile of a χ25 ) is fixed and the upper and lower
truncation limits vary.

stepwise will select the nonnull variables first and when the worst case for power
can be excluded. Using Lemma 1 of Laurent and Massart (2000), for ǫ > 0 if
we define
x = − log(1 − (1 − ǫ)1/G ) (36)
then if all Ti are null (central) we have

 
P max Ti2 > k + 2 kx + 2x < ǫ. (37)
1≤i≤G


Table 2 gives values of the upper bound k + 2 kx + 2x as a function of G and
k, where there are G null groups of equal size k and error variance σ 2 = 1. For
example, with G = 50 groups of size k = 2 the null χ2k statistics will all be below
27.28 with 99% probability and 21.35 with 90% probability.
The power curves plotted in Fig 2 show this theoretical bound is likely
more conservative than necessary. In fact, let us simplify a bit more by as-
suming orthogonal groups of sizep 1 (single columns), and a 1-sparse alter-
native with magnitude (1 + δ) 2 log(p) for δ > 0. Then the standard tail
bounds for Gaussian random variables imply that the T χ test is asymptotically
equivalent to Bonferroni and hence asymptotically optimal for this alternative.
This is discussed in Loftus and Taylor (2014) and a little more generally in
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 14

Probability of exceeding upper bound of nulls

0.8

0.6
G
10
Pr(T > x)

20
0.4 50
100
1000

0.2

0.0

0 10 20 30
λ

(a) Probability of a single nonnull exceeding the 90% theoretical bounds in Table 2.

Power against one−sparse alternative


1.00

0.75

G
10
Power

20
0.50 50
100
1000

0.25

0.00
0 10 20 30
λ

(b) Smoothed estimates of power functions for 1-sparse alternative and G null variables.

Fig 2: Theoretical and empirical curves related to power against a one-sparse


non-central χ25 with non-centrality parameter λ.
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 15

Table 2
High probability upper bounds for central χ2k statistics corresponding noise variables. The
left panel gives 99% bounds and the right panel 90%. This assumes G noise groups of equal
size k, groupwise orthogonality, and error variance σ2 = 1.
k k
G 2 5 10 50 G 2 5 10 50
10 23.24 30.56 40.42 100.96 10 17.16 23.66 32.62 89.31
20 24.99 32.52 42.62 104.17 20 18.98 25.74 34.99 92.90
50 27.28 35.07 45.48 108.29 50 21.35 28.43 38.03 97.44
100 28.99 36.98 47.60 111.32 100 23.12 30.42 40.27 100.74
1000 34.61 43.19 54.47 120.99 1000 28.88 36.85 47.46 111.11

Taylor, Loftus and Tibshirani (2015). The 1-sparse case with orthogonal groups
of fixed, equal size follows similarly using tail bounds for χ2 random variables.
To translate between the non-centrality parameter λ and a linear model co-
efficient, consider
√ that when y = Xg βg + ǫ and the expected column norms of
Xg√scale like n, then the non-centrality parameter for Tg will be roughly equal
to nkβg k22 (under groupwise orthogonality).
Finally, when groups are not orthogonal, the greedy nature of forward step-
wise will generally result in less power since part of the variation in y due to Xg
may be regressed out at previous steps.

7. Simulations

To evaluate the T χ test when groupwise orthogonality does not hold, we conduct
a simulation experiment with correlated Gaussian design matrices. In each of
1000 realizations forward stepwise is run with number of steps chosen by BIC.
Table 3 summarizes the resulting model sizes and how many of the 5 nonnull
variables were included.
Table 3
Size and composition of models chosen by BIC in the simulation described in Fig 3.
Model size 4 5 6 7 8 9
# Occurrences 4 42 511 328 95 20
# True signals included 3 4 5
# Occurrences 18 106 876

Fig 3 shows the empirical distributions of both null and nonnull T χ p-values
plotted by the step the corresponding variable was added. Table 4 shows the
power of rejecting at the α = 0.05 level also by step. Note that it was rare for
nonnull variables to enter at later steps, so the observed power of 0 at step 8 is
not too surprising.
Table 4
Empirical power for nonnull variables in the simulation described in Fig 3.
step 1 2 3 4 5 6 7 8
Power 0.532 0.470 0.454 0.521 0.641 0.315 0.500 0.000
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 16

Empirical CDF of Tχ p−values by step


1 2 3
1.00

0.75

0.50

0.25

0.00
4 5 6
1.00

0.75 Null
0.50 FALSE
TRUE
0.25

0.00
7 8 9
1.00

0.75

0.50

0.25

0.00
0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2
Pvalue

Fig 3: Empirical CDFs of T χ p-values plotted by step. Models chosen by BIC,


with model size distribution described in Table 3. Design matrix had n = 100
observations and p = 100 variables p in G = 50 groups of size 2. The true sparsity
is 5, nonzero entries of β equal ±2 log(G)/n ≈ ±0.4. Dotted lines correspond
to null variables and the solid, off-diagonal lines to signal variables.

References

Akaike, H. (1973). Information Theory and an Extension of the Maximum. In


Second International Symposium on Information Theory 267–281.
Chouldechova, A. and Hastie, T. (2015). Generalized Additive Model Se-
lection. ArXiv e-prints.
Fithian, W., Sun, D. and Taylor, J. (2014). Optimal inference after model
selection. arXiv preprint arXiv:1410.2597.
Foster, D. P. and George, E. I. (1994). The risk inflation criterion for
multiple regression. The Annals of Statistics 1947–1975.
Gross, S. M., Taylor, J. and Tibshirani, R. (2015). A Selective Approach
to Internal Inference. ArXiv e-prints.
Laurent, B. and Massart, P. (2000). Adaptive estimation of a quadratic
functional by model selection. Annals of Statistics 1302–1338.
Lee, J. D., Sun, D. L., Sun, Y. and Taylor, J. E. (2015). Exact post-
selection inference with the lasso. Ann. Statist. To appear.
Lim, M. and Hastie, T. (2013). Learning interactions through hierarchical
group-lasso regularization. arXiv preprint arXiv:1308.2719.
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 17

Lockhart, R., Taylor, J., Tibshirani, R. J. and Tibshirani, R. (2014).


A significance test for the lasso. Ann. Statist. 42 413–468.
Loftus, J. R. and Taylor, J. E. (2014). A significance test for forward step-
wise model selection. arXiv preprint arXiv:1405.3920.
University of Wisconsin Population Health Institute (2015). Califor-
nia County health rankings. http://www.countyhealthrankings.org. Ac-
cessed: 2015-10-28.
Schwarz, G. et al. (1978). Estimating the dimension of a model. The annals
of statistics 6 461–464.
Taylor, J. E., Loftus, J. R. and Tibshirani, R. J. (2015). Tests in adaptive
regression via the Kac-Rice formula. Ann. Statist. To appear.
Tibshirani, R. J., Taylor, J., Lockhart, R. and Tibshirani, R. (2014).
Exact Post-Selection Inference for Sequential Regression Procedures. ArXiv
e-prints.
Tibshirani, R., Tibshirani, R., Taylor, J., Loftus, J. and Reid, S. (2015).
selectiveInference: Tools for Selective Inference R package version 1.1.1.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy