lOFTUS_ET_AL
lOFTUS_ET_AL
1. Introduction
Information on health and various other indicators has been aggregated to the
county level by University of Wisconsin Population Health Institute (2015). We
attempt to fit a linear model for California counties with outcome given by
log-years of potential life lost, and search among 31 predictors using forward
stepwise. This example has groups only of size 1, but still serves to illustrate
the procedure. With the BIC criteria for choosing model size the resulting model
has 8 variables listed in Table 1. The naive p-values computed from standard
hypothesis tests for linear regression models are biased to be small because we
chose the 8 best predictors out of 31.
Table 1
Significance test p-values for predictors chosen by forward stepwise with BIC.
Step Variable Naive Selective
1 %80th Percentile Income <0.001 0.001
2 Injury Death Rate <0.001 0.086
3 Chlamydia Rate 0.078 0.287
4 %Obese <0.001 0.170
5 %Receiving HbA1c <0.001 0.335
6 %Some College 0.005 0.864
7 Teen Birth Rate 0.071 0.940
8 Violent Crime Rate 0.067 0.179
The selective p-values reported here are the T χ statistics described in Sec-
tion 3. These have been adjusted to account for model selection. Using the data
to select a model and then conducting inference about that model means we are
randomly choosing which hypotheses to test. In this scenario, Fithian, Sun and Taylor
(2014) propose controlling the selective type 1 error defined as
The notation H0 (M ) emphasizes the fact that the hypothesis depends upon the
model M . Scientific studies reporting p-values which control (1) would not suffer
from model selection bias, and consequently would have greater replicability.
One of the most common methods which controls selective type 1 error is data
splitting, where independent subsets of the data are used for model selection
and inference separately. Unfortunately, by using less data to select the model
and less data to conduct tests this method suffers from loss of model selection
accuracy and lower power. The T χ and TF hypothesis tests described in the
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 3
present work control (1) while using the whole data for both selection and
inference.
Many model selection procedures, including lasso, forward stepwise, least angle
regression, marginal screening, and others, can be characterized by affine in-
equalities. In other words, for each m ∈ M there exists a matrix Am and vector
bm such that M̂ (y) = m is equivalent to Am y ≤ bm . Hence, assuming
y = µ + ǫ, ǫ ∼ N (0, σ 2 I) (2)
Our approach will exploit the structure of model selection procedures charac-
terized by quadratic inequalities in the following sense.
Definition 1.1. A quadratic model selection procedure is a map M : X → M
determining a model m = M (y), such that for any m ∈ M
\
Em := {y : M (y) = m} = Em,j ,
j∈Jm (3)
T T
Em,j := {y : y Qm,j y + am,j y + bm,j ≥ 0}
The power of this definition lies in the ease with which we can compute
one-dimensional slices through Em . This enables us to compute closed-form
results for certain one-dimensional statistics and to potentially implement hit-
and-run sampling schemes. Consider tests based on T 2 := kP yk22 for some fixed
projection matrix P . Write u = P y/T , z = y −P y, and substitute uT +z for y in
the definition (3). Conditioning on z and u, the only remaining variation is in T ,
and the equations above are univariate quadratics. This allows us to determine
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 4
the truncation region for T induced by the model selection M (y) = m. We now
apply this approach and work out the details for several examples. The example
in Section 4 is a slight extension beyond the quadratic framework that illustrates
how this approach may be further generalized.
This section establishes notation and describes the particular variant of grouped
forward stepwise referred to throughout this paper. Readers familiar with the
statistical programming language R may know of this algorithm as the one im-
plemented in the step function. Because we are interested in groups of variables,
the stepwise algorithm is not stated in terms of univariate correlations the way
forward stepwise is often presented.
We observe an outcome y ∼ N (µ, σ 2 I) and wish to model µ using a sparse
linear model µ ≈ Xβ for a matrix of predictors X and sparse coefficient vector
β. We assume the matrix X is subdivided a priori into groups
X = X1 X2 · · · XG (4)
with Xg denoting the submatrix of X with all columns of the group g for
1 ≤ g ≤ G. Forward stepwise, described in Algorithm 1, picks at each step the
group minimizing a penalized residual sums of squares criterion. The penalty
is an AIC criterion penalizing group size, since otherwise the algorithm will be
biased toward selecting larger groups. First we describe the case where σ 2 is
known. With a parameter k ≥ 0, the penalized RSS criterion is
where Pg1 := Xg Xg† and Tr(Pg1 ) is the degrees of freedom used by Xg . Minimizing
this with respect to g is equivalent to maximizing
Note that the event E1 := {z : arg max RSS1 (g) = g1 } can be decomposed as
an intersection \
E1 = {z : z T Q1g z + kg1 ≥ 0} (7)
g6=g1
where Q1g := Pg11 − Pg1 and kg1 := kσ 2 Tr(Pg11 − Pg1 ). We ignore ties and use >
and ≥ interchangeably. Ties do not occur with probability 1 under fairly general
conditions on the design matrix.
After adding a group of variables to the active set, we orthogonalize the
outcome and the remaining groups with respect to the added group. So, at step
s > 1, we have As = {g1 , g2 , . . . , gs−1 } and define
Ps := (I − Pgs−1
s−1
)(I − Pgs−2
s−2
) · · · (I − Pg11 )
(8)
rs := Ps y, Xgs := Ps Xg , Pgs := Xgs (Xgs )†
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 5
for h ∈ Ac do
Xh ← (I − Pg∗ )Xh
rs ← (I − Pg∗ )rs−1
return A
Algorithm 1: Forward stepwise with groups, σ 2 known
Now for
Es := {z ∈ Es−1 : arg max RSSs (g) = gs } (10)
we have \
Es = Es−1 ∩ {z : z T Ps Qsg Ps z + kgs ≥ 0} (11)
g∈Acs
g6=gs
where Qsg := Pgss − Pgs and kgs := σ 2 kTr(Pgss − Pgs ). We have established the
following lemma.
Lemma 2.1. The selection event E that forward stepwise selects the ordered
active set As = {g1 , g2 , . . . , gs } can be written as an intersection of quadratic
inequalities all of the form y T Qy + b ≥ 0, which satisfies Definition 1.1.
We leverage this fact to compute truncation intervals for selective significance
tests based on one-dimensional slices through E. When σ is known, the trun-
cated χ test statistic T χ described in Section 3 is used to test the significance
of each group in the active set. For unknown σ, the corresponding truncated
F statistic TF is detailed in Section 4. The unknown σ 2 case alters the above
algorithm by changing the RSS criteria to the form
This merely replaces the additive constant with a multiplicative one. Finally,
we note that k = 2 corresponds to the classic AIC criterion Akaike (1973),
k = log(n) yields BIC Schwarz et al. (1978), and k = 2 log(p) gives the RIC
criterion of Foster and George (1994). The extension of Algorithm 1 to allow
stopping using the AIC criterion is given later in Section 5.
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 6
where βA,g denotes the coordinates of coefficients for group g in the linear
model determined by XA . Note that this null hypothesis depends on A, as
the coefficients for the same group g in a different active set A′ 6= A with g ∈ A′
have a different meaning, namely they are regression coefficients controlling for
the variables in A′ rather than in A. Several equivalent formulations of (14) are
where P̃g is the projection onto the column space of (I − PA/g )Xg . We use these
forms to emphasize that the probability model for y is determined by µ and not
by β ∗ .
We now describe our significance test for a group of variables g in the active set
A, under the probability model y ∼ N (µ, σ 2 I) where we assume σ is known or
can be estimated independently from y. Section 4 concerns the case where σ is
unknown.
in Section 5. Let A denote the ordered active set and E the event that A =
{g1 , g2 , . . . , gS }. With X̃g = (I − PA/g )Xg , let X̃g = Ug Dg VgT be the singular
value decomposition of X̃g . Note that P̃g = Ug UgT . In this notation, without
model selection, the test statistic and null distribution
form the usual regression model hypothesis test for the group g of variables in
model XA when σ 2 is known.
Let u ∝ P̃g y be a unit vector, so y = z + T u where z = (I − P̃g )y. Under the
null, (T, z, u) are independent because P̃g y and z are orthogonal and (T, u) are
independent by Basu’s theorem. We condition on z and u without changing the
test under the null, but it should be noted that this may result in some loss of
power under certain alternatives. Now the only remaining variation is in T , and
we still have T 2 ∼ χ2Tr(P̃ ) if there is no model selection. Finally, to obtain the
g
where the vertical bar denotes truncation. Hence, with fr the pdf of a central χr
random variable R
f (t)dt
MA ∩[T,∞] r
T χ := R ∼ U [0, 1] (18)
MA r
f (t)dt
is a p-value.
Proof. First consider PA = P as fixed independently of y, and write TP , zP , uP ,
and MA (P ) to emphasize these are determined by P . Without selection, we
know (TP , zP , uP ) are independent and TP |(zP , uP ) ∼ χr . Since the selection
event is determined entirely by M (uP tσ + zP ) = A, conditioning further on
{M (y) = A} has the effect of truncating TP to MA (P ), so
Now we let P = PA , and note that TPA , zPA , uPA depend on P only through
A, hence the same is true of MA (P ). So since PA is fixed on {M (y) = A}, the
conclusion (19) still holds with this random choice of P .
This establishes (17), and (18) follows by application of the truncated survival
function transform, or, equivalently, truncated CDF transform.
Since the right hand side of (18) does not functionally depend on any of the
conditions involving A, the p-value result holds unconditionally. But, in this
case the meaning of the left hand side cannot be interpreted as a test statistic
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 8
MA = {t ≥ 0 : M (utσ + z) = A}
S \
\
= {t ≥ 0 : (utσ + z)T Qsg (utσ + z) + kgs ≥ 0}
s=1 g∈Acs
g6=gs (20)
S
\ \
= {t ≥ 0 : asg t2 + bsg t + csg ≥ 0}
s=1 g∈Acs
g6=gs
with
Each set in the above intersection can be computed from the roots of the cor-
responding quadratic in t, yielding either a single interval or a union of two
intervals. Computing MA as the intersection of these unions of intervals can be
done O( G
S ). Empirically we observe that the support MA is almost always a
single interval, a fact we hope to exploit in further work on sampling.
Since each matrix Pgs is low rank, in practice we only store the left singular
values which are sufficient for computing the above coefficients. This improves
both storage since we do not store the full n × n projections, and computation
since the coefficients (21) require only several inner products rather than a full
matrix-vector product.
We next turn to the case when σ 2 is unknown. We will see that this example
does not satisfy Definition 1.1, but the spirit of the approach remains the same.
A selective F test was first explored by Gross, Taylor and Tibshirani (2015) for
affine model selection procedures. Following these authors, we write
where the vertical bar denotes truncation. Hence, with fd1 ,d2 the pdf of an Fd1 ,d2
random variable R
MA ∩[T,∞] fd1 ,d2 (t)dt
TF := R ∼ U [0, 1] (27)
MA fd1 ,d2 (t)dt
is a p-value conditional on (v∆ , v2 , A). Since the right hand side does not depend
on these conditions, (27) also holds marginally.
Proof. Using the fact that (R1 − R2 , R2 ) are independent, the rest of the proof
follows the same argument as Theorem 3.1.
Conditioning on (z, v∆ , v2 ) reduces power, but it is unclear how to compute
the truncation region without using this strategy. As we see next, this is already
non-trivial as a one dimensional problem for quadratic selection regions.
where r
ct 1
g1 (t) = r , g2 (t) = r √
1 + ct 1 + ct
T T
x11 := v∆ Qv∆ , x12 := 2v∆ Qv2 , (29)
x22 := v2T Qv2 , T
x1 := 2v∆ Qz + aT v∆ ,
x2 := 2v2T Qz + aT v2 , x0 := z T Qz + aT z + b.
By continuity the positive level sets {t ≥ 0 : IQ,a,b (t) ≥ 0} are unions of intervals,
and the selection event is an intersection of sets of this form. We can use the
same strategy as before solving for each one and then intersecting the unions of
intervals. However, the curve is no longer quadratic, it is not clear how many
roots it may have, and its derivative near 0 may approach ∞. So finding the
roots is not trivial. Our approach begins by reparametrizing with trigonometric
functions. There is an associated complex quartic polynomial and we solve for
its roots first using numerical algorithms specialized for polynomials. A subset
of these are potential roots of the original function. This allows us to isolate
the roots of the original function and solve for them numerically in bounded
intervals containing only one root. The details are as follows.
Since c > 0 and t ≥ 0, we can make the substitution ct = tan2 (θ) with
0 ≤ θ < π/2. Then
r2 r2
g1 (θ)2 = r2 sin2 (θ) = 2 − ei2θ + e−i2θ
[1 − cos(2θ)] =
2 4
2 2
r r
g2 (θ)2 = r2 cos2 (θ) = 2 + ei2θ + e−i2θ
[1 + cos(2θ)] =
2 4
iθ
e − e−iθ
iθ
e + e−iθ ir2 i2θ
g1 (θ)g2 (θ) = r2 e − e−i2θ
=−
2i 2 4
motivates us to consider the complex function
r2 2 ir2 2 r2
p(z) := − (z + z −2 − 2)x11 + − (z − z −2 )x12 + (z 2 + z −2 + 2)x22
4 4 4
ir r
− (z − z −1 )x1 + (z + z −1 )x2 + x0
2 2
r r
h r
= (−x11 − ix12 + x22 )z 2 + (−x11 + ix12 + x22 )z −2 + (−ix1 + x2 )z
2 2 2 i
+ (ix1 + x2 )z −1 + (rx11 + rx22 + 2x0 /r)
which agrees with (28) when z = eiθ . Hence, the zeroes of I(t) coincide with
the zeroes of the polynomial p̃(z) := z 2 p(z) with z = eiθ and 0 ≤ θ ≤ π/2. This
is the polynomial we solve numerically, as described above. Finally, transform-
ing the polynomial roots back to the original domain may not be numerically
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 11
stable, which is why we use a numerical method to solve for them again. In the
selectiveInference software we use the R functions polyroot for the poly-
nomial and uniroot in the original domain. This customized approach is highly
robust and numerically stable, enabling accurate computation of the selection
region.
Instead of running forward stepwise for S steps with S chosen a priori, we would
like to choose when to stop adaptively. An AIC-style criterion is one classical
way of accomplishing this, choosing the s that minimizes
− 2 log(Ls ) + k · edfs (31)
where Ls is the likelihood and edfs denotes the effective degrees of freedom of
the model at step s. As noted in Section 2, with k = 2 this is the usual AIC,
with k = log(n) it is BIC, and with k = 2 log(p) it is RIC. Note that edf
includes whether the error variance σ has been estimated. For example, for a
linear model with unknown σ minimizing the above is equivalent to minimizing
log(ky − Xs βs k22 ) + k · (1 + kβs k0 )/n (32)
where k·k0 counts the number of nonzero entries. Instead of taking the approach
which minimizes this over 1 ≤ s ≤ S, we adopt an early-stopping rule which
picks the s after which the AIC criterion increases s+ times in a row, for some
s+ ≥ 1. Readers familiar with the R programming language might recognize
this as the default behavior of step when s+ = 1.
Fortunately, since the stopping rule depends only on successive values of the
quadratic objective which we are already tracking, we can condition on the
event {ŝ = s} by appending some additional quadratic inequalities encoding
the successive comparisons. The results concerning computing T χ or TF remain
intact, with the only complication being the addition of a few more inequalities
in the computation of MA .
In this section we describe the behavior of the T χ statistic under some simpli-
fying assumptions and evaluate power against some alternatives.
Definition 6.1. By groupwise orthogonality or orthogonal groups we mean that
XgT Xh = 0pg ×ph for all g, h, and the columns of X satisfy kXj k2 = 1 for all j.
Proposition 6.1. Let Ts denote the observed T χ statistic for the group that
entered the model at step s. With orthogonal groups of equal size these are or-
dered T1 > T2 > · · · > TS . Further, the truncation regions are given exactly by
this ordering along with T1 < ∞, TS > TS+1 where TS+1 is the T χ statistic for
the next variable that would enter the model, understood to be 0 if there are no
remaining variables.
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 12
Note that r1 = y − Ug1 UgT1 y is the residual after the first step, and that Ug −
Ug1 UgT1 Ug = Ug for all g 6= g1 by orthogonality. The second group g2 to enter
satisfies
g2 = arg max kUgT r1 k = arg max kUgT yk (34)
g6=g1 g6=g1
(ηs,i t + zs,i )T (Ugi UgTi − Ugj UgTj )(ηs,i t + zs,i ) ≥ 0 ∀i, j s.t. i ≤ j ≤ G. (35)
By orthogonality, ηs,i = ηs ∝ Ugs UgTs y and zs,i = zs = y − Ugs UgTs y for all i.
Hence if s ∈/ {i, j} the inequality is satisfied for all t. If s = i, after expanding
the inequalities become t2 − Tj2 ≥ 0 for all j, and by the ordering this reduces
to t ≥ Tgs+1 . The case s = j is similar and yields one upper bound Tgs−1 ≥ t
(implicitly we have defined Tg0 = ∞).
Proposition 6.1 allows us to analyze the power in the orthogonal groups case
by evaluating χ survival functions with upper and lower limits given simply by
the neighboring T χ statistics. To apply this we still need to deal with the fact
that the order of the noncentrality parameters may not correspond to the order
of groups entering the model, and it is also possible for null statistics to be
interspersed with the non-nulls.
Proposition 6.1 also yields some heuristic understanding. For example, we see
explicitly the non-independence of the T χ statistics, but also note that in this
case the dependency is only on neighboring statistics. Further, we see how the
power for a given test depends on the values of the neighboring statistics. An
example with groups of size 5 is plotted in Figure 1. This example considers a
fixed value of T2 and plots contours of the p-value for T2 as a function of the
neighboring statistics. It is apparent that when T3 → T2 (right side of plot), the
p-value for T2 will be large irrespective of T1 .
This motivates us to consider when a given nonnull Ti is likely to have a
close lower limit, since that scenario results in low power. Since a non-central
χ2k (λ) with non-centrality parameter λ has mean k + λ and variance 2k + 4λ, we
expect that the nonnull Ti will be larger and have greater spread than the Ti
corresponding to null variables. Hence, if the true signal is sparse then the risk
of having a nearby lower limit is mostly due to the bulk of null statistics. By
upper bounding the largest null statistic, we get an idea of both when forward
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 13
15
12
p−value
1.00
0.75
T1
0.50
9
0.25
1 2 3 4
T3
stepwise will select the nonnull variables first and when the worst case for power
can be excluded. Using Lemma 1 of Laurent and Massart (2000), for ǫ > 0 if
we define
x = − log(1 − (1 − ǫ)1/G ) (36)
then if all Ti are null (central) we have
√
P max Ti2 > k + 2 kx + 2x < ǫ. (37)
1≤i≤G
√
Table 2 gives values of the upper bound k + 2 kx + 2x as a function of G and
k, where there are G null groups of equal size k and error variance σ 2 = 1. For
example, with G = 50 groups of size k = 2 the null χ2k statistics will all be below
27.28 with 99% probability and 21.35 with 90% probability.
The power curves plotted in Fig 2 show this theoretical bound is likely
more conservative than necessary. In fact, let us simplify a bit more by as-
suming orthogonal groups of sizep 1 (single columns), and a 1-sparse alter-
native with magnitude (1 + δ) 2 log(p) for δ > 0. Then the standard tail
bounds for Gaussian random variables imply that the T χ test is asymptotically
equivalent to Bonferroni and hence asymptotically optimal for this alternative.
This is discussed in Loftus and Taylor (2014) and a little more generally in
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 14
0.8
0.6
G
10
Pr(T > x)
20
0.4 50
100
1000
0.2
0.0
0 10 20 30
λ
(a) Probability of a single nonnull exceeding the 90% theoretical bounds in Table 2.
0.75
G
10
Power
20
0.50 50
100
1000
0.25
0.00
0 10 20 30
λ
(b) Smoothed estimates of power functions for 1-sparse alternative and G null variables.
Table 2
High probability upper bounds for central χ2k statistics corresponding noise variables. The
left panel gives 99% bounds and the right panel 90%. This assumes G noise groups of equal
size k, groupwise orthogonality, and error variance σ2 = 1.
k k
G 2 5 10 50 G 2 5 10 50
10 23.24 30.56 40.42 100.96 10 17.16 23.66 32.62 89.31
20 24.99 32.52 42.62 104.17 20 18.98 25.74 34.99 92.90
50 27.28 35.07 45.48 108.29 50 21.35 28.43 38.03 97.44
100 28.99 36.98 47.60 111.32 100 23.12 30.42 40.27 100.74
1000 34.61 43.19 54.47 120.99 1000 28.88 36.85 47.46 111.11
Taylor, Loftus and Tibshirani (2015). The 1-sparse case with orthogonal groups
of fixed, equal size follows similarly using tail bounds for χ2 random variables.
To translate between the non-centrality parameter λ and a linear model co-
efficient, consider
√ that when y = Xg βg + ǫ and the expected column norms of
Xg√scale like n, then the non-centrality parameter for Tg will be roughly equal
to nkβg k22 (under groupwise orthogonality).
Finally, when groups are not orthogonal, the greedy nature of forward step-
wise will generally result in less power since part of the variation in y due to Xg
may be regressed out at previous steps.
7. Simulations
To evaluate the T χ test when groupwise orthogonality does not hold, we conduct
a simulation experiment with correlated Gaussian design matrices. In each of
1000 realizations forward stepwise is run with number of steps chosen by BIC.
Table 3 summarizes the resulting model sizes and how many of the 5 nonnull
variables were included.
Table 3
Size and composition of models chosen by BIC in the simulation described in Fig 3.
Model size 4 5 6 7 8 9
# Occurrences 4 42 511 328 95 20
# True signals included 3 4 5
# Occurrences 18 106 876
Fig 3 shows the empirical distributions of both null and nonnull T χ p-values
plotted by the step the corresponding variable was added. Table 4 shows the
power of rejecting at the α = 0.05 level also by step. Note that it was rare for
nonnull variables to enter at later steps, so the observed power of 0 at step 8 is
not too surprising.
Table 4
Empirical power for nonnull variables in the simulation described in Fig 3.
step 1 2 3 4 5 6 7 8
Power 0.532 0.470 0.454 0.521 0.641 0.315 0.500 0.000
J. R. Loftus and J. E. Taylor/Selective inference with groups of variables 16
0.75
0.50
0.25
0.00
4 5 6
1.00
0.75 Null
0.50 FALSE
TRUE
0.25
0.00
7 8 9
1.00
0.75
0.50
0.25
0.00
0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2
Pvalue
References