0% found this document useful (0 votes)
17 views12 pages

Lecture 02

This document summarizes a statistics lecture about optimal testing procedures against sparse alternatives. It discusses the needle in a haystack problem and shows that no test can have non-negligible power when the signal strength is below a certain threshold. It also analyzes the properties of the likelihood ratio test and chi-squared test in this context.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views12 pages

Lecture 02

This document summarizes a statistics lecture about optimal testing procedures against sparse alternatives. It discusses the needle in a haystack problem and shows that no test can have non-negligible power when the signal strength is below a certain threshold. It also analyzes the properties of the likelihood ratio test and chi-squared test in this context.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

STATS 300C: Theory of Statistics Spring 2023

Lecture 2 — April 5
Lecturer: Prof. Emmanuel Candès Scribe: Nathan Tung

 Warning: These notes may contain factual and/or typographic errors. They are
based on Emmanuel Candès’s course from 2018 to 2023, and scribe notes written by
Debolina Paul, John Cherian, Will Fithian, Kenneth Tay, Paulo Orenstein, Stephen Bates,
and XY Han.

2.1 Recap: Needle in a Haystack Problem


In the previous lecture, we introduced Bonferroni’s global test. To study its properties
further, we considered the following independent Gaussian sequence model:
ind
Yi ∼ N (µi , 1)

To precisely characterize the sense in which Bonferroni’s global test is optimal, we com-
puted the asymptotic power of the test under a particular null and alternative hypothesis.
To this end, we defined the global null as,
n
\
H0 = H0,i where, H0,i : µi = 0.
i=1

Under this testing criterion, the Bonferroni test is determined by the alternate hypothesis,
which we denote by H1 , there exists i such that µi = µ > 0.
maxi Yi p
Then, relying on the fact that √ 2 log n
→ 1 under the global null,1 we argued that there
exists a sharp detection threshold for the needle in a haystack alternative hypothesis.

2.2 Optimality against sparse alternatives


Given the failure of the Bonferroni global test when µ(n) is below the decision threshold,
we might wonder if there exists some test that has non-negligible asymptotic power in this
scenario. However, we show below that this decision threshold cannot be improved using
any test of the global null against the “needle in a haystack” alternative. To prove this,
we reduce our composite alternative to a simple hypothesis, and show that the optimal test
given by the Neyman-Pearson Lemma still does no better than flipping a biased coin.
1
Further discussion of this claim in Appendix A.1.

2-1
STATS 300C Lecture 2 — April 5 Spring 2023

Bayesian Decision Problem We consider the problem of testing

H0 : µi = 0, for all i
H1 : {µi } ∼ π

where π selects a coordinate I uniformly and sets µI = µ(n) = (1 − ϵ) 2 log n with all other
µi = 0.
Note that both the null and alternative hypotheses specified above are simple (one reason
for using this random alternative). Thus, we can apply the Neyman-Pearson Lemma, which
states that the most powerful test rejects for large values of the likelihood ratio.

Remark: If the coordinate was chosen in a non-random manner, i.e. I = 1, the problem
above reduces to testing H0 : µ1 = 0 against testing H1 : µ1 > 0. As we saw in STATS
300A, there is no decision threshold for which this test is powerless. By contrast, we will
see that for the “least favorable” alternative defined above, even the optimal likelihood ratio
test exhibits the same asymptotic cutoff behavior as the Bonferroni global test.
Proceeding to define the likelihood
√ ratio, we first observe that the densities under the
(n)
null and alternative, where µ = 2 log n, are given by,
n
Y 1 1 2
f0 (y) = √ e− 2 yj
j=1

n
1 X 1 − 1 (yi −µ)2 Y 1 − 1 yj2
f1 (y) = √ e 2 √ e 2
n i=1 2π j:j̸=i

So, in this case, the likelihood ratio becomes,


1 2
n √1 e− 2 (yi −µ) n
f1 (y) 1X 2π 1 X Yi µ− 1 µ2
L= = 1 2 = e 2 , EH0 L = 1
f0 (y) n i=1 √1 e− 2 yi n i=1

Naively, we might think to apply the central limit theorem in order to derive a limiting
distribution for L under H0 . However, the central limit theorem (for triangular arrays)
cannot be applied here. We shall focus, therefore, on deriving a weaker result.
√ p
Proposition 1. Under H0 , if µ(n) = (1 − ϵ) 2 log n, then L →− 1.

Proof. Proof in Appendix A.3.


Since L converges in probability to 1, we claim that the likelihood ratio test is asymp-
totically powerless.

Proposition 2. If Tn (α) such that PH0 (L ≥ Tn (α)) = α, then

lim PH1 (Type II error) = 1 − α


n→∞

2-2
STATS 300C Lecture 2 — April 5 Spring 2023

Proof. Note that


Z
PH1 (Type II Error) = PH1 (L ≤ Tn (α)) = 1{L≤Tn (α)} dP1n
Z
= 1{L≤Tn (α)} L dP0n
Z Z
= 1{L≤Tn (α)} dP0n + 1{L≤Tn (α)} (L − 1) dP0n
Z
= (1 − α) + 1{L≤Tn (α)} (L − 1) dP0n

→ (1 − α).
p
The last claim follows from the fact that L →
− 1. We can make this rigorous as follows: let
p p
Zn = 1{L≤Tn (α)} (L − 1). Then Zn →
− 0 by Slutsky’s and, because L →
− 1, Tn (α) is uniformly
bounded and so is Zn . Zn is then clearly uniformly integrable and by Vitali’s convergence
L1 R
theorem Zn −→ 0 which means 1{L≤Tn (α)} (L − 1) dP0n vanishes.


Conclusion: If µI = (1 − ϵ) 2 log n, then the optimal test has

PH0 (Type I Error) + PH1 (Type II Error) → 1.

We further note that the following inequality holds for any test,

PH0 (Type I Error) + sup PH1 (Type II Error) ≥


H1

PH0 (Type I Error) + EH1 ∼π PH1 (Type II Error)

and where the supremum is taken√over any alternative hypothesis for {µi } in which one
coordinate has mean µ(n) = (1 − ϵ) 2 log n. Since we have proven that the limit of the RHS
is at least 1, taking a lim inf on both sides of the inequality yields the desired result:
 
lim inf PH0 (Type I Error) + sup PH1 (Type II Error) ≥ 1.
n→∞ H1

In summary, we have proven that there is no test that is asymptotically able to distinguish
between the null and alternative
√ hypotheses when the mean of the needle in the haystack,
(n)
µ , is smaller than the 2 log n threshold.

2.3 χ2 test
Turning our attention for the time being away from the Bonferroni test, we might also ask
ourselves if the Fisher combination test considered in the previous lecture is asymptotically
powerful for a different set of alternatives.
The definition of the Fisher combination test statistic (Tn = −2 ni=1 log pi ) makes it
P
difficult to directly analyze. However, for the Gaussian model we considered in our power

2-3
STATS 300C Lecture 2 — April 5 Spring 2023

Figure 2.1. Fisher combination test statistic (solid) and z = |y|2 line (dashed)

analysis of the Bonferroni global test, Figure 2.1 demonstrates that each term in the Fisher
test statistic is qualitatively similar to the better-understood statistic y 2 . This observation
motivates our analysis of the χ2 test. This family of tests is also known as ANOVA, when
at least one of the variables concerned is continuous.
For our power analysis of the χ2 test, we define our null and alternative hypotheses as
follows.
i.i.d.
Yi = µi + zi zi ∼ N (0, 1)
H0 : µi = 0 for all i H1 : at least one µi ̸= 0
Then, the test statistic for the χ2 test is
n
X
Tn = Yi2 = ∥Y ∥22
i=1

Under H0 : Tn ∼ χ2n , and the level-α test rejects H0 when Tn > χ2n (1 − α). Note that
under H0 ,
n
X
Tn = zi2
i=1

with E[z12 ] = 1 and Var(z12 ) = 2. Hence, by a CLT approximation, for large n, we roughly
have
Tn − n
√ ∼ N (0, 1),
2n

2-4
STATS 300C Lecture 2 — April 5 Spring 2023

implying that

χ2n (1 − α) ≈ n + 2nz(1 − α)

Under H1 : Tn is a non-central χ2 . Here,


n
E (µi + zi )2 = µ2i + 1,
 
X
Tn = (µi + zi )2
Var (µi + zi )2 = 4µ2i + 2.
 
i=1

Again applying a CLT approximation for large n, we have that

Tn − (n + ∥µ∥2 )
p ∼ N (0, 1)
2n + 4∥µ∥2

Remark: What is a computationally efficient way to draw samples from the non-central
d P
χ2 distribution with n degrees of freedom? Since the distribution of χ2n (∥µ∥22 ) = ni=1 (µi +
zi )2 is not dependent on how the mass of µ is distributed over its indices, but rather only
d
depends on the value of ∥µ∥22 since, χ2n (∥µ∥22 ) = χ2n−1 + (z + ∥µ∥2 )2 . This is both efficient
to compute, and captures the idea that, in this setting, the χ2 test detects a large deviation
from one null hypothesis as well as it detects an equally-sized in ℓ2 norm small deviations
occurring across all or many hypotheses.

2.4 Detection thresholds for small distributed effects


To ascertain a detection threshold for the χ2 test, we first define two useful quantities: Zn ,
which is a normalized test statistic, and θn , which we will see is proportional to the signal-
to-noise ratio (SNR) for the test.

Tn − n ∥µ∥2
Zn := √ θn := √
2n 2n
Applying some elbow grease to the normal approximation above and rearranging terms, we
derive the following limiting distributions:

H0 : Zn ∼ N (0, 1)
!
θn
H1 : Zn ∼ N θn , 1 + p .
n/8

The asymptotic power√of the χ2 test is thus determined by θn , which measures the relative size
of ∥µ∥2 compared to n. In particular, we can see from our expressions for the asymptotic
null and alternative distributions above that the χ2 test is easy when θn ≫ 1 and hard when
θn ≪ 1. For example, when θn = 2, the power of the test is roughly P(z1 > 1.65 − 2) ≈ 66%.

2-5
STATS 300C Lecture 2 — April 5 Spring 2023

SNR: If we had started with a model in which the noise variance is defined to be σ 2 , i.e.
Yi = µi + σzi i = 1, . . . , n
then by the same argument as above (special case of σ = 1), we would see that the detection
power depends sensitively on
r
n ∥µ∥2
θn = ·
2 σ2n
Therefore, if we define the SNR as
total signal power ∥µ∥2
SNR = = 2
total expected noise power σ n
p
we can see that θn ∝ SNR with a constant of proportionality equal to n/2.
Though this test does not have a sharp decision threshold, a natural question arises:
when θn ≪ 1, is there a test that is asymptotically more powerful than the χ2
test? To show that the answer is no, we use the strategy employed above for the Bonferroni
test: we first introduce a pair of simple hypotheses using a “Bayesian” alternative, and show
that in this setting, the most powerful (likelihood ratio) test is powerless as θn → 0.

Decision Problem:
H0 : µ = 0n
H1 : µ ∼ πρ(n)
(n)
where πρ distributes mass uniformly on the sphere (in Rn ) of radius ρ(n) .

Likelihood ratio: By way of the Neyman-Pearson Lemma, we note that the optimal
test for this pair of hypotheses rejects for large values of the likelihood ratio. To derive
this statistic and its limiting distribution under the null, we introduce some notation. Let
µ = ρ(n) u where u is uniformly distributed on the unit sphere denoted by S n−1 . Let πn be
the uniform distribution on this sphere. We have then that
1 (n) 2
e− 2 ∥y−ρ u∥
Z Z
− 21 (ρ(n) )2 +ρ(n) uT y
L= 1 2
π(du) = e π(du)
S n−1 e− 2 ∥y∥ S n−1
(n) 2
In this setting, θn := (ρ√2n) measures the signal-to-noise ratio for our test, and as expected,
when θn → 0, our test becomes powerless. This claim is summarized by the following
proposition.
P
Proposition 3. Under H0 , if θn → 0, then L → 1 and the Neyman-Pearson test is asymp-
totically no better than flipping a biased coin.
Proof. Proof in Appendix A.5.
Heuristically, this proposition leads us to conclude that the χ2 test is a “good” test. It is
only asymptotically powerless for vanishing SNR, and in this setting, no test can do better
in the limit.

2-6
STATS 300C Lecture 2 — April 5 Spring 2023

2.5 Comparison of Bonferroni’s and χ2 tests


We turn now to a more practical comparison of the two global tests; namely, we provide a
qualitative characterization of the sets of alternatives for which the Bonferroni and χ2 tests
are most powerful.
Before we present this summary, we consider two examples in which the tests have very
different power characteristics.

1/4

Example 1: n√ of the µi ’s are equal to 2 log n. To make this concrete, if n = 106 ,
n1/4 ≈ 32 and 2 log n ≈ 5.3. In this set-up, our asymptotic results suggest that the
Bonferroni test will have (approximately) full power, but because

n1/4 (2 log n)
θn = √ → 0,
2n

the χ2 test has (approximately) no power.



Example 2: 2n of the µi ’s are equal to 3. The χ2 test has (almost) full power. The
Bonferroni test has no power, however, because when n is large, it’s very likely that the
smallest p-value comes from a null µi rather than a √
true signal. An intuitive argument is as
follows: among the nulls, theplargest yi has size ≈ 2 log n, while among the true signals,

the largest yi has size ≈ 3 + 2 log 2n. If n is large, the former value is larger.

Numerical illustration: Let n = 106 and α = 0.05, and consider Bonferroni’s and χ2
tests for the following alternatives:

• Sparse strong effects: µi is the same as the Bonferroni threshold (|z(α/(2n))| ≈ 5.45)
for 1 ≤ i ≤ 4 and 0 otherwise.

• Distributed weak effects: µi is 1.1 for 1 ≤ i ≤ k = 2400 and 0 otherwise.

• Distributed moderate effects: µi is 1.3 for 1 ≤ i ≤ k = 2400 and 0 otherwise.

Sparse strong effects: In the sparse setting, the asymptotic power of Bonferroni’s method
can be approximated using the inclusion-exclusion principle. Let A denote the event that we
reject because at least one of Y1 , . . . , Y4 exceeds the rejection threshold, and let B denote the
event that we reject because one of the “noise” variables exceeds the Bonferroni threshold.
Then,

P(A ∪ B) = P(A) + P(B) − P(A ∩ B)


 4 !  4 !
1 1
≈ 1− +q(α) − 1 − q(α)
2 2
| {z }
=0.9375
≈ 0.94

2-7
STATS 300C Lecture 2 — April 5 Spring 2023


On the other hand, the χ2 test would be almost powerless, as θn = ∥µ∥2 / 2n ≈ 0.084 ≪ 1.
A numerical estimate of the power for these tests with 500 trials yields the expected result:

Bonferroni ≈ 94%
Chi-sq ≈ 6%

Distributed weak effects: For the alternative of distributed weak effects, the power of
the Bonferroni test is roughly
   
PH1n [max |Yi | > |z(α/(2n))|] ≤ PH1 max |Yi | > |z(α/(2n))| +P max |zi | > |z(α/(2n))| ≈ 0.066.
i≤k i>k

√ .
By contrast, θn = ∥µ∥2 / 2n = 2.05. So, while the Bonferroni test has almost no power, the
χ2 test can detect the alternative more often than not. Numerically,

Bonferroni = 6%
Chi-sq = 66%

Distributed
√ moderate effects : For the “moderate effects” scenario listed above, θn =
∥µ∥2 / 2n ≈ 2.8. Here, the χ2 test is very powerful, but the Bonferroni test remains nearly
powerless. Numerically,

Bonferroni = 8%
Chi-sq = 89%

Simulating the weak distributed effect experiment confirms that the practical challenge for
the Bonferroni cutoff is that the largest values of Yi tend to come from the noise variables.
Figure 2.2 thus captures why the Bonferroni global test is incapable of detecting weak dis-
tributed effects. We can summarize our conclusions in the following table:

Small, distributed effects Few strong effects


ANOVA (Analysis of Variance) Powerful Weak
Bonferroni Weak Powerful

Next week: Can we introduce a method that has the best of both worlds? Is there a
single test that is powerful for any alternative?

A Appendix
A.1 Concentration of Gaussian Maxima
How large is our threshold tn = |z(α/n)| (one sided) or |z(α/2n)| (two sided)? If ϕ(t) is the
standard normal pdf, then we can derive by Markov’s Inequality the useful result:
 
ϕ(t) 1 ϕ(t)
1 − 2 ≤ P(Z > t) ≤ ,
t t t

2-8
STATS 300C Lecture 2 — April 5 Spring 2023

Figure 2.2. Distribution of the maximum Yi observed for the “distributed weak effects” example.

where Z ∼ N (0, 1). That is, for large t, ϕ(t)


t
is a good approximation to the normal tail
probability (Gaussian quantile). Roughly speaking, then,

ϕ(t)
P(Z > t) = α/n “ ⇐⇒ ” ≈ α/n.
t
Holding α fixed, then, we can show that for large n,
 
p 1 log log n
|z(α/n)| ≈ 2 log n 1 −
4 log n
p
≈ 2 log n.

Hence, the quantiles grow like 2 log n, with a small correction factor.

A.2 Triangular Array CLT Failure


(n) − 1 (µ(n) )2
Writing Xi = eyi µ 2 , we have that under H0 the Xi are iid and
n
1X
L= Xi .
n i=1

This is a sample average with mean EX1 and variance n1 VarX1 . √


We would like to apply the CLT; however, because µ(n) = (1−ε) 2 log n → ∞, we would
need a triangular array argument.

2-9
STATS 300C Lecture 2 — April 5 Spring 2023

The (sufficient but not necessary) Lyapunov condition, for instance, is violated for q = 3:
1 X
E|Xi |3 → ∞
[ i Var(Xi )]3/2 i
P

as n → ∞. Though this is not a proof that the CLT fails, it does suggest that we might
take the alternative approach discussed in the main text of these notes.

A.3 Convergence of L for the Sparse Alternative


√ P
Proposition. If µ(n) = (1 − ϵ) 2 log n, then L → 1.
Proof. Recall
n
1X
L= Xi
n i=1
(n) 1 (n) 2
with Xi = eyi µ − 2 (µ ) i.i.d. √
Assume first 0 < ϵ < 1/2, take Tn = 2 log n, and write
n
1X
L̃ = Xi 1{yi ≤Tn } .
n i=1

We have
P(L̃ ̸= L) ≤ P(max yi ≥ Tn ) → 0,
and it suffices to establish that
p
L̃ = Φ(ε 2 log n) + oP0 (1)
which in particular follows if

1. E0 (L̃) = Φ(ε 2 log n)
2. Var0 (L̃) = o(1)
Proceeding,
Z Tn
2 /2 1 2
eµz−µ √ e−z /2 dz
 
E0 (L̃) = E0 X1 1{y1 ≤Tn } =
−∞ 2π
Z Tn
1 2
= √ e−(z−µ) /2 dz
−∞ 2π
= Φ(Tn − µ)
p
= Φ(ε 2 log n).
Furthermore,
 1 Tn −µ2 2µz
Z
1  1  2
Var0 (L̃) = Var X1 1{y1 ≤Tn } ≤ E0 X1 1{y1 ≤Tn } = e e ϕ(z) dz
n n n −∞
1 2
= eµ Φ(Tn − 2µ).
n

2-10
STATS 300C Lecture 2 — April 5 Spring 2023

Since Φ(Tn − 2µ) ≤ ϕ(2µ − Tn ), this gives


1 µ2 1 2 2 1 2 2
Var0 (L̃) ≤ e ϕ(2µ − Tn ) = e(1−ε) Tn √ e−(1−2ε) Tn /2
n n 2π
1 2 2
=√ e(1−2ε )Tn /2
2πn
1 −ε2 Tn2
=√ e

→ 0.

This proves the result for 0 < ϵ < 1/2. The claim for 1 > ϵ > 1/2 is even simpler since
exp (µ(n) )2 /n converges to zero in this case.

A.4 Asymptotic Powerlessness of LRT (Proof via Contiguity)


P
We argued in the main text of these notes that when L → 1, the Bonferroni and χ2 tests
become asymptotically powerless. Here, we describe an alternative proof of this claim that
relies on two contiguity results proven in STATS300B.
First, we state a simplified version of LeCam’s first lemma.

Lemma 1. Let Pn and Qn be sequences of probability measures on measurable spaces


(Ωn , An ). Then, the following statements are equivalent:

1. Qn ◁ Pn

P
2. If dQn /dPn ⇝n V along a subsequence, then E[V ] = 1.
P
Taking Qn = P1n and Pn = P0n , we see that L → 1 implies condition 2, and we conclude
that P1n ◁ P0n , namely, P1n is contiguous with respect to P0n . We can now apply a generalized
version of LeCam’s third lemma to obtain the desired result.

Theorem 1. Let Pn and Qn be sequences of probability measures on measurable spaces


(Ωn , An ) and let Xn : Ωn → Rk be a sequence of random vectors. Suppose that Qn ◁ Pn and
 
dQn Pn
Xn , ⇝ (X, V )
dPn
Qn
Then, Xn ⇝ L where L is any random variable whose law satisfies P(L ∈ B) = E[1B (X)V ]
for arbitrary Borel set B.

Let Xn be the test statistic Tn (which we know to have some limiting distribution T under
the null), and let Pn = P0n and Qn = P1n as above. We conclude that Tn converges to T under
both the null and alternative hypotheses since the limit law for Tn under the alternative can
be re-expressed as E[1B (X) · (1)] = P(T ∈ B). Since the limiting distributions under the
null and alternative sequences are identical, our test must be asymptotically powerless.

2-11
STATS 300C Lecture 2 — April 5 Spring 2023

A.5 Convergence of L for the Distributed Alternative


We use some shorthand in this argument for brevity’s sake. Namely, for the derivation, we
drop the n-index corresponding to the sequence of hypotheses and refer to an expectation
under the null using the subscript 0 rather than H0 .

Useful Relationship: If y ∼ N (0n , In ), then


 T  2
E ea y = e∥a∥ /2 ,

which is the mgf of a Gaussian random vector. Then


Z Z 
2 −ρ2 /2+ρuT y −ρ2 /2+ρv T y
E0 (L ) = E0 e e π(du) π(dv)
Z Z 
−ρ2 +ρ(u+v)T y
= E0 e π(du) π(dv)
Z Z
−ρ2 2 2
=e eρ ∥u+v∥ /2 π(du) π(dv)
Z Z
2 T
= eρ u v π(du) π(dv),

where the third equality uses the mgf and the fourth uses uT u = v T v = 1. By spherical
symmetry, we can fix v = e1 = (1, 0, . . . , 0) to obtain
Z
2
E0 (L ) = eρ u1 π(du),
2

with u = (u1 , . . . , un ) uniform on S n−1 . Using the Taylor approximation

2u ρ4 u21
eρ 1
= 1 + ρ2 u1 + + ··· ,
2
we have
ρ4 u21
 
ρ2 u1 2
Ee = 1 + E[ρ u1 ] + E + ···
2
ρ4
 8
ρ
=1+0+ +0+O ,
2n n2

which is to say

E0 L2 = 1 + θn2 + O(θn4 ) → 1
2
when θn = √ρ → 0.
2n

∥µ∥2
Conclusion: The LR test has no power if √
2n
→ 0 as n → ∞.

2-12

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy