Lecture 8: Inference 36-401, Fall 2015, Section B
Lecture 8: Inference 36-401, Fall 2015, Section B
1 b2
Sampling Distribution of βb0 , βb1 and σ
The Gaussian-noise simple linear regression model has three parameters: the intercept β0 , the slope
β1 , and the noise variance σ 2 . We’ve seen, previously, how to estimate all of these by maximum
likelihood; the MLE for the βs is the same as their least-squares estimates. These are
n
X Xi − x
cXY
βb1 = 2 = 2 Yi (1)
sX i=1
ns X
We have also seen how to re-write the first two of these as a deterministic part plus a weighted
sum of the noise terms :
n
X Xi − x
βb1 = β1 + i (4)
i=1
ns2X
n
1X Xi − x
βb0 = β0 + 1−x 2 i (5)
n sX
i=1
Finally, we have our modeling assumption that the i are independent Gaussians, i ∼ N (0, σ 2 ).
1
1.1 Reminders of Basic Properties of Gaussian Distributions
Suppose U ∼ N (µ, σ 2 ). By the basic algebra of expectations and variances, E [a + bU ] = a + bµ,
while Var [a + bU ] = b2 σ 2 . This would be true of any random variable; a special property of
Gaussians.
Suppose U1 , U2 , . . . Un are independent Gaussians, with means µi and variances σi2 . Then
n
X X X
Ui ∼ N µi , σi2
i=1 i i
That the expected values add up for a sum is true of all random variables; that the variances add
up is true for all uncorrelated random variables. That the sum follows the same type of distribution
as the summands is a special property of Gaussians.
2
# Simulate a Gaussian-noise simple linear regression model
# Inputs: x sequence; intercept; slope; noise variance; switch for whether to
# return the simulated values, or run a regression and return the coefficients
# Output: data frame or coefficient vector
sim.gnslrm <- function(x, intercept, slope, sigma.sq, coefficients=TRUE) {
n <- length(x)
y <- intercept + slope*x + rnorm(n,mean=0,sd=sqrt(sigma.sq))
if (coefficients) {
return(coefficients(lm(y~x)))
} else {
return(data.frame(x=x, y=y))
}
}
Figure 1: Code setting up a simulation of a Gaussian-noise simple linear regression model, along a fixed
vector of Xi values.
βb0 − β0
r ∼ N (0, 1)
σ2 x2
n 1 + s2
X
The right-hand side of this equation is admirably simple and easy for us to calculate, but the left-
hand side unfortunately involves two unknown parameters, and that complicates any attempt to
use it.
1.4 b2
Sampling Distribution of σ
It is mildly challenging, but certainly not too hard, to show that
2 n − 2 2
E σ
b = σ
n
We can be much more specific. When i ∼ N (0, σ 2 ), it can be shown that
σ2
nb
∼ χ2n−2
σ2
3
25
20
15
Density
10
5
0
# Run the simulation 10,000 times and collect all the coefficients
# What intercept, slope and noise variance does this impose?
many.coefs <- replicate(1e4, sim.gnslrm(x=x, 5, -2, 0.1, coefficients=TRUE))
# Histogram of the slope estimates
hist(many.coefs[2,], breaks=50, freq=FALSE, xlab=expression(hat(beta)[1]),
main="")
# Theoretical Gaussian sampling distribution
theoretical.se <- sqrt(0.1/(length(x)*var(x)))
curve(dnorm(x,mean=-2,sd=theoretical.se), add=TRUE,
col="blue")
Figure 2: Simulating 10,000 runs of a Gaussian-noise simple linear regression model, calculating βb1 each
time, and comparing the histogram of estimates to the4 theoretical Gaussian distribution (Eq. 8, in blue).
1.5 Standard Errors of βb0 and βb1
The standard error of an estimator is its standard deviation. We’ve just seen that the true
standard errors of βb0 and βb1 are, respectively,
h i σ
se βb1 = √ (9)
sx n
σ
h i q
se βb0 = √ s2X + x2 (10)
nsX
Unfortunately, these standard errors involve the unknown parameter σ 2 (or its square root σ,
equally unknown to us).
We can, however, estimate the standard errors. The maximum-likelihood estimates just substi-
tute σ
b for σ:
h i σ
√
b
se
b βb1 = (11)
sx n
σ
h i q
√ s2X + x2
b
se
b βb0 = (12)
sX n
For later theoretical purposes, however, things will work out slightly nicer if we use the de-biased
n
version, n−2 b2 :
σ
h i σ
√
b
se
b βb1 = (13)
sx n − 2
σ
h i q
√ s2X + x2
b
se
b βb0 = (14)
sx n − 2
These standard errors — approximate or estimated though they be — are one important way
of quantifying how much uncertainty there is around our point estimates. However, we can’t use
them, alone
h i to say anything
h i terribly precise about, say, the probability that β1 is in the interval
[β1 − se
b b β1 , β1 − se
b b b β1 ], which is the sort of thing we’d want to be able to give guarantees about
b
the reliability of our estimates.
h i
2 Sampling distribution of (β − β)/se
b b βb
It should take only a little work with the properties of the Gaussian distribution to convince yourself
that
βb1 − β1
h i ∼ N (0, 1)
se βb1
5
h i
the standard Gaussian distribution. If the Oracle told us σ 2 , we’d know se βb1 , and so we could
assert that (for example)
h i h i
P β1 − 1.96se βb1 ≤ βb1 ≤ β1 + 1.96se βb1 (15)
h i h i
= P −1.96se βb1 ≤ βb1 − β1 ≤ 1.96se βb1 (16)
βb1 − β1
= P −1.96 ≤ h i ≤ 1.96 (17)
se βb1
= Φ(1.96) − Φ(−1.96) = 0.95 (18)
(I call this a proposition, but it’s almost a definition of what we mean by a t distribution with d
degrees of freedom. Of course, if we take this as the definition, the proposition that this distribu-
tion has a probability density ∝ (1 + x2 /d)−(d+1)/2 would become yet another proposition to be
demonstrated.) h i
Let’s try to manipulate (βb1 − β1 )/se
b βb1 into this form.
βb1 − β1 βb1 − β1 σ
h i = h i
se
b βb1 σ se
b βb 1
βb1 −β1
σ N (0, 1/ns2X ) sX N (0, 1/ns2X )
= = σ
=
b [βb1 ]
se √
b √σ
b
σ sx σ n−2 σ n−2
√
N (0, 1/n) nN (0, 1/n) N (0, 1)
= = √ =q
√σ
b
√ nb
σ σ2 1
nb
σ n−2 σ n−2 σ 2 n−2
N (0, 1)
= q = tn−2
χ2n−2 /(n − 2)
where in the last step I’ve used the proposition I stated (without proof) above.
To sum up:
h i
Proposition 2 Using the se b βb1 of Eq. 13,
βb1 − β1
h i ∼ tn−2 (19)
se
b βb1
6
h i
Notice that we can compute se b βb1 without knowing any of the true parameters — it’s a pure
statistic, just a function of the data. This is a key to actually using the proposition for anything
useful.
By exactly parallel reasoning, we may also demonstrate that
βb0 − β0
h i ∼ tn−2
se
b βb0
7
What abouth i β0 ? By exactly h iparallel reasoning, a 1 − α confidence interval for β0 is [β0 −
b
k(n, α)se
b βb0 , βb0 + k(n, α)se
b βb0 ].
What α should we use? It’s become conventional to set α = 0.05. To be honest, this owes more
to the fact that the resulting k tends to 1.96 as n → ∞, and 1.96 ≈ 2, and most psychologists and
economists could multiply by 2, even in 1950, than to any genuine principle of statistics or scientific
method. A 5% error rate corresponds to messing up about one working day in every month, which
you might well find high. On the other hand, there is nothing which stops you from increasing α.
It’s often illuminating to plot a series of confidence sets, at different values of α.
What about power? The coverage of a confidence set is the probability that it includes the
true parameter value. This is not, however, the only virtue we want in a confidence set; if it was, we
could just say “Every possible parameter is in the set”, and have 100% coverage no matter what.
We would also like the wrong values of the parameter to have a high probability of not being in the
set. Just as the coverage is controlled by the size / false-alarm probability / type-I error rate α of
the hypothesis test, the probability of excluding the wrong parameters is controlled by the power
/ miss probability / type-II error rate. Test with higher power exclude (correctly) more parameter
values, and give smaller confidence sets.
2. If we have a way of constructing a 1 − α confidence set, we can use it to test the hypothesis
that β = β ∗ : reject when β ∗ is outside the confidence set, retain the null when β ∗ is inside
the set.
I will leave it as a pair of exercises (2 and 3) to that inverting a test of size α gives a 1−α confidence
set, and that inverting a 1 − α confidence set gives a test of size α.
βb − β
h i → N (0, 1)
se
b βb
which considerably simplifies the sampling intervals and confidence sets; as n grows, we can forget
about the t distribution and just use the standard Gaussian distribution. Figure 3 plots the
convergence of k(n, α) towards the k(∞, α) we’d get from the Gaussian approximation. As you
can see from the figure, by the time n = 100 —a quite small data set by modern standards — the
difference between the t distribution and the standard-Gaussian is pretty trivial.
8
α = 0.01
10
α = 0.05
α = 0.5
8
6
k(n, α)
4
2
0
curve(qt(0.995,df=x-2),from=3,to=1e4,log="x", ylim=c(0,10),
xlab="Sample size (n)", ylab=expression(k(n,alpha)),col="blue")
abline(h=qnorm(0.995),lty="dashed",col="blue")
curve(qt(0.975,df=x-2), add=TRUE)
abline(h=qnorm(0.975),lty="dashed")
curve(qt(0.75,df=x-2), add=TRUE, col="orange")
abline(h=qnorm(0.75), lty="dashed", col="orange")
legend("topright", legend=c(expression(alpha==0.01), expression(alpha==0.05),
expression(alpha==0.5)),
col=c("blue","black","orange"), lty="solid")
Figure 3: Convergence of k(n, α) as n → ∞, illustrated for α = 0.01, α = 0.05 and α = 0.5. (Why do I
plot the 97.5th percentile when I’m interested in α = 0.05?)
9
4 Statistical Significance: Uses and Abuses
4.1 p-Values
The test statistic for the Wald test,
βb1 − β1∗
T = h i
se
b βb1
has the nice, intuitive property that it ought to be close to zero when the null hypothesis β1 = β1∗
is true, and take large values (either positive or negative) when the null hypothesis is false. When
a test statistic works like this, it makes sense to summarize just how bad the data looks for the
null hypothesis in a p-value: when our observed value of the test statistic is Tobs , the p-value is
P = P (|T | ≥ |Tobs |)
calculating the probability under the null hypothesis. (I write a capital P here as a reminder that
this is a random quantity, though it’s conventional to write the phrase “p-value” with a lower-case
p.) This is the probability, under the null, of getting results which are at least as extreme as what
we saw. It should be easy to convince yourself that rejecting the null in a level-α test is the same
as getting a p-value < α.
It is not too hard (Exercise 4) to show that P has a uniform distribution over [0, 1] under the
null hypothesis.
10
h i
2. β1 6= 0, but se
b βb1 is so large that we can’t tell anything about β1 with any confidence.
There is a very big difference between data which lets us say “we can be quite confident that the
true β1 is, if not perhaps exactly 0, then very small”, and data which only lets us say “we have
no earthly idea what β1 is, and it may as well be zero for all we can tell.” It is good practice to
always compute a confidence interval, but it is especially important to do so when you retain the
null, so you know whether you can say “this parameter is zero to within such-and-such a (small)
precision”, or whether you have to admit “I couldn’t begin to tell you what this parameter is”.
Substantive vs. statistical significance Even a huge β1 , which h iit would be crazy to ignore in
any circumstance, can be statistically insignificant, so long as se
b βb1 is large enough. Conversely,
any β1 which isn’t exactly zero, noh matter
i how close it might be to 0, will become statistically
b β1 is small enough. Since, as n → ∞,
significant at any threshold once se b
h i σ
b βb1 →
se √
sX n
h i
βb1
b βb1 → 0, and
we can show that se → ±∞, unless β1 is exactly 0 (see below).
b [βb1 ]
se
Statistical significance is a weird mixture of how big the coefficient is, how big a sample we’ve got,
how much noise there is around the regression line, and how spread out the data is along the x axis.
This has so little to do with “significance” in ordinary language that it’s pretty unfortunate we’re
stuck with the word; if the Ancestors had decided to say “statistically detectable” or “statistically
distinguishable from 0”, we might have avoided a lot of confusion.
If you confuse substantive and statistical significance in this class, it will go badly for you.
Model checking Our statistical models often make very strong, claims about the probability
distribution of the data, with little wiggle room. The simple linear regression model, for instance,
claims that the regression function is exactly linear, and that the noise around this line has exactly
constant variance. If we test these claims and find very small p-values, then we have evidence that
there’s a detectable, systematic departure from the model assumptions, and we should re-formulate
the model.
Actual scientific interest Some scientific theories make very precise predictions about coeffi-
cients. According to Newton, the gravitational force between two masses is inversely proportional
to the square of the distance between them, ∝ r−2 . The prediction is exactly ∝ r−2 , not ∝ r−1.99
nor ∝ r−2.05 . Measuring that exponent and finding even tiny departures from 2 would be big news,
if we had reason to think they were real and not just noise. One of the most successful theories
11
in physics, quantum electrodynamics, makes predictions about some properties of hydrogen atoms
with a theoretical precision of one part in a trillion; finding even tiny discrepancies between what
the theory predicts and what we estimate would force us to rethink lots of physics. Experiments to
detect new particles, like the Higgs boson, essentially boil down to hypothesis testing, looking for
deviations from theoretical predictions which should be exactly zero if the particle doesn’t exist.
Outside of the natural sciences, however, it is harder to find examples of interesting, exact null
hypothesis which are, so to speak, “live options”. The best I can come up with are theories of
economic growth and business cycles which predict that the share of national income going to labor
(as opposed to capital) should be constant over time. Otherwise, in the social sciences, there’s
usually little theoretical reason to think that certain regression coefficients should be exactly zero,
or exactly one, or anything else.
confint(object, level=0.95)
Here object is the name of the fitted model object, and level is the confidence level; if you
want 95% confidence, you can omit that argument. For instance:
library(gamair); data(chicago)
death.temp.lm <- lm(death ~ tmpd, data=chicago)
confint(death.temp.lm)
## 2.5 % 97.5 %
## (Intercept) 128.8783687 131.035734
## tmpd -0.3096816 -0.269607
confint(death.temp.lm, level=0.90)
## 5 % 95 %
## (Intercept) 129.0518426 130.8622598
## tmpd -0.3064592 -0.2728294
If you want p-values for the coefficients, those are conveniently computed as part of the summary
function:
coefficients(summary(death.temp.lm))
Notice how this actually gives us an array with four columns: the point estimate, the standard
error, the t statistic, and finally the p-value. Each row corresponds to a different coefficient of the
model. If we want, say, the p-value of the intercept, that’s
12
coefficients(summary(death.temp.lm))[1,4]
## [1] 0
The summary function will also print out a lot of information about the model:
summary(death.temp.lm)
##
## Call:
## lm(formula = death ~ tmpd, data = chicago)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.275 -9.018 -0.754 8.187 305.952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 129.95705 0.55023 236.19 <2e-16 ***
## tmpd -0.28964 0.01022 -28.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.22 on 5112 degrees of freedom
## Multiple R-squared: 0.1358,Adjusted R-squared: 0.1356
## F-statistic: 803.1 on 1 and 5112 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = death ~ tmpd, data = chicago)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.27 -9.02 -0.75 8.19 305.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 129.9571 0.5502 236.2 <2e-16
## tmpd -0.2896 0.0102 -28.3 <2e-16
##
## Residual standard error: 14.2 on 5112 degrees of freedom
## Multiple R-squared: 0.136,Adjusted R-squared: 0.136
## F-statistic: 803 on 1 and 5112 DF, p-value: <2e-16
13
5.1 Coverage of the Confidence Intervals: A Demo
Here is a little computational demonstration of how the confidence interval for a parameter is
a random parameter, and how it covers the true parameter value with the probability we want.
I’ll repeat many simulations of the model from Figure 2, calculate the confidence interval on each
simulation, and plot those. I’ll also keep track of how often, in the first m simulations, the confidence
interval covers the truth; this should converge to 1 − α as m grows.
Exercises
To think through or to practice on, not to hand in.
(a) Find a formula for the 1 − α sampling interval for σb2 , in terms of the CDF of the χ2n−2
2
distribution, α, n and σ . (Some of these might not appear in your answer.) Is the
width of your sampling interval the same for all σ 2 , the way the width of the sampling
interval for βb1 doesn’t change with β1 ?
(b) Fix α = 0.05, n = 40, and plot the sampling intervals against σ 2 .
(c) Find a formula for the 1 − α confidence interval for σ 2 , in terms of σ
b2 , the CDF of the
2
χn−2 distribution, α and n.
2. Suppose we start a way of testing the hypothesis β = β ∗ which can be applied to any β ∗ ,
and which has size (false alarm / type I error) probability α for β ∗ . Show that the set of β
retained by their tests is a confidence set, with confidence level 1 − α. What happens if the
size is ≤ α for all β ∗ (rather than exactly α)?
3. Suppose we start from a way of creating confidence sets which we know has confidence level
1 − α. We test the hypothesis β = β ∗ by rejecting when β ∗ is outside the confidence set, and
retaining when β ∗ is inside the confidence set. Show that the size of this test is α. What
happens if the initial confidence level is ≥ 1 − α, rather exactly 1 − α?
4. Prove that the p-value P is uniformly distributed under the null hypothesis. You may,
throughout, assume that the test statistic T has a continuous distribution.
(a) Show that if Q ∼ Unif(0, 1), then P = 1 − Q has the same distribution.
(b) Let X be a continuous random variable with CDF F . Show that F (X) ∼ Unif(0, 1).
Hint: the CDF of the uniform distribution FUnif(0,1) (x) = x.
(c) Show that P , as defined, is 1 − F|T | (|Tobs |).
(d) Using the previous parts, show that P ∼ Unif(0, 1).
5. Use Eq. ?? to show Eq. ??, following the derivation of Eq. ??.
14
●
● ●
● ● ●
●
● ● ● ● ●●
● ●
−1.95
●
● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ●
● ● ● ●● ●● ● ●
●●
●
●
Confidence limits for slope
● ● ●
● ● ● ● ●
● ● ●
● ● ●● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ●
●
●● ●
−2.00
● ● ●
● ● ●
●
● ● ● ● ●
● ●
●● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ● ● ●● ● ● ●
● ● ● ● ●
● ●
● ●● ● ● ●●● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
−2.05
● ● ● ● ● ●
●● ●
● ● ● ●
●● ● ● ● ● ●
● ● ●
● ●
● ●●
●
0 20 40 60 80 100
Simulation number
# Run 1000 simulations and get the confidence interval from each
CIs <- replicate(1000, confint(lm(y~x,data=sim.gnslrm(x=x,5,-2,0.1,FALSE)))[2,])
# Plot the first 100 confidence intervals; start with the lower limits
plot(1:100, CIs[1,1:100], ylim=c(min(CIs),max(CIs)),
xlab="Simulation number", ylab="Confidence limits for slope")
# Now the lower limits
points(1:100, CIs[2,1:100])
# Draw line segments connecting them
segments(x0=1:100, x1=1:100, y0=CIs[1,1:100], y1=CIs[2,1:100], lty="dashed")
# Horizontal line at the true coefficient value
abline(h=-2, col="grey")
15
1.0
●
●
●
● ●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●● ●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
● ●
● ●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
● ●
●●
●
●
●●
●
●●
●
●
●●
●
●
●●
●
●
●● ●
●
●
●●
●
●
●●
●
●
●● ●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●●
●
●
●
●● ●
●●
● ●
●
● ●●
●
●
●●
●● ●●
●
●
●●●●
● ●
●
●●●●
●● ●
●
●
● ●●
0.8
Sample coverage proportion
0.6
0.4
0.2
0.0
Number of simulations
# For each simulation, check whether the interval covered the truth
covered <- (CIs[1,] <= -2) & (CIs[2,] >= -2)
# Calculate the cumulative proportion of simulations where the interval
# contained the truth, plot vs. number of simulations.
plot(1:length(covered), cumsum(covered)/(1:length(covered)),
xlab="Number of simulations",
ylab="Sample coverage proportion", ylim=c(0,1))
abline(h=0.95, col="grey")
16