0% found this document useful (0 votes)
58 views16 pages

Lecture 8: Inference 36-401, Fall 2015, Section B

1. The document discusses the sampling distributions of the estimators βb1, βb0, and σ^2 from a Gaussian-noise simple linear regression model. 2. It derives that βb1 follows a normal distribution centered at the true slope β1 and σ^2/ns^2X variance. βb0 also follows a normal distribution but with a more complicated variance involving x. 3. It states that σ^2 follows a chi-squared distribution with n-2 degrees of freedom, which provides the basis for statistical inference within the model.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views16 pages

Lecture 8: Inference 36-401, Fall 2015, Section B

1. The document discusses the sampling distributions of the estimators βb1, βb0, and σ^2 from a Gaussian-noise simple linear regression model. 2. It derives that βb1 follows a normal distribution centered at the true slope β1 and σ^2/ns^2X variance. βb0 also follows a normal distribution but with a more complicated variance involving x. 3. It states that σ^2 follows a chi-squared distribution with n-2 degrees of freedom, which provides the basis for statistical inference within the model.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Lecture 8: Inference

36-401, Fall 2015, Section B


Having gone over the Gaussian-noise simple linear regression model, over ways of estimating
its parameters and some of the properties of the model, and over how to check the model’s as-
sumptions, we are now ready to begin doing some serious statistical inference within the model. In
previous lectures, we came up with point estimators of the parameters and the conditional mean
(prediction) function, but we weren’t able to say much about the margin of uncertainty around
these estimates. In this lecture we will focus on supplementing point estimates with reliable mea-
sures of uncertainty. This will naturally lead us to testing hypotheses about the true parameters
— again, we will want hypothesis tests which are unlikely to get the answer wrong, whatever the
truth might be.
To accomplish all this, we first need to understand the sampling distribution of our point
estimators. We can find them, mathematically, but they involve the unknown true parameters
in inconvenient ways. We will therefore work to find combinations of our estimators and the true
parameters with fixed, parameter-free distributions; we’ll get our confidence sets and our hypothesis
tests from them.
Throughout this lecture, I am assuming, unless otherwise noted, that all of the assumptions of
the Gaussian-noise simple linear regression model hold.

1 b2
Sampling Distribution of βb0 , βb1 and σ
The Gaussian-noise simple linear regression model has three parameters: the intercept β0 , the slope
β1 , and the noise variance σ 2 . We’ve seen, previously, how to estimate all of these by maximum
likelihood; the MLE for the βs is the same as their least-squares estimates. These are
n
X Xi − x
cXY
βb1 = 2 = 2 Yi (1)
sX i=1
ns X

βb0 = y − βb1 x (2)


n
2 1X
σ
b = (Yi − (βb0 + βb1 Xi ))2 (3)
n
i=1

We have also seen how to re-write the first two of these as a deterministic part plus a weighted
sum of the noise terms :
n
X Xi − x
βb1 = β1 + i (4)
i=1
ns2X
n  
1X Xi − x
βb0 = β0 + 1−x 2 i (5)
n sX
i=1

Finally, we have our modeling assumption that the i are independent Gaussians, i ∼ N (0, σ 2 ).

1
1.1 Reminders of Basic Properties of Gaussian Distributions
Suppose U ∼ N (µ, σ 2 ). By the basic algebra of expectations and variances, E [a + bU ] = a + bµ,
while Var [a + bU ] = b2 σ 2 . This would be true of any random variable; a special property of
Gaussians.
Suppose U1 , U2 , . . . Un are independent Gaussians, with means µi and variances σi2 . Then
n
X X X 
Ui ∼ N µi , σi2
i=1 i i

That the expected values add up for a sum is true of all random variables; that the variances add
up is true for all uncorrelated random variables. That the sum follows the same type of distribution
as the summands is a special property of Gaussians.

1.2 Sampling Distribution of βb1


Since we’re assuming Gaussian noise, the i are independent Gaussians, i ∼ N (0, σ 2 ). Hence (using
the first basic property of Gaussians)
!
Xi − x 2 2
 
Xi − x
i ∼ N 0, σ
ns2X ns2X
Thus, using the second basic property of Gaussians,
n n 
!
Xi − x 2

X Xi − x 2
X
i ∼ N 0, σ (6)
i=1
ns2X i=1
ns2X
σ2
 
= N 0, 2 (7)
nsX
Using the first property of Gaussians again,
σ2
 
βb1 ∼ N β1 , 2 (8)
nsX
This is the distribution of estimates we’d see if we repeated the experiment (survey, observation,
etc.) many times, and collected the results. Every particular run of the experiment would give a
slightly different βb1 , but they’d average out to β1 , the average squared difference from β1 would be
σ 2 /ns2X , and a histogram of them would follow the Gaussian probability density function (Figure
2).
It is a bit hard to use Eq. 8, because it involves two of the unknown parameters. We can
manipulate it a bit to remove one of the parameters from the probability distribution,
σ2
βb1 − β1 ∼ N (0, 2 )
nsX
but that still has σ 2 on the right hand side, so we can’t actually calculate anything. We could write
βb1 − β1
q ∼ N (0, 1)
σ/ ns2X
but now we’ve got two unknown parameters on the left-hand side, which is also awkward.

2
# Simulate a Gaussian-noise simple linear regression model
# Inputs: x sequence; intercept; slope; noise variance; switch for whether to
# return the simulated values, or run a regression and return the coefficients
# Output: data frame or coefficient vector
sim.gnslrm <- function(x, intercept, slope, sigma.sq, coefficients=TRUE) {
n <- length(x)
y <- intercept + slope*x + rnorm(n,mean=0,sd=sqrt(sigma.sq))
if (coefficients) {
return(coefficients(lm(y~x)))
} else {
return(data.frame(x=x, y=y))
}
}

# Fix an arbitrary vector of x's


x <- seq(from=-5, to=5, length.out=42)

Figure 1: Code setting up a simulation of a Gaussian-noise simple linear regression model, along a fixed
vector of Xi values.

1.3 Sampling Distribution of βb0


Starting from Eq. 5 rather than Eq. 4, an argument exactly parallel to the one we just went through
gives
σ2 x2
 
β0 ∼ N (β0 ,
b 1+ 2 )
n sX
It follows, again by parallel reasoning, that

βb0 − β0
r   ∼ N (0, 1)
σ2 x2
n 1 + s2
X

The right-hand side of this equation is admirably simple and easy for us to calculate, but the left-
hand side unfortunately involves two unknown parameters, and that complicates any attempt to
use it.

1.4 b2
Sampling Distribution of σ
It is mildly challenging, but certainly not too hard, to show that
 2 n − 2 2
E σ
b = σ
n
We can be much more specific. When i ∼ N (0, σ 2 ), it can be shown that

σ2
nb
∼ χ2n−2
σ2

3
25
20
15
Density

10
5
0

−2.06 −2.04 −2.02 −2.00 −1.98 −1.96 −1.94


^
β1

# Run the simulation 10,000 times and collect all the coefficients
# What intercept, slope and noise variance does this impose?
many.coefs <- replicate(1e4, sim.gnslrm(x=x, 5, -2, 0.1, coefficients=TRUE))
# Histogram of the slope estimates
hist(many.coefs[2,], breaks=50, freq=FALSE, xlab=expression(hat(beta)[1]),
main="")
# Theoretical Gaussian sampling distribution
theoretical.se <- sqrt(0.1/(length(x)*var(x)))
curve(dnorm(x,mean=-2,sd=theoretical.se), add=TRUE,
col="blue")

Figure 2: Simulating 10,000 runs of a Gaussian-noise simple linear regression model, calculating βb1 each
time, and comparing the histogram of estimates to the4 theoretical Gaussian distribution (Eq. 8, in blue).
1.5 Standard Errors of βb0 and βb1
The standard error of an estimator is its standard deviation. We’ve just seen that the true
standard errors of βb0 and βb1 are, respectively,
h i σ
se βb1 = √ (9)
sx n
σ
h i q
se βb0 = √ s2X + x2 (10)
nsX

Unfortunately, these standard errors involve the unknown parameter σ 2 (or its square root σ,
equally unknown to us).
We can, however, estimate the standard errors. The maximum-likelihood estimates just substi-
tute σ
b for σ:
h i σ

b
se
b βb1 = (11)
sx n
σ
h i q
√ s2X + x2
b
se
b βb0 = (12)
sX n

For later theoretical purposes, however, things will work out slightly nicer if we use the de-biased
n
version, n−2 b2 :
σ
h i σ

b
se
b βb1 = (13)
sx n − 2
σ
h i q
√ s2X + x2
b
se
b βb0 = (14)
sx n − 2
These standard errors — approximate or estimated though they be — are one important way
of quantifying how much uncertainty there is around our point estimates. However, we can’t use
them, alone
h i to say anything
h i terribly precise about, say, the probability that β1 is in the interval
[β1 − se
b b β1 , β1 − se
b b b β1 ], which is the sort of thing we’d want to be able to give guarantees about
b
the reliability of our estimates.
h i
2 Sampling distribution of (β − β)/se
b b βb

It should take only a little work with the properties of the Gaussian distribution to convince yourself
that
βb1 − β1
h i ∼ N (0, 1)
se βb1

5
h i
the standard Gaussian distribution. If the Oracle told us σ 2 , we’d know se βb1 , and so we could
assert that (for example)
 h i h i
P β1 − 1.96se βb1 ≤ βb1 ≤ β1 + 1.96se βb1 (15)
 h i h i
= P −1.96se βb1 ≤ βb1 − β1 ≤ 1.96se βb1 (16)
 
βb1 − β1
= P −1.96 ≤ h i ≤ 1.96 (17)
se βb1
= Φ(1.96) − Φ(−1.96) = 0.95 (18)

where Φ is the cumulative distribution function of the N (0, 1) distribution.


Since the oracles have fallen silent, we can’t use this approach. What we can do is use the
following fact:

Proposition 1 If Z ∼ N (0, 1), S 2 ∼ χ2d , and Z and S 2 are independent, then


Z
p ∼ td
S 2 /d

(I call this a proposition, but it’s almost a definition of what we mean by a t distribution with d
degrees of freedom. Of course, if we take this as the definition, the proposition that this distribu-
tion has a probability density ∝ (1 + x2 /d)−(d+1)/2 would become yet another proposition to be
demonstrated.) h i
Let’s try to manipulate (βb1 − β1 )/se
b βb1 into this form.

βb1 − β1 βb1 − β1 σ
h i = h i
se
b βb1 σ se
b βb 1

βb1 −β1
σ N (0, 1/ns2X ) sX N (0, 1/ns2X )
= = σ
=
b [βb1 ]
se √
b √σ
b
σ sx σ n−2 σ n−2

N (0, 1/n) nN (0, 1/n) N (0, 1)
= = √ =q
√σ
b
√ nb
σ σ2 1
nb
σ n−2 σ n−2 σ 2 n−2
N (0, 1)
= q = tn−2
χ2n−2 /(n − 2)

where in the last step I’ve used the proposition I stated (without proof) above.
To sum up:
h i
Proposition 2 Using the se b βb1 of Eq. 13,

βb1 − β1
h i ∼ tn−2 (19)
se
b βb1

6
h i
Notice that we can compute se b βb1 without knowing any of the true parameters — it’s a pure
statistic, just a function of the data. This is a key to actually using the proposition for anything
useful.
By exactly parallel reasoning, we may also demonstrate that
βb0 − β0
h i ∼ tn−2
se
b βb0

3 Confidence Intervals and Tests


Define k ≡ k(n, α) such that
Z k(n,α)
f (u)du = 1 − α
−k(n,α)

b βb1 ). A 1 − α confidence interval for β1 is


where f is the density of a tn−2 distribution. Let s = se(
h i
C = βb1 − ks, βb1 + ks .

To verify this, note that


!
βb1 − β1
P (β1 ∈ C) = P (βb1 − ks ≤ β1 ≤ βb1 + ks) = P −k ≤ <k
s
= P (−k < T < k) = 1 − α
where T denotes a random var iable with a tn−2 distribution. So the interval traps β1 with proba-
bility 1 − α.
Suppose we want to test
H0 : β1 = 0 versus H1 : β1 6= 0.
We can just reject H0 is 0 is not in C. Equivalently, reject H0 if
|βb1 |
> k(n, α).
s
This is called the Wald test.
h i
Width of the confidence interval Notice that the width of the confidence interval is 2k(n, α)se
b βb1 .
This tells us what controls the width of the confidence interval:
1. As α shrinks, the interval widens. (High confidence comes at the price of big margins of
error.)
2. As n grows, the interval shrinks. (Large samples mean precise estimates.)
3. As σ 2 increases, the interval widens. (The more noise there is around the regression line, the
less precisely we can measure the line.)
4. As s2X grows, the interval shrinks. (Widely-spread measurements give us a precise estimate
of the slope.)

7
What abouth i β0 ? By exactly h iparallel reasoning, a 1 − α confidence interval for β0 is [β0 −
b
k(n, α)se
b βb0 , βb0 + k(n, α)se
b βb0 ].

What about σ 2 ? See Exercise 1.

What α should we use? It’s become conventional to set α = 0.05. To be honest, this owes more
to the fact that the resulting k tends to 1.96 as n → ∞, and 1.96 ≈ 2, and most psychologists and
economists could multiply by 2, even in 1950, than to any genuine principle of statistics or scientific
method. A 5% error rate corresponds to messing up about one working day in every month, which
you might well find high. On the other hand, there is nothing which stops you from increasing α.
It’s often illuminating to plot a series of confidence sets, at different values of α.

What about power? The coverage of a confidence set is the probability that it includes the
true parameter value. This is not, however, the only virtue we want in a confidence set; if it was, we
could just say “Every possible parameter is in the set”, and have 100% coverage no matter what.
We would also like the wrong values of the parameter to have a high probability of not being in the
set. Just as the coverage is controlled by the size / false-alarm probability / type-I error rate α of
the hypothesis test, the probability of excluding the wrong parameters is controlled by the power
/ miss probability / type-II error rate. Test with higher power exclude (correctly) more parameter
values, and give smaller confidence sets.

3.1 Confidence Sets and Hypothesis Tests


There is a general relationship between confidence sets and hypothesis tests.

1. Inverting any hypothesis test gives us a confidence set.

2. If we have a way of constructing a 1 − α confidence set, we can use it to test the hypothesis
that β = β ∗ : reject when β ∗ is outside the confidence set, retain the null when β ∗ is inside
the set.

I will leave it as a pair of exercises (2 and 3) to that inverting a test of size α gives a 1−α confidence
set, and that inverting a 1 − α confidence set gives a test of size α.

3.2 Large-n Asymptotics


h i h i
As n → ∞, b2
σ → σ2. b β → se βb . Hence,
It follows (by continuity) that se b

βb − β
h i → N (0, 1)
se
b βb

which considerably simplifies the sampling intervals and confidence sets; as n grows, we can forget
about the t distribution and just use the standard Gaussian distribution. Figure 3 plots the
convergence of k(n, α) towards the k(∞, α) we’d get from the Gaussian approximation. As you
can see from the figure, by the time n = 100 —a quite small data set by modern standards — the
difference between the t distribution and the standard-Gaussian is pretty trivial.

8
α = 0.01
10

α = 0.05
α = 0.5
8
6
k(n, α)

4
2
0

5 10 50 100 500 5000

Sample size (n)

curve(qt(0.995,df=x-2),from=3,to=1e4,log="x", ylim=c(0,10),
xlab="Sample size (n)", ylab=expression(k(n,alpha)),col="blue")
abline(h=qnorm(0.995),lty="dashed",col="blue")
curve(qt(0.975,df=x-2), add=TRUE)
abline(h=qnorm(0.975),lty="dashed")
curve(qt(0.75,df=x-2), add=TRUE, col="orange")
abline(h=qnorm(0.75), lty="dashed", col="orange")
legend("topright", legend=c(expression(alpha==0.01), expression(alpha==0.05),
expression(alpha==0.5)),
col=c("blue","black","orange"), lty="solid")

Figure 3: Convergence of k(n, α) as n → ∞, illustrated for α = 0.01, α = 0.05 and α = 0.5. (Why do I
plot the 97.5th percentile when I’m interested in α = 0.05?)
9
4 Statistical Significance: Uses and Abuses
4.1 p-Values
The test statistic for the Wald test,
βb1 − β1∗
T = h i
se
b βb1

has the nice, intuitive property that it ought to be close to zero when the null hypothesis β1 = β1∗
is true, and take large values (either positive or negative) when the null hypothesis is false. When
a test statistic works like this, it makes sense to summarize just how bad the data looks for the
null hypothesis in a p-value: when our observed value of the test statistic is Tobs , the p-value is

P = P (|T | ≥ |Tobs |)

calculating the probability under the null hypothesis. (I write a capital P here as a reminder that
this is a random quantity, though it’s conventional to write the phrase “p-value” with a lower-case
p.) This is the probability, under the null, of getting results which are at least as extreme as what
we saw. It should be easy to convince yourself that rejecting the null in a level-α test is the same
as getting a p-value < α.
It is not too hard (Exercise 4) to show that P has a uniform distribution over [0, 1] under the
null hypothesis.

4.2 p-Values and Confidence Sets


When our test lets us calculate a p-value, we can form a 1 − α confidence set by taking all the β’s
where the p-value is ≥ α. Conversely, if we have some way of making confidence sets already, we
can get a p-value for the hypothesis β = β ∗ ; it’s the largest α such that β ∗ is in the 1 − α confidence
set.

4.3 Statistical Significance


If we test the hypothesis that β1 = β1∗ and reject it, we say that the difference between β1 and β1∗
is statistically significant. Since, as I mentioned, many professions have an overwhelming urge
to test the hypothesis β1 = 0, it’s common to hear people say that “β1 is statistically significant”
when they mean “β1 is difference from 0 is statistically significant”.
This is harmless enough, as long as we keep firmly in mind that “significant” is used here as a
technical term, with a special meaning, and is not the same as “important”, “relevant”, etc. When
we reject the hypothesis that β1 = 0, what we’re saying is “It’s really implausibly hard to fit this
data with a flat line, as opposed to one with a slope”. This is informative, if we had serious reasons
to think that a flat line was a live option.
It is incredibly common for researchers from other fields, and even some statisticians, to reason
as follows: “I tested whether β1 = 0 or not, and I retained the null; therefore β1 is insignificant,
and I can ignore it.” This is, of course, a complete fallacy.
To see why, it is enough to realize that there are (at least) two reasons why our hypothesis test
might retain the null β1 = 0:

1. β1 is, in fact, zero,

10
h i
2. β1 6= 0, but se
b βb1 is so large that we can’t tell anything about β1 with any confidence.

There is a very big difference between data which lets us say “we can be quite confident that the
true β1 is, if not perhaps exactly 0, then very small”, and data which only lets us say “we have
no earthly idea what β1 is, and it may as well be zero for all we can tell.” It is good practice to
always compute a confidence interval, but it is especially important to do so when you retain the
null, so you know whether you can say “this parameter is zero to within such-and-such a (small)
precision”, or whether you have to admit “I couldn’t begin to tell you what this parameter is”.

Substantive vs. statistical significance Even a huge β1 , which h iit would be crazy to ignore in
any circumstance, can be statistically insignificant, so long as se
b βb1 is large enough. Conversely,
any β1 which isn’t exactly zero, noh matter
i how close it might be to 0, will become statistically
b β1 is small enough. Since, as n → ∞,
significant at any threshold once se b

h i σ
b βb1 →
se √
sX n
h i
βb1
b βb1 → 0, and
we can show that se → ±∞, unless β1 is exactly 0 (see below).
b [βb1 ]
se
Statistical significance is a weird mixture of how big the coefficient is, how big a sample we’ve got,
how much noise there is around the regression line, and how spread out the data is along the x axis.
This has so little to do with “significance” in ordinary language that it’s pretty unfortunate we’re
stuck with the word; if the Ancestors had decided to say “statistically detectable” or “statistically
distinguishable from 0”, we might have avoided a lot of confusion.
If you confuse substantive and statistical significance in this class, it will go badly for you.

4.4 Appropriate Uses of p-Values and Significance Testing


I do not want this section to give the impression that p-values, hypothesis testing, and statistical
significance are unimportant or necessarily misguided. They’re often used badly, but that’s true
of every statistical tool from the sample mean on down the line. There are certainly situations
where we really do want to know whether we have good evidence against some exact statistical
hypothesis, and that’s just the job these tools do. What are some of these situations?

Model checking Our statistical models often make very strong, claims about the probability
distribution of the data, with little wiggle room. The simple linear regression model, for instance,
claims that the regression function is exactly linear, and that the noise around this line has exactly
constant variance. If we test these claims and find very small p-values, then we have evidence that
there’s a detectable, systematic departure from the model assumptions, and we should re-formulate
the model.

Actual scientific interest Some scientific theories make very precise predictions about coeffi-
cients. According to Newton, the gravitational force between two masses is inversely proportional
to the square of the distance between them, ∝ r−2 . The prediction is exactly ∝ r−2 , not ∝ r−1.99
nor ∝ r−2.05 . Measuring that exponent and finding even tiny departures from 2 would be big news,
if we had reason to think they were real and not just noise. One of the most successful theories

11
in physics, quantum electrodynamics, makes predictions about some properties of hydrogen atoms
with a theoretical precision of one part in a trillion; finding even tiny discrepancies between what
the theory predicts and what we estimate would force us to rethink lots of physics. Experiments to
detect new particles, like the Higgs boson, essentially boil down to hypothesis testing, looking for
deviations from theoretical predictions which should be exactly zero if the particle doesn’t exist.
Outside of the natural sciences, however, it is harder to find examples of interesting, exact null
hypothesis which are, so to speak, “live options”. The best I can come up with are theories of
economic growth and business cycles which predict that the share of national income going to labor
(as opposed to capital) should be constant over time. Otherwise, in the social sciences, there’s
usually little theoretical reason to think that certain regression coefficients should be exactly zero,
or exactly one, or anything else.

5 Confidence Sets and p-Values in R


When we estimate a model with lm, R makes it easy for us to extract the confidence intervals of
the coefficients:

confint(object, level=0.95)

Here object is the name of the fitted model object, and level is the confidence level; if you
want 95% confidence, you can omit that argument. For instance:

library(gamair); data(chicago)
death.temp.lm <- lm(death ~ tmpd, data=chicago)
confint(death.temp.lm)

## 2.5 % 97.5 %
## (Intercept) 128.8783687 131.035734
## tmpd -0.3096816 -0.269607

confint(death.temp.lm, level=0.90)

## 5 % 95 %
## (Intercept) 129.0518426 130.8622598
## tmpd -0.3064592 -0.2728294

If you want p-values for the coefficients, those are conveniently computed as part of the summary
function:

coefficients(summary(death.temp.lm))

## Estimate Std. Error t value Pr(>|t|)


## (Intercept) 129.9570512 0.55022802 236.18763 0.00000e+00
## tmpd -0.2896443 0.01022089 -28.33845 3.23449e-164

Notice how this actually gives us an array with four columns: the point estimate, the standard
error, the t statistic, and finally the p-value. Each row corresponds to a different coefficient of the
model. If we want, say, the p-value of the intercept, that’s

12
coefficients(summary(death.temp.lm))[1,4]

## [1] 0

The summary function will also print out a lot of information about the model:

summary(death.temp.lm)

##
## Call:
## lm(formula = death ~ tmpd, data = chicago)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.275 -9.018 -0.754 8.187 305.952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 129.95705 0.55023 236.19 <2e-16 ***
## tmpd -0.28964 0.01022 -28.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.22 on 5112 degrees of freedom
## Multiple R-squared: 0.1358,Adjusted R-squared: 0.1356
## F-statistic: 803.1 on 1 and 5112 DF, p-value: < 2.2e-16

As my use of coefficients(summary(death.temp.lm)) above suggests, the summary function


actually returns a complex object, which can be stored for later access, and printed. Controlling
how it gets printed is done through the print function:

print(summary(death.temp.lm), signif.stars=FALSE, digits=3)

##
## Call:
## lm(formula = death ~ tmpd, data = chicago)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.27 -9.02 -0.75 8.19 305.95
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 129.9571 0.5502 236.2 <2e-16
## tmpd -0.2896 0.0102 -28.3 <2e-16
##
## Residual standard error: 14.2 on 5112 degrees of freedom
## Multiple R-squared: 0.136,Adjusted R-squared: 0.136
## F-statistic: 803 on 1 and 5112 DF, p-value: <2e-16

13
5.1 Coverage of the Confidence Intervals: A Demo
Here is a little computational demonstration of how the confidence interval for a parameter is
a random parameter, and how it covers the true parameter value with the probability we want.
I’ll repeat many simulations of the model from Figure 2, calculate the confidence interval on each
simulation, and plot those. I’ll also keep track of how often, in the first m simulations, the confidence
interval covers the truth; this should converge to 1 − α as m grows.

Exercises
To think through or to practice on, not to hand in.

1. Confidence interval for σ 2 : Start with the observation that nb


σ 2 /σ 2 ∼ χ2n−2 .

(a) Find a formula for the 1 − α sampling interval for σb2 , in terms of the CDF of the χ2n−2
2
distribution, α, n and σ . (Some of these might not appear in your answer.) Is the
width of your sampling interval the same for all σ 2 , the way the width of the sampling
interval for βb1 doesn’t change with β1 ?
(b) Fix α = 0.05, n = 40, and plot the sampling intervals against σ 2 .
(c) Find a formula for the 1 − α confidence interval for σ 2 , in terms of σ
b2 , the CDF of the
2
χn−2 distribution, α and n.

2. Suppose we start a way of testing the hypothesis β = β ∗ which can be applied to any β ∗ ,
and which has size (false alarm / type I error) probability α for β ∗ . Show that the set of β
retained by their tests is a confidence set, with confidence level 1 − α. What happens if the
size is ≤ α for all β ∗ (rather than exactly α)?

3. Suppose we start from a way of creating confidence sets which we know has confidence level
1 − α. We test the hypothesis β = β ∗ by rejecting when β ∗ is outside the confidence set, and
retaining when β ∗ is inside the confidence set. Show that the size of this test is α. What
happens if the initial confidence level is ≥ 1 − α, rather exactly 1 − α?

4. Prove that the p-value P is uniformly distributed under the null hypothesis. You may,
throughout, assume that the test statistic T has a continuous distribution.

(a) Show that if Q ∼ Unif(0, 1), then P = 1 − Q has the same distribution.
(b) Let X be a continuous random variable with CDF F . Show that F (X) ∼ Unif(0, 1).
Hint: the CDF of the uniform distribution FUnif(0,1) (x) = x.
(c) Show that P , as defined, is 1 − F|T | (|Tobs |).
(d) Using the previous parts, show that P ∼ Unif(0, 1).

5. Use Eq. ?? to show Eq. ??, following the derivation of Eq. ??.

14

● ●
● ● ●

● ● ● ● ●●
● ●
−1.95


● ●
● ● ● ●
● ● ● ● ● ● ●
● ● ●
● ● ● ●
● ●
● ● ● ●● ●● ● ●
●●


Confidence limits for slope

● ● ●
● ● ● ● ●
● ● ●
● ● ●● ● ● ●
● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ● ●
● ● ●
● ●

●● ●
−2.00

● ● ●
● ● ●

● ● ● ● ●
● ●

●● ●
● ● ● ● ●
● ● ●
● ● ● ● ●
● ●
● ● ● ●
● ● ● ●● ● ● ●
● ● ● ● ●
● ●
● ●● ● ● ●●● ●
● ● ●
● ● ● ●
● ● ● ● ● ●
−2.05

● ● ● ● ● ●
●● ●
● ● ● ●
●● ● ● ● ● ●
● ● ●
● ●
● ●●

0 20 40 60 80 100

Simulation number

# Run 1000 simulations and get the confidence interval from each
CIs <- replicate(1000, confint(lm(y~x,data=sim.gnslrm(x=x,5,-2,0.1,FALSE)))[2,])
# Plot the first 100 confidence intervals; start with the lower limits
plot(1:100, CIs[1,1:100], ylim=c(min(CIs),max(CIs)),
xlab="Simulation number", ylab="Confidence limits for slope")
# Now the lower limits
points(1:100, CIs[2,1:100])
# Draw line segments connecting them
segments(x0=1:100, x1=1:100, y0=CIs[1,1:100], y1=CIs[2,1:100], lty="dashed")
# Horizontal line at the true coefficient value
abline(h=-2, col="grey")

15
1.0



● ●

●●


●●


●●



●●


●● ●

●●


●●


●●


●●



●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●

● ●
● ●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●


●●
● ●
●●


●●

●●


●●


●●


●● ●


●●


●●


●● ●●


●●


●●


●●


●●


●●
●●



●● ●
●●
● ●

● ●●


●●
●● ●●


●●●●
● ●

●●●●
●● ●


● ●●
0.8
Sample coverage proportion

0.6
0.4
0.2
0.0

0 200 400 600 800 1000

Number of simulations

# For each simulation, check whether the interval covered the truth
covered <- (CIs[1,] <= -2) & (CIs[2,] >= -2)
# Calculate the cumulative proportion of simulations where the interval
# contained the truth, plot vs. number of simulations.
plot(1:length(covered), cumsum(covered)/(1:length(covered)),
xlab="Number of simulations",
ylab="Sample coverage proportion", ylim=c(0,1))
abline(h=0.95, col="grey")

16

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy