0% found this document useful (0 votes)
11 views47 pages

Final Review Handout

Statistics 111 taught by Joseph Blitzstein, PhD, Harvard University, is the premier statistical inference class at Harvard, giving undergraduates insight into estimators, estimands, Fisher information, confidence intervals and a range of parametric and non-parametric methods to estimate data.

Uploaded by

peiyaosimonma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views47 pages

Final Review Handout

Statistics 111 taught by Joseph Blitzstein, PhD, Harvard University, is the premier statistical inference class at Harvard, giving undergraduates insight into estimators, estimands, Fisher information, confidence intervals and a range of parametric and non-parametric methods to estimate data.

Uploaded by

peiyaosimonma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Stat 111 Final Review, Spring 2023

Joe Blitzstein

1 General Information
The final will be in Science Center Hall B on Monday, May 8, from 2:00 pm to 5:00 pm.
Some additional information and instructions:

• There will be six problems, weighted equally.

• No collaboration, copying, calculators, computers, or cell phones are allowed.

• No books or notes are allowed, except that you may bring four standard-sized
pages of notes (double-sided, so a total of eight sides), with whatever you want
handwritten or typed on them.

• Feel free to cite and apply any result from the Stat 110 or Stat 111 lectures,
homeworks, or books without rederiving it. Of course you don’t have to remember
that, e.g., what you’re using is labeled as Theorem 2.3.6 in the book – just be sure
to describe clearly what result you are applying.

• Show your work and justify your answers. Simplify your answers fully, unless
otherwise specified.

The final will be cumulative. Any material covered in class, Chapters 1–11 (except
for starred sections/subsections and R sections), or the homeworks is fair game, unless
otherwise specified. You will not need to read/write R code on the exam, though there
could be a question asking you to describe in words how you would run a simulation in
order to, e.g., approximate the MSE of an estimator.
Here is a detailed list of the main topics. The sampling of topics is done with replacement,
e.g., permutation tests are listed both under hypothesis testing and under resampling.

• Statistical goals: describe and explore data and a phenomenon, predict y’s from
x’s, causal inference.

• Trident: Theory, simulation, data. Strengths and limitations of each.

• Models: parametric vs. nonparametric model, estimand vs. estimator vs. estimate,
statistics, sufficient statistics, natural exponential families (NEFs).

• Likelihood : likelihood function, reparameterization, transformations of the data,


log-likelihood, maximum likelihood estimator (MLE), score function, Fisher infor-
mation, information equality.

1
• Asymptotics: central limit theorem, law of large numbers, continuous mapping
theorem, Slutsky’s theorem, delta method. Asymptotic distributions of sample
quantiles, method of moments estimator (MoM), and MLE.
• Point estimation: MoM, MLE, posterior mean/median/mode, invariance of MLE,
consistency, empirical CDF, sample mean, sample variance, sample covariance,
sample quantile. Bias, standard error, bias-variance tradeoff, loss functions, mean
square error (MSE), Cramér-Rao lower bound (CRLB), Rao-Blackwell theorem.
• Interval estimation: interval estimator, interval estimate, coverage probability,
confidence interval, credible interval, pivot, asymptotic approaches, bootstrap con-
fidence intervals.
• Regression: linear regression, predictive regression, descriptive regression, least
squares, homoskedasticity vs. heteroskedasticity, residuals, logistic regression.
• Hypothesis testing: simple vs. composite, one-sided vs. two-sided, Type I error,
Type II error, power function, test statistic, critical value, p-value, p-hacking,
t-test, Wald test, score test, likelihood ratio test, permutation test.
• Bayesian inference: Bayes’ rule, prior and posterior, conjugate prior (especially
Beta-Binomial, Gamma-Poisson, Normal-Normal), posterior mean/median/mode,
Bayesian model choice, posterior predictive distribution, loss function, risk func-
tion, admissibility, shrinkage factor, Stein’s paradox.
• Sampling: design-based vs. population model-based inference, simple random sam-
pling (SRS) with or without replacement, finite population correction, stratified
sampling, Horvitz–Thompson estimator.
• Resampling: bootstrap, parametric bootstrap, bootstrap estimate of bias, boot-
strap estimate of standard error, bootstrap confidence intervals, permutation test.
• Causal inference: potential outcomes framework, switching equation, treatment
effects, finite sample approach vs. population model-based approach, assignment
mechanism, randomized control trials (RCTs), Fisher’s null hypothesis vs. Ney-
man’s null hypothesis, randomization test, MoM estimator of finite sample esti-
mand. observational studies, unconfoundedness.
• Mathematical tools: Taylor approximation, differentiation under the integral sign
(DUThIS), sum of squares identity.
• Important examples: MLE and MoM of mean and variance in a Normal model,
sufficient statistic and MLE in an NEF, censored data, German tank problem,
sample mean vs. sample median (efficiency vs. robustness), variance-stabilizing
transformation of a Poisson, pivot based on a t distribution, Gamma–Poisson
story for buses in Blotchville, kidney cancer example, Basu’s elephant, smoking
and birth permutation test example, James-Stein estimator for batting averages.

2
2 Table of Distributions
The following table will be provided as the last page of the exam. Feel free to detach it.

Name Param. PMF or PDF Mean Variance

Bernoulli p P (X = 1) = p, P (X = 0) = q p pq
n

Binomial n, p k
pk q n−k , for k ∈ {0, 1, . . . , n} np npq

FS p pq k−1 , for k ∈ {1, 2, . . . } 1/p q/p2

Geom p pq k , for k ∈ {0, 1, 2, . . . } q/p q/p2


r+n−1

NBin r, p r−1
pr q n , n ∈ {0, 1, 2, . . . } rq/p rq/p2
(wk)(n−k
b
)
HGeom w, b, n w+b , for k ∈ {0, 1, . . . , n} µ= nw
( w+b−n )µ(1 − nµ )
( n ) w+b w+b−1

e−λ λk
Poisson λ k!
, for k ∈ {0, 1, 2, . . . } λ λ

1 a+b (b−a)2
Uniform a<b b−a
, for x ∈ (a, b) 2 12

2 2
Normal µ, σ 2 √1 e−(x−µ) /(2σ ) µ σ2
σ 2π

2 2 2 /2 2
Log-Normal µ, σ 2 1

xσ 2π
e−(log x−µ) /(2σ ) , x >0 θ = eµ+σ θ2 (eσ − 1)

Expo λ λe−λx , for x > 0 1/λ 1/λ2

Gamma a, λ Γ(a)−1 (λx)a e−λx x−1 , for x > 0 a/λ a/λ2


Γ(a+b) a−1 a µ(1−µ)
Beta a, b Γ(a)Γ(b)
x (1 − x)b−1 , for 0 < x < 1 µ= a+b a+b+1

Chi-Square n 1
2n/2 Γ(n/2)
xn/2−1 e−x/2 , for x > 0 n 2n
Γ((n+1)/2)
Student-t n √
nπΓ(n/2)
(1 + x2 /n)−(n+1)/2 0 if n > 1 n
n−2
if n > 2

3
3 Stat 111 final from 2019
1. Jane, a student at Harvard, has developed a cool new prediction method that she
calls DeepLearner. Her method uses characteristics of the individual as well as sensor
data from the individual’s phone over 7 days to predict whether the individual’s mood
on the 8th day will be low or high. She uses a large dataset to estimate all of the
parameters in her prediction method. Jane wants to estimate the misclassification rate,
that is, the fraction of people on which DeepLearner will predict the wrong mood.

(a) Jane obtains a new test dataset of n = 100 individuals. Let X be the number of
times out of n that DeepLearner incorrectly labels the 8th day mood out of these 100
individuals. It turns out that X = 5 for this dataset. Jane is thrilled. But Jane’s
mentor is quite skeptical. The usual misclassification rate for mood predictions is at
least 10%. Jane’s mentor asks Jane the following question: If DeepLearner has a 10%
misclassification rate, then what is the chance that the misclassification rate will be at
least as low on a second set of 100 individuals as it was on the first set of individuals?
Explain how Jane’s mentor’s question can be interpreted in terms of a hypothesis
testing problem. Provide the details of this hypothesis testing problem: null and alter-
native hypothesis, test statistic, and an expression for the p-value (you don’t have to
calculate the p-value numerically).

4
(b) Jack, a student at Stanford, has developed an alternate prediction method, which
Jane calls ShallowLearner. Jane hopes to show that DeepLearner is better than Shal-
lowLearner, in terms of lower misclassification rates. So she obtains a second dataset of
100 individuals and calculates Y , the number of times out of 100 that ShallowLearner
incorrectly labels the 8th day mood. What are Jane’s null and alternative hypothe-
ses? Provide a test statistic and, using large sample approximations (e.g., assume that
n = 100 is sufficiently large so that asymptotic approximations will be pretty accurate),
give the formula for the critical value(s) in order to get Type I error rate α.

(c) Jane realizes that her approach to compare DeepLearner and ShallowLearner is
wasteful, as she actually has 200 individuals on which both DeepLearner and Shal-
lowLearner can be used to predict mood. So Jane decides to run both prediction meth-
ods on all 200 people. Let Xj , j = 1, . . . , 200 indicate whether mood was misclassified
by DeepLearner on each of the 200 people. Define the Yj ’s similarly. Note that Xj and
Yj may be correlated, both unconditionally and conditional on person j’s mood on the
8th day. Revise Jane’s null and alternative hypothesis, if appropriate. Provide a test
statistic, and give the formula for the critical value.
Hint: Consider the differences Zj = Xj − Yj .

5
2. We observe i.i.d. Y1 , . . . , Yn ∼ Bern(p), and want a 95% confidence interval for p. Let
Y = Y1 + · · · + Yn and p̂ = Yn .

(a) Show that


p̂ − p d
p → N (0, 1)
p̂(1 − p̂)/n
as n → ∞.

(b) Use the bootstrap to construct a 95% confidence interval for p with the percentile
method.

(c) Would your answer to (b) change if you used the parametric bootstrap instead? If
so, construct a 95% interval using the parametric bootstrap; if not, explain why not.

6
p
(d) The result of (a) implies that, asymptotically, the interval p̂ ± 1.96 p̂(1 − p̂)/n is
a 95% confidence interval for p. However, it can be shown via simulation that the
coverage probability of this interval can be poor if p is close to 0 or 1. For example,
for p = 0.005 and n = 592, the coverage probability turns out to be 79%, not 95%. A
confidence interval with better performance in terms of coverage probabilities, known
as the Agresti-Coull interval, uses the estimator
Y +2
p̃ =
n+4
in place of p̂ and ñ = n + 4 in place of n, resulting in the interval
p
p̃ ± 1.96 p̃(1 − p̃)/ñ.

For what choice of (a, b) is p̃ the posterior mean of p, if we take a Bayesian approach
with p ∼ Beta(a, b) prior? For this choice of (a, b), find the posterior distribution of p.
Assuming that the posterior is approximately Normal, construct a 95% credible interval
for p, centered at p̃, and compare it with the Agresti-Coull interval.

7
3. A perennial challenge that occurs in the evaluation of new treatments is the test-
retest conundrum. Suppose that a group of n children are given a math test at age 9
and another math test at age 10. Let Xj be the score on the math test at age 9 and Yj
be the score on the math test at age 10, for child j. Let the correlation between Xj and
Yj be ρ, with 0 < ρ < 1. Suppose that (Yj , Xj ), j = 1, . . . , n are i.i.d. Let
n
X n
X n
X
2 2
Sxx = (Xj − X̄) , Syy = (Yj − Ȳ ) , Sxy = (Xj − X̄)(Yj − Ȳ ),
j=1 j=1 j=1

and let
Sxy
r=p
Sxx Syy
be the sample correlation. Feel free to use the following fact: the MLE of (β̂0 , β̂1 ) in the
predictive regression E[Y |X] = β0 + β1 X is

Sxy
β̂1 = , β̂0 = Ȳ − β̂1 X̄.
Sxx

(a) Find the estimated slope β̂1 for a linear regression of age 10 test result on age 9 test
result, in terms of r, Sxx , and Syy .

(b) Define Ŷj = β̂0 + β̂1 Xj . Show that

Ŷ − Ȳ Xj − X̄
pj = rp .
Syy /n Sxx /n

8
(c) Suppose you have developed a new computerized math intervention. To test your
intervention, you consider conducting a test-retest study. That is, you test the children’s’
math ability at age 9, and provide the computerized math intervention to all of the
children. Then you do a retest of their math ability at age 10, to see if they show
improvement. Suppose that the data are (Yj , Xj ) satisfying the same assumptions as
above, e.g., the sample correlation is r. What is the predicted score at age 10 for a child
who scored two standard deviations below the mean on the test at age 9?

(d) Explain intuitively why the answer to (c) is not that they will score two standard
deviations below the mean at age 10.

(e) Use potential outcomes notation to explain to a policy maker whether using a test-
retest study will allow them to assess the causal effect of the new computerized math
intervention.

9
4. The Inverse Gaussian distribution with parameter µ > 0 has PDF

−(y − µ)2
 
1
f (y) = p exp , for y > 0.
2πy 3 2µ2 y

This distribution has been applied in physics, meteorology, industrial quality control,
and various other applications. Let Y1 , . . . , Yn be i.i.d. r.v.s with this distribution.
(a) Show that the distribution of Yj is a natural exponential family.

(b) Find E[Yj ].

(c) Find Var(Yj ).

(d) Find the MLE of µ.

(e) Find the MLE of µ2 .

10
5. A traffic safety engineer is studying the car accident rate in his state. He collects data
on the number of car accidents in a particular one year period in n different counties.
Let Yj be the number of car accidents in county j, and cj be the number of people who
live in county j. Suppose that Yj ∼ Pois(cj θ), independently, where the cj are known
and θ, the underlying car accident rate per person, is the estimand.

(a) Find the MLE θ̂ of θ.

(b) Find the standard error of θ̂.

(c) Find a one-dimensional sufficient statistic S. Is it possible to use Rao-Blackwell


(with respect to S) to obtain an estimator for θ whose MSE is strictly smaller than that
of θ̂? If so, compute the Rao-Blackwellized estimator; if not, explain why not.

11
(d) For the remainder of the problem, take a Bayesian perspective. Now the model is

Yj |θ ∼ Pois(cj θ), with prior θ ∼ Gamma(r0 , b0 ),

where r0 and b0 are known positive numbers, with r0 an integer. Find the posterior
mean of θ given the data Y1 , . . . , Yn .

(e) A new county, with cn+1 residents, is added to the study. Find an approximate
95% predictive interval for the number of car accidents in the new county in a one year
period, given the previous data Y1 , . . . , Yn . That is, the posterior probability that the
new observation will be in your interval, given Y1 , . . . , Yn , should be 95%. Your answer
can be in terms of the quantile function of a named distribution.

12
6. You are given a limited edition, collector’s item book, of which only N copies exist;
you have little information about how large N is. To assess how rare the book is, you
wish to estimate N . The jth book printed was inscribed with the number j (so the
books are numbered 1, 2, . . . , N ). Your book turns out to be Number 111. Assume that
your copy is equally likely to be any of the N copies of the book.
(a) Find the maximum likelihood estimate for N .

(b) Find a method of moments estimate for N .

(c) Using the prior π(N ) ∝ 1/N 2 (for N = 1, 2, . . . ), find the posterior mean of N , in

1
P
terms of one or more of the constants cj defined by cj = Nj
.
N =111

13
4 Stat 111 final from 2020
1. The Pareto distribution is used to model extremes in insurance, e.g., modeling very
large insurance claims above a known high threshold. The density has the form
αk α
fY1 (y|α) = , y > k,
y α+1

where the threshold k is known (with k > 0) and the parameter α is unknown (with
α > 0).

(a) Show that the distribution of Y1 follows a natural exponential family.

(b) Find the MLE of α based on observing n i.i.d. Pareto r.v.s Y1 , . . . , Yn .

14
(c) Find the Fisher information in the sample for α, and use this to obtain the approx-
imate variance of the MLE of α.

(d) One of the main uses of models of extremes is to offer a glimpse at possible extremes
which can happen beyond what we have seen so far in the data. This might be helpful,
for example, in modeling insurance claims which are larger than any we have previously
seen. Find the MLE of the upper tail quantile (a 1 in 1,000 event) θ = QY1 (0.999).

(e) What is the approximate variance of the MLE of θ?

15
2. A forensic scientist is studying crime patterns in a certain city, and wishes to compare
crime rates in two neighborhoods. In a certain month, Y1 ∼ Pois(λ1 ) crimes are observed
on the first block, and Y2 ∼ Pois(λ2 ) crimes are observed on the second block. Assume
that Y1 and Y2 are independent.
The forensic scientist is interested in comparing λ1 and λ2 , and decides to focus on
the estimand θ = λ1 /λ2 . The reparameterization p = θ/(1 + θ) is also sometimes useful.
The scientist decides to condition on the total number of crimes, Y1 + Y2 , reasoning that
the total doesn’t carry much information about the proportion of crimes on the first
block. So throughout this problem, condition on Y1 + Y2 = t.

(a) Find the (conditional) likelihood function for p.

(b) Find the (conditional) likelihood function for θ.

(c) Derive the test statistic for a Wald test for H0 : θ = 1 vs. H1 : θ ̸= 1.

16
(d) Derive the test statistic for a score test for H0 : θ = 1 vs. H1 : θ ̸= 1.

(e) Derive the test statistic for a likelihood ratio test for H0 : θ = 1 vs. H1 : θ ̸= 1.

17
Placebo Antibiotic

w=0 w=1

Did not feel better y=0 16 19

Feel better y=1 65 66

Table 1: Treatment of acute sinusitis through amoxicillin compared to placebo.

3. In 2012, J. M. Garbutt, et al. published in the Journal of the American Medical As-
sociation the results from a randomized control trial on the effect of taking antibiotics
on treating acute sinusitis. The treatment was a 10 day course of amoxicillin (an antibi-
otic). The control was a 10 day course of a placebo, which tasted and looked identical
to a tablet of amoxicillin. At the end of the 10 day course, patients self-reported if their
symptoms had improved. The results of the experiment are given in Table 1.
Your numerical answers in this problem can be somewhat unsimplified, e.g., you
don’t have to do the arithmetic of, say, subtracting fractions or getting decimal approx-
imations.
(a) What does non-interference mean in terms of this acute sinusitis study? Is it rea-
sonable to assume non-interference in this study? Explain.

(b) Estimate the population-based average treatment effect E[τ1 ] based on this experi-
ment. (Find a mathematical expression for the MLE and then plug in the numbers.)

18
(c) What is the estimated standard deviation of E[τ1 ]?

(d) Find an approximate 95% CI for E[τ1 ].

19
4. Each of n basketball players shoots one free throw. The outcome Yj is observed,
where Yj = 1 if player j makes their shot, and Yj = 0 if they miss. Let pj be the true
free throw shooting percentage of player j, and write Y = (Y1 , . . . , Yn ), p = (p1 , . . . , pn ).
Consider the model

Yj |p, µ ∼ Bern(pj )
pj |µ ∼ Beta(µr0 , (1 − µ)r0 )
µ ∼ Beta(a0 , b0 ),

where p and µ are unknown, and a0 , b0 , r0 are known, positive constants. In this model,
the Yj are conditionally independent given p, and the pj are conditionally i.i.d. given µ.
(a) Are Y1 , . . . , Yn conditionally independent given µ? Are Y1 , . . . , Yn unconditionally
independent? You can explain your answers either mathematically or with a clear
intuitive explanation in words.

20
(b) Find the conditional distribution of Yj |µ.
Hint: First find E(Yj |µ) and consider the support of Yj .

(c) Find the conditional distribution of µ|Y.

21
5. Jerry loves predicting things: stock returns, basketball scores, and whether his friends
will like his book recommendations. His approach is to focus on the conditional mean,
given some predictors:
µ(x) = E[Y |X = x].
To approximate the conditional mean he uses a model µ(x|θ), where the functional form
is known; it is just θ that is unknown. We will think of θ as a scalar parameter, to focus
on statistical ideas.
Let the maximum likelihood estimator for θ be θ̂. (You do not need to try to find
an explicit formula for θ̂ in this problem. Answers can be left in terms of θ̂ and/or
∂µ(x|θ)/∂θ.)
Suppose that Jerry observes (X1 , Y1 ), . . . , (Xn , Yn ) and uses the model
ind.
Yj |(X1 = x1 , ..., Xn = xn ) ∼ N (µ(xj |θ), σ 2 ), j = 1, ..., n.

For simplicity, assume σ 2 is known.


(a) Why does it make sense to use the conditional likelihood (conditioning on X = x)
for this problem, i.e., work with
n
1 X
log L(θ) = − 2 {yj − µ(xj |θ)}2
2σ j=1

rather than an unconditional likelihood? Assume that this conditional likelihood is used
for the rest of this problem.

(b) What is the corresponding score for θ?

(c) What is the Fisher information for θ? What is the approximate variance of θ̂?

22
(d) What is the asymptotic distribution for θ̂?

(e) What is the asymptotic distribution for µ(xj |θ̂)?

(f) Jerry is interested in predicting the outcome Yn+1 for a new individual whose predic-
tor variables are Xn+1 = xn+1 . So his prediction is µ(xn+1 |θ̂) and his prediction error
is
Yn+1 − µ(xn+1 |θ̂).
What is the approximate distribution of his prediction error?
Hint: Add and subtract µ(xn+1 |θ).

23
6. Let Y1 , . . . , Yn be i.i.d. observations from a model such that the sample mean Ȳ is a
sufficient statistic. Let σ 2 = Var(Y1 ). Let the estimand be

µ = E[Y1 ].

A fan of the bootstrap proposes using a bootstrapped version of Ȳ , instead of Ȳ itself, to


estimate µ. Specifically, the proposal is to generate B bootstrap samples (recall that one
bootstrap sample is Y1∗ , . . . , Yn∗ , obtained by resampling from Y1 , . . . , Yn using a simple
random sample with replacement), compute the sample mean of each bootstrap sample,
and then average those sample means together. That is, the proposed new estimator is
B
1 X ∗
µ̂ = Ȳ ,
B j=1 j

where Ȳj∗ is the sample mean of the jth bootstrap sample.

(a) Show that it is possible to improve upon µ̂ in terms of MSE (you do not need to do
any calculations for this part).

(b) Conditional on Y1 , . . . , Yn , what happens to µ̂ as B → ∞?

(c) Find E(µ̂|Y1 , . . . , Yn ) and E(µ̂).

24
(d) Find Var(µ̂|Y1 , . . . , Yn ) and Var(µ̂).

25
5 Stat 111 final from 2021
1. Suppose that Y1 , . . . , Yn are i.i.d. random variables with µ = E[Y1 ] and σ 2 = Var(Y1 ).
Josie also assumes that Y1 , . . . , Yn are Normally distributed. Assume for parts (a)
through (c) that the Normal model is correct. Josie finds that the MLE of σ 2 is
n
2 1X
σ
b = (Yj − Y )2 ,
n j=1

and recalls from Example 10.4.3 in the Stat 110 book that
n
1 X
(Yj − Y )2 ∼ χ2n−1 .
σ 2 j=1

b2 .
(a) Find the mean square error (MSE) of σ

(b) Find the MLE of σ.

(c) Derive a 95% confidence interval for σ (in terms of quantiles of famous distributions).

26
(d) Beau does not feel comfortable with the Normal assumption Josie maintains. In-
b2 . Derive the asymptotic
stead Beau decides to use an asymptotic approximation to σ
b2 . Specifically, show that as n → ∞,
distribution of σ
√ d
σ 2 − σ 2 ) → N (0, λσ2 ),
n(b

for some constant λσ2 which you should specify in terms of central moments of Y1 . (The
kth central moment of an r.v. X with mean µ is E[(X − µ)k ].)

27
2. A company is investigating the performance of a new battery it has developed. The
battery lasts for an Expo(λ) number of hours, where λ is the rate parameter and µ = 1/λ
is the mean parameter. The company collects data by using the batteries in two kinds
of devices: remote controls and calculators.
A remote control needs only one battery. The company installs a new battery in each
of n remote controls, and records the i.i.d. times X1 , . . . , Xn until the remote controls
stop working. Calculators need two batteries. The company installs two new batteries
in each of m calculators, and records the i.i.d. times Y1 , . . . , Ym until the calculators
stop working. Assume that a remote control stops working when (and only when) its
battery dies, and that a calculator stops working when (and only when) either of its
two batteries dies.
So the data are independent random variables X1 , . . . , Xn , Y1 , . . . , Ym , where Xj ∼
Expo(λ) and Yj is whichever of two i.i.d. Expo(λ) random variables is smaller.
(a) Find the likelihood function L(λ).

(b) Find a one-dimensional sufficient statistic. What is its distribution?

(c) Find the MLE of λ.

28
(d) Find the MLE of the probability that a remote control will work for at least c hours
(starting from the time when a new battery is installed), where c is a known constant.

(e) Find the Fisher information I(λ) in the entire dataset about λ.

29
i.i.d.
3. Suppose that Y1 , . . . , Yn ∼ N (θ, 1), and consider testing H0 : θ ≤ 0 vs. H1 : θ > 0
using the test statistic √
Tn = n × Y ,
rejecting the null for large values of Tn . Your answers to this question can be in terms
of the N (0, 1) CDF and quantile function.
(a) For a test of level α, what is the critical value of the test?

(b) Let tn be the observed value of Tn . What is the p-value for the test?

(c) Now suppose that John had the prior θ ∼ N (0, 1/c), leading to the posterior

θ|y1 , . . . , yn ∼ N (mn , τn2 ).

What are mn and τn2 , in terms of n, c, y?

30
(d) Continuing (c), find P (θ ≤ 0|y1 , . . . , yn ), the posterior probability of the null, in
terms of tn , n, c.

(e) For n large, determine whether the p-value from (b) and posterior probability of the
null from (d) are close to each other.

31
4. Sam observes i.i.d. random variables Y1 , . . . , Yn , and wants to estimate λ = E[Y1 ],
based on the model Yj ∼ Pois(λ). Enamored of the fact that both the mean and the
variance of the Pois(λ) distribution equal λ, Sam decides to use the unbiased sample
variance to estimate λ. So consider the estimator
n
1 X
λ̂ = (Yj − Ȳ )2 .
n − 1 j=1

(a) Suppose for this part only that the Poisson model is wrong, and in fact the Yj are
Geom(p), with mean λ = (1 − p)/p. What is the bias if λ̂ is used to estimate λ? Express
your answer in terms of λ only.

(b) For the remainder of this problem, assume that the Poisson model is correct. De-
scribe clearly how Sam can use the bootstrap to estimate the standard error of λ̂.

Pn
(c) Show that T = j=1 Yj is a sufficient statistic.

32
(d) Can λ̂ be improved by Rao-Blackwellization? Explain why or why not. If the answer
is yes, derive the fully simplified Rao-Blackwellized estimator.
Hint: Recall from Theorem 4.8.2 in the Stat 110 book that if X1 and X2 are independent
with X1 ∼ Pois(λ1 ), X2 ∼ Pois(λ2 ), then
 
λ1
X1 |(X1 + X2 = t) ∼ Bin t, .
λ1 + λ2

(e) Now suppose instead that a Bayesian approach is taken, with the prior λ ∼ Gamma(r0 , b0 ).
Find the posterior mean of λ, and show that it can be written as a weighted average of
the prior mean and the sample mean.

33
5. A government agency would like to assess the overall accuracy of the tax returns
that were filed by people in the country last year. There were N returns filed, where
N is in the millions, and they do not have nearly enough resources to audit all of the
returns. So they plan to audit a random sample of the tax returns (not necessarily a
simple random sample).
Label the people who filed tax returns last year as i = 1, 2, . . . , N . Let xi be the
amount that person i reported as their tax obligation and yi be their true tax obligation.
Hopefully xi equals yi , but they may not be equal due to the possibility of errors or
fraud. Here xi is fixed and known, while yi is fixed but unknown (unless/until person i
gets audited).
Let di = yi − xi be the discrepancy between the correct tax obligation and the reported
tax obligation for person i. The estimand is the total discrepancy (which is called the
tax gap),
XN
τ= di .
i=1

The agency draws a random sample of size n with replacement (sampling without re-
placement is more common in practice but makes the math messier). For each sampled
individual i, the agency conducts a thorough audit and ascertains the true yi .
Each time a return is randomly chosen to be audited, individual i is selected with
probability wi , where the wi are known, positive constants that sum to 1. (If wi = 1/N
for all i then this is simple random sampling, but some returns may be considered to
be more important to audit than other returns, based on how much money is at stake.)
So the data are i.i.d. triples

(Yj , Xj , Wj ), j = 1, . . . , n,

where if the jth individual sampled is person i then (Yj , Xj , Wj ) = (yi , xi , wi ). Let

Dj = Yj − Xj ,

and consider the estimator n


1 X Dj
τ̂ = .
n j=1 Wj

(a) Find the bias of τ̂ (fully simplified, not left as a sum). Is τ̂ unbiased?

34
(b) Find the variance of τ̂ (as a sum).

(c) Find an unbiased estimator for the variance of τ̂ .

(d) Find the Horvitz-Thompson estimator for τ .

35
6. You are studying the causal effect for first year College students that being on the
selective College sprint team on 1 September has on the time it takes them to run 100
meters on 1 April in the following calendar year (this time is the outcome variable). Let
Wj be the treatment assignment (1 if on the team, 0 otherwise) and Yj be the outcome
for student j.
Let Xj be a predictor variable for student j, such as their sprint time from 2 years ago.
Assume a potential outcomes framework with non-interference, where

{Y1 (0), Y1 (1), X1 , W1 , Y1 } , . . . , {Yj (0), Yj (1), Xj , Wj , Yj } , . . . {Yn (0), Yn (1), Xn , Wn , Yn }

are i.i.d. across j. Suppose that

P (W1 = 1|X1 = x) = λ(x), Y1 (1) = β1 X1 + ε1 , Y1 (0) = β0 X1 + η1 ,

where β0 and β1 are constants and {(ε1 , η1 ) ⊥⊥ W1 } |X1 , i.e., (ε1 , η1 ) is conditionally
independent of W1 , given X1 .
The estimand in this problem is E[τ1 ], where τ1 = Y1 (1) − Y1 (0).
(a) Define
n  
1X Wj Yj (1 − Wj )Yj
τe = − .
n j=1 E[W1 ] E[1 − W1 ]

Find the unconditional expectation E[e


τ ] in terms of E[W1 Y1 (1)], E[(1−W1 )Y1 (0)], E[W1 ],
and E[1 − W1 ].

36
(b) Express the bias of τe in terms of Cov(W1 , Y1 (1)), Cov(W1 , Y1 (0)), E[W1 ], E[1 − W1 ].

(c) Show that unconfoundedness holds.

37
6 Stat 111 final from 2022
1. Let θ be the estimand and θbn be the estimator
1 1 1
θbn = θ + 1/2 U + 3000 + 2/3 V,
n n n
where U and V are independent, unobserved random variables which do not change
with n. Assume that E[U ] = 0, E[V ] = µV , Var(U ) = σU2 < ∞ and Var(V ) = σV2 < ∞.
Additionally, assume that U is Normally distributed.
(a) What is the bias of θbn ?

(b) What is the variance of θbn ?

(c) What is the mean square error of θbn ?

(d) Is θbn consistent for θ?

38
(e) What does

n(θbn − θ)
converge to in distribution as n → ∞?

bU2 and σ
(f) I have consistent estimators σ bV2 of σU2 and σV2 , respectively. How can I use
these to generate an asymptotically valid 95% confidence interval for θ by finding an
asymptotic pivot?

39
2. Richard models the survival times of (a particular type of) cancer patients using the
parametric model

δeδγ −3/2
 
1 2 2

fY1 (y; δ, γ) = √ y exp − δ /y + γ y , y > 0, δ > 0, γ > 0. (1)
2π 2

Assume that Y1 , . . . , Yn are i.i.d. from (1).


(a) What is the log-likelihood function for δ, γ?

(b) What are the sufficient statistics for δ, γ? Please find the smallest dimensional
sufficient statistic you can.

(c) What is the score for δ?

(d) Use the properties of the score to find E[Y1−1 ] in terms of δ and γ.

40
(e) Calculate the Fisher information in the sample for δ.

(f) What are the maximum likelihood estimates for δ and γ?

41
3. Suppose that we have a parametric statistical model for Y|θ and prior π(θ), where Y
has a continuous distribution. The marginal likelihood f (y) is the corresponding density
for Y, averaging out the effect of the prior:
Z
f (y) = f (y|θ)π(θ)dθ.

The posterior mean can be thought of as a Bayes estimate. Write it as

θ(y)
b = Eθ|Y=y [θ] = E[θ|Y = y].

George is a frequentist. He would like to know the properties of the Bayes estimator
and credible intervals when the data are drawn from the above model. That is, θ is
drawn from the prior and then Y is drawn from Y|θ.
(a) The expected value of the Bayesian estimator, averaged over the marginal likelihood
R
f (y), is written as EY [θ(Y)]
b and equals θ(y)f
b (y)dy. Show that
Z
EY [θ(Y)] = θπ(θ)dθ = E[θ],
b

the mean of the prior.

(b) A 95% credible interval C(y), computed from the posterior π(θ|y), has the property

P (θ ∈ C(y)|Y = y) = 0.95.

What is P (θ ∈ C(Y))?

(c) Does the result of the previous part imply that C(Y) is a 95% confidence interval?

42
4. The data are y1 , ..., yn (with no repeated values), and θ is the estimand. Livie focuses
on a method of moment estimate
n
1X
θb = g(yj ),
n j=1

where g is a known function. She draws B bootstrap replications of the data,


∗(1) ∗(B)
Y1∗ = (Y1 , ..., Yn∗(1) ), ..., YB

= (Y1 , ..., Yn∗(B) ),

and computes
n
1X ∗(b)
θbb∗ = g(Yj ), b = 1, 2, ..., B.
n j=1

(a) What is the bootstrap estimate of the standard error of Livie’s estimator θ,
b in terms

of θb1∗ , ..., θbB ?

(b) What is Eboot [θbb∗ ], in terms of y1 , . . . , yn ?

43
(c) The probability that none of the elements within Y1∗ equal y1 , i.e.,
∗(1)
P (Y1 ̸= y1 , ..., Yn∗(1) ̸= y1 ),

converges to a simple constant as n gets large. What is that constant?

44
5. This question is about using outcomes and predictors to estimate a model with an
initial dataset, and using that fitted model to predict a new value Y1∗ based on X1∗ ,
where (X1∗ , Y1∗ ) is a new data point not used in the estimation (doing this is called out
of sample prediction).
Suppose that the outcomes (Y1 , ..., Yn , Y1∗ ) are conditionally independent given the pre-
dictors (X1 , ..., Xn , X1∗ ), i.e., the joint density factors as
n
Y
f (y1 , ..., yn , y1∗ |x1 , . . . , xn , x∗1 ) = f (y1∗ |x∗1 ) fj (yj |xj ).
j=1

Also assume that for all x ∈ R,

E[Yj | (Xj = x)] = µ(x), Var[Yj |Xj = x] = σ 2 (x), j = 1, ..., n,

and
E[Y1∗ |X1∗ = x] = µ(x), Var[Y1∗ |X1∗ = x] = σ 2 (x).
Gail does not know how to write down a parametric model for µ(x). Instead she decides
to make predictions of outcomes using a linear method θ̂x, where
Pn
j=1 Xj Yj
θ̂ = Pn 2
.
j=1 Xj

Note that (X1∗ , Y1∗ ) are not used in the estimator θ̂. Let

θn = E[θ̂|X1 = x1 , ..., Xn = xn , X1∗ = x∗1 ],


Pn Pn
(a) What is θn , in terms of j=1 xj µ(xj ) and j=1 x2j ?

45
(b) What is
Var[θ̂|X1 = x1 , ..., Xn = xn , X1∗ = x∗1 ],
in terms of nj=1 x2j σ 2 (xj ) and nj=1 x2j ?
P P

(c) What is
Cov(θ̂, Y1∗ |X1 = x1 , ..., Xn = xn , X1∗ = x∗1 )?

46
(d) Gail predicts Y1∗ by θ̂x∗1 . Then the out of sample prediction error Û1∗ = Y1∗ − θ̂x∗1
can be decomposed as

Û1∗ = (Y1∗ − µ(x∗1 )) + (µ(x∗1 ) − θn x∗1 ) + (θn − θ̂)x∗1 .

What is the expected prediction error (given X1 , . . . , Xn , X1∗ ), i.e.,

E[Û1∗ |X1 = x1 , ..., Xn = xn , X1∗ = x∗1 ]?

(e) What is the variance of the prediction error (given X1 , . . . , Xn , X1∗ ), i.e.,
 
Var Û1∗ |X1 = x1 , ..., Xn = xn , X1∗ = x∗1 ?

47

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy