Practice Questions - Final With Feedback
Practice Questions - Final With Feedback
Spring 2022
40% of the population has the virus. What is the probability that an individual who tests
positive, really does have virus?
Answer – It’s helpful to imagine many people, say 100,000, who all take a COVID test.
So the probability that a person who has tested positive actually has the virus is
38,000/(38,000 + 4,800) = 0.89
2
The following table reports data on public health expenditures and COVID-19 death
rates for a selected set of countries.
Provide R code to compute the correlation between public health expenditures and the
COVID-19 death rate?
Answer
cor(health, death)
If you actually run this in R, which you wouldn’t be able to do on the final, you would find
a strong negative correlation of -0.99 between health expenditure percentage and the
Covid death rate.
Answer
p=342/1000=0.342
Note - We have quite a large sample so we can use the normal distribution. You should
remember that the critical value for a 95% confidence interval using the normal
distribution is 1.96. If you forget this fact you could write the code qnorm(0.975) into
your answer.
𝑝(1−𝑝) 0.342(1−0.342)
𝑧 = 𝑝 ± 𝑧𝛼 √ =0.342±1.96√ = 0.342 ± 1.96 ∙
𝑛 1000
0.015001 =[0.3126,0.3714]
The confidence interval for the estimate of the population proportion of individuals
testing positive is between 31.26 and 37.14%. The lower bound is 31.26%.
A government official claims that each of the 100 test centres in the country conduct, on
average, 1,000 COVID19 tests per week. You obtain data from a random sample of 9
testing centres and find that, on average, they have conducted 820 tests per week with
a sample variance of 144. Assume that the distribution of tests per centre is normal and
provide R code that will help you determine whether, at a 5% significance level, the
official claim is valid. State how acceptance or rejection of the government claim will
depend on the result you would get from running your R code.
Answer
This is a small sample so we should use the t distribution rather than the normal
distribution.
We should conduct a one-sided test of the hypothesis that mean tests per week are
1,000 against the alternative hypothesis that mean tests per week are less than 1,000.
Step 2 – Determine the Critical Value
It’s a one-sided test with 5% significance so we need to find the value for which 5% of
the area of a t distribution with 9 – 1 = 8 degrees of freedom will be to left. We get this
from:
qt(0.05, 8)
You can do this on a calculator, or even with paper and pencil in this case, if you want.
i. Calculate the standard error. This is the sample standard deviation divided by the
square root of the sample size. The sample standard deviation is the square root of the
sample variance so the standard error is:
12/3 = 4
ii. The test statistic is the difference between the sample mean and the null-hypothesis
mean divided by the standard error.
(820 - 1000)/((144^0.5)/9^0.5)
The question will be whether -45 is smaller than qt(0.05, 8). If so then we’ll reject at the
5% level the hypothesis that the mean number of tests at each centre is 1,000 in favour
of the hypothesis that this mean is less than 1,000.
FYI if you compute qt(0.05, 8) in R you get 1.86 so the null hypothesis would get
rejected by a country mile. But on the test it would be good enough to leave you
answer with the previous paragraph.
5
Among a random sample of 100 students, 20 test positive for COVID19. A politician
claims that at least 1/3rd of students in the country have COVID. Provide R code that
would test the politician’s claim at a 5% level of significance level.
Answer –
This will be a one-sided test (note the phrase “at least” above). The null hypothesis that
1/3rd of students have COVID will be tested against the alternative hypothesis that less
than 1/3rd of students have COVID.
100 students can be considered a large sample so we can use the normal distribution.
However, it is totally fine to use the t instead. The two will hardly differ.
qnorm(0.05)
qt(0.05, 99)
𝑝−𝜋 0.2−1/3
𝑧= = =-2.8284
𝜋(1−𝜋) 1/3(1−1/3)
√ √
𝑛 100
Step 4: draw a conclusion
If the test statistic -2.8284 < (left hand side) critical value qnorm(0.05) then reject the
Null hypothesis
If you run the R code you see that we would reject the politician’s claim in favour of the
alternative that less than a third of students have tested positive nationwide.
A scientist wants to assess whether age affects the effectiveness of a drug. She runs a
regression on a large dataset where a measure of drug effectiveness is the response
variable and Age is the explanatory variable. She obtains the following results:
Calculate the t-statistic on Age for a 5% significance test. Decide whether age is likely to
be informative regarding the effectiveness of the drug. How would your answer change
if the estimated coefficient came out to be 0.100 or 0.350? Note – you do not need R
code to answer this question.
Answer
The question states that the dataset is large. So, although you are asked to do a t test,
we can still use the normal distribution to get the critical value. In fact, you should
probably just remember that the critical value for the normal distribution for 5%
significance in a two-sided test is 1.96. But, if not, you can use:
qnorm(0.025)
t = -0.200/0.110= -1.8181
The test statistics is less, in absolute value, than the critical so you would not reject the
Null hypothesis at the 5% level.
If the coefficient is 0.350: test statistic=3.1818, reject the Null that age is not related to
the effectiveness of the drug in favour of the alternative hypothesis that there is
relationship between age and the effectiveness of the drug.
1. Write down the regression equation for Life expectancy as a function of GDP per
capita.
Answer
Life_Expectancy = a + b*GDPpc + e
2. Provide R code that will estimate the intercept and the slope of the linear model.
Answer
lm(development_dataset$Life_Expectancy ~ development_dataset$GDPpc)
3. If your answer to part 2 does not already do so, then provide R code that will give an
estimate of the R2 for your regression.
Answer
The code in part 2 will only give you the coefficient estimates. You get a lot more
information, including the R2 but using the summary() command
summary(lm(development_dataset$Life_Expectancy ~ development_dataset$GDPpc))