243 Final Exam - Practice
243 Final Exam - Practice
1) When doing a significance test, a student gets a p-value of 0.06. This means: [1
mark]
I. Assuming Ho is true, the sample results are a likely event.
II. 94% of samples should give results that fall in this interval.
III. We reject Ho.
A. I only
B. III only
C. I and II
D. I and III
E. II and III
ANSWER: ____A______
ANSWER: _____B_____
STUDY A (questions 3 & 4) You are interested in the flowering time of tulips in early
spring and decided to keep track of the date that different gardens in your neighborhood
bloom. You use April 1 as a reference point and record the number of days past this date
st
for each garden. You sample 13 gardens and find that the mean number of days is 12.3
with a standard deviation of 3.1.
3) For STUDY A, what is the 95% confidence interval for the average days past April 1 st
1
4) For STUDY A, which of the following statements about the confidence interval is
true? [1 mark]
A. The confidence interval indicates the range over which you would fail to
reject the null hypothesis of a single-sample t-test.
B. The average number of days before flowering in the sample is between the
days given by the interval.
C. There is a 95% probability that the true average number of days before
flowering is between the days given by the interval.
D. The confidence interval is based on a z-distribution
E. None of the above.
C
ANSWER: ____B______
STUDY B (questions 5, 6 & 7) As a mushroom farmer in the Kingston region, you are
trying to find new opportunities to sell the range of goods you produce. You decide to try
developing homegrown mushroom kits, but wanted to do a market survey first to see if
the average homeowner could growth mushrooms successfully. You approach 23 random
people at the weekend farmers market and ask if they would be willing to try growing
Oyster mushrooms from your kit. You first ask them whether they have previous
experience growing mushrooms and then follow up with each participant each week over
the next three weeks and ask them whether the mushrooms successfully grew. You are
interested in evaluating whether success at growing mushrooms by the end of the 3 weeks
depended on previous experience.
5) For STUDY B, what statistical test would be most appropriate for testing your
question? [1 mark]
A. One-sample t-test
B. Two-sample t-test
C. Chi-square test
D. Single-factor ANOVA
E. Two-factor ANOVA
ANSWER: ____C______
2
7) For STUDY B, select the plot that is most appropriate for showing whether success at
growing mushrooms by the end of the 3 weeks depended on previous experience. [1
mark]
A. Bar chart
B. Box plot
C. Grouped bar chart
D. Grouped box chart
E. Scatter plot
ANSWER: _____C_____
ANSWER: _____A_____
9) Which of the following statements best explains why confidence intervals cannot be
used to test for significance in a t-test? [1 mark]
A. Confidence intervals do not share a common sampling distribution with t-
tests.
B. Confidence intervals indicate the range over which the true population value
lies.
C. Confidence intervals are a statement about the sampling distribution of your
population, not of the null distribution.
D. Confidence intervals do not use the standard error estimated from your
sample.
E. Confidence intervals are always two-tailed whereas t-test can be one- or two-
tailed.
ANSWER: _____C____
3
STUDY C (questions 10-13) You discover a pamphlet advertising final exam prep
sessions for statistics. Since exams are coming soon, you decide to read what they have to
say. Use your statistical knowledge to find the statistical and methodological errors.
10) For STUDY C, which of the following is the most important methodological error
found in the passage? (Multiple answers possible: 1 mark for each correct response.
You will lose 0.5 marks for each incorrect answer, but only applied to this question.)
A. The study design has pseudoreplicated sampling units
B. The study design has biased sampling unit allocation
C. The study design uses non-random sampling units
D. The study design is missing a control level needed to compare.
E. There are no methodical errors
ANSWER: ____C_____
11) For STUDY C, which of the following is an error found in the passage? [1 mark]
I. Testing the study methods should not be done using linear regression.
II. Linear regression is not evaluated using a t-test.
III. There is a mismatch between the test scores and the statistical conclusion.
A. I only
B. III only
C. I and II
D. I and III
E. II and III
ANSWER: ____D______
4
12) For STUDY C, which of the following is an error found in the passage? [1 mark]
I. Evaluating the relationship between sleep and stress should not be done with a
Chi-square test.
II. Evaluating the effectiveness of the practice questions on exam success should
not be done using a two-sample t-test.
III. The researchers have drawn an incorrect scientific conclusion about the p-
value for the effectiveness of the practice questions on exam success.
A. I only
B. III only
C. I and II
D. I and III
E. II and III
ANSWER: ____E______
STUDY D Use the following study to answer questions 13, 14 & 15.
Airbnb is an accommodation-sharing platform that relies heavily on guest ratings.
Interestingly, the distribution of ratings (shown below) is not equal across the possible
scores and is also not Normally distributed (5.0 is excellent, 1.0 is poor). The proportion
of accommodations in each category is:
13) If you took repeated samples of potential accommodations for STUDY D, which of
the following statements about the distribution of the mean ratings would be
INCORRECT? [1 mark]
A. Since the data are not Normally distributed, the distribution of mean ratings
cannot be Normally distributed.
B. The mean of the distribution is the same as the mean of the population
distribution.
C. The variance of the distribution depends on sample size.
D. Sampling error causes the variation in the distribution.
E. The standard deviation of the distribution can be estimated from a sample.
ANSWER: ____A_____
5
14) You collect a sample of 32 accommodations for STUDY D and find the following
information about the ratings: Mean=4.2, Median=4.5, SD=0.2, IQR=1.5. Based on
this information, what is the estimated standard deviation of the distribution of mean
ratings? [1 mark]
A. 0 ≤ ANSWER < 0.05
B. 0.05 ≤ ANSWER < 0.1
C. 0.1 ≤ ANSWER < 0.2
D. 0.2 ≤ ANSWER < 0.5
E. 0.5 ≤ ANSWER < 1.0
ANSWER: _____A_____
15) What is the probability that a randomly selected accommodation in STUDY D will
have a rating that is less than 4.0? [1 mark]
A. 0 ≤ ANSWER < 0.2
B. 0.2 ≤ ANSWER < 0.4
C. 0.4 ≤ ANSWER < 0.6
D. 0.6 ≤ ANSWER < 0.8
E. 0.8 ≤ ANSWER < 1.0
ANSWER: _____A_____
16)You have been investigating the human health effects of a factory that has been
releasing contaminated water into a river from which all local communities draw
their drinking water. The toxin in question is known to cause liver cancer in lab
rats. To determine whether pollution in the river is affecting human health, you
randomly sample 200 people living in the community upstream from the factor
(that don’t drink the polluted water) and 200 individuals that live downstream
(hence they do drink the contaminated water. For each person sampled, you
perform a liver enzyme test to look for the tell tale signs of liver cancer. What
type of statistical test would be best suited to analyze these data? [1 mark]
A. Chi-square test
B. Paired-sample t-test
C. 2-sample t-test
D. 1-factor analysis of variance (ANOVA)
E. Regression
ANSWER: _____C_____
6
17) You want to plant a butterfly garden on the balcony of your apartment, but are not
sure what species of milkweed is best suited for pots. You grow 10 plants of the
‘Common Milkweed’ and 10 plants of a related species called the ‘Butterfly Weed’
and measure their height after 2 months. Which of the following statements is about
the descriptive statistics for your data? [1 mark]
I. A boxplot showing that the interquartile range for the common milkweed is
greater than for the butterfly weed.
II. The difference in median height between the two species is 3.2 cm.
III. The height difference between the two species is unexpected from sampling
error.
A. I only.
B. II only.
C. I and II.
D. I and III.
E. II and III.
ANSWER: _____C_____
18) Which of the following questions should be analyzed using a 1-tailed t-test? [1
mark]
A. Is the fluorine concentration in Kingston’s water above the recommended
guide of 0.7 mg/L?
B. Does the mean patient wait time differ between an After-Hours health clinic
compared to a regular doctor’s office?
C. Has the first calendar date that stores put out their Christmas ornaments for
sale changed over the past 20 years?
D. Is the amount of sea ice in our Arctic waters different now from what it was
10 years ago?
ANSWER: _____A_____
19)Sara is an ornithologist and has been studying stress hormones (ug/ml) in black-
capped chickadees over the winter months. She collects blood samples from 10
random birds in a forest site, 10 random birds from a field site, and 10 random
birds from a urban site. What statistical test should be used to evaluate whether
there is a difference in stress hormones among the sites? [1 mark]
A. Chi-square test
B. Paired-sample t-test
C. 2-sample t-test
D. 1-factor analysis of variance (ANOVA)
E. Regression
ANSWER: _____D_____
7
20) The following R output is for a linear regression of brain mass in bats as a function of
their body mass. Which of the following values is used to test the hypothesis that
body mass can be used to predict brain mass? [1 mark]
21)You have been given a dataset that contains the viral load for patients who are
sick with the flu. The dataset also includes whether the patients had been given a
flu vaccination (levels of yes/no) and their age group (levels of
child/teen/adult/senior). Which of the following figures would best illustrate
whether the efficacy of the flu vaccine for reducing viral load depends on patient
age? [1 mark]
A. Boxplot
B. Contingency table
C. Interaction plot
D. Histogram
E. Scatter plot
ANSWER: _____C_____
22)Select which of the following null and alternative hypotheses are most
appropriate for a two-sample t-test that answers the following question: “Are the
means of my samples different?”. [1 mark]
A. H0: μ ≤ 0 HA: μ > 0
B. H0: μA = μB HA: μA ≠ μB
C. H0: μA ≥ μB HA: μA < μB
D. H0: μ > 0 HA: μ ≤ 0
E. H0: μA ≠ μB HA: μA = μB
ANSWER: ____B_______
8
SHORT ANSWER - Write all your answers in the space provided.
23) Carbon tax is a hot political issue right now in Canada. The idea is that by taxing
activities or products that generate a lot of carbon dioxide, consumers will change
their behavior and opt for choices that generate less carbon dioxide. You have been
asked to conduct a survey looking at how people feel about carbon taxes in five key
areas: gasoline, cement manufacturing, electricity generation from natural gas plants,
home heating with fossil fuels, agriculture. You decide to run the survey online by
sending emails to 1000 people who have their email registered with Revenue Canada.
The survey asks participants to select their level of support (strongly support, support,
neutral, do not support, strongly do not support) for carbon taxes under each area.
A. Indicate the sampling unit and the statistical population (be specific). [2 marks]
[sampling unit: a person (or person’s email); statistical population: all the people
(or emails) who have an email registered with Revenue Canada. (Zero marks for
not identifying that the statistical population is just those that have an email
registered with revenue Canada.)]
B. Give one example of a hidden bias that could arise from this sampling method (be
specific). [1 mark]
[People without an email registered with Revenue Canada will not be included in
the sample, such as those who are older or who can’t afford a computer. These
people are likely to have a different perspective on carbon tax, which may cause a
bias in your answer]
C. Indicate the most appropriate statistical test (be as specific as possible) [1 mark]
[Chi-square test]
E. Indicate the null and alternative hypothesis (be mindful of direction in the test if
appropriate) [1 mark]
[The null hypothesis is H0: observed counts are not different from the expected
counts, HA: observed counts are different from the expected counts. *grade part c
based on the answer in part a, even if that was incorrect]
F. Name the appropriate test statistic (e.g., F-score) (You do not need to find its
value.) [1 mark]
[Chi-square score. *grade part d based on the answer in part a, even if that was
incorrect]
9
24) You work for a cosmetic company that develops tanning solutions designed to
modify skin color. You have been asked to evaluate the effectiveness of two
formulations (Dihhydroxyacetone, Erythrulose) and three methods of
application (Cream, Gel, Solution) in terms of color change. After running the
experiments and collecting the data, you analyze the data and generate the
following R output. Use these results to answer the following questions.
10
24) continued…
A. Indicate the most appropriate statistical test (e.g., t-test, regression etc.) for this
data and explain your rationale. Be as specific as possible. [1 mark]
This is a two-factor ANOVA [0.5 marks], which is most appropriate because we are
studying a numerical response under two categorical explanatory variables [0.5
marks].
B. Indicate the statistical distribution you will use to test the null hypothesis. [1
mark]
It is an F-distribution.
C. Your employers are interested in whether the color change for each formulation
depends on the type of application. For this specific question, state:
a. The null and alternative hypothesis. [1 mark]
δ F1A1=δF2A1=δF1A2=δF2A2=δF1A3=δF2A3= 0, where delta is the difference of
each cell from additivity.
b. The observed test score (3 significant figures). [1 mark]
F_observed=4.677
c. The appropriate degrees of freedom and the associated critical test score
(3 significant figures) from the table provided at the end of the exam.
Assume a Type I error rate of 5%. [1 mark]
Numerator degrees of freedom is 2, denominator degrees of freedom is 18.
From the table, this gives a F_crit=3.555
d. Your statistical conclusion. [1 mark]
Since F_crit<F_obs, we reject the null hypothesis (0.5 marks) and conclude
that some of the cells are different from additivity (0.5 marks).
11
24) continued…
D. The following boxplot shows the results of the experiment. Labels are ‘D’ for
Dihhydroxyacetone, ‘E’ for Erythrulose, ‘Cream’ for cream application, ‘Gel’
for gel application, and ‘Sol’ for solution application. Indicate directly on the
boxplot which groups are significantly different from each other based on
above R output. Use the letter scheme shown in lecture. [2 marks]
40
a b a b c b
30
Color Change
20
10
0
E. Use the medians in the above boxplot to draw the interaction plot below.
Draw the figure to scale and be as accurate as possible. [2 marks]
Application
30
Solution
Gel
Cream
25
Color Change
Dihhydroxyacetone Erythrulose
Formulation
12
25) A study was published in the Lancet on the body mass index (BMI) of people from
around the globe. The following table shows a random subset of the data for 10
people. For this question, test the hypothesis that the mean BMI of people in this
sample is above the ‘healthy’ threshold of 25.
25.76 25.72 25.39 24.65 24.90 26.01 25.97 25.11 25.91 26.54
A. Indicate the statistical test that is most appropriate for this data and the
scientific question [1 mark] [single sample t-test]
Remove picture
13
26) For each of the following studies, identify and rationalize the most appropriate
statistical test (e.g., t-test, regression etc.). Include the null and alternative hypotheses
(be mindful of direction in the test), as well as the appropriate test statistic (e.g., F-
score for an F-test).
STUDY 2 A sports medicine study was conducted looking at whether changing the
source of dietary proteins improved race times in marathon runners. The study asked
50 racers to increase the proportion of their dietary protein coming from plants for a
season. The researchers recorded the proportion of dietary protein derived from plants
(proportion) and the change in race time (minutes) for each racer. They wanted to test
the hypothesis that increasing the proportion of plant proteins could predict faster race
times.
14
26) continued…
STUDY 3 Researchers are interested in whether a new form of bio control can
successfully manage pest insects on tea plantations compared to using pesticides.
They conduct an experimental study where 120 tea plants are randomly allocated to
either i) no biocontrol or pesticide, ii) biocontrol but no pesticide, and iii) pesticide
but not biocontrol. For each tea plant, they measured the mass of consumed after
exposure to the insects.
ii. State the null and alternative hypothesis for the question whether there is
an effect of biocontrol. [1 mark]
The null hypothesis is H0: u1=u2=u3, HA: u1≠u2≠u3, where u is the mean
for each treatment.
27) Maple syrup production is looking to be good this year for sugar bush operators. You
are interested in how soil type impacts the amount of maple syrup produced per
hectare of forest, so conduct a study where you randomly sample 60 maple sugar bush
operators from each municipality on Ontario and Quebec. For each operator, you
record the soil type (humic, acidic, sandy, bog) and the amount of production (<100L
per ha, 100-200L per ha, >200L per ha).
A. Indicate the survey design used in the study [1 mark] [stratified sampling]
B. Indicate the sampling unit and observation unit [1 mark] [0.5 marks for each
of sampling unit: sugar bush, observation unit: sugar bush]
C. Indicate the type of data for each of the measurement variables in the study [1
mark] [0.5 marks for each of soil type: categorical, production: categorical]
15
27) continued…
D. The following table shows the data from the Maple Syrup Study for one of the
municipalities in question 27.
Humic soils Acidic soils Sandy soils Bog soils
<100L per ha 8 12 5 4
100-200L per ha 13 5 1 1
>200L per ha 5 3 1 2
i. Indicate the statistical test that is most appropriate for this data and the
scientific question [1 mark] [Chi-square test]
iii. Calculate the missing expected counts in the table below under the null
hypothesis. Report your answers to one decimal place, showing your work
in the space below the table for full marks. [3 marks]
[For the first cell, it would be ((8+13+5)/60)*((8+12+5+4)/60)*60=12.6. Full
table is shown below]
Expected Humic soils Acidic soils Sandy soils Bog soils
<100L per ha 12.6 9.7 3.4 3.4
100-200L per ha 8.7 6.7 2.3 2.3
>200L per ha 4.8 3.7 1.3 1.3
16
27) continued…
iv. Calculate your observed test score to two decimal places. [2 marks]
[Observed test score is 7.70] df=(number of rows -1)*(number of columns -
1)=6]
v. Find the critical score and write both your statistical conclusions and
scientific conclusions. [2 marks]
[chisq-crit=12.592. Since the chisq-observed is less than chisq-crit, we fail to
reject the null hypothesis. We conclude that there is no evidence that maple
syrup production and soil type are not independent.]
28) Explain what a null F-distribution represents. Include a clear definition of the F-score
(i.e., how is an observed F-score calculated), and then include a clear description of
what the null distribution for the F-score represents. Be as specific as possible. [3
marks]
[The F-score is the ratio of the variation among categorical groups divided by the
residual variation within a group. The null distribution for the F-score represents the
variation in that ratio you would expect from repeated sampling of a population where
there was no true difference in the means.]
17
FORMULAE
18
19
20
21