Introduction to Data Science and Statistical Thinking
Introduction to Data Science and Statistical Thinking
Thinking
Preface 9
I Introduction 10
1 Intro to data 1
1.1 Data basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Types of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Relationships among variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Associated vs. independent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Explanatory and response variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Anecdotal evidence and early smoking research . . . . . . . . . . . . . . . . . . . . . 7
Census . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
From exploratory analysis to inference . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Sampling bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Observational studies and experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Obtaining good samples - Sampling principles and strategies . . . . . . . . . . . . . . 13
1.4.1 Simple random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.2 Stratified sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.4 Multistage sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 More on experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
II R 21
3
Table of contents
2.4.4 geoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.5 Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4
Table of contents
IV Probability 107
7 Probability 117
7.1 Defining probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1.1 Sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.1.2 Events and complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.1.3 Rules of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.3 Addition rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.5.1 Disjoint, complementary, independent . . . . . . . . . . . . . . . . . . . . . . 124
7.6 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.6.1 General multiplication rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.6.2 Independence and conditional probabilities . . . . . . . . . . . . . . . . . . . 127
7.6.3 Case study: breast cancer screening . . . . . . . . . . . . . . . . . . . . . . . . 128
7.6.4 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.7 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.7.1 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.7.2 Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.7.3 Linear combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.8 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.8.1 Expected value and variability . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.8.2 The normal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5
Table of contents
6
Table of contents
VI Inference 256
7
Table of contents
References 336
Appendices 337
8
Preface
These lecture notes accompany a course at TUM having the same title as these notes.
Many different sources inspire the content of these notes. We are happy to mention:
The notes contain, in some parts, derivatives of the sources mentioned above. None of these changes
were approved by the authors of the original resources.
The content may be copied, edited, and shared via the CC BY-NC-SA license.
9
Part I
Introduction
10
1 Intro to data
Objective: Evaluate the effectiveness of cognitive-behavior therapy for chronic fatigue syndrome.
Participant pool: 142 patients who were recruited from referrals by primary care physicians and
consultants to a hospital clinic specializing in chronic fatigue syndrome.
Actual participants: Only 60 of the 142 referred patients entered the study. Some were excluded
because they didn’t meet the diagnostic criteria, some had other health issues, and some refused to
be a part of the study.
Study design
Patients randomly assigned to treatment and control groups, 30 patients in each group:
Treatment: Cognitive behavior therapy - collaborative, educative, and with a behavioral empha-
sis. Patients were shown how activity could be increased steadily and safely without exacerbating
symptoms.
Control: No advice was given about how activity could be increased. Instead, progressive muscle
relaxation, visualization, and rapid relaxation skills were taught.
Results
The table below shows the distribution of the outcomes at six-month follow-up 2 .
Good Outcome
No Yes Sum
Group
Control 21 5 26
Treatment 8 19 27
Sum 29 24 53
1
Deale et al. (1997)
2
Note: Seven patients dropped out of the study (3 from the treatment and 4 from the control group)
1
1 Intro to data
19
Proportion with good outcomes in treatment group: ≈ 0.70
27
5
Proportion with good outcomes in control group: ≈ 0.19
26
Question
Random variation
While the chance a coin lands heads in any given coin flip is 50%, we probably won’t observe
exactly 50 heads.
table(coin_flips)
# coin_flips
# heads tails
# 44 56
Back to whether the data shows a “real” difference between the groups.
The observed difference between the two groups (0.70 - 0.19 = 0.51) may be real or due to natural
variation.
Since the difference is quite large, it is more believable that the difference is real.
Conclusion: We need statistical tools to determine quantitatively if the difference is so large that we
should reject the notion that it was due to chance.
2
1 Intro to data
Question
Are the results of this study generalizable to all patients with chronic fatigue syndrome?
These patients were recruited from referrals to a hospital clinic specializing in chronic fatigue syn-
drome and volunteered to participate in this study. Therefore, they may not be representative of all
patients with chronic fatigue syndrome.
While we cannot immediately generalize the results to all patients, this first study is encouraging.
The method works for patients with some narrow set of characteristics, which gives hope that it will
work, at least to some degree, with other patients.
A survey was conducted on students in an introductory statistics course. Below are a few of the
questions on the survey and the corresponding variables the data from the responses were stored
in:
classroom_survey
# # A tibble: 86 x 6
# gender intro_extro sleep bedtime countries dread
# <chr> <chr> <int> <chr> <int> <int>
# 1 male introvert 8 9-11 10 1
# 2 female extrovert 5 12-2 1 1
# 3 male introvert 5 12-2 1 5
# 4 male extrovert 8 9-11 12 1
# # i 82 more rows
3
1 Intro to data
Types of variables
• Numerical: Variable can take a wide range of numerical values, and it is sensible to add,
subtract, or take averages with those values.
• Categorical: Variable has a finite number of values, which are categories (called levels),
and it is not sensible to add, subtract, or take averages with those values.
Example 1.1.
head(classroom_survey)
# # A tibble: 6 x 6
# gender intro_extro sleep bedtime countries dread
# <chr> <chr> <int> <chr> <int> <int>
# 1 male introvert 8 9-11 10 1
# 2 female extrovert 5 12-2 1 1
# 3 male introvert 5 12-2 1 5
# 4 male extrovert 8 9-11 12 1
# 5 female introvert 6 9-11 4 3
# 6 male introvert 8 9-11 5 3
¾ Your turn
4
1 Intro to data
Does there appear to be a relationship between GPA and number of hours students study per week?
4.0
3.5
3.0
0 20 40 60
study hours
gpa <= 4.0
Can you spot anything unusual about any of the data points?
When two variables show some connection with one another, they are called associated or depen-
dent variables.
Conclusion: If two variables are not associated, i.e., there is no evident connection between them,
they are said to be independent.
When we look at the relationship between two variables, we often want to analyze whether a change
in one variable causes a change in the other.
If study times are increased, will this lead to an improvement in GPA?
We are asking whether study times affects GPA. If this is our underlying belief, then study times is
the explanatory variable and GPA is the response variable in the hypothesized relationship.
5
1 Intro to data
Definition 1.1. When we suspect one variable might causally affect another, we label the first vari-
able the explanatory variable and the second the response variable.
ĺ Remember
Labeling variables as explanatory and response does not guarantee the relationship between
the two is causal, even if an association is identified between the two variables.
Research question: Can people become better, more efficient runners on their own,
merely by running?
6
1 Intro to data
Anti-smoking research started in the 1930s and 1940s when cigarette smoking became increasingly
popular. While some smokers seemed to be sensitive to cigarette smoke, others were completely
unaffected.
Anti-smoking research was faced with resistance based on anecdotal evidence such as
“My uncle smokes three packs a day and he’s in perfectly good health”.
Evidence based on a limited sample size that might not be representative of the population.
It was concluded that “smoking is a complex human behavior, by its nature difficult to study, confounded
by human variability.” 3
In time, researchers could examine larger samples of cases (smokers), and trends showing that smok-
ing has negative health impacts became much more straightforward.
Census
Wouldn’t it be better to just include everyone and “sample” the entire population?
1. It can be difficult to complete a census: there always seem to be some subjects who are hard to
locate or hard to measure. And these difficult-to-find subjects may have certain characteristics
that distinguish them from the rest of the population.
2. Populations rarely stand still. Even if you could take a census, the population changes con-
stantly, so it’s never possible to get a perfect measure.
3. Taking a census may be more complex than sampling.
Sampling is natural. Think about sampling something you are cooking - you taste (examine) a small
part of what you’re cooking to get an idea about the dish as a whole.
Exploratory analysis:
You taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough.
Inference:
You generalize and conclude that your entire soup needs salt.
3
Brandt (2009)
7
1 Intro to data
For your inference to be valid, the spoonful you tasted (the sample) must represent the entire pot (the
population).
If your spoonful comes only from the surface and the salt is collected at the bottom of the pot, what
you tasted is probably not representative of the whole pot .
If you first stir the soup thoroughly before you taste it, your spoonful will more likely be representative
of the whole pot.
Sampling bias
Non-response: If only a small fraction of the randomly sampled people chooses to respond to a
survey, the sample may no longer be representative of the population.
Voluntary response: The sample consists of people who volunteer to respond, because they have
strong opinions on the issue. Such a sample will also not be representative of the population.
Example 1.2.
Convenience sample: Individuals who are easily accessible are more likely to be included in the
sample.
8
1 Intro to data
The Literary Digest polled about 10 million Americans, and got responses from about 2.4 million.
The poll showed that Landon would likely be the overwhelming winner and FDR would get only
43% of the votes.
9
1 Intro to data
This resulted in lists of voters far more likely to support Republicans than a truly typical voter of the
time, i.e., the sample was not representative of the American population at the time.
The Literary Digest election poll was based on a sample size of 2.4 million, which is huge, but since
the sample was biased, the sample did not yield an accurate prediction.
Conclusion: If the soup is not well stirred, it doesn’t matter how large a spoon you have, it will still
not taste right. If the soup is well stirred, a small spoon will suffice to test the soup.
¾ Your turn
A school district is considering whether it will no longer allow high school students to park at
school after two recent accidents where students were severely injured. As a first step, they
survey parents by mail, asking them whether or not the parents would object to this policy
change. Of 6,000 surveys that go out, 1,200 are returned. Of these 1,200 surveys that were
completed, 960 agreed with the policy change and 240 disagreed.
Which of the following statements are true?
ii) The school district has strong support from parents to move forward with the policy
approval.
iii) It is possible that the majority of the parents of high school students disagree with the
policy change.
iv) The survey results are unlikely to be biased because all parents were mailed a survey.
A i) and ii)
B i) and iii)
C ii) and iii)
D iii) and iv)
Definition 1.2.
1. If researchers collect data in a way that does not directly interfere with how the data arise, i.e.,
they merely “observe”, we call it an observational study.
In this case, only a relationship between the explanatory and the response variables can be
established.
2. If researchers randomly assign subjects to various treatments in order to establish causal con-
nections between the explanatory and response variables, we call it an experiment.
10
1 Intro to data
If you’re going to walk away with one thing from this class, let it be
New study sponsored by General Mills says that eating breakfast makes girls thinner
Girls who regularly ate breakfast, particularly one that includes cereal, were slimmer than those
who skipped the morning meal, according to a larger NIH survey of 2,379 girls in California,
Ohio, and Maryland who were tracked between ages 9 and 19.
Girls who ate breakfast of any type had a lower average body mass index, a common obesity
gauge, than those who said they didn’t. The index was even lower for girls who said they
ate cereal for breakfast, according to findings of the study conducted by the Maryland Medical
Research Institute. The study received funding from the National Institutes of Health and cereal
maker General Mills.
“Not eating breakfast is the worst thing you can do, that’s really the take-home message for
teenage girls,” said study author Bruce Barton, the Maryland institute’s president and CEO.
Results of the study appear in the September issue of the Journal of the American Dietetic
Association.
As part of the survey, the girls were asked once a year what they had eaten during the previous
three days. The data were adjusted to compensate for factors such as differences in physical
activity among the girls and normal increases in body fat during adolescence.
A girl who reported eating breakfast on all three days had, on average, a body mass index 0.7
units lower than a girl who did not eat breakfast at all. If the breakfast included cereal, the
average was 1.65 units lower, the researchers found.
Breakfast consumption dropped as the girls aged, the researchers found, and those who did not
eat breakfast tended to eat higher fat foods later in the day.
Remark. One should be aware that the body mass index is generally a poor metric for measuring
people’s health.
This is an observational study since the researchers merely observed the girls’ (subjects) behavior
instead of imposing treatments on them. The study, which was sponsored by General Mills, found an
association between girls eating breakfast and being slimmer.
11
1 Intro to data
Definition 1.3. Extraneous variables, that affect both the explanatory and the response variable and
that make it seem like there is a relationship between the two, are called confounding variables.
Example 1.4.
A study found a rather strong correlation between the ice cream sales and the number of shark attacks
for a number of beaches that were sampled.
50
number of shark attacks
40
30
20
70 80 90 100 110
ice cream sales
Conclusion: Increasing ice cream sales causes more shark attacks (sharks like eating people full of
ice cream).
Better explanation: The confounding variable is temperature. Warmer temperatures cause ice
cream sales to go up. Warmer temperatures also bring more people to the beaches, increasing the
chances of shark attacks.
12
1 Intro to data
Note
Almost all statistical methods are based on the notion of implied randomness.
If observational data are not collected in a random framework from a population, these statistical
methods, i.e.,
the estimates and errors associated with the estimates,
are not reliable.
Randomly select cases from the population without implied connection between the selected
points.
A sample is called a simple random sample if each case in the population has an equal chance of
being included in the final sample.
13
1 Intro to data
Similar cases from the population are grouped into so-called strata. Afterward, a simple random
sample is taken from each stratum.
Stratum 5
Stratum 1
Stratum 3
Stratum 6
Stratum 4
Stratum 2
Stratified sampling is especially useful when the cases in each stratum are very similar in terms
of the outcome of interest.
Clusters are usually not made up of homogeneous observations. We take a simple random sample
of clusters, and then sample all observations in that cluster.
Cluster 2
Cluster 6
Cluster 4
14
1 Intro to data
Clusters are usually not made up of homogeneous observations. We take a simple random sample
of clusters, and then take a simple random sample within each sampled cluster.
Cluster 2
Cluster 6
Cluster 4
Remark. Cluster or multistage sampling can be more economical than the other sampling techniques.
Also, unlike stratified samples, they are most useful when there is a large case-to-case variability
within a cluster, but the clusters themselves do not look very different.
¾ Your turn
A city council has requested a household survey be conducted in a suburban area of their city.
The area is broken into many distinct and unique neighborhoods, some including large homes
and some with only apartments. Which approach would likely be the least effective?
A Simple random sampling
B Cluster sampling
C Stratified sampling
D Multistage sampling
15
1 Intro to data
¾ Your turn
On a large college campus first-year students and sophomores live in dorms located on the
eastern part of the campus and juniors and seniors live in dorms located on the western part
of the campus. Suppose you want to collect student opinions on a new housing structure the
college administration is proposing and you want to make sure your survey equally represents
opinions from students from all years.
Remark. In this course, our focus will be on statistical methods for simple random sampling.
Proper analysis of more involved sampling schemes requires statistical methods beyond the scope
of the course, e.g., linear mixed models for dependent data obtained from stratified sampling, where
observations from the same stratum are considered as being dependent but independent from obser-
vations from other strata.
Example 1.5.
It is suspected that energy gels might affect pro and amateur athletes differently, therefore we
block for pro status:
16
1 Intro to data
¾ Your turn
A study is designed to test the effect of light level and noise level on the exam performance
of students. The researcher also believes that light and noise levels might affect males and
females differently, so she wants to make sure both genders are equally represented in each
group. Which of the descriptions is correct?
A There are 3 explanatory variables (light, noise, gender) and 1 response variable (exam perfor-
mance)
B There are 2 explanatory variables (light and noise), 1 blocking variable (gender), and 1 re-
sponse variable (exam performance)
C There is 1 explanatory variable (gender) and 3 response variables (light, noise, exam perfor-
mance)
D There are 2 blocking variables (light and noise), 1 explanatory variable (gender), and 1 re-
sponse variable (exam performance)
17
1 Intro to data
Placebo: fake treatment, often used as the control group for medical studies
Placebo effect: experimental units showing improvement simply because they believe they are re-
ceiving a special treatment
Blinding: when experimental units do not know whether they are in the control or treatment
group
Double-blind: when both the experimental units and the researchers who interact with the patients
do not know who is in the control and who is in the treatment group
¾ Your turn
Online experiments
In 2012 a Microsoft employee working on Bing had an idea about changing the way the search engine
displayed ad headlines. Development had low effort, but it was just one of hundreds of ideas
proposed, and therefore received low priority from program managers.
So it took more than six months, until an engineer realized that the cost of writing the code for
it would be small, and launched a simple online controlled experiment — an A/B test (A: control,
current system; B: treatment, modification) — to assess its impact.
Within hours the new headline variation was producing abnormally high revenue, triggering a
“too good to be true” alert.
Usually, such alerts signal a bug, but not in this case. An analysis showed that the change
had increased revenue by an astonishing 12%—which on an annual basis would come to
more than $100 million in the United States alone.
It was the best revenue-generating idea in Bing’s history, but until the test, its value was underesti-
mated.
18
1 Intro to data
Figure 1.7: From Harvard Business Review: The Surprising Power of Online Experiments
19
1 Intro to data
Short summary
In this chapter we use a case study on chronic fatigue syndrome to illustrate study design,
control groups, and the challenge of distinguishing real effects from random variation. We
then cover data basics, including variable types (numerical and categorical, with subcategories
of ordinal and nominal) and the identification of relationships between variables, such as
association, dependence, and the roles of explanatory and response variables. Furthermore,
we distinguish between populations and samples, highlighting potential biases in sampling
methods like non-response and voluntary response, using the historical example of the Literary
Digest poll. Sampling strategies were introduced to reduce potential bias, like simple ran-
dom sampling or cluster sampling. Finally, the source explains observational studies versus
experiments, emphasising that correlation does not imply causation, and discusses principles
of experimental design including control, randomisation, replication, and blocking, along with
concepts like placebos and blinding.
20
Part II
21
2 Data and visualization
starwars
# # A tibble: 87 x 14
# name height mass hair_color skin_color eye_color birth_year
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
# 1 Luke Skywalker 172 77 blond fair blue 19
# 2 C-3PO 167 75 <NA> gold yellow 112
# 3 R2-D2 96 32 <NA> white, blue red 33
# 4 Darth Vader 202 136 none white yellow 41.9
# 5 Leia Organa 150 49 brown light brown 19
# # i 82 more rows
# # i 7 more variables: sex <chr>, gender <chr>, homeworld <chr>, ...
22
2 Data and visualization
Some variables are more complex objects than others. For example, films is a so-called list. It con-
tains the names of all the films the character was starring in.
So, this information varies with the different characters. For Luke it contains the following titles.
luke$films
# [[1]]
# [1] "A New Hope" "The Empire Strikes Back"
# [3] "Return of the Jedi" "Revenge of the Sith"
# [5] "The Force Awakens"
starwars is a data object of type tibble. Therefore, we can inspect available variables by looking at
the data object.
starwars
# # A tibble: 87 x 14
# name height mass hair_color skin_color eye_color birth_year
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
# 1 Luke Skywalker 172 77 blond fair blue 19
# 2 C-3PO 167 75 <NA> gold yellow 112
# 3 R2-D2 96 32 <NA> white, blue red 33
# 4 Darth Vader 202 136 none white yellow 41.9
# 5 Leia Organa 150 49 brown light brown 19
# 6 Owen Lars 178 120 brown, grey light blue 52
# # i 81 more rows
# # i 7 more variables: sex <chr>, gender <chr>, homeworld <chr>, ...
23
2 Data and visualization
But what exactly does each column represent? Take a look at the help page:
?starwars
Definition 2.1. Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize
their main characteristics.
24
2 Data and visualization
Questions:
• How would you describe the relationship between mass and height of Star Wars charac-
ters?
• What other variables would help us understand data points that don’t follow the overall
trend?
1000
Weight (kg)
500
0
100 150 200 250
Height (cm)
Let’s take a look at a famous collection of four datasets known as Anscombe’s quartet:
quartet
# set x y
# 1 I 10 8.04
# 2 I 8 6.95
# 3 I 13 7.58
# 4 I 9 8.81
# 5 I 11 8.33
# 6 I 14 9.96
25
2 Data and visualization
# 7 I 6 7.24
# 8 I 4 4.26
# 9 I 12 10.84
# 10 I 7 4.82
# 11 I 5 5.68
# 12 II 10 9.14
# 13 II 8 8.14
# 14 II 13 8.74
# 15 II 9 8.77
# 16 II 11 9.26
# 17 II 14 8.10
# 18 II 6 6.13
# 19 II 4 3.10
# 20 II 12 9.13
# 21 II 7 7.26
# 22 II 5 4.74
# 23 III 10 7.46
# 24 III 8 6.77
# 25 III 13 12.74
# 26 III 9 7.11
# 27 III 11 7.81
# 28 III 14 8.84
# 29 III 6 6.08
# 30 III 4 5.39
# 31 III 12 8.15
# 32 III 7 6.42
# 33 III 5 5.73
# 34 IV 8 6.58
# 35 IV 8 5.76
# 36 IV 8 7.71
# 37 IV 8 8.84
# 38 IV 8 8.47
# 39 IV 8 7.04
# 40 IV 8 5.25
# 41 IV 19 12.50
# 42 IV 8 5.56
# 43 IV 8 7.91
# 44 IV 8 6.89
The datasets look pretty similar when summarizing each of the four datasets by computing the sample
mean and standard deviation for each of the two variables x and y, as well as their correlation.
26
2 Data and visualization
quartet |>
group_by(set) |>
summarise(
mean_x = mean(x),
mean_y = mean(y),
sd_x = sd(x),
sd_y = sd(y),
r = cor(x, y)
)
# # A tibble: 4 x 6
# set mean_x mean_y sd_x sd_y r
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 I 9 7.50 3.32 2.03 0.816
# 2 II 9 7.50 3.32 2.03 0.816
# 3 III 9 7.5 3.32 2.03 0.816
# 4 IV 9 7.50 3.32 2.03 0.817
When visualizing Anscombe’s quartet, we realize that they are, in fact, quite different.
I II III IV
12.5
10.0
y
7.5
5.0
5 10 15 5 10 15 5 10 15 5 10 15
x
27
2 Data and visualization
2.4 ggplot2
“The simple graph has brought more information to the data analyst’s mind than any other
device.” — John Tukey
library(tidyverse)
Remark.
1. In case we only need ggplot2 and no other package from the tidyverse, we can load explicitly
just ggplot2 by running the following code.
library(ggplot2)
2. The gg in ggplot2 stands for Grammar of Graphics, since the package is inspired by the book
The Grammar of Graphics by Leland Wilkinson. A grammar of graphics is a tool that enables
us to concisely describe the components of a graphic.
28
2 Data and visualization
Let’s look again at the plot of mass vs. height of Star Wars characters.
ggplot(
data = starwars,
mapping = aes(x = height, y = mass)
) +
geom_point() +
labs(
title = "Mass vs. height of Starwars characters",
x = "Height (cm)", y = "Weight (kg)"
)
1000
Weight (kg)
500
0
100 150 200 250
Height (cm)
Questions:
29
2 Data and visualization
ggplot(
data = starwars,
mapping = aes(x = height, y = mass)
) +
geom_point() +
labs(
title = "Mass vs. height of Starwars characters",
x = "Height (cm)", y = "Weight (kg)"
)
ggplot() is the main function in ggplot2. It initializes the plot. The different layers of
the plots are then added consecutively.
ggplot(data = [dataset],
mapping = aes(x = [x-variable], y = [y-variable])) +
geom_xxx() +
other options
Palmer penguins
• species,
• island in Palmer Archipelago,
• size (flipper length, body mass, bill dimensions),
• sex.
30
2 Data and visualization
library(palmerpenguins)
glimpse(penguins)
# Rows: 344
# Columns: 8
# $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A~
# $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torge~
# $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.~
# $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.~
# $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, ~
# $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347~
# $ sex <fct> male, female, female, NA, female, male, female, m~
# $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2~
50 Species
Adelie
Chinstrap
Gentoo
40
1. Initialize the plot with ggplot() and specify the data argument.
ggplot(data = penguins)
31
2 Data and visualization
2. Map the variables bill_depth_mm and bill_length_mm to the x- and y-axis, respectively. The
function aes() creates the mapping from the dataset variables to the plot’s aesthetics.
1 ggplot(data = penguins,
2 mapping = aes(x = bill_depth_mm,
3 y = bill_length_mm))
60
50
bill_length_mm
40
60
50
bill_length_mm
40
32
2 Data and visualization
60
50
bill_length_mm
species
Adelie
Chinstrap
Gentoo
40
33
2 Data and visualization
bill_length_mm
50 species
Adelie
Chinstrap
Gentoo
40
6. Add the subtitle “Dimensions for Adelie, Chinstrap, and Gentoo Penguins”.
1 ggplot(data = penguins,
2 mapping = aes(x = bill_depth_mm,
3 y = bill_length_mm,
4 colour = species)) +
5 geom_point() +
6 labs(
7 title = "Bill depth and length",
8 subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins")
50 species
Adelie
Chinstrap
Gentoo
40
34
2 Data and visualization
7. Label the x- and y-axis as “Bill depth (mm)” and “Bill length (mm)”, respectively.
1 ggplot(data = penguins,
2 mapping = aes(x = bill_depth_mm,
3 y = bill_length_mm,
4 colour = species)) +
5 geom_point() +
6 labs(
7 title = "Bill depth and length",
8 subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
9 x = "Bill depth (mm)", y = "Bill length (mm)")
50 species
Adelie
Chinstrap
Gentoo
40
35
2 Data and visualization
50 Species
Adelie
Chinstrap
Gentoo
40
36
2 Data and visualization
10. Finally, use a discrete colour scale to be perceived by viewers with common colour blindness.
1 ggplot(data = penguins,
2 mapping = aes(x = bill_depth_mm,
3 y = bill_length_mm,
4 colour = species)) +
5 geom_point() +
6 labs(
7 title = "Bill depth and length",
8 subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
9 x = "Bill depth (mm)", y = "Bill length (mm)",
10 colour = "Species",
11 caption = "Source: Palmer Station LTER/palmerpenguins package") +
12 scale_colour_viridis_d()
50 Species
Adelie
Chinstrap
Gentoo
40
Remark. Like for all other R functions, you can omit the names of the arguments when building plots
with ggplot(), as long as you keep the order of arguments as given in the function’s help page.
37
2 Data and visualization
So, using
ggplot(penguins,
aes(x = bill_depth_mm,
y = bill_length_mm,
colour = species))
instead of
ggplot(data = penguins,
mapping = aes(x = bill_depth_mm,
y = bill_length_mm,
colour = species))
ggplot(aes(x = bill_depth_mm,
y = bill_length_mm,
colour = species),
penguins)
2.4.1 Aesthetics
colour
1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm,
3 colour = species)) +
4 geom_point() +
5 scale_colour_viridis_d()
38
2 Data and visualization
60
50
bill_length_mm
species
Adelie
Chinstrap
Gentoo
40
shape
In addition to specifying colour with respect to species, we now define shape based on island.
1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm, colour = species,
3 shape = island)) +
4 geom_point() +
5 scale_colour_viridis_d()
60
species
Adelie
50 Chinstrap
bill_length_mm
Gentoo
island
Biscoe
40 Dream
Torgersen
39
2 Data and visualization
One can, of course, use the same variable for specifying different aesthetics, e.g., using species to
define shape and colour.
1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm,
3 colour = species, shape = species)) +
4 geom_point() +
5 scale_colour_viridis_d()
60
50
bill_length_mm
species
Adelie
Chinstrap
Gentoo
40
However, the information displayed regarding shape and color is redundant and reduces the number
of variables to be displayed.
Remark. The values of shape can only be specified by a discrete variable. Using instead a continuous
variable will lead to an error.
ggplot(penguins,
aes(x = bill_depth_mm, y = bill_length_mm,
shape = body_mass_g)) +
geom_point()
# Error in `geom_point()`:
# ! Problem while computing aesthetics.
# i Error occurred in the 1st layer.
# Caused by error in `scale_f()`:
# ! A continuous variable cannot be mapped to the shape aesthetic.
# i Choose a different aesthetic or use `scale_shape_binned()`.
40
2 Data and visualization
size
The effect of using the size aesthetic is rather apparent. Based on the values of the defining variables,
we get symbols of different size.
1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm,
3 colour = species, shape = species,
4 size = body_mass_g)) +
5 geom_point() +
6 scale_colour_viridis_d()
60
species
Adelie
Chinstrap
50
bill_length_mm
Gentoo
body_mass_g
3000
40 4000
5000
6000
When using a continuous variable to define size a set of representative values is chosen to be dis-
played in the legend.
41
2 Data and visualization
alpha
1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm,
3 colour = species, shape = species,
4 alpha = flipper_length_mm)) +
5 geom_point() +
6 scale_colour_viridis_d()
60
species
Adelie
Chinstrap
Gentoo
50
bill_length_mm
flipper_length_mm
180
190
40 200
210
220
230
mapping
1 ggplot(
2 penguins,
3 aes(
4 x = bill_depth_mm, y = bill_length_mm,
5 size = body_mass_g,
6 alpha = flipper_length_mm)) +
7 geom_point()
42
2 Data and visualization
60
flipper_length_mm
180
190
200
50 210
bill_length_mm
220
230
body_mass_g
40 3000
4000
5000
6000
setting
1 ggplot(penguins,
2 aes( x = bill_depth_mm, y = bill_length_mm)) +
3 geom_point(
4 size = 2,
5 alpha = 0.5)
60
50
bill_length_mm
40
43
2 Data and visualization
Mapping: Determine the size, alpha, etc., of the geometric objects, like points, based on the values
of a variable in the dataset: Use aes() to define the mapping.
Setting: Determine the size, alpha, etc., of the geometric objects, like points, not based on the values
of a variable in the dataset: Specify the aesthetics within geom_*().
Remark. The * is a placeholder for one of the available geoms. We used geom_point() in the previous
example, but we’ll learn about other geoms soon!
2.4.3 Faceting
Faceting means creating smaller plots that display different subsets of the data. Useful for exploring
conditional relationships and large data.
50
Adelie
40
60
bill_length_mm
Chinstrap
50
40
60
Gentoo
50
40
Task: In the following few slides, describe what each plot displays. Think about how the code relates
to the output.
Note
The plots in the next few slides do not have proper titles, axis labels, etc. because we want you
to figure out what’s happening in the plots. But you should always label your plots!
44
2 Data and visualization
female male NA
60
50
Adelie
40
60
bill_length_mm
Chinstrap
50
40
60
Gentoo
50
40
50
female
40
60
bill_length_mm
50
male
40
60
50
NA
40
45
2 Data and visualization
Faceting by species
50
40
50
40
facet_wrap() allows for specifying the number of columns (or rows) in the output.
46
2 Data and visualization
Adelie Chinstrap
60
50
40
bill_length_mm
15.0 17.5 20.0
Gentoo
60
50
40
Summary
facet_grid():
facet_wrap(): 1d ribbon wrapped according to the number of rows and columns specified or
available plotting area
When facets are built based on a variable used for coloring, the output will contain an unnecessary
legend.
1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
3 geom_point() +
4 facet_grid( species ~ sex) +
5 scale_color_viridis_d()
47
2 Data and visualization
female male NA
60
50
Adelie
40
60
bill_length_mm
species
Chinstrap
50 Adelie
Chinstrap
40
Gentoo
60
Gentoo
50
40
The information about the different species is already shown on the y-axis and, hence, doesn’t need to
be repeated in the legend. One can remove the legend using either guides(), see the below example,
or theme(legend.position = "none").
1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
3 geom_point() +
4 facet_grid(species ~ sex) +
5 scale_color_viridis_d() +
6 guides(color = FALSE)
female male NA
60
50
Adelie
40
60
bill_length_mm
Chinstrap
50
40
60
Gentoo
50
40
48
2 Data and visualization
2.4.4 geoms
60
50
bill_length_mm
bill_length_mm
50 species species
Adelie Adelie
Chinstrap Chinstrap
Gentoo 40 Gentoo
40
30
15.0 17.5 20.0 15.0 17.5 20.0
bill_depth_mm bill_depth_mm
Both plots use the same data, x-aesthetic and y-aesthetic. But they use different geometric
objects to represent the data, and therefore look quite different.
Other geoms are applied analogously to geom_point(). One can also combine several geoms in one
plot.
60
50
bill_length_mm
species
Adelie
Chinstrap
Gentoo
40
30
15.0 17.5 20.0
bill_depth_mm
49
2 Data and visualization
There are a variety of geoms. Some of them are given in the ggplot2 cheatsheet, which one can also
download as a PDF.
For a complete list, visit the ggplot2 website.
Different geoms describe different aspects of the data, and the choice of the appropriate geom
also depends on the type of the data.
This is explained in more detail when we speak about exploring data.
After figuring out which geoms to use, there might still be the question how to do it. In that case,
open the documentation for the chosen geom function (like for any other R function), type
?geom_function
2.4.5 Themes
The above plots look different when you run the given code on your machine. This is because we
defined a different theme to be our default theme. In general, theme_gray() is the default.
60
bill_length_mm
50 species
Adelie
Chinstrap
Gentoo
40
50
2 Data and visualization
ggplot(penguins,
aes(x = bill_depth_mm, y = bill_length_mm,
color = species)) +
geom_point() +
theme_minimal()
60
50
bill_length_mm
species
Adelie
Chinstrap
Gentoo
40
The complete list of all built-in themes is available on the ggplot2 website.
51
2 Data and visualization
Short summary
This chapter introduces the ggplot2 package in R for data visualisation. It begins by explaining
dataset terminology using the starwars dataset as an example. The text then demonstrates
the fundamentals of ggplot2, including initialising plots with ggplot(), mapping variables
to aesthetics with aes(), and adding geometric objects with geom_point(). Further sections
cover customising plots with labels and titles, exploring the use of faceting for creating
multiple subplots, and differentiating between mapping aesthetics to variables versus
setting them manually. Finally, it touches upon various geoms for different visual rep-
resentations and the application of themes to alter the overall appearance of plots.
52
3 Grammar of data wrangling
53
3 Grammar of data wrangling
Example: We have data from two categories of hotels: resort hotels and city hotels. Each row, i.e.,
each observation, represents a hotel booking.
Data source: TidyTuesday
Goal for original data collection: development of prediction models to classify a hotel
booking’s likelihood to be cancelled (Antonia et al., 2019).
We will pursue a much simpler data exploration, but before doing that we have to get the data into R.
The dataset is available as a text file, to be precise, as a CSV file.
The tidyverse toolbox for data input/output is the readr package. It is one of the core tidyverse
packages loaded when loading the tidyverse. Since we already loaded the tidyverse, readr is ready
for usage.
The most general function for reading in data is read_delim(). Several variants with respect to the
relevant field separator exist to make our lives easier. In our case, it is a comma. Therefore, we use
read_csv() (in case of a semicolon, it would be read_csv2()).
hotels
# # A tibble: 119,390 x 32
# hotel is_canceled lead_time arrival_date_year arrival_date_month
# <chr> <dbl> <dbl> <dbl> <chr>
# 1 Resort Hotel 0 342 2015 July
# 2 Resort Hotel 0 737 2015 July
# 3 Resort Hotel 0 7 2015 July
# 4 Resort Hotel 0 13 2015 July
# 5 Resort Hotel 0 14 2015 July
# 6 Resort Hotel 0 14 2015 July
# # i 119,384 more rows
# # i 27 more variables: arrival_date_week_number <dbl>, ...
54
3 Grammar of data wrangling
We start by extracting just a single column. For example, we want to look at lead_time, which is the
number of days between booking and arrival date.
select(
hotels,
lead_time
)
# # A tibble: 119,390 x 1
# lead_time
# <dbl>
# 1 342
# 2 737
# 3 7
# 4 13
# 5 14
# 6 14
# # i 119,384 more rows
Ĺ In this example, hotels and the output of select() are a tibble, which is a special kind of
data frame. In particular, it prints information about the dimension of the data and the type
of the variables. Most of the time we will work with tibbles.
55
3 Grammar of data wrangling
# 3 Resort Hotel 7
# 4 Resort Hotel 13
# 5 Resort Hotel 14
# 6 Resort Hotel 14
# # i 119,384 more rows
That was easy. We just had to provide the additional variable name as a further argument of
select().
3.3 Arrange
But what if we wanted to select these columns and then arrange the data in descending order of
lead time?
To accomplish this task, we need to take two steps that we can implement as follows:
arrange(
select(hotels,
hotel, lead_time),
desc(lead_time)
)
# # A tibble: 119,390 x 2
# hotel lead_time
# <chr> <dbl>
# 1 Resort Hotel 737
# 2 Resort Hotel 709
# 3 City Hotel 629
# 4 City Hotel 629
# 5 City Hotel 629
# 6 City Hotel 629
# # i 119,384 more rows
Often, tasks will have four, five, or more steps. Writing code from inside to outside in this way will
get extremely messy.
Hence, we want to introduce a more efficient way of combining several steps into one command.
56
3 Grammar of data wrangling
3.4 Pipes
R knows a number of different pipe operators. In Version 4.1.0 the pipe operator
|>
was introduced in the base R distribution. It is therefore known as the native pipe. We will use this
pipe operator most of the times.
Ĺ In programming, a pipe is a technique for passing information from one process to another.
Expressed as a set of nested functions in R pseudo code this would look like
park(drive(start_car(find("keys")), to = "work"))
Writing it out using pipes gives a more natural (and more accessible to read) structure.
find("keys") |>
start_car() |>
drive(to = "work") |>
park()
57
3 Grammar of data wrangling
Let’s see the native pipe in action. We start with the tibble hotels, and pass it to the select()
function to extract the variables hotel and lead_time.
hotels |>
select(hotel, lead_time)
# # A tibble: 119,390 x 2
# hotel lead_time
# <chr> <dbl>
# 1 Resort Hotel 342
# 2 Resort Hotel 737
# 3 Resort Hotel 7
# 4 Resort Hotel 13
# 5 Resort Hotel 14
# 6 Resort Hotel 14
# # i 119,384 more rows
Combining the above code with the arrange() function leads to the result we are looking for.
hotels |>
select(hotel, lead_time) |>
arrange(desc(lead_time))
# # A tibble: 119,390 x 2
# hotel lead_time
# <chr> <dbl>
# 1 Resort Hotel 737
# 2 Resort Hotel 709
# 3 City Hotel 629
# 4 City Hotel 629
# 5 City Hotel 629
# 6 City Hotel 629
# # i 119,384 more rows
3.4.3 Aside
dplyr knows it’s own pipe operator %>%,
which is actually implemented in the
package magrittr. This operator is older
but has the drawback of working only in
an “extended tidyverse”.
Any guesses as to why the package is
called magrittr? The Treachery of Images.
58
3 Grammar of data wrangling
we pipe the output of the previous line of code as the first input of the next line of
code
dplyr
hotels +
select(hotel, lead_time)
# Error: object 'hotel' not found
hotels |>
select(hotel, lead_time)
# # A tibble: 119,390 x 2
# hotel lead_time
# <chr> <dbl>
# 1 Resort Hotel 342
# 2 Resort Hotel 737
# 3 Resort Hotel 7
# 4 Resort Hotel 13
# 5 Resort Hotel 14
# 6 Resort Hotel 14
# # i 119,384 more rows
59
3 Grammar of data wrangling
ggplot2
ggplot(
hotels,
aes(x = hotel,
fill = deposit_type)) +
geom_bar()
80000
60000
deposit_type
count
No Deposit
40000
Non Refund
Refundable
20000
0
City Hotel Resort Hotel
hotel
We have already seen select() at work. However, selecting can also be done based on specific
characteristics.
We could be interested in all variables which have a name starting with the string “arrival”,
60
3 Grammar of data wrangling
hotels |>
select(starts_with("arrival"))
# # A tibble: 119,390 x 4
# arrival_date_year
# <dbl>
# 1 2015
# 2 2015
# 3 2015
# 4 2015
# 5 2015
# 6 2015
# # i 119,384 more rows
# # i 3 more variables: ...
hotels |>
select(ends_with("type"))
# # A tibble: 119,390 x 4
# reserved_room_type
# <chr>
# 1 C
# 2 C
# 3 A
# 4 A
# 5 A
# 6 A
# # i 119,384 more rows
# # i 3 more variables: ...
61
3 Grammar of data wrangling
See help for any of these functions for more info, e.g. ?everything.
By default, the arrange() function will sort the entries in ascending order.
hotels |>
select(adults, children,
babies) |>
arrange(babies)
# # A tibble: 119,390 x 3
# adults children babies
# <dbl> <dbl> <dbl>
# 1 2 0 0
# 2 2 0 0
# 3 1 0 0
# 4 1 0 0
# 5 2 0 0
# 6 2 0 0
# # i 119,384 more rows
If the output should be given in descending order, we must specify this using desc().
hotels |>
select(adults, children,
babies) |>
arrange(desc(babies))
# # A tibble: 119,390 x 3
# adults children babies
# <dbl> <dbl> <dbl>
# 1 2 0 10
# 2 1 0 9
# 3 2 0 2
# 4 2 0 2
# 5 2 0 2
# 6 2 0 2
# # i 119,384 more rows
62
3 Grammar of data wrangling
The subsequent filter() arguments specify conditions that need to be fulfilled by a row (= an ob-
servation) to become part of the output. Let’s filter for all bookings in City Hotels.
hotels |>
filter(hotel == "City Hotel")
# # A tibble: 79,330 x 32
# hotel is_canceled
# <chr> <dbl>
# 1 City Hotel 0
# 2 City Hotel 1
# 3 City Hotel 1
# 4 City Hotel 1
# 5 City Hotel 1
# 6 City Hotel 1
# # i 79,324 more rows
# # i 30 more variables: ...
We can specify multiple conditions, which will be combined with an &. The following command
extracts all observations, which are bookings with no adults and at least one child.
hotels |>
filter(
adults == 0,
children >= 1
) |>
select(adults, babies, children)
# # A tibble: 223 x 3
# adults babies children
# <dbl> <dbl> <dbl>
# 1 0 0 3
# 2 0 0 2
# 3 0 0 2
# 4 0 0 2
# 5 0 0 2
# 6 0 0 3
# # i 217 more rows
If two (or more) conditions should be combined with an “or”, we must do this explicitly using the |
operator. So, let’s check again for bookings with no adults, but this time we allow for some children
or babies in the room.
63
3 Grammar of data wrangling
hotels |>
filter(
adults == 0,
children >= 1 | babies >= 1 # | means or
) |>
select(adults, babies, children)
# # A tibble: 223 x 3
# adults babies children
# <dbl> <dbl> <dbl>
# 1 0 0 3
# 2 0 0 2
# 3 0 0 2
# 4 0 0 2
# 5 0 0 2
# 6 0 0 3
# # i 217 more rows
We end up with the same number of observations. So, there are no bookings with just babies in the
room.
In some cases, we might be interested in the unique observations of a variable. That’s when we want
to use the distinct() function.
hotels |>
distinct(market_segment)
# # A tibble: 8 x 1
# market_segment
# <chr>
# 1 Direct
# 2 Corporate
# 3 Online TA
# 4 Offline TA/TO
# 5 Complementary
# 6 Groups
# 7 Undefined
# 8 Aviation
Combining distinct() with arrange() leads to a friendlier output to read through the fact being
ordered.
64
3 Grammar of data wrangling
hotels |>
distinct(hotel,
market_segment) |>
arrange(hotel, market_segment)
# # A tibble: 14 x 2
# hotel market_segment
# <chr> <chr>
# 1 City Hotel Aviation
# 2 City Hotel Complementary
# 3 City Hotel Corporate
# 4 City Hotel Direct
# 5 City Hotel Groups
# 6 City Hotel Offline TA/TO
# 7 City Hotel Online TA
# 8 City Hotel Undefined
# 9 Resort Hotel Complementary
# 10 Resort Hotel Corporate
# 11 Resort Hotel Direct
# 12 Resort Hotel Groups
# 13 Resort Hotel Offline TA/TO
# 14 Resort Hotel Online TA
hotels |>
slice(1:5) # first five
# # A tibble: 5 x 32
# hotel is_canceled
# <chr> <dbl>
# 1 Resort Hotel 0
# 2 Resort Hotel 0
# 3 Resort Hotel 0
# 4 Resort Hotel 0
# 5 Resort Hotel 0
# # i 30 more variables:
# # lead_time <dbl>, ...
65
3 Grammar of data wrangling
Every data frame you encounter contains more information than what is visible at first glance. For
instance, the hotels data frame does not show the average number of days between arrival at a city
hotel and booking. But we can definitely compute it using summarize():
hotels |>
filter(hotel == "City Hotel") |>
summarize(mean_lead_time = mean(lead_time))
# # A tibble: 1 x 1
# mean_lead_time
# <dbl>
# 1 110.
To use the summarize() function, you need to pass in a data frame along with one or more named
arguments. Each named argument should correspond to an R expression that produces a single value.
The summarize() function will create a new data frame where each named argument is converted
into a column. The name of the argument will become the column name, while the value returned
by the argument will fill the column. This means, that we are not restricted to compute only one
summary statistic.
We can apply several summary functions (we will see more examples of summary functions in Explore
Data) as subsequent arguments of summarize.
For example, consider determining the number of bookings in city hotels in addition to computing
the average lead time. We can use the n() function from the dplyr package to count the number of
observations.
hotels |>
filter(hotel == "City Hotel") |>
summarize(mean_lead_time = mean(lead_time), n = n())
# # A tibble: 1 x 2
# mean_lead_time n
# <dbl> <int>
# 1 110. 79330
66
3 Grammar of data wrangling
Now assume we want to compute the same two quantities also for resort hotels. We can definitely
do the following.
hotels |>
filter(hotel == "Resort Hotel") |>
summarize(mean_lead_time = mean(lead_time), n = n())
# # A tibble: 1 x 2
# mean_lead_time n
# <dbl> <int>
# 1 92.7 40060
That would be feasible if the grouping variable has only a very limited number of unique values. But
it is still quite inefficient.
Imagine we want to compute those two summary statistics for each level of market_segment. Based
on the previous solution, we need to repeat the code eight times. Fortunately, dplyr provides a much
more efficient approach using group_by().
The group_by() function takes a data frame and one or more column names from that data frame as
input. It returns a new data frame that groups the rows based on the unique combinations of values
in the specified columns. When we now apply a dplyr function like summarize() or mutate() on
the grouped data frame, it executes the function in a group-wise manner.
hotels |>
group_by(market_segment) |>
summarize(mean_lead_time = mean(lead_time), n = n())
# # A tibble: 8 x 3
# market_segment mean_lead_time n
# <chr> <dbl> <int>
# 1 Aviation 4.44 237
# 2 Complementary 13.3 743
# 3 Corporate 22.1 5295
# 4 Direct 49.9 12606
# 5 Groups 187. 19811
# 6 Offline TA/TO 135. 24219
# 7 Online TA 83.0 56477
# 8 Undefined 1.5 2
67
3 Grammar of data wrangling
Imagine that it’s not essential for the analysis to distinguish between children and babies. Instead,
we would like to have the number of little ones (children or babies) staying in the room.
hotels |>
mutate(little_ones = children + babies) |>
select(children, babies, little_ones) |>
arrange(desc(little_ones))
# # A tibble: 119,390 x 3
# children babies little_ones
# <dbl> <dbl> <dbl>
# 1 10 0 10
# 2 0 10 10
# 3 0 9 9
# 4 2 1 3
# 5 2 1 3
# 6 2 1 3
# # i 119,384 more rows
¾ Your turn
68
3 Grammar of data wrangling
Good code styling is not necessary but is highly beneficial. The readability of your code is something
you benefit from the most. But what is good code styling?
Two principles when using |> and + are, e.g.,
ggplot(hotels,aes(x=hotel,fill=deposit_type))+geom_bar()
These are just two examples. There is a lot more to consider and it makes sense for a beginner to get
inspired by looking at a style guide. Therefore, we encourage you to look at the tidyverse style guide.
We will not always follow this style guide with our code but try to do so as often as possible.
Following a style guide is easier than you think. The styler package provides functions for converting
code to follow a chosen style.
library(styler)
style_text(
"ggplot(hotels,aes(x=hotel,y=deposit_type))+geom_bar()",
transformers = tidyverse_style())
# ggplot(hotels, aes(x = hotel, y = deposit_type)) +
# geom_bar()
By using the styler addin for the RStudio IDE, it makes it even easier than that. Just look at the styler
website for an example.
69
3 Grammar of data wrangling
Short summary
This chapter introduces the dplyr package in R, a key component of the tidyverse for data
manipulation. It explains how dplyr provides a consistent set of functions, or “verbs”, such as
select, arrange, filter, mutate, and summarise, to tackle common data wrangling tasks. The
text details the use of the native pipe operator (|>) to chain these verbs together in a readable
manner, contrasting it with the layering concept in ggplot2. Furthermore, it illustrates how to
read data using the readr package and demonstrates fundamental dplyr functions for extract-
ing columns, ordering rows, filtering data, and creating summary statistics, often in conjunction
with the group_by() function for group-wise operations.
70
Part III
Explore Data
71
4 Exploring categorical data
• Not all loans are created equal – ease of getting a loan depends on (apparent) ability to pay back
the loan.
• Data includes loans made, these are not loan applications.
library(openintro)
loans_full_schema
# # A tibble: 10,000 x 55
# emp_title emp_length state homeownership annual_income
# <chr> <dbl> <fct> <fct> <dbl>
# 1 "global config engineer " 3 NJ MORTGAGE 90000
# 2 "warehouse office clerk" 10 HI RENT 40000
# 3 "assembly" 3 WI RENT 40000
# 4 "customer service" 1 PA RENT 30000
# 5 "security supervisor " 10 CA RENT 35000
# 6 "" NA KY OWN 34000
# # i 9,994 more rows
# # i 50 more variables: verified_income <fct>, debt_to_income <dbl>, ...
72
4 Exploring categorical data
• homeownership, which can take one of the values of rent, mortgage (owns but has a mortgage),
or own, and
• application_type, which indicates whether the loan application was made with a partner
(joint) or whether it was an individual application.
73
4 Exploring categorical data
We start exploring the distribution of the two variables by computing the absolute frequencies of
the different outcome values.
Definition 4.1. Let {𝑣1 , … , 𝑣𝑘 } be the unique values of a categorical variable 𝑋, and let 𝑥1 , … , 𝑥𝑛
be 𝑛 sample observations from that variable. Then we define
𝑛
𝑛𝑗 = ∑ 1𝑣𝑗 (𝑥𝑖 ), 𝑗 ∈ {1, … , 𝑘}
𝑖=1
1, 𝑥𝑖 = 𝑣𝑗
as the absolute frequency of outcome 𝑣𝑗 . Here, 1𝑣𝑗 (𝑥𝑖 ) = { is called the indicator
0, 𝑥𝑖 ≠ 𝑣𝑗
or characteristic function. Note that 𝑛𝑗 simply counts the occurrences of 𝑣𝑗 among {𝑥1 , … , 𝑥𝑛 }.
loans |>
count(homeownership)
# # A tibble: 3 x 2
# homeownership n
# <fct> <int>
# 1 rent 3858
# 2 mortgage 4789
# 3 own 1353
loans |>
count(application_type)
# # A tibble: 2 x 2
# application_type n
# <fct> <int>
# 1 joint 1495
# 2 individual 8505
𝑛𝑗
Instead of absolute frequencies, we can compute the relative frequencies (proportions) 𝑟𝑗 = 𝑛.
74
4 Exploring categorical data
loans |>
count(homeownership) |>
mutate(prop = n / sum(n))
# # A tibble: 3 x 3
# homeownership n prop
# <fct> <int> <dbl>
# 1 rent 3858 0.386
# 2 mortgage 4789 0.479
# 3 own 1353 0.135
A bar chart is a common way to display the distribution of a single categorical variable.
In ggplot2 we can use geom_bar() to create a bar chart.
5000
4000
3000
Count
2000
1000
0
rent mortgage own
Homeownership
75
4 Exploring categorical data
we did not present the data as they are. In a preliminary step, absolute frequencies were calculated
for homeownership and then these values were plotted.
Each geom has its own set of variables to be calculated. For geom_bar() these are
Remark. The help page of each geom function contains a list with all computed variables, where the
first entry is the default computation.
To create a bar chart of relative frequencies (not absolute), we first have to apply the statistical
transformation prop to the whole data set.
after_stat(prop) computes group-wise proportions. The data contains three groups concerning
homeownership. If we want to calculate proportions for each group with respect to the size of the
whole dataset, we first have to assign a common group value (e.g., group = 1) for all three groups.
1 ggplot(loans,
2 aes(x = homeownership,
3 y = after_stat(prop), group = 1)) +
4 geom_bar(fill = "gold") +
5 labs(x = "Homeownership")
0.5
0.4
0.3
prop
0.2
0.1
0.0
rent mortgage own
Homeownership
76
4 Exploring categorical data
More generally, we can determine the joint frequency distribution for two categorical variables.
Definition 4.2. Let {𝑣1 , … , 𝑣𝑘 } and {𝑤1 , … , 𝑤𝑚 } be the unique values of the categorical variables
𝑋 and 𝑌 , respectively. Further let (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ) be 𝑛 sample observations from the bivariate
variable (𝑋, 𝑌 ). Then we define
𝑛
𝑛𝑗,ℓ = ∑ 1(𝑣𝑗 ,𝑤ℓ ) ((𝑥𝑖 , 𝑦𝑖 )), 𝑗 ∈ {1, … , 𝑘}, ℓ ∈ {1, … , 𝑚},
𝑖=1
loans |>
select(homeownership, application_type) |>
table()
# application_type
# homeownership joint individual
# rent 362 3496
# mortgage 950 3839
# own 183 1170
loans |>
select(homeownership, application_type) |>
table() |>
addmargins()
# application_type
# homeownership joint individual Sum
# rent 362 3496 3858
# mortgage 950 3839 4789
# own 183 1170 1353
# Sum 1495 8505 10000
77
4 Exploring categorical data
prop.table() converts a contingency table with absolute frequencies into one with proportions.
loans |>
select(homeownership, application_type) |>
table() |>
prop.table()
# application_type
# homeownership joint individual
# rent 0.0362 0.3496
# mortgage 0.0950 0.3839
# own 0.0183 0.1170
To add row and column proportions, one can use the margin argument. For row proportion 1 , we
have to use margin=1
loans |>
select(homeownership, application_type) |>
table() |>
prop.table(margin = 1) |>
addmargins()
# application_type
# homeownership joint individual Sum
# rent 0.0938310 0.9061690 1.0000000
# mortgage 0.1983713 0.8016287 1.0000000
# own 0.1352550 0.8647450 1.0000000
# Sum 0.4274573 2.5725427 3.0000000
loans |>
select(homeownership, application_type) |>
table() |>
prop.table(margin = 2) |>
addmargins()
# application_type
# homeownership joint individual Sum
# rent 0.2421405 0.4110523 0.6531928
# mortgage 0.6354515 0.4513815 1.0868330
# own 0.1224080 0.1375661 0.2599742
# Sum 1.0000000 1.0000000 2.0000000
1
absolute frequencies divided by the row totals
2
absolute frequencies divided by the column totals
78
4 Exploring categorical data
Remark. Row and column proportions can also be thought of as conditional proportions as they
tell us about the proportion of observations in a given level of a categorical variable conditional on
the level of another categorical variable.
We can plot the distributions of two categorical variables simultaneously in a bar chart. Such charts
are generally helpful to visualize the relationship between two categorical variables.
4000
joint
2000 individual
1000
0
rent mortgage own
Homeownership
Loan applicants most often live in homes with mortgages. But it is not so easy to say how the different
types of applications differ over the levels of homeownership.
The stacked bar chart is most useful when it’s reasonable to assign one variable as the explanatory
variable (here homeownership) and the other variable as the response (here application_type) since
we are effectively grouping by one variable first and then breaking it down by the others.
One can vary the bars’ position with the position argument of geom_bar().
79
4 Exploring categorical data
3000
Application type
Count
2000 joint
individual
1000
0
rent mortgage own
Homeownership
Dodged bar charts are more agnostic in their display about which variable, if any, represents the
explanatory and which is the response variable. It is also easy to discern the number of cases in the
six group combinations. However, one downside is that it tends to require more horizontal space.
Additionally, when two groups are of very different sizes, as we see in the group own relative to either
of the other two groups, it is difficult to discern if there is an association between the variables.
A third option for the position argument is fill. Using this option makes it easy to compare the
distribution within one group over all groups in the dataset. But we have no idea about the sizes of
the different groups.
80
4 Exploring categorical data
0.75
Application type
Count
0.50 joint
individual
0.25
0.00
rent mortgage own
Homeownership
Conclusion: Joint applications are most common for applicants who live in mortgaged homes. Since
the proportions of joint and individual loans vary across the groups, we can conclude that the two
variables are associated in this sample.
¾ Your turn
A study was conducted to determine whether an experimental heart transplant program in-
creased lifespan. Each patient entering the program was officially designated a heart transplant
candidate. Patients were randomly assigned into treatment and control groups. Patients in the
treatment group received a transplant, and those in the control group did not. The charts below
displays the data in two different ways.
alive
deceased
control control
a) Provide one aspect of the two-group comparison that is easier to see from the stacked bar
chart?
b) Provide one aspect of the two-group comparison that is easier to see from the filled bar
chart?
c) For the Heart Transplant Study which of those aspects would be more important to dis-
play? That is, which bar plot would be better as a data visualization?
81
4 Exploring categorical data
We have previously used bar charts to visualize the distribution of two categorical variables. However,
in, e.g., a filled bar chart, it is impossible to identify the groups’ relative sizes. To visualize the values
in a contingency table for two variables, you can use geom_count().
This function creates a point for each combination of values from the two variables. The size of each
point corresponds to the number of observations associated with that combination. Rare combina-
tions will appear as small points, while more common combinations will be represented by larger
points.
ggplot(loans) +
geom_count(mapping = aes(x = homeownership, y = application_type)) +
labs(x = "Homeownership", y = "Application type")
individual
Application type
n
1000
2000
3000
joint
We can argue based on this plot that the homeownership and application_type variables are asso-
ciated.
The distribution of individual applications across homeownership levels is unequal. The same is true
for joint applications.
Heat maps are a second way to visualize the relationship between two categorical variables. They
function similar to count plots, but use color fill instead of point size to display the number of obser-
vations in each combination.
ggplot2 does not provide a geom function for heat maps, but you can construct a heat map by plotting
the results of count() with geom_tile().
82
4 Exploring categorical data
To do this, set the x and y aesthetics of geom_tile() to the variables that you pass to count(). Then
map the fill aesthetic to the n variable computed by count().
loans |>
count(homeownership, application_type) |>
ggplot(aes(x = homeownership, y = application_type, fill = n)) +
geom_tile() + labs(x = "Homeownership", y = "Application type")
individual
n
Application type
3000
2000
1000
joint
Remark (Pie charts). Pie charts can work for visualizing a categorical variable with very few levels.
1353
homeownership
3858
rent
mortgage
own
4789
83
4 Exploring categorical data
However, they can be pretty tricky to read when used to visualize a categorical variable with many
levels, like grade.
grade
12
58
335
2459 1446 A
B
C
D
2653 E
3037 F
G
Hence, it would be best if you never used a pie chart. Use a bar chart instead.
3000
grade
A
2000
B
count
C
D
E
1000
F
G
A B C D E F G
grade
84
4 Exploring categorical data
Short summary
This chapter introduces methods for analysing categorical data. It uses the Lending Club loan
dataset to illustrate concepts such as frequency distributions, bar charts, and contingency
tables for examining the relationships between different categories. The document explains
how to visualise single and paired categorical variables using various graphical techniques,
including stacked, dodged, and filled bar charts, as well as count plots and heatmaps, while also
briefly discouraging the use of pie charts.
85
5 Exploring numerical data
In the beginning, we work again with the loan data from the Lending Club. But this time, we are
considering only a subsample of size 50. In addition, we select just some of the variables.
86
5 Exploring numerical data
Let’s start by visualizing the shape of the distribution of a single variable. In these cases, a dot plot
provides the most basic of displays.
ggplot(
loans,
aes(x = interest_rate)
) +
labs(x = "Interest rate") +
geom_dotplot()
1.00
0.75
count
0.50
0.25
0.00
10 20
Interest rate
Remark. The rates have been rounded to the nearest percent in this plot.
Empirical mean
The empirical mean, often called the average or sample mean, is a common way to measure the
center of a distribution of data.
𝑥1 + 𝑥2 + + 𝑥𝑛
𝑥𝑛 = ,
𝑛
87
5 Exploring numerical data
where 𝑥1 , 𝑥2 , … , 𝑥𝑛 represent the 𝑛 observed values. Sometimes it is convenient to drop the index 𝑛
and write just 𝑥.
The population mean is often denoted as 𝜇. Sometimes a subscript, such as 𝑥 , is used to represent
which variable the population mean refers to.
Often it is too expensive or even not possible (population data are rarely available) to measure the
population mean 𝜇 precisely. Hence we have to estimate 𝜇 using the sample mean 𝑥𝑛 .
5.1.1 Summarize
Although we cannot calculate the average interest rate across all loans in the populations, we can
estimate the population value using the sample data.
We can use summarize() from the dplyr package to summarize the data by computing the sample
mean of the interest rate:
loans |>
summarize(
mean_ir = mean(interest_rate)
)
# # A tibble: 1 x 1
# mean_ir
# <dbl>
# 1 11.6
The sample mean is a point estimate of 𝜇𝑥 . It’s not perfect, but it is our best guess of the average
interest rate on all loans in the population studied.
Later, we will discuss methods for assessing the accuracy of point estimates, which is necessary be-
cause accuracy varies with the sample size.
Remark. We could also have indexed the interest_rate with the $ notation and then applied mean()
to the result.
mean(loans$interest_rate)
# [1] 11.5672
88
5 Exploring numerical data
Now we know that the average interest rate in the sample is equal to 11.5672. However, we would
expect that the interest rate varies with the grade of the loan.
Can we compute the sample mean for each level of grade in an easy way?
1 loans |>
2 group_by(grade) |>
3 summarize(
4 mean_ir = mean(interest_rate)
5 )
6 # # A tibble: 5 x 2
7 # grade mean_ir
8 # <fct> <dbl>
9 # 1 A 6.77
10 # 2 B 10.2
11 # 3 C 13.8
12 # 4 D 18.6
13 # 5 E 25.6
After the group_by() step, all computations are performed separately for each level of grade.
We detect an increasing average interest rate with a decreasing grade.
Can we compute several statistics for each level of grade of all mortgage observations
quickly?
1 loans |>
2 filter(homeownership == "mortgage") |>
3 group_by(grade) |>
4 summarize(
5 mean_ir = mean(interest_rate),
6 mean_la = mean(loan_amount),
7 n = n()
8 )
9 # # A tibble: 5 x 4
10 # grade mean_ir mean_la n
11 # <fct> <dbl> <dbl> <int>
12 # 1 A 6.31 18286. 7
13 # 2 B 10.1 18370 10
14 # 3 C 13.0 25500 4
15 # 4 D 20.3 29333. 3
16 # 5 E 25.6 27200 2
89
5 Exploring numerical data
Remark. These values should be viewed with great caution, as the sample size for several stages is
very small.
Dot plots show the exact value for each observation. They are useful for small datasets but can become
hard to read with larger samples.
Especially for larger samples, we prefer to think of the value as belonging to a bin. For the loans
dataset, we created a table of counts for the number of loans with interest rates between 5.0% and
7.5%, then the number of loans with rates between 7.5% and 10.0%, and so on.
loans |>
pull(interest_rate) |>
cut(breaks = seq(5, 27.5, by = 2.5)) |>
table()
#
# (5,7.5] (7.5,10] (10,12.5] (12.5,15] (15,17.5] (17.5,20]
# 11 15 8 4 5 4
# (20,22.5] (22.5,25] (25,27.5]
# 1 1 1
ggplot(
loans,
aes(x = interest_rate)) +
geom_histogram(
breaks = seq(5, 27.5, 2.5),
colour = "white") +
labs(x = "Interest rate")
90
5 Exploring numerical data
15
10
count
5
0
10 20
Interest rate
Histograms provide a view of the data density. Higher bars represent where the data are relatively
more common.
A smoothed-out histogram is known as a density plot.
0.100
0.075
Density
0.050
0.025
0.000
5 10 15 20 25
Interest rate
Histograms, as well as density plots, are especially convenient for understanding the shape of the
data distribution.
Both plots suggest that most loans have rates under 15%, while only a handful have rates above 20%.
When the distribution of a variable trails off to the right in this way and has a longer right tail, the
shape is said to be right skewed.
91
5 Exploring numerical data
15 0.100
0.075
10
Density
count
0.050
5
0.025
0 0.000
10 20 5 10 15 20 25
Interest rate Interest rate
Variables with the reverse characteristic – a long, thinner tail to the left – are said to be left skewed.
Variables that show roughly equal trailing off in both directions are called symmetric.
5.2.1 Modality
20 20 20
15 15 15
10 10 10
5 5 5
0 0 0
0 10 20 30 0 10 20 30 0 10 20 30
Remark. The search for modes is not about finding a clear and correct answer to the number of
modes in a distribution, which is why prominent is not strictly defined. The most important part of
this investigation is to better understand your data.
92
5 Exploring numerical data
The mean describes the center of a distribution. But we also need to understand the variability in
the data.
Here, we introduce two related measures of variability: the empirical variance and the empirical
standard deviation. The standard deviation roughly describes how far away the typical observation
is from the mean. We call the distance of an observation 𝑥𝑖 from its empirical mean 𝑥𝑛̄ its deviation
𝑥𝑖 − 𝑥𝑛̄ .
If we square these deviations and then take an average, the result is equal to the empirical variance.
Definition 5.2. Given a sample 𝑥1 , … , 𝑥𝑛 , the empirical variance of the sample is defined as the
average squared deviation from the empirical mean
𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛̄ )2
𝑠2𝑥,𝑛 = .
𝑛−1
Remark. We divide by 𝑛 − 1, rather than 𝑛 since we average over 𝑛 − 1 “free” values. Indeed, from
𝑛
𝑛 − 1 of the values 𝑥𝑖 − 𝑥𝑛 , we can determine the last remaining value because ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 ) = 0.
In practice, we wouldn’t use that formula for a larger dataset like this one. We would use the following
formula:
𝑛 𝑛 𝑛 𝑛
∑ (𝑥𝑖 − 𝑥𝑛̄ )2 ∑ 𝑥2𝑖 − 2 ∑𝑖=1 𝑥𝑖 ⋅ 𝑥𝑛̄ + ∑𝑖=1 𝑥2𝑛̄
𝑠2𝑥,𝑛 = 𝑖=1 = 𝑖=1
𝑛−1 𝑛−1
𝑛 𝑛
∑𝑖=1 𝑥2𝑖 − 2𝑛 ⋅ 𝑥𝑛̄ ⋅ 𝑥𝑛̄ + 𝑛𝑥2𝑛̄ ∑𝑖=1 𝑥2𝑖 − 𝑛 ⋅ 𝑥2𝑛̄
= = .
𝑛−1 𝑛−1
Or just use R:
var(loans$interest_rate)
# [1] 25.52387
93
5 Exploring numerical data
The empirical standard deviation is the square root of the empirical variance.
Definition 5.3. Given a sample 𝑥1 , … , 𝑥𝑛 , the empirical standard deviation is defined as the
𝑛
1
square root of the empirical variance: 𝑠𝑥,𝑛 = √ ∑(𝑥𝑖 − 𝑥𝑛̄ )2 .
𝑛 − 1 𝑖=1
Remark. The subscript of 𝑥,𝑛 may be omitted if it’s clear that we speak about the variance and standard
deviation of 𝑥1 , … , 𝑥𝑛 . But in general, it’s helpful to have it as a reminder.
Summary
The empirical variance is the average squared distance from the mean, and the empirical stan-
dard deviation is its square root. The standard deviation is useful when considering how far the
data are distributed from the mean.
Like the mean, the population values for variance and standard deviation have typical symbols:
𝜎2 for the variance and 𝜎 for the standard deviation.
loans |>
summarize(
var_ir = var(interest_rate), sd_ir = sd(interest_rate)
)
# # A tibble: 1 x 2
# var_ir sd_ir
# <dbl> <dbl>
# 1 25.5 5.05
A box plot summarizes a dataset using five statistics while also identifying unusual observations.
The next figure contains a histogram alongside a box plot of the interest_rate variable.
94
5 Exploring numerical data
15
count
10
0
10 20
Interest rate
0.4
0.2
0.0
−0.2
−0.4
5 10 15 20 25
Interest rate
The dark line inside the box represents the empirical median.
5.4.1 Median
At least 50% of the data are less than or equal to the median, and at least 50% are greater than or equal
to it.
Definition 5.4. The empirical median is the value that splits the data in half when ordered in
ascending order.
Remark. When there is an odd number of observations, there will be precisely one observation that
splits the data into two halves, and in such a case, that observation is the median.
The empirical median can be computed in several ways for 𝑛 being an even number. One common
approach is to define the empirical median of a sample 𝑥1 , … , 𝑥𝑛 to be the average 12 (𝑥( 𝑛2 ) + 𝑥( 𝑛2 )+1 ),
where 𝑥(𝑘) is the 𝑘-th smallest value.
We can use median() to compute the empirical median in R. median() knows nine different methods
for calculating an empirical median. We will always use the default method.
The interest_rate dataset has an even number of observations, and its median is:
median(loans$interest_rate)
# [1] 9.93
Definition 5.5. The 𝑘𝑡ℎ percentile is a number with at least 𝑘% of the observations below or equal
to and at least 100 − 𝑘% above or equal to.
95
5 Exploring numerical data
The box in a box plot represents the middle 50% of the data. The length of the box is called the
interquartile range, or IQR for short.
𝐼𝑄𝑅 = 𝑄3 − 𝑄1 ,
Like the standard deviation, the IQR is a measure of variability in data. The more variable the data,
the larger the standard deviation and IQR.
IQR(loans$interest_rate)
# [1] 5.755
0.4
0.2
0.0
−0.2
−0.4
5 10 15 20 25
Interest rate
The whiskers reach to the minimum and the maximum values in the data, unless there are points
that are considered unusually high or unusually low:
> 1.5 times the IQR away from the first or the third quartile
96
5 Exploring numerical data
¾ Your turn
Create a box plot for the mass variable from the starwars dataset. Try to guess the correct
geom function. Let RStudio help you by typing rather slowly.
Hint: use library(tidyverse) since starwars is part of dplyr.
An outlier is an observation that appears extreme relative to the rest of the data. Examining data
for outliers serves many useful purposes, including:
However, remember that some datasets have a naturally long skew, and outlying points do not
represent any sort of problem in the dataset.
Side-by-side boxplots is a common tool to compare the distribution of a numerical variable across
groups.
1 ggplot(loans, aes(
2 x = interest_rate,
3 y = grade
4 )) +
5 geom_boxplot() +
6 labs(
7 x = "Interest rate (%)",
8 y = "Grade",
9 title = "Interest rates of Lending Club loans",
10 subtitle = "by grade of loan"
11 )
97
5 Exploring numerical data
Grade C
5 10 15 20 25
Interest rate (%)
When using a histogram, we can fill the bars with different colors according to the levels of the
categorical variable.
1 ggplot(loans, aes(
2 x = interest_rate,
3 fill = homeownership
4 )) +
5 geom_histogram(binwidth = 2, colour = "white") +
6 labs(
7 x = "Interest rate (%)",
8 title = "Interest rates of Lending Club loans"
9 )
15
homeownership
10
count
rent
mortgage
own
5
0
10 20
Interest rate (%)
98
5 Exploring numerical data
With the position argument of geom_histogram(), one can vary where to put the bars for the
different groups. The default is to put them on top of each other. The dodge position puts them next
to each other.
1 ggplot(loans, aes(
2 x = interest_rate,
3 fill = homeownership
4 )) +
5 geom_histogram(binwidth = 2, colour = "white",
6 position = "dodge") +
7 labs(
8 x = "Interest rate (%)",
9 title = "Interest rates of Lending Club loans"
10 )
7.5
homeownership
5.0
count
rent
mortgage
own
2.5
0.0
10 20
Interest rate (%)
Another technique for comparing numerical data across different groups would be faceting.
1 ggplot(loans_full_schema,
2 aes(x = interest_rate)) +
3 geom_histogram(
4 bins = 10, colour = "white") +
5 facet_grid(term ~ homeownership)
99
5 Exploring numerical data
750
500
36
250
count 0
750
500
60
250
0
10 20 30 10 20 30 10 20 30
interest_rate
Remark. We used the complete dataset in the above plot, not just 50 observations.
Original data
1.00
0.75
count
0.50
0.25
How are the sample statistics of the 0.00
interest_rate affected by the observa- 0 10 20 30
0.50
had instead been 0.25
0.00
0 10 20 30
• only 15%? Move 26.3% to 35%
1.00
• even larger, say 35%? 0.75
count
0.50
0.25
0.00
0 10 20 30
Interest rate
We compute the median, IQR, empirical mean and empirical standard deviation for all three
datasets.
100
5 Exploring numerical data
loan50_robust_check |>
group_by(Scenario) |>
summarise(
Median = median(interest_rate),
IQR = IQR(interest_rate),
Mean = mean(interest_rate),
SD = sd(interest_rate)
)
# # A tibble: 3 x 5
# Scenario Median IQR Mean SD
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 Original data 9.93 5.76 11.6 5.05
# 2 Move 26.3% to 15% 9.93 5.76 11.3 4.61
# 3 Move 26.3% to 35% 9.93 5.76 11.7 5.68
The median and IQR are called robust statistics because extreme observations/skewness have
little effect on their values.
On the other hand, the mean and standard deviation are more heavily influenced by changes in
extreme observations.
For skewed distributions it is often more helpful to use median and IQR to describe the center
and spread. It holds that
For symmetric distributions, it is often more helpful to use the mean and SD to describe the center
and spread. It holds
mean ≈ median.
101
5 Exploring numerical data
¾ Your turn
Marathon winners. The histogram and box plots below show the distribution of finishing
times for male and female winners of the New York Marathon between 1970 and 1999.
0.4
20
0.2
15
count
0.0
10
5 −0.2
0
−0.4
2.25 2.50 2.75 3.00 3.25 2.1 2.4 2.7 3.0
time time
a) What features of the distribution are apparent in the histogram and not the box plot?
What features are apparent in the box plot but not in the histogram?
c) Compare the distribution of marathon times for men and women based on the box plot
shown below.
m
gender
102
5 Exploring numerical data
A scatterplot provides a case-by-case view of data for two numerical variables. Let’s consider the
relation between annual_income and loan_amount.
40000
30000
Scatterplots are useful for
quickly identifying associa-
loan amount
20000
tions between the variables
under consideration, whether
they are simple trends or
10000
more complex relationships.
5.6.1 Correlation
A measure of linear dependence between two variables is the empirical or sample correlation
coefficient of the two variables.
Definition 5.7. Given a paired sample (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ), the empirical correlation coefficient
of the sample is defined as
𝑛 𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 ) ∑ (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 )
𝑟(𝑥,𝑦),𝑛 = = 𝑖=1
𝑛 𝑛
√∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2 ∑𝑖=1 (𝑦𝑖 − 𝑦𝑛 )2 (𝑛 − 1)𝑠𝑥,𝑛 𝑠𝑦,𝑛
𝑠𝑥𝑦,𝑛
= ,
𝑠𝑥,𝑛 𝑠𝑦,𝑛
where 𝑥𝑛 , 𝑦𝑛 and 𝑠𝑥,𝑛 , 𝑠𝑦,𝑛 are the empirical means and standard deviations, respectively. 𝑠𝑥𝑦,𝑛 is
called the empirical covariance of the sample.
103
5 Exploring numerical data
Calculating the correlation across all 50 observations of income and loan amount results in an empir-
ical correlation coefficient of
cor(loans$annual_income, loans$loan_amount)
# [1] 0.396303
This indicates, at most, a moderate linear relation. Computing the correlation coefficient for each
level of homeownership yields
loans |>
group_by(homeownership) |>
summarize(r = cor(annual_income, loan_amount))
# # A tibble: 3 x 2
# homeownership r
# <fct> <dbl>
# 1 rent 0.372
# 2 mortgage 0.308
# 3 own 0.954
Do we have a strong linear dependence between the two variables in case of owners?
40000
104
5 Exploring numerical data
loans_full_schema |>
group_by(homeownership) |>
summarize(r = cor(annual_income, loan_amount), n())
# # A tibble: 3 x 3
# homeownership r `n()`
# <fct> <dbl> <int>
# 1 MORTGAGE 0.284 4789
# 2 OWN 0.366 1353
# 3 RENT 0.331 3858
Creating a scatterplot of the annual income and the loan amount for the complete datasets leads to a
lot of overplotting due to the size of the dataset. In that case, a hexplot can be advantageous compared
to a scatterplot.
40000
count
30000 800
loan_amount
600
20000
400
200
10000
0
0 5000001000000
1500000
2000000 0 5000001000000
1500000
2000000
annual_income
In the hexplot, we are given additional information (compared to the scatterplot) about the absolute
frequency of each hexagon.
105
5 Exploring numerical data
Short summary
This chapter introduces fundamental techniques for analysing numerical information using a
dataset about loans for illustration. The text demonstrates how to visualise single variables
through dot plots, histograms, and density plots to understand their distribution, including con-
cepts like skewness and modality. It explains methods to quantify the centre and spread
of data using measures such as the mean, variance, and standard deviation, alongside robust
alternatives like the median and the IQR. Furthermore, the material covers comparing nu-
merical data across different categories using box plots and grouped visualisations. Finally,
it examines relationships between pairs of numerical variables using scatterplots and the
correlation coefficient to identify linear dependencies.
106
Part IV
Probability
107
6 Case study: Gender discrimination
Study design:
In 1972, as a part of a study on gender discrimination, 48 male bank supervisors were each
given the same personnel file and asked to judge whether the person should be promoted
to a branch manager job that was described as “routine”.
The files were identical except that half of the supervisors had files showing the person was
male while the other half had files showing the person was female.
It was randomly determined which supervisors got “male” and which got “female” applications.
See Rosen and Jerdee (1974) for more details.
Data: Of the 48 files reviewed, 35 were promoted. The study is testing whether females are unfairly
discriminated against.
At first glance, does there appear to be a relationship between promotion and gender?
Absolute counts:
gender_discrimination |>
table() |>
addmargins()
# decision
# gender promoted not promoted Sum
# male 21 3 24
# female 14 10 24
# Sum 35 13 48
108
6 Case study: Gender discrimination
gender_discrimination |>
table() |>
prop.table(margin = 1) |>
addmargins()
# decision
# gender promoted not promoted Sum
# male 0.8750000 0.1250000 1.0000000
# female 0.5833333 0.4166667 1.0000000
# Sum 1.4583333 0.5416667 2.0000000
Conclusion?
We saw a difference of almost 30% (29.2% to be exact) between the proportion of male and female files
that were promoted.
For the sample, we observe a promotion rate that is dependent on gender.
But at this stage, we don’t know how to decide which of the following statements could be true for
the population:
1. Promotion is dependent on gender; males are more likely to be promoted, and hence, there is
gender discrimination against women in promotion decisions.
2. The difference in the proportions of promoted male and female files is due to chance. This is
not evidence of gender discrimination against women in promotion decisions.
Promotion and gender are independent, no gender discrimination, observed difference in pro-
portions is simply due to chance. → Null hypothesis
Promotion and gender are dependent, there is gender discrimination, observed difference in
proportions is not due to chance. → Alternative hypothesis
109
6 Case study: Gender discrimination
𝐻0 ∶ Defendant is innocent
𝐻𝐴 ∶ Defendant is guilty
Then we judge the evidence - Could these data plausibly have happened by chance if the null
hypothesis were true?
If they were very unlikely to have occurred, then the evidence raises more than a reasonable doubt
in our minds about the null hypothesis.
Ultimately we must make a decision: What is too unlikely?
Conclusion:
If the evidence is not strong enough to reject the assumption of innocence, the jury returns with a
verdict of not guilty.
The jury does not say that the defendant is innocent, just that there is not enough evidence to
convict.
The defendant may, in fact, be innocent, but the jury has no way of being sure.
In statistical terms, we fail to reject the null hypothesis.
Note
We never declare the null hypothesis to be true, because we simply do not know whether
it’s true or not.
In a trial, the burden of proof is on the prosecution. In a hypothesis test, the burden of proof is on
the unusual claim.
110
6 Case study: Gender discrimination
The null hypothesis is the ordinary state of affairs (the status quo). So, it’s the alternative
hypothesis that we consider unusual and for which we must gather evidence.
We start with a null hypothesis 𝐻0 that represents the status quo and an alternative hypothesis
𝐻𝐴 that represents our research question, i.e., what we’re testing for.
Testing process:
2. Under the assumption that the null hypothesis is true, compute how likely the ob-
served value 𝑇 (x) is.
3. Decide if the test results suggest that the data provides convincing evidence against
the null hypothesis. If that’s the case,
The second step can be done via simulation (briefly now, more detailed later) or theoretical meth-
ods (later in the course).
Returning to our example of gender discrimination, we want to simulate the experiment under the
assumption of independence, i.e., leave things up to chance.
Two possible outcomes:
1. Results from the simulations based on the chance model look like the data.
We can conclude that the difference between the proportions of promoted files between males and
females was simply due to chance: promotion and gender are independent.
2. Results from the simulations based on the chance model do not look like the data.
We can conclude that the difference between the proportions of promoted files between males and
females was not due to chance, but due to an actual effect of gender: promotion and gender are
dependent.
111
6 Case study: Gender discrimination
Let’s start by recomputing the proportions in the four different categories to give the output as a
tibble.
Using the dplyr approach introduced in Chapter 3, we do the following:
1. Summarize the dataset by computing the sample size for the each combination of gender and
decision.
2. Compute the proportion for each level of gender.
Remark. We specify the grouping through the .by argument of summarize() and mutate().
A reasonable summary statistic 𝑇 is the difference in proportions for the groups of promoted females
and males.
112
6 Case study: Gender discrimination
# # A tibble: 1 x 1
# diff_in_prop
# <dbl>
# 1 -0.292
Under the assumption of independence between gender and decision, the information about gender
has no influence on decision. To decide if the observed difference in proportions is unusual under
this assumption, we need to simulate differences in proportions under independence between the two
variables.
Idea:
Assign the decision independent of the gender. We achieve this by randomly permuting
the variable gender while leaving decision as it is. If the two variables are independent, the
value of the statistic should be comparable.
gender_discrimination
# # A tibble: 48 x 2
# gender decision
# <fct> <fct>
# 1 male promoted
# 2 male promoted
# 3 male promoted
# 4 male promoted
# 5 male promoted
# 6 male promoted
# 7 male promoted
# 8 male promoted
# 9 male promoted
# 10 male promoted
# # i 38 more rows
113
6 Case study: Gender discrimination
114
6 Case study: Gender discrimination
df_perm_perc |>
filter(decision == "promoted") |>
summarize(diff(prop))
# # A tibble: 1 x 1
# `diff(prop)`
# <dbl>
# 1 0.125
and hence, very different than the observed difference. But doing this once, does not provide any
evidence in favor of the alternative. Hence, we need to repeat those steps several times.
Let’s repeat everything 𝑁 times in a for loop. We initialize a vector of length 100
and then run the for loop with the previous commands.
for (i in 1:N) {
diff_stat_perm[i] <- gender_discrimination |>
# permute the labels
mutate(
gender = sample(gender, nrow(gender_discrimination))
) |>
summarise(n = n(), .by = c(gender, decision)) |>
# compute the proportions
mutate(prop = n / sum(n), .by = gender) |>
arrange(gender) |>
filter(decision == "promoted") |>
# compute diff of props
summarize(diff_prop = diff(prop)) |>
pull() # pull the diff out of the tibble
}
115
6 Case study: Gender discrimination
which describes how often a permuted sample was at least as extreme as the original data.
This means that we observed a difference (in absolute value) at least as large as 0.29 only in 3 of the
𝑁 = 100 simulation runs.
Do the simulation results provide convincing evidence of gender discrimination against women, i.e., de-
pendence between gender and promotion decisions?
Yes, the data provide convincing evidence for the alternative hypothesis of gender discrimination
against women in promotion decisions. We conclude that the observed difference between the two
proportions was due to a real effect of gender.
The relative frequency 0.03 can also be understood as an estimate of the probability that the
statistic takes on a value at least as extreme as the observed value under a so-called null distri-
bution. Before we are able to discuss null distributions and related topics in statistical inference
(in later chapters), we have to rigorously discuss the concept of randomness in Chapter 7.
116
7 Probability
Probability theory forms an important part of the foundations of statistics. When doing inferen-
tial statistics we need to measure / quantify uncertainty.
Probability theory is the mathematical tool to describe uncertain / random events.
Note
The study of probability arose in part from an interest in understanding games of chance, such as
cards or dice. We will use the intuitive setting of these games to introduce some of the concepts.
Definition 7.1. A random process is one in which we know all possible outcomes, but we cannot
predict which outcome will occur.
Introductory examples:
Tossing a coin
sample(c("Heads", "Tails"), 1)
# [1] "Heads"
or rolling a die
sample(1:6, 1)
# [1] 2
117
7 Probability
Definition 7.2. The sample space 𝑆 is the set of all possible outcomes of a random process.
Further examples:
Starting from the sample space and its elementary events, we can construct more complex events
𝐴 by forming a union of elementary events. In other words, general events 𝐴 are subsets of the
sample space 𝑆.
Example: Observing an even number when rolling a die is then represented by the event 𝐴 =
{2, 4, 6}.
For each event we are able to define its complement with respect to the sample space.
Definition 7.3. Let 𝐴 be an event from a sample space 𝑆. Then we call 𝐴𝑐 the complementary
event of 𝐴, if their union is equal to the sample space
𝐴 ∪ 𝐴𝑐 = 𝑆 .
The event 𝐴 and its complement 𝐴𝑐 are therefore mutually exclusive or disjoint.
118
7 Probability
There are several possible interpretations of probability but they all agree on the mathematical rules
probability must follow.
Rules
Let 𝑆 be the sample space, 𝐸 a single event and 𝐸1 , 𝐸2 , ... a disjoint sequence of events. Let P
be a function that assigns a probability P(𝐸) to each 𝐸 ⊆ 𝑆. Then it should hold that:
2. The probability that at least one (elementary) event occurs is one: P(𝑆) = 1.
∞ ∞
3. P( ⋃𝑖=1 𝐸𝑖 ) = ∑𝑖=1 P(𝐸𝑖 ) (addition rule of disjoint events).
Remark. A direct consequence of 2. and 3. is the following. For an event 𝐴 and its complement 𝐴𝑐
it holds that
P(𝐴) + P(𝐴𝑐 ) = P(𝑆) ⇒ P(𝐴𝑐 ) = 1 − P(𝐴) .
Frequentist interpretation:
The probability of an outcome is the proportion of times the outcome would occur if we observed
the random process an infinite number of times.
Bayesian interpretation:
A Bayesian interprets probability as a subjective degree of belief. For the same event, two separate
people could have different viewpoints and so assign different probabilities.
The Bayesian probabilist specifies a prior probability. This, in turn, is then updated to a posterior
probability in the light of new, relevant data/evidence.
119
7 Probability
Probability can be illustrated by rolling a die many times. Let 𝑝𝑛̂ be the proportion of outcomes that
are 1 out of the first n rolls. As the number of rolls increases, 𝑝𝑛̂ will converge to the probability
of rolling a 1, 𝑝 = 16 . The tendency of 𝑝𝑛̂ to stabilize around 𝑝 is described by the Law of Large
Numbers (LLN).
0.4
0.3
n
0.2
p
^
0.1
0.0
When tossing a fair coin, if heads comes up on each of the first 10 tosses
HHHHHHHHHH
what do you think the chance is that another head will come up on the next toss?
0.5, less than 0.5, or more than 0.5?
The probability is still 0.5, i.e., there is still a 50% chance that another head will come up on the next
toss:
The common misunderstanding of the LLN is that random processes are supposed to compen-
sate for whatever happened in the past. This is not true and also called gambler’s fallacy or
law of averages.
120
7 Probability
What is the probability of drawing a jack or a red card from a well shuffled full deck?
(a) Full card deck. Figure from www.milefoot.com, and edited by Open-
Intro.
The jack of hearts and the jack of diamonds are, of course, also red cards. Adding the probabilities
for a jack and being a red card will count these two jacks twice. Therefore, the probability for being
one of these two jacks needs to be subtracted, when calculating:
Note
For disjoint events 𝐴 and 𝐵 we have P(𝐴 ∩ 𝐵) = 0, so the above formula simplifies to
121
7 Probability
¾ Your turn
What is the probability that a student, randomly sampled out of a population of size 165, thinks
marijuana should be legalized (No, Yes) or they agree with their parents’ political views (No,
Yes) ?
40+36−78
A 165
114+118−78
B 165
78
C 165
114+118
D 165
Definition 7.4. A discrete probability distribution is a list of all possible elementary events 𝑥𝑖
(countably many) and the probabilities P({𝑥𝑖 }) with which they occur.
Remark.
• An event 𝐴, e.g., drawing a jack, can be the union of several elementary events. For drawing a
jack, we have
122
7 Probability
7.5 Independence
Example 7.1.
• Knowing that the coin landed on a head on the first toss does not provide any valuable infor-
mation for determining what the coin will land on in the second toss.
⇒ Outcomes of two tosses of a coin are independent.
• Knowing that the first card drawn from a deck is an ace does provide helpful information for
determining the probability of drawing an ace in the second draw.
⇒ Outcomes of two draws from a deck of cards (without replacement) are dependent.
which is equivalent (see next section) to saying: 𝐴 and 𝐵 are independent, if and only if 1
Example 7.2. Consider the random process of throwing a fair coin twice. Let 𝐴 and 𝐵 be the events
of 𝐻 (head) in the first and second toss, respectively. The sample space for this random process is
equal to 𝑆 = {(𝐻, 𝑇 ), (𝐻, 𝐻), (𝑇 , 𝑇 ), (𝑇 , 𝐻)}. This implies
1 1 1
P(𝐴 ∩ 𝐵) = P({(𝐻, 𝐻)}) = = ⋅
4 2 2
= P({(𝐻, 𝑇 )} ∪ {(𝐻, 𝐻)}) ⋅ P({(𝐻, 𝐻)} ∪ {(𝑇 , 𝐻)})
= P(𝐴) ⋅ P(𝐵)
which shows that 𝐴 and 𝐵 are independent (confirms the intuition).
1
Later we introduce the notation P(𝐴|𝐵) = P(A occurs, given 𝐵).
123
7 Probability
More generally, several events 𝐴1 , … , 𝐴𝑛 are independent if the probability of any intersection
formed from the events is the product of the individual event probabilities. In particular,
¾ Your turn
In a multiple choice exam, there are 5 questions and 4 choices for each question (a, b, c, d).
Nancy has not studied for the exam at all and decides to randomly guess the answers. What is
the probability that:
Nancy guesses the answer to five multiple choice questions, with four choices for each question.
What is the probability that she gets at least one question right?
Let 𝑄𝑘 be the event, that the answer to the k-th question is correct.
We are interested in the event:
So we can divide up the sample space into two categories: 𝑆 = {𝐴, 𝐴𝑐 }, where
P(𝐴) = 1 − P(𝐴𝑐 )
= 1 − P(𝑄𝑐1 ∩ ⋅ ⋅ ⋅ ∩ 𝑄𝑐5 ) = 1 − P(𝑄𝑐1 ) ⋅ ⋅ ⋅ P(𝑄𝑐5 )
= 1 − 0.755 ≈ 0.7627 .
¾ Your turn
Roughly 20% of undergraduates at a university are vegetarian or vegan. What is the probability
that, among a random sample of 3 undergraduates, at least one is neither vegetarian nor vegan?
124
7 Probability
Relapse study:
Researchers randomly assigned 72 chronic users of cocaine into three groups:
• desipramine (antidepressant)
• lithium (standard treatment for cocaine)
• placebo.
outcome total
group relapse no relapse
desipramine 10 14 24
lithium 18 6 24
placebo 20 4 24
total 48 24 72
We can think of the above table as describing the joint distribution of the two random processes
group and outcome, which can take values in {desipramine, lithium, placebo} and {relapse, no relapse},
respectively.
Using the joint distribution, we can answer questions such as:
What is the probability that a randomly selected patient received the antidepressant (desipramine) and
relapsed?
10
P(outcome = relapse and group=desipramine) = ≈ 0.14
72
Focusing on just one of the random processes, means working with the marginal distribution. As
an example, consider the following question.
What is the probability that a randomly selected patient relapsed?
48
P(outcome = relapse) = ≈ 0.67
72
Definition 7.8. Let P be a probability measure, and let 𝐴, 𝐵 be two events from a sample space 𝑆,
with P(𝐵) > 0. Then the conditional probability of the event 𝐴 (outcome of interest) given the
event 𝐵 (condition) is defined as
P(𝐴 ∩ 𝐵)
P(𝐴|𝐵) ∶= .
P(𝐵)
125
7 Probability
If we know that a patient received the antidepressant desipramine, what is the probability that they
relapsed?
¾ Your turn
For each one of the three treatment, if we know that a randomly selected patient received this
treatment, what is the probability that they relapsed?
outcome total
group relapse no relapse
desipramine 10 14 24
lithium 18 6 24
placebo 20 4 24
total 48 24 72
Earlier we saw that if two events are independent, their joint probability is simply the product
of their probabilities. If the events are not believed to be independent, then the dependence is
reflected in the calculation of the joint probability.
If A and B represent two outcomes or events, then
Remark. Note that this formula is simply the conditional probability formula, rearranged.
126
7 Probability
Consider the following (hypothetical) distribution of gender and major of students in an introduc-
tory statistics class:
major total
gender social science non-social science
female 30 20 50
male 30 20 50
total 60 40 100
60
The probability that a randomly selected student is a social science major is P(major = sosc) = 100 =
0.6, while P(major = non-sosc) = 0.4.
The probability that a randomly selected student is a social science major given that they are female
is
30
P(major = sosc|gender = female) = = 0.6 .
50
Since P(major = sosc|gender = male) also equals 0.6 and
P(𝐴 ∩ 𝐵)
P(𝐴|𝐵) = = P(𝐴) ⇒ P(𝐴 ∩ 𝐵) = P(𝐴) ⋅ P(𝐵) .
P(𝐵)
127
7 Probability
American Cancer Society estimates that about 1.7% of women have breast cancer.
Susan G. Komen for The Cure Foundation states that mammography correctly identifies about 78 %
of women who truly have breast cancer.
An article published in 2003 suggests that up to 10% of all mammograms result in false positives for
patients who do not have cancer.
When a patient goes through breast cancer screening there are two competing claims:
no cancer
cancer
Let 𝐶 describe if the patient has cancer or not, and let 𝑀 ∈ {+, −} be the result of the mammogram.
Then we are interested in the probability
Note
Tree diagrams are useful for inverting probabilities. We are given P(𝑀 = +|𝐶 = yes) and
ask for P(𝐶 = yes|𝑀 = +).
128
7 Probability
¾ Your turn
Suppose a woman who gets tested once and obtains a positive result wants to get tested again.
In the second test, what should we assume to be the probability of this specific woman having
cancer?
A 0.017
B 0.12
C 0.0133
D 0.88
¾ Your turn
What is the probability that this woman has cancer if this second mammogram also yielded a
positive result?
The conditional probability formula we have seen so far is a special case of Bayes’ Theorem, which is
applicable even when events are defined by variables that have more than just two outcomes.
In the previous example, we calculated the probability P(𝑀 = +) by summing the probabilities of
two disjoint events: obtaining a positive test result for a person with cancer and for a person without
cancer. This rule is an application of the law of total probability, which is given in the next theorem.
Theorem 7.1. Assume 𝐴1 , … , 𝐴𝑘 are a partition of the sample space 𝑆, i.e., the events 𝐴1 , … , 𝐴𝑘 are
𝑘
all disjoint, P(𝐴𝑗 ) > 0 and ⋃𝑖=1 𝐴𝑖 = 𝑆. Let 𝐵 be any another event, then the law of total probability
says
P(𝐵) = P(𝐵 ∩ 𝐴1 ) + P(𝐵 ∩ 𝐴2 ) + ⋅ ⋅ ⋅ + P(𝐵 ∩ 𝐴𝑘 )
= P(𝐵|𝐴1 )P(𝐴1 ) + P(𝐵|𝐴2 )P(𝐴2 ) + ⋅ ⋅ ⋅ + P(𝐵|𝐴𝑘 )P(𝐴𝑘 )
𝑘
= ∑ P(𝐵|𝐴𝑗 )P(𝐴𝑗 ) .
𝑗=1
P(𝐴𝑗 ∩𝐵)
Bayes’ Theorem is an application of the conditional probability definition, P(𝐴𝑗 |𝐵) = P(𝐵) , along
with the law of total probability.
129
7 Probability
Remark. We think of 𝐴1 , … , 𝐴𝑘 as all possible (disjoint) outcomes of one random process and 𝐵 is
the outcome of a second random process.
A common epidemiological model for the spread of diseases is the SIR model, where the popula-
tion is partitioned into three groups: Susceptible, Infected, and Recovered.
This is a reasonable model for diseases like chickenpox, where a single infection usually provides
immunity to subsequent infections. Sometimes these diseases can also be difficult to detect.
Imagine a population in the midst of an epidemic, where 60% of the population is considered sus-
ceptible, 10% is infected, and 30% is recovered.
The only test for the disease is accurate 95% of the time for susceptible individuals, 99% for in-
fected individuals, but 65% for recovered individuals.
Note: In this case accurate means returning a negative result for susceptible and recovered individuals
and a positive result for infected individuals.
¾ Your turn
Draw a probability tree to reflect the information given above. If the individual has tested
positive, what is the probability that they are actually infected?
130
7 Probability
The concept of random variables is a helpful and intuitive tool to describe a random process.
Definition 7.9. Let 𝑆 be a sample space and P a probability measure. Then we call a real-valued
function
𝑋 ∶ 𝑆 → R; 𝑠 ↦ 𝑋(𝑠)
a random variable. 𝑋 is called a discrete random variable, if the set 𝑋(𝑆) ⊂ R is finite or at least
countably infinite. Otherwise, 𝑋 is called a continuous random variable.
Informal interpretation
We take a measurement 𝑋 for a sample unit 𝑠. Each sample unit 𝑠 contains uncertainty, which
carries over to the measurement 𝑋(𝑠).
Remark. We often write 𝑋 instead of 𝑋(𝑠). The values of random variables are denoted with a
lowercase letter. So, for a discrete random variable 𝑋, we may write, for example, P(𝑋 = 𝑥) for the
probability that the sampled value of 𝑋(𝑠) is equal to 𝑥.
Definition 7.10. Let 𝑋 be a discrete random variable with outcome values 𝑥1 , … , 𝑥𝑘 and correspond-
ing probabilities P(𝑋 = 𝑥1 ), … , P(𝑋 = 𝑥𝑘 ). Then we call the weighted average of the possible
outcomes
𝑘
E[𝑋] = ∑ 𝑥𝑖 P(𝑋 = 𝑥𝑖 )
𝑖=1
Remark.
131
7 Probability
Example 7.3. In a game of cards, you win one dollar if you draw a heart, five dollars if you draw an
ace (including the ace of hearts), ten dollars if you pull the king of spades and nothing for any other
card you draw.
The random variable described below represents the outcome of this card game:
12 4 1 35
with distribution P(𝑋 = 1) = 52 , P(𝑋 = 5) = 52 , P(𝑋 = 10) = 52 , P(𝑋 = 0) = 52 .
¾ Your turn
A casino game costs 5 Dollars to play. If the first card you draw is red, then you get to draw a
second card (without replacement). If the second card is the ace of clubs, you win 500 Dollars.
If not, you don’t win anything, i.e. lose your 5 Dollars. What is your expected profit/loss from
playing this game?
Hint: The random variable
7.7.2 Variability
In addition to knowing the average value of a random experiment, we are also often interested in the
variability of the values of a random variable.
132
7 Probability
Definition 7.11. Let 𝑋 be a discrete random variable with outcome values 𝑥1 , … , 𝑥𝑘 , probabilities
P(𝑋 = 𝑥1 ), … , P(𝑋 = 𝑥𝑘 ) and expected value E[𝑋]. Then we call the weighted average of squared
distances
𝑘
Var[𝑋] = ∑(𝑥𝑖 − E[𝑋])2 P(𝑋 = 𝑥𝑖 )
𝑖=1
Remark.
Example 7.4. For the card game from Example 7.3, how much would you expect the winnings to
vary from game to game? Using
35 12
P(𝑋 = 0) = , P(𝑋 = 1) = ,
52 52
4 1
P(𝑋 = 5) = , P(𝑋 = 10) = , and
52 52
42
E[𝑋] =
52
we get
4
Var[𝑋] = ∑(𝑥𝑖 − E[𝑋])2 P(𝑋 = 𝑥𝑖 )
𝑖=1
= (0 − E[𝑋])2 ⋅ P(𝑋 = 0) + (1 − E[𝑋])2 ⋅ P(𝑋 = 1)
+ (5 − E[𝑋])2 ⋅ P(𝑋 = 5) + (10 − E[𝑋])2 ⋅ P(𝑋 = 10)
42 35 42 12 42 4 42 1
= (0 − )2 ⋅ + (1 − )2 ⋅ + (5 − )2 ⋅ + (10 − )2 ⋅ ≈ 3.425
52 52 52 52 52 52 52 52
and √
SD[𝑋] = 3.425 ≈ 1.85 .
133
7 Probability
To describe the variability of a linear combination of random variables, we need to determine the
variance of the linear combination.
The variance of the linear combination 𝑎𝑋 + 𝑏𝑌 of the random variables 𝑋 and 𝑌 is calculated as
where
Cov[𝑋, 𝑌 ] = E[𝑋 ⋅ 𝑌 ] − E[𝑋] ⋅ E[𝑌 ] .
is the covariance of the random variables 𝑋 and 𝑌 . The covariance is a measure of linear depen-
dence between two random variables.
Definition 7.12. Let 𝑋 and 𝑌 be two random variables. We call two random variables uncorrelated
if and only if
Cov[𝑋, 𝑌 ] = 0 .
For pairwise uncorrelated random variables 𝑋1 , … , 𝑋𝑘 the variance of the linear combination
𝑘
∑𝑖=1 𝑎𝑖 𝑋𝑖 is equal to
𝑘 𝑘
Var[ ∑ 𝑎𝑖 𝑋𝑖 ] = ∑ 𝑎2𝑖 Var[𝑋𝑖 ] .
𝑖=1 𝑖=1
Being uncorrelated is a weaker property compared to being independent. The definition of indepen-
dence between two random variables relies on the independence of events, see Definition 7.6.
Definition 7.13. Let 𝑋 ∶ 𝑆 → R and 𝑌 ∶ 𝑆 → R be two random variables and 𝐴, 𝐵 ⊆ R. The ran-
dom variables 𝑋 and 𝑌 are called independent, if the events {𝑋 ∈ 𝐴} and {𝑌 ∈ 𝐵} are independent
for all 𝐴, 𝐵 ∈ F, i.e.,
where F is a collection 2 of subsets of R, such that the probabilities in (Equation 7.1) are well-defined.
2
Such a collection is called a 𝜎-algebra and needs to fulfill certain properties. For a discrete r.v. you can think of it being
the power set.
134
7 Probability
Remark. One can show that the independence between random variables implies uncorrelatedness.
Hence, when two random variables are independent, they are also uncorrelated. The other way
around does not hold in general.
The covariance measures linear dependence. However, it is influenced by the scale of the random
variables 𝑋 and 𝑌 . So, we could say that they are linearly dependent, but specifying the strength is
not straightforward. Therefore, we introduce a scale-free measure of linear dependence.
Definition 7.14. Let 𝑋 ∶ 𝑆 → R and 𝑌 ∶ 𝑆 → R be two random variables. Then, the correlation
between 𝑋 and 𝑌 is defined by
Cov[𝑋, 𝑌 ]
Corr[𝑋, 𝑌 ] = ∈ [−1, 1] .
√Var[𝑋] ⋅ √Var[𝑌 ]
¾ Your turn
A company has 5 Lincoln Town Cars in its fleet. Historical data show that annual maintenance
cost for each car is on average 2154 Dollars with a standard deviation of 132 . What is the mean
and the standard deviation of the total annual maintenance cost for this fleet?
Note: you can assume that the annual maintenance costs of the five cars are uncorrelated.
Up until now, we have always considered discrete probability distributions. The goal of this section is
to introduce continuous probability distributions. We want to motivate their definition by considering
the effect of an increasing population size, which allows us to measure the outcome of the random
process on finer and finer levels.
135
7 Probability
𝐴 = {a randomly selected adult has a height between 180 and 184 cm} = [180, 184].
The height of the bar corresponding to the interval 𝐴 = [180, 184] will tell us something about the
probability P(𝐴). The fraction of observations falling into this height range will be an approximation
of the probability P(𝐴).
10000
count
5000
0
140 160 180 184 200
height
10000
By looking at the counts in the plot we get P(𝐴) ≈ = 0.1 .
100000
9852
The precise number would be P(𝐴) = 100000 = 0.09852.
As height is a continuous numerical variable, the size of the bins in the histogram can be made smaller
as the sample size increases.
30000
20000
count
10000
0
140 160 180 200
height
136
7 Probability
Now we see the distribution of heights for a sample of size 1000000. We will visualize the distribution
in the limit using a curve.
The curve is called the density or density function of the distribution and is denoted by 𝑓.
0.03
density
0.02
0.01
0.00
140 160 180 200
height
Remark. As we form the limit, the y-axis in each histogram is rescaled such that the total area of all
bars is equal to 1. Thus, the area under a density equals one.
0.03
0.02
f(x)
0.01
0.00
140 160 180 200
x; height
Returning to the probability of the event 𝐴 = [180, 184], a randomly selected adult has a height
between 180 and 184 cm.
184
Using the density, we can calculate the probability as follows: P(𝐴) = ∫ 𝑓(𝑥)d𝑥.
180
137
7 Probability
Note
• The distribution function at 𝑥 is the probability of the event/interval (−∞, 𝑥], i.e., 𝐹 (𝑥) =
P((−∞, 𝑥]).
• We already know 0 ≤ P(𝐵) ≤ 1 for any event 𝐵. In particular, we know that the
probability of “all possible events” is one, i.e.,
∞
P(R) = ∫ 𝑓(𝑥)d𝑥 = 1 .
−∞
So, 𝑓 needs to integrate to 1 and must be non-negative because otherwise P(𝐴) > P(𝐵)
with 𝐴 ⊂ 𝐵 could be the case, which is not allowed (=“doesn’t make sense”).
Also in the continuous case, we are often interested in the expected or average outcome of a random
variable.
Definition 7.16. Let 𝑋 be a random variable with continuous d.f. 𝐹 and corresponding density 𝑓
defined on R. Then we call the weighted average of the possible outcomes
E[𝑋] = ∫ 𝑥 ⋅ 𝑓(𝑥)d𝑥
R
the expected value of 𝑋 (or of 𝐹 ).
Definition 7.17. Let 𝑋 be a random variable with continuous d.f. 𝐹 and corresponding density 𝑓.
Then we call the weighted average of squared distances
138
7 Probability
Note
The formula Var[𝑋] = E[𝑋 2 ] − E[𝑋]2 holds for continuous and discrete random variables.
¾ Your turn
In the production of cylinder pistons, the manufacturing process ensures that the deviations of
the diameter upwards or downwards are at most equal to 1.
We interpret the deviations in the current production deviations as realizations of a random
variable 𝑋 with density
3
𝑓(𝑥) = (1 − 𝑥2 ) ⋅ 1[−1,1] (𝑥),
4
1, 𝑥 ∈ 𝐴,
where 1𝐴 (𝑥) = {
0, 𝑥 ∉ 𝐴.
In the manufacturing process, there should be an average deviation of zero. Decide if this is the
case for the distribution with density 𝑓(𝑥) = 34 (1 − 𝑥2 ) ⋅ 1[−1,1] (𝑥). Justify your answer.
¾ Your turn
How much variability in deviations can be expected based on this distribution? Quantify by
computing the variance of the distribution.
Quantiles
In statistical inference, quantiles are used to construct interval estimates for an unknown parameter
or to specify the critical value in a hypothesis test. In these cases, we compute quantiles of a normal
139
7 Probability
distribution as well as for other distributions. Therefore, we define the q-quantile in a general case
using the following definition.
Graphically, 𝑞 is the area below the probability density curve to the left of the q-quantile 𝑥𝑞 .
0.4
0.3
f(x)
0.2
0.1
0.0
−2.00 0.00 1.28 2.00
x
Figure 7.2: The gray area represents a probability of 0.9 based on a standard normal distribution.
Let’s consider the most prominent example of a continuous distribution at bit more detailed than
the others. We defined (see Definition A.8) the distribution in the following way:
Let 𝜇 ∈ R and 𝜎 > 0. The normal distribution with mean 𝜇 and variance 𝜎2 is the continuous
distribution on R with density function
1 (𝑥−𝜇)2
𝑓(𝑥) = √ e− 2𝜎2 , 𝑥 ∈ R,
2𝜋𝜎2
and we will denote it by N (𝜇, 𝜎2 ).
Remark.
• Many variables are nearly normally distributed (although none are exactly normal). While not
perfect for any single problem, normal distributions are useful for a variety of problems.
140
7 Probability
• The normal density has a well-known bell shape. In particular, it is unimodal and symmetric
around the mean 𝜇.
A normal distribution with mean 𝜇 = 0 and standard deviation 𝜎2 = 1 is called standard normal
distribution; in symbols, N (0, 1).
0.4
0.3
f(x)
0.2
0.1
0.0
−5.0 −2.5 0.0 2.5 5.0
x
If we vary the mean 𝜇, the density will be shifted along the x-axis
0.4
0.3
f(x)
0.2
0.1
0.0
−10 −5 0 5 10
x
141
7 Probability
0.4
0.3
f(x)
0.2
0.1
0.0
−10 −5 0 5 10
x
Linear combinations
Let 𝑋1 ∼ N (𝜇1 , 𝜎12 ), … , 𝑋𝑘 ∼ N (𝜇𝑘 , 𝜎𝑘2 ) be 𝑘 independent normally distributed random variables.
Then, each linear combination of these 𝑘 random variables is also normally distribution:
𝑘 𝑘 𝑘
𝑎 + ∑ 𝑏𝑖 𝑋𝑖 ∼ N (𝑎 + ∑ 𝑏𝑖 𝜇𝑖 , ∑ 𝑏𝑖2 𝜎𝑖2 ) ,
𝑖=1 𝑖=1 𝑖=1
Example 7.5. Consider a single r.v. 𝑋1 ∼ N (𝜇1 , 𝜎12 ), i.e., 𝑘 = 1. Let 𝑎 = − 𝜇𝜎1 and 𝑏1 = 1
𝜎1 , and
1
define
𝜇 1 𝑋 − 𝜇1
𝑍 = 𝑎 + 𝑏1 𝑋1 = − 1 + 𝑋1 = 1 .
𝜎1 𝜎1 𝜎1
Then, we get
𝑋 − 𝜇1 𝜇 1 1
𝑍= 1 ∼ N ( − 1 + 𝜇1 , 2 ⋅ 𝜎12 ) = N (0, 1) .
𝜎1 𝜎1 𝜎1 𝜎1
So, 𝑍 has standard normal distribution. It is called a standardized score, or Z score.
Remark.
1. The Z-score of an observation represents the number of standard deviations it deviates from
the mean.
2. Z scores can be formed for distributions of any shape, but only when the distribution is
normal, we get 𝑍 ∼ N (0, 1).
3. Observations that are more than 2 standard deviations away from the mean (|𝑍| > 2) are
usually considered unusual.
142
7 Probability
¾ Your turn
Probabilities:
Quantiles:
Example 7.6. At the Heinz ketchup factory, the amounts that go into ketchup bottles are supposed
to be normally distributed with a mean of 36 oz and a standard deviation of 0.11 oz. Every 30 minutes,
a bottle is selected from the production line, and its contents are precisely noted. If the amount of
ketchup in the bottle is below 35.8 oz or above 36.2 oz, the bottle fails the quality control inspection.
Question: What is the probability, that a bottle contains less than 35.8 ounces of ketchup?
So, we need to find the probability indicated by the gray area in the figure below.
143
7 Probability
2
f(x)
0
35.7 36.0 36.3 36.6
x
The amount of ketchup per bottle is denoted by 𝑋. It assumed that 𝑋 ∼ N (36, 0.012 ). To compute
the probability, we can either compute the z-score
35.8−36
𝑍= 0.11 = −1.81818... ≈ −1.82
pnorm(-1.82, mean = 0, sd = 1)
# [1] 0.0343795
144
7 Probability
¾ Your turn
Example 7.7. Body temperatures of healthy humans are typically nearly normal distributed, with a
mean of 36.7℃ and a standard deviation of 0.4℃.
Question: What is the cut-off (quantile) for the lowest 3% of human body temperatures?
1.00
0.75
f(x)
0.50
0.25
0.03
0.00
35.5 36.0 36.5 37.0 37.5 38.0
x
145
7 Probability
¾ Your turn
Question: Which R codes compute the cut-off for the highest 10% of human body tempera-
tures?
Remember: body temperature was normally distributed with 𝜇 = 36.7 and 𝜎 = 0.4
A qnorm(0.9) * 0.4 + 36.7
B qnorm(0.1, mean = 36.7, sd = 0.4, lower.tail = FALSE)
C qnorm(0.1) * 0.4 + 36.7
D qnorm(0.9, mean = 36.7, sd = 0.4)
68-95-99.7 rule
mean,
Y
mean,
• about 99.7% falls within 3 SD of 99.7%
the mean.
µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ
Observations can occasionally fall 4, 5, or more standard deviations away from the mean; however,
such occurrences are quite rare if the data follows a nearly normal distribution.
Let’s create a histogram and a normal probability plot of a sample of 100 male heights with empir-
ical mean 177.88 and empirical standard deviation 8.36.
146
7 Probability
200
190
0.04
180
density
y
0.02 170
160
0.00
150 160 170 180 190 200 −2 −1 0 1 2
heights x
Empirical quantiles, derived from the data, are plotted on the y-axis of a normal probability
plot, while theoretical quantiles from a normal distribution are displayed on the x-axis.
In detail, for data 𝑥1 , … , 𝑥𝑛 , the plot shows a point for each index 𝑖 = 1, … , 𝑛, with the 𝑖-th
point having the following coordinates:
• The y-coordinate is the 𝑖-th smallest value among the data points 𝑥1 , … , 𝑥𝑛 .
• The x-coordinate is a normal quantile that approximates the expected value of the 𝑖-th
smallest value in a random sample of 𝑛 values drawn from N (0, 1).
Interpretation: If a linear relationship is present in the plot, then the data nearly follows a
normal distribution.
Constructing a normal probability plot involves calculating percentiles and corresponding z-scores
for each observation. R performs the detailed calculations when we request it to create these plots.
147
7 Probability
0.75
4
density
0.50
y
0.25
0.00
−2 −1 0 1 2 0 2 4
x x
Left skew: Points bend down and to the right of the line.
1.25
1.00
density
6
y
0.75
0.50
0
−2 −1 0 1 2 0.3 0.5 0.7 0.9
x x
148
7 Probability
Short tails (narrower than the normal distribution): Points follow an S shaped-curve.
1.0
0.9
0.5
density
0.0 0.6
y
−0.5
0.3
−1.0
0.0
−2 −1 0 1 2 −6 −3 0 3 6
x x
Long tails (wider than the normal distribution): Points start below the line, bend to follow it, and
end above it.
6 0.4
3 0.3
density
0.2
y
0.1
−3
0.0
−6
−2 −1 0 1 2 −6 −3 0 3 6
x x
149
7 Probability
Short summary
This provides a foundational overview of probability theory and its crucial role in statistical
thinking and data science. The text introduces core concepts such as sample spaces, events,
and the rules of probability, including frequentist and Bayesian interpretations. It further
explains important principles like the law of large numbers and the addition rule. The ma-
terial extends to conditional probability, independence, and Bayes’ Theorem, illustrating
their applications with examples. Finally, the resource covers random variables, both dis-
crete and continuous distributions (with a focus on the normal distribution), alongside
concepts like expected value and variance.
150
Part V
Predictive modeling
151
8 Statistical learning
In our course we will use the term statistical learning to refer to methods for predicting, or estimat-
ing, an output based on one or more inputs.
• Such prediction is often termed supervised learning (the development of the prediction method
is supervised by the output).
Example 8.1 (Advertising). Data on the sales of a product in 200 different markets, along with ad-
vertising budgets for the product in each of those markets for three different media: TV, radio, and
newspaper.
152
8 Statistical learning
Suppose we wish to predict the sales for a randomly selected market with a given budget for radio
advertising.
In this problem, radio is the input variable and sales the output variable.
Terminology:
• Often, the input is also referred to as predictors, features, covariates, or independent variables.
Notation:
𝑋 = input, 𝑌 = output.
20
sales
10
0 10 20 30 40 50
radio
153
8 Statistical learning
There seems to be some relationship between sales and radio, but it is certainly noisy. We may
capture the noise in a statistical model that invokes probability.
Consider a randomly selected market with radio budget 𝑋 and sales 𝑌 . Mathematically, 𝑋 and 𝑌
are random variables. Then we may posit the model
𝑌 = 𝑓0 (𝑋) + 𝜖0 ,
where 𝜖0 is a random error term, i.e., a random variable that is independent of 𝑋 and has mean 0. The
random error 𝜖 encodes the noise in the prediction problem.
In this formulation, 𝑓0 is an unknown function that represents the systematic information that 𝑋
provides about the numeric variable 𝑌 .
𝑓(𝑥) = 𝛽0 + 𝛽1 𝑥
and choose the unknown parameters 𝛽0 and 𝛽1 such that 𝑓 optimally approximates the
data (we will look at this in depth under the headings of linear regression and least
squares).
ii) To form a prediction of 𝑌 , we may evaluate the line at values of 𝑋 that are of interest. So
each prediction takes the form
̂ ∶= 𝛽 ̂ + 𝛽 ̂ 𝑥 ,
̂ = 𝑓(𝑥)
𝑌 0 1
where 𝛽0̂ and 𝛽1̂ are the optimally chosen parameters and 𝑥 is the value of interest.
154
8 Statistical learning
20
sales
10
0 10 20 30 40 50
radio
20 20 20
sales
sales
sales
10 10 10
155
8 Statistical learning
In order to (hopefully) obtain better predictions, we may simultaneously draw on several predictors.
Advertising example:
Consider as input the vector 𝑋 = (𝑋1 , 𝑋2 , 𝑋3 ) comprising all three budgets, so 𝑋1 is radio, 𝑋2 is
TV, and 𝑋3 is newspaper.
The systematic part of our model 𝑌 = 𝑓(𝑋) + 𝜖 is now given by a function 𝑓 ∶ R3 → R.
Again, a very useful framework is to consider functions that linearly combine predictors, so
𝑓(𝑋) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 .
We will discuss this in depth under the heading of multiple linear regression.
8.2.2 Noise
In the model
𝑌 = 𝑓(𝑋) + 𝜖,
the random error 𝜖 encodes the noise in the prediction problem.
The noise 𝜖 may contain unmeasured variables that are useful in predicting 𝑌 : Since we don’t measure
them, 𝑓 cannot use them for its prediction (e.g., TV matters but we only know and predict from
radio).
The noise 𝜖 may also contain unmeasurable variation (a stochastic aspect). For example, the risk of an
adverse reaction might vary for a given patient on a given day, depending on manufacturing variation
in the drug itself or the patient’s general feeling of well-being on that day.
We assume that the true, but unknown, relationship between 𝑋 and 𝑌 is described by 𝑌 = 𝑓0 (𝑋)+𝜖0 .
Under the chosen model 𝑌 = 𝑓(𝑋) + 𝜖, the accuracy of 𝑌 ̂
̂ = 𝑓(𝑋) as a prediction for 𝑌 depends on
two quantities:
In general, 𝑓 ̂ is not a perfect estimate of 𝑓, and this inaccuracy will introduce some error. This error
is reducible because we can potentially improve the accuracy of our predictions by estimating 𝑓 via
more appropriate statistical learning techniques.
156
8 Statistical learning
However, even if our model is correctly specified, meaning that 𝑓0 = 𝑓, and we have a perfect estimate
𝑓 ̂ = 𝑓0 , we still face some prediction error, namely,
𝑌 −𝑌 ̂
̂ = 𝑌 − 𝑓(𝑋) = 𝜖0 .
Indeed, the noise 𝜖0 cannot be predicted using 𝑋. By the modeling assumption, 𝜖0 is independent of
𝑋. The prediction error resulting from the variability associated with 𝜖0 is known as the irreducible
error, because no matter how well we estimate 𝑓0 , we cannot reduce the error introduced by 𝜖0 .
1. Model the data by specifying assumptions about the form of 𝑓, e.g., 𝑓 is linear in 𝑋:
𝑓(𝑋) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + + 𝛽𝑝 𝑋𝑝 .
Estimating 𝑓 amounts to estimating a vector of unknown parameters 𝛽 = (𝛽0 , 𝛽1 , … , 𝛽𝑝 ).
2. Fit the specified model to training data. In the above linear model, we optimize the choice of
𝛽 so as to find a linear function 𝑓 that optimally approximates the training data. That is, we
want to find values of the parameters such that
𝑌 ≈ 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + + 𝛽𝑝 𝑋𝑝 .
The most common approach to optimize the choice of 𝛽 is referred to as (ordinary) least squares,
which we will discuss in detail soon. (Other approaches can be useful…)
The following plot shows a least squares line for the first 20 data points from Advertising. The red line
segments show the prediction errors. The blue line minimizes the sum of squared prediction errors,
among all possible lines.
25
20
sales (Y)
15
10
5
0 100 200
TV (X)
157
8 Statistical learning
A straightforward method to enhance the expressivity of linear models is to generate additional pre-
dictors by transforming the original predictors. This allows one to form predictions that are non-linear
in the original predictors.
𝑓(𝑋) = 𝛽0 + 𝛽1 𝑋2 + 𝛽2 𝑋4 = 𝛽0 + 𝛽1 𝑋2 + 𝛽2 𝑋22 .
20 20
sales (Y)
sales (Y)
10 10
Non-parametric methods do not make explicit assumptions about the functional form of 𝑓. Instead,
they seek an estimate 𝑓 ̂ that gets as close to the data points as possible without being too rough or
wiggly.
158
8 Statistical learning
Example 8.7 (Example: 𝑘-NN regression). For a numeric response 𝑌 , 𝑘-nearest neighbor regression
forms the prediction as
̂ 1
𝑓(𝑋) = (𝑌𝑖1 + + 𝑌𝑖𝑘 ),
𝑘
where 𝑖1 , … , 𝑖𝑘 ∈ {1, … , 𝑛} are the indices such that 𝑋𝑖1 , … , 𝑋𝑖𝑘 are the 𝑘 inputs in the training
data that are closest to the input 𝑋 for which we predict.
7.5
5.0
y
2.5
0.0
1 2 3
x
159
8 Statistical learning
Example 8.8 (Example: 𝑘-NN regression with varying 𝑘). Results for 𝑘 = 3, 10, 25 for a data set of
size 𝑛 = 300.
12 12 12
8 8 8
y
y
4 4 4
0 0 0
−2 0 2 −2 0 2 −2 0 2
x x x
No one method dominates all others over all possible data sets.
• Selecting a suitable method is often the most challenging part of performing statistical learning
in practice:
– is a linear model suitable?
– does adding transformations help?
– is a nonparametric method better?
– but if we apply, say, 𝑘-NN regression, what should be the value of 𝑘?
• Sometimes the best predictive performance results from ensemble methods that aver-
age/combine predictions from several different methods
• Speaking of “best” requires specifying a precise measure for prediction errors.
160
8 Statistical learning
How can we measure the accuracy of a predicted value in relation to the true response?
In regression, the most commonly-used measure is the mean squared error (MSE), which is an
average squared prediction error.
Let (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ) be the observed values in a data set. Let 𝑓 ̂ be a prediction rule that may be
used to compute a prediction 𝑓(𝑥 ̂ ) for each response value 𝑦 . Then the MSE for 𝑓 ̂ is defined as
𝑖 𝑖
1 𝑛 ̂ ))2 .
MSE = ∑ (𝑦𝑖 − 𝑓(𝑥 𝑖
𝑛 𝑖=1
The MSE will be small if the predicted responses are very close to the true responses and large if for
some of the observations, the predicted and true responses differ substantially.
In practice, the prediction rule 𝑓 ̂ is determined using a training data set consisting of the pairs
(𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ). The term training MSE refers to the mean squared error for 𝑓,̂ calculated
by averaging the squared prediction errors based on the training data.
In contrast, the term test MSE refers to an MSE calculated using an independent test data set
(𝑥∗1 , 𝑦1∗ ), … , (𝑥∗𝑛 , 𝑦𝑛∗ ). S,o while 𝑓 ̂ is found using the training data, the test MSE is computed as
1 𝑛 ̂ ∗ ))2 .
MSE = ∑ (𝑦𝑖∗ − 𝑓(𝑥 𝑖
𝑛 𝑖=1
Statistical learning methods aim to find 𝑓 ̂ that minimizes training MSE, hoping that it will also mini-
mize test MSE. However, there is no guarantee that this will happen.
161
8 Statistical learning
0 0
y
y
−1 −1
−2 −2
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x x
0 0
y
y
−1 −1
−2 −2
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x x
Figure 8.1: Figure 2.9 from ISLR (Left: true 𝑓 in black; estimates 𝑓 ̂ in color | Right: training MSE in
grey; test MSE in red)
162
8 Statistical learning
• U-shape in test MSE curve is due to two competing properties of statistical learning methods:
bias versus variance.
• Bias arises when there are systematic deviations between test case predictions 𝑓(𝑥 ̂ ∗ ) and the
𝑖
expected response 𝑓0 (𝑥∗𝑖 ). This occurs, in particular, when the learning method infers 𝑓 ̂ from an
overly simple class of functions M, that fails to contain/closely approximate the true function
𝑓0 . This indicates that we select our model 𝑓 from the set M (e.g. all linear functions in 𝑋),
but the true model/relationship 𝑓0 is not included in M. The bias increases with the distance
between 𝑓0 and our chosen model class M.
• Variance encompasses chance errors that arise when the learned prediction rule 𝑓 ̂ heavily de-
pends on the noise in the training data. This occurs, for instance, when fitting polynomials of
overly high degree to training data.
ĺ Important
For accurate prediction, we need a learning method that is not only sufficiently flexible
but also filters out noise in the training data.
View the training data as a random sample (𝑋1 , 𝑌1 ), … , (𝑋𝑛 , 𝑌𝑛 ), and consider the problem of pre-
dicting the response 𝑌 ∗ in an independent test case (𝑋 ∗ , 𝑌 ∗ ). Here, we take 𝑋 ∗ = 𝑥∗ be fixed
(non-random) but generate the response randomly according to our model
𝑌 ∗ = 𝑓0 (𝑥∗ ) + 𝜖0 .
Then the prediction 𝑓(𝑥 ̂ ∗ ) is a random variable whose expected test MSE can be shown (see Sec-
tion C.1 for a proof) to decompose as
Bias-variance decomposition
2 2
̂ ∗ )) ] = Var [𝑓(𝑥
E [(𝑌 ∗ − 𝑓(𝑥 ̂ ∗ )] + [Bias(𝑓(𝑥
̂ ∗ ))] + Var[𝜖 ]. (8.1)
0
Here, the bias is defined as the deviation between expected prediction and expected response:
̂ ∗ )) = E [𝑓(𝑥
Bias(𝑓(𝑥 ̂ ∗ )] − 𝑓 (𝑥∗ ).
0
In the three-term decomposition, Var[𝜖] is due to irreducible error, whereas the contribution of re-
ducible errors is decomposed into the variance and the squared bias of the prediction.
163
8 Statistical learning
Example 8.10. Assume we have 100 observations (𝑥1 , 𝑦1 ), … , (𝑥100 , 𝑦100 ) from the model
𝑌 = 𝑓0 (𝑥) + 𝜖0 ,
Our unbiased estimator with the smallest variance has the least MSE, while the other two are similar.
To determine if this finding is consistent, we should repeat the entire procedure multiple times.
Assume we take a sample of 100 observations and compute all three estimators 1000 times. The
following histograms show the estimates of the intercept 𝛽0 for all three estimators.
200
100
0
^
intercept estimate b0 from f
250
200
count
150
100
50
0
^
intercept estimate b0 from f b
60
40
20
0
−1 0 1 2 3
^
intercept estimate b0 from f v
Averaged over all 1000 samples we get the following MSE values:
164
8 Statistical learning
• So far we have discussed problems in which response is numeric (i.e., regression problems).
• When the response is instead categorical, then the prediction problem is a classification
problem.
• Each value of a categorical response defines a class that a test case may belong to.
One important method to solve classification problems with two classes (i.e., a binary response) is
logistic regression. This is a suitable generalization of linear regression that we will discuss in-depth
later in the course.
165
8 Statistical learning
Algorithm
Let’s return to the practical problem of finding a statistical learning method that yields a good pre-
diction rule 𝑓 ̂ on the basis of a data set (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ).
̂ for new data points
A good rule is one that generalizes well. That is, it gives accurate predictions 𝑓(𝑥)
(for which we observe the input value 𝑥 but not the response 𝑦).
As noted earlier, finding a learning method typically requires picking one of several possible mod-
els, one of several possible estimation methods, or possibly also setting tuning parameters (like the
number 𝑘 of nearest neighbors). A natural idea to make such choices is to
1. randomly split the available data into a training and a validation part, and
2. make all statistical choices such that the MSE (or a misclassification rate) for the validation
cases is minimized.
This leads to a specific learning method that can then be applied to the entire original data set
(𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ) to form 𝑓.̂
8.10.1 Cross-validation
In order to reduce the variability due to the randomness in splitting the data, we may consider several
different random splits. This idea is typically implemented in the specific form of cross validation.
In 𝑣-fold cross validation, the data (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ) are randomly split into 𝑣 parts, also known
as the folds. Typical choices for the number of folds are 𝑣 = 5 or 𝑣 = 10.
Each fold then plays the role of the validation data once. This leads to 𝑣 validation errors, which
are averaged to form an overall cross-validation error CV(𝑣) . Learning methods are then designed to
minimize CV(𝑣) .
166
8 Statistical learning
Example 8.13. (CV error in regression) For a regression problem, let MSE𝑖 be the MSE when pre-
dicting the data in the 𝑖-th fold (𝑖 = 1, … , 𝑣). Then the overall cross-validation error is the average
MSE:
1 𝑣
CV(𝑣) = ∑ MSE𝑖 .
𝑣 𝑖=1
Illustrating cross-validation
Figure 8.3: 4-fold cross validation; from Hardin, Çetinkaya-Rundel (2021). Introduction to Modern
Statistics
Below is a plot of accuracy (% correctly classified) in 5-fold cross-validation for 𝑘-NN classification.
0.935
Accuracy (Cross−Validation)
0.930
0.925
0.920
0 10 20 30
#Neighbors
Note: 10,3% of emails in the dataset are spam.
167
8 Statistical learning
Our discussion focused on supervised learning, where we observe a response in the available data.
In contrast, unsupervised learning problems are learning problems in which we do not get to observe
data for a response.
7.5 6
X2
X2
5.0
4
2.5
2
2 4 6 8 2 4 6
X1 X1
Each data point is a pair (𝑋1 , 𝑋2 ). Colors/Plotting symbols indicate which group each data point
belongs to. But imagine not knowing the color and having to form three clusters.
Reference
168
8 Statistical learning
Short summary
This chapter offers an introduction to statistical learning, which involves methods for predicting
outputs based on inputs. It differentiates between supervised learning, where predictions
are guided by an output, covering regression for numeric outputs and classification for categori-
cal ones, and unsupervised learning, which explores data without a supervising output. The
text discusses modelling noisy relationships, highlighting the concepts of reducible and
irreducible error, and explores both parametric, model-based approaches like linear regression
and non-parametric methods such as nearest neighbours. Furthermore, it addresses how to
assess model accuracy using metrics like mean squared error and the importance of training
and test data, alongside the bias-variance trade-off. Finally, the text touches upon classi-
fication problems and methods for selecting learning approaches using validation sets and
cross-validation, contrasting supervised with unsupervised learning.
169
9 Linear regression
The simple linear regression allows us to predict a quantitative response 𝑌 based on a single pre-
dictor variable 𝑋. It assumes that there is approximately a linear relationship between 𝑋 and 𝑌 . We
can write this linear relationship as
𝑌 ≈ 𝛽 0 + 𝛽1 𝑋 .
You can interpret the symbol ≈ as meaning “is approximately modeled as”. We say that 𝑌 is regressed
on 𝑋.
The simple linear regression model for the i-th observation is then given by the equation
where 𝑥𝑖 is the i-th observation of the predictor variable 𝑋 and 𝜖1 , … , 𝜖𝑛 are independent random
error terms with zero mean and constant variance 𝜎2 , which are independent of 𝑋.
Remark. Later on, we will also make inference for the slope parameter. There, we will make the
additional assumption that the random errors have a N (0, 𝜎2 ) distribution.
170
9 Linear regression
The regression parameters 𝛽0 and 𝛽1 are two unknown constants that represent the intercept and
slope terms in the linear model.
Once we have computed estimates 𝛽0̂ and 𝛽1̂ for the regression parameters, we can predict future
response values based on (new) values 𝑥 of the predictor variable
𝑦 ̂ = 𝛽0̂ + 𝛽1̂ 𝑥 ,
Here and in the following, we use a hat symbol t̂ o denote the estimated value for an unknown
parameter, or to denote the predicted value of the response.
Data
poverty
# # A tibble: 51 x 7
# State Metro_Res White Graduates Poverty Female_House Region
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 Alabama 55.4 71.3 79.9 14.6 14.2 South
# 2 Alaska 65.6 70.8 90.6 8.3 10.8 West
# 3 Arizona 88.2 87.7 83.8 13.3 11.1 West
# 4 Arkansas 52.5 81 80.9 18 12.1 South
# 5 California 94.4 77.5 81.1 12.8 12.6 West
# 6 Colorado 84.5 90.2 88.7 9.4 9.6 West
# # i 45 more rows
The scatterplot below illustrates the relationship between high school graduation rates (in percent)
across all 50 US states and DC, and the percentage of residents living below the poverty level 1 .
1
income below $23050 for a family of 4 in 2012.
171
9 Linear regression
10
5
80 84 88 92
high school graduation rate
Poverty is the response, and Graduates is the predictor variable. The relationship can be described
as a moderately strong negative linear relationship.
In practice, the regression parameters 𝛽0 and 𝛽1 are unknown. So, we must use the data to estimate
them. The data consists of 𝑛 paired observations
This means we want to find an intercept 𝛽0̂ and a slope 𝛽1̂ such that the distance between the obser-
vations 𝑦𝑖 and the predicted values on the regression line 𝑦𝑖̂ is as close as possible for all 𝑛 observa-
tions.
There are several ways of measuring the distance. However, the most common approach in this
setting is minimizing the least squares distance, and we take that approach in this chapter.
172
9 Linear regression
Example 9.1. As an illustrative example we consider the relation between Poverty and Graduates
only in the states Florida, Louisiana, Minnesota and Washington. The observed Poverty values are
12.1, 17, 6.5 and 10.8. Imagine for a moment we want to describe the relation between Graduates
and Poverty only using the observations (79.8, 17) and (91.6, 6.5), which belong to Louisiana and
Minnesota, respectively. In this case we simply remember from secondary school how to compute a
slope and use it as your best estimate of 𝛽1 ,
6.5 − 17
𝛽1̂ = ≈ −0.8898 .
91.6 − 79.8
15
12
6
80 84 88 92
high school graduation rate
Figure 9.1: Fitted least squares regression line for the regression of Poverty onto Graduates given
only the observations from Louisiana and Minnesota. The observations for Florida and
Washington are added in red.
When we consider all four observations, it’s clear that we cannot fit a single line through all of the
points, as shown in Figure 9.1. This raises the question: how can we determine the line that best fits
this point cloud?
The goal is to keep the distance between our predictions 𝑦𝑖̂ and the observations 𝑦𝑖 , for 𝑖 ∈ {1, 2, 3, 4},
as small as possible. It will not be zero but should be minimized. There are several distance measures
available, but we will use squared distances, as they are easier to work with.
We consider the prediction 𝑦𝑖̂ = 𝑏0 + 𝑏1 𝑥𝑖 as a function of two parameters, 𝑏0 ∈ R and 𝑏1 ∈ R and
compute the squared distance between the predicted values and the actual observations, 𝑦𝑖 , for all
173
9 Linear regression
data points. By summing all these squared distances, we obtain a total that remains a function of the
two parameters. This sum can then be minimized with respect to 𝑏0 and 𝑏1 .
In our example, this will lead to the function
which has a minimum at the point, where the partial derivatives with respect to 𝑏0 and 𝑏1 vanish.
Hence, our two estimates 𝛽0,4̂ and 𝛽 ̂ (the 4 denotes the fact that we have four observations) will
1,4
be solution of the followin system of equations:
d
𝑓(𝑏0 , 𝑏1 ) = −2(12.1 − (𝑏0 + 𝑏1 84.7)) − 2(17 − (𝑏0 + 𝑏1 79.8))
d𝑏0
− 2(6.5 − (𝑏0 + 𝑏1 91.6)) − 2(10.8 − (𝑏0 + 𝑏1 89.1))
= −92.8 + 8 ⋅ 𝑏0 + 690.4 ⋅ 𝑏1 = 0
d
𝑓(𝑏0 , 𝑏1 ) = 2 ⋅ 84.7(12.1 − (𝑏0 + 𝑏1 84.7)) + 2 ⋅ 79.8(17 − (𝑏0 + 𝑏1 79.8))
d𝑏1
+ 2 ⋅ 91.6(6.5 − (𝑏0 + 𝑏1 91.6)) + 2 ⋅ 89.1(10.8 − (𝑏0 + 𝑏1 89.1))
= 7878.3 − 690.4 ⋅ 𝑏0 − 59743 ⋅ 𝑏1 = 0 ,
Definition 9.1. Let 𝛽0̂ and 𝛽1̂ be estimates of intercept and slope. The residual of the 𝑖-th observa-
tion (𝑥𝑖 , 𝑦𝑖 ) is the difference of the observed response 𝑦𝑖 and the prediction based on the model
fit 𝑦𝑖̂ = 𝛽0̂ + 𝛽1̂ 𝑥𝑖 :
𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖̂ .
Considering the predictions 𝑦𝑖̂ as a function of two parameters 𝑏0 ∈ R and 𝑏1 ∈ R allows us to say
that the least squares estimates minimize the residual sum of squares
174
9 Linear regression
12
6
80 84 88 92
high school graduation rate
Figure 9.2: Fitted least squares regression line for the regression of Poverty onto Graduates. Each
red line segment represents one of the errors 𝑦𝑖 − 𝑦𝑖̂ .
Definition 9.2. The least squares estimates 𝛽0,𝑛 ̂ and 𝛽 ̂ (point estimates) for the parameters 𝛽
1,𝑛 0
̂ , 𝛽 ̂ ) that minimizes the function
and 𝛽1 (population parameters) are defined as the point (𝛽0,𝑛 1,𝑛
𝑛 𝑛
RSS(𝑏0 , 𝑏1 ) = ∑ 𝑒2𝑖 = ∑(𝑦𝑖 − (𝑏0 + 𝑏1 𝑥𝑖 ))2 ,
𝑖=1 𝑖=1
i.e.,
̂ , 𝛽 ̂ ) = argmin
(𝛽0,𝑛 RSS(𝑏0 , 𝑏1 ) .
1,𝑛 (𝑏 0 ,𝑏1 )∈R
2
Remark. We often drop the index 𝑛 and denote the least squares estimates just by 𝛽0̂ and 𝛽1̂ .
The least squares estimates are the solution to the minimization problem argmin(𝑏 ,𝑏 )∈R2 RSS(𝑏0 , 𝑏1 ).
0 1
To understand why this is actually true, we will first rewrite our regression model Equation 9.1 using
matrix notation.
The simple linear regression model is defined through the following equation
1 𝑥1 𝜖1
⎛
⎜ 1 𝑥 ⎞
⎟ 𝛽 ⎛
⎜ 𝜖2 ⎞
⎟
Y = X𝛽 + 𝜖 = ⎜
⎜
2⎟
⎟ ( 0) + ⎜
⎜ ⎟,
⎜⋮ ⋮ ⎟ 𝛽1 ⎜⋮⎟ ⎟
⎝1 𝑥 𝑛 ⎠ ⎝𝜖𝑛 ⎠
175
9 Linear regression
Taking the derivative with respect to 𝑏0 and 𝑏1 leads to the following system of equations
X⊤ Xb = X⊤ y ,
which are called normal equation, and they have the solution
−1
̂
𝛽(y) = (X⊤ X) X⊤ y
𝑛
∑ 𝑖 (𝑥 −𝑥 )(𝑦 −𝑦 )
𝑛 𝑖
⎛ 𝑦𝑛 − 𝑖=1 𝑛
∑𝑖=1 (𝑥𝑖 −𝑥𝑛 )2
𝑛
𝑥𝑛 ⎞
=⎜ 𝑛
∑𝑖=1 (𝑥𝑖 −𝑥𝑛 )(𝑦𝑖 −𝑦𝑛 )
⎟.
𝑛
⎝ ∑𝑖=1 (𝑥𝑖 −𝑥𝑛 ) 2
⎠
See Section C.2 for a proof of this last result.
Remark. The notation 𝛽(y)̂ indicates that the formula is evaluated using the observed values y. In
this situation, we typically simplify the notation and just write 𝛽.̂ If the formula for the estimator
̂
uses the response vector Y, i.e. computing 𝛽(Y), the estimator becomes a random quantity (since Y
is random) and it make sense to compute an expectation or variance of 𝛽(Y).̂
Example 9.2. Consider again our model of regressing Poverty onto Graduates. The normal equa-
tions for this model are given by
1 79.9 14.6
1 1 ⋅⋅⋅ 1 ⎛
⎜ 1 90.6 ⎞
⎟ 𝑏 1 1 ⋅ ⋅ ⋅ 1 ⎛
⎜ 8.3 ⎞
⎟
( )⎜
⎜ ⎟
⎟ ( 0) = ( )⎜ ⎟.
79.9 90.6 ⋅ ⋅ ⋅ 90.9 ⎜ ⋮ ⋮ ⎟ 𝑏1 79.9 90.6 ⋅ ⋅ ⋅ 90.9 ⎜ ⋮ ⎟
⎜ ⎟
⎝1 90.9⎠ ⎝ 9.5 ⎠
Solving this system of equations with respect to 𝑏0 and 𝑏1 gives the least-squares estimates:
solve(XtX, Xy)
# [1] 64.7809658 -0.6212167
176
9 Linear regression
These estimates differ from our earlier ones based only on Louisiana and Minnesota data. However,
they are more aligned with the estimates derived from the complete dataset, which we will see later.
Given
poverty |>
summarise(mean_pov = mean(Poverty),
mean_grad = mean(Graduates),
sd_pov = sd(Poverty),
sd_grad = sd(Graduates))
# # A tibble: 1 x 4
# mean_pov mean_grad sd_pov sd_grad
# <dbl> <dbl> <dbl> <dbl>
# 1 11.3 86.0 3.10 3.73
cor(poverty$Poverty, poverty$Graduates)
# [1] -0.7468583
𝑛
∑𝑖=1 (𝑥𝑖 −𝑥)(𝑦𝑖 −𝑦)
we can compute the estimated slope. The formula for the slope estimator was 𝛽1̂ = 𝑛
∑𝑖=1 (𝑥𝑖 −𝑥)2
,
which can be rearranged in the following way
𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 ) (𝑛 − 1)𝑠𝑥𝑦,𝑛 𝑠𝑥𝑦,𝑛 𝑠𝑥𝑦,𝑛
𝛽̂1,𝑛 = 𝑛 = 2
= 2 =
∑𝑖=1 (𝑥𝑖
− 𝑥𝑛 )2 (𝑛 − 1)𝑠𝑥,𝑛 𝑠𝑥,𝑛 𝑠𝑥,𝑛 ⋅ 𝑠𝑥,𝑛
𝑠𝑦,𝑛 𝑠𝑥𝑦,𝑛 𝑠𝑦,𝑛
= ⋅ = ⋅𝑟
𝑠𝑥,𝑛 𝑠𝑥,𝑛 ⋅ 𝑠𝑦,𝑛 𝑠𝑥,𝑛 (𝑥,𝑦),𝑛
Now we can input the given output in the above formula, and get the following estimate
𝑠𝑦,𝑛 3.10
̂ =
𝛽1,𝑛 ⋅𝑟 ≈ ⋅ (−0.75) = −0.62 .
𝑠𝑥,𝑛 (𝑥,𝑦),𝑛 3.73
Interpretation
When comparing two states, then for each additional percentage point in the high school
graduation rate, we would expect the percentage of people living in poverty to be 0.62 percent-
age points lower on average.
177
9 Linear regression
Given
poverty |>
summarise(mean_pov = mean(Poverty), mean_grad = mean(Graduates),
sd_pov = sd(Poverty), sd_grad = sd(Graduates))
# # A tibble: 1 x 4
# mean_pov mean_grad sd_pov sd_grad
# <dbl> <dbl> <dbl> <dbl>
# 1 11.35 86.01 3.099 3.726
¾ Your turn
Be aware: Our interpretation of intercept and slope is not causal, where by causal, we mean effects
resulting from interventions (such as policy changes, new treatments, …).
When interpreting the slope in our example, we consider differences in the expected responses of
two states with (naturally) different high school graduation rates. We do not conclude that the slope
provides an estimate of how an intervention that in-/decreases the high school graduation rate in one
state leads to a de-/increase in the poverty rate in that same state.
Remember: Causal conclusions may be drawn if a study is a randomized controlled experiment
(i.e., the value of 𝑥 was controlled and randomly assigned by the experimenter).
178
9 Linear regression
In R, one can fit a linear regression model by using the function lm()
Note
After fitting the model, one should analyse the residuals. To check:
• if the residuals show no structure → implies that relationship between predictor and
response is roughly linear
• the shape of the distribution of the residuals
• if variability of residuals around the 0 line is roughly constant
The process for carrying out these three checks is explained in Section 14.2.
In this chapter, we will focus on the residual plot, which is a scatterplot of the residuals 𝑒𝑖 against
the predicted values 𝑦𝑖̂ . It helps to evaluate how well a linear model fits a dataset. To create the plot,
we first add the residuals and the predictions to the dataset with the functions add_residuals() and
add_predictions() from the modelr package.
library(modelr)
poverty <- poverty |>
add_residuals(model = model_pov_0) |>
add_predictions(model = model_pov_0)
# poverty contains now the additional columns fitted and resid
ggplot(poverty, aes(x = pred, y = resid)) +
geom_point() +
labs(x = "predicted perc. of people living below the poverty line", y =
↪ "residuals") +
geom_hline(yintercept = 0, linetype = 2)
179
9 Linear regression
5.0
2.5
residuals
0.0
−2.5
Figure 9.3: The given residual plot indicates a not-so-bad fit since no apparent structure is visible.
One might say that there are two clusters (below and above 12%) with slightly different
variability.
Note
One purpose of residual plots is to identify characteristics or patterns still apparent in data after
fitting a model.
If the chosen model fits the data rather well, there should be no pattern.
𝑛 𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 ) ∑ (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 )
𝑟(𝑥,𝑦),𝑛 = = 𝑖=1
𝑛 𝑛
√∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2 ∑𝑖=1 (𝑦𝑖 − 𝑦𝑛 )2 (𝑛 − 1)𝑠𝑥,𝑛 𝑠𝑦,𝑛
180
9 Linear regression
which corresponds to the cases of perfect negative (-1) / positive (+1) correlation and no linear asso-
ciation (0). In general 𝑟(𝑥,𝑦),𝑛 ∈ [−1, 1].
¾ Your turn
Which of the following is the best guess for the correlation between % in poverty and the high
school graduation rate?
A -0.75
B -0.1
C 0.02
D -1.5
¾ Your turn
Which of the following plots shows the strongest correlation, i.e. correlation coefficient closest
to +1 or -1?
A B
40
40
30 30
20
20
10
0 10
0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
C D
20
50 10
0
25
−10
0 −20
−25 −30
0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0
9.1.3 Extrapolation
When regressing the percentage of people living below the poverty line onto the percentage of high
school graduates, we estimated an intercept of 64.78. Since there are no states in the dataset with no
high school graduates, the intercept is of no interest, not very useful, and also not reliable since
the predicted value of the intercept is so far from the bulk of the data.
181
9 Linear regression
40
20
0
0 25 50 75 100
high school graduation rate
Applying a fitted model to values outside the original data realm is called extrapolation. Sometimes,
the intercept might be an extrapolation.
Example 9.3. Figure 9.4 shows the median age for each year of men living in the US at first marriage.
Using only the data up to 1950 leads to a trend that dramatically underestimates the median age for
1970 and onwards.
30
27
median age
24
21
Figure 9.4: Men’s median age at first marriage for men living in the US.
Example 9.4. In 2004, the BBC reported that women “may outsprint men by 2156”. In their report
they were referring to results found in Tatem et al. (2004).
The study’s authors fitted linear regression lines to the winning times of males and females over the
past 100 years.
182
9 Linear regression
Then, they extrapolated these trends to the 2008 Olympia games and concluded that the women’s
100-meter race could be won in a time of 10.57± 0.232 seconds and the men’s event in 9.73 ± 0.144
seconds. The actual winning times have been 10.780 and 9.690, respectively. Both being within the
given 95% confidence intervals.
But already in the Tokyo 2020 Olympics, this wasn’t the case anymore.
Figure 9.5: Momentous sprint at the 2156 Olympics? Figure from Tatem et al. (2004); we added the
Tokyo 2020 results.
The quality of a linear regression fit can be evaluated using the residual standard error (RSE) and the
R squared value (𝑅2 ), among other criteria.
Recall that the linear regression model Equation 9.1 assumes that the response 𝑌𝑖 is a linear combi-
nation of the linear predictor 𝛽0 + 𝛽1 𝑥𝑖 and the error 𝜖𝑖 .
The RSE is an estimate of the standard deviation 𝜎 of the unobservable random error 𝜖𝑖 . Remember
that all 𝜖𝑖 are assumed to have the same variance. It is defined by the following formula
𝑛
1 1
RSE = √ RSS = √ ∑(𝑦𝑖 − 𝑦𝑖̂ )2 .
𝑛−2 𝑛 − 2 𝑖=1
The RSE is a measure of how well the model fits the data.
183
9 Linear regression
If the model’s predictions closely match the actual outcome values, the RSE will be small, indicating
a good fit. Conversely, if the predictions 𝑦𝑖̂ differ significantly from the actual values 𝑦𝑖 for some
observations, the RSE will be large, suggesting a poor fit of the model to the data.
The glance() function from the broom package computes several fit criterion. The RSE is denoted
by sigma.
broom::glance(model_pov_0)
# # A tibble: 1 x 12
# r.squared adj.r.squared sigma statistic p.value df
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.5578 0.5488 2.082 61.81 3.109e-10 1
# # i 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
# # deviance <dbl>, df.residual <int>, nobs <int>
Roughly speaking, the RSE measures the average amount that the response will deviate from the true
regression line.
In our example, this means that the actual percentage of people living below the poverty line in
each state differs from the true regression line by approximately two percentage points, on average.
Whether or not 2 percentage points is an acceptable prediction error depends on the problem con-
text. For the poverty dataset, the mean percentage living below the poverty line over all states is
approximately 11.35, and so the percentage error is 2.082
11.35 ⋅ 100% ≈ 18%.
𝑅2
The RSE provides an absolute measure of the lack of fit. However, since it is measured in the units
of the response, it is not always clear what constitutes a good RSE. The 𝑅2 statistic provides an
alternative measure of fit. It takes the form of a proportion and is hence independent of the scale of
𝑌.
Definition 9.3. The strength of the fit of a linear model can be evaluated using the R squared value
𝑅2 . It is defined in the following way
𝑛 𝑛
2 TSS − RSS ∑𝑖=1 (𝑦𝑖 − 𝑦𝑛 )2 − ∑𝑖=1 (𝑦𝑖 − 𝑦𝑖̂ )2
𝑅 = = 𝑛 ,
TSS ∑𝑖=1 (𝑦𝑖 − 𝑦𝑛 )2
Remark.
184
9 Linear regression
2. One can show, that the 𝑅2 value is equal to the square of the correlation coefficient in the
simple linear model, i.e.
𝑅2 = 𝑟(𝑥,𝑦),𝑛
2
.
This equation implies that 𝑅2 ∈ [0, 1].
3. One can show that
𝑛 𝑛 𝑛
∑(𝑦𝑖̂ − 𝑦𝑛 )2 = ∑(𝑦𝑖 − 𝑦𝑛 )2 − ∑(𝑦𝑖 − 𝑦𝑖̂ )2 ,
𝑖=1 𝑖=1 𝑖=1
⇔ 𝑦𝑛 = 𝑦𝑛̂
that 𝑛
∑𝑖=1 (𝑦𝑖̂ − 𝑦𝑛 )2 𝑠2𝑦̂
𝑅2 = = ,
∑𝑖=1 (𝑦𝑖 − 𝑦𝑛 )2 𝑠2𝑦
where 𝑠2𝑦̂ denotes the empirical variance of the fitted values.
Interpretation
The percentage of variability in the response variable, which the model explains.
The remainder of the variability is explained by variables not included in the model or by inher-
ent randomness in the data.
model_pov_0
#
# Call:
# lm(formula = Poverty ~ Graduates, data = poverty)
#
# Coefficients:
# (Intercept) Graduates
# 64.7810 -0.6212
185
9 Linear regression
summary(model_pov_0)
#
# Call:
# lm(formula = Poverty ~ Graduates, data = poverty)
#
# Residuals:
# Min 1Q Median 3Q Max
# -4.1624 -1.2593 -0.2184 0.9611 5.4437
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 64.78097 6.80260 9.523 9.94e-13 ***
# Graduates -0.62122 0.07902 -7.862 3.11e-10 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 2.082 on 49 degrees of freedom
# Multiple R-squared: 0.5578, Adjusted R-squared: 0.5488
# F-statistic: 61.81 on 1 and 49 DF, p-value: 3.109e-10
summary(model_pov_0)$r.squared
# [1] 0.5577973
¾ Your turn
Which of the below is the correct interpretation of 𝑟(𝑥,𝑦),𝑛 = −0.75 and 𝑅2 = 0.56?
A The model explains 56% of the variability in the percentage of high school graduates among
the 51 states.
B The model explains 56% of the variability in the percentage of residents living in poverty
among the 51 states.
C 56% of the time the percentage of high school graduates predict the percentage of residents
living in poverty correctly.
D The model explains 75% of the variability in the percentage of residents living in poverty
among the 51 states.
186
9 Linear regression
Simple linear regression is a useful method for predicting a response based on a single predictor
variable. However, in practical applications, we often have more than one predictor variable.
In the poverty data, we have used the high school graduation rate as predictor. However, the dataset
contains much information. It consists of the following variables:
• State: US state
• Metro_Res: metropolitan residence
• White: percent of population that is white
• Graduates: percent of high school graduates
• Female_House: percent of female householder families (no husband present)
• Poverty: percent living below the poverty line
• Region: region in the United States
When describing the relationship between multiple predictor variables and the response variable
Poverty, fitting separate simple linear regressions for each predictor is not ideal. This approach over-
looks potential dependencies between the predictor variables and does not provide a single prediction
for Poverty based on multiple fits.
Therefore, we extend the simple linear regression model in such a way that it can directly accommo-
date several predictor variables.
Let 𝑌 be our quantitative response, and let 𝑋1 , … , 𝑋𝑘 be the considered predictor variables. Then
multiple linear regression assumes that there is approximately a linear relationship between
𝑋1 , … , 𝑋𝑘 and 𝑌 , with
𝑌 ≈ 𝛽0 + 𝛽1 𝑋1 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑋𝑘 .
More formally, let 𝑌1 , … , 𝑌𝑛 be 𝑛 observations of the response variable, and write 𝑥𝑗,1 , … , 𝑥𝑗,𝑛 , for
𝑗 ∈ {1, … , 𝑘}, for the associated given values of the 𝑘 predictors.
Definition 9.4. The multiple linear regression model is defined through the equation
with independent errors 𝜖1 , … , 𝜖𝑛 , which have zero mean and constant variance 𝜎2 . In the model,
the regression parameters 𝛽𝑗 ∈ R, 𝑗 ∈ {0, … , 𝑘}, are fixed, but unknown, coefficients.
We interpret 𝛽𝑗 as the average effect on 𝑌 of a one-unit increase in 𝑥𝑗 , while holding all other pre-
dictors fixed.
187
9 Linear regression
Remark. Using matrix notation, the multiple linear regression model is given by
Y = X𝛽 + 𝜖 , (9.3)
with response vector Y ∈ R𝑛 , design matrix X ∈ R𝑛×(𝑘+1) , population parameters 𝛽 ∈ R𝑘+1 and
residual errors 𝜖 ∈ R𝑛 .
Let’s visualize the poverty data (except for State) in a pairsplot. But first we remove the fitted values
and residuals again, which we added when fitting the simple linear regression model model_pov_0.
The following pairsplot, created using GGally::ggpairs(), compactly visualizes pairwise relation-
ships between variables.
GGally::ggpairs(
relocate(
select(poverty, -State), Poverty, .after = Female_House)
) # we removed State and show Poverty and Female_House next to each other
Metro_Res
0.015 Corr: Corr: Corr: Corr:
0.010
−0.342* 0.018 0.300* −0.204
0.005
0.000
100
80
Corr: Corr: Corr:
White
60
0.238. −0.751*** −0.309*
40
92
Graduates
88
Corr: Corr:
84
−0.612*** −0.747***
80
Female_House
17.5
15.0 Corr:
12.5 0.525***
10.0
7.5
15
Poverty
10
5
4
3
2
1
0
4
Region
3
2
1
0
4
3
2
1
0
4
3
2
1
0
40 60 80 100 40 60 80 100 80 85 90 7.510.0
12.5
15.0
17.5 5 10 15 South
West
Northeast
Midwest
188
9 Linear regression
The second last row is most interesting. It shows the relationship between Poverty and the other
predictor variables. To highlight one fact: The percentage of female householder families with no
husband present seems to have a positive relationship with Poverty.
Let’s fit another simple linear regression model, but this time using Female_House as predictor.
15
Poverty
10
5
7.5 10.0 12.5 15.0 17.5
Female_House
As noted, Female_House seems to help predict the percentage of people living below the poverty line.
So, the question is how to fit a joint model by estimating the regression parameters in Equation 9.2.
In Equation 9.2, as in the simple linear regression model, the regression parameters 𝛽0 , 𝛽1 , … , 𝛽𝑘 are
unknown and need to be estimated. Once we have estimates 𝛽0̂ , 𝛽1̂ , … , 𝛽𝑘̂ , we can use them to make
predictions using the following formula:
The parameters are estimated using the same least squares approach that we used in Section 9.1.1.
189
9 Linear regression
The least squares estimates are defined as the point 𝛽̂ = (𝛽0̂ , 𝛽1̂ , … , 𝛽𝑘̂ )⊤ ∈ R𝑘+1 that minimizes
the function
RSS(b) = RSS(𝑏0 , 𝑏1 , … , 𝑏𝑘 )
𝑛 𝑛
= ∑ 𝑒2𝑖 = ∑(𝑦𝑖 − 𝑦𝑖̂ )2
𝑖=1 𝑖=1 .
𝑛
= ∑(𝑦𝑖 − (𝑏0 + 𝑏1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝑏𝑘 𝑥𝑘,𝑖 ))2
𝑖=1
In symbols,
̂
𝛽(y) = argminb∈R𝑘+1 RSS(b)
−1
= (X⊤ X) X⊤ y .
This formula is the same as what we encountered in Section 9.1.1. However, at that time, we were
able to evaluate it manually. Now, the process is becoming more complex, and we will rely on R to
compute the estimates.
Using the formula for the least squares estimates 𝛽̂ gives the following representation of the fitted
values
ŷ = X𝛽̂ = X(X⊤ X)−1 X⊤ y =∶ Hy . (9.4)
The matrix H is called hat matrix and will be used in Section 14.2, when we speak about outlier
detection.
The variance 𝜎2 of the unobservable errors will be estimated through
𝑛
∑𝑖=1 𝑒2𝑖
𝜎̂ 2 = .
𝑛−𝑘−1
In R we simply have to extend the formula argument of lm() to Poverty ~ Graduates +
Female_House.
The output contains besides the estimates 𝛽̂ = (𝛽0̂ , 𝛽1̂ , 𝛽2̂ )⊤ also the corresponding standard errors
190
9 Linear regression
1. Square the empirical correlation between the observed response values y and the com-
puted predictions ŷ
𝑅2 = 𝑟(𝑦,
2
𝑦),𝑛
̂ .
Now our predictions/fitted values are
Remark: This is equivalent to squaring the correlation coefficient of 𝑦 and 𝑥 in a simple linear
regression model (because in this case 𝑦 ̂ is a linear transformation of 𝑥, which leaves correlation
invariant).
TSS − RSS
𝑅2 = .
TSS
If we use the second approach, we need to determine the relevant sums of squares. First, the total
sum of squares, which is defined precisely as in our discussion of the simple linear regression model.
Second, the residual sum of squares, which again is defined as before and sums the squares of the
residuals 𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖̂ .
We recap:
𝑛
TSS = ∑(𝑦𝑖 − 𝑦𝑛 )2 ,
𝑖=1
𝑛
RSS = ∑(𝑦𝑖 − 𝑦𝑖̂ )2 .
𝑖=1
Their difference is the explained sum of squares, which by our previous calculations is given by
𝑛
ESS = TSS − RSS = ∑(𝑦𝑖̂ − 𝑦𝑛 )2 .
𝑖=1
In R, the different sums of squares can be readily extracted from a so-called ANOVA (analysis of
variance) table:
191
9 Linear regression
anova(model_pov_2)
# Analysis of Variance Table
#
# Response: Poverty
# Df Sum Sq Mean Sq F value Pr(>F)
# Graduates 1 267.881 267.881 61.5896 3.741e-10 ***
# Female_House 1 3.593 3.593 0.8262 0.3679
# Residuals 48 208.773 4.349
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
𝑛
RSS = ∑ 𝑒2𝑖 = 208.8
𝑖=1
𝑛
ESS = ∑(𝑦𝑖̂ − 𝑦𝑛 )2 ≈ 267.9 + 3.593 = 271.493
𝑖=1
𝑛
TSS = ∑(𝑦𝑖 − 𝑦𝑛 )2 = ESS + RSS
𝑖=1
= 271.493 + 208.8 = 480.293
This then leads to
ESS 271.493
𝑅2 = = ≈ 0.5653
TSS 480.293
glance(model_pov_2)
# # A tibble: 1 x 12
# r.squared adj.r.squared sigma statistic p.value df
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.5653 0.5472 2.086 31.21 2.075e-9 2
# # i 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
# # deviance <dbl>, df.residual <int>, nobs <int>
The R computation makes it evident that the explained sum of squares (explained variability) ESS will
increase with every additional predictor. More formally, note that RSS will only decrease when opti-
mizing over additional coefficients 𝑏𝑗 . Hence, 𝑅2 increases with every additional predictor, making
it a less reliable measure for assessing the fit of a multiple linear regression model and unsuitable for
comparing different models.
192
9 Linear regression
Note
1. 𝑅2 is a biased estimate of the percentage of variability the model explains when there
are many variables. If we compute predictions for new data using the current model,
the 𝑅2 will tend to be slightly overly optimistic.
To get a better estimate of the amount of variability explained by the model we use:
Definition 9.5. Let 𝑦𝑖 be the i-th response value and 𝑒𝑖 the estimated i-th residual of a fitted multiple
linear regression model with 𝑘 + 1 parameters. Then the adjusted 𝑅2 is defined as
2 ∑ 𝑒2𝑖 /(𝑛 − 𝑘 − 1)
𝑅𝑎𝑑𝑗 =1−
∑(𝑦𝑖 − 𝑦)2 /(𝑛 − 1)
∑ 𝑒2𝑖 𝑛−1
=1− ⋅ ,
∑(𝑦𝑖 − 𝑦)2 𝑛 − 𝑘 − 1
where 𝑛 is the number of cases/observations. The specific divisor 𝑛 − 𝑘 − 1 is connected to avoid-
ing bias in estimation of the variance of the 𝜖𝑖 ; we will revisit this point when discussing statistical
inference for linear regression models.
𝑛−1
Remark. Since 𝑛−𝑘−1 > 1, we always have
2 ∑ 𝑒2𝑖
𝑅𝑎𝑑𝑗 < 𝑅2 = 1 − .
∑(𝑦𝑖 − 𝑦)2
Let’s compare the (adjusted) 𝑅2 values for the regression model using Graduates and Female_House
as predictors to the one using in addition White.
glance(model_pov_2)[,1:2]
# # A tibble: 1 x 2
# r.squared adj.r.squared
# <dbl> <dbl>
# 1 0.5653 0.5472
193
9 Linear regression
glance(model_pov_3)[,1:2]
# # A tibble: 1 x 2
# r.squared adj.r.squared
# <dbl> <dbl>
# 1 0.5769 0.5499
We detect a stronger increase in 𝑅2 than in the adjusted 𝑅2 . This indicates that the actual amount of
explained variability hasn’t increased that much.
Does adding the variable White to the model add valuable information that wasn’t provided by
Female_House?
Metro_Res
0.015 Corr: Corr: Corr: Corr:
0.010
−0.342* 0.018 0.300* −0.204
0.005
0.000
100
80
Corr: Corr: Corr:
White
60
0.238. −0.751*** −0.309*
40
92
Graduates
88
Corr: Corr:
84
−0.612*** −0.747***
80
Female_House
17.5
15.0 Corr:
12.5 0.525***
10.0
7.5
15
Poverty
10
5
4
3
2
1
0
4
Region
3
2
1
0
4
3
2
1
0
4
3
2
1
0
40 60 80 100 40 60 80 100 80 85 90 7.510.0
12.5
15.0
17.5 5 10 15 South
West
Northeast
Midwest
194
9 Linear regression
In the pairsplot we can detect a quite strong linear dependence between Female_House and White,
which indicates that White doesn’t contain much additional information.
Fitting a model with dependent predictor variables also affects the least squares estimates. Comparing
the slope estimate for Female_House over the two models shows that it decreases from a positive
value of 0.1438501 to -0.0859684.
coef(model_pov_2)
# (Intercept) Graduates Female_House
# 58.3202572 -0.5655586 0.1438501
coef(model_pov_3)
# (Intercept) Graduates Female_House White
# 68.85606105 -0.61872923 -0.08596838 -0.04024676
This effect of reverting the sign of the estimate would not be present if we had added Metro_Res to
the model instead of White.
coef(model_pov_4)
# (Intercept) Graduates Female_House Metro_Res
# 54.05960674 -0.49422868 0.31784887 -0.05396264
Multicollinearity
Multicollinearity happens when the predictor variables are correlated within themselves.
When the predictor variables are correlated, the coefficients in a multiple regression model can
be challenging to interpret.
Remember: Predictors are also called explanatory or independent variables. Ideally, they would
be independent of each other.
While it’s more or less impossible to avoid collinearity from arising in observational data, experi-
ments are usually designed to prevent correlation among predictors.
195
9 Linear regression
In our analysis of the poverty dataset, we used only numeric predictor variables so far. But the dataset
also contains the categorical variable Region. Categorical variables like Region are also helpful in
predicting outcomes. Using a categorical variable in a simple linear regression as a predictor variable
means fitting different mean levels of the response variable with respect to the explanatory variable.
As an example, we analyze if the percentage of people living below the poverty line varies with the
region:
poverty |>
summarise(`mean Pov per Region` = mean(Poverty), .by = Region)
# # A tibble: 4 x 2
# Region `mean Pov per Region`
# <fct> <dbl>
# 1 South 13.66
# 2 West 11.29
# 3 Northeast 9.5
# 4 Midwest 9.525
If we want to use a linear model to analyze the different mean values, we need to understand how
the information contained in the categorical variable is coded.
There exist different types of coding. R uses by default treatment coding, which is also called dummy
coding. In the case of Region (which has 4 levels), this amounts to creating indicators for three specific
regions, which is done as follows:
contrasts(poverty$Region)
# West Northeast Midwest
# South 0 0 0
# West 1 0 0
# Northeast 0 1 0
# Midwest 0 0 1
We see that South is the reference category and each estimated parameter (for the other three
levels) is the difference to the reference category. Let’s compare the results from fitting the linear
model
196
9 Linear regression
poverty |>
summarise(`mean Pov per Region` = mean(Poverty), .by = Region)
# # A tibble: 4 x 2
# Region `mean Pov per Region`
# <fct> <dbl>
# 1 South 13.66
# 2 West 11.29
# 3 Northeast 9.5
# 4 Midwest 9.525
The intercept estimate 𝛽0̂ is the mean value of the reference category South. All other mean values
can be obtained by adding the corresponding estimate to 𝛽0̂ .
Example 9.5. As an example, let’s analyze the influence of the high school graduation rate on
Poverty in the West and Midwest. In the first step we reduce the dataset to observations from those
two regions.
Using this dataset we fit a model with the two predictor variables Graduates and Region.
summary(model_pov_5)
#
# Call:
# lm(formula = Poverty ~ Graduates + Region, data = poverty_west)
197
9 Linear regression
#
# Residuals:
# Min 1Q Median 3Q Max
# -3.8039 -0.9648 -0.0985 0.3900 3.8086
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 53.5317 12.1951 4.390 0.000233 ***
# Graduates -0.4840 0.1396 -3.466 0.002194 **
# RegionMidwest -1.1310 0.7306 -1.548 0.135881
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 1.767 on 22 degrees of freedom
# Multiple R-squared: 0.4536, Adjusted R-squared: 0.4039
# F-statistic: 9.131 on 2 and 22 DF, p-value: 0.001297
198
9 Linear regression
18
Region
12 West
Midwest
6
82.5 85.0 87.5 90.0
high school graduation rate
Figure 9.6: Predicted values for the regions West and Midwest. By design, the model does not allow
to estimate different slope values in the two regions.
coef(model_pov_5)
# (Intercept) Graduates RegionMidwest
# 53.5317008 -0.4839698 -1.1310115
Slope of Graduates:
All else held constant, which each additional unit increase in the high school graduation rate, the
percentage of people living below the poverty line decreases, on average, by 0.484.
Slope of Region:
All else held constant, the model predicts that for states in the Midwest (compared to the West), the
percentage of people living below the poverty line is on average lower by 1.31 percentage points.
Intercept:
In Western States with a high school graduation rate of zero percent the percentage of people living
below the poverty line is on average 53.532 percent.
Remark. Obviously, the intercept does not make sense in context. It only serves to adjust the height
of the line.
199
9 Linear regression
¾ Your turn
Use the estimated regression parameters to compute the predicted percentage of people living
below the poverty line for Midwestern state with a high school graduation rate of 88%
coef(model_pov_5)
# (Intercept) Graduates RegionMidwest
# 53.5317008 -0.4839698 -1.1310115
Example 9.6. In this example, we analyze data from a survey of adult American women and their
children, a sub-sample from the National Longitudinal Survey of Youth. The aim of this analysis is
to predict the cognitive test scores of three- and four-year-old children using characteristics of their
mothers.
kidiq
# # A tibble: 434 x 5
# kid_score mom_hs mom_iq mom_work mom_age
# <int> <fct> <dbl> <fct> <int>
# 1 65 1 121.1 1 27
# 2 98 1 89.36 1 25
# 3 85 1 115.4 1 27
# 4 83 1 99.45 1 25
# 5 115 1 92.75 1 27
# 6 98 0 107.9 0 18
# # i 428 more rows
200
9 Linear regression
¾ Your turn
tidy(model_iq)
# # A tibble: 5 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 19.59 9.219 2.125 3.414e- 2
# 2 mom_hs1 5.095 2.315 2.201 2.825e- 2
# 3 mom_iq 0.5615 0.06064 9.259 9.973e-19
# 4 mom_work1 2.537 2.351 1.079 2.810e- 1
# 5 mom_age 0.2180 0.3307 0.6592 5.101e- 1
In this section we work with the evals dataset from the openintro package. The dataset contains
student evaluations of instructors’ beauty and teaching quality for 463 courses at the University of
Texas.
The teaching evaluations were conducted at the end of the semester. The beauty judgments were
made later, by six students who had not attended the classes and were not aware of the course evaluations
(two upper-level females, two upper-level males, one lower-level female, one lower-level male), see
Hamermesh and Parker (2005) for further details.
evals
# # A tibble: 463 x 11
# prof_id score bty_avg age gender cls_level cls_students
# <int> <dbl> <dbl> <int> <fct> <fct> <int>
# 1 1 4.7 5 36 female upper 43
# 2 1 4.1 5 36 female upper 125
# 3 1 3.9 5 36 female upper 125
# 4 1 4.8 5 36 female upper 123
# 5 2 4.6 3 59 male upper 20
# 6 2 4.3 3 59 male upper 40
# # i 457 more rows
# # i 4 more variables: rank <fct>, ethnicity <fct>, ...
201
9 Linear regression
4 gender
score
female
male
2 4 6 8 2 4 6 8
bty_avg
Conclusion: There seems to be a positive relation between the beauty score of the professor and the
evaluation score. For a given beauty rating, it is hard to see if male professors are evaluated higher,
lower, or about the same as female professors.
But we can fit a model to see if the evaluation score varies with gender.
202
9 Linear regression
¾ Your turn
For a given beauty score, are male professors evaluated higher, lower, or about the same as
female professors?
A higher
B lower
C about the same
One possible model choice is the full model, which involves using all relevant variables as predictor
variables. In our case, it may not make sense to use the professor ID as a predictor variable. Therefore,
we will remove this variable from the dataset and then fit the full model using the . notation.
203
9 Linear regression
Choosing all variables as predictor variables is definitely not always a good choice. But to decide
which model is preferable regarding the prediction accuracy, we need to talk about measures again.
Remember that we introduced in Section 8.6 the MSE (mean squared error):
1 𝑚 2
MSE = ∑ (𝑦𝑖 − 𝑦𝑖̂ ) ,
𝑚 𝑖=1
where 𝑦𝑖̂ are the predictions based on a multiple linear regression model.
Assume we have a fitted model based on 𝑛 observations and obtained the parameter estimates
𝛽0̂ , 𝛽1̂ , … , 𝛽𝑘̂ . Now we are given 𝑚 new outcomes of the predictor variables 𝑥𝑗,𝑖 . This allows us to
compute 𝑚 predictions 𝑦𝑖̂ for these new outcomes.
Further, assume that in addition to the 𝑚 new observations of the predictor variables, we are also
given the corresponding observations 𝑦𝑖 of the response. In that case, we can evaluate the predictive
accuracy through the MSE. However, since the MSE is measured in squared deviations, one often
prefers a measure on the scale of the observations and uses, therefore, the root mean square error:
√ 1 𝑚
RMSE = MSE = √ ∑(𝑦𝑖 − 𝑦𝑖̂ )2 .
𝑚 𝑖=1
Remark. This measure is challenging to use/evaluate if we don’t have any new outcomes and this
will often be the case.
Solution: Use the idea of splitting the available data in training and test data, introduced in Sec-
tion 8.6.1.
204
9 Linear regression
To use the different packages contained in tidymodels we load directly the collection (similar to
tidyverse):
library(tidymodels)
To get started, let’s split evals into a training set and a testing set. We’ll keep most (around 90%) of
the rows from the original dataset (subset chosen randomly) in the training set. The training data will
be used to fit the model, and the test data will be used to measure model performance.
To do this, we can use the rsample package to create an object that contains the information on how
to split the data (apply initial_split()), and then use training() and test() from the rsample
package to create data frames for the training and test data:
# for reproducibility
set.seed(111)
Now we fit the full model to the training data and compute predictions on the test data using
predict().
pred_full contains the predictions 𝑦𝑖̂ of the response values 𝑦𝑖 contained in test. We can use this
information to compute the RMSE.
Let’s create a function for computing the RMSE, as we will need to do this multiple times.
205
9 Linear regression
Roughly speaking, we can say that the average difference between the predicted and actual scores
is approximately 0.52 on a scale of 1 to 5. That’s a measure of the goodness of fit. Our goal in this
section is to use the RMSE to compare the predictive accuracy between different models. So, we have
to fit another model.
5
score
score. Therefore, we can remove ‘ethnicity’ from
the model and observe the effect on predictive ac- 3
curacy.
minority not minority
ethnicity
rmse_red
# [1] 0.5211824
rmse_full
# [1] 0.5243576
Warning
The result is based on one split of the data. But in practice, this has to be done multiple times
to come up wit a conclusion.
We already know how to do this multiple times. In Section 8.10.1 we introduced the concept of 𝑣-fold
cross validation. The idea was to randomly split the data into 𝑣 parts/folds. Each fold then plays the
role of the validation (test) data once. This leads to 𝑣 validation errors, which are averaged to form
an overall cross-validation error
1 𝑣
CV(𝑣) = ∑ RMSE𝑖 ,
𝑣 𝑖=1
where RMSE𝑖 is the validation error when using the 𝑖-th fold as validation data.
Luckily we don’t have to do the 𝑣 splits by hand. We can use vfold_cv().
206
9 Linear regression
vfold_cv() is a function in the parsnip package from tidymodels. To use it, we must adhere to the
tidymodels approach for specifying and fitting models.
We start by specifying the functional form of the model that we want to fit using the parsnip package.
We can define a linear regression model with the linear_reg() function:
linear_reg()
# Linear Regression Model Specification (regression)
#
# Computational engine: lm
On its own, the function really doesn’t do much. It only specifies the type of the model. Next, we
must choose a method to fit the model, the so-called engine. We have introduced the least squares
method for estimating the parameters of a linear regression. But that’s not the only method for fitting
a linear model. In this course we will not discuss any further methods for fitting linear regression,
but have a look at ?linear_reg to see the list of available engines.
The engine for least squares estimation is called lm.
lin_mod <-
linear_reg(engine = "lm")
lin_mod still doesn’t contain any estimated parameters, which makes sense, since we haven’t speci-
fied a concrete model equation. This can be done using the fit() function.
For illustration, let’s fit a model with bty_avg being the only predictor variable.
lin_mod |>
fit(score ~ bty_avg, data = evals)
# parsnip model object
#
#
# Call:
# stats::lm(formula = score ~ bty_avg, data = data)
#
# Coefficients:
# (Intercept) bty_avg
# 3.88034 0.06664
Now we are ready to create the different folds. We want to do 10-fold cross validation and therefore
choose v=10.
207
9 Linear regression
set.seed(123)
folds <- vfold_cv(evals, v = 10)
folds
# # 10-fold cross-validation
# # A tibble: 10 x 2
# splits id
# <list> <chr>
# 1 <split [416/47]> Fold01
# 2 <split [416/47]> Fold02
# 3 <split [416/47]> Fold03
# 4 <split [417/46]> Fold04
# 5 <split [417/46]> Fold05
# 6 <split [417/46]> Fold06
# # i 4 more rows
Defining a workflow
The task is now to fit the linear regression model to the ten different folds.
We know how to fit the model to one dataset using fit(). Luckily, we don’t have to repeat that step
ten times by hand.
We can add the model specification and the model formula to a workflow using the workflow() func-
tion from the workflow package. Given a workflow the fit_resamples() function fits the specified
model to all folds.
We start by creating a workflow, which contains only the linear model.
wf_lm <-
workflow() |>
add_model(lin_mod)
By adding different formulas we create workflows for the full and reduced model.
208
9 Linear regression
lm_fit_full <-
wf_full |>
fit_resamples(folds)
lm_fit_red <-
wf_red |>
fit_resamples(folds)
collect_metrics(lm_fit_full)[,1:5]
# # A tibble: 2 x 5
# .metric .estimator mean n std_err
# <chr> <chr> <dbl> <int> <dbl>
# 1 rmse standard 0.5241 10 0.01619
# 2 rsq standard 0.1041 10 0.02549
collect_metrics(lm_fit_red)[,1:5]
# # A tibble: 2 x 5
# .metric .estimator mean n std_err
# <chr> <chr> <dbl> <int> <dbl>
# 1 rmse standard 0.5235 10 0.01606
# 2 rsq standard 0.1065 10 0.02559
The output includes the RMSE and 𝑅2 values, calculated over ten different folds. We noticed slightly
better values for the reduced model compared to the full model for both measures. This confirms our
findings based on a single split.
In this section, we will explore methods for selecting subsets of predictors, including best subset and
stepwise model selection procedures.
To perform best subset selection, we must fit a separate least squares regression for every possible
combination of the 𝑘 predictors.
209
9 Linear regression
This means fitting all 𝑘 models that contain exactly one predictor and choosing the best one among
those 𝑘 models. Then continue with all (𝑘2) = 𝑘(𝑘−1)
2 models that contain exactly two predictors, and
choose the best one containing two predictors, i.e., the one with the lowest RSS. Continue in this way
until the model with 𝑘 predictor variables has been fitted. In the end one needs to select a single best
model, which is not so obvious. Before we discuss how to do that, let’s summarize the Best subset
selection algorithm.
1. Let M0 denote the null model, which contains no predictors but only an intercept.
2. For 𝑗 = 1, 2, … , 𝑘:
i) Fit all (𝑘𝑗) models that contain exactly 𝑗 predictors.
ii) Among these (𝑘𝑗) models, pick the one with lowest RSS, and call it M𝑗 .
The set of 𝑘 predictor variables allows for 2𝑘 subsets, i.e., we have to consider 2𝑘 models. Step 2
reduces this number to 𝑘 + 1 models.
To choose the best model, we need to select from these 𝑘 + 1 options. This process requires careful
consideration because the RSS of these 𝑘 + 1 models consistently decreases as the number of features
increases.
If we use the RSS to select the best model, we will always end up with a model involving all variables.
The issue is that a low RSS or a high 𝑅2 indicates a model with low training error, but we actually
want to choose a model with low error on new data (low test error).
In order to select the best model with respect to test error, we need to estimate this test error. There
are two common approaches:
1. We can indirectly estimate test error by making an adjustment to the training error to account
for the bias due to overfitting.
2. We can directly estimate the test error, using either a validation set approach or a cross-
validation approach, as discussed in Section 8.10.1.
The second approach requires to compute the average RMSE (test error), calculated over all v folds,
for all models under consideration. We will not discuss this approach any further; see (James et al.
2021, chap. 6.1.3) for details.
In Section 9.2.2, we already introduced one way of adjusting by using the adjusted 𝑅2 instead of the
𝑅2 . Another way of adjusting the RSS is the AIC, which we introduce in the following.
210
9 Linear regression
In Section 9.2.2, we discussed that the 𝑅2 value calculated on the training data can be overly optimistic
when applied to new data, especially in models with multiple predictors. This is because the training
MSE = 𝑛1 ⋅ RSS tends to underestimate the MSE on the test data. A more accurate measure of the
explained variability is the adjusted 𝑅2 . This characteristic makes adjusted 𝑅2 a valuable criterion
for model selection.
A different, but related adjustment of the RSS is the AIC (An Information Criterion or Akaike Informa-
tion Criterion), which will now be defined.
Definition 9.6. Given a fitted multiple linear regression model with 𝑛 residuals 𝑒1 , … , 𝑒𝑛 and 𝑘 + 1
estimated parameters, the AIC is defined in the following way
RSS
AIC = constant + 𝑛 ⋅ ln ( ) + 2(𝑘 + 1) .
𝑛
Like the adjusted 𝑅2 , the AIC strikes a balance between the flexibility of using more predictors (RSS
decreases) and the complexity of the resulting model (the last summand 2(𝑘 + 1) increases).
Remark. Besides the adjusted 𝑅2 and the AIC, there exist further criteria defined as a transformation
of the RSS, like Mallows’s 𝐶𝑝 or BIC. We will not discuss the advantages and disadvantages of one
criterion over the others. Instead, we will focus on using the AIC. But please be aware that other
criteria do exists, which one can use instead of the AIC, and which might be preferable in some
scenarios.
Due to computational limitations, best subset selection is not feasible for really large values of 𝑘.
In addition, as the search space increases, there is a greater risk of identifying models that appear
to perform well on the training data but may lack predictive power for future data. Therefore, an
extensive search space can lead to statistical challenges.
Stepwise methods, which explore a more restricted set of models, are appealing alternatives to best
subset selection.
Forward stepwise selection starts with a model that has only an intercept and no predictors. At each
step, the algorithm adds the predictor variable that provides the greatest improvement to the current
model’s fit. The process stops when no further improvement is possible.
211
9 Linear regression
Let M0 denote the null model, which contains no predictors, only an intercept.
1. For 𝑗 = 0, … , 𝑘 − 1
a) Fit all 𝑘 − 𝑗 models that can be obtained by adding one additional predictor to the
ones already in M𝑗 .
b) Choose among these 𝑘 − 𝑗 models the one with the smallest AIC, and call it M𝑗+1 .
Unlike best subset selection, which involved fitting 2𝑘 models, forward stepwise selection consists of
fitting:
1000
# fitted models in BSS
750
500
250
0
0 10 20 30 40 50
# fitted models in FSS
Figure 9.7: Comparison of computational advantage between Best Subset Selection (BSS) and Forward
Stepwise Selection (FSS) for models with k=1,…,10 predictors.
The computational advantage of forward stepwise selection over best subset selection is clear. While
212
9 Linear regression
forward stepwise tends to perform well in practice, it is not guaranteed to find the best possible model
out of all possible 2𝑘 models (best in terms of the considered criterion, e.g., AIC).
For example, consider a scenario where you have a dataset with 𝑘 = 3 predictors. Suppose the
best one-variable model includes 𝑋1 , while the best two-variable model includes 𝑋2 and 𝑋3 . Then,
forward stepwise selection does not choose the best two-variable model because M1 contains 𝑋1 ,
requiring M2 to include 𝑋1 and one additional variable.
Backward stepwise selection is similar to forward stepwise selection, providing an efficient alternative
to best subset selection. However, unlike forward stepwise selection, it starts with the full model
containing all predictors and then iteratively removes the least useful predictor one at a time.
1. For 𝑗 = 𝑘, 𝑘 − 1, … , 1:
a) Fit all 𝑗 models that are obtained by omitting one of the predictors in 𝑀𝑗 . Each such
model has then a total of 𝑗 − 1 predictors.
b) Choose among these 𝑗 models the one with the smallest AIC, and call it M𝑗−1 .
Similar to the forward stepwise selection, the backward selection approach examines at most 1 +
𝑘(𝑘+1)
2 models. It is suitable for situations where the number of predictors 𝑘 is too large for best
subset selection.
As in forward stepwise selection, backward stepwise selection does not guarantee finding the best
model containing a subset of the 𝑘 predictors.
It’s important to note that backward selection requires the number of samples 𝑛 to be larger than the
number of predictors 𝑘, which allows the estimation of the full model. On the other hand, forward
stepwise can also be used in the case of 𝑛 < 𝑘.
Hybrid approaches
The best subset, forward stepwise, and backward stepwise selection approaches typically result in
similar but not identical models. In addition, hybrid versions of forward and backward stepwise se-
lection exist as an alternative. In a hybrid selection algorithm, variables are either added or removed
in each step. It continues to incrementally remove or add single predictors until no further improve-
ment to the model fit can be achieved. Such a stepwise selection approach aims to closely resemble
213
9 Linear regression
best subset selection while retaining the computational advantages of forward and backward stepwise
selection.
The stepwise selection algorithms are implemented in the step() function from the stats package.
The type of stepwise selection can be chosen with the direction argument. Possible values are
both (hybrid), backward and forward. The first argument of step() is object, which has to be the
fitted linear regression model considered in the first step (so M0 for forward and M𝑘 for backward
selection). With the scope argument one can specify the range of models examined in the stepwise
search. In the case of backward selection, step() considers all models between the full and the null
model, if scope is unspecified. The same is true in the hybrid approach, when starting with a full
model.
We run a backward and forward selection for selecting a model predicting the evaluation score in
the evals dataset. We start with the backward stepwise selection and use model_beauty_2 (=M𝑘 ) as
a starting point. In each step (if possible), we drop the variable which leads to the most significant
reduction in AIC.
214
9 Linear regression
215
9 Linear regression
# age gendermale
# -0.008691 0.209779
# ranktenure track ranktenured
# -0.206806 -0.175756
# languagenon-english pic_outfitnot formal
# -0.244128 -0.130906
The last part of the output shows the estimated coefficients of the selected model:
Now we compare this model with the outcome of the forward stepwise selection. We specify the null
model M0 as input. In addition, we have to define the scope, which has to be done by specifying a
formula describing the largest possible model under consideration.
stats::step(
object = lm(score ~ 1, data = evals),
scope = ~ bty_avg + age + gender + cls_level + cls_students +
rank + ethnicity + language + pic_outfit,
direction = "forward")
# Start: AIC=-562.99
# score ~ 1
#
# Df Sum of Sq RSS AIC
# + bty_avg 1 4.7859 131.87 -577.49
# + gender 1 2.2602 134.39 -568.71
# + language 1 1.6023 135.05 -566.45
# + age 1 1.5655 135.09 -566.32
# + rank 2 1.5891 135.06 -564.40
# + cls_level 1 0.9575 135.70 -564.24
# + ethnicity 1 0.7857 135.87 -563.66
# <none> 136.65 -562.99
# + pic_outfit 1 0.1959 136.46 -561.65
# + cls_students 1 0.0922 136.56 -561.30
#
# Step: AIC=-577.49
# score ~ bty_avg
#
# Df Sum of Sq RSS AIC
# + gender 1 3.2934 128.57 -587.20
# + language 1 1.6846 130.18 -581.45
216
9 Linear regression
217
9 Linear regression
#
# Step: AIC=-596.9
# score ~ bty_avg + gender + language + age + rank
#
# Df Sum of Sq RSS AIC
# + pic_outfit 1 0.91002 122.84 -598.32
# <none> 123.75 -596.90
# + ethnicity 1 0.20537 123.55 -595.67
# + cls_level 1 0.10498 123.65 -595.29
# + cls_students 1 0.01436 123.74 -594.95
#
# Step: AIC=-598.32
# score ~ bty_avg + gender + language + age + rank + pic_outfit
#
# Df Sum of Sq RSS AIC
# <none> 122.84 -598.32
# + cls_students 1 0.16684 122.68 -596.94
# + ethnicity 1 0.13898 122.70 -596.84
# + cls_level 1 0.11982 122.72 -596.77
#
# Call:
# lm(formula = score ~ bty_avg + gender + language + age + rank +
# pic_outfit, data = evals)
#
# Coefficients:
# (Intercept) bty_avg
# 4.490380 0.056916
# gendermale languagenon-english
# 0.209779 -0.244128
# age ranktenure track
# -0.008691 -0.206806
# ranktenured pic_outfitnot formal
# -0.175756 -0.130906
Forward stepwise selection gives us the same model, when specifying the scope to be the full
model.
Finally, let’s apply the hybrid approach. We start again with the null model.
stats::step(
object = lm(score ~ 1, data = evals),
scope = ~ bty_avg + age + gender + cls_level + cls_students +
218
9 Linear regression
219
9 Linear regression
220
9 Linear regression
221
9 Linear regression
A bit of caution
The selection algorithms we just discussed can be viewed as heuristics to optimize a chosen model
selection criterion over the set of all the 2𝑘 possible models.
Note, however, that even if these algorithms were able to find the “best” model, such a model would
necessarily correspond to some subset of the predictors that were originally supplied. Hence, the
algorithms are entirely unable to select models containing transformations of the predictor variables
as long as the user doesn’t specify such transformations.
Interaction effects
For example, we did not account for any interaction effects. An interaction effect in a regression
model occurs when the impact of one predictor variable on the response variable varies based on
the value of another predictor variable. In other words, the influence of one variable is different at
different levels of another variable.
Possible interactions in this context might be between
Let’s try to visualize a possible interaction effect between age and gender.
4 gender
score
female
male
30 40 50 60 70
age
222
9 Linear regression
Conclusion: We observe a more pronounced decline in the evaluation score for women compared to
men as age increases.
Including this interaction effect improves the model fit, as we can see when comparing the model
chosen by the stepwise selection algorithms
¾ Your turn
Determine the slope of age for male and female observations. Give an interpretation of the
slope.
coef(model_int)
# (Intercept) bty_avg
# 5.06163919 0.05767324
# gendermale age
# -0.59415367 -0.02057155
# languagenon-english ranktenure track
# -0.18133702 -0.26383826
# ranktenured pic_outfitnot formal
# -0.21350431 -0.14013002
# gendermale:age
# 0.01712965
223
9 Linear regression
Short summary
This chapter introduces linear regression as a foundational supervised learning method for pre-
dicting quantitative responses. It highlights the continued relevance of this seemingly basic
technique despite more advanced approaches. The chapter explores simple linear regres-
sion with a single predictor, detailing the least squares method for model fitting and
parameter estimation using the poverty dataset. It then extends to multiple linear regres-
sion, considering scenarios with several predictor variables and addressing concepts such as
multicollinearity. The text also covers the inclusion of categorical predictors and meth-
ods for model assessment, including (adjusted) R-squared, residual standard error and
AIC. Furthermore, it discusses model selection techniques like best subset and stepwise selec-
tion, along with the importance of cross-validation for evaluating predictive accuracy using the
evals dataset.
224
10 Logistic regression
In this chapter, we will illustrate the concept of logistic regression using the email dataset from the
openintro package. The data represents incoming emails from David Diez’s mail account for the
first three months of 2012.
We will be interested in predicting the spam status (0=no, 1=yes) of an incoming email, based on
further features of the email.
library(openintro)
email
# # A tibble: 3,921 x 21
# spam to_multiple from cc sent_email time image attach
# <fct> <fct> <fct> <int> <fct> <dttm> <dbl> <dbl>
# 1 0 0 1 0 0 2012-01-01 07:16:41 0 0
# 2 0 0 1 0 0 2012-01-01 08:03:59 0 0
# 3 0 0 1 0 0 2012-01-01 17:00:32 0 0
# 4 0 0 1 0 0 2012-01-01 10:09:49 0 0
# 5 0 0 1 0 0 2012-01-01 11:00:01 0 0
# 6 0 0 1 0 0 2012-01-01 11:04:46 0 0
# # i 3,915 more rows
# # i 13 more variables: dollar <dbl>, winner <fct>, inherit <dbl>, ...
Let’s visualize the distribution of winner and line_breaks for each level of spam. We start with a
plot of the joint distribution of spam and winner.
225
10 Logistic regression
library(ggmosaic)
ggplot(email) +
geom_mosaic(aes(x = product(winner, spam), fill = winner)) +
scale_fill_brewer(palette = "Set1")
yes
winner
winner
no no
yes
0 1
spam
Figure 10.1: A larger percentage of spam emails contain the word winner compared to non-spam
emails.
Boxplots for line_breaks, for each combination of spam and winner are shown in Figure 10.2.
226
10 Logistic regression
winner
spam no
yes
Figure 10.2: The average number of line breaks is smaller in spam emails compared to non-spam
emails.
It seems clear that both winner and line_breaks affect the spam status. But how do we develop
a model to explore this relationship?
Why not use linear regression? The response variable is the binary variable spam with levels
0 (no spam) and 1 (spam). The expected value of spam is equal to the probability of 1 (spam),
and thus a number in [0, 1]. On the other hand, a linear regression model will yield predictions
𝑘
of the form 𝛽0 + ∑𝑗=1 𝛽𝑗 𝑋𝑗 , which depending on the values of the 𝑋𝑗 ’s may take on any real
number as a value.
While we could proceed in an ad-hoc fashion and map the linear predictions to the nearest num-
ber in [0, 1], it is evident that a different type of model that always produces sensible estimates
of the probability of 1 (spam) may be preferable.
In linear regression, we model the response variable 𝑌 directly. In contrast, logistic regression models
focus on the probability that the response takes one of two possible values.
For the email data, logistic regression models the probability
P(𝑌 = 1|𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ) ,
where 𝑌 is the response spam and 𝑋1 and 𝑋2 are the two predictors line_breaks and winner.
The values of 𝑝(𝑥1 , 𝑥2 ) = P(𝑌 = 1|𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ) will range between 0 and 1. A prediction
for spam can then be made based on the probability value 𝑝(𝑥1 , 𝑥2 ). For instance, one could set the
prediction as spam=1 for any email with 𝑝(𝑥1 , 𝑥2 ) > 0.5. On the other hand, if we are particularly
227
10 Logistic regression
bothered by spam, we might opt to lower the threshold, for example, 𝑝(𝑥1 , 𝑥2 ) > 0.3, making it
easier to classify an email as spam. However, this will also increase the chances of misclassifying a
non-spam email as spam.
We already argued that using the linear predictor
𝜂𝑖 ∶= 𝛽0 + 𝛽1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘,𝑖
to model the response values directly, as in linear regression, leads to fitted values outside the interval
[0, 1]. Nevertheless, we would like to maintain the linear predictor as a way of describing the influence
of the predictor variables. To resolve this issue, we will use the linear predictor not to directly predict
the probabilities, 𝑝(x𝑖 ), but rather a transformation of these probabilities. This transformation will
have values on the real line.
To complete the specification of the logistic model, we must introduce a suitable transformation,
known as the link function, which links the linear predictor 𝜂𝑖 to 𝑝(x𝑖 ). There are a variety of options
but the most commonly used is the logit function:
𝑝
logit(𝑝) = log ( ), 𝑝 ∈ [0, 1].
1−𝑝
ggplot() + xlim(0.01,.99)+
geom_function(fun = function(x) log(x / (1-x)))+
labs(x = expression(p[i]), y = expression(logit(p[i])))
5.0
2.5
logit(pi)
0.0
−2.5
−5.0
0.00 0.25 0.50 0.75 1.00
pi
𝑝
Solving 𝜂 = log ( 1−𝑝 ) for 𝑝, we find the inverse of the logit function:
−1 exp(𝜂) 1
logit (𝜂) = = ∈ [0, 1].
1 + exp(𝜂) 1 + exp(−𝜂)
228
10 Logistic regression
ggplot() + xlim(-5, 5) +
geom_function(fun = function(x) 1 / (1 + exp(-x))) +
labs(x = expression(eta[i]), y = expression(logit^-1*(eta[i])))
1.00
0.75
logit−1(ηi)
0.50
0.25
0.00
−5.0 −2.5 0.0 2.5 5.0
ηi
Definition 10.1. Let 𝑌𝑖 be independent binary response variables with associated predictor variable
values x𝑖 = (𝑥1,𝑖 , … , 𝑥𝑘,𝑖 ), 𝑖 ∈ {1, … , 𝑛}. Then the logistic regression model is defined through
the equation
exp(𝜂𝑖 ) 1
P(𝑌𝑖 = 1|x𝑖 ) = =
1 + exp(𝜂𝑖 ) 1 + exp(−𝜂𝑖 )
with linear predictor
𝜂𝑖 = 𝛽0 + 𝛽1 𝑥𝑖,1 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑖,𝑘
and parameters 𝛽 = (𝛽0 , 𝛽1 , … , 𝛽𝑘 ). Written out, we arrive at:
Note
It turns out, that the logistic regression model is a special case of a more general class of regres-
sion models, the generalized linear models (GLMs).
All generalized linear models have the following three characteristics:
2. A linear predictor
229
10 Logistic regression
𝜂 = 𝛽0 + 𝛽1 𝑥1 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘
The odds of an event is the ratio of the probability of the event and the probability of the complemen-
tary event. Thus, the odds of the event {𝑌𝑖 = 1}|x are
Values close to 0 and ∞ indicate very low or very high probabilities for the event of interest, such
as being spam.
Using Definition 10.1 leads to
𝑝(x𝑖 )
= exp(𝛽0 ) ⋅ exp(𝛽1 𝑥𝑖,1 ) ⋅ ⋅ ⋅ exp(𝛽𝑘 𝑥𝑖,𝑘 )
1 − 𝑝(x𝑖 )
and when apply the logarithm on both sides we arrive at
𝑘
𝑝(x𝑖 )
log ( ) = 𝛽0 + ∑ 𝛽𝑗 𝑥𝑗,𝑖
1 − 𝑝(x𝑖 ) 𝑗=1
Therefore, e𝛽𝑗 represents the change in odds when 𝑥𝑖,𝑗 increases by one unit, holding all other vari-
ables constant.
230
10 Logistic regression
Thus,
1
1 − 𝑝(x𝑖 ) = 𝑃 (𝑌𝑖 = 0|x𝑖 ) = .
1 + exp(𝛽0 + 𝛽1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘,𝑖 )
To find estimates (𝛽0̂ , … , 𝛽𝑘̂ ), we may apply the principle of maximum likelihood. That is, we find
the parameters for which the probability of the data (𝑦1 , … , 𝑦𝑛 ) is as large as possible.
This can also be thought of as seeking estimates for (𝛽0 , 𝛽1 , … , 𝛽𝑘 ) such that the predicted probability
̂ 𝑖 ) corresponds as closely as possible to the observed response value.
𝑝(x
By independence of the observations, the probability of our data is a product in which the factors are
𝑝(x𝑖 ) or 1 − 𝑝(x𝑖 ) depending on whether 𝑦𝑖 = 1 or 𝑦𝑖 = 0, respectively.
For any probability 𝑝 we have 𝑝0 = 1 and 𝑝1 = 𝑝, so the likelihood function (probability of the
data) may be written conveniently as
𝑛 𝑛 𝑖 𝑦
𝑦𝑖 1−𝑦𝑖 𝑝(x𝑖 )
𝐿(y, x|𝛽) = ∏ 𝑝(x𝑖 ) (1 − 𝑝(x𝑖 )) = ∏( ) (1 − 𝑝(x𝑖 )).
𝑖=1 𝑖=1
1 − 𝑝(x𝑖 )
While we won’t go into any details here, this is a function that is easy to maximize using iterative
algorithms (implemented in R).
231
10 Logistic regression
In R, we fit a Generalized Linear Model (GLM) similarly to a linear model. It’s important to remember
that logistic regression is a special case of a generalized linear model. Instead of using lm(), we use
glm() and specify the type of the GLM with the family argument of glm().
Let’s start by fitting a model using only line_breaks as predictor variable.
library(broom)
tidy(model_lb)
# # A tibble: 2 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) -1.74 0.0717 -24.3 5.61e-130
# 2 line_breaks -0.00345 0.000416 -8.31 9.37e- 17
232
10 Logistic regression
email |>
mutate(pred = predict(model_lb, type = "response"),
spam = ifelse(spam == "1", 1, 0 )) |>
ggplot(aes(x = line_breaks, y = spam)) +
geom_point(aes(colour = winner)) +
ylab("Spam / predicted values") +
geom_line(aes(y = pred), size = 1.2) +
scale_color_brewer(palette = "Set1")
1.00
Spam / predicted values
0.75
winner
0.50 no
yes
0.25
0.00
0 1000 2000 3000 4000
line_breaks
Interpretation:
tidy(model_lb)
# # A tibble: 2 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) -1.74 0.0717 -24.3 5.61e-130
# 2 line_breaks -0.00345 0.000416 -8.31 9.37e- 17
Interpretations in terms of log odds for intercept and slope terms are easy.
Intercept: The log odds of spam for an email with zero line breaks are -1.7391.
Slope: For each additional line break the log odds decrease by -0.00345.
Problem
These interpretations are not particularly intuitive. Most of the time, we care only about sign
and relative magnitude. This is described by the odds ratio.
233
10 Logistic regression
Interpretation: If the number of line breaks is increased 10, the odds of being spam are decreased
by roughly 3%.
Remark. The choice of 10 additional line breaks was our decision. In a different application, we would
choose a different value. A standard value would be one, but in this case, a one-unit change would
have been too insignificant.
We have observed that the likelihood of an email being classified as spam decreases as the number of
line breaks increases. However, we obtained a relatively small probability of being spam for a small
number of line breaks. This indicates that only using line_breaks as a predictor will probably not
lead to a good classifier. Therefore, we extend the model by including winner as a second predictor.
234
10 Logistic regression
For a given number of line_breaks, the odds ratio for being spam is e1.97 ≈ 7.17 if the level of
winner is changed from no (reference level) to yes.
For the different levels of winner we get the following fitted models:
1. winner=no
𝑝(x𝑖 )
log ( ) = −1.77 − 0.00036 ⋅ 𝑥𝑖,1 + 1.97 ⋅ 0
1 − 𝑝(x𝑖 )
= −1.77 − 0.00036 ⋅ 𝑥𝑖,1 ,
where x𝑖 = (𝑥𝑖,1 , 0)⊤ with 𝑥𝑖,1 being the number of line breaks in the 𝑖-th email.
2. winner=yes
𝑝(x𝑖 )
log ( ) = −1.77 − 0.00036 ⋅ 𝑥𝑖,1 + 1.97 ⋅ 1
1 − 𝑝(x𝑖 )
= 0.2 − 0.00036 ⋅ 𝑥𝑖,1 ,
where x𝑖 = (𝑥𝑖,1 , 1)⊤ with 𝑥𝑖,1 being the number of line breaks in the 𝑖-th email.
Let’s visualize the fitted values for the different levels of winner.
email |>
mutate(pred = predict(model_lb_win, type = "response"),
spam = ifelse(spam == "1", 1, 0 )) |>
ggplot(aes(x= line_breaks, y = spam, colour = winner)) +
geom_point() +
ylab("Spam / fitted values") +
geom_line(aes(y = pred), size = 1.2) +
scale_color_brewer(palette = "Set1")
235
10 Logistic regression
1.00
winner
0.50 no
yes
0.25
0.00
0 1000 2000 3000 4000
line_breaks
The most common mistake when interpreting logistic regression is to treat an odds ratio as a ratio of
probabilities. It’s a ratio of odds.
̂
This means, that emails containing the word winner are not e𝛽2 = e1.97 ≈ 7.17 times more likely to
be spam than emails not containing the word winner.
Such an interpretation would be the relative risk
P(spam|exposed)
RR = ,
P(spam|unexposed)
where “exposed” means in this case, that the email contains the word winner. So, this is different
compared to the odds ratio
P(spam|exposed)
1−P(spam|exposed)
OR = P(spam|unexposed)
.
1−P(spam|unexposed)
Based on the fitted model (model_lb_win) one can compute the following probabilities of being
spam.
The probability of an email being spam that contains the word “winner” and has 20 line breaks is
given by:
predict(model_lb_win,
newdata = data.frame(winner = "yes", line_breaks = 20),
type = "response")
# 1
# 0.5317901
236
10 Logistic regression
The probability of an email being spam that does not contain the word “winner” and has 20 line
breaks is given by:
predict(model_lb_win,
newdata = data.frame(winner = "no", line_breaks = 20),
type = "response")
# 1
# 0.1365626
Note
The relative risk depends on the context. In the current example, this means “number of line
breaks” contained in the email.
The probability of an email being spam that contains the word “winner” and has 2 line breaks is given
by:
predict(model_lb_win,
newdata = data.frame(winner = "yes", line_breaks = 2),
type = "response")
# 1
# 0.5478812
The probability of an email being spam that does not contain the word “winner” and has 2 line breaks
is given by:
predict(model_lb_win,
newdata = data.frame(winner = "no", line_breaks = 2),
type = "response")
# 1
# 0.1443825
237
10 Logistic regression
Given a fitted logistic regression model, it is possible to compute the relative risk using the
oddsratio_to_riskratio() function from the effectsize package, which is based on the odds
ratio and the predicted probability obtained under a specific set of predictor variables.
library(effectsize)
In linear regression, we use transformations of the residual sum of squares, such as the MSE, to assess
the accuracy of our predictions. However, in classification tasks, we need additional metrics.
For logistic regression, it is important to evaluate how well we predict each of the two possible out-
comes. To illustrate the need for these new measures, let’s consider an example.
Example 10.1. If you’ve ever watched the TV show House, you know that Dr. House regularly states,
“It’s never lupus.”
Lupus is a medical phenomenon where antibodies that are supposed to attack foreign
cells to prevent infections instead see plasma proteins as foreign bodies, leading to a high
risk of blood clotting. It is believed that 2% of the population suffers from this disease.
The test for lupus is very accurate if the person actually has lupus. However, it is very inaccurate if
the person does not.
More specifically, the test is 98% accurate if a person actually has the disease. The test is 74% accu-
rate if a person does not have the disease.
Is Dr. House correct when he says it’s never lupus, even if someone tests positive for lupus?
238
10 Logistic regression
Let’s use the following tree to compute the conditional probability of having lupus given a positive
test result.
lupus yes
lupus no
P(test= +, lupus=yes)
P(lupus = yes|test = +) =
P(test =+)
P(test= +, lupus=yes)
=
P(test =+, lupus=yes) + P(test= +, lupus=no)
0.0196
= = 0.0714
0.0196 + 0.2548
Testing for lupus is actually quite complicated. A diagnosis usually relies on the results of multi-
ple tests, including a complete blood count, an erythrocyte sedimentation rate, a kidney and liver
assessment, a urinalysis, and an anti-nuclear antibody (ANA) test.
It is important to consider the implications of each of these tests and how they contribute to the overall
decision to diagnose a patient with lupus.
At some level, a diagnosis can be seen as a binary decision (lupus or no lupus) that involves the
complex integration of various predictor variables.
The diagnosis should try to ensure a high probability of a positive test result when the patient is
actually ill. This is referred to as the test’s sensitivity.
239
10 Logistic regression
On the other hand, the diagnosis should also have the property of yielding a high probability of a
negative test result if the patient does not have the disease. This is known as the specificity of the
test.
The example does not provide any information about how a diagnosis/decision is made, but it
does give us something equally important - the concept of the “sensitivity” and “specificity” of
the test.
Sensitivity and specificity are crucial for understanding the true meaning of a positive or nega-
tive test result.
Definition 10.2. The sensitivity of a test refers to its ability to accurately detect a condition in
subjects if they do have the condition:
A positive or negative test result can also be a mistake; these are called false positives and false nega-
tives, respectively. All four outcomes are illustrated in Figure 10.3.
When given a sample, we can calculate the number of true positives (#TP), false positives (#FP), false
negatives (#FN), and true negatives (#TN). Using these numbers, we can then estimate sensitivity and
240
10 Logistic regression
specificity.
̂ = #TP
sensitivity
#TP + #FN
̂ = #TN
specificity
#FP + #TN
Given the definitions of sensitivity and specificity, we can further define the false negative
rate 𝛽
𝛽 = P(Test = − | Condition = +) ,
and the false positive rate 𝛼
𝛼 = P(Test = + | Condition = −) .
10.5 Classification
Given a fitted logistic regression model, we can compute the probability of spam given a set of pre-
dictor variables.
Classification algorithm
Input: response y_i, predictor variables x1_i,...,xk_i
1. Compute probabilities p_i
2. Decide if spam given p_i -> decisions d_i
Output: d_i
Important
We need a decision rule in the second step! The rule must be of the form:
email is spam if p_i > threshold
So, the rule is defined, if we can pick a suitable threshold.
Let’s begin by computing the probabilities p_i as the first step. We have already fitted two models to
the email dataset. But now let’s use the full model as our first choice.
We can include the predicted probabilities (pred) of being spam in the dataset using the
add_predictions() function. Since we want to compute predictions on the response level,
we will choose the option type = "response".
241
10 Logistic regression
Based on these probabilities, a decision needs to be made concerning which emails should be flagged
as spam. This is accomplished by selecting a threshold probability. Any email that surpasses that
probability will be flagged as spam.
The computed probabilities are visualized in Figure 10.4. For each level of spam the probabilities of
being spam are shown on the y-axis.
predicted probability of being spam
1.00
0.75
0.50
0.25
0.00
0 1
spam
Figure 10.4: Jitter plot of predicted probabilities of being spam for spam and non-spam emails.
We could start with a conservative choice for the threshold, such as 0.75, to avoid classifying non-
spam emails as spam. In Figure 10.5, a horizontal red line indicates a threshold of 0.75.
242
10 Logistic regression
0.75
0.50
0.25
0.00
0 1
spam
Figure 10.5: Jitter plot of predicted probabilities of being spam for spam and non-spam emails. Points
above the red line are classified as being spam.
library(tidymodels)
email_fit_full<- email_fit_full |>
mutate(
# transform obs as factor
obs = factor(spam == 1),
# transform pred as factor
pred_f = factor(pred > 0.75)
)
email_fit_full |>
accuracy(obs, pred_f)
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.911
243
10 Logistic regression
A confusion matrix for categorical data is a contingency table of observed and predicted response
values. In can be computed with the conf_mat() function from the yardstick package.
conf_mat_email
# Truth
# Prediction FALSE TRUE
# FALSE 3544 339
# TRUE 10 28
From the matrix, we can observe that there are 3544 true negatives.
What is the sensitivity and specificity of this decision rule? We can estimate both using the values
from the confusion matrix.
28
ŝ
ens = = 0.076294
28 + 339
3544
spec
̂= = 0.997186
10 + 3544
The values are already stored in the conf_mat_email object, as evident from the following output.
244
10 Logistic regression
To strike a balance between sensitivity and specificity, we need to experiment with different thresh-
olds.
A good way to do this is by using the threshold_perf() function from the probably package, which
is linked to tidymodels but is not a part of it.
Youden’s J statistic
In addition to sensitivity and specificity, the output also includes a summary of both, the
Youden’s J statistic. The statistic is defined like this
𝐽 = sensitivity + specificity − 1 .
The statistic is at most one, which would be the case if there were no false positives and no false
negatives. Hence, one should choose a threshold such that Youden’s J becomes maximal.
245
10 Logistic regression
0.75
0.50
0.25
0.00
0 1
spam
Figure 10.6: Jitter plot of predicted probabilities of being spam for spam and non-spam emails. The
horizontal lines indicate the different threshold values.
The relationship between sensitivity and specificity is illustrated by plotting the sensitivity against
the specificity or the false positive rate. Such a curve is called receiver operating characteristic
(ROC) curve.
Remember: false positive rate = 1 - specificity.
Before we can plot the ROC curve, we split the data in a training and test set,
set.seed(12345)
train_idx <- sample(1:nrow(email), floor(0.9 * nrow(email)))
email_train <- email[train_idx,]
email_test <- email[-train_idx,]
246
10 Logistic regression
and compare it with the actual spam by plotting the ROC curve.
library(pROC)
spam_roc <-
roc(email_test$spam ~ pred_spam)
ggroc(spam_roc) +
geom_segment(
aes(x = 1, xend = 0, y = 0, yend = 1),
color="grey", linetype="dashed")
1.00
0.75
sensitivity
0.50
0.25
0.00
Figure 10.7: ROC curve for the full model fitted to email_train.
1. The graph shows the trade-off between sensitivity and specificity for various thresholds.
2. It’s simple to evaluate the model’s performance against random chance (shown by the
dashed line).
3. We can use the area under the curve (AUC) to assess the predictive ability of a model.
auc(spam_roc)
# Area under the curve: 0.9212
247
10 Logistic regression
Remark. The vertical distance between a point on the ROC curve and the dashed line equals Youden’s
J statistic.
The ROC curve and the corresponding AUC value provide a useful way to measure and describe the
predictive accuracy of a model. However, the most common use case is likely comparing different
models based on their AUC value.
We will conduct a backward stepwise selection on the full model fitted to the training data to find an
additional competitor.
248
10 Logistic regression
249
10 Logistic regression
The algorithm removed two variables (exclaim_subj and cc). We can now compare the predictive
accuracy of the reduced and full models on the test data. Predictions and the ROC curve for the
model_step will be computed in the next step.
Figure 10.8 presents the ROC curves for both models. Visually, there is little to no difference.
ggroc(list(spam_roc, spam_roc_step)) +
scale_color_brewer(palette = "Set1", name = "model",labels = c("full","step"))
↪ +
geom_segment( aes(x = 1, xend = 0, y = 0, yend = 1),
color="grey", linetype="dashed")
1.00
0.75
model
sensitivity
0.50 full
step
0.25
0.00
Figure 10.8: ROC curve for the full and reduced model fitted to email_train.
250
10 Logistic regression
The full model has an area under the curve of 0.9212009, while the reduced model has an AUC value
of 0.9216457.
We prefer the reduced model based on these results. However, it’s important to note that this conclu-
sion is drawn from only one data split, so we cannot be certain of the accuracy of the area under the
curve estimates. To obtain a more reliable estimate, we will need to apply cross-validation, which we
will address in the next section.
The reduced model performed slightly better on the test data, but let’s conduct v-fold cross valida-
tion to verify if this outcome is consistent across multiple validation sets.
Remember, the idea of v-fold cross validation is to divide the entire dataset into v parts. Then, each
of these v parts will be selected as a validation set one at a time. The remaining v-1 parts are used to
train/fit the model.
In each round, the model is fitted using v-1 parts of the original data as training data. Afterward, the
accuracy and the AUC are computed on the validation set.
We will again use functions from tidymodels to perform the v-fold cross validation. The first step
will be to define the logistic regression model using functions from tidymodels.
library(tidymodels)
log_mod <-
logistic_reg(mode = "classification", engine = "glm")
folds
# # 10-fold cross-validation
# # A tibble: 10 x 2
# splits id
# <list> <chr>
# 1 <split [3528/393]> Fold01
# 2 <split [3529/392]> Fold02
# 3 <split [3529/392]> Fold03
# 4 <split [3529/392]> Fold04
# 5 <split [3529/392]> Fold05
251
10 Logistic regression
Next, we will fit the logistic regression model to ten different datasets (folds), beginning by creating
a workflow.
glm_wf <-
workflow() |>
add_model(log_mod) |>
add_formula(spam ~ .)
glm_fit_full <-
glm_wf |>
fit_resamples(folds)
Given the fitted models, the accuracy measures are computed with collect_metric().
collect_metrics(glm_fit_full)
# # A tibble: 3 x 6
# .metric .estimator mean n std_err .config
# <chr> <chr> <dbl> <int> <dbl> <chr>
# 1 accuracy binary 0.914 10 0.00381 Preprocessor1_Model1
# 2 brier_class binary 0.0649 10 0.00203 Preprocessor1_Model1
# 3 roc_auc binary 0.885 10 0.00336 Preprocessor1_Model1
The output includes the average accuracy and area under the curve. Our average AUC is lower than
the one obtained on the single test dataset. Therefore, based on that one split, the estimate was too
optimistic.
Everything is now repeated for the reduced model. First, we update the formula.
glm_fit_red <-
glm_wf |>
fit_resamples(folds)
252
10 Logistic regression
Let’s compare the accuracy measures calculated for the two models now.
Conclusion: The average AUC for the reduced mode is slightly larger. Barely, but larger.
¾ Your turn
The outcome variable, called pop, takes value Vic when a possum is from Victoria and other
(reference level) when it is from New South Wales or Queensland. We consider five predictors:
sex, headL (head length), skullW (skull width), totalL (total length) and tailL (tail length).
The full model was fitted to the data
# # A tibble: 6 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 39.2 11.5 3.40 0.000672
# 2 sexm -1.24 0.666 -1.86 0.0632
# 3 headL -0.160 0.139 -1.16 0.248
# 4 skullW -0.201 0.133 -1.52 0.129
# 5 totalL 0.649 0.153 4.24 0.0000227
253
10 Logistic regression
a) Explain why the remaining estimates change between the two models.
b) Write out the form of the reduced model and give an interpretation of the estimated slope
of tail length.
c) The plot below displays the ROC curve for both the full (in red) and reduced (in blue)
models. The AUC values are 0.921 and 0.919. Which model would you prefer based on
the AUC?
1.00
0.75
model
sensitivity
0.50 full
step
0.25
0.00
d) While visiting a zoo in the US, we came across a brushtail possum with a sign indicating
that it was captured in the wild in Australia. However, the sign did not specify the exact
location within Australia. It mentioned that the possum is male, has a skull width of
approximately 63 mm, a tail length of 37 cm, and a total length of 83 cm. What is the
computed probability from the reduced model that this possum is from Victoria? How
confident are you in the model’s accuracy of this probability calculation?
254
10 Logistic regression
Short summary
This chapter explains how to model binary outcomes, such as the spam status of emails. The
chapter uses an email dataset to illustrate key concepts, including the logit function, odds ra-
tios, and the process of fitting a logistic regression model in R. It further discusses evaluating
model performance through metrics like sensitivity, specificity, ROC curves, and cross-
validation, demonstrating the selection of an appropriate classification threshold and model
comparison.
255
Part VI
Inference
256
11 Foundations of Inference
11.1 Intro
• we know how to describe the outcome of a random experiment by using a sample space,
• we are aware that by choosing a probability measure/distribution we are able to describe
the distribution of the outcome of a random experiment.
Definition 11.1. Let 𝑆 be a sample space and P a probability measure, which defines probabilities
P(𝐴) for all events 𝐴 of a random experiment with outcomes on 𝑆. Then we call the pair (𝑆, P) a
probability model.
Example 11.1. For the random experiment of flipping a coin, we would use the following probability
model:
({Heads, Tails}, P),
with P(Heads) = 𝜃 and P(Tails) = 1 − 𝜃 for some value 𝜃 ∈ (0, 1).
In probability theory, we know 𝜃 and are interested in computing probabilities of events under the
probability model given by 𝜃. For instance, if we toss a fair coin, we know that the probability of
getting heads is 0.5.
In inferential statistics, we make assumptions about the probability model of a random experiment,
but we do not know the true probability measure.
We have to use data. In Example 11.1 we do not know 𝜃, the probability of heads, but we will toss the
coin repeatedly to infer 𝜃.
257
11 Foundations of Inference
Our approach
Before proceeding to discuss how we use data, we introduce the concept of a statistical model
formally.
Definition 11.2. A statistical model is a pair (𝑆, (P𝜃 )𝜃∈Θ ), where 𝑆 is the sample space for the
considered random experiment and (P𝜃 )𝜃∈Θ is a collection of probability measures, each of which
defines probabilities P𝜃 (𝐴) for events 𝐴 ⊆ 𝑆.
The probability measures depend on the unknown population parameter 𝜃 whose values form the
parameter space Θ.
Example 11.2. If P𝜃 is a normal distribution (see Definition A.8) for which both the mean 𝜇 and
the variance 𝜎2 are unknown, then 𝜃 = (𝜇, 𝜎2 )⊤ and Θ = R × (0, ∞) reflecting that 𝜇 ∈ R and
𝜎2 > 0.
Given the sample X = (𝑋1 , … , 𝑋𝑛 ), we use sample statistics as point estimators for the unknown
population parameters of interest.
Definition 11.3. Let X = (𝑋1 , … , 𝑋𝑛 ) be a sample. Then we call any real-valued function 𝑇
defined on the sample space a statistic.
258
11 Foundations of Inference
1 𝑛
An example would be the sample mean 𝑇 ∶ X ↦ 𝑇 (X) = 𝑋 𝑛 ∶= 𝑛 ∑𝑖=1 𝑋𝑖 .
Whenever we compute the value of a point estimator, we will make an error, which is defined
as the difference between the value of the sample statistic 𝑇 (X) and the population parameter 𝜃 it
estimates, e.g., 𝑋 𝑛 − 𝜇.
Definition 11.4. The bias is the systematic tendency to over- or under-estimate the true popula-
tion parameter. It is defined as
E[𝑇 (X)] − 𝜃,
where 𝜃 is the population parameter of interest.
The bias is the expected error. In addition to the bias, the error contains the sampling error, 𝑇 (X) −
E[𝑇 (X)], which describes how much an estimate will tend to vary from one sample to the next. We
can summarize the sampling error by computing the standard deviation of the estimator, which is
also called the standard error.
Definition 11.5. Let 𝑋1 , … , 𝑋𝑛 be an i.i.d. sample and 𝑇 (X) a point estimator for the unknown
parameter 𝜃. The standard deviation of 𝑇 (X) is then called the standard error of 𝑇 (X) and will be
denoted by
SE(𝑇 (X)) = √Var[𝑇 (X)] .
Example 11.3. Consider again the example of the sample mean 𝑇 (X) = 𝑋 𝑛 of i.i.d. observations
𝑋1 , … , 𝑋𝑛 with Var[𝑋𝑖 ] = 𝜎2 . The standard error of the sample mean is then
1 𝑛 1 𝑛
SE(𝑋 𝑛 ) = √Var[ ∑𝑋 ] = √ Var[ ∑ 𝑋𝑖 ]
𝑛 𝑖=1 𝑖 𝑛2 𝑖=1
𝑛 𝑛
√
1 1 𝑛 ⋅ 𝜎2
= √∑ Var[𝑋𝑖 ] = √∑ 𝜎 = 2
𝑛 𝑖=1 𝑛 𝑖=1 𝑛
𝜎
=√ .
𝑛
Much of statistics is focused on understanding and quantifying the sampling error. The sample size
is important when quantifying this error.
In the following we analyse the sampling error through simulations. We introduce the sampling
distribution of a sample statistic.
259
11 Foundations of Inference
Let’s consider all balls in the bowl as our population of interest. Assume we are interested in
answering the question: What is the proportion of red balls in the bowl?
In general, we can precisely answer this question by counting the number of red (and white, if we
don’t know the total number) balls. But this wouldn’t be any fun at all, unless we have a digital
version of the bowl.
The package moderndive, which accompanies the book Ismay and Kim (2019), contains such a digital
version of the bowl.
library(moderndive)
bowl
# # A tibble: 2,400 x 2
# ball_ID color
# <int> <chr>
# 1 1 white
# 2 2 white
# 3 3 white
# 4 4 red
# 5 5 white
# 6 6 white
# # i 2,394 more rows
So, the bowl contains 2400 balls. Now let’s compute the proportion of red balls.
260
11 Foundations of Inference
Using the summarize() function, we can easily determine the total number of balls and the number
of red balls. With this information, we can calculate the proportion of red balls in the bowl.
bowl |>
summarise(
n = n(),
sum = sum(color == "red"),
prop = sum / n)
# # A tibble: 1 x 3
# n sum prop
# <int> <int> <dbl>
# 1 2400 900 0.375
Remark. By taking all balls (=the whole population) into account, we have done a full survey.
Virtual sampling
In reality, however, no one would want to check all 2400 balls for their color. Therefore, sampling is
typically the only realistic option.
But of course, given the sample, we would still like to answer the question: What is the proportion
of red balls in the bowl?
To answer this question, we define the proportion as the population parameter 𝜃 of a statistical
model and then consider the sample as a sample from this model.
If we take a sample 𝑋1 , … , 𝑋𝑛 of size 𝑛, we will perform 𝑛 Bernoulli trials with success probability
𝜃, which is equal to the proportion of red balls in the bowl.
Hence, the statistical model is given by ({0, 1}, P𝜃 ), with
We will draw a sample using different sample lengths 𝑛 ∈ {25, 50, 100} to analyze in addition the
influence of the sample length on the outcome.
For each of the three samples, we then calculate the relative frequency of red balls.
To draw the samples we use the function rep_sample_n() from the infer package, which is another
tidymodels package.
Remark. It is not absolutely necessary to load another package here, we could use dplyr functions
instead. However, the infer package is beneficial at this point, especially later, and the included
functions are intuitive to use.
261
11 Foundations of Inference
library(infer)
bowl |>
rep_sample_n(size = 25) |>
summarise(
prop = sum(color == "red") / 25
)
# # A tibble: 1 x 2
# replicate prop
# <int> <dbl>
# 1 1 0.32
bowl |>
rep_sample_n(size = 50) |>
summarise(
prop = sum(color == "red") / 50
)
# # A tibble: 1 x 2
# replicate prop
# <int> <dbl>
# 1 1 0.44
bowl |>
rep_sample_n(size = 100) |>
summarise(
prop = sum(color == "red") / 100
)
# # A tibble: 1 x 2
# replicate prop
# <int> <dbl>
# 1 1 0.39
Conclusion: All three values differ compared to the true value 0.375.
In practice, we would not know the true value. Therefore, we further consider how to evaluate or
estimate the quality of the calculated values.
One question we should ask ourselves in this regard is:
How much does the estimated value deviate from the center (ideally the unknown 𝜃) of
the distribution?
262
11 Foundations of Inference
If we were able to calculate not just one estimate but many, we could simply use the empirical standard
deviation as a measure of dispersion.
In reality, this is usually not possible, since it involves costs and/or time.
But since we only collect our samples on the computer, it is no problem for us to draw 1000 samples
of length 𝑛 ∈ {25, 50, 100}.
Now we have for each sample size 𝑁 = 1000 samples x𝑗 = (𝑥𝑗1 , … , 𝑥𝑗𝑛 ), 𝑗 ∈ {1, … , 𝑁 }, where
In the next step we will compute for each sample the empirical mean (=proportion of red balls)
1 𝑛 𝑗
𝜃𝑛𝑗̂ = 𝑥𝑗𝑛 = ∑𝑥 , 𝑗 ∈ {1, … , 𝑁 } .
𝑛 𝑖=1 𝑖
Then, we are able to compute the empirical standard deviation of the 1000 estimated proportions
√ 𝑁
√ 1 2
𝑠𝑛 = √ ∑ (𝜃𝑛𝑗̂ − 𝜃𝑛̂ ) , 𝑛 ∈ {25, 50, 100} ,
𝑁 − 1 𝑗=1
⎷
which is an estimate for the standard error of 𝑋 𝑛 .
stp_25 |>
summarise(
prop = sum(color == "red") / 25
) |>
summarise(sd_prop = sd(prop))
# # A tibble: 1 x 1
# sd_prop
# <dbl>
# 1 0.0986
263
11 Foundations of Inference
stp_50 |>
summarise(
prop = sum(color == "red") / 50
) |>
summarise(sd_prop = sd(prop))
# # A tibble: 1 x 1
# sd_prop
# <dbl>
# 1 0.0704
stp_100 |>
summarise(
prop = sum(color == "red") / 100
) |>
summarise(sd_prop = sd(prop))
# # A tibble: 1 x 1
# sd_prop
# <dbl>
# 1 0.0477
Recap
Given the whole population data (the bowl). We were able to create 1000 (our choice) samples
of size 𝑛 = 100 (our choice) from the population data.
For each sample we computed an estimate for the proportion of red balls
stp_100 |>
summarise(
prop = sum(color == "red") / 100
)
264
11 Foundations of Inference
n=25
250
200
150
count
100
50
0
0.0 0.2 0.4 0.6 0.8
prop
n=50
300
200
count
100
0
0.0 0.2 0.4 0.6 0.8
prop
n=100
300
count
200
100
0
0.0 0.2 0.4 0.6 0.8
prop
Figure 11.2: We see a reduction in variability with increasing sample size. In addition, we see that
the empirical distribution is symmetrically distributed around the true parameter
0.375.
265
11 Foundations of Inference
200
150
count
100
50
0
0.2 0.3 0.4 0.5 0.6
prop
Figure 11.3
is then considered as an approximation to the sampling distribution of the statistic under consid-
eration - in our case the empirical mean.
Definition 11.6. Let X = (𝑋1 , … , 𝑋𝑛 ) be sample from a statistical model and 𝑇 (X) a statistic.
Then we call the distribution of the r.v. 𝑇 (X) the sampling distribution.
As said, in reality we will usually not be able to collect more than one sample. Nevertheless, we
would like to say something about the distribution of our estimator (here 𝑋 𝑛 ). So we have to think
about other strategies.
One can think of three different approaches for approximating the sampling distribution.
1. Theoretical approach: For several statistics 𝑇 (X) one can derive the distribution of
𝑇 (X) by making assumptions about the distribution of the sample X. In this approach,
it’s important to consider whether the assumed distribution for the sample X actually
applies to the observed sample x.
2. Asymptotic approach: There are methods that enable us to approximate the distribution
of 𝑇 (X) for “large” samples. One important method is the Central Limit Theorem.
266
11 Foundations of Inference
In this section, we won’t focus on deriving the exact distribution of a specific statistic 𝑇 (X) ourselves.
Instead, this part is more about providing several examples of statistics for which one can derive the
exact distribution after making assumptions about the distribution of the sample X.
Example 11.4. Given a sample X = (𝑋1 , … , 𝑋𝑛 )⊤ of i.i.d. observations from the statistical model
(R, N (𝜇, 𝜎2 )), with 𝜃 = (𝜇, 𝜎2 )⊤ , one can derive the following results.
𝜎2
1. The sample mean 𝑇 (X) = 𝑋 𝑛 has a normal distribution with mean 𝜇 and variance 𝑛 . This
result implies
𝑋𝑛 − 𝜇
𝜎 ∼ N (0, 1) .
√
𝑛
2. When inferring about the mean 𝜇, the variance 𝜎2 is generally unknown. In this scenario, the
1 𝑛
sample mean is standardized using the empirical variance 𝑆𝑛2 (X) = 𝑛−1 ∑𝑖=1 (𝑋𝑖 − 𝑋 𝑛 )2 ,
resulting in a known distribution: the t-distribution with 𝑛 − 1 degrees of freedom.
𝑋𝑛 − 𝜇
2
∼ 𝑡(𝑛 − 1) .
√ 𝑆𝑛𝑛(X)
Example 11.5. Given a sample X = (𝑋1 , … , 𝑋𝑛 )⊤ of i.i.d. observations from the statistical model
({0, 1}, P𝜃 ), with P𝜃 being a Bernoulli distribution with parameter 𝜃 ∈ (0, 1). Then, the distribution
𝑛
of the statistic 𝑇 (X) = ∑𝑖=1 𝑋𝑖 is a binomial distribution with parameters 𝑛 and 𝜃, i.e.,
𝑛
∑ 𝑋𝑖 ∼ Bin(𝑛, 𝜃) ,
𝑖=1
Important
One advantage of the theoretical approach is that we can work with a well-specified distribution.
However, this benefit comes with a caveat. We need to consider whether the assumed distribu-
tion for the observations 𝑋1 , … , 𝑋𝑛 is realistic given the observed sample 𝑥1 , … , 𝑥𝑛 . If this is
not the case, the resulting distribution of the sample statistic 𝑇 (X) is likely to be incorrect.
267
11 Foundations of Inference
The literature on the asymptotic distributions of sample statistics is extensive. However, we are
𝑛
specifically interested in the approximate distribution of the sample mean 𝑇 (X) = 𝑛1 ∑𝑖=1 𝑋𝑖 .
In the case of i.i.d. random variables, the famous Central Limit Theorem (CLT) is applicable to derive
the asymptotic distribution of 𝑇 (X).
Remark. When estimating, e.g., a proportion 𝜃 ∈ (0, 1), sufficiently large is characterized through
𝑛𝜃 ≥ 10 and 𝑛(1 − 𝜃) ≥ 10.
268
11 Foundations of Inference
n = 25
150
count
100
50
0
0.2 0.4 0.6
x
n = 100
150
100
count
50
0
0.3 0.4 0.5
x
We will use this result to make inferences about the mean 𝜇 of a distribution P. Even though the
observations 𝑋1 , … , 𝑋𝑛 in an i.i.d. sample come from a non-normal distribution, we can use the
normal distribution defined by the Central Limit Theorem to draw conclusions about the mean.
The bootstrap is a powerful statistical tool used to measure uncertainty in estimators. In practical
terms, while we can gain a good understanding of the accuracy of an estimator 𝑇 (X) by drawing
samples from the population multiple times, this method is not feasible for real-world data. This is
because with real data, it’s rarely possible to generate new samples from the original population.
In the bootstrap approach we use a computer to simulate the process of obtaining new sample sets.
This way, we can estimate the variability of 𝑇 (X) without needing to generate additional samples.
269
11 Foundations of Inference
Instead of repeatedly sampling independent datasets from the population, we create samples by re-
peatedly sampling with replacement observations from the original dataset.
The idea behind the bootstrap approach is, that the original sample approximates the population. So,
resamples from the observed sample approximate independent samples from the population.
The bootstrap distribution of a statistic, based on many resamples, approximates the sampling distribu-
tion of the statistic.
Bootstrap algorithm
Evaluating the algorithm allows to estimate the standard error SE(𝑇 (X)) of the statistic 𝑇 (X). An
estimator is given by
√ 𝐵
√ 1 1 𝐵 2
SE(𝑇 (X)) = √
̂ ∑ (𝑇 (x𝑏 ) − ∑ 𝑇 (x𝑗 )) .
𝐵 − 1 𝑏=1 𝐵 𝑗=1
⎷
Let’s illustrate this algorithm with a very small dataset of size 𝑛 = 3, 𝐵 = 3 resamples and as statistic
𝑇 (x) the empirical mean.
x
# # A tibble: 3 x 2
# ball_ID color
# <int> <chr>
# 1 2227 red
# 2 526 white
# 3 195 white
270
11 Foundations of Inference
x_B
# # A tibble: 9 x 3
# # Groups: replicate [3]
# replicate ball_ID color
# <int> <int> <chr>
# 1 1 526 white
# 2 1 526 white
# 3 1 526 white
# 4 2 195 white
# 5 2 2227 red
# 6 2 526 white
# 7 3 526 white
# 8 3 2227 red
# 9 3 526 white
One can then calculate the average from each of these three samples to estimate the proportion of
red balls.
x_B |>
summarise(
prop = mean(color == "red"))
# # A tibble: 3 x 2
# replicate prop
# <int> <dbl>
# 1 1 0
# 2 2 0.333
# 3 3 0.333
It’s evident that the choice of 𝐵 = 3 was purely for illustrative purposes. For small values of 𝐵, the
̂ (X)) will lack accuracy. A typical value of 𝐵 in real applications
estimate of the standard error SE(𝑇
is 1000.
Hence, we will now increase 𝐵. In addition, we will simplify the code for generating the resam-
ples and visualizing the distribution of these resamples using additional functions from the infer
package.
271
11 Foundations of Inference
In the next step, we need to specify the variable under consideration and select the success argument
since we want to determine a proportion.
272
11 Foundations of Inference
Now, we can generate 1000 bootstrap samples and calculate the proportion of red balls for each sam-
ple.
After completing this step, we removed the 100000 observations from the 1000 bootstrap samples and
retained only the 1000 estimates.
Finally we can visualize the bootstrap distribution:
visualize(bootstrap_means) +
geom_vline(xintercept = mean(x$color == "red"), color = "blue",
size = 2)
273
11 Foundations of Inference
150
100
count
50
The bootstrap distribution is centered around the mean of the sample x, and not the unknown pro-
portion 0.375 of the population.
274
11 Foundations of Inference
Short summary
This chapter introduces core concepts in statistical inference. The text explains statistical mod-
els and the crucial idea of an independent and identically distributed (i.i.d.) sample. It further
discusses point estimation of population parameters, including concepts like bias and stan-
dard error, and explores the idea of a sampling distribution. Finally, it examines different
approaches for approximating this distribution, including theoretical, asymptotic (specifically
the Central Limit Theorem), and bootstrap methods, providing practical examples and R code
snippets for illustration.
275
12 Confidence intervals
In Section 11.3.3 we computed 1000 bootstrap means as estimates of the true proportion of red balls,
which was 0.375. All of them have been “wrong” (=not equal to the true value).
sum(bootstrap_means == 0.375)
# [1] 0
Even for the 1000 samples stp_100 from the bowl (the population), we did not see one estimate being
equal to the true value.
stp_100 |>
summarise(
prop = sum(color == "red") / 100
) |>
summarise(sum(prop == 0.375))
# # A tibble: 1 x 1
# `sum(prop == 0.375)`
# <int>
# 1 0
Hence, none of our point estimates for 𝜃, the proportion of red balls, produced the correct value of
0.375.
1 𝑛
But maybe our choice of using the point estimator 𝑇 (X) = 𝑛 ∑𝑖=1 𝑋𝑖 was bad?
That’s not the case. 𝑇 (X) is actually the maximum likelihood estimator of 𝜃, and as such has
“nice” statistical properties (which we won’t discuss in this course).
MLE
We will not discuss the derivation of maximum likelihood estimator (MLE) in general, but let’s
revisit the concept in the case of i.i.d. Bernoulli trials:
1, ball is red,
𝑋𝑖 = {
0, ball is white.
276
12 Confidence intervals
MLE method
Given a statistical model, the Maximum Likelihood Estimation (MLE) method estimates the
unknown parameter 𝜃 in a way that is most consistent with the observed data. In other words,
MLE gives the distribution under which the observed data are most likely.
𝑛 𝑛
𝐿(𝑥1 , … , 𝑥𝑛 |𝜃) = ∏ P(𝑋𝑖 = 𝑥𝑖 ) = ∏ 𝜃𝑥𝑖 (1 − 𝜃)1−𝑥𝑖
𝑖=1 𝑖=1
𝑛1 𝑛−𝑛1
= 𝜃 (1 − 𝜃) ,
where 𝑛1 = #{𝑖 ∶ 𝑥𝑖 = 1} is the number of ‘successes’ (coded as 1). When considered as a function
of 𝜃 this expression gives the likelihood function, and the maximum likelihood estimate 𝜃 ̂ is then
the parameter value such that
𝑑 𝑛1
𝜃 (1 − 𝜃)𝑛−𝑛1 = (1 − 𝜃)(𝑛−𝑛1 −1) 𝜃(𝑛1 −1) (𝑛1 − 𝑛𝜃) = 0,
𝑑𝜃
we have that
𝑛
𝑛1 − 𝑛𝜃 ̂ = 0 ⟺ 𝜃 ̂ = 1 .
𝑛
So, 𝜃 ̂ is the proportion of ‘successes’ out the 𝑛 trials.
Using a point estimate is similar to fishing in a murky lake with a spear. The chance of hitting the
correct value is very low.
We also saw that each point estimate is the realization of a random variable 𝑇 (X) having a non-zero
standard deviation.
Confidence intervals
277
12 Confidence intervals
Definition 12.1. Let X = (𝑋1 , … , 𝑋𝑛 ) be an i.i.d. sample and 𝛼 ∈ (0, 1) a chosen error level. Then
we call an interval 𝐶𝐼(X) a 100(1 − 𝛼)% confidence interval for the population parameter 𝜃 of
interest, if 𝐶𝐼(X) covers 𝜃 with probability at least 1 − 𝛼, i.e.,
P(𝜃 ∈ 𝐶𝐼(X)) ≥ 1 − 𝛼 , ∀𝜃 ∈ Θ .
Hence, 𝐶𝐼(X) is a plausible range of values for the population parameter 𝜃 of interest.
Conclusion: If we report a point estimate, we probably won’t hit the exact population parameter. If
we report a range of plausible values, we have a good shot at capturing the parameter.
We mentioned that a confidence interval for 𝜃 expresses the uncertainty of the corresponding point
estimate 𝑇 (X). Since we do not know the distribution of 𝑇 (X) (the sampling distribution), we must
approximate it using one of the three approaches introduced above. Given the approximation, we
can compute the confidence interval.
In this section, we will discuss the construction of a confidence interval based on the theoretical
approach. We will focus on an easy special case, rather than the general case.
Let’s assume we have i.i.d. observations 𝑋1 , … , 𝑋𝑛 from the statistical model (R, N (𝜃, 𝜎2 )), i.e.,
from a model specifying a normal distribution for which we know the variance 𝜎2 . The goal is to
construct a confidence interval for the unknown mean 𝜃.
Remark. This model is not very realistic. It’s rarely the case that you do not know the mean value
but do know the variance. We only consider it because deriving the confidence interval is illustrative
and easy to understand.
𝑛 2
We know from Example 11.4 that the distribution of the sample mean 𝑋 𝑛 = 𝑛1 ∑𝑖=1 𝑋𝑖 is N (𝜃, 𝜎𝑛 ).
But this is of course an unknown distribution, since we do not know 𝜃. Hence, we can’t compute any
probabilities of the form P(𝑋 𝑛 ≤ 𝑥) for given 𝑥.
Idea
Specify a function 𝑔 of the point estimator 𝑋 𝑛 and the unknown parameter 𝜃 (and perhaps even
other known parameters) so that the distribution of 𝑔(𝑋 𝑛 , 𝜃) is known and therefore indepen-
dent of 𝜃.
Using this known distribution, one can construct a two-sided interval [𝑥𝛼/2 , 𝑥1−𝛼/2 ] for which
the following holds:
278
12 Confidence intervals
According to the above calculation, we obtain our confidence interval 𝐶𝐼(X) as the set of all values
of 𝜃 for which 𝑥𝛼/2 ≤ 𝑔(𝑋 𝑛 , 𝜃) ≤ 𝑥1−𝛼/2 .
Remark. A value 𝑥𝛽 from a distribution with distribution function 𝐹 is said to be the 𝛽-quantile, if
It is straightforward to find the function 𝑔 for the normal distribution. It holds that:
𝑋𝑛 − 𝜃
∼ N (0, 1) ,
√𝜎2 /𝑛
𝑥−𝜃
and hence, the function 𝑔 has the form 𝑔 ∶ (𝑥, 𝜃) ↦ √𝜎2 /𝑛
.
𝜎 𝜎
[𝑋 𝑛 − 𝑧1−𝛼/2 √0 , 𝑋 𝑛 + 𝑧1−𝛼/2 √0 ]
𝑛 𝑛
is a two-sided confidence interval with the confidence level 1 − 𝛼 for the unknown mean value 𝜃 of
a normal distribution with known variance 𝜎2 .
279
12 Confidence intervals
Example 12.1. We consider again i.i.d. observations 𝑋1 , … , 𝑋𝑛 from a statistical model with a
normal distribution. However, this time both parameters are unknown. Therefore, we have 𝑋𝑖 ∼
N (𝜃1 , 𝜃2 ), with 𝜃 = (𝜃1 , 𝜃2 )⊤ unknown. In this example, our focus is on constructing a confidence
interval for the variance 𝜃2 .
It can be shown that
2
𝑛
𝑋 − 𝑋𝑛 (𝑛 − 1)𝑆𝑛2
𝑔(X, 𝜃2 ) ∶= ∑ ( 𝑖 ) = ∼ 𝜒2 (𝑛 − 1) ,
𝑖=1 √𝜃 2
𝜃2
where 𝜒2 (𝑛) denotes the 𝜒2 -distribution with 𝑛 degrees of freedom. Then the two-sided 1 − 𝛼 con-
fidence interval for 𝜃2 is given by:
(𝑛 − 1)𝑆𝑛2 (𝑛 − 1)𝑆𝑛2
[ , 2 ].
𝜒21−𝛼/2 (𝑛
− 1) 𝜒𝛼/2 (𝑛 − 1)
In Theorem 11.1 we have seen that the sampling distribution of the average of i.i.d. random variables
𝑋1 , … , 𝑋𝑛 , with E[𝑋𝑖 ] = 𝜇 and Var[𝑋𝑖 ] = 𝜎2 can be approximated by a normal distribution with
2
mean 𝜇 and variance 𝜎𝑛 for large 𝑛. We can use this property to construct asymptotic confidence
intervals, i.e., for all 𝜃 ∈ Θ, the inequality
P(𝜃 ∈ 𝐶𝐼(X)) ≥ 1 − 𝛼 ,
holds in the limit as 𝑛 → ∞ and, thus, approximately for “large” 𝑛. The construction builds on
approximating the distribution of the concerned estimator 𝑇 (X) by the normal distribution with
mean 𝜃 and standard deviation SE(𝑇 (X)).
Definition 12.2. When the Central Limit Theorem (CLT) is applicable to the sampling distribution
of a point estimator 𝑇 (X), i.e., the point estimator is the average of i.i.d. random variables, and the
sampling distribution of the estimator thus closely follows a normal distribution, we can construct a
100(1 − 𝛼)% confidence interval as
̂ (X)) ,
𝑇 (x) ± 𝑧1−𝛼/2 ⋅ SE(𝑇
280
12 Confidence intervals
Example 12.2. We consider the outcome 𝑋1 , … , 𝑋𝑛 of 𝑛 i.i.d. Bernoulli trials. The success proba-
bility 𝜃 in each trial is unknown and we would like to compute a 95% confidence interval for 𝜃.
In order for the asymptotic approach to be effective, the sample size needs to be fairly large. Typically,
the sample size is considered sufficiently large if there are at least 10 successes and 10 failures in
the sample. The expected number of success in sample of size 𝑛 is,
𝑛 𝑛 𝑛
E[ ∑ 𝑋𝑖 ] = ∑ E[𝑋𝑖 ] = ∑ (0 ⋅ P(𝑋𝑖 = 0) + 1 ⋅ P(𝑋𝑖 = 1))
𝑖=1 𝑖=1 𝑖=1
𝑛
= ∑ 𝜃 = 𝑛𝜃 .
𝑖=1
Hence, we need 𝑛𝜃 (the average number of successes) and 𝑛(1 − 𝜃) (the average number of failures)
both greater than or equal to 10. This requirement is called the success-failure condition.
̂
When these conditions are met, the sampling distribution of the point estimator 𝜃(X) ∶= 𝑋 𝑛 is
approximately normal with mean 𝜃 and standard error
1 𝑛 1 𝑛
SE(𝑋 𝑛 ) = √Var[ ∑ 𝑋𝑖 ] = √ 2 ∑ Var[𝑋𝑖 ]
𝑛 𝑖=1 𝑛 𝑖=1
1 𝑛
= √ 2 ∑ 𝜃(1 − 𝜃)
𝑛 𝑖=1
𝜃(1 − 𝜃)
=√ .
𝑛
The confidence interval then has the form
̂
𝜃(X)(1 ̂
− 𝜃(X))
̂
𝜃(X) ± 𝑧1−𝛼/2 ⋅ √ .
𝑛
Remark. The success-failure conditions depends on the unknown 𝜃. In applications we use our best
̂
guess to check the condition, i.e., we check if 𝑛𝜃(x) ̂
≥ 10 and 𝑛(1 − 𝜃(x)) ̂ being the
≥ 10, with 𝜃(x)
computed point estimate.
Let’s apply the above formula for a simulated dataset consisting of 100 (Pseudo-) Bernoulli trials with
a success probability of 0.3.
281
12 Confidence intervals
100 * mean(x)
# [1] 29
100 * (1 - mean(x))
# [1] 71
Hence, the condition holds and we can compute the 95% confidence interval based on the CLT:
We observe that the computed interval covers the true value of 0.3.
We may construct a confidence interval with the help of the simulated bootstrap distribution.
Idea
Use the simulated bootstrap distribution to determine the lower (𝑙𝑒) and upper (𝑢𝑒) endpoint
of the interval (𝑙𝑒, 𝑢𝑒), such that the interval has a probability of 1 − 𝛼 under the bootstrap
distribution.
But this means nothing more than identifying the cut-off values (quantiles) of the bootstrap
distribution. This approach is referred to as the percentile method.
As an alternative, we can use the standard error method, which calculates the standard devi-
ation of the bootstrap distribution and then computes the interval based on the formula given
in Definition 12.2.
Let’s assume we want to build a 95% confidence interval for the proportion of red balls in the bowl.
Using the percentile method, we would consider the middle 95% of values from the bootstrap distri-
bution.
We can achieve this by calculating the 2.5th and 97.5th percentiles of the bootstrap distribution.
282
12 Confidence intervals
250
true proportion
200
150
count
100
50
0
0.2 0.3 0.4 0.5
stat
Figure 12.1: Bootstrap distribution and the corresponding 95% confidence interval based on the per-
centile method.
If the bootstrap distribution has a symmetric shape, like a normal distribution, we can construct
the confidence interval based on the formula given in Definition 12.2. The point estimate of the
unknown parameter 𝜃 is given by 𝜃(X) ̂ = 𝑇 (X). The standard error of the statistic is estimated
through the standard deviation of the bootstrap distribution. The interval has the form
̂
𝜃(X) ̂ (X)) ,
± 𝑧1−𝛼/2 ⋅ SE(𝑇
where 𝑧1−𝛼/2 is the (1 − 𝛼/2)-quantile of the standard normal distribution and SE(𝑇 ̂ (X)) is the
standard deviation of the bootstrap distribution. It is a (1 − 𝛼) ⋅ 100% confidence interval for the
mean of the sampling distribution E[𝑇 (X)] based on the standard error method.
283
12 Confidence intervals
Example 12.3. Let’s calculate a 95% confidence interval for the proportion of red balls using the
standard error method. Therefore, we need the 0.975 quantile.
qnorm(0.975)
# [1] 1.959964
̂
Using the point estimate 𝜃(x) = 𝑥𝑛 and the standard deviation of the bootstrap distribution
# sample mean
x_bar <- x |>
specify(response = color, success = "red") |>
calculate(stat = "prop") |>
pull(stat)
x_bar
# [1] 0.36
# sd of
sd(bootstrap_means$stat)
# [1] 0.0485163
Using the infer package to construct confidence intervals is summarized in Figure 12.2.
A confidence interval based on the percentile method can be computed using get_ci() with the
argument type equal to percentile (default).
(per_ci <-
bootstrap_means |>
get_ci(level = 0.95, type = "percentile")
)
# # A tibble: 1 x 2
# lower_ci upper_ci
# <dbl> <dbl>
# 1 0.26 0.46
284
12 Confidence intervals
visualize(bootstrap_means) +
shade_ci(endpoints = per_ci,
color="blue", fill="gold")
150
100
count
50
285
12 Confidence intervals
Interpretation: We are 95% confident that the proportion of red balls is between 0.26 and 0.46.
In Example 12.3 we already used the standard error method to compute a 95% confidence interval
for the proportion of red balls. So, let’s see how to compute this interval with get_ci(). As type we
have to choose se and in addition we need to specify the point_estimate.
So, we obtain an interval that is very close to the one obtained using the percentile method. Let’s
create a graphic that shows both intervals at the same time.
visualize(bootstrap_means) +
shade_ci(endpoints = per_ci, color = "gold") +
shade_ci(endpoints = se_ci)
150
100
count
50
286
12 Confidence intervals
Short summary
This chapter begins by highlighting the limitations of point estimates and introduces the con-
cept of a confidence interval as a range of plausible values for a population parameter, thus
quantifying uncertainty. The text then explores three primary methods for constructing confi-
dence intervals: the theoretical approach, the asymptotic approach leveraging the Cen-
tral Limit Theorem, and the bootstrap approach (including percentile and standard
error methods). Practical examples and the use of the infer R package are demonstrated to
illustrate these concepts and their application in statistical inference. Ultimately, the chapter
emphasises the importance of reporting a range of values to better capture the true,
unknown population parameter.
287
13 Hypothesis testing
comes from a TED talk of Hans and Ola Rosling. The answer to this question can be given based on
data collected within the Gapminder project.
We might be wondering about the level of awareness people have regarding global health. Assume
that individuals either possess knowledge about the topic or are influenced by false information. How-
ever, in addressing the question about vaccination, their responses are not random guesses. This leads
to our research question:
People have knowledge, whether correct or incorrect, about the topic of vaccination and do not randomly
guess an answer.
We can now transfer this research question into two competing hypotheses:
𝐻0 ∶ People never learn these particular topics and their responses are simply equivalent to ran-
dom guesses.
versus
𝐻𝐴 ∶ People have knowledge, either correct or incorrect, which they apply and hence do not ran-
domly guess an answer.
288
13 Hypothesis testing
Note
The null hypothesis 𝐻0 often represents a claim to be tested. The alternative hypothesis
𝐻𝐴 represents an alternative claim under consideration.
Definition 13.1. Let 𝑋1 , … , 𝑋𝑛 be an i.i.d. sample from a statistical model (𝑆, P𝜃 ), 𝜃 ∈ Θ. The test
problem consists of the null hypothesis 𝐻0 and the alternative 𝐻𝐴 , which constitute a partition of
the parameter space Θ, i.e.,
𝐻0 ∶ 𝜃 ∈ Θ0 𝐻𝐴 ∶ 𝜃 ∈ Θ ∖ Θ0 .
A statistical test decides, based on a sample 𝑋1 , … , 𝑋𝑛 , whether the null hypothesis can be rejected
or cannot be rejected.
The corresponding decision is taken with the help of a suitably chosen test statistic 𝑇 (X). The range
of 𝑇 (X) can be split up in a rejection region 𝑅 and its complement 𝑅𝑐 .
Given an observed sample x = {𝑥1 , … , 𝑥𝑛 } and the corresponding realization of the test statistic
𝑇 (x), one decides in the following manner:
• 𝐻0 is rejected, if 𝑇 (x) ∈ 𝑅.
• 𝐻0 is not rejected, if 𝑇 (x) ∈ 𝑅𝑐 .
Hypothesis tests are not flawless. There are two competing hypotheses: the null and the alternative.
In a hypothesis test, we make a decision about which hypothesis might be true, but our decision
might be incorrect. Two types of errors are possible. Both are visualized in Figure 13.1.
A type 1 error is rejecting the null hypothesis when 𝐻0 is true. A type 2 error is failing to reject
the null hypothesis when 𝐻𝐴 is true.
The way we will construct the test ensures that the probability of a type 1 error is at most 𝛼 ∈ (0, 1),
the significance level.
289
13 Hypothesis testing
Important
13.1.1 p-value
Given the two hypotheses, the data can now either support the alternative hypothesis or not. There-
fore, we need some quantification to understand how much the alternative is favored.
The p-value is a way of quantifying the strength of the evidence against the null hypothesis and in
favor of the alternative hypothesis.
Definition 13.2. The p-value is the probability of observing data at least as favorable to the
alternative hypothesis as our current dataset, under the assumption that the null hypothesis is
true.
Example 13.1. If we test a hypothesis with one-sided alternative 𝐻𝐴 ∶ 𝜇 > 𝜇0 about the mean value
𝜇, we could use the sample mean 𝑇 (X) = 𝑋 𝑛 as test statistic. The p-value is then given by
P𝐻0 (𝑋 𝑛 ≥ 𝑥𝑛 ) ,
If the p-value quantifies the strength of the evidence against the null hypothesis, we should use it to
make decisions about rejecting the null hypothesis. But how?
Remember, that the probability of a type 1 error shall be at most the significance level 𝛼 ∈ (0, 1).
This can be achieved as follows:
290
13 Hypothesis testing
Remark.
1. Our definition of rejection regions and p-values is such that the decision rules 𝑇 (X) ∈ 𝑅 and
p-value ≤ 𝛼 are equivalent. We will work most of the time with the latter one.
2. The rule says that we reject the null hypothesis (𝐻0 ) when the p-value is less than the chosen
significance level (𝛼), which is determined before conducting the test. Common values for 𝛼
include 0.01, 0.05, or 0.1, but the specific choice depends on the particular application or setting.
Interpretation
The imposed significance level ensures that for those cases where 𝐻0 is actually true, we
incorrectly reject 𝐻0 at most 100 ⋅ 𝛼% of times.
In other words, when using a significance level 𝛼, there is about 100 ⋅ 𝛼% chance of making a
type 1 error if the null hypothesis is true.
𝐻0 ∶ 𝜃 = 𝜃 0 𝐻𝐴 ∶ 𝜃 ≠ 𝜃0
One-sided:
𝐻0 ∶ 𝜃 = 𝜃 0 𝐻𝐴 ∶ 𝜃 < 𝜃0
or
𝐻0 ∶ 𝜃 = 𝜃 0 𝐻𝐴 ∶ 𝜃 > 𝜃0
Remark. In the two-sided case, the rejection region 𝑅 “lives” in both tails of the distribution of the
test statistic 𝑇 (X). This makes a two-sided alternative harder to verify and hence the test more con-
servative.Therefore, it is often preferable to test a two-sided alternative even if the research question
is formulated as a directed claim.
Example 13.2. Let’s reconsider the question of vaccination rates. The population parameter we want
to test is the probability of providing a correct answer.
291
13 Hypothesis testing
We may think that individuals possess incorrect knowledge, meaning they perform worse than ran-
dom guessing. However, we are uncertain, so the more cautious approach of a two-sided alternative
is preferred. This indicates that we assume they are not simply guessing.
To define the null hypothesis and alternative in mathematical notation, let’s introduce the probabil-
ity 𝜃 ∈ Θ = (0, 1) of a correct answer. This then leads to the following test problem:
1 1
𝐻0 ∶ 𝜃 = 𝐻𝐴 ∶ 𝜃 ≠ .
3 3
Since we miss background information, we decide to use a standard significance level 𝛼 of 5%.
Given a sample X, a test problem and a significance level 𝛼, we have to solve the following tasks:
292
13 Hypothesis testing
compared to using these approaches to approximate the sampling distribution. In order to approx-
imate the null distribution, we must first make an assumption about the population parameter 𝜃
(assuming the null hypothesis is true). Therefore, we will use the assumed value 𝜃0 when calculating
the approximate null distribution.
We will consider some examples where a theoretical approach could be applied to test the given test
problem.
table(responses)
# responses
# A B C
# 23 15 12
The correct answer is C. So, we can code the responses in the following way:
1, for 𝐶,
𝑥𝑖 = {
0, for 𝐴, 𝐵.
Hence, 𝑥1 , … , 𝑥50 are realizations from 50 Bernoulli trials with success probability 𝜃 ∈ (0, 1).
Now we need to choose a test statistic. The sample mean would be reasonable estimator for 𝜃 and
could be used as test statistic. But we are unsure about the distribution of 𝑋 𝑛 .
50
That is different, if we consider the number of successes in 50 trials 𝑇 (X) = ∑𝑖=1 𝑋𝑖 , which is
definitely also informative about 𝜃.
From the definition of the Binomial distribution (see Definition A.3) we know that the sum of i.i.d.
Bernoulli random variables has Binomial distribution. Hence, we know that
50
1
𝑇 (X) = ∑ 𝑋𝑖 ∼ Bin(50, ) ,
𝑖=1
3
under 𝐻0 . Given this distribution, we can compute the p-value, which is defined in the following
way:
p-value = P𝐻0 (𝑇 (X) ≤ 12) + P𝐻0 (𝑇 (X) ≥ 22) ,
293
13 Hypothesis testing
1
where 22 is the smallest value that has at least the same distance to the expected value 50 ⋅ 3 as 12
has. For computing the actual value we use R:
Computing the p-value this way is nothing you would typically do. Instead you would use the func-
tion binom.test().
In the output, we observe the same p-value (after rounding) as calculated earlier. Because the p-value
is greater than the selected significance level of 0.05, we can conclude that the data does not provide
sufficient evidence to reject the null hypothesis of random guessing.
In the following example, we aim to draw conclusions about the mean value of a normally distributed
random variable.
Example 13.4. Assume we have observations from the statistical model (R, N (𝜃1 , 𝜃2 )), which means
mean value and variance are both unknown.
As an example we consider the loans_full_schema dataset from Chapter 4. To be specific, we will
not use the complete dataset of 10000 observations. Instead, we will only consider a sample of 50
individuals who are renting homes and have taken a grade A loan.
set.seed(1234)
loan_rentA <- loans_full_schema |>
filter(homeownership == "rent", grade == "A") |>
slice_sample(n = 50)
294
13 Hypothesis testing
loan_rentA
# # A tibble: 50 x 8
# loan_amount interest_rate term grade state annual_income homeownership
# <int> <dbl> <dbl> <fct> <fct> <dbl> <fct>
# 1 10000 7.96 36 A PA 64000 rent
# 2 11000 7.35 36 A GA 56500 rent
# 3 15000 7.35 36 A CA 55000 rent
# 4 10000 6.08 36 A NY 200000 rent
# 5 1500 6.07 36 A CT 65000 rent
# 6 6000 7.97 36 A CA 36051 rent
# # i 44 more rows
# # i 1 more variable: debt_to_income <dbl>
We are interested in the average grade A loan amount of individuals who are renting homes.
Our claim is that the average loan amount is less than $15400. Hence, we want to consider the
following test problem
at a 5% significance level.
In Example 11.4 we said that the statistic
𝑋 𝑛 − 𝜃1 𝑋 𝑛 − 𝜃1
𝑇 (X) = = 𝑆√
2 𝑛 (X)
√ 𝑆𝑛𝑛(X) 𝑛
has a t-distribution with 𝑛 − 1 degrees of freedom. Since we know the value of 𝜃1 under the null
hypotheses, we can use 𝑇 (X) as our test statistic.
To calculate the test statistic for the given dataset, we require the sample mean and sample standard
deviation of the loan amount.
295
13 Hypothesis testing
The alternative hypothesis states that the average loan amount is less than 15400. Therefore, data
more favorable to the alternative hypothesis than the observed dataset would result in a test statistic
value being even smaller than the observed value -1.2671. So, the p-value is equal to
pt(t, df = 49)
# [1] 0.1055515
The p-value is greater than our pre-defined significance level of 0.05. Therefore, we can fail to reject
the null hypothesis at the given significance level. The data does not show enough evidence for
favoring the alternative.
The whole test is implemented within the function t.test(). To confirm our result, let’s compare it
with the output of the function.
But how likely was it to reject the null hypothesis for difference between the sample mean and the
null value of roughly 2000 (in absolute value)?
296
13 Hypothesis testing
Remember: A type 1 error occurs when we incorrectly reject 𝐻0 . The probability of a type 1 error is
(at most) 𝛼 (the significance level).
In case of a type 2 error, we fail to reject 𝐻0 although it is false. The probability of doing so is
denoted 𝛽.
Definition 13.3. A test’s power is defined as the probability of correctly rejecting 𝐻0 , and the
probability of doing so is 1 − 𝛽.
In hypothesis testing, we want to keep 𝛼 and 𝛽 low, but there are inherent trade-offs.
If the alternative hypothesis is true, what is the chance that we make a type 2 error?
• If the true population average is very close to the null hypothesis value, it will be difficult
to detect a difference (and reject 𝐻0 ).
• If the true population average is very different from the null hypothesis value, it will be
easier to detect a difference.
• Clearly, 𝛽 depends on the effect size 𝛿.
Example 13.5. In Example 13.4 we were not able to reject the null hypothesis of an average loan
amount being equal to $15400. The idea is now to determine a sample size, which leads to a power of
0.8 for an assumed effect size 𝛿 of $2000.
We can use the function power.t.test() to compute the necessary sample size. Besides the values
given above, we also need to enter an estimate/guess of the standard deviation of the loan amounts
as an argument.
A reasonable estimate is the sample standard deviation in the loan_rentA dataset:
297
13 Hypothesis testing
sd(loan_rentA$loan_amount)
# [1] 11188.78
The output says that we need a sample size of at least 196. Then let’s try our chance and take another
sample of size 196 from loans_full_schema.
set.seed(1234)
loan_rentA <- loans_full_schema |>
filter(homeownership == "rent", grade == "A") |>
slice_sample(n = 196)
dim(loan_rentA)
# [1] 196 8
Applying t.test() the same way as in Example 13.4 yields the following output.
298
13 Hypothesis testing
# -Inf 14802.68
# sample estimates:
# mean of x
# 13661.22
We observe a p-value of 0.0063108, which is less than our significance level of 0.05. Hence, we can
conclude that the data shows enough evidence to reject the null hypothesis of an average grade A
loan amount of $15400 for individuals who are renting homes.
For larger sample sizes, we can use the asymptotic approach to approximate the null distribution. If
the test statistic has the form of an average of i.i.d. observation, this means applying the results from
Theorem 11.1. An application is shown in Example 13.6. In another example, we will analyze the
relationship between two categorical variables. The test statistic is derived from comparing observed
counts to the expected counts, assuming that both variables are independent. This test statistic follows
a non-normal distribution.
Example 13.6 (Continuation of Example 13.2). Remember, in Example 13.3 we analyzed the test
problem
1 1
𝐻0 ∶ 𝜃 = 𝐻𝐴 ∶ 𝜃 ≠
3 3
50
using the theoretical approach on the test statistic 𝑇 (X) = ∑𝑖=1 𝑋𝑖 . In this example, we aim to test
the same problem using the asymptotic approach. To apply Theorem 11.1 in this context we need to
choose another test statistic, since 𝑇 is not an average of random variables. But 𝑋 𝑛 is, which is also
the maximum likelihood estimator of the unknown success probability 𝜃. For the observed sample
table(responses)
# responses
# A B C
# 23 15 12
12
we get as test statistic value 𝑥50 = 50 = 0.24 (remember that C was the correct answer).
Hence, values more favorable to the alternative would be mean values even less than 0.24 or larger
than 13 + ( 13 − 0.24). This implies that the p-value is equal to
299
13 Hypothesis testing
To compute (approximate) the probabilities P𝐻0 (𝑋̄ 𝑛 ≤ 0.24) and P𝐻0 (𝑋̄ 𝑛 ≥ 2
3 − 0.24) we want to
̇ ( 13 , 450
use the CLT, which implies 𝑋 𝑛 ∼N 2
) (see Example 12.2).
But this is only valid if the sample size is large enough. In Example 12.2 we introduced the success-
failure condition to check if the sample size is large enough. We used the point estimate 𝜃 ̂ to evaluate
the condition there. But now, the condition should hold under the null hypothesis, since we want
to approximate the null distribution. Thus, we use the assumed value of 𝜃 to check if the condition
holds.
Plugging in 𝜃0 = 31 shows that the success-failure condition holds, as 𝑛𝜃0 = 50
3 and 𝑛(1 − 𝜃0 ) = 50⋅2
3
are both greater 10.
Hence we can use R to compute the p-value based on the normal approximation, 𝑋 𝑛 ∼N 2
̇ ( 13 , 450 ), of
the null distribution.
The asymptotic approach of testing a null hypothesis about a proportion is implemented in the func-
tion prop.test(). Applying this function yields the following output:
Remember that we have selected a significance level of 5%. Since the p-value is 0.1615, which is
greater than 0.05, the data indicate insufficient evidence to reject the null hypothesis.
Remark. The obtained p-value is larger than the one computed for the binomial test in Example 13.3.
Intuitively this makes sense. The binomial test is an exact (using the exact distribution of the Bernoulli
300
13 Hypothesis testing
r.v. 𝑋𝑖 ) test, whereas the current one relied on an approximation. Therefore, it makes sense to obtain
a less “precise” answer when applying the two tests to the same data.
In the rest of this section we will discuss the chi-squared test of independence, which checks whether
two categorical variables 𝑋 and 𝑌 are likely to be related or not. It analyzes the following testing
problem:
Before we discuss how to construct a reasonable test statistic, let’s look at an example.
Example 13.7. The popular dataset (available on Moodle) contains information about students in
grades 4 to 6.
They were asked whether good grades, athletic ability, or popularity was most important to them.
A two-way table separating the students by grade and choice of the most important factor is shown
below.
popular |>
table()
# goals
# grade Grades Popular Sports
# 4th 63 31 25
# 5th 88 55 33
# 6th 96 55 32
301
13 Hypothesis testing
1.00
0.75
count goals
Grades
0.50
Popular
Sports
0.25
0.00
4th 5th 6th
grade
The idea of constructing the chi-squared test statistic is to compare the observed joint distribution of
𝑋 and 𝑌 with the joint distribution under independence.
Remember, the joint distribution of two categorical variables 𝑋 and 𝑌 with support {𝑢1 , … , 𝑢𝑘 } and
{𝑣1 , … , 𝑣ℓ }, respectively, can be summarized by a two-way table
𝑣1 𝑣2 ⋅⋅⋅ 𝑣ℓ
𝑢1 𝑁11 𝑁12 ⋅⋅⋅ 𝑁1ℓ
𝑢2 𝑁21 𝑁22 ⋅⋅⋅ 𝑁2ℓ
⋮ ⋮ ⋮ ⋮ ⋮
𝑢𝑘 𝑁11 𝑁12 ⋅⋅⋅ 𝑁1ℓ
containing observed counts 𝑁𝑖𝑗 for each combination of the levels of 𝑋 and 𝑌 .
Under the null hypothesis 𝑋 and 𝑌 are independent. This assumption implies that the expected
counts are given by
𝑁 𝑁a𝑗 𝑁𝑖a ⋅ 𝑁a𝑗
𝐸𝑖𝑗 = 𝑁 ⋅ 𝑖a ⋅ = ,
𝑁 𝑁 𝑁
𝑘 ℓ ℓ 𝑘
where 𝑁 = ∑𝑖=1 ∑𝑗=1 𝑁𝑖𝑗 is the table total, 𝑁𝑖a = ∑𝑗=1 𝑁𝑖𝑗 is the row 𝑖 total and 𝑁a𝑗 = ∑𝑖=1 𝑁𝑖𝑗
is the column 𝑗 total.
302
13 Hypothesis testing
¾ Your turn
Have we observed more than expected 5th graders who have the goal of being popular?
#
# Grades Popular Sports Sum
# 4th 63 31 25 119
# 5th 88 55 33 176
# 6th 96 55 32 183
# Sum 247 141 90 478
A yes
B no
C can’t tell
The alternative hypothesis says that the variables 𝑋 and 𝑌 are dependent. In this case, there is
favored when there are more significant differences between the observed and expected counts.
The differences are summarized in the chi-squared statistic, which will be used as a test statistic for
the test of independence.
Definition 13.4. Let (𝑋1 , 𝑌1 ), … , (𝑋𝑛 , 𝑌𝑛 ) be an i.i.d. sample, where each observation is a pair of
two (possibly dependent) categorical random variables with levels 𝑢1 , … , 𝑢𝑘 and 𝑣1 , … , 𝑣ℓ , respec-
𝑛
tively. Further let 𝑁𝑖𝑗 (X, Y) = ∑𝑟=1 1{𝑢𝑖 ,𝑣𝑗 } (𝑋𝑟 , 𝑌𝑟 ) be the observed count and 𝐸𝑖,𝑗 (X, Y) =
𝑁𝑖a (X,Y)⋅𝑁a𝑗 (X,Y)
𝑁(X,Y) the expected count under independence for cell (𝑖, 𝑗).
The chi-squared statistic is then calculated as
𝑘 ℓ
(𝑁𝑖𝑗 (X, Y) − 𝐸𝑖𝑗 (X, Y))2
𝜒2 (X, Y) = ∑ ∑ .
𝑖=1 𝑗=1
𝐸𝑖𝑗 (X, Y)
Example 13.8 (Continuation of Example 13.7). To better understand the relationship between the
two variables, we should compare the observed counts with the expected counts. We need the
marginal distributions to compute the expected counts, which are added to the contingency table
with the addmargins() function.
popular |>
table() |>
addmargins()
# goals
# grade Grades Popular Sports Sum
# 4th 63 31 25 119
303
13 Hypothesis testing
# 5th 88 55 33 176
# 6th 96 55 32 183
# Sum 247 141 90 478
Now let’s calculate the expected counts for 4th graders who prioritize good grades or popularity.
119
𝐸1,1 = ⋅ 247 = 61.4916318
478
119
𝐸1,2 = ⋅ 141 = 35.1025105
478
#
# Grades Popular Sports
# 4th 61.49163 35.10251 22.40586
# 5th 90.94561 51.91632 33.13808
# 6th 94.56276 53.98117 34.45607
Given only the observed statistic value of 1.3121, we cannot determine if this value is sufficiently
large to reject the null hypothesis. Therefore, we need to learn how to calculate or approximate the
null distribution of the chi-squared statistic.
Definition 13.5. Let (𝑋1 , 𝑌1 ), … , (𝑋𝑛 , 𝑌𝑛 ) be an i.i.d. sample, where each observation is a pair of
two (possibly dependent) categorical random variables with levels 𝑢1 , … , 𝑢𝑘 and 𝑣1 , … , 𝑣ℓ , respec-
tively. The chi-squared statistic is calculated as
𝑘 ℓ
2
(𝑁𝑖𝑗 (X, Y) − 𝐸𝑖𝑗 (X, Y))2
𝜒 (X, Y) = ∑ ∑ .
𝑖=1 𝑗=1
𝐸𝑖𝑗 (X, Y)
• independent observations
• the number of expected counts 𝐸𝑖𝑗 is at least 5 in each cell
304
13 Hypothesis testing
which corresponds to the area under the 𝜒2(𝑘−1)⋅(ℓ−1) density, above the observed chi-squared
statistic 𝜒2 (x, y).
305
13 Hypothesis testing
¾ Your turn
Which of the following is the correct p-value for an observed test statistic value of 𝜒2 = 1.3121
and 𝑑𝑓 = 4 degrees of freedom?
A more than 0.3
B between 0.3 and 0.2
C between 0.2 and 0.1
D between 0.1 and 0.05
0.15
0.10
f(x)
0.05
0.00
0 5 10 15
x
Example 13.9 (Continuation of Example 13.8). The chi-squared test of independence is implemented
in the chisq.test() function. In this specific 𝜒2 test, which is one of several 𝜒2 tests, the arguments
of chisq.test() are the observations from both variables that we want to test for independence.
chisq.test(popular$grade, popular$goals)
#
# Pearson's Chi-squared test
#
# data: popular$grade and popular$goals
# X-squared = 1.3121, df = 4, p-value = 0.8593
Conclusion: Since the p-value is high, we fail to reject the null hypothesis 𝐻0 . The data do not
provide convincing evidence that grade and goals are dependent. It doesn’t appear that goals vary
by grade.
306
13 Hypothesis testing
For constructing confidence intervals, we utilized the “infer workflow” to estimate the sampling dis-
tribution of the sample statistic. The idea for creating “new” samples was to take resamples with
replacement from the observed sample (the bootstrap approach). In the context of hypothesis testing,
we need to generate new values of the test statistic while assuming the null hypothesis to be true.
Hence, we need to adjust the sampling procedure. We consider the following two cases:
1. The null hypothesis specifies a specific probability model. In this instance, we will simply
generate new samples from this model.
2. The testing problem involves the relationship between two variables, and the null hypothesis
states that they are independent. In this scenario, we will randomly permute observations for
one of the variables, and this should not affect the test outcome under independence.
Compared to the interval estimation, we need one additional step for hypothesis testing:
hypothesize() is used to specify the null hypothesis. We have to choose for the null argument
of hypothesize() one of the the two following arguments:
• point: if the null hypothesis is about a single population parameter, where the chosen value
specifies a concrete probability model,
307
13 Hypothesis testing
• independence: if the null hypothesis refers to the independence of the two variables under
consideration.
Remark. After specifying the null, generate() creates the resamples. Here, we do not need to specify
the correct type of resamples. It is automatically chosen. It will be type="draw" for a point null
hypothesis about a concrete probability model and type="permute" if the null hypothesis is of type
independence.
df <- tibble(
responses = ifelse(responses == "C",
"correct",
"not correct")
)
308
13 Hypothesis testing
The simulated null distribution is visualized in Figure 13.3. The figure also shows the p-value as
shaded area of the histogram.
null_distn |>
visualize() +
shade_p_value(obs_stat = 0.24, # observed value
direction = "both") # alternative hypothesis
150
count
100
50
0
0.2 0.3 0.4 0.5 0.6
stat
Figure 13.3
309
13 Hypothesis testing
null_distn |>
get_p_value(obs_stat = 0.24, direction = "both")
# # A tibble: 1 x 1
# p_value
# <dbl>
# 1 0.216
that is,
# resampled props ≤ 0.24 or ≥ 0.427 216
= .
1000 1000
The result says that 216 out of the 1000 simulated proportions have been either less or equal to 0.24
(=observed value) or larger or equal to 0.427 ( ≈ 13 + ( 13 − 0.24)). Hence, under the assumption of
the true success probability being equal to 13 , these ranges are not so unlikely. Therefore, we have
to conclude, that the data doesn’t provide sufficient evidence at the 5% significance level to reject
the null hypothesis of random guessing in favor of doing worse or better than that.
Example 13.11. Remember the gender discrimination study from Chapter 6. We were considering
the following two hypothesis:
𝐻0 ∶ Promotion and gender are independent, no gender discrimination, observed difference in pro-
portions is simply due to chance.
𝐻1 ∶ Promotion and gender are dependent, there is gender discrimination, observed difference in
proportions is not due to chance.
Let’s use the infer workflow to test the null hypothesis. The null distribution can be simulated with
the following code.
To compute the p-value, we need the observed value of the test statistic (here difference in propor-
tions).
310
13 Hypothesis testing
null_distn |>
get_p_value(obs_stat = diff_prop$stat,
direction = "both")
# # A tibble: 1 x 1
# p_value
# <dbl>
# 1 0.028
We can conclude that there is something going on. The null hypothesis of promotion and gender
being independent can be rejected at the 0.05 significance level. The data provides evidence for an
existing gender discrimination; the observed difference in proportions is not due to chance.
In the previous two examples, we generated new samples by drawing observations from a specified
distribution or by permuting the original observations under the independence assumption of the
null hypothesis. But these are not the only methods for simulation-based hypothesis testing. Another
method is the bootstrap approach. Compared to approximating the sampling distribution using the
bootstrap approach, some slight adjustments have to be made.
We will employ this method to test the mean value 𝜃 of a distribution, which is otherwise not spec-
ified. For instance, let’s consider the following scenario. We have observations 𝑥1 , … , 𝑥𝑛 from a
distribution with mean 𝜃 and wish to test:
𝐻0 ∶ 𝜃 = 𝜃0 vs. 𝐻𝐴 ∶ 𝜃 ≠ 𝜃 0 .
311
13 Hypothesis testing
1. Create new values 𝑥∗𝑖 = 𝑥𝑖 − 𝑥𝑛 + 𝜃0 , 𝑖 ∈ {1, … , 𝑛}, which will have an empirical mean
equal to 𝜃0 .
x𝑏 = (𝑥𝑏1 , … , 𝑥𝑏𝑛 )⊤ .
2. Compute for each resample x𝑏 the value of the test statistic 𝑇 (x𝑏 ).
The resampled test statistic values are then based on samples with the assumed mean value under the
null hypothesis, but they also represent the variation of the original sample.
Choosing a significance level for a test is important in many contexts. The traditional default level
is 0.05. However, it is often helpful to adjust the significance level based on the application.
We may select a level that is smaller or larger than 0.05 depending on the consequences of any
conclusions reached from the test.
If a type 2 error is important to avoid / costly, then we should choose a higher significance
level (e.g. 0.10). Then we are more cautious about failing to reject 𝐻0 when the null is actually
false (at the price of a higher type 1 error rate).
312
13 Hypothesis testing
Short summary
This chapter introduces the fundamental concepts of hypothesis testing. It begins by framing a
research question regarding public knowledge about vaccination, translating it into competing
null and alternative hypotheses. The text then defines key elements of a statistical test, such
as the test statistic, rejection region, and the crucial concepts of Type I and Type II
errors. The document thoroughly explains the p-value as a measure of evidence against
the null hypothesis and details various approaches to conducting hypothesis tests, including
theoretical, asymptotic (for large samples), and simulation-based methods like permutation tests
and a modified bootstrap. Furthermore, it discusses the power of a test and the importance
of selecting an appropriate significance level based on the potential consequences of
errors. Finally, the material covers specific statistical tests, such as the binomial test, t-tests
for means, and the chi-squared test for independence between categorical variables, illustrating
their application with examples in R.
313
14 Inference for linear regression
In this section we work again with the evals dataset from the openintro package. The dataset con-
tains student evaluations of instructors’ beauty and teaching quality for 463 courses at the University
of Texas.
The teaching evaluations were conducted at the end of the semester. The beauty judgments were
made later, by six students who had not attended the classes and were not aware of the course evaluations
(two upper-level females, two upper-level males, one lower-level female, one lower-level male), see
Hamermesh and Parker (2005) for further details.
In Section 9.2.7, we applied stepwise selection algorithms to choose informative predictor variables
for predicting the evaluation score. As a result, we derived the following model.
Now we can ask a related but different question: Does the observed data provide enough evidence
to reject the assumption that there is no relation between one of the predictor variables and the
response?
This question can be answered by formulating it as a statistical test problem. For instance, let’s
consider the relationship between the average beauty score and the evaluation score. Given the
above model, we would consider the following test problem:
𝐻0 ∶ 𝛽1 = 0 𝐻𝐴 ∶ 𝛽1 =
/ 0.
Under the null hypothesis, the average beauty score has no relation with the evalution score.
We will be exploring two different approaches for testing the slope parameter 𝛽𝑗 of the 𝑗-th predictor
variable 𝑥𝑗 in the multiple linear regression model
Both the theoretical and simulation-based approach will conduct a partial test, analyzing the relation-
ship between one predictor and the response while considering the influence of all other predictor
variables.
314
14 Inference for linear regression
Using the theoretical approach, one can derive the distribution of the test statistic used to test:
𝐻0 ∶ 𝛽𝑗 = 𝛽𝑗,0 𝐻𝐴 ∶ 𝛽𝑗 ≠ 𝛽𝑗,0 .
Assumptions:
2. have a normal distribution with zero mean and constant variance, i.e., 𝜖𝑖 ∼ N (0, 𝜎2 ).
We can create a test statistic by comparing the least-squares estimate 𝛽̂(Y, x) with the assumed value
𝑗
under the null hypothesis, i.e. considering the difference 𝛽𝑗̂ (Y, x) − 𝛽𝑗,0 . To obtain a statistic with a
known distribution, the difference is standardized using an estimate of the standard error SE𝛽̂ (Y, x)
𝑗
Assuming the random errors follow the above assumptions, it can be demonstrated that the test
statistic
𝛽𝑗̂ (Y, x) − 𝛽𝑗,0
𝑇𝑗 (Y, x) = , (14.2)
̂ ̂ (Y, x)
SE 𝛽𝑗
known as t-statistic, follows a t-distribution with 𝑛 − (𝑘 + 1) degrees of freedom under the assump-
tion that the null hypothesis (𝐻0 ∶ 𝛽𝑗 = 𝛽𝑗,0 ) is true.
Remark.
1. Most of the time the null value 𝛽𝑗,0 is assumed to be zero (no effect).
2. The standard error SE𝛽̂ (Y, x) depends on the unknown variance 𝜎2 of the random errors. But
𝑗
the residual variance
𝑛
2 1
𝜎̂ = ∑ 𝑒2
𝑛 − 𝑘 − 1 𝑖=1 𝑖
is an estimator for 𝜎2 . Using this estimator allows to define the estimated standard error
̂ ̂ (Y, x).
SE 𝛽𝑗
315
14 Inference for linear regression
With tidy() we can extract the estimates, their standard errors, the t-statistics and the corresponding
p-values.
tidy(evals_lm)
# # A tibble: 8 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 4.49 0.229 19.6 1.11e-62
# 2 bty_avg 0.0569 0.0170 3.35 8.65e- 4
# 3 age -0.00869 0.00322 -2.70 7.21e- 3
# 4 gendermale 0.210 0.0519 4.04 6.25e- 5
# 5 ranktenure track -0.207 0.0839 -2.47 1.40e- 2
# 6 ranktenured -0.176 0.0641 -2.74 6.32e- 3
# 7 languagenon-english -0.244 0.108 -2.26 2.41e- 2
# 8 pic_outfitnot formal -0.131 0.0713 -1.84 6.70e- 2
Let’s try to verify the test statistic and p-value for testing 𝐻0 ∶ 𝛽1 = 0. The test statistic 𝑇1 (Y, x) has
the value
0.0569 − 0
𝑇1 (y, x) = ≈ 3.35 .
0.017
The distribution of the test statistic is a t-distribution with parameter (called degrees of freedom) equal
to 𝑑𝑓 = 𝑛 − 𝑘 − 1 = 463 − 7 − 1 = 455. This then leads to a p-value of
Given the influence of all other variables being part of the model, there is strong evidence that the
average beauty score is related with the evaluation score.
316
14 Inference for linear regression
Remark. Since the test examines the impact of one predictor while all other predictors are included
in the model, it is referred to as a partial t-test.
𝛽̂𝑗 (Y,x)−𝛽𝑗
Using the t-statistic ̂ ̂ (Y,x)
SE
and applying the idea for constructing confidence intervals pre-
𝛽𝑗
sented in Section 12.1, yields the 100(1 − 𝛼)% confidence interval for the slope parameter 𝛽𝑗 :
tidy(evals_lm)[2,]
# # A tibble: 1 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 bty_avg 0.0569 0.0170 3.35 0.000865
we can compute a 95% confidence interval for the slope of the average beauty score. For 𝛼 = 0.05 the
critical value 𝑡𝑛−𝑘−1,1−𝛼/2 is given by the 0.975 quantile of the t-distribution with 𝑛 − 𝑘 − 1 = 455
degrees of freedom:
qt(0.975, df = 455)
# [1] 1.965191
𝛽1̂ ± 𝑡455,0.975 ⋅ SE
̂ ̂ ≈ 0.0569 ± 1.97 ⋅ 0.017
𝛽 1
≈ (0.0234, 0.0904) .
The confint() function can be used to compute confidence intervals for the slope parameters of a
linear model.
317
14 Inference for linear regression
¾ Your turn
# # A tibble: 1 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 age -0.00869 0.00322 -2.70 0.00721
A Since the p-value is positive, the higher the professor’s age, the higher we would expect
them to be rated.
B If we keep all other variables in the model, there is strong evidence that professor’s age is
associated with their rating.
C Probability that the true slope parameter for age is 0 is 0.00721.
D There is about a 1% chance that the true slope parameter for age is -0.00869.
Before we discuss the simulation-based approach and talk about how to check the model assumption,
we will look at two special cases in the next two examples.
Example 14.1. We already know that several variables appear to be related to the evaluation score.
However, we will focus solely on comparing the evaluation scores of male and female professors in
this example.
male
gender
female
3 4 5
score
Figure 14.1
318
14 Inference for linear regression
A model for comparing the scores in the two groups and testing for a difference at a 5% significance
level is as follows.
From the summary, we can infer that male professors appear to have significantly higher average
evaluation scores.
Testing
𝐻0 ∶ 𝛽male = 0 𝐻𝐴 ∶ 𝛽male ≠ 0
is equivalent to testing
𝐻0 ∶ 𝜇female − 𝜇male = 0 𝐻𝐴 ∶ 𝜇female − 𝜇male ≠ 0 ,
where 𝜇male and 𝜇female are the average evaluation score in the population of male and female profes-
sors. This test is also known as the two sample t-test with equal variance.
319
14 Inference for linear regression
Example 14.2. In Example 14.1, we learned how to test mean values in two groups. If a variable di-
vides the population into more than two groups, we can no longer use the t-test. However, regression
analysis can still be used in this case.
From the summary, it can be inferred that tenured professors receive significantly lower evaluation
scores compared to teaching professors (reference level). At a 5% significance level, this difference
cannot be confirmed for professors on a tenure track.
Based on the summary, we can conclude that we cannot reject the null hypothesis 𝐻0 ∶ 𝛽ten. track =
𝛽tenured = 0 in favor of the alternative that at least one of them is different from zero. This is because
the F-test, which is used to test this kind of hypotheses, has a value (F-statistic) of c(value = 2.706),
which is not statistically significant (p-value: c(value = 0.06786)) at a 5% level.
Testing both slope parameters jointly to be zero, is equivalent to testing
𝐻0 ∶ 𝜇teaching = 𝜇ten. track = 𝜇tenured .
320
14 Inference for linear regression
A technique, known as one-way analysis of variance (ANOVA), can also be used to test hypothesis
like the one above.
Idea
When dealing with the null hypothesis 𝐻0 ∶ 𝛽𝑗 = 0, it implies that the j-th predictor has
no relationship with the response variable. Consequently, the values of 𝑥𝑗 have no impact on
the estimation results. Thus, we are able to randomly permute these values while keeping all
predictor values as they are.
This way we can create a large number of resamples under the null hypothesis of no relationship
between the j-th predictor and the response. This is again a partial test that evaluates the impact of
one predictor while considering all other predictors in the model.
We start by re-calculating the observed fit:
321
14 Inference for linear regression
Now we can generate a distribution of fits where each predictor variable is permuted indepen-
dently:
We can visualize the observed fit alongside the fits under the null hypothesis. This is done in Fig-
ure 14.2.
visualize(null_distn) +
shade_p_value(obs_stat = obs_fit, direction = "two-sided") +
plot_layout(ncol = 2)
count
150 150
100 100
50 50
0 0
−0.005 0.000 0.005 −0.050 −0.025 0.000 0.025 0.050 0.075
age bty_avg
200
count
count
150 150
100 100
50 50
0 0
−0.1 0.0 0.1 0.2 3.6 3.9 4.2 4.5 4.8
gendermale intercept
150 200
count
count
100 150
100
50 50
0 0
−0.2 0.0 0.2 −0.2 −0.1 0.0 0.1 0.2
languagenon−english pic_outfitnot formal
200 200
count
count
150 150
100 100
50 50
0 0
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.2 −0.1 0.0 0.1 0.2
ranktenure track ranktenured
Figure 14.2
322
14 Inference for linear regression
The p-values shown in the last figure can also be calculated using the get_p_value() function
again.
null_distn |>
get_p_value(obs_stat = obs_fit, direction = "two-sided")
# # A tibble: 8 x 2
# term p_value
# <chr> <dbl>
# 1 age 0
# 2 bty_avg 0.004
# 3 gendermale 0
# 4 intercept 0.056
# 5 languagenon-english 0.028
# 6 pic_outfitnot formal 0.038
# 7 ranktenure track 0.006
# 8 ranktenured 0.006
Remark. Be cautious in reporting a p-value of zero. This result is an approximation based on the
number of reps chosen in the generate() step. In theory, the p-value is never zero.
Caution
Always be aware of the type of data you’re working with: random sample, non-random
sample, or population.
Statistical inference, and the resulting p-values, are meaningless when you already have
population data.
If you have a sample that is non-random (biased), inference on the results will be unreliable.
The ideal situation is to have independent observations.
323
14 Inference for linear regression
One often uses graphical methods to verify these conditions. We will discuss how to do that
next.
In addition we will have a look at how to detect outliers.
We want to analyse the residuals of four models based on simulated data to illustrate the cases of
having
• a nonlinear relationship,
• heterogeneous variance,
• outliers in the data.
We fit simple linear regression models to the four different response values using the simulated data.
# linear relation
model_reg <- lm(y ~ x, data = df)
# nonlinear relation
model_str <- lm(y_str ~ x, data = df)
# heterogeneous variance
model_het <- lm(y_het ~ x, data = df)
# contains outlier
model_out <- lm(y_out ~ x_out, data = df)
324
14 Inference for linear regression
x
0.04 0.852*** 0.778*** 0.970*** 0.297** 0.779***
0.00
20
15 Corr: Corr: Corr: Corr:
x_out
10
5
0.594*** 0.811*** 0.280** 0.603***
8
6 Corr: Corr: Corr:
y
4 0.812*** 0.177. 1.000***
2
40
30 Corr: Corr:
y_str
20
0.310** 0.813***
10
0
30
20
Corr:
y_het
10
0 0.178.
−10
−20
8
6
y_out
4
2
2.5 5.0 7.510.0 5 10 15 20 2 4 6 8 0 10 20 30 40−20
−100 102030 2 4 6 8
Figure 14.3: Pairsplot of the simulated data showing the the relationship between the different re-
sponse variables and x as well as x_out.
# # A tibble: 4 x 3
# model intercept x
# <chr> <dbl> <dbl>
# 1 linear 1.98 0.414
# 2 nonlinear -8.17 4.58
# 3 het. 1.12 0.724
# 4 outlier 2.68 0.283
The visualizations are created using the autoplot() function, which requires a fitted regression
model as input. In order to transform the information contained in the fitted model into something
that can be plotted, autoplot() needs helper functions from the ggfortify package. Therefore, we
have to load this package before we can use autoplot() for the first time.
library(ggfortify)
325
14 Inference for linear regression
We assess the linear relationship by creating a scatterplot of residuals versus the fitted values (remem-
ber, the fitted values are a linear combination of all predictor variables).
p1 + p2 + p3 + p4 + plot_layout(ncol = 2)
Residuals
0
0
−1
−2 64
3 4 5 6 0 10 20 30
Fitted values Fitted values
0.0
Residuals
Residuals
10
0
−2.5
−10
26 −5.0
−20
18 100
2 4 6 8 3 4 5 6 7 8
Fitted values Fitted values
Figure 14.4
326
14 Inference for linear regression
The constant variability is analyzed through a scatterplot of the square root of the absolute values of
standardized residuals vs. fitted values.
p1 + p2 + p3 + p4 + plot_layout(ncol = 2)
Scale−Location Scale−Location
Standardized residuals
Standardized residuals
49 64 26 1.6 49
45 1
1.2
1.2
0.8 0.8
0.4 0.4
3 4 5 6 0 10 20 30
Fitted values Fitted values
Scale−Location Scale−Location
Standardized residuals
Standardized residuals
0.0 0.0
2 4 6 8 3 4 5 6 7 8
Fitted values Fitted values
Figure 14.5
327
14 Inference for linear regression
We verify the assumption of normality using a normal quantile-quantile plot (normal-probability plot)
of the standardized residuals.
p1 + p2 + p3 + p4 + plot_layout(ncol = 2)
Standardized residuals
2
2 145
1
1
0
0
−1
−1
−2
64
−2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles
Standardized residuals
2 0.0
0
−2.5
−2
26 −5.0
−4 18 100
−2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles
Figure 14.6
328
14 Inference for linear regression
Outliers
The presence of outliers can be analyzed using a scatterplot of standardized residuals versus leverage.
Before displaying plots, let’s define one component of these plots, the leverage score.
Remember from Equation 9.4 the representation of the fitted values
Definition 14.1. The leverage score 𝐻𝑖𝑖 of the i-th observation (𝑦𝑖 , x𝑖 ) is defined as H𝑖𝑖 , the i-th
diagonal element of the hat matrix H.
It can be interpreted as the degree by which the i-th observed response influences the i-th fitted value:
𝜕 𝑦𝑖̂
𝐻𝑖𝑖 = .
𝜕𝑦𝑖
Remark. The variance of the i-th residual 𝑒𝑖 is 𝜎2 (1 − 𝐻𝑖𝑖 ), and it holds that
0 ≤ 𝐻𝑖𝑖 ≤ 1 .
p1 + p2 + p3 + p4 + plot_layout(ncol = 2)
329
14 Inference for linear regression
Standardized Residuals
Standardized Residuals
2 1 49
51
49
2 45
91
1
1
0
0
−1
−2 −1
0.00 0.01 0.02 0.03 0.04 0.00 0.01 0.02 0.03 0.04
Leverage Leverage
Standardized Residuals
4 4 2.5 1
51
2 0.0
0 −2.5
−2 94 −5.0
−4 18 100
0.00 0.01 0.02 0.03 0.04 0.00 0.05 0.10 0.15 0.20
Leverage Leverage
Figure 14.7
In case of detecting high leverage and uncertainty about its influence on parameter estimation, check
the plot Cook’s distance vs. Leverage.
4
Cook's distance
0 1
51
0.00 0.05 0.10 0.15 0.20
Leverage
Figure 14.8
330
14 Inference for linear regression
The last plot indicated that the last observation (𝑥100 , 𝑦100 ) = (20, 2) has a large Cook’s distance
1
𝐷𝑖 = ⋅ (ŷ − ŷ(𝑖) )⊤ (ŷ − ŷ(𝑖) )
(𝑘 + 1)𝜎̂ 2
H𝑖𝑖 ⋅ 𝑟𝑖2
= ,
(1 − H𝑖𝑖 )(𝑘 + 1)
where ŷ(𝑖) denotes predictions based on parameter estimates of the regression coefficients that have
𝑒𝑖
been computed when omitting the 𝑖-th observation and 𝑟𝑖 = 𝜎̂√1−𝐻 is the i-th standardized
𝑖𝑖
residual.
The Cook’s distance is computed for each observation and can be used to indicate influential
observations.
Operational guideline
The function augment() from the broom package adds, among other characteristics, the Cook’s dis-
tances to the dataset.
augment(model_out) |>
slice_tail(n = 10)
# # A tibble: 10 x 8
# y_out x_out .fitted .resid .hat .sigma .cooksd .std.resid
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2.81 1 2.96 -0.156 0.0355 1.15 0.000352 -0.138
# 2 3.42 2 3.25 0.179 0.0260 1.15 0.000333 0.158
# 3 6.08 6 4.38 1.70 0.0100 1.14 0.0112 1.49
# 4 5.93 10 5.51 0.421 0.0288 1.15 0.00206 0.372
# 5 2.48 2 3.25 -0.767 0.0260 1.15 0.00613 -0.677
# 6 5.47 5 4.10 1.38 0.0108 1.14 0.00792 1.21
# 7 5.30 4 3.81 1.49 0.0137 1.14 0.0119 1.31
# 8 3.24 3 3.53 -0.286 0.0188 1.15 0.000605 -0.251
# 9 5.67 10 5.51 0.157 0.0288 1.15 0.000286 0.139
# 10 2 20 8.34 -6.34 0.228 0.890 5.85 -6.29
Remark. We use slice_tail() to show the last rows, as the very last observation exhibits a high
Cook’s distance.
331
14 Inference for linear regression
¾ Your turn
evals_lm
#
# Call:
# lm(formula = score ~ bty_avg + age + gender + rank + language +
# pic_outfit, data = evals)
#
# Coefficients:
# (Intercept) bty_avg age
# 4.490380 0.056916 -0.008691
# gendermale ranktenure track ranktenured
# 0.209779 -0.206806 -0.175756
# languagenon-english pic_outfitnot formal
# -0.244128 -0.130906
List the conditions required for linear regression and check if each one is satisfied for this model
based on the following diagnostic plots.
autoplot(evals_lm)
1 2
Residuals
0 0
−1 −2
335 239 239
335
162 162
3.8 4.0 4.2 4.4 4.6 −2 0 2
Fitted values Theoretical Quantiles
Standardized Residuals
2
335 162 239
1.5
0
1.0
0.5 −2
376 239
0.0 162
3.8 4.0 4.2 4.4 4.6 0.00 0.02 0.04
Fitted values Leverage
332
14 Inference for linear regression
• transforming variables
• seeking out additional variables to fill model gaps
• using more advanced methods (not part of the course)
We will examine the process of transforming variables. The data used to fit model_str conforms to
the model
𝑌𝑖 = 2 + 0.4 ⋅ 𝑥2𝑖 + 𝜖𝑖 .
This means that using 𝑥2𝑖 instead of 𝑥𝑖 as the predictor variable results in a linear relationship between
the response and predictor.
40
30
y_str
20
10
0
0 25 50 75 100
x^2
Figure 14.9
333
14 Inference for linear regression
autoplot(model_tf)
Standardized residuals
2 49 26 2 26 49
Residuals
1 1
0 0
−1 −1
−2 64 −2 64
10 20 30 40 −2 −1 0 1 2
Fitted values Theoretical Quantiles
Standardized Residuals
49 64 26 2
49 1
51
1.2
1
0.8 0
0.4 −1
−2
10 20 30 40 0.00 0.01 0.02 0.03 0.04
Fitted values Leverage
Figure 14.10
334
14 Inference for linear regression
Short summary
This chapter introduces inference for linear regression, using a dataset of student evaluations
of university instructors. It explores how to statistically test the relationship between pre-
dictor variables like beauty, age, and gender, and the response variable evaluation score.
The text details both theoretical approaches using t-statistics and p-values, alongside
simulation-based methods involving permutation. Furthermore, it emphasises the as-
sumptions underlying linear regression and provides guidance on residual analysis to
check model validity, including identifying outliers and assessing linearity, constant variance,
and normality. The text also touches upon improving model fit through variable transforma-
tion and briefly discusses special cases like comparing two groups and multiple groups.
335
References
336
A Some probability distributions
dnorm(0.5)
# [1] 0.3520653
rnorm(2, sd = 3)
# [1] -3.3220566 0.2682201
1
Definition A.1. A distribution over the set 𝑆 ∶= {𝑥1 , … , 𝑥𝑛 } assigning equal weight P({𝑥𝑖 }) = 𝑛
to each element 𝑥𝑖 is called discrete uniform distribution.
337
A Some probability distributions
Bernoulli distribution
Definition A.2. Let 𝑝 ∈ (0, 1) be the probability of success and 𝑆 ∶= {0, 1}. Then the probabilities
𝑝, 𝑘=1
P({𝑘}) = {
1 − 𝑝, 𝑘 = 0
Remark. A random trial with only two possible outcomes is called a Bernoulli random trial. A r.v.
𝑋 with sample space 𝑆 = {0, 1} and probabilities P(𝑋 = 1) = 𝑝 and P(𝑋 = 0) = 1 − 𝑝, is called a
Bernoulli random variable with E[𝑋] = 𝑝 and Var[𝑋] = (1 − 𝑝)𝑝.
Remark:
Binomial distribution
Definition A.3. Let 𝑝 ∈ (0, 1) be the probability of success, 𝑛 ∈ N the number of independent
Bernoulli trials and 𝑘 ∈ 𝑆 ∶= {0, 1, … , 𝑛} the number of successes (elementary event). Then the
probabilities
𝑛
P(𝑛,𝑝) ({𝑘}) = ( ) 𝑝𝑘 (1 − 𝑝)(𝑛−𝑘) , 𝑘 ∈ 𝑆,
𝑘
define a binomial distribution with parameters 𝑛 and 𝑝.
Interpretation
Expectation and Variance of a r.v. 𝑋 with binomial distribution are E[𝑋] = 𝑛𝑝 and Var[𝑋] = 𝑛𝑝(1 −
𝑝), respectively.
In R, we can compute the probability P(𝑛,𝑝) ({𝑘}) using the command:
1 𝑛 𝑛!
(𝑘) = 𝑘!(𝑛−𝑘)!
is called the binomial coefficient, and 𝑛! = 𝑛 ⋅ (𝑛 − 1) ⋅ (𝑛 − 2) ⋅ ⋅ 2 ⋅ 1 is the factorial.
338
A Some probability distributions
The probability of the event {𝑋 ≤ 𝑘} is the distribution function of the binomial distribution
𝑘
∑𝑖=0 P(𝑛,𝑝) ({𝑖}) at point 𝑘 and can be computed in R using the command:
Example A.1. The number of success in 10 independent Bernoulli trials follows a binomial distribu-
tion with parameters 𝑛 = 10 and 𝑝 = 0.35.
The probability of 6 successes in 10 independent Bernoulli trials is then equal to
10 10!
P(6 out of 10 refuse) = ( ) ⋅ 0.356 ⋅ (1 − 0.35)10−6 = ⋅ 0.356 ⋅ 0.654
6 4!6!
10 ⋅ 9 ⋅ 8 ⋅ 7 ⋅ 6
= ⋅ 0.356 ⋅ 0.654
4⋅3⋅2⋅1
= 210 ⋅ 0.356 ⋅ 0.654
= 0.0689098
Using R we obtain
Geometric distribution
Definition A.4. Let 𝑝 ∈ (0, 1) be the probability of success, (1 − 𝑝) the probability of failure and
𝑛 ∈ 𝑆 ∶= N the number of independent trials. Then the probabilities
P𝑝 ({𝑛}) = (1 − 𝑝)𝑛−1 𝑝 , 𝑛 ∈ 𝑆,
1 1−𝑝
Expectation and Variance of a r.v. 𝑋 with geometric distribution are E(𝑋) = 𝑝 and Var(𝑋) = 𝑝2 .
dgeom(n-1, prob = p)
Remark. dgeom() uses a different parametrization compared to Definition A.4. Therefore, we have
to use n-1 instead of n.
339
A Some probability distributions
Application
Waiting time until the first success in independent and identically distributed (i.i.d.)
Bernoulli random trials.
Poisson distribution
Definition A.5. Let 𝜆 ∈ R+ be a positive parameter, called rate. Then the probabilities
𝜆𝑘 𝑒−𝜆
P𝜆 ({𝑘}) = , 𝑘 ∈ 𝑆 ∶= N0 ,
𝑘!
where 𝑘! denotes the 𝑘-factorial, define a Poisson distribution with parameter 𝜆, E(𝑋) = 𝜆 and
Var(𝑋) = 𝜆.
Expectation and Variance of a r.v. 𝑋 with Poisson distribution are E[𝑋] = 𝜆 and Var[𝑋] = 𝜆.
Interpretation
The Poisson distribution is often useful for modeling the number of rare events in a large
population over a (short) unit of time. The population is assumed to be (mostly-)fixed, and the
units within the population should be independent.
Data, which can be modeled through a Poisson distribution, is also called count data.
Definition A.6. Let 𝑝 ∈ (0, 1) be the probability of success, 𝑛 ∈ N the number of independent
Bernoulli trials, and 𝑘 ≤ 𝑛 the number of successes. Then the probabilities
𝑛−1 𝑘
P(𝑘,𝑝) ({𝑛}) = ( ) 𝑝 (1 − 𝑝)𝑛−𝑘
𝑘−1
defines a negative binomial distribution with parameters 𝑘 and 𝑝.
340
A Some probability distributions
Interpretation
The negative binomial distribution describes the probability of observing the 𝑘𝑡ℎ success on the
𝑛𝑡ℎ trial.
R uses a different parametrization compared to Definition A.6. Therefore, we have to enter the number
of failures x and successes size in x + size Bernoulli trials with the last one being a success.
Definition A.7. Let [𝑎, 𝑏] an interval on the real line with 𝑎 < 𝑏. The distribution with density
function
1
1 , 𝑥 ∈ [𝑎, 𝑏]
𝑓(𝑥) = ⋅ 1[𝑎,𝑏] (𝑥) = { 𝑏−𝑎
𝑏−𝑎 0, 𝑥 ∉ [𝑎, 𝑏]
is called (continuous) uniform distribution on the interval [𝑎, 𝑏] and we will denote it by Unif(𝑎, 𝑏).
𝑏+𝑎
Expectation and Variance of a r.v. 𝑋 with uniform distribution on the interval [𝑎, 𝑏] are E[𝑋] = 2
2
and Var[𝑋] = (𝑏−𝑎)
12 .
The function
341
A Some probability distributions
0.8
f(x)
0.4
0.0
−2 0 2 4
x
Normal distribution
Definition A.8. Let 𝜇 ∈ R and 𝜎 > 0. The normal distribution with mean 𝜇 and variance 𝜎2 is
the continuous distribution on R with density function
1 (𝑥−𝜇)2
𝑓(𝑥) = √ e− 2𝜎2 , 𝑥 ∈ R,
2𝜋𝜎2
Remark. A normal distribution with mean 𝜇 = 0 and variance 𝜎 = 1 is called standard normal
distribution; in symbols, N (0, 1).
Exponential distribution
Definition A.9. Let 𝜆 > 0 be a parameter called rate. The distribution with density function
342
A Some probability distributions
1 1
Expectation and Variance of a r.v. 𝑋 with distribution Exp(𝜆) are E[𝑋] = 𝜆 and Var[𝑋] = 𝜆2 .
The function
Exponential density
1.5
1.0 λ
f(x)
1
2
0.5
0.0
−5 0 5 10
x
Application
Chi-squared distribution
Definition A.10. Let 𝑘 > 0 be a parameter called degree of freedom. The distribution with density
function
1
𝑓(𝑥) = 𝑥𝑘/2−1 e−𝑥/2 1[0,∞) (𝑥) ,
2𝑘/2 Γ(𝑘/2)
where Γ(⋅) is the Gamma function, is called chi-squared distribution and we will denote it by 𝜒2 (𝑘).
Expectation and Variance of a r.v. 𝑋 with distribution 𝜒2 (𝑘) are E[𝑋] = 𝑘 and Var[𝑋] = 2𝑘.
343
A Some probability distributions
Application
The function
dchisq(x, df = k)
computes the density function of the chi-squared distribution with 𝑘 degrees of freedom.
0.4
0.3 k
2
f(x)
3
0.2
7
0.1
0.0
0 5 10 15
x
F-distribution
Definition A.11. Let 𝑑1 > 0 and 𝑑2 > 0 be two parameters called degrees of freedom. The distribu-
tion with density function
where Γ(⋅) is the Gamma function, is called F-distribution and we will denote it by 𝐹 (𝑑1 , 𝑑2 ).
𝑑2
Expectation and Variance of a r.v. 𝑋 with distribution 𝐹 (𝑑1 , 𝑑2 ) are E[𝑋] = 𝑑2 −2 , for 𝑑2 > 2, and
2𝑑22 (𝑑1 +𝑑2 −2)
Var[𝑋] = 𝑑1 (𝑑2 −2)2 (𝑑2 −4) , for 𝑑2 > 4.
344
A Some probability distributions
Application
The function
computes the density function of the F-distribution with 𝑑1 and 𝑑2 degrees of freedom.
Density of F−distribution
1.00
0.75
d1,d2
2, 1
f(x)
0.50
5, 1
100, 100
0.25
0.00
𝑡 distribution
Definition A.12. Let 𝜈 > 0 be a parameter called degree of freedom. The distribution with density
function
Γ( 𝜈+1 ) 𝑥2 −(𝜈+1)/2
𝑓(𝑥) = √ 2 𝜈 (1 + ) ,
𝜈𝜋Γ( 2 ) 𝜈
where Γ(⋅) is the Gamma function, is called t-distribution and we will denote it by 𝑡(𝜈).
𝜈
Expectation and Variance of a r.v. 𝑋 with distribution 𝑡(𝜈) are E[𝑋] = 0 and Var[𝑋] = 𝜈−2 , for
𝜈 > 2.
345
A Some probability distributions
Remark. The Student’s 𝑡-distribution is a generalization of the standard normal distribution. Its
density is also symmetric around zero and bell-shaped, but its tails are thicker than the normal
model’s. Therefore, observations are more likely to fall beyond two SDs from the mean than those
under the normal distribution.
0.4
0.3
distribution
f(x)
0.2 normal
t
0.1
0.0
Application
The function
dt(x, df = nu)
346
A Some probability distributions
0.4
0.3
distribution
df=1
df=10
f(x)
0.2
df=2
df=5
normal
0.1
0.0
Definition A.13. Let (𝑋1 , 𝑋2 )⊤ ∈ R2 be a random vector. We say, that (𝑋1 , 𝑋2 )⊤ has a two-
dimensional normal distribution, if the density of the joint distribution of 𝑋1 and 𝑋2 is given
by
1 1 −1
𝑓(x) = exp {− (x − 𝜇)⊤ Σ (x − 𝜇)} , x ∈ R2 ,
√(2𝜋)2 det(Σ) 2
with 𝜇 = (𝜇1 , 𝜇2 )⊤ ∈ R2 and
𝜎12 𝜌𝜎1 𝜎2
Σ=( ),
𝜌𝜎1 𝜎2 𝜎22
where 𝜎1 , 𝜎2 > 0 and 𝜌 ∈ (−1, 1). The parameters 𝜇𝑖 and 𝜎𝑖2 are the expectation and variance of the
random variable 𝑋𝑖 , 𝑖 ∈ {1, 2}, respectively. We denote the two-dimensional normal distribution
with parameters 𝜇 and Σ by N2 (𝜇, Σ).
Remark.
1. Doing the matrix-vector multiplication, the density of the two-dimensional normal distribution
can be written as
1 (𝑥1 −𝜇1 )2 (𝑥1 −𝜇1 )⋅(𝑥2 −𝜇2 ) (𝑥 −𝜇 )2
1 − 2(1−𝜌 2) [ −2𝜌⋅ + 2 22 ]
𝜎2
𝑓(x) = 𝑓(𝑥1 , 𝑥2 ) = e 1
𝜎1 𝜎2 𝜎2
.
√(2𝜋)2 𝜎12 𝜎22 (1−𝜌2 )
347
A Some probability distributions
2. It holds that each component, 𝑋1 and 𝑋2 , has a one-dimensional normal distribution, with 𝑋𝑖
following N (𝜇𝑖 , 𝜎𝑖2 ). For example, the density 𝑓1 of 𝑋1 is given
(𝑥1 −𝜇1 )2
1 −
2𝜎2
𝑓1 (𝑥1 ) = e 1
√2𝜋𝜎12
which gives the density of N (𝜇1 , 𝜎12 ); see Section C.3 for a proof of this result. The case of 𝑋2
is analogous.
3. The form of the density from Definition A.13 also generalizes to an n-dimensional normal dis-
tribution with parameters 𝜇 ∈ R𝑛 and Σ ∈ S𝑛++ , where S𝑛++ is the set of all symmetric and
positive definite matrices of dimension 𝑛 × 𝑛.
We visualize the density of the two-dimensional normal distribution by contour plots. These plots
show the smallest regions containing 50%, 80%, 95%, and 99% of the probability mass and are created
using functions from the ggdensity package.
We start by defining the density function.
First, we vary the expectation vector 𝜇. We start with 𝜇 = (0, 0)⊤ , then change 𝜇1 and in the last
step change both.
library(ggdensity)
p1 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-3, 9), ylim = c(-7, 4),
fill = "blue") +
xlim(-3,9) + ylim(-7,4) +
labs(title = expression(paste(mu[1],"=0 and ", mu[2], "=0")))
p2 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-10, 10), ylim = c(-10, 10),
fill = "blue", args = list(mu1 = 5)) +
xlim(-3,9) + ylim(-7,4) +
labs(title = expression(paste(mu[1],"=5 and ", mu[2], "=0")))
348
A Some probability distributions
p3 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-10, 10), ylim = c(-10, 10),
fill = "blue", args = list(mu1 = 5, mu2 = -3)) +
xlim(-3,9) + ylim(-7,4) +
labs(title = expression(paste(mu[1],"=5 and ", mu[2], "=-3")))
Now, we change the variance of the two components. The first plot shows again the N2 (𝜇, Σ) distri-
bution. Then, we increase both variances to 4, and in the last plot, the variance of 𝑋2 is 9.
p1 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-10, 10), ylim = c(-10, 10),
fill = "blue") +
xlim(-10,10) + ylim(-10,10) +
labs(title = expression(paste(sigma[1],"=1 and ", sigma[2], "=1")))
p2 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-10, 10), ylim = c(-10, 10),
fill = "blue", args = list(sigma1 = 2, sigma2 = 2)) +
xlim(-10,10) + ylim(-10,10) +
labs(title = expression(paste(sigma[1],"=2 and ", sigma[2], "=2")))
p3 <- ggplot() +
349
A Some probability distributions
p1 + p2 + p3 +
plot_layout(guides = 'collect')
0 0 0 95%
80%
−5 −5 −5
50%
In all the previous examples, the parameter 𝜌 was equal to 0. Before we visualize the effect of varying
𝜌, let’s show that 𝜌 is the correlation between 𝑋1 and 𝑋2 for (𝑋1 , 𝑋2 ) ∼ N2 (𝜇, Σ) with Σ as given
in Definition A.13.
To compute the covariance between two random variables 𝑋1 and 𝑋2 , we use the formula
Cov[𝑋1 , 𝑋2 ] = E[𝑋1 ⋅ 𝑋2 ] − E[𝑋1 ] ⋅ E[𝑋2 ]. As we already know the expected values of 𝑋1 and
𝑋2 , denoted by 𝜇1 and 𝜇2 respectively, we only need to determine E[𝑋1 ⋅ 𝑋2 ], which is done in
Section C.3. From there we get
E[𝑋1 ⋅ 𝑋2 ] = 𝜌𝜎1 𝜎2 + 𝜇1 𝜇2 ,
which leads to the following covariance
Cov[𝑋1 , 𝑋2 ] 𝜌𝜎 𝜎
Corr[𝑋1 , 𝑋1 ] = = 1 1 = 𝜌,
√Var[𝑋1 ] ⋅ √Var[𝑋2 ] 𝜎1 𝜎2
350
A Some probability distributions
which shows that 𝜌 is the correlation between 𝑋1 and 𝑋2 , and describes the linear dependence be-
tween the two.
1 𝜌
The last figure shows contour plots for the N2 (𝜇, Σ) distribution, with 𝜇 = (0, 0)⊤ , Σ = ( )
𝜌 1
and 𝜌 ∈ {−0.7, −0.2, 0, 0.2, 0.7}.
p1 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-5, 5), ylim = c(-5, 5),
fill = "blue", args = list(rho = -.7)) +
xlim(-5,5) + ylim(-5,5) +
labs(title = expression(paste(rho,"=-0.7")))
p2 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-5, 5), ylim = c(-5, 5),
fill = "blue", args = list(rho = -.2)) +
xlim(-5,5) + ylim(-5,5) +
labs(title = expression(paste(rho,"=-0.2")))
p3 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-5, 5), ylim = c(-5, 5),
fill = "blue") +
xlim(-5,5) + ylim(-5,5) +
labs(title = expression(paste(rho,"=0")))
p4 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-5, 5), ylim = c(-5, 5),
fill = "blue", args = list(rho = .2)) +
xlim(-5,5) + ylim(-5,5) +
labs(title = expression(paste(rho,"=0.2")))
p5 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-5, 5), ylim = c(-5, 5),
fill = "blue", args = list(rho = .7)) +
xlim(-5,5) + ylim(-5,5) +
labs(title = expression(paste(rho,"=0.7")))
351
A Some probability distributions
0.0 0.0
−2.5 −2.5
−5.0 −5.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0
352
B Inference for logistic regression
A health survey conducted in The Hague, Netherlands from 1972 to 1981 found a link between keeping
pet birds and an increased risk of lung cancer.
To investigate bird keeping as a risk factor, researchers conducted a case-control study of patients
in 1985 at four hospitals in The Hague (population 450,000).
They identified 49 cases of lung cancer among the patients who were registered with a general
practice, who were age 65 or younger and who had resided in the city since 1965. They also selected
98 controls from a population of residents having the same general age structure.
The data is contained in the Sleuth3 package accompanying the book The Statistical Sleuth (2002).
1
Example taken from Ramsey and Schafer (2002).
353
B Inference for logistic regression
case2002
# # A tibble: 147 x 7
# LC FM SS BK AG YR CD
# <fct> <fct> <fct> <fct> <int> <int> <int>
# 1 LungCancer Male Low Bird 37 19 12
# 2 LungCancer Male Low Bird 41 22 15
# 3 LungCancer Male High NoBird 43 19 15
# 4 LungCancer Male Low Bird 46 24 15
# 5 LungCancer Male Low Bird 49 31 20
# 6 LungCancer Male High NoBird 51 24 15
# # i 141 more rows
We want to visually analyze the impact of the predictor variables (age, smoking (years as well as the
rate), gender, socioeconomic status, and owning birds) on the likelihood of developing lung cancer.
We will create a separate bar plot for each of the three variables and color each bar based on the
proportions of individuals with and without lung cancer in the corresponding subgroup. Since age,
years of smoking, and the rate of smoking are numerical variables, we will need to categorize them
to create bar plots.
354
B Inference for logistic regression
355
B Inference for logistic regression
1.00
0.75
0.50
0.25
0.00
[37,49.2]
(49.2,57](57,61](61,64](64,67] [0,17.2]
(17.2,25](25,35](35,40](40,50]
Age Years of smoking
prior to diagnosis
1.00
proportions
Figure B.1
We observe different proportions across the categories of age (minimal), years of smoking, average
number of cigarettes per day and bird ownership, while this is not evident for gender and the socioe-
conomic status.
Model selection
Based on our exploratory data analysis, we do not anticipate gender and socioeconomic status signif-
icantly contributing to a model that describes the probability of developing lung cancer. However,
owning birds and smoking appear to have an impact. As for the influence of age, we are uncertain.
Let’s try to verify our assumptions by a logistic regression model to the data. Remember, the model
states that
356
B Inference for logistic regression
To select the predictor variables in the model, a hybrid stepwise selection algorithm is used. The
algorithm begins with the intercept model:
Now, we can use the step() function on this model and define the scope to include all other variables
in the dataset.
357
B Inference for logistic regression
# Df Deviance AIC
# <none> 158.11 164.11
# + AG 1 156.22 164.22
# + CD 1 156.75 164.75
# + FM 1 157.10 165.10
# + SS 1 158.11 166.11
# - YR 1 172.93 176.93
# - BK 1 173.17 177.17
Let’s take a look at the estimated slope parameters 𝛽𝑗̂ , 𝑗 ∈ {1, 2}.
tidy(model_step)
# # A tibble: 3 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) -3.18 0.636 -5.00 0.000000582
# 2 BKBird 1.48 0.396 3.73 0.000194
# 3 YR 0.0582 0.0168 3.46 0.000544
• the odds ratio of getting lung cancer for bird keepers vs non-bird keepers is exp(1.48) ≈
4.37,
• the odds ratio of getting lung cancer for an additional year of smoking is exp(0.0582) ≈
1.06.
Given the results of a fitted model, we want to test hypotheses about the slopes 𝛽𝑖 , as we have done
in the linear regression model.
We want to investigate whether the j-th explanatory variable has an impact on the probability of
success P(𝑌 = 1|x) within the population. Hence, we want to consider the testing problem
𝐻0 ∶ 𝛽𝑗 = 0 𝐻𝐴 ∶ 𝛽𝑗 ≠ 0 .
Remark. As with multiple linear regression, each hypothesis test is conducted with each of the other
variables remaining in the model. Hence, the null hypothesis 𝐻0 ∶ 𝛽𝑗 = 0 is actually:
358
B Inference for logistic regression
The test statistic construction follows the same basic setup as linear regression using the theoretical
approach. The difference is that we only know the asymptotic distribution of the statistic constructed
this way.
Remark. For inference in the logistic regression model we will only use the asymptotic approach. The
infer package does not support the simulation-based approach for logistic regression. Therefore, one
must utilize the general setup provided by the tidymodels package to employ this approach.
Definition B.1. Let 𝛽𝑗̂ (Y, x), 𝑗 ∈ {1, … , 𝑘}, be the maximum likelihood estimator of the population
parameter 𝛽𝑗 in a logistic regression model and SE𝑏𝑗 (Y, x) the corresponding standard error. Then
the test statistic
𝛽𝑗̂ (Y, x) − 𝛽𝑗,0
𝑍𝑗 (Y, x) = ,
SE𝛽̂ (Y, x)
𝑗
which has approximately (for large n) a standard normal distribution, is used for testing
𝐻0 ∶ 𝛽𝑗 = 𝛽𝑗,0 𝐻𝐴 ∶ 𝛽𝑗 ≠ 𝛽𝑗,0 ,
Remark. The only tricky bit, which is beyond the scope of this course, is how the standard error
SE𝛽̂ (Y, x) is calculated.
𝑗
𝐻0 ∶ 𝛽𝑗 = 0 𝐻𝐴 ∶ 𝛽𝑗 ≠ 0 ,
given all the other variables in the model, i.e., 𝛽𝑗,0 is assumed to be zero.
In this case the null hypothesis would be rejected at the 𝛼 significance level, if
where 𝑧𝑗 (y, x) is the observed value of the test statistic 𝑍𝑗 (Y, x) and 𝑧1−𝛼/2 is the 1 − 𝛼/2 quantile
of the standard normal distribution.
Let’s consider the impact of owning a bird on the likelihood of developing lung cancer. Under the
null hypothesis, we assume that keeping a bird does not affect the likelihood. In other words, the test
problem is as follows:
𝐻0 ∶ 𝛽 1 = 0 𝐻𝐴 ∶ 𝛽1 ≠ 0
359
B Inference for logistic regression
tidy(model_step)
# # A tibble: 3 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) -3.18 0.636 -5.00 0.000000582
# 2 BKBird 1.48 0.396 3.73 0.000194
# 3 YR 0.0582 0.0168 3.46 0.000544
since
𝛽̂𝑗 (Y,x)−𝛽𝑗
Using the z-statistic SE𝛽̂ (Y,x) and applying the idea for constructing confidence intervals pre-
𝑗
sented in Section 12.1, yields the 100(1 − 𝛼)% asymptotic confidence interval for the slope
parameter 𝛽𝑗 :
𝛽𝑗̂ (Y, x) ± 𝑧1−𝛼/2 ⋅ SE𝛽̂ (Y, x) ,
𝑗
360
B Inference for logistic regression
tidy(model_step)
# # A tibble: 3 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) -3.18 0.636 -5.00 0.000000582
# 2 BKBird 1.48 0.396 3.73 0.000194
# 3 YR 0.0582 0.0168 3.46 0.000544
qnorm(0.975)
# [1] 1.959964
Remember, the odds ratio for a one unit change of BK is equal to e𝛽1 . Hence, an asymptotic 95%
confidence interval for the odds ratio is given by
confint.default(model_step)
# 2.5 % 97.5 %
# (Intercept) -4.42747020 -1.93284067
# BKBird 0.69964830 2.25145472
# YR 0.02523359 0.09126585
The relative risk of developing lung cancer for individuals who own birds and smoke for 10 or 20
years is presented in the following output.
library(effectsize)
361
B Inference for logistic regression
# -------------------------------------
# (p0) | 0.07 |
# BK [Bird] | 3.54 | [1.91, 6.07]
# YR | 1.06 | [1.03, 1.09]
362
B Inference for logistic regression
Remember, logistic regression assumes a linear relationship between the logit and the numeric pre-
dictor variables:
𝑘
𝑝(x𝑖 )
log ( ) = 𝛽0 + ∑ 𝛽𝑗 𝑥𝑗,𝑖 .
1 − 𝑝(x𝑖 ) 𝑗=1
To verify this assumption, we plot each numeric predictor variable against the estimated logit
log ( 1−𝑝(x
̂ 𝑖)
̂ ) ).
𝑝(x
𝑖
Since model_step has only one numeric predictor, let’s fit a model where we add the other two
numeric predictor variables AG and CD for illustration.
Now we can add with the augment() function the fitted probabilities 𝑝𝑖̂ and the logit values to the
dataset.
We generate a scatterplot for each numeric predictor against the logit. Additionally, we use
geom_smooth() to add a non-parametric fit to the point cloud. If the fitted function appears to be
linear, we can infer that the linearity assumption may be satisfied.
# age
p_AG <- ggplot(case2002_fit, aes(x = AG, y = logit)) +
geom_point() + geom_smooth()
# years smoking
p_YR <- ggplot(case2002_fit, aes(x = YR, y = logit)) +
geom_point() + geom_smooth()
363
B Inference for logistic regression
0 0
−1
logit
logit
−2 −2
−3
−4 −4
40 50 60 0 10 20 30 40
AG CD
−1
logit
−2
−3
−4
0 10 20 30 40 50
YR
Figure B.2
It appears that the assumption only applies to YR. For AG, the relationship may also be linear, but
with a slope of approximately zero. The relationship between the logit and CD is evidently non-linear,
likely due to the non-smokers smoking zero cigarettes per day. For smokers who smoke 5 or more
cigarettes per day, the relationship appears to be linear, but with a rather small slope. Therefore, the
two plots once again confirm that a model without AG and CD makes more sense.
Remark.
1. The linearity assumption is obviously met for all categorical predictors. For categorical predic-
tors, the assumption states that the logit will have different mean values for the different levels
of the predictors, which will be the case.
364
B Inference for logistic regression
365
C Technical points
This section presents various technical points raised in previous chapters. The purpose of this section
is to provide proofs for each of these points for readers who are interested. Please note that the content
provided here is not considered part of the lecture material.
Claim
2 2
̂ ∗ )) ] = E[(𝑌 ∗ − 𝑓 (𝑥∗ ) + 𝑓 (𝑥∗ ) − E[𝑓(𝑥
E[(𝑌 ∗ − 𝑓(𝑥 ̂ ∗ )] + E[𝑓(𝑥
̂ ∗ )] − 𝑓(𝑥
̂ ∗ )) ]
0 0
2
= E[(𝑌 ∗ − 𝑓0 (𝑥∗ ))
̂ ∗ )] + E[𝑓(𝑥
+ 2(𝑌 ∗ − 𝑓0 (𝑥∗ ))(𝑓0 (𝑥∗ ) − E[𝑓(𝑥 ̂ ∗ )] − 𝑓(𝑥
̂ ∗ ))
2
̂ ∗ )] + E[𝑓(𝑥
+ (𝑓0 (𝑥∗ ) − E[𝑓(𝑥 ̂ ∗ )] − 𝑓(𝑥
̂ ∗ )) ]
̂ ∗ )] + E[𝑓(𝑥
= E[𝜖20 ] + 2E[𝜖0 (𝑓0 (𝑥∗ ) − E[𝑓(𝑥 ̂ ∗ )] − 𝑓(𝑥
̂ ∗ ))]
2
̂ ∗ )] + E[𝑓(𝑥
+ E[(𝑓0 (𝑥∗ ) − E[𝑓(𝑥 ̂ ∗ )] − 𝑓(𝑥
̂ ∗ )) ] .
366
C Technical points
Now observe that 𝜖0 is independent from all other random quantities and has expectation zero, i.e.,
E[𝜖0 ] = 0 Hence,
2
̂ ∗ )) ] = E[𝜖2 ] + 2E[𝜖 ]E[(𝑓 (𝑥∗ ) − E[𝑓(𝑥
E[(𝑌 ∗ − 𝑓(𝑥 ̂ ∗ )] + E[𝑓(𝑥
̂ ∗ )] − 𝑓(𝑥
̂ ∗ ))]
0 0 0
2
̂ ∗ )]) + 2(𝑓 (𝑥∗ ) − E[𝑓(𝑥
+ E[(𝑓0 (𝑥∗ ) − E[𝑓(𝑥 ̂ ∗ )])(E[𝑓(𝑥
̂ ∗ )] − 𝑓(𝑥
̂ ∗ ))
0
2
̂ ∗ )] − 𝑓(𝑥
+ (E[𝑓(𝑥 ̂ ∗ )) ]
2
̂ ∗ )]) + 2(𝑓 (𝑥∗ ) − E[𝑓(𝑥
= E[(𝑓0 (𝑥∗ ) − E[𝑓(𝑥 ̂ ∗ )])(E[𝑓(𝑥
̂ ∗ )] − 𝑓(𝑥
̂ ∗ ))
0
2
̂ ∗ )] − 𝑓(𝑥
+ (E[𝑓(𝑥 ̂ ∗ )) ] + E[𝜖2 ] .
0
2 2
̂ ∗ )) ] = E[(𝑓 (𝑥∗ ) − E[𝑓(𝑥
E[(𝑌 ∗ − 𝑓(𝑥 ̂ ∗ )]) ] + 2(𝑓 (𝑥∗ ) − E[𝑓(𝑥
̂ ∗ )])E[(E[𝑓(𝑥
̂ ∗ )] − 𝑓(𝑥
̂ ∗ ))]
0 0
2
̂ ∗ )] − 𝑓(𝑥
+ E[(E[𝑓(𝑥 ̂ ∗ )) ] + Var[𝜖 ]
0
2 2
̂ ∗ )] − 𝑓 (𝑥∗ )) ] + E[(𝑓(𝑥
= E[(E[𝑓(𝑥 ̂ ∗ ) − E[𝑓(𝑥
̂ ∗ )]) ] + Var[𝜖 ]
0 0
2
̂ ∗ ))) ] + Var[𝑓(𝑥
= E[(Bias(𝑓(𝑥 ̂ ∗ )] + Var[𝜖 ]
0
2
̂ ∗ )] + [Bias(𝑓(𝑥
= Var[𝑓(𝑥 ̂ ∗ ))] + Var[𝜖 ] .
0
367
C Technical points
The least squares estimates in a simple linear regression model are given by
𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 )
̂ =
𝛽1,𝑛 𝑛 ,
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2
̂
𝛽0,𝑛 = 𝑦𝑛 − 𝛽1̂ 𝑥𝑛 .
Claim
−1
These estimates can be computed using the general formula 𝛽̂ = (X⊤ X) X⊤ y for least
squares estimates in linear regression models.
−1
𝛽̂ = (X⊤ X) X⊤ y
−1
1 𝑥1
1 1 ⎛ 1 1
=⎛
⎜( )⎜⋮ ⋮ ⎞⎟⎞
⎟ ( )y
𝑥1 𝑥𝑛 𝑥1 𝑥𝑛
⎝ ⎝1 𝑥 𝑛 ⎠⎠
𝑛 −1 𝑛
𝑛 ∑𝑖=1 𝑥𝑖 ∑ 𝑦
=( 𝑛 𝑛 ) ( 𝑛𝑖=1 𝑖 )
∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑥2𝑖 ∑𝑖=1 𝑥𝑖 𝑦𝑖
𝑛 𝑛 𝑛
1 ∑𝑖=1 𝑥2𝑖 − ∑𝑖=1 𝑥𝑖 ∑ 𝑦
= 2
( 𝑛 ) ( 𝑛𝑖=1 𝑖 ) .
𝑛 2 𝑛
𝑛 ⋅ ∑𝑖=1 𝑥𝑖 − ( ∑𝑖=1 𝑥𝑖 ) − ∑ 𝑥
𝑖=1 𝑖
𝑛 ∑ 𝑥𝑦
𝑖=1 𝑖 𝑖
𝑛 𝑛
(𝑛 − 1)𝑠2𝑥,𝑛 = ∑(𝑥𝑖 − 𝑥𝑛 )2 = ∑ (𝑥2𝑖 − 2𝑥𝑖 𝑥𝑛 + 𝑥2𝑛 )
𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛
= ∑ 𝑥2𝑖 − 2𝑥𝑛 ∑ 𝑥𝑖 + ∑ 𝑥2𝑛 = ∑ 𝑥2𝑖 − 𝑛 ⋅ 𝑥2𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
1 2
= ∑ 𝑥2𝑖 − ( ∑ 𝑥𝑖 ) ,
𝑖=1
𝑛 𝑖=1
and
368
C Technical points
𝑛 𝑛
(𝑛 − 1)𝑠𝑥𝑦,𝑛 = ∑(𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 ) = ∑ (𝑥𝑖 𝑦𝑖 − 𝑥𝑖 𝑦𝑛 − 𝑥𝑛 𝑦𝑖 + 𝑥𝑛 𝑦𝑛 )
𝑖=1 𝑖=1
𝑛 𝑛 𝑛
= ∑ 𝑥𝑖 𝑦𝑖 − 𝑦𝑛 ∑ 𝑥𝑖 − ∑ 𝑦𝑛 + 𝑛𝑥𝑛 𝑦𝑛
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
= ∑ 𝑥𝑖 𝑦𝑖 − 𝑛𝑦𝑛 𝑥𝑛 − 𝑛𝑥𝑛 𝑦𝑛 + 𝑛𝑥𝑛 𝑦𝑛 = ∑ 𝑥𝑖 𝑦𝑖 − 𝑛𝑥𝑛 𝑦𝑛 .
𝑖=1 𝑖=1
−1
𝛽̂ = (X⊤ X) X⊤ y
𝑛 𝑛 𝑛 𝑛
1 ∑𝑖=1 𝑥2𝑖 ∑𝑖=1 𝑦𝑖 − ∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑥𝑖 𝑦𝑖
= 𝑛 ( 𝑛 𝑛 𝑛 )
𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2 𝑛 ∑𝑖=1 𝑥𝑖 𝑦𝑖 − ∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑦𝑖
𝑛 𝑛 𝑛 𝑛
1 ∑𝑖=1 𝑥2𝑖 ∑𝑖=1 𝑦𝑖 − ∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑥𝑖 𝑦𝑖
= 𝑛 ( 𝑛 )
𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2 𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 )
𝑛 𝑛
1 ∑𝑖=1 𝑥2𝑖 𝑛𝑦𝑛 − (𝑛 ∑𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛2 𝑥𝑛 𝑦𝑛 )𝑥𝑛 − 𝑛2 𝑥2𝑛 𝑦𝑛
= 𝑛 ( 𝑛 )
𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2 𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 )
𝑛 𝑛
1 𝑛( ∑𝑖=1 𝑥2𝑖 − 𝑛𝑥2𝑛 )𝑦𝑛 − (𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 ))𝑥𝑛
= 𝑛 ( 𝑛 )
𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2 𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 )
𝑛
∑ 𝑖 (𝑥 −𝑥 )(𝑦 −𝑦 )
𝑛 𝑖
𝑦𝑛 − 𝑖=1 𝑛
𝑛
𝑥𝑛 ⎞
=⎛
∑𝑖=1 (𝑥𝑖 −𝑥𝑛 )2
⎜ 𝑛
∑𝑖=1 (𝑥𝑖 −𝑥𝑛 )(𝑦𝑖 −𝑦𝑛 )
⎟
𝑛
⎝ ∑𝑖=1 (𝑥𝑖 −𝑥𝑛 ) 2
⎠
Claim
𝜎12 𝜌𝜎1 𝜎2
Let (𝑋1 , 𝑋2 ) ∼ N2 (𝜇, Σ) 𝜇 = (𝜇1 , 𝜇2 )⊤ ∈ R2 and Σ = ( ), where 𝜎1 , 𝜎2 > 0
𝜌𝜎1 𝜎2 𝜎22
and 𝜌 ∈ (−1, 1). Then it holds that the marginal distributions of 𝑋1 and 𝑋2 are also normal,
e.g. the density of 𝑋1 is given by
(𝑥1 −𝜇1 )2
1 −
2𝜎2
𝑓1 (𝑥1 ) = e 1 ,
√2𝜋𝜎12
369
C Technical points
E[𝑋1 ⋅ 𝑋2 ] = 𝜌𝜎1 𝜎2 + 𝜇1 𝜇2 .
Proof. We start by computing the density of 𝑋1 . The result for 𝑋2 is analogous. We use the following
form of the joint density
(𝑥1 −𝜇1 )2 (𝑥1 −𝜇1 )⋅(𝑥2 −𝜇2 ) (𝑥 −𝜇 )2
1 1
− 2(1−𝜌 2) [ 𝜎2
−2𝜌⋅ + 2 22 ]
𝑓(x) = 𝑓(𝑥1 , 𝑥2 ) = e 1
𝜎1 𝜎2 𝜎2
To compute E[𝑋1 ⋅ 𝑋2 ] we use again the above given form of the joint density 𝑓. It holds that
370
C Technical points
1 (𝑥 −𝜇 )2
− [(1−𝜌2 )⋅ 1 2 1 ]
2(1−𝜌2 ) 𝜎1
= ∫R 𝑥1 ⋅ e √
2𝜋𝜎1
⋅ (𝜌(𝑥1 − 𝜇1 ) 𝜎𝜎2 + 𝜇2 )d𝑥1
1
(𝑥 −𝜇 )2 (𝑥1 −𝜇1 )2
− 1 21 −
= 𝜌 𝜎𝜎2 ∫R (𝑥21 − 𝜇1 𝑥1 ) ⋅ 1
e 2𝜎1
d𝑥1 + 𝜇2 ∫R 𝑥1 ⋅ 1
e 2𝜎2
1 d𝑥1
1 √2𝜋𝜎12 2
√2𝜋𝜎1
371
Index
372
Index
373
Index
power, 297
significance level, 289
test problem, 289
test statistic, 289
type 1 error, 289
type 2 error, 289
Stepwise selection
backward, 213
forward, 212
Subset selection
best, 210
hybrid, 214
Success-failure condition, 281
Supervised learning, 152
t-test, 294
two sample, 320
Variables
block, 17
categorical, 4
nominal, 4
ordinal, 4
explanatory, 6
numerical, 4
response, 6
treatment, 17
374