0% found this document useful (0 votes)
8 views384 pages

Introduction to Data Science and Statistical Thinking

The document is an introduction to data science and statistical thinking, covering essential topics such as data basics, sampling principles, and exploratory data analysis. It includes sections on data visualization using R, data wrangling, and exploring both categorical and numerical data. The content is structured into multiple chapters, each focusing on different aspects of data science and statistical methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views384 pages

Introduction to Data Science and Statistical Thinking

The document is an introduction to data science and statistical thinking, covering essential topics such as data basics, sampling principles, and exploratory data analysis. It includes sections on data visualization using R, data wrangling, and exploring both categorical and numerical data. The content is structured into multiple chapters, each focusing on different aspects of data science and statistical methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 384

Introduction to Data Science and Statistical

Thinking

Mathias Drton Stephan Haug


2
Table of contents

Preface 9

I Introduction 10

1 Intro to data 1
1.1 Data basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Types of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Relationships among variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Associated vs. independent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Explanatory and response variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Anecdotal evidence and early smoking research . . . . . . . . . . . . . . . . . . . . . 7
Census . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
From exploratory analysis to inference . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Sampling bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Observational studies and experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 Obtaining good samples - Sampling principles and strategies . . . . . . . . . . . . . . 13
1.4.1 Simple random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.2 Stratified sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.3 Cluster sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.4 Multistage sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5 More on experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

II R 21

2 Data and visualization 22


2.1 What is in a dataset? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2 Exploratory data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 Why do we visualize? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 ggplot2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.4.1 Aesthetics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.2 Mapping vs. setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.4.3 Faceting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3
Table of contents

2.4.4 geoms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.4.5 Themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3 Grammar of data wrangling 53


3.1 Data I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Select - extracting columns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3 Arrange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4 Pipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.1 What is a pipe? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.2 How does a pipe work? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.3 Aside . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.4 A note on piping and layering . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.5 More on select() and arrange() . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.6 Filter rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7 Deriving information with summarize() . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.7.1 Summarizing by groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.8 Adding or changing variables with mutate() . . . . . . . . . . . . . . . . . . . . . . . 68
3.9 Tidyverse style guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

III Explore Data 71

4 Exploring categorical data 72


4.1 Data analysis example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Frequency distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.3 Bar chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.3.1 Computed variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.4 Frequency distribution for two variables . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.4.1 Computing contingency tables . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.5 Bar charts with two variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6 Visualize the joint frequency distribution . . . . . . . . . . . . . . . . . . . . . . . . . 82

5 Exploring numerical data 86


5.1 Dot plots and the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.1.1 Summarize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.1.2 Group means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Histograms and shape . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2.1 Modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3 Variance and standard deviation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.4 Box plots, quartiles, and the median . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.1 Median . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.4.2 Interquartile range, whiskers and outliers . . . . . . . . . . . . . . . . . . . . 96
5.4.3 Comparing numerical data across groups . . . . . . . . . . . . . . . . . . . . 97

4
Table of contents

5.5 Robust statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100


5.6 Exploring paired numerical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.6.1 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

IV Probability 107

6 Case study: Gender discrimination 108


6.1 A trial as a hypothesis test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Simulate different outcomes by permutations . . . . . . . . . . . . . . . . . . . . . . . 112

7 Probability 117
7.1 Defining probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1.1 Sample space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.1.2 Events and complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.1.3 Rules of probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.2 Law of large numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.3 Addition rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7.4 Probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
7.5.1 Disjoint, complementary, independent . . . . . . . . . . . . . . . . . . . . . . 124
7.6 Conditional probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.6.1 General multiplication rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
7.6.2 Independence and conditional probabilities . . . . . . . . . . . . . . . . . . . 127
7.6.3 Case study: breast cancer screening . . . . . . . . . . . . . . . . . . . . . . . . 128
7.6.4 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.7 Random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.7.1 Expected value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7.7.2 Variability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.7.3 Linear combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
7.8 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
7.8.1 Expected value and variability . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.8.2 The normal distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

V Predictive modeling 151

8 Statistical learning 152


8.1 Learning problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.2 Modeling noisy relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.2.1 Linear prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.2.2 Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.2.3 Reducible and irreducible error . . . . . . . . . . . . . . . . . . . . . . . . . . 156

5
Table of contents

8.3 Parametric methods for statistical learning . . . . . . . . . . . . . . . . . . . . . . . . 157


8.3.1 Least squares for a linear model . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.3.2 Non-linearity with the help of transformations . . . . . . . . . . . . . . . . . 158
8.4 Non-parametric methods for statistical learning . . . . . . . . . . . . . . . . . . . . . 158
8.5 Assessing model accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.6 Measuring the quality of fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.6.1 Training and test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
8.6.2 What contributes to MSE? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.7 The formal bias-variance trade-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.8 Regression versus classification problems . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.9 Classification: Nearest neighbors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.10 Selecting a learning method via a validation set . . . . . . . . . . . . . . . . . . . . . 166
8.10.1 Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.11 Supervised versus unsupervised learning . . . . . . . . . . . . . . . . . . . . . . . . . 168

9 Linear regression 170


9.1 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.1.1 Least squares fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.1.2 Quantifying the relationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.1.3 Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.1.4 Assessing the accuracy of the model . . . . . . . . . . . . . . . . . . . . . . . 183
9.2 Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.2.1 Estimating the regression parameters . . . . . . . . . . . . . . . . . . . . . . . 189
9.2.2 Another look at 𝑅2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.2.3 Collinearity between predictors . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.2.4 Categorical predictor variables . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.2.5 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.2.6 Subset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
9.2.7 Stepwise selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

10 Logistic regression 225


10.1 EDA of the email dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
10.2 The logistic regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
10.2.1 Odds and odds-ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.2.2 Estimation approach in logistic regression . . . . . . . . . . . . . . . . . . . . 231
10.2.3 Fitting the model in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
10.3 Relative risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
10.4 Assessing the accuracy of the predictions . . . . . . . . . . . . . . . . . . . . . . . . . 238
10.4.1 Sensitivity and specificity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
10.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 241
10.5.1 Picking a threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
10.5.2 Consequences of picking a threshold . . . . . . . . . . . . . . . . . . . . . . . 244
10.5.3 Trying other thresholds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245

6
Table of contents

10.5.4 ROC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246


10.5.5 Comparing models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
10.6 Cross validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251

VI Inference 256

11 Foundations of Inference 257


11.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
11.1.1 Modeling data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
11.2 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
11.3 Sampling distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260
11.3.1 Theoretical approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
11.3.2 Asymptotic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
11.3.3 Bootstrap approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

12 Confidence intervals 276


MLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
12.1 Theoretical approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
12.2 Asymptotic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
12.3 Bootstrap approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
12.3.1 Percentile method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
12.3.2 Standard error method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
12.3.3 infer workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

13 Hypothesis testing 288


13.1 Statistical test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
13.1.1 p-value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
13.1.2 Two-sided vs. one-sided alternative . . . . . . . . . . . . . . . . . . . . . . . . 291
13.2 Null distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
13.3 Theoretical approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
13.3.1 Power of a test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
13.4 Asymptotic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
13.4.1 Chi-squared test of independence . . . . . . . . . . . . . . . . . . . . . . . . . 301
13.5 Simulation-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
13.5.1 Bootstrap method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
13.6 Choosing a significance level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312

14 Inference for linear regression 314


14.1 Testing the slope parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
14.1.1 Theoretical approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
14.1.2 Simulation-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

7
Table of contents

14.2 Residual analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323


14.3 Options for improving the model fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333

References 336

Appendices 337

A Some probability distributions 337


A.1 R and probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
A.2 Discrete distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
A.3 Continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341

B Inference for logistic regression 353


B.1 Testing the slope parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
B.2 Checking model conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363

C Technical points 366


C.1 Technical points from Chapter 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
C.2 Technical points from Chapter 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
C.3 Technical points from Section A.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369

8
Preface

These lecture notes accompany a course at TUM having the same title as these notes.
Many different sources inspire the content of these notes. We are happy to mention:

• Data Science in a Box


• Introduction to Modern Statistics
• An Introduction to Statistical Learning
• Statistical Inference via Data Science
• R for Data Science

The notes contain, in some parts, derivatives of the sources mentioned above. None of these changes
were approved by the authors of the original resources.
The content may be copied, edited, and shared via the CC BY-NC-SA license.

Current version: April, 2025

9
Part I

Introduction

10
1 Intro to data

Case study: treating chronic fatigue syndrome 1

Objective: Evaluate the effectiveness of cognitive-behavior therapy for chronic fatigue syndrome.
Participant pool: 142 patients who were recruited from referrals by primary care physicians and
consultants to a hospital clinic specializing in chronic fatigue syndrome.
Actual participants: Only 60 of the 142 referred patients entered the study. Some were excluded
because they didn’t meet the diagnostic criteria, some had other health issues, and some refused to
be a part of the study.

Study design
Patients randomly assigned to treatment and control groups, 30 patients in each group:
Treatment: Cognitive behavior therapy - collaborative, educative, and with a behavioral empha-
sis. Patients were shown how activity could be increased steadily and safely without exacerbating
symptoms.
Control: No advice was given about how activity could be increased. Instead, progressive muscle
relaxation, visualization, and rapid relaxation skills were taught.

Results
The table below shows the distribution of the outcomes at six-month follow-up 2 .

Good Outcome
No Yes Sum
Group
Control 21 5 26
Treatment 8 19 27
Sum 29 24 53

1
Deale et al. (1997)
2
Note: Seven patients dropped out of the study (3 from the treatment and 4 from the control group)

1
1 Intro to data

19
Proportion with good outcomes in treatment group: ≈ 0.70
27
5
Proportion with good outcomes in control group: ≈ 0.19
26

Understanding the results

Question

Do the data show a “real” difference between the groups?

Random variation

Suppose you flip a coin 100 times.

coin_flips <- sample(


c("heads","tails"),
100,
replace = TRUE)

While the chance a coin lands heads in any given coin flip is 50%, we probably won’t observe
exactly 50 heads.

table(coin_flips)
# coin_flips
# heads tails
# 44 56

This type of fluctuation is part of almost any type of data-generating process.

Back to whether the data shows a “real” difference between the groups.
The observed difference between the two groups (0.70 - 0.19 = 0.51) may be real or due to natural
variation.
Since the difference is quite large, it is more believable that the difference is real.
Conclusion: We need statistical tools to determine quantitatively if the difference is so large that we
should reject the notion that it was due to chance.

2
1 Intro to data

Generalizing the results

Question

Are the results of this study generalizable to all patients with chronic fatigue syndrome?

These patients were recruited from referrals to a hospital clinic specializing in chronic fatigue syn-
drome and volunteered to participate in this study. Therefore, they may not be representative of all
patients with chronic fatigue syndrome.
While we cannot immediately generalize the results to all patients, this first study is encouraging.
The method works for patients with some narrow set of characteristics, which gives hope that it will
work, at least to some degree, with other patients.

1.1 Data basics


Classroom survey

A survey was conducted on students in an introductory statistics course. Below are a few of the
questions on the survey and the corresponding variables the data from the responses were stored
in:

• gender: What is your gender?


• intro_extro: Are you more introverted or rather extroverted?
• sleep: How many hours do you sleep at night, on average?
• bedtime: What time do you usually go to bed?
• countries: How many countries have you visited?
• dread: On a scale of 1-5, how much do you dread being here?

The collected data is summarized in

classroom_survey
# # A tibble: 86 x 6
# gender intro_extro sleep bedtime countries dread
# <chr> <chr> <int> <chr> <int> <int>
# 1 male introvert 8 9-11 10 1
# 2 female extrovert 5 12-2 1 1
# 3 male introvert 5 12-2 1 5
# 4 male extrovert 8 9-11 12 1
# # i 82 more rows

3
1 Intro to data

Types of variables

The type of a variable is one of the following:

• Numerical: Variable can take a wide range of numerical values, and it is sensible to add,
subtract, or take averages with those values.

• Categorical: Variable has a finite number of values, which are categories (called levels),
and it is not sensible to add, subtract, or take averages with those values.

Categorical variables can be further distinguished as

• ordinal: the levels of the variable have a natural ordering, or


• nominal: the levels of the variable don’t have a natural ordering.

Example 1.1.

head(classroom_survey)
# # A tibble: 6 x 6
# gender intro_extro sleep bedtime countries dread
# <chr> <chr> <int> <chr> <int> <int>
# 1 male introvert 8 9-11 10 1
# 2 female extrovert 5 12-2 1 1
# 3 male introvert 5 12-2 1 5
# 4 male extrovert 8 9-11 12 1
# 5 female introvert 6 9-11 4 3
# 6 male introvert 8 9-11 5 3

• gender: categorical, nominal


• sleep: numerical
• bedtime: categorical, ordinal
• countries: numerical
• dread: categorical, ordinal - could also be used as numerical

¾ Your turn

What type of variable is a telephone area code?


A numerical
B nominal
C ordinal

4
1 Intro to data

Relationships among variables

Does there appear to be a relationship between GPA and number of hours students study per week?

grade point average (gpa)

4.0

3.5

3.0

0 20 40 60
study hours
gpa <= 4.0

Can you spot anything unusual about any of the data points?

Associated vs. independent

When two variables show some connection with one another, they are called associated or depen-
dent variables.
Conclusion: If two variables are not associated, i.e., there is no evident connection between them,
they are said to be independent.

Explanatory and response variables

When we look at the relationship between two variables, we often want to analyze whether a change
in one variable causes a change in the other.
If study times are increased, will this lead to an improvement in GPA?
We are asking whether study times affects GPA. If this is our underlying belief, then study times is
the explanatory variable and GPA is the response variable in the hypothesized relationship.

5
1 Intro to data

Definition 1.1. When we suspect one variable might causally affect another, we label the first vari-
able the explanatory variable and the second the response variable.

ĺ Remember

Labeling variables as explanatory and response does not guarantee the relationship between
the two is causal, even if an association is identified between the two variables.

1.2 Populations and samples

New York Times article: Finding Your ideal Running Form

Research question: Can people become better, more efficient runners on their own,
merely by running?

The population of interest are all people.


The sample considered in the study was a group of adult women who recently joined a running group.
Therefore, the population to which results can be generalized are adult women. Assuming the data
was randomly sampled.

6
1 Intro to data

Anecdotal evidence and early smoking research

Anti-smoking research started in the 1930s and 1940s when cigarette smoking became increasingly
popular. While some smokers seemed to be sensitive to cigarette smoke, others were completely
unaffected.
Anti-smoking research was faced with resistance based on anecdotal evidence such as

“My uncle smokes three packs a day and he’s in perfectly good health”.

Evidence based on a limited sample size that might not be representative of the population.
It was concluded that “smoking is a complex human behavior, by its nature difficult to study, confounded
by human variability.” 3
In time, researchers could examine larger samples of cases (smokers), and trends showing that smok-
ing has negative health impacts became much more straightforward.

Census

Wouldn’t it be better to just include everyone and “sample” the entire population?

This is called a census. There are problems with taking a census:

1. It can be difficult to complete a census: there always seem to be some subjects who are hard to
locate or hard to measure. And these difficult-to-find subjects may have certain characteristics
that distinguish them from the rest of the population.
2. Populations rarely stand still. Even if you could take a census, the population changes con-
stantly, so it’s never possible to get a perfect measure.
3. Taking a census may be more complex than sampling.

From exploratory analysis to inference

Sampling is natural. Think about sampling something you are cooking - you taste (examine) a small
part of what you’re cooking to get an idea about the dish as a whole.

Exploratory analysis:
You taste a spoonful of soup and decide the spoonful you tasted isn’t salty enough.
Inference:
You generalize and conclude that your entire soup needs salt.

3
Brandt (2009)

7
1 Intro to data

For your inference to be valid, the spoonful you tasted (the sample) must represent the entire pot (the
population).
If your spoonful comes only from the surface and the salt is collected at the bottom of the pot, what
you tasted is probably not representative of the whole pot .
If you first stir the soup thoroughly before you taste it, your spoonful will more likely be representative
of the whole pot.

Sampling bias

Non-response: If only a small fraction of the randomly sampled people chooses to respond to a
survey, the sample may no longer be representative of the population.
Voluntary response: The sample consists of people who volunteer to respond, because they have
strong opinions on the issue. Such a sample will also not be representative of the population.

Example 1.2.

(a) Survey (a) Result

Source: cnn.com, Jan 14, 2012

Convenience sample: Individuals who are easily accessible are more likely to be included in the
sample.

8
1 Intro to data

Example 1.3. A historical example of a biased sample yielding misleading results:


In 1936, Alf Landon sought the Republican presidential nomination opposing the re-election of
Franklin D. Roosevelt (FDR).

(a) Alf Landon (a) Franklin D. Roosevelt

The Literary Digest polled about 10 million Americans, and got responses from about 2.4 million.
The poll showed that Landon would likely be the overwhelming winner and FDR would get only
43% of the votes.

Election result: Franklin D. Roosevelt won, with 62% of the votes.


The magazine was completely discredited because of the poll, and was soon discontinued.

The literary digest poll – what went wrong?

The magazine had surveyed

• its own readers,


• registered automobile owners, and
• registered telephone users.

These groups had incomes well above the national average


of the day - this was the Great Depression era.

(a) Issue from September 1936

9
1 Intro to data

This resulted in lists of voters far more likely to support Republicans than a truly typical voter of the
time, i.e., the sample was not representative of the American population at the time.
The Literary Digest election poll was based on a sample size of 2.4 million, which is huge, but since
the sample was biased, the sample did not yield an accurate prediction.
Conclusion: If the soup is not well stirred, it doesn’t matter how large a spoon you have, it will still
not taste right. If the soup is well stirred, a small spoon will suffice to test the soup.

¾ Your turn

A school district is considering whether it will no longer allow high school students to park at
school after two recent accidents where students were severely injured. As a first step, they
survey parents by mail, asking them whether or not the parents would object to this policy
change. Of 6,000 surveys that go out, 1,200 are returned. Of these 1,200 surveys that were
completed, 960 agreed with the policy change and 240 disagreed.
Which of the following statements are true?

i) Some of the mailings may have never reached the parents.

ii) The school district has strong support from parents to move forward with the policy
approval.

iii) It is possible that the majority of the parents of high school students disagree with the
policy change.

iv) The survey results are unlikely to be biased because all parents were mailed a survey.

A i) and ii)
B i) and iii)
C ii) and iii)
D iii) and iv)

1.3 Observational studies and experiments

Definition 1.2.

1. If researchers collect data in a way that does not directly interfere with how the data arise, i.e.,
they merely “observe”, we call it an observational study.
In this case, only a relationship between the explanatory and the response variables can be
established.
2. If researchers randomly assign subjects to various treatments in order to establish causal con-
nections between the explanatory and response variables, we call it an experiment.

10
1 Intro to data

If you’re going to walk away with one thing from this class, let it be

”Correlation does not imply causation!”

New study sponsored by General Mills says that eating breakfast makes girls thinner
Girls who regularly ate breakfast, particularly one that includes cereal, were slimmer than those
who skipped the morning meal, according to a larger NIH survey of 2,379 girls in California,
Ohio, and Maryland who were tracked between ages 9 and 19.
Girls who ate breakfast of any type had a lower average body mass index, a common obesity
gauge, than those who said they didn’t. The index was even lower for girls who said they
ate cereal for breakfast, according to findings of the study conducted by the Maryland Medical
Research Institute. The study received funding from the National Institutes of Health and cereal
maker General Mills.
“Not eating breakfast is the worst thing you can do, that’s really the take-home message for
teenage girls,” said study author Bruce Barton, the Maryland institute’s president and CEO.
Results of the study appear in the September issue of the Journal of the American Dietetic
Association.
As part of the survey, the girls were asked once a year what they had eaten during the previous
three days. The data were adjusted to compensate for factors such as differences in physical
activity among the girls and normal increases in body fat during adolescence.
A girl who reported eating breakfast on all three days had, on average, a body mass index 0.7
units lower than a girl who did not eat breakfast at all. If the breakfast included cereal, the
average was 1.65 units lower, the researchers found.
Breakfast consumption dropped as the girls aged, the researchers found, and those who did not
eat breakfast tended to eat higher fat foods later in the day.

Remark. One should be aware that the body mass index is generally a poor metric for measuring
people’s health.

We could ask the following questions about this study:

1. What type of study is this, observational study or an experiment?


2. What is the conclusion of the study?
3. Who sponsored the study?

This is an observational study since the researchers merely observed the girls’ (subjects) behavior
instead of imposing treatments on them. The study, which was sponsored by General Mills, found an
association between girls eating breakfast and being slimmer.

11
1 Intro to data

Three possible explanations for the found association:

1. Eating breakfast causes girls to be thinner.

2. Being thin causes girls to eat breakfast.

3. A third variable Z is responsible for both. What could


it be? eating breakfast
stay slim

Definition 1.3. Extraneous variables, that affect both the explanatory and the response variable and
that make it seem like there is a relationship between the two, are called confounding variables.

Example 1.4.
A study found a rather strong correlation between the ice cream sales and the number of shark attacks
for a number of beaches that were sampled.

50
number of shark attacks

40

30

20

70 80 90 100 110
ice cream sales

Conclusion: Increasing ice cream sales causes more shark attacks (sharks like eating people full of
ice cream).
Better explanation: The confounding variable is temperature. Warmer temperatures cause ice
cream sales to go up. Warmer temperatures also bring more people to the beaches, increasing the
chances of shark attacks.

12
1 Intro to data

1.4 Obtaining good samples - Sampling principles and strategies

Note

Almost all statistical methods are based on the notion of implied randomness.

If observational data are not collected in a random framework from a population, these statistical
methods, i.e.,
the estimates and errors associated with the estimates,
are not reliable.

Most commonly used random sampling techniques are

• simple random sampling


• stratified sampling
• cluster sampling
• multistage sampling

1.4.1 Simple random sampling

Randomly select cases from the population without implied connection between the selected
points.

A sample is called a simple random sample if each case in the population has an equal chance of
being included in the final sample.

13
1 Intro to data

1.4.2 Stratified sampling

Similar cases from the population are grouped into so-called strata. Afterward, a simple random
sample is taken from each stratum.

Stratum 5

Stratum 1

Stratum 3

Stratum 6

Stratum 4
Stratum 2

Stratified sampling is especially useful when the cases in each stratum are very similar in terms
of the outcome of interest.

1.4.3 Cluster sampling

Clusters are usually not made up of homogeneous observations. We take a simple random sample
of clusters, and then sample all observations in that cluster.

Cluster 2

Cluster 6

Cluster 4

14
1 Intro to data

1.4.4 Multistage sampling

Clusters are usually not made up of homogeneous observations. We take a simple random sample
of clusters, and then take a simple random sample within each sampled cluster.

Cluster 2

Cluster 6

Cluster 4

Remark. Cluster or multistage sampling can be more economical than the other sampling techniques.
Also, unlike stratified samples, they are most useful when there is a large case-to-case variability
within a cluster, but the clusters themselves do not look very different.

¾ Your turn

A city council has requested a household survey be conducted in a suburban area of their city.
The area is broken into many distinct and unique neighborhoods, some including large homes
and some with only apartments. Which approach would likely be the least effective?
A Simple random sampling
B Cluster sampling
C Stratified sampling
D Multistage sampling

15
1 Intro to data

¾ Your turn

On a large college campus first-year students and sophomores live in dorms located on the
eastern part of the campus and juniors and seniors live in dorms located on the western part
of the campus. Suppose you want to collect student opinions on a new housing structure the
college administration is proposing and you want to make sure your survey equally represents
opinions from students from all years.

a) What type of study is this?

b) Suggest a sampling strategy for carrying out this study.

Remark. In this course, our focus will be on statistical methods for simple random sampling.
Proper analysis of more involved sampling schemes requires statistical methods beyond the scope
of the course, e.g., linear mixed models for dependent data obtained from stratified sampling, where
observations from the same stratum are considered as being dependent but independent from obser-
vations from other strata.

1.5 More on experiments

Principles of experimental design

1. Control: Compare the treatment of interest to a control group.


2. Randomize: Randomly assign subjects to treatments, and randomly sample from the popula-
tion whenever possible.
3. Replicate: Within a study, replicate by collecting a sufficiently large sample. Or replicate the
entire study.
4. Block: If there are variables that are known or suspected to affect the response variable, first
group subjects into blocks based on these variables, and then randomize cases within each
block to treatment groups.

Example 1.5.
It is suspected that energy gels might affect pro and amateur athletes differently, therefore we
block for pro status:

1. divide the sample into pro and amateur


2. randomly assign pro athletes to treatment and control groups
3. randomly assign amateur athletes to treatment and control groups
4. pro/amateur status is equally represented in the resulting treatment and control groups

16
1 Intro to data

• Treatment: energy gel

• Control: no energy gel

(a) GU Energy gel

¾ Your turn

A study is designed to test the effect of light level and noise level on the exam performance
of students. The researcher also believes that light and noise levels might affect males and
females differently, so she wants to make sure both genders are equally represented in each
group. Which of the descriptions is correct?
A There are 3 explanatory variables (light, noise, gender) and 1 response variable (exam perfor-
mance)
B There are 2 explanatory variables (light and noise), 1 blocking variable (gender), and 1 re-
sponse variable (exam performance)
C There is 1 explanatory variable (gender) and 3 response variables (light, noise, exam perfor-
mance)
D There are 2 blocking variables (light and noise), 1 explanatory variable (gender), and 1 re-
sponse variable (exam performance)

Blocking vs treatment variables

Treatment variables are conditions we can impose on the experimental units.


Blocking variables are characteristics that the experimental units come with, that we
would like to control for.
Blocking is like stratifying, except used in experimental settings when randomly assigning, as
opposed to sampling.

17
1 Intro to data

More experimental design terminology

Placebo: fake treatment, often used as the control group for medical studies
Placebo effect: experimental units showing improvement simply because they believe they are re-
ceiving a special treatment
Blinding: when experimental units do not know whether they are in the control or treatment
group
Double-blind: when both the experimental units and the researchers who interact with the patients
do not know who is in the control and who is in the treatment group

¾ Your turn

What is the main difference between observational studies and experiments?


A Experiments take place in a lab while observational studies do not need to.
B In an observational study we only look at what happened in the past.
C Most experiments use random assignment while observational studies do not.
D Observational studies are completely useless since no causal inference can be made based on
their findings.

Online experiments

In 2012 a Microsoft employee working on Bing had an idea about changing the way the search engine
displayed ad headlines. Development had low effort, but it was just one of hundreds of ideas
proposed, and therefore received low priority from program managers.
So it took more than six months, until an engineer realized that the cost of writing the code for
it would be small, and launched a simple online controlled experiment — an A/B test (A: control,
current system; B: treatment, modification) — to assess its impact.
Within hours the new headline variation was producing abnormally high revenue, triggering a
“too good to be true” alert.

Usually, such alerts signal a bug, but not in this case. An analysis showed that the change
had increased revenue by an astonishing 12%—which on an annual basis would come to
more than $100 million in the United States alone.

It was the best revenue-generating idea in Bing’s history, but until the test, its value was underesti-
mated.

18
1 Intro to data

Figure 1.7: From Harvard Business Review: The Surprising Power of Online Experiments

19
1 Intro to data

Short summary

In this chapter we use a case study on chronic fatigue syndrome to illustrate study design,
control groups, and the challenge of distinguishing real effects from random variation. We
then cover data basics, including variable types (numerical and categorical, with subcategories
of ordinal and nominal) and the identification of relationships between variables, such as
association, dependence, and the roles of explanatory and response variables. Furthermore,
we distinguish between populations and samples, highlighting potential biases in sampling
methods like non-response and voluntary response, using the historical example of the Literary
Digest poll. Sampling strategies were introduced to reduce potential bias, like simple ran-
dom sampling or cluster sampling. Finally, the source explains observational studies versus
experiments, emphasising that correlation does not imply causation, and discusses principles
of experimental design including control, randomisation, replication, and blocking, along with
concepts like placebos and blinding.

20
Part II

21
2 Data and visualization

2.1 What is in a dataset?

Let’s start with some dataset terminology:

Each row is an observation and each column is a variable.

In the beginning, we will work with the dataset

starwars
# # A tibble: 87 x 14
# name height mass hair_color skin_color eye_color birth_year
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
# 1 Luke Skywalker 172 77 blond fair blue 19
# 2 C-3PO 167 75 <NA> gold yellow 112
# 3 R2-D2 96 32 <NA> white, blue red 33
# 4 Darth Vader 202 136 none white yellow 41.9
# 5 Leia Organa 150 49 brown light brown 19
# # i 82 more rows
# # i 7 more variables: sex <chr>, gender <chr>, homeworld <chr>, ...

One of the observations in starwars is about

(a) Luke Skywalker

22
2 Data and visualization

The dataset contains the following information about Luke:

# we will soon learn how to read this code


luke <- filter(starwars, name == "Luke Skywalker")
print(luke, width = Inf)
# # A tibble: 1 x 14
# name height mass hair_color skin_color eye_color birth_year
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
# 1 Luke Skywalker 172 77 blond fair blue 19
# sex gender homeworld species films vehicles starships
# <chr> <chr> <chr> <chr> <list> <list> <list>
# 1 male masculine Tatooine Human <chr [5]> <chr [2]> <chr [2]>

Some variables are more complex objects than others. For example, films is a so-called list. It con-
tains the names of all the films the character was starring in.
So, this information varies with the different characters. For Luke it contains the following titles.

luke$films
# [[1]]
# [1] "A New Hope" "The Empire Strikes Back"
# [3] "Return of the Jedi" "Revenge of the Sith"
# [5] "The Force Awakens"

What’s in the Star Wars data?

starwars is a data object of type tibble. Therefore, we can inspect available variables by looking at
the data object.

starwars
# # A tibble: 87 x 14
# name height mass hair_color skin_color eye_color birth_year
# <chr> <int> <dbl> <chr> <chr> <chr> <dbl>
# 1 Luke Skywalker 172 77 blond fair blue 19
# 2 C-3PO 167 75 <NA> gold yellow 112
# 3 R2-D2 96 32 <NA> white, blue red 33
# 4 Darth Vader 202 136 none white yellow 41.9
# 5 Leia Organa 150 49 brown light brown 19
# 6 Owen Lars 178 120 brown, grey light blue 52
# # i 81 more rows
# # i 7 more variables: sex <chr>, gender <chr>, homeworld <chr>, ...

23
2 Data and visualization

But what exactly does each column represent? Take a look at the help page:

?starwars

2.2 Exploratory data analysis

Definition 2.1. Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize
their main characteristics.

Often, EDA is visual – this is what we’ll focus on first.


But we might also calculate summary statistics and perform data transformation at (or before) this
analysis stage – this is what we’ll focus on next.

24
2 Data and visualization

Questions:

• How would you describe the relationship between mass and height of Star Wars charac-
ters?

• What other variables would help us understand data points that don’t follow the overall
trend?

• Who is the not-so-tall but really chubby character?

Mass vs. height of Starwars characters

1000
Weight (kg)

500

0
100 150 200 250
Height (cm)

2.3 Why do we visualize?

Let’s take a look at a famous collection of four datasets known as Anscombe’s quartet:

quartet
# set x y
# 1 I 10 8.04
# 2 I 8 6.95
# 3 I 13 7.58
# 4 I 9 8.81
# 5 I 11 8.33
# 6 I 14 9.96

25
2 Data and visualization

# 7 I 6 7.24
# 8 I 4 4.26
# 9 I 12 10.84
# 10 I 7 4.82
# 11 I 5 5.68
# 12 II 10 9.14
# 13 II 8 8.14
# 14 II 13 8.74
# 15 II 9 8.77
# 16 II 11 9.26
# 17 II 14 8.10
# 18 II 6 6.13
# 19 II 4 3.10
# 20 II 12 9.13
# 21 II 7 7.26
# 22 II 5 4.74
# 23 III 10 7.46
# 24 III 8 6.77
# 25 III 13 12.74
# 26 III 9 7.11
# 27 III 11 7.81
# 28 III 14 8.84
# 29 III 6 6.08
# 30 III 4 5.39
# 31 III 12 8.15
# 32 III 7 6.42
# 33 III 5 5.73
# 34 IV 8 6.58
# 35 IV 8 5.76
# 36 IV 8 7.71
# 37 IV 8 8.84
# 38 IV 8 8.47
# 39 IV 8 7.04
# 40 IV 8 5.25
# 41 IV 19 12.50
# 42 IV 8 5.56
# 43 IV 8 7.91
# 44 IV 8 6.89

The datasets look pretty similar when summarizing each of the four datasets by computing the sample
mean and standard deviation for each of the two variables x and y, as well as their correlation.

26
2 Data and visualization

quartet |>
group_by(set) |>
summarise(
mean_x = mean(x),
mean_y = mean(y),
sd_x = sd(x),
sd_y = sd(y),
r = cor(x, y)
)
# # A tibble: 4 x 6
# set mean_x mean_y sd_x sd_y r
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 I 9 7.50 3.32 2.03 0.816
# 2 II 9 7.50 3.32 2.03 0.816
# 3 III 9 7.5 3.32 2.03 0.816
# 4 IV 9 7.50 3.32 2.03 0.817

Remark. We haven’t introduced this kind of code, but we will in Chapter 3.

When visualizing Anscombe’s quartet, we realize that they are, in fact, quite different.

ggplot(quartet, aes(x = x, y = y)) +


geom_point() +
facet_wrap(~ set, ncol = 4)

I II III IV

12.5

10.0
y

7.5

5.0

5 10 15 5 10 15 5 10 15 5 10 15
x

Figure 2.2: Anscombe’s quartet.

27
2 Data and visualization

2.4 ggplot2

“The simple graph has brought more information to the data analyst’s mind than any other
device.” — John Tukey

Data visualization is the creation and study of the


visual representation of data.
Many tools exist for visualizing data, and R is one
of them.
There exist many approaches/systems within R for
making data visualizations – ggplot2 is one of
them, and that’s what we will use.

Remember: The tidyverse is a collection of several packages.


ggplot2 is tidyverse’s data visualization package, i.e., ggplot2 is loaded after running

library(tidyverse)

Remark.

1. In case we only need ggplot2 and no other package from the tidyverse, we can load explicitly
just ggplot2 by running the following code.
library(ggplot2)

2. The gg in ggplot2 stands for Grammar of Graphics, since the package is inspired by the book
The Grammar of Graphics by Leland Wilkinson. A grammar of graphics is a tool that enables
us to concisely describe the components of a graphic.

28
2 Data and visualization

Let’s look again at the plot of mass vs. height of Star Wars characters.

ggplot(
data = starwars,
mapping = aes(x = height, y = mass)
) +
geom_point() +
labs(
title = "Mass vs. height of Starwars characters",
x = "Height (cm)", y = "Weight (kg)"
)

Mass vs. height of Starwars characters

1000
Weight (kg)

500

0
100 150 200 250
Height (cm)

Questions:

• What are the functions doing the plotting?

• What is the dataset being plotted?

• Which variables map to which features (aesthetics) of the plot?

29
2 Data and visualization

ggplot(
data = starwars,
mapping = aes(x = height, y = mass)
) +
geom_point() +
labs(
title = "Mass vs. height of Starwars characters",
x = "Height (cm)", y = "Weight (kg)"
)

ggplot() is the main function in ggplot2. It initializes the plot. The different layers of
the plots are then added consecutively.

The structure of the code for plots can be summarized as

ggplot(data = [dataset],
mapping = aes(x = [x-variable], y = [y-variable])) +
geom_xxx() +
other options

Remark. For help with ggplot2, see ggplot2.tidyverse.org.

Palmer penguins

The dataset penguins from the package palmerpenguins


contains measurements for

• species,
• island in Palmer Archipelago,
• size (flipper length, body mass, bill dimensions),
• sex.

(a) Artwork by Allison Horst.

30
2 Data and visualization

library(palmerpenguins)
glimpse(penguins)
# Rows: 344
# Columns: 8
# $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, A~
# $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torge~
# $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.~
# $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.~
# $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, ~
# $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 347~
# $ sex <fct> male, female, female, NA, female, male, female, m~
# $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2~

Our goal is now to understand the code of the following plot.

Bill depth and length


Dimensions for Adelie, Chinstrap, and Gentoo Penguins
60
Bill length (mm)

50 Species
Adelie
Chinstrap
Gentoo
40

15.0 17.5 20.0


Bill depth (mm)

1. Initialize the plot with ggplot() and specify the data argument.
ggplot(data = penguins)

31
2 Data and visualization

2. Map the variables bill_depth_mm and bill_length_mm to the x- and y-axis, respectively. The
function aes() creates the mapping from the dataset variables to the plot’s aesthetics.
1 ggplot(data = penguins,
2 mapping = aes(x = bill_depth_mm,
3 y = bill_length_mm))

60

50
bill_length_mm

40

15.0 17.5 20.0


bill_depth_mm

3. Represent each observation with a point by using geom_point().


1 ggplot(data = penguins,
2 mapping = aes(x = bill_depth_mm,
3 y = bill_length_mm)) +
4 geom_point()

60

50
bill_length_mm

40

15.0 17.5 20.0


bill_depth_mm

32
2 Data and visualization

4. Map species to the colour of each observation point.


1 ggplot(data = penguins,
2 mapping = aes(x = bill_depth_mm,
3 y = bill_length_mm,
4 colour = species)) +
5 geom_point()

60

50
bill_length_mm

species
Adelie
Chinstrap
Gentoo

40

15.0 17.5 20.0


bill_depth_mm

5. Title the plot “Bill depth and length”.


1 ggplot(data = penguins,
2 mapping = aes(x = bill_depth_mm,
3 y = bill_length_mm,
4 colour = species)) +
5 geom_point() +
6 labs(
7 title = "Bill depth and length")

33
2 Data and visualization

Bill depth and length


60

bill_length_mm
50 species
Adelie
Chinstrap
Gentoo

40

15.0 17.5 20.0


bill_depth_mm

6. Add the subtitle “Dimensions for Adelie, Chinstrap, and Gentoo Penguins”.
1 ggplot(data = penguins,
2 mapping = aes(x = bill_depth_mm,
3 y = bill_length_mm,
4 colour = species)) +
5 geom_point() +
6 labs(
7 title = "Bill depth and length",
8 subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins")

Bill depth and length


Dimensions for Adelie, Chinstrap, and Gentoo Penguins
60
bill_length_mm

50 species
Adelie
Chinstrap
Gentoo
40

15.0 17.5 20.0


bill_depth_mm

34
2 Data and visualization

7. Label the x- and y-axis as “Bill depth (mm)” and “Bill length (mm)”, respectively.
1 ggplot(data = penguins,
2 mapping = aes(x = bill_depth_mm,
3 y = bill_length_mm,
4 colour = species)) +
5 geom_point() +
6 labs(
7 title = "Bill depth and length",
8 subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
9 x = "Bill depth (mm)", y = "Bill length (mm)")

Bill depth and length


Dimensions for Adelie, Chinstrap, and Gentoo Penguins
60
Bill length (mm)

50 species
Adelie
Chinstrap
Gentoo
40

15.0 17.5 20.0


Bill depth (mm)

8. Label the legend “Species”.


1 ggplot(data = penguins,
2 mapping = aes(x = bill_depth_mm,
3 y = bill_length_mm,
4 colour = species)) +
5 geom_point() +
6 labs(
7 title = "Bill depth and length",
8 subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
9 x = "Bill depth (mm)", y = "Bill length (mm)",
10 colour = "Species")

35
2 Data and visualization

Bill depth and length


Dimensions for Adelie, Chinstrap, and Gentoo Penguins
60

Bill length (mm)


50 Species
Adelie
Chinstrap
Gentoo
40

15.0 17.5 20.0


Bill depth (mm)

9. Add a caption for the data source.


1 ggplot(data = penguins,
2 mapping = aes(x = bill_depth_mm,
3 y = bill_length_mm,
4 colour = species)) +
5 geom_point() +
6 labs(
7 title = "Bill depth and length",
8 subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
9 x = "Bill depth (mm)", y = "Bill length (mm)",
10 colour = "Species",
11 caption = "Source: Palmer Station LTER/palmerpenguins package")

Bill depth and length


Dimensions for Adelie, Chinstrap, and Gentoo Penguins
60
Bill length (mm)

50 Species
Adelie
Chinstrap
Gentoo
40

15.0 17.5 20.0


Bill depth (mm)
Source: Palmer Station LTER/palmerpenguins package

36
2 Data and visualization

10. Finally, use a discrete colour scale to be perceived by viewers with common colour blindness.

1 ggplot(data = penguins,
2 mapping = aes(x = bill_depth_mm,
3 y = bill_length_mm,
4 colour = species)) +
5 geom_point() +
6 labs(
7 title = "Bill depth and length",
8 subtitle = "Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
9 x = "Bill depth (mm)", y = "Bill length (mm)",
10 colour = "Species",
11 caption = "Source: Palmer Station LTER/palmerpenguins package") +
12 scale_colour_viridis_d()

Bill depth and length


Dimensions for Adelie, Chinstrap, and Gentoo Penguins
60
Bill length (mm)

50 Species
Adelie
Chinstrap
Gentoo
40

15.0 17.5 20.0


Bill depth (mm)
Source: Palmer Station LTER/palmerpenguins package

Remark. Like for all other R functions, you can omit the names of the arguments when building plots
with ggplot(), as long as you keep the order of arguments as given in the function’s help page.

37
2 Data and visualization

So, using

ggplot(penguins,
aes(x = bill_depth_mm,
y = bill_length_mm,
colour = species))

instead of

ggplot(data = penguins,
mapping = aes(x = bill_depth_mm,
y = bill_length_mm,
colour = species))

is valid code, compared to using

ggplot(aes(x = bill_depth_mm,
y = bill_length_mm,
colour = species),
penguins)

2.4.1 Aesthetics

colour

1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm,
3 colour = species)) +
4 geom_point() +
5 scale_colour_viridis_d()

38
2 Data and visualization

60

50

bill_length_mm
species
Adelie
Chinstrap
Gentoo

40

15.0 17.5 20.0


bill_depth_mm

shape

In addition to specifying colour with respect to species, we now define shape based on island.

1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm, colour = species,
3 shape = island)) +
4 geom_point() +
5 scale_colour_viridis_d()

60

species
Adelie

50 Chinstrap
bill_length_mm

Gentoo

island
Biscoe
40 Dream
Torgersen

15.0 17.5 20.0


bill_depth_mm

39
2 Data and visualization

One can, of course, use the same variable for specifying different aesthetics, e.g., using species to
define shape and colour.

1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm,
3 colour = species, shape = species)) +
4 geom_point() +
5 scale_colour_viridis_d()

60

50
bill_length_mm

species
Adelie
Chinstrap
Gentoo

40

15.0 17.5 20.0


bill_depth_mm

However, the information displayed regarding shape and color is redundant and reduces the number
of variables to be displayed.

Remark. The values of shape can only be specified by a discrete variable. Using instead a continuous
variable will lead to an error.

ggplot(penguins,
aes(x = bill_depth_mm, y = bill_length_mm,
shape = body_mass_g)) +
geom_point()
# Error in `geom_point()`:
# ! Problem while computing aesthetics.
# i Error occurred in the 1st layer.
# Caused by error in `scale_f()`:
# ! A continuous variable cannot be mapped to the shape aesthetic.
# i Choose a different aesthetic or use `scale_shape_binned()`.

40
2 Data and visualization

size

The effect of using the size aesthetic is rather apparent. Based on the values of the defining variables,
we get symbols of different size.

1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm,
3 colour = species, shape = species,
4 size = body_mass_g)) +
5 geom_point() +
6 scale_colour_viridis_d()

60

species
Adelie
Chinstrap
50
bill_length_mm

Gentoo

body_mass_g
3000

40 4000
5000
6000

15.0 17.5 20.0


bill_depth_mm

When using a continuous variable to define size a set of representative values is chosen to be dis-
played in the legend.

41
2 Data and visualization

alpha

The alpha aesthetic introduced different levels of transparency.

1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm,
3 colour = species, shape = species,
4 alpha = flipper_length_mm)) +
5 geom_point() +
6 scale_colour_viridis_d()

60

species
Adelie
Chinstrap
Gentoo
50
bill_length_mm

flipper_length_mm
180
190

40 200
210
220
230

15.0 17.5 20.0


bill_depth_mm

2.4.2 Mapping vs. setting

mapping

1 ggplot(
2 penguins,
3 aes(
4 x = bill_depth_mm, y = bill_length_mm,
5 size = body_mass_g,
6 alpha = flipper_length_mm)) +
7 geom_point()

42
2 Data and visualization

60
flipper_length_mm
180
190
200

50 210

bill_length_mm
220
230

body_mass_g
40 3000
4000
5000
6000

15.0 17.5 20.0


bill_depth_mm

setting

1 ggplot(penguins,
2 aes( x = bill_depth_mm, y = bill_length_mm)) +
3 geom_point(
4 size = 2,
5 alpha = 0.5)

60

50
bill_length_mm

40

15.0 17.5 20.0


bill_depth_mm

43
2 Data and visualization

Mapping: Determine the size, alpha, etc., of the geometric objects, like points, based on the values
of a variable in the dataset: Use aes() to define the mapping.
Setting: Determine the size, alpha, etc., of the geometric objects, like points, not based on the values
of a variable in the dataset: Specify the aesthetics within geom_*().

Remark. The * is a placeholder for one of the available geoms. We used geom_point() in the previous
example, but we’ll learn about other geoms soon!

2.4.3 Faceting

Faceting means creating smaller plots that display different subsets of the data. Useful for exploring
conditional relationships and large data.

1 ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) +


2 geom_point() +
3 facet_grid(species ~ island)

Biscoe Dream Torgersen


60

50

Adelie
40

60
bill_length_mm

Chinstrap
50

40

60
Gentoo

50

40

15.0 17.5 20.0 15.0 17.5 20.0 15.0 17.5 20.0


bill_depth_mm

Various ways to facet

Task: In the following few slides, describe what each plot displays. Think about how the code relates
to the output.
Note

The plots in the next few slides do not have proper titles, axis labels, etc. because we want you
to figure out what’s happening in the plots. But you should always label your plots!

44
2 Data and visualization

Faceting by species and sex

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) +


geom_point() +
facet_grid(species ~ sex)

female male NA
60

50

Adelie
40

60
bill_length_mm

Chinstrap
50

40

60

Gentoo
50

40

15.0 17.5 20.0 15.0 17.5 20.0 15.0 17.5 20.0


bill_depth_mm

Faceting by sex and species

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) +


geom_point() +
facet_grid(sex ~ species)

Adelie Chinstrap Gentoo


60

50
female

40

60
bill_length_mm

50
male

40

60

50
NA

40

15.0 17.5 20.0 15.0 17.5 20.0 15.0 17.5 20.0


bill_depth_mm

45
2 Data and visualization

Faceting by species

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) +


geom_point() +
facet_wrap(~ species)

Adelie Chinstrap Gentoo


60
bill_length_mm

50

40

15.0 17.5 20.0 15.0 17.5 20.0 15.0 17.5 20.0


bill_depth_mm

Faceting again by species

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) +


geom_point() +
facet_grid(. ~ species)

Adelie Chinstrap Gentoo


60
bill_length_mm

50

40

15.0 17.5 20.0 15.0 17.5 20.0 15.0 17.5 20.0


bill_depth_mm

facet_wrap() allows for specifying the number of columns (or rows) in the output.

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm)) +


geom_point() +
facet_wrap(~ species, ncol = 2)

46
2 Data and visualization

Adelie Chinstrap
60

50

40

bill_length_mm
15.0 17.5 20.0
Gentoo
60

50

40

15.0 17.5 20.0


bill_depth_mm

Summary

facet_grid():

• able to create 2d grid


• rows ~ cols
• use . for no split in rows or columns

facet_wrap(): 1d ribbon wrapped according to the number of rows and columns specified or
available plotting area

Facets based on variables defining aesthetics

When facets are built based on a variable used for coloring, the output will contain an unnecessary
legend.

1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
3 geom_point() +
4 facet_grid( species ~ sex) +
5 scale_color_viridis_d()

47
2 Data and visualization

female male NA
60

50

Adelie
40

60
bill_length_mm
species

Chinstrap
50 Adelie
Chinstrap
40
Gentoo

60

Gentoo
50

40

15.0 17.5 20.0 15.0 17.5 20.0 15.0 17.5 20.0


bill_depth_mm

The information about the different species is already shown on the y-axis and, hence, doesn’t need to
be repeated in the legend. One can remove the legend using either guides(), see the below example,
or theme(legend.position = "none").

1 ggplot(penguins,
2 aes(x = bill_depth_mm, y = bill_length_mm, color = species)) +
3 geom_point() +
4 facet_grid(species ~ sex) +
5 scale_color_viridis_d() +
6 guides(color = FALSE)

female male NA
60

50
Adelie

40

60
bill_length_mm

Chinstrap

50

40

60
Gentoo

50

40

15.0 17.5 20.0 15.0 17.5 20.0 15.0 17.5 20.0


bill_depth_mm

48
2 Data and visualization

2.4.4 geoms

60

50
bill_length_mm

bill_length_mm
50 species species
Adelie Adelie
Chinstrap Chinstrap
Gentoo 40 Gentoo
40

30
15.0 17.5 20.0 15.0 17.5 20.0
bill_depth_mm bill_depth_mm

Both plots use the same data, x-aesthetic and y-aesthetic. But they use different geometric
objects to represent the data, and therefore look quite different.
Other geoms are applied analogously to geom_point(). One can also combine several geoms in one
plot.

1 ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm,


2 color = species)) +
3 geom_point() +
4 geom_smooth()

60

50
bill_length_mm

species
Adelie
Chinstrap
Gentoo
40

30
15.0 17.5 20.0
bill_depth_mm

49
2 Data and visualization

There are a variety of geoms. Some of them are given in the ggplot2 cheatsheet, which one can also
download as a PDF.
For a complete list, visit the ggplot2 website.

Different geoms describe different aspects of the data, and the choice of the appropriate geom
also depends on the type of the data.
This is explained in more detail when we speak about exploring data.

After figuring out which geoms to use, there might still be the question how to do it. In that case,
open the documentation for the chosen geom function (like for any other R function), type

?geom_function

• scan page for relevant info


• ignore things that don’t make sense
• try out the examples

2.4.5 Themes

The above plots look different when you run the given code on your machine. This is because we
defined a different theme to be our default theme. In general, theme_gray() is the default.

ggplot(penguins, aes(x = bill_depth_mm, y = bill_length_mm,


color = species)) +
geom_point() +
theme_gray()

60
bill_length_mm

50 species
Adelie
Chinstrap
Gentoo
40

15.0 17.5 20.0


bill_depth_mm

50
2 Data and visualization

But we used theme_minimal().

ggplot(penguins,
aes(x = bill_depth_mm, y = bill_length_mm,
color = species)) +
geom_point() +
theme_minimal()

60

50
bill_length_mm

species
Adelie
Chinstrap
Gentoo

40

15.0 17.5 20.0


bill_depth_mm

The complete list of all built-in themes is available on the ggplot2 website.

51
2 Data and visualization

Short summary

This chapter introduces the ggplot2 package in R for data visualisation. It begins by explaining
dataset terminology using the starwars dataset as an example. The text then demonstrates
the fundamentals of ggplot2, including initialising plots with ggplot(), mapping variables
to aesthetics with aes(), and adding geometric objects with geom_point(). Further sections
cover customising plots with labels and titles, exploring the use of faceting for creating
multiple subplots, and differentiating between mapping aesthetics to variables versus
setting them manually. Finally, it touches upon various geoms for different visual rep-
resentations and the application of themes to alter the overall appearance of plots.

52
3 Grammar of data wrangling

In this chapter we introduce another star in the tidyverse, the


dplyr package. dplyr is a grammar of data manipulation,
providing a consistent set of verbs (=functions) that help you
solve the most common data manipulation challenges.

The main verbs are:


select: pick columns by name,
arrange: reorder rows,
filter: pick rows matching criteria,
mutate: add new variables,
summarize: reduce variables to values.
Cleverly combining these verbs builds the grammar. Often, we will need to apply a set of commands
group-wise. This can be achieved with the group_by function, which will also be discussed in this
chapter.
Besides the main verbs dplyr contains many more function like, e.g., slice or distinct. We will
talk about some of them, but definitely not all of them.

ĺ For all dplyr functions

• the first argument is always a data frame

• the subsequent arguments say what to do with that data frame

All dplyr functions

• always return a data frame

• don’t modify in place

53
3 Grammar of data wrangling

3.1 Data I/O

Example: We have data from two categories of hotels: resort hotels and city hotels. Each row, i.e.,
each observation, represents a hotel booking.
Data source: TidyTuesday

Goal for original data collection: development of prediction models to classify a hotel
booking’s likelihood to be cancelled (Antonia et al., 2019).

We will pursue a much simpler data exploration, but before doing that we have to get the data into R.
The dataset is available as a text file, to be precise, as a CSV file.
The tidyverse toolbox for data input/output is the readr package. It is one of the core tidyverse
packages loaded when loading the tidyverse. Since we already loaded the tidyverse, readr is ready
for usage.
The most general function for reading in data is read_delim(). Several variants with respect to the
relevant field separator exist to make our lives easier. In our case, it is a comma. Therefore, we use
read_csv() (in case of a semicolon, it would be read_csv2()).

hotels <- read_csv("data/hotels.csv")

Let’s have a first look at the data.

hotels
# # A tibble: 119,390 x 32
# hotel is_canceled lead_time arrival_date_year arrival_date_month
# <chr> <dbl> <dbl> <dbl> <chr>
# 1 Resort Hotel 0 342 2015 July
# 2 Resort Hotel 0 737 2015 July
# 3 Resort Hotel 0 7 2015 July
# 4 Resort Hotel 0 13 2015 July
# 5 Resort Hotel 0 14 2015 July
# 6 Resort Hotel 0 14 2015 July
# # i 119,384 more rows
# # i 27 more variables: arrival_date_week_number <dbl>, ...

54
3 Grammar of data wrangling

3.2 Select - extracting columns

We start by extracting just a single column. For example, we want to look at lead_time, which is the
number of days between booking and arrival date.

select(
hotels,
lead_time
)
# # A tibble: 119,390 x 1
# lead_time
# <dbl>
# 1 342
# 2 737
# 3 7
# 4 13
# 5 14
# 6 14
# # i 119,384 more rows

Our code consists of three parts:

1. the dplyr() function, in this case select()


2. the data frame object hotels; each dplyr function expects a data object as first argument
3. the name (without quotes) of the variable which we want to select

Result: data frame with 119390 rows and 1 column


Remember, dplyr functions always expect a data frame and always yield a data frame.

Ĺ In this example, hotels and the output of select() are a tibble, which is a special kind of
data frame. In particular, it prints information about the dimension of the data and the type
of the variables. Most of the time we will work with tibbles.

In the next step, let’s select hotel and lead_time.

select(hotels, hotel, lead_time)


# # A tibble: 119,390 x 2
# hotel lead_time
# <chr> <dbl>
# 1 Resort Hotel 342
# 2 Resort Hotel 737

55
3 Grammar of data wrangling

# 3 Resort Hotel 7
# 4 Resort Hotel 13
# 5 Resort Hotel 14
# 6 Resort Hotel 14
# # i 119,384 more rows

That was easy. We just had to provide the additional variable name as a further argument of
select().

3.3 Arrange

But what if we wanted to select these columns and then arrange the data in descending order of
lead time?
To accomplish this task, we need to take two steps that we can implement as follows:

arrange(
select(hotels,
hotel, lead_time),
desc(lead_time)
)
# # A tibble: 119,390 x 2
# hotel lead_time
# <chr> <dbl>
# 1 Resort Hotel 737
# 2 Resort Hotel 709
# 3 City Hotel 629
# 4 City Hotel 629
# 5 City Hotel 629
# 6 City Hotel 629
# # i 119,384 more rows

Often, tasks will have four, five, or more steps. Writing code from inside to outside in this way will
get extremely messy.
Hence, we want to introduce a more efficient way of combining several steps into one command.

56
3 Grammar of data wrangling

3.4 Pipes

3.4.1 What is a pipe?

R knows a number of different pipe operators. In Version 4.1.0 the pipe operator

|>

was introduced in the base R distribution. It is therefore known as the native pipe. We will use this
pipe operator most of the times.

Ĺ In programming, a pipe is a technique for passing information from one process to another.

This means, the command


lhs |> rhs(further_arguments)
will be translated to
rhs(lhs, further_arguments)
By default, the object on the left is passed as the first argument of the function on the right. The
native pipe also has a placeholder option, allowing objects to be passed to a specific argument. But
we will have no reason to use this option.

3.4.2 How does a pipe work?

You can think about the following sequence of actions:

find keys, start car, drive to work, park

Expressed as a set of nested functions in R pseudo code this would look like

park(drive(start_car(find("keys")), to = "work"))

Writing it out using pipes gives a more natural (and more accessible to read) structure.

find("keys") |>
start_car() |>
drive(to = "work") |>
park()

57
3 Grammar of data wrangling

Let’s see the native pipe in action. We start with the tibble hotels, and pass it to the select()
function to extract the variables hotel and lead_time.

hotels |>
select(hotel, lead_time)
# # A tibble: 119,390 x 2
# hotel lead_time
# <chr> <dbl>
# 1 Resort Hotel 342
# 2 Resort Hotel 737
# 3 Resort Hotel 7
# 4 Resort Hotel 13
# 5 Resort Hotel 14
# 6 Resort Hotel 14
# # i 119,384 more rows

Combining the above code with the arrange() function leads to the result we are looking for.

hotels |>
select(hotel, lead_time) |>
arrange(desc(lead_time))
# # A tibble: 119,390 x 2
# hotel lead_time
# <chr> <dbl>
# 1 Resort Hotel 737
# 2 Resort Hotel 709
# 3 City Hotel 629
# 4 City Hotel 629
# 5 City Hotel 629
# 6 City Hotel 629
# # i 119,384 more rows

3.4.3 Aside
dplyr knows it’s own pipe operator %>%,
which is actually implemented in the
package magrittr. This operator is older
but has the drawback of working only in
an “extended tidyverse”.
Any guesses as to why the package is
called magrittr? The Treachery of Images.

58
3 Grammar of data wrangling

3.4.4 A note on piping and layering

We use |> mainly in dplyr pipelines:

we pipe the output of the previous line of code as the first input of the next line of
code

We use + in ggplot2 plots for “layering”:

we create the plot in layers, separated by +

dplyr

hotels +
select(hotel, lead_time)
# Error: object 'hotel' not found

hotels |>
select(hotel, lead_time)
# # A tibble: 119,390 x 2
# hotel lead_time
# <chr> <dbl>
# 1 Resort Hotel 342
# 2 Resort Hotel 737
# 3 Resort Hotel 7
# 4 Resort Hotel 13
# 5 Resort Hotel 14
# 6 Resort Hotel 14
# # i 119,384 more rows

59
3 Grammar of data wrangling

ggplot2

ggplot(hotels, aes(x = hotel, fill = deposit_type)) |>


geom_bar()
# Error in `geom_bar()`:
# ! `mapping` must be
# created by `aes()`.
# i Did you use `%>%` or `|>`
# instead of `+`?

ggplot(
hotels,
aes(x = hotel,
fill = deposit_type)) +
geom_bar()

80000

60000
deposit_type
count

No Deposit
40000
Non Refund
Refundable
20000

0
City Hotel Resort Hotel
hotel

3.5 More on select() and arrange()

We have already seen select() at work. However, selecting can also be done based on specific
characteristics.
We could be interested in all variables which have a name starting with the string “arrival”,

60
3 Grammar of data wrangling

hotels |>
select(starts_with("arrival"))
# # A tibble: 119,390 x 4
# arrival_date_year
# <dbl>
# 1 2015
# 2 2015
# 3 2015
# 4 2015
# 5 2015
# 6 2015
# # i 119,384 more rows
# # i 3 more variables: ...

or have a name ending with “type”.

hotels |>
select(ends_with("type"))
# # A tibble: 119,390 x 4
# reserved_room_type
# <chr>
# 1 C
# 2 C
# 3 A
# 4 A
# 5 A
# 6 A
# # i 119,384 more rows
# # i 3 more variables: ...

Helper functions in combination with select()

• starts_with(): Starts with a prefix


• ends_with(): Ends with a suffix
• contains(): Contains a literal string
• num_range(): Matches a numerical range like x01, x02, x03
• one_of(): Matches variable names in a character vector
• everything(): Matches all variables
• last_col(): Select last variable, possibly with an offset
• matches(): Matches a regular expression (a sequence of symbols/characters expressing a
string/pattern to be searched for within text)

61
3 Grammar of data wrangling

See help for any of these functions for more info, e.g. ?everything.

Arrange in ascending or descending order

By default, the arrange() function will sort the entries in ascending order.

hotels |>
select(adults, children,
babies) |>
arrange(babies)
# # A tibble: 119,390 x 3
# adults children babies
# <dbl> <dbl> <dbl>
# 1 2 0 0
# 2 2 0 0
# 3 1 0 0
# 4 1 0 0
# 5 2 0 0
# 6 2 0 0
# # i 119,384 more rows

If the output should be given in descending order, we must specify this using desc().

hotels |>
select(adults, children,
babies) |>
arrange(desc(babies))
# # A tibble: 119,390 x 3
# adults children babies
# <dbl> <dbl> <dbl>
# 1 2 0 10
# 2 1 0 9
# 3 2 0 2
# 4 2 0 2
# 5 2 0 2
# 6 2 0 2
# # i 119,384 more rows

62
3 Grammar of data wrangling

3.6 Filter rows

The subsequent filter() arguments specify conditions that need to be fulfilled by a row (= an ob-
servation) to become part of the output. Let’s filter for all bookings in City Hotels.

hotels |>
filter(hotel == "City Hotel")
# # A tibble: 79,330 x 32
# hotel is_canceled
# <chr> <dbl>
# 1 City Hotel 0
# 2 City Hotel 1
# 3 City Hotel 1
# 4 City Hotel 1
# 5 City Hotel 1
# 6 City Hotel 1
# # i 79,324 more rows
# # i 30 more variables: ...

We can specify multiple conditions, which will be combined with an &. The following command
extracts all observations, which are bookings with no adults and at least one child.

hotels |>
filter(
adults == 0,
children >= 1
) |>
select(adults, babies, children)
# # A tibble: 223 x 3
# adults babies children
# <dbl> <dbl> <dbl>
# 1 0 0 3
# 2 0 0 2
# 3 0 0 2
# 4 0 0 2
# 5 0 0 2
# 6 0 0 3
# # i 217 more rows

If two (or more) conditions should be combined with an “or”, we must do this explicitly using the |
operator. So, let’s check again for bookings with no adults, but this time we allow for some children
or babies in the room.

63
3 Grammar of data wrangling

hotels |>
filter(
adults == 0,
children >= 1 | babies >= 1 # | means or
) |>
select(adults, babies, children)
# # A tibble: 223 x 3
# adults babies children
# <dbl> <dbl> <dbl>
# 1 0 0 3
# 2 0 0 2
# 3 0 0 2
# 4 0 0 2
# 5 0 0 2
# 6 0 0 3
# # i 217 more rows

We end up with the same number of observations. So, there are no bookings with just babies in the
room.
In some cases, we might be interested in the unique observations of a variable. That’s when we want
to use the distinct() function.

hotels |>
distinct(market_segment)
# # A tibble: 8 x 1
# market_segment
# <chr>
# 1 Direct
# 2 Corporate
# 3 Online TA
# 4 Offline TA/TO
# 5 Complementary
# 6 Groups
# 7 Undefined
# 8 Aviation

Combining distinct() with arrange() leads to a friendlier output to read through the fact being
ordered.

64
3 Grammar of data wrangling

hotels |>
distinct(hotel,
market_segment) |>
arrange(hotel, market_segment)
# # A tibble: 14 x 2
# hotel market_segment
# <chr> <chr>
# 1 City Hotel Aviation
# 2 City Hotel Complementary
# 3 City Hotel Corporate
# 4 City Hotel Direct
# 5 City Hotel Groups
# 6 City Hotel Offline TA/TO
# 7 City Hotel Online TA
# 8 City Hotel Undefined
# 9 Resort Hotel Complementary
# 10 Resort Hotel Corporate
# 11 Resort Hotel Direct
# 12 Resort Hotel Groups
# 13 Resort Hotel Offline TA/TO
# 14 Resort Hotel Online TA

Slice for specific row numbers

If we know which rows to extract, slice() can do the job.

hotels |>
slice(1:5) # first five
# # A tibble: 5 x 32
# hotel is_canceled
# <chr> <dbl>
# 1 Resort Hotel 0
# 2 Resort Hotel 0
# 3 Resort Hotel 0
# 4 Resort Hotel 0
# 5 Resort Hotel 0
# # i 30 more variables:
# # lead_time <dbl>, ...

slice() also comes with different variants:

65
3 Grammar of data wrangling

slice_head(df, n = 1, by = group) # first row from each group.


slice_tail(df, n = 1, by = group) # last row in each group.
slice_min(df, x, n = 1) # smallest value of column x.
slice_max(df, x, n = 1) # largest value of column x.
slice_sample(df, n = 1) # one random row.

3.7 Deriving information with summarize()

Every data frame you encounter contains more information than what is visible at first glance. For
instance, the hotels data frame does not show the average number of days between arrival at a city
hotel and booking. But we can definitely compute it using summarize():

hotels |>
filter(hotel == "City Hotel") |>
summarize(mean_lead_time = mean(lead_time))
# # A tibble: 1 x 1
# mean_lead_time
# <dbl>
# 1 110.

To use the summarize() function, you need to pass in a data frame along with one or more named
arguments. Each named argument should correspond to an R expression that produces a single value.
The summarize() function will create a new data frame where each named argument is converted
into a column. The name of the argument will become the column name, while the value returned
by the argument will fill the column. This means, that we are not restricted to compute only one
summary statistic.
We can apply several summary functions (we will see more examples of summary functions in Explore
Data) as subsequent arguments of summarize.
For example, consider determining the number of bookings in city hotels in addition to computing
the average lead time. We can use the n() function from the dplyr package to count the number of
observations.

hotels |>
filter(hotel == "City Hotel") |>
summarize(mean_lead_time = mean(lead_time), n = n())
# # A tibble: 1 x 2
# mean_lead_time n
# <dbl> <int>
# 1 110. 79330

66
3 Grammar of data wrangling

3.7.1 Summarizing by groups

Now assume we want to compute the same two quantities also for resort hotels. We can definitely
do the following.

hotels |>
filter(hotel == "Resort Hotel") |>
summarize(mean_lead_time = mean(lead_time), n = n())
# # A tibble: 1 x 2
# mean_lead_time n
# <dbl> <int>
# 1 92.7 40060

That would be feasible if the grouping variable has only a very limited number of unique values. But
it is still quite inefficient.
Imagine we want to compute those two summary statistics for each level of market_segment. Based
on the previous solution, we need to repeat the code eight times. Fortunately, dplyr provides a much
more efficient approach using group_by().
The group_by() function takes a data frame and one or more column names from that data frame as
input. It returns a new data frame that groups the rows based on the unique combinations of values
in the specified columns. When we now apply a dplyr function like summarize() or mutate() on
the grouped data frame, it executes the function in a group-wise manner.

hotels |>
group_by(market_segment) |>
summarize(mean_lead_time = mean(lead_time), n = n())
# # A tibble: 8 x 3
# market_segment mean_lead_time n
# <chr> <dbl> <int>
# 1 Aviation 4.44 237
# 2 Complementary 13.3 743
# 3 Corporate 22.1 5295
# 4 Direct 49.9 12606
# 5 Groups 187. 19811
# 6 Offline TA/TO 135. 24219
# 7 Online TA 83.0 56477
# 8 Undefined 1.5 2

67
3 Grammar of data wrangling

3.8 Adding or changing variables with mutate()

Imagine that it’s not essential for the analysis to distinguish between children and babies. Instead,
we would like to have the number of little ones (children or babies) staying in the room.

hotels |>
mutate(little_ones = children + babies) |>
select(children, babies, little_ones) |>
arrange(desc(little_ones))
# # A tibble: 119,390 x 3
# children babies little_ones
# <dbl> <dbl> <dbl>
# 1 10 0 10
# 2 0 10 10
# 3 0 9 9
# 4 2 1 3
# 5 2 1 3
# 6 2 1 3
# # i 119,384 more rows

¾ Your turn

What is happening in the following chunk?


hotels |>
mutate(little_ones = children + babies) |>
count(hotel, little_ones) |>
mutate(prop = n / sum(n))
# # A tibble: 12 x 4
# hotel little_ones n prop
# <chr> <dbl> <int> <dbl>
# 1 City Hotel 0 73923 0.619
# 2 City Hotel 1 3263 0.0273
# 3 City Hotel 2 2056 0.0172
# 4 City Hotel 3 82 0.000687
# 5 City Hotel 9 1 0.00000838
# 6 City Hotel 10 1 0.00000838
# 7 City Hotel NA 4 0.0000335
# 8 Resort Hotel 0 36131 0.303
# 9 Resort Hotel 1 2183 0.0183
# 10 Resort Hotel 2 1716 0.0144
# 11 Resort Hotel 3 29 0.000243
# 12 Resort Hotel 10 1 0.00000838

68
3 Grammar of data wrangling

3.9 Tidyverse style guide

Good code styling is not necessary but is highly beneficial. The readability of your code is something
you benefit from the most. But what is good code styling?
Two principles when using |> and + are, e.g.,

• always add a space before


• always add a line break after (for pipelines with more than two lines)

ggplot(hotels,aes(x=hotel,fill=deposit_type))+geom_bar()

ggplot(hotels, aes(x = hotel, fill = deposit_type)) +


geom_bar()

These are just two examples. There is a lot more to consider and it makes sense for a beginner to get
inspired by looking at a style guide. Therefore, we encourage you to look at the tidyverse style guide.
We will not always follow this style guide with our code but try to do so as often as possible.
Following a style guide is easier than you think. The styler package provides functions for converting
code to follow a chosen style.

library(styler)
style_text(
"ggplot(hotels,aes(x=hotel,y=deposit_type))+geom_bar()",
transformers = tidyverse_style())
# ggplot(hotels, aes(x = hotel, y = deposit_type)) +
# geom_bar()

By using the styler addin for the RStudio IDE, it makes it even easier than that. Just look at the styler
website for an example.

69
3 Grammar of data wrangling

Short summary

This chapter introduces the dplyr package in R, a key component of the tidyverse for data
manipulation. It explains how dplyr provides a consistent set of functions, or “verbs”, such as
select, arrange, filter, mutate, and summarise, to tackle common data wrangling tasks. The
text details the use of the native pipe operator (|>) to chain these verbs together in a readable
manner, contrasting it with the layering concept in ggplot2. Furthermore, it illustrates how to
read data using the readr package and demonstrates fundamental dplyr functions for extract-
ing columns, ordering rows, filtering data, and creating summary statistics, often in conjunction
with the group_by() function for group-wise operations.

70
Part III

Explore Data

71
4 Exploring categorical data

Data: lending club

The dataset loans_full_schema is contained in


the openintro package and includes thousands of
loans made through the Lending Club, which is a
platform that allows individuals to lend to other
individuals.

• Not all loans are created equal – ease of getting a loan depends on (apparent) ability to pay back
the loan.
• Data includes loans made, these are not loan applications.

Let’s take a peek at the data.

library(openintro)
loans_full_schema
# # A tibble: 10,000 x 55
# emp_title emp_length state homeownership annual_income
# <chr> <dbl> <fct> <fct> <dbl>
# 1 "global config engineer " 3 NJ MORTGAGE 90000
# 2 "warehouse office clerk" 10 HI RENT 40000
# 3 "assembly" 3 WI RENT 40000
# 4 "customer service" 1 PA RENT 30000
# 5 "security supervisor " 10 CA RENT 35000
# 6 "" NA KY OWN 34000
# # i 9,994 more rows
# # i 50 more variables: verified_income <fct>, debt_to_income <dbl>, ...

72
4 Exploring categorical data

4.1 Data analysis example

We will examine the relationship between the two variables

• homeownership, which can take one of the values of rent, mortgage (owns but has a mortgage),
or own, and
• application_type, which indicates whether the loan application was made with a partner
(joint) or whether it was an individual application.

The data requires some data cleaning.

loans <- loans_full_schema |>


mutate(
# lower case letters
homeownership = tolower(homeownership),
# pick new levels
homeownership = fct_relevel(
homeownership,
"rent", "mortgage", "own"
),
application_type = fct_relevel(
as.character(application_type),
"joint", "individual"
)
)

select(loans, homeownership, application_type)


# # A tibble: 10,000 x 2
# homeownership application_type
# <fct> <fct>
# 1 mortgage individual
# 2 rent individual
# 3 rent individual
# 4 rent individual
# 5 rent joint
# 6 own individual
# # i 9,994 more rows

73
4 Exploring categorical data

4.2 Frequency distribution

We start exploring the distribution of the two variables by computing the absolute frequencies of
the different outcome values.

Definition 4.1. Let {𝑣1 , … , 𝑣𝑘 } be the unique values of a categorical variable 𝑋, and let 𝑥1 , … , 𝑥𝑛
be 𝑛 sample observations from that variable. Then we define
𝑛
𝑛𝑗 = ∑ 1𝑣𝑗 (𝑥𝑖 ), 𝑗 ∈ {1, … , 𝑘}
𝑖=1

1, 𝑥𝑖 = 𝑣𝑗
as the absolute frequency of outcome 𝑣𝑗 . Here, 1𝑣𝑗 (𝑥𝑖 ) = { is called the indicator
0, 𝑥𝑖 ≠ 𝑣𝑗
or characteristic function. Note that 𝑛𝑗 simply counts the occurrences of 𝑣𝑗 among {𝑥1 , … , 𝑥𝑛 }.

We get the following absolute frequencies for homeownership and application_type.

loans |>
count(homeownership)
# # A tibble: 3 x 2
# homeownership n
# <fct> <int>
# 1 rent 3858
# 2 mortgage 4789
# 3 own 1353

loans |>
count(application_type)
# # A tibble: 2 x 2
# application_type n
# <fct> <int>
# 1 joint 1495
# 2 individual 8505

𝑛𝑗
Instead of absolute frequencies, we can compute the relative frequencies (proportions) 𝑟𝑗 = 𝑛.

74
4 Exploring categorical data

loans |>
count(homeownership) |>
mutate(prop = n / sum(n))
# # A tibble: 3 x 3
# homeownership n prop
# <fct> <int> <dbl>
# 1 rent 3858 0.386
# 2 mortgage 4789 0.479
# 3 own 1353 0.135

4.3 Bar chart

A bar chart is a common way to display the distribution of a single categorical variable.
In ggplot2 we can use geom_bar() to create a bar chart.

ggplot(loans, aes(x = homeownership)) +


geom_bar(fill = "gold") +
labs(x = "Homeownership", y = "Count")

5000

4000

3000
Count

2000

1000

0
rent mortgage own
Homeownership

4.3.1 Computed variables

In the previous plot

ggplot(loans, aes(x = homeownership)) +


geom_bar(fill = "gold") +
labs(x = "Homeownership", y = "Count")

75
4 Exploring categorical data

we did not present the data as they are. In a preliminary step, absolute frequencies were calculated
for homeownership and then these values were plotted.
Each geom has its own set of variables to be calculated. For geom_bar() these are

Remark. The help page of each geom function contains a list with all computed variables, where the
first entry is the default computation.

To create a bar chart of relative frequencies (not absolute), we first have to apply the statistical
transformation prop to the whole data set.
after_stat(prop) computes group-wise proportions. The data contains three groups concerning
homeownership. If we want to calculate proportions for each group with respect to the size of the
whole dataset, we first have to assign a common group value (e.g., group = 1) for all three groups.

1 ggplot(loans,
2 aes(x = homeownership,
3 y = after_stat(prop), group = 1)) +
4 geom_bar(fill = "gold") +
5 labs(x = "Homeownership")

0.5

0.4

0.3
prop

0.2

0.1

0.0
rent mortgage own
Homeownership

76
4 Exploring categorical data

4.4 Frequency distribution for two variables

More generally, we can determine the joint frequency distribution for two categorical variables.

Definition 4.2. Let {𝑣1 , … , 𝑣𝑘 } and {𝑤1 , … , 𝑤𝑚 } be the unique values of the categorical variables
𝑋 and 𝑌 , respectively. Further let (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ) be 𝑛 sample observations from the bivariate
variable (𝑋, 𝑌 ). Then we define
𝑛
𝑛𝑗,ℓ = ∑ 1(𝑣𝑗 ,𝑤ℓ ) ((𝑥𝑖 , 𝑦𝑖 )), 𝑗 ∈ {1, … , 𝑘}, ℓ ∈ {1, … , 𝑚},
𝑖=1

as the absolute frequency of outcome (𝑣𝑗 , 𝑤ℓ ).


A table containing all these absolute frequencies is known as contingency table.

4.4.1 Computing contingency tables

The table() function can be used to compute such a contingency table:

loans |>
select(homeownership, application_type) |>
table()
# application_type
# homeownership joint individual
# rent 362 3496
# mortgage 950 3839
# own 183 1170

We can also add the marginal frequency distributions.

loans |>
select(homeownership, application_type) |>
table() |>
addmargins()
# application_type
# homeownership joint individual Sum
# rent 362 3496 3858
# mortgage 950 3839 4789
# own 183 1170 1353
# Sum 1495 8505 10000

Remark. Contingency tables can also be computed using count().

77
4 Exploring categorical data

prop.table() converts a contingency table with absolute frequencies into one with proportions.

loans |>
select(homeownership, application_type) |>
table() |>
prop.table()
# application_type
# homeownership joint individual
# rent 0.0362 0.3496
# mortgage 0.0950 0.3839
# own 0.0183 0.1170

To add row and column proportions, one can use the margin argument. For row proportion 1 , we
have to use margin=1

loans |>
select(homeownership, application_type) |>
table() |>
prop.table(margin = 1) |>
addmargins()
# application_type
# homeownership joint individual Sum
# rent 0.0938310 0.9061690 1.0000000
# mortgage 0.1983713 0.8016287 1.0000000
# own 0.1352550 0.8647450 1.0000000
# Sum 0.4274573 2.5725427 3.0000000

and column proportions 2 are computed with margin=2.

loans |>
select(homeownership, application_type) |>
table() |>
prop.table(margin = 2) |>
addmargins()
# application_type
# homeownership joint individual Sum
# rent 0.2421405 0.4110523 0.6531928
# mortgage 0.6354515 0.4513815 1.0868330
# own 0.1224080 0.1375661 0.2599742
# Sum 1.0000000 1.0000000 2.0000000
1
absolute frequencies divided by the row totals
2
absolute frequencies divided by the column totals

78
4 Exploring categorical data

Remark. Row and column proportions can also be thought of as conditional proportions as they
tell us about the proportion of observations in a given level of a categorical variable conditional on
the level of another categorical variable.

4.5 Bar charts with two variables

We can plot the distributions of two categorical variables simultaneously in a bar chart. Such charts
are generally helpful to visualize the relationship between two categorical variables.

ggplot(loans, aes(x = homeownership, fill = application_type)) +


geom_bar() +
scale_fill_brewer(palette = "Set1") +
labs(
x = "Homeownership", y = "Count", fill = "Application type",
title = "Stacked bar chart"
)

Stacked bar chart


5000

4000

3000 Application type


Count

joint
2000 individual

1000

0
rent mortgage own
Homeownership

Loan applicants most often live in homes with mortgages. But it is not so easy to say how the different
types of applications differ over the levels of homeownership.
The stacked bar chart is most useful when it’s reasonable to assign one variable as the explanatory
variable (here homeownership) and the other variable as the response (here application_type) since
we are effectively grouping by one variable first and then breaking it down by the others.
One can vary the bars’ position with the position argument of geom_bar().

79
4 Exploring categorical data

1 ggplot(loans, aes(x = homeownership, fill = application_type)) +


2 geom_bar(position = "dodge") +
3 scale_fill_brewer(palette = "Set1") +
4 labs(x = "Homeownership", y = "Count", fill = "Application type",
5 title = "Dodged bar chart")

Dodged bar chart


4000

3000

Application type
Count

2000 joint
individual

1000

0
rent mortgage own
Homeownership

Dodged bar charts are more agnostic in their display about which variable, if any, represents the
explanatory and which is the response variable. It is also easy to discern the number of cases in the
six group combinations. However, one downside is that it tends to require more horizontal space.
Additionally, when two groups are of very different sizes, as we see in the group own relative to either
of the other two groups, it is difficult to discern if there is an association between the variables.
A third option for the position argument is fill. Using this option makes it easy to compare the
distribution within one group over all groups in the dataset. But we have no idea about the sizes of
the different groups.

1 ggplot(loans, aes(x = homeownership, fill = application_type)) +


2 geom_bar(position = "fill") +
3 scale_fill_brewer(palette = "Set1") +
4 labs(x = "Homeownership", y = "Count", fill = "Application type",
5 title = "Filled bar chart")

80
4 Exploring categorical data

Filled bar chart


1.00

0.75

Application type
Count
0.50 joint
individual

0.25

0.00
rent mortgage own
Homeownership

Conclusion: Joint applications are most common for applicants who live in mortgaged homes. Since
the proportions of joint and individual loans vary across the groups, we can conclude that the two
variables are associated in this sample.

¾ Your turn

A study was conducted to determine whether an experimental heart transplant program in-
creased lifespan. Each patient entering the program was officially designated a heart transplant
candidate. Patients were randomly assigned into treatment and control groups. Patients in the
treatment group received a transplant, and those in the control group did not. The charts below
displays the data in two different ways.

treatment treatment Outcome


Group

alive
deceased
control control

0 20 40 60 0.00 0.25 0.50 0.75 1.00


Count Proportion

a) Provide one aspect of the two-group comparison that is easier to see from the stacked bar
chart?
b) Provide one aspect of the two-group comparison that is easier to see from the filled bar
chart?
c) For the Heart Transplant Study which of those aspects would be more important to dis-
play? That is, which bar plot would be better as a data visualization?

81
4 Exploring categorical data

4.6 Visualize the joint frequency distribution

We have previously used bar charts to visualize the distribution of two categorical variables. However,
in, e.g., a filled bar chart, it is impossible to identify the groups’ relative sizes. To visualize the values
in a contingency table for two variables, you can use geom_count().
This function creates a point for each combination of values from the two variables. The size of each
point corresponds to the number of observations associated with that combination. Rare combina-
tions will appear as small points, while more common combinations will be represented by larger
points.

ggplot(loans) +
geom_count(mapping = aes(x = homeownership, y = application_type)) +
labs(x = "Homeownership", y = "Application type")

individual
Application type

n
1000
2000
3000

joint

rent mortgage own


Homeownership

We can argue based on this plot that the homeownership and application_type variables are asso-
ciated.
The distribution of individual applications across homeownership levels is unequal. The same is true
for joint applications.
Heat maps are a second way to visualize the relationship between two categorical variables. They
function similar to count plots, but use color fill instead of point size to display the number of obser-
vations in each combination.
ggplot2 does not provide a geom function for heat maps, but you can construct a heat map by plotting
the results of count() with geom_tile().

82
4 Exploring categorical data

To do this, set the x and y aesthetics of geom_tile() to the variables that you pass to count(). Then
map the fill aesthetic to the n variable computed by count().

loans |>
count(homeownership, application_type) |>
ggplot(aes(x = homeownership, y = application_type, fill = n)) +
geom_tile() + labs(x = "Homeownership", y = "Application type")

individual
n
Application type

3000

2000

1000

joint

rent mortgage own


Homeownership

Remark (Pie charts). Pie charts can work for visualizing a categorical variable with very few levels.

1353

homeownership
3858
rent
mortgage
own
4789

83
4 Exploring categorical data

However, they can be pretty tricky to read when used to visualize a categorical variable with many
levels, like grade.

grade
12
58
335
2459 1446 A
B
C
D
2653 E

3037 F
G

Hence, it would be best if you never used a pie chart. Use a bar chart instead.

ggplot(loans, aes(x=grade, fill = grade)) +


geom_bar()

3000

grade
A
2000
B
count

C
D
E
1000
F
G

A B C D E F G
grade

84
4 Exploring categorical data

Short summary

This chapter introduces methods for analysing categorical data. It uses the Lending Club loan
dataset to illustrate concepts such as frequency distributions, bar charts, and contingency
tables for examining the relationships between different categories. The document explains
how to visualise single and paired categorical variables using various graphical techniques,
including stacked, dodged, and filled bar charts, as well as count plots and heatmaps, while also
briefly discouraging the use of pie charts.

85
5 Exploring numerical data

In the beginning, we work again with the loan data from the Lending Club. But this time, we are
considering only a subsample of size 50. In addition, we select just some of the variables.

loans <- loan50 |>


select(loan_amount, interest_rate, term, grade,
state, annual_income, homeownership, debt_to_income)
loans
# # A tibble: 50 x 8
# loan_amount interest_rate term grade state annual_income
# <int> <dbl> <dbl> <fct> <fct> <dbl>
# 1 22000 10.9 60 B NJ 59000
# 2 6000 9.92 36 B CA 60000
# 3 25000 26.3 36 E SC 75000
# 4 6000 9.92 36 B CA 75000
# 5 25000 9.43 60 B OH 254000
# 6 6400 9.92 36 B IN 67000
# # i 44 more rows
# # i 2 more variables: homeownership <fct>, ...

The selected variables are the following:

• loan_amount: Amount of the loan received, in US dollars (numerical)


• interest_rate: Interest rate on the loan, in an annual percentage (numerical)
• term: The length of the loan, which is always set as a whole number of months (numerical)
• grade: Loan grade, which takes a values A through G and represents the quality of the loan
and its likelihood of being repaid (categorical, ordinal)
• state: US state where the borrower resides (categorical, nominal)
• annual_income: Borrower’s annual income, including any second income, in US dollars (nu-
merical)
• homeownership: Indicates whether the person owns, owns but has a mortgage, or rents (cate-
gorical, nominal)
• debt_to_income: Debt-to-income ratio (numerical)

86
5 Exploring numerical data

5.1 Dot plots and the mean

Let’s start by visualizing the shape of the distribution of a single variable. In these cases, a dot plot
provides the most basic of displays.

ggplot(
loans,
aes(x = interest_rate)
) +
labs(x = "Interest rate") +
geom_dotplot()

1.00

0.75
count

0.50

0.25

0.00

10 20
Interest rate

Remark. The rates have been rounded to the nearest percent in this plot.

Empirical mean

The empirical mean, often called the average or sample mean, is a common way to measure the
center of a distribution of data.

Definition 5.1. The empirical mean, denoted as 𝑥𝑛 , can be calculated as

𝑥1 + 𝑥2 + + 𝑥𝑛
𝑥𝑛 = ,
𝑛

87
5 Exploring numerical data

where 𝑥1 , 𝑥2 , … , 𝑥𝑛 represent the 𝑛 observed values. Sometimes it is convenient to drop the index 𝑛
and write just 𝑥.

The population mean is often denoted as 𝜇. Sometimes a subscript, such as 𝑥 , is used to represent
which variable the population mean refers to.
Often it is too expensive or even not possible (population data are rarely available) to measure the
population mean 𝜇 precisely. Hence we have to estimate 𝜇 using the sample mean 𝑥𝑛 .

5.1.1 Summarize

Although we cannot calculate the average interest rate across all loans in the populations, we can
estimate the population value using the sample data.
We can use summarize() from the dplyr package to summarize the data by computing the sample
mean of the interest rate:

loans |>
summarize(
mean_ir = mean(interest_rate)
)
# # A tibble: 1 x 1
# mean_ir
# <dbl>
# 1 11.6

The sample mean is a point estimate of 𝜇𝑥 . It’s not perfect, but it is our best guess of the average
interest rate on all loans in the population studied.
Later, we will discuss methods for assessing the accuracy of point estimates, which is necessary be-
cause accuracy varies with the sample size.

Remark. We could also have indexed the interest_rate with the $ notation and then applied mean()
to the result.

mean(loans$interest_rate)
# [1] 11.5672

88
5 Exploring numerical data

5.1.2 Group means

Now we know that the average interest rate in the sample is equal to 11.5672. However, we would
expect that the interest rate varies with the grade of the loan.
Can we compute the sample mean for each level of grade in an easy way?

1 loans |>
2 group_by(grade) |>
3 summarize(
4 mean_ir = mean(interest_rate)
5 )
6 # # A tibble: 5 x 2
7 # grade mean_ir
8 # <fct> <dbl>
9 # 1 A 6.77
10 # 2 B 10.2
11 # 3 C 13.8
12 # 4 D 18.6
13 # 5 E 25.6

After the group_by() step, all computations are performed separately for each level of grade.
We detect an increasing average interest rate with a decreasing grade.
Can we compute several statistics for each level of grade of all mortgage observations
quickly?

1 loans |>
2 filter(homeownership == "mortgage") |>
3 group_by(grade) |>
4 summarize(
5 mean_ir = mean(interest_rate),
6 mean_la = mean(loan_amount),
7 n = n()
8 )
9 # # A tibble: 5 x 4
10 # grade mean_ir mean_la n
11 # <fct> <dbl> <dbl> <int>
12 # 1 A 6.31 18286. 7
13 # 2 B 10.1 18370 10
14 # 3 C 13.0 25500 4
15 # 4 D 20.3 29333. 3
16 # 5 E 25.6 27200 2

89
5 Exploring numerical data

For lower grades, we can detect a larger average loan amount.

Remark. These values should be viewed with great caution, as the sample size for several stages is
very small.

5.2 Histograms and shape

Dot plots show the exact value for each observation. They are useful for small datasets but can become
hard to read with larger samples.
Especially for larger samples, we prefer to think of the value as belonging to a bin. For the loans
dataset, we created a table of counts for the number of loans with interest rates between 5.0% and
7.5%, then the number of loans with rates between 7.5% and 10.0%, and so on.

loans |>
pull(interest_rate) |>
cut(breaks = seq(5, 27.5, by = 2.5)) |>
table()
#
# (5,7.5] (7.5,10] (10,12.5] (12.5,15] (15,17.5] (17.5,20]
# 11 15 8 4 5 4
# (20,22.5] (22.5,25] (25,27.5]
# 1 1 1

These binned counts are plotted as bars in a histogram.


In ggplot2 we use geom_histogram() to create a histogram.

ggplot(
loans,
aes(x = interest_rate)) +
geom_histogram(
breaks = seq(5, 27.5, 2.5),
colour = "white") +
labs(x = "Interest rate")

90
5 Exploring numerical data

15

10

count
5

0
10 20
Interest rate

Histograms provide a view of the data density. Higher bars represent where the data are relatively
more common.
A smoothed-out histogram is known as a density plot.

ggplot(loan50, aes(x = interest_rate)) +


geom_density(fill = "black", alpha = 0.3) +
labs(x = "Interest rate", y = "Density")

0.100

0.075
Density

0.050

0.025

0.000
5 10 15 20 25
Interest rate

Histograms, as well as density plots, are especially convenient for understanding the shape of the
data distribution.
Both plots suggest that most loans have rates under 15%, while only a handful have rates above 20%.
When the distribution of a variable trails off to the right in this way and has a longer right tail, the
shape is said to be right skewed.

91
5 Exploring numerical data

15 0.100

0.075
10

Density
count

0.050

5
0.025

0 0.000
10 20 5 10 15 20 25
Interest rate Interest rate

Variables with the reverse characteristic – a long, thinner tail to the left – are said to be left skewed.
Variables that show roughly equal trailing off in both directions are called symmetric.

5.2.1 Modality

In addition to looking at whether a distribution is skewed or symmetric, histograms can be used to


identify modes.
A mode is a prominent peak in the distribution. There is only one prominent peak in the histogram
of interest_rate.
The following plots show histograms with one, two, or three prominent peaks. Such distributions are
called unimodal, bimodal, and multimodal, respectively.

unimodal bimodal multimodal

20 20 20
15 15 15
10 10 10
5 5 5
0 0 0
0 10 20 30 0 10 20 30 0 10 20 30

Remark. The search for modes is not about finding a clear and correct answer to the number of
modes in a distribution, which is why prominent is not strictly defined. The most important part of
this investigation is to better understand your data.

92
5 Exploring numerical data

5.3 Variance and standard deviation

The mean describes the center of a distribution. But we also need to understand the variability in
the data.
Here, we introduce two related measures of variability: the empirical variance and the empirical
standard deviation. The standard deviation roughly describes how far away the typical observation
is from the mean. We call the distance of an observation 𝑥𝑖 from its empirical mean 𝑥𝑛̄ its deviation
𝑥𝑖 − 𝑥𝑛̄ .
If we square these deviations and then take an average, the result is equal to the empirical variance.

Definition 5.2. Given a sample 𝑥1 , … , 𝑥𝑛 , the empirical variance of the sample is defined as the
average squared deviation from the empirical mean
𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛̄ )2
𝑠2𝑥,𝑛 = .
𝑛−1

Remark. We divide by 𝑛 − 1, rather than 𝑛 since we average over 𝑛 − 1 “free” values. Indeed, from
𝑛
𝑛 − 1 of the values 𝑥𝑖 − 𝑥𝑛 , we can determine the last remaining value because ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 ) = 0.

Let’s compute the empirical variance of the interest_rate.

(10.9 − 11.57)2 + (9.92 − 11.57)2 + + (6.08 − 11.57)2


𝑠2𝑥,𝑛 =
50 − 1
(−0.67)2 + (−1.65)2 + + (−5.49)2
=
49
≈ 25.52

In practice, we wouldn’t use that formula for a larger dataset like this one. We would use the following
formula:

𝑛 𝑛 𝑛 𝑛
∑ (𝑥𝑖 − 𝑥𝑛̄ )2 ∑ 𝑥2𝑖 − 2 ∑𝑖=1 𝑥𝑖 ⋅ 𝑥𝑛̄ + ∑𝑖=1 𝑥2𝑛̄
𝑠2𝑥,𝑛 = 𝑖=1 = 𝑖=1
𝑛−1 𝑛−1
𝑛 𝑛
∑𝑖=1 𝑥2𝑖 − 2𝑛 ⋅ 𝑥𝑛̄ ⋅ 𝑥𝑛̄ + 𝑛𝑥2𝑛̄ ∑𝑖=1 𝑥2𝑖 − 𝑛 ⋅ 𝑥2𝑛̄
= = .
𝑛−1 𝑛−1
Or just use R:

var(loans$interest_rate)
# [1] 25.52387

93
5 Exploring numerical data

The empirical standard deviation is the square root of the empirical variance.

Definition 5.3. Given a sample 𝑥1 , … , 𝑥𝑛 , the empirical standard deviation is defined as the
𝑛
1
square root of the empirical variance: 𝑠𝑥,𝑛 = √ ∑(𝑥𝑖 − 𝑥𝑛̄ )2 .
𝑛 − 1 𝑖=1

Remark. The subscript of 𝑥,𝑛 may be omitted if it’s clear that we speak about the variance and standard
deviation of 𝑥1 , … , 𝑥𝑛 . But in general, it’s helpful to have it as a reminder.

Summary

The empirical variance is the average squared distance from the mean, and the empirical stan-
dard deviation is its square root. The standard deviation is useful when considering how far the
data are distributed from the mean.
Like the mean, the population values for variance and standard deviation have typical symbols:
𝜎2 for the variance and 𝜎 for the standard deviation.

Let’s finish by computing the variance and standard deviation of interest_rate.

loans |>
summarize(
var_ir = var(interest_rate), sd_ir = sd(interest_rate)
)
# # A tibble: 1 x 2
# var_ir sd_ir
# <dbl> <dbl>
# 1 25.5 5.05

5.4 Box plots, quartiles, and the median

A box plot summarizes a dataset using five statistics while also identifying unusual observations.
The next figure contains a histogram alongside a box plot of the interest_rate variable.

94
5 Exploring numerical data

15

count
10

0
10 20
Interest rate
0.4
0.2
0.0
−0.2
−0.4
5 10 15 20 25
Interest rate

The dark line inside the box represents the empirical median.

5.4.1 Median

At least 50% of the data are less than or equal to the median, and at least 50% are greater than or equal
to it.

Definition 5.4. The empirical median is the value that splits the data in half when ordered in
ascending order.

Remark. When there is an odd number of observations, there will be precisely one observation that
splits the data into two halves, and in such a case, that observation is the median.
The empirical median can be computed in several ways for 𝑛 being an even number. One common
approach is to define the empirical median of a sample 𝑥1 , … , 𝑥𝑛 to be the average 12 (𝑥( 𝑛2 ) + 𝑥( 𝑛2 )+1 ),
where 𝑥(𝑘) is the 𝑘-th smallest value.
We can use median() to compute the empirical median in R. median() knows nine different methods
for calculating an empirical median. We will always use the default method.

The interest_rate dataset has an even number of observations, and its median is:

median(loans$interest_rate)
# [1] 9.93

Definition 5.5. The 𝑘𝑡ℎ percentile is a number with at least 𝑘% of the observations below or equal
to and at least 100 − 𝑘% above or equal to.

95
5 Exploring numerical data

5.4.2 Interquartile range, whiskers and outliers

The box in a box plot represents the middle 50% of the data. The length of the box is called the
interquartile range, or IQR for short.

Definition 5.6. Given a sample 𝑥1 , … , 𝑥𝑛 the range

𝐼𝑄𝑅 = 𝑄3 − 𝑄1 ,

is called the interquartile range.


The statistics 𝑄1 and 𝑄3 are the 25𝑡ℎ and 75𝑡ℎ percentile of the sample, respectively. 𝑄1 and 𝑄3 are
also called first and third quartile of the sample.

Like the standard deviation, the IQR is a measure of variability in data. The more variable the data,
the larger the standard deviation and IQR.

IQR(loans$interest_rate)
# [1] 5.755

The whiskers attempt to capture the data outside of the box.

0.4

0.2

0.0

−0.2

−0.4
5 10 15 20 25
Interest rate

The whiskers reach to the minimum and the maximum values in the data, unless there are points
that are considered unusually high or unusually low:

> 1.5 times the IQR away from the first or the third quartile

which are labeled with a dot and referred to as potential outliers.

96
5 Exploring numerical data

¾ Your turn

Create a box plot for the mass variable from the starwars dataset. Try to guess the correct
geom function. Let RStudio help you by typing rather slowly.
Hint: use library(tidyverse) since starwars is part of dplyr.

An outlier is an observation that appears extreme relative to the rest of the data. Examining data
for outliers serves many useful purposes, including:

• identifying strong skew in the distribution,


• identifying possible data collection or data entry errors, and
• providing insight into interesting properties of the data.

However, remember that some datasets have a naturally long skew, and outlying points do not
represent any sort of problem in the dataset.

5.4.3 Comparing numerical data across groups

Side-by-side boxplots is a common tool to compare the distribution of a numerical variable across
groups.

1 ggplot(loans, aes(
2 x = interest_rate,
3 y = grade
4 )) +
5 geom_boxplot() +
6 labs(
7 x = "Interest rate (%)",
8 y = "Grade",
9 title = "Interest rates of Lending Club loans",
10 subtitle = "by grade of loan"
11 )

97
5 Exploring numerical data

Interest rates of Lending Club loans


by grade of loan

Grade C

5 10 15 20 25
Interest rate (%)

When using a histogram, we can fill the bars with different colors according to the levels of the
categorical variable.

1 ggplot(loans, aes(
2 x = interest_rate,
3 fill = homeownership
4 )) +
5 geom_histogram(binwidth = 2, colour = "white") +
6 labs(
7 x = "Interest rate (%)",
8 title = "Interest rates of Lending Club loans"
9 )

Interest rates of Lending Club loans

15

homeownership
10
count

rent
mortgage
own
5

0
10 20
Interest rate (%)

98
5 Exploring numerical data

With the position argument of geom_histogram(), one can vary where to put the bars for the
different groups. The default is to put them on top of each other. The dodge position puts them next
to each other.

1 ggplot(loans, aes(
2 x = interest_rate,
3 fill = homeownership
4 )) +
5 geom_histogram(binwidth = 2, colour = "white",
6 position = "dodge") +
7 labs(
8 x = "Interest rate (%)",
9 title = "Interest rates of Lending Club loans"
10 )

Interest rates of Lending Club loans

7.5

homeownership
5.0
count

rent
mortgage
own
2.5

0.0
10 20
Interest rate (%)

Another technique for comparing numerical data across different groups would be faceting.

1 ggplot(loans_full_schema,
2 aes(x = interest_rate)) +
3 geom_histogram(
4 bins = 10, colour = "white") +
5 facet_grid(term ~ homeownership)

99
5 Exploring numerical data

MORTGAGE OWN RENT

750

500

36
250

count 0

750

500

60
250

0
10 20 30 10 20 30 10 20 30
interest_rate

Remark. We used the complete dataset in the above plot, not just 50 observations.

5.5 Robust statistics

Original data
1.00
0.75
count

0.50
0.25
How are the sample statistics of the 0.00
interest_rate affected by the observa- 0 10 20 30

tion, 26.3%? 1.00


Move 26.3% to 15%

What would have happened if this loan 0.75


count

0.50
had instead been 0.25
0.00
0 10 20 30
• only 15%? Move 26.3% to 35%
1.00
• even larger, say 35%? 0.75
count

0.50
0.25
0.00
0 10 20 30
Interest rate

We compute the median, IQR, empirical mean and empirical standard deviation for all three
datasets.

100
5 Exploring numerical data

loan50_robust_check |>
group_by(Scenario) |>
summarise(
Median = median(interest_rate),
IQR = IQR(interest_rate),
Mean = mean(interest_rate),
SD = sd(interest_rate)
)
# # A tibble: 3 x 5
# Scenario Median IQR Mean SD
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 Original data 9.93 5.76 11.6 5.05
# 2 Move 26.3% to 15% 9.93 5.76 11.3 4.61
# 3 Move 26.3% to 35% 9.93 5.76 11.7 5.68

The median and IQR are called robust statistics because extreme observations/skewness have
little effect on their values.
On the other hand, the mean and standard deviation are more heavily influenced by changes in
extreme observations.

For skewed distributions it is often more helpful to use median and IQR to describe the center
and spread. It holds that

• mean > median for right-skewed distributions,


• mean < median for left-skewed distributions.

For symmetric distributions, it is often more helpful to use the mean and SD to describe the center
and spread. It holds

mean ≈ median.

101
5 Exploring numerical data

¾ Your turn

Marathon winners. The histogram and box plots below show the distribution of finishing
times for male and female winners of the New York Marathon between 1970 and 1999.
0.4

20

0.2

15
count

0.0
10

5 −0.2

0
−0.4
2.25 2.50 2.75 3.00 3.25 2.1 2.4 2.7 3.0
time time

a) What features of the distribution are apparent in the histogram and not the box plot?
What features are apparent in the box plot but not in the histogram?

b) What may be the reason for the bimodal distribution? Explain.

c) Compare the distribution of marathon times for men and women based on the box plot
shown below.

m
gender

2.1 2.4 2.7 3.0


time

102
5 Exploring numerical data

5.6 Exploring paired numerical data

A scatterplot provides a case-by-case view of data for two numerical variables. Let’s consider the
relation between annual_income and loan_amount.

ggplot(loan50, aes(x = annual_income, y = loan_amount)) +


geom_point() +
labs(x = "annual income", y = "loan amount")

40000

30000
Scatterplots are useful for
quickly identifying associa-
loan amount

20000
tions between the variables
under consideration, whether
they are simple trends or
10000
more complex relationships.

1e+05 2e+05 3e+05


annual income

Is there a linear relationship between income and loan amount?


How can we measure the strength of linear dependence?

5.6.1 Correlation

A measure of linear dependence between two variables is the empirical or sample correlation
coefficient of the two variables.

Definition 5.7. Given a paired sample (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ), the empirical correlation coefficient
of the sample is defined as

𝑛 𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 ) ∑ (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 )
𝑟(𝑥,𝑦),𝑛 = = 𝑖=1
𝑛 𝑛
√∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2 ∑𝑖=1 (𝑦𝑖 − 𝑦𝑛 )2 (𝑛 − 1)𝑠𝑥,𝑛 𝑠𝑦,𝑛
𝑠𝑥𝑦,𝑛
= ,
𝑠𝑥,𝑛 𝑠𝑦,𝑛
where 𝑥𝑛 , 𝑦𝑛 and 𝑠𝑥,𝑛 , 𝑠𝑦,𝑛 are the empirical means and standard deviations, respectively. 𝑠𝑥𝑦,𝑛 is
called the empirical covariance of the sample.

103
5 Exploring numerical data

Calculating the correlation across all 50 observations of income and loan amount results in an empir-
ical correlation coefficient of

cor(loans$annual_income, loans$loan_amount)
# [1] 0.396303

This indicates, at most, a moderate linear relation. Computing the correlation coefficient for each
level of homeownership yields

loans |>
group_by(homeownership) |>
summarize(r = cor(annual_income, loan_amount))
# # A tibble: 3 x 2
# homeownership r
# <fct> <dbl>
# 1 rent 0.372
# 2 mortgage 0.308
# 3 own 0.954

Do we have a strong linear dependence between the two variables in case of owners?

1 ggplot(loans, aes( x = annual_income, y = loan_amount, colour = homeownership)) +


2 geom_point(aes(shape = homeownership)) +
3 geom_smooth(aes(linetype = homeownership),method = "lm", se = FALSE)

40000

30000 The empirical cor-


homeownership relation between
loan_amount

rent income and loan


20000 mortgage amount is strong in
own the sample for own-
ers. But the value is
10000 based on just three
observations.

1e+05 2e+05 3e+05


annual_income

104
5 Exploring numerical data

Remark. Based on all 10000 loans we get

loans_full_schema |>
group_by(homeownership) |>
summarize(r = cor(annual_income, loan_amount), n())
# # A tibble: 3 x 3
# homeownership r `n()`
# <fct> <dbl> <int>
# 1 MORTGAGE 0.284 4789
# 2 OWN 0.366 1353
# 3 RENT 0.331 3858

Creating a scatterplot of the annual income and the loan amount for the complete datasets leads to a
lot of overplotting due to the size of the dataset. In that case, a hexplot can be advantageous compared
to a scatterplot.

p_point <- ggplot(loans_full_schema,


aes(x = annual_income, y = loan_amount)) +
geom_point()
p_hex <- ggplot(loans_full_schema,
aes(x = annual_income, y = loan_amount)) +
geom_hex()

p_point + p_hex + plot_layout(axes = "collect") # using the patchwork package

40000

count
30000 800
loan_amount

600
20000
400

200
10000

0
0 5000001000000
1500000
2000000 0 5000001000000
1500000
2000000
annual_income

In the hexplot, we are given additional information (compared to the scatterplot) about the absolute
frequency of each hexagon.

105
5 Exploring numerical data

Short summary

This chapter introduces fundamental techniques for analysing numerical information using a
dataset about loans for illustration. The text demonstrates how to visualise single variables
through dot plots, histograms, and density plots to understand their distribution, including con-
cepts like skewness and modality. It explains methods to quantify the centre and spread
of data using measures such as the mean, variance, and standard deviation, alongside robust
alternatives like the median and the IQR. Furthermore, the material covers comparing nu-
merical data across different categories using box plots and grouped visualisations. Finally,
it examines relationships between pairs of numerical variables using scatterplots and the
correlation coefficient to identify linear dependencies.

106
Part IV

Probability

107
6 Case study: Gender discrimination

Study design:

In 1972, as a part of a study on gender discrimination, 48 male bank supervisors were each
given the same personnel file and asked to judge whether the person should be promoted
to a branch manager job that was described as “routine”.
The files were identical except that half of the supervisors had files showing the person was
male while the other half had files showing the person was female.
It was randomly determined which supervisors got “male” and which got “female” applications.
See Rosen and Jerdee (1974) for more details.

Research question: Are females unfairly discriminated against?

Data: Of the 48 files reviewed, 35 were promoted. The study is testing whether females are unfairly
discriminated against.

At first glance, does there appear to be a relationship between promotion and gender?
Absolute counts:

gender_discrimination |>
table() |>
addmargins()
# decision
# gender promoted not promoted Sum
# male 21 3 24
# female 14 10 24
# Sum 35 13 48

108
6 Case study: Gender discrimination

Percentages given gender:

gender_discrimination |>
table() |>
prop.table(margin = 1) |>
addmargins()
# decision
# gender promoted not promoted Sum
# male 0.8750000 0.1250000 1.0000000
# female 0.5833333 0.4166667 1.0000000
# Sum 1.4583333 0.5416667 2.0000000

Conclusion?

We saw a difference of almost 30% (29.2% to be exact) between the proportion of male and female files
that were promoted.
For the sample, we observe a promotion rate that is dependent on gender.
But at this stage, we don’t know how to decide which of the following statements could be true for
the population:

1. Promotion is dependent on gender; males are more likely to be promoted, and hence, there is
gender discrimination against women in promotion decisions.
2. The difference in the proportions of promoted male and female files is due to chance. This is
not evidence of gender discrimination against women in promotion decisions.

Two competing claims

There is nothing going on:

Promotion and gender are independent, no gender discrimination, observed difference in pro-
portions is simply due to chance. → Null hypothesis

There is something going on:

Promotion and gender are dependent, there is gender discrimination, observed difference in
proportions is not due to chance. → Alternative hypothesis

The two claims can be challenged / tested through a hypothesis test.

109
6 Case study: Gender discrimination

6.1 A trial as a hypothesis test

Hypothesis testing is very much like a court trial.

𝐻0 ∶ Defendant is innocent
𝐻𝐴 ∶ Defendant is guilty

We first present the evidence - collect data.

(a) Image from freepik.com

Then we judge the evidence - Could these data plausibly have happened by chance if the null
hypothesis were true?
If they were very unlikely to have occurred, then the evidence raises more than a reasonable doubt
in our minds about the null hypothesis.
Ultimately we must make a decision: What is too unlikely?

Conclusion:

We need to understand randomness and learn how to compute probabilities.

If the evidence is not strong enough to reject the assumption of innocence, the jury returns with a
verdict of not guilty.
The jury does not say that the defendant is innocent, just that there is not enough evidence to
convict.
The defendant may, in fact, be innocent, but the jury has no way of being sure.
In statistical terms, we fail to reject the null hypothesis.

Note

We never declare the null hypothesis to be true, because we simply do not know whether
it’s true or not.

In a trial, the burden of proof is on the prosecution. In a hypothesis test, the burden of proof is on
the unusual claim.

110
6 Case study: Gender discrimination

The null hypothesis is the ordinary state of affairs (the status quo). So, it’s the alternative
hypothesis that we consider unusual and for which we must gather evidence.

Recap: Hypothesis testing framework

We start with a null hypothesis 𝐻0 that represents the status quo and an alternative hypothesis
𝐻𝐴 that represents our research question, i.e., what we’re testing for.

Testing process:

1. Compute a summary statistic 𝑇 (x) for the observed sample x.

2. Under the assumption that the null hypothesis is true, compute how likely the ob-
served value 𝑇 (x) is.

3. Decide if the test results suggest that the data provides convincing evidence against
the null hypothesis. If that’s the case,

reject the null hypothesis in favour of the alternative.

Otherwise, we stick with the null hypothesis.

The second step can be done via simulation (briefly now, more detailed later) or theoretical meth-
ods (later in the course).
Returning to our example of gender discrimination, we want to simulate the experiment under the
assumption of independence, i.e., leave things up to chance.
Two possible outcomes:

1. Results from the simulations based on the chance model look like the data.

We can conclude that the difference between the proportions of promoted files between males and
females was simply due to chance: promotion and gender are independent.

2. Results from the simulations based on the chance model do not look like the data.

We can conclude that the difference between the proportions of promoted files between males and
females was not due to chance, but due to an actual effect of gender: promotion and gender are
dependent.

111
6 Case study: Gender discrimination

6.2 Simulate different outcomes by permutations

Let’s start by recomputing the proportions in the four different categories to give the output as a
tibble.
Using the dplyr approach introduced in Chapter 3, we do the following:

1. Summarize the dataset by computing the sample size for the each combination of gender and
decision.
2. Compute the proportion for each level of gender.

Remark. We specify the grouping through the .by argument of summarize() and mutate().

props <- gender_discrimination |>


summarize(
n = n(),
.by = c(gender, decision)
) |>
mutate(
prop = n / sum(n),
.by = c(gender)
)
props
# # A tibble: 4 x 4
# gender decision n prop
# <fct> <fct> <int> <dbl>
# 1 male promoted 21 0.875
# 2 male not promoted 3 0.125
# 3 female promoted 14 0.583
# 4 female not promoted 10 0.417

A reasonable summary statistic 𝑇 is the difference in proportions for the groups of promoted females
and males.

T_stat <- props |>


filter(
decision == "promoted"
) |>
summarize(
diff_in_prop = diff(prop)
)
T_stat

112
6 Case study: Gender discrimination

# # A tibble: 1 x 1
# diff_in_prop
# <dbl>
# 1 -0.292

Under the assumption of independence between gender and decision, the information about gender
has no influence on decision. To decide if the observed difference in proportions is unusual under
this assumption, we need to simulate differences in proportions under independence between the two
variables.

Idea:

Assign the decision independent of the gender. We achieve this by randomly permuting
the variable gender while leaving decision as it is. If the two variables are independent, the
value of the statistic should be comparable.

Let’s again look at the original data:

gender_discrimination
# # A tibble: 48 x 2
# gender decision
# <fct> <fct>
# 1 male promoted
# 2 male promoted
# 3 male promoted
# 4 male promoted
# 5 male promoted
# 6 male promoted
# 7 male promoted
# 8 male promoted
# 9 male promoted
# 10 male promoted
# # i 38 more rows

Now, we permute the variable gender.

113
6 Case study: Gender discrimination

set.seed(123) # makes the permutation reproducible


df_perm <- gender_discrimination |>
mutate(
gender = sample(
gender,
nrow(gender_discrimination)
)
)
df_perm
# # A tibble: 48 x 2
# gender decision
# <fct> <fct>
# 1 female promoted
# 2 male promoted
# 3 male promoted
# 4 male promoted
# 5 female promoted
# 6 female promoted
# 7 female promoted
# 8 female promoted
# 9 female promoted
# 10 female promoted
# # i 38 more rows

For the permuted data we observe the following proportions:

df_perm_perc <- df_perm |>


summarise(n = n(), .by = c(gender, decision)) |>
mutate(prop = n / sum(n), .by = gender) |>
arrange(gender) # male is the first level of gender
df_perm_perc
# # A tibble: 4 x 4
# gender decision n prop
# <fct> <fct> <int> <dbl>
# 1 male promoted 16 0.667
# 2 male not promoted 8 0.333
# 3 female promoted 19 0.792
# 4 female not promoted 5 0.208

114
6 Case study: Gender discrimination

The value of the statistic is equal to

df_perm_perc |>
filter(decision == "promoted") |>
summarize(diff(prop))
# # A tibble: 1 x 1
# `diff(prop)`
# <dbl>
# 1 0.125

and hence, very different than the observed difference. But doing this once, does not provide any
evidence in favor of the alternative. Hence, we need to repeat those steps several times.
Let’s repeat everything 𝑁 times in a for loop. We initialize a vector of length 100

set.seed(190503) # makes the permutations reproducible


N <- 100
diff_stat_perm <- vector("numeric", N)

and then run the for loop with the previous commands.

for (i in 1:N) {
diff_stat_perm[i] <- gender_discrimination |>
# permute the labels
mutate(
gender = sample(gender, nrow(gender_discrimination))
) |>
summarise(n = n(), .by = c(gender, decision)) |>
# compute the proportions
mutate(prop = n / sum(n), .by = gender) |>
arrange(gender) |>
filter(decision == "promoted") |>
# compute diff of props
summarize(diff_prop = diff(prop)) |>
pull() # pull the diff out of the tibble
}

115
6 Case study: Gender discrimination

Our decision will now be based on the relative frequency

sum(abs(diff_stat_perm) >= 0.29) / N


# [1] 0.03

which describes how often a permuted sample was at least as extreme as the original data.
This means that we observed a difference (in absolute value) at least as large as 0.29 only in 3 of the
𝑁 = 100 simulation runs.

Do the simulation results provide convincing evidence of gender discrimination against women, i.e., de-
pendence between gender and promotion decisions?
Yes, the data provide convincing evidence for the alternative hypothesis of gender discrimination
against women in promotion decisions. We conclude that the observed difference between the two
proportions was due to a real effect of gender.

The relative frequency 0.03 can also be understood as an estimate of the probability that the
statistic takes on a value at least as extreme as the observed value under a so-called null distri-
bution. Before we are able to discuss null distributions and related topics in statistical inference
(in later chapters), we have to rigorously discuss the concept of randomness in Chapter 7.

116
7 Probability

7.1 Defining probability

Probability theory forms an important part of the foundations of statistics. When doing inferen-
tial statistics we need to measure / quantify uncertainty.
Probability theory is the mathematical tool to describe uncertain / random events.

Note

While probability is not of independent interest in this course, it is required as a necessary


tool.

The study of probability arose in part from an interest in understanding games of chance, such as
cards or dice. We will use the intuitive setting of these games to introduce some of the concepts.

Definition 7.1. A random process is one in which we know all possible outcomes, but we cannot
predict which outcome will occur.

Introductory examples:
Tossing a coin

sample(c("Heads", "Tails"), 1)
# [1] "Heads"

or rolling a die

sample(1:6, 1)
# [1] 2

In these examples we have a finite and small number of possible outcomes.


But this doesn’t have to be the case, think e.g. about the closing price of a particular stock. It has a
finite, but very large number of possible outcomes.

117
7 Probability

7.1.1 Sample space

Rolling a “standard” die has as possible outcomes the elementary events: 1, 2, 3, 4, 5, 6.


The set of all possible outcomes in a random process will be called the sample space.

Definition 7.2. The sample space 𝑆 is the set of all possible outcomes of a random process.

Further examples:

• The sample space for rolling two dice is equal to


𝑆 = {(1, 1), (1, 2), … , (1, 6), (2, 1), … , (2, 6), … , (6, 1), … , (6, 6)}.
• The sample space for spinning the American roulette wheel once is equal to
𝑆 = {𝑅1, 𝑅3, … , ⏟⏟
⏟⏟⏟⏟⏟ 𝑅35, 𝐵2, 𝐵4, … , ⏟⏟
⏟⏟⏟⏟⏟ 𝐵36, 𝐺0,
⏟ 𝐺00 }.
18 red pockets 18 black pockets 2 green pockets

7.1.2 Events and complements

Starting from the sample space and its elementary events, we can construct more complex events
𝐴 by forming a union of elementary events. In other words, general events 𝐴 are subsets of the
sample space 𝑆.
Example: Observing an even number when rolling a die is then represented by the event 𝐴 =
{2, 4, 6}.
For each event we are able to define its complement with respect to the sample space.

Definition 7.3. Let 𝐴 be an event from a sample space 𝑆. Then we call 𝐴𝑐 the complementary
event of 𝐴, if their union is equal to the sample space

𝐴 ∪ 𝐴𝑐 = 𝑆 .

The event 𝐴 and its complement 𝐴𝑐 are therefore mutually exclusive or disjoint.

118
7 Probability

7.1.3 Rules of probability

There are several possible interpretations of probability but they all agree on the mathematical rules
probability must follow.

Rules

Let 𝑆 be the sample space, 𝐸 a single event and 𝐸1 , 𝐸2 , ... a disjoint sequence of events. Let P
be a function that assigns a probability P(𝐸) to each 𝐸 ⊆ 𝑆. Then it should hold that:

1. The probability of the event 𝐸 is non-negative: P(𝐸) ≥ 0.

2. The probability that at least one (elementary) event occurs is one: P(𝑆) = 1.
∞ ∞
3. P( ⋃𝑖=1 𝐸𝑖 ) = ∑𝑖=1 P(𝐸𝑖 ) (addition rule of disjoint events).

A function 𝑃 that follows these rules is called a probability measure.

Remark. A direct consequence of 2. and 3. is the following. For an event 𝐴 and its complement 𝐴𝑐
it holds that
P(𝐴) + P(𝐴𝑐 ) = P(𝑆) ⇒ P(𝐴𝑐 ) = 1 − P(𝐴) .

What do probabilities mean?

Frequentist interpretation:
The probability of an outcome is the proportion of times the outcome would occur if we observed
the random process an infinite number of times.
Bayesian interpretation:
A Bayesian interprets probability as a subjective degree of belief. For the same event, two separate
people could have different viewpoints and so assign different probabilities.
The Bayesian probabilist specifies a prior probability. This, in turn, is then updated to a posterior
probability in the light of new, relevant data/evidence.

119
7 Probability

7.2 Law of large numbers

Probability can be illustrated by rolling a die many times. Let 𝑝𝑛̂ be the proportion of outcomes that
are 1 out of the first n rolls. As the number of rolls increases, 𝑝𝑛̂ will converge to the probability
of rolling a 1, 𝑝 = 16 . The tendency of 𝑝𝑛̂ to stabilize around 𝑝 is described by the Law of Large
Numbers (LLN).

0.4

0.3
n

0.2
p
^

0.1

0.0

0 250 500 750 1000


n

When tossing a fair coin, if heads comes up on each of the first 10 tosses
HHHHHHHHHH
what do you think the chance is that another head will come up on the next toss?
0.5, less than 0.5, or more than 0.5?
The probability is still 0.5, i.e., there is still a 50% chance that another head will come up on the next
toss:

P(𝐻 on 11𝑡ℎ toss) = P(𝑇 on 11𝑡ℎ toss) = 0.5 .

The common misunderstanding of the LLN is that random processes are supposed to compen-
sate for whatever happened in the past. This is not true and also called gambler’s fallacy or
law of averages.

120
7 Probability

7.3 Addition rule

What is the probability of drawing a jack or a red card from a well shuffled full deck?

The figure shows that


we are interested in the
probability of a union
of non-disjoint sets.

(a) Full card deck. Figure from www.milefoot.com, and edited by Open-
Intro.

The jack of hearts and the jack of diamonds are, of course, also red cards. Adding the probabilities
for a jack and being a red card will count these two jacks twice. Therefore, the probability for being
one of these two jacks needs to be subtracted, when calculating:

P(𝑗𝑎𝑐𝑘 𝑜𝑟 𝑟𝑒𝑑) = P(𝑗𝑎𝑐𝑘) + P(𝑟𝑒𝑑) − P(𝑗𝑎𝑐𝑘 𝑎𝑛𝑑 𝑟𝑒𝑑)


4 26 2 28
= + − =
52 52 52 52

General addition rule:


P(𝐴 ∪ 𝐵) = P(𝐴) + P(𝐵) − P(𝐴 ∩ 𝐵) ,

where ∪ means or (union) and ∩ means and (intersection).

Note

For disjoint events 𝐴 and 𝐵 we have P(𝐴 ∩ 𝐵) = 0, so the above formula simplifies to

P(𝐴 ∪ 𝐵) = P(𝐴) + P(𝐵).

121
7 Probability

¾ Your turn

What is the probability that a student, randomly sampled out of a population of size 165, thinks
marijuana should be legalized (No, Yes) or they agree with their parents’ political views (No,
Yes) ?

Agree: No Agree: Yes Total


Legalize: No 11 40 51
Legalize: Yes 36 78 114
Total 47 118 165

40+36−78
A 165
114+118−78
B 165
78
C 165
114+118
D 165

7.4 Probability distributions

Definition 7.4. A discrete probability distribution is a list of all possible elementary events 𝑥𝑖
(countably many) and the probabilities P({𝑥𝑖 }) with which they occur.

Rules for discrete probability distributions

1. The elementary events 𝑥1 , … , 𝑥𝑛 must be disjoint.

2. Each probability must be between 0 and 1.


𝑛
3. The probabilities must total 1, i.e., ∑𝑖=1 P({𝑥𝑖 }) = 1.

Remark.

• An event 𝐴, e.g., drawing a jack, can be the union of several elementary events. For drawing a
jack, we have

𝐴 = ”Jack of clubs” ∪ ”Jack of diamonds”


∪ ”Jack of hearts” ∪ ”Jack of spades”

• Continuous probability distributions are introduced in Section 7.8.

122
7 Probability

7.5 Independence

Definition 7.5. (informal)


Two random processes are independent, if knowing the outcome of one provides no useful in-
formation about the outcome of the other.

Example 7.1.

• Knowing that the coin landed on a head on the first toss does not provide any valuable infor-
mation for determining what the coin will land on in the second toss.
⇒ Outcomes of two tosses of a coin are independent.
• Knowing that the first card drawn from a deck is an ace does provide helpful information for
determining the probability of drawing an ace in the second draw.
⇒ Outcomes of two draws from a deck of cards (without replacement) are dependent.

Definition 7.6. (mathematical)


Two events 𝐴 and 𝐵 are called independent if

P(𝐴 ∩ 𝐵) = P(𝐴) ⋅ P(𝐵) ,

which is equivalent (see next section) to saying: 𝐴 and 𝐵 are independent, if and only if 1

P(𝐴 occurs, given that 𝐵 occured) =∶ P(𝐴|𝐵) = P(𝐴).

Example 7.2. Consider the random process of throwing a fair coin twice. Let 𝐴 and 𝐵 be the events
of 𝐻 (head) in the first and second toss, respectively. The sample space for this random process is
equal to 𝑆 = {(𝐻, 𝑇 ), (𝐻, 𝐻), (𝑇 , 𝑇 ), (𝑇 , 𝐻)}. This implies
1 1 1
P(𝐴 ∩ 𝐵) = P({(𝐻, 𝐻)}) = = ⋅
4 2 2
= P({(𝐻, 𝑇 )} ∪ {(𝐻, 𝐻)}) ⋅ P({(𝐻, 𝐻)} ∪ {(𝑇 , 𝐻)})
= P(𝐴) ⋅ P(𝐵)
which shows that 𝐴 and 𝐵 are independent (confirms the intuition).

Definition 7.7. The product rule for independent events says

P(𝐴 ∩ 𝐵) = P(𝐴) ⋅ P(𝐵)

if 𝐴 and 𝐵 are independent.

1
Later we introduce the notation P(𝐴|𝐵) = P(A occurs, given 𝐵).

123
7 Probability

More generally, several events 𝐴1 , … , 𝐴𝑛 are independent if the probability of any intersection
formed from the events is the product of the individual event probabilities. In particular,

P(𝐴1 ∩ ⋅ ⋅ ⋅ ∩ 𝐴𝑘 ) = P(𝐴1 ) ⋅ … ⋅ P(𝐴𝑘 ), 𝑘 = 2, … , 𝑛.

¾ Your turn

In a multiple choice exam, there are 5 questions and 4 choices for each question (a, b, c, d).
Nancy has not studied for the exam at all and decides to randomly guess the answers. What is
the probability that:

a) the first question she gets right is the 5th question?

b) she gets all of the questions right?

7.5.1 Disjoint, complementary, independent

Nancy guesses the answer to five multiple choice questions, with four choices for each question.
What is the probability that she gets at least one question right?
Let 𝑄𝑘 be the event, that the answer to the k-th question is correct.
We are interested in the event:

𝐴 = {the answer to at least one question is right}

So we can divide up the sample space into two categories: 𝑆 = {𝐴, 𝐴𝑐 }, where

𝐴𝑐 = {the answer to none of the questions is right} .

Since the probability of complementary events must add up to 1, we get

P(𝐴) = 1 − P(𝐴𝑐 )
= 1 − P(𝑄𝑐1 ∩ ⋅ ⋅ ⋅ ∩ 𝑄𝑐5 ) = 1 − P(𝑄𝑐1 ) ⋅ ⋅ ⋅ P(𝑄𝑐5 )
= 1 − 0.755 ≈ 0.7627 .

¾ Your turn

Roughly 20% of undergraduates at a university are vegetarian or vegan. What is the probability
that, among a random sample of 3 undergraduates, at least one is neither vegetarian nor vegan?

124
7 Probability

7.6 Conditional probability

Relapse study:
Researchers randomly assigned 72 chronic users of cocaine into three groups:

• desipramine (antidepressant)
• lithium (standard treatment for cocaine)
• placebo.

Results of the study are summarized below.

outcome total
group relapse no relapse
desipramine 10 14 24
lithium 18 6 24
placebo 20 4 24
total 48 24 72

We can think of the above table as describing the joint distribution of the two random processes
group and outcome, which can take values in {desipramine, lithium, placebo} and {relapse, no relapse},
respectively.
Using the joint distribution, we can answer questions such as:
What is the probability that a randomly selected patient received the antidepressant (desipramine) and
relapsed?
10
P(outcome = relapse and group=desipramine) = ≈ 0.14
72

Focusing on just one of the random processes, means working with the marginal distribution. As
an example, consider the following question.
What is the probability that a randomly selected patient relapsed?

48
P(outcome = relapse) = ≈ 0.67
72

Definition 7.8. Let P be a probability measure, and let 𝐴, 𝐵 be two events from a sample space 𝑆,
with P(𝐵) > 0. Then the conditional probability of the event 𝐴 (outcome of interest) given the
event 𝐵 (condition) is defined as
P(𝐴 ∩ 𝐵)
P(𝐴|𝐵) ∶= .
P(𝐵)

125
7 Probability

If we know that a patient received the antidepressant desipramine, what is the probability that they
relapsed?

P(outcome = relapse |group = desipramine)


P(outcome = relapse and group = desipramine)
=
P(group = desipramine)
10/72 10
= = = 0.42
24/72 24

¾ Your turn

For each one of the three treatment, if we know that a randomly selected patient received this
treatment, what is the probability that they relapsed?

outcome total
group relapse no relapse
desipramine 10 14 24
lithium 18 6 24
placebo 20 4 24
total 48 24 72

7.6.1 General multiplication rule

Earlier we saw that if two events are independent, their joint probability is simply the product
of their probabilities. If the events are not believed to be independent, then the dependence is
reflected in the calculation of the joint probability.
If A and B represent two outcomes or events, then

P(𝐴 ∩ 𝐵) = P(𝐴|𝐵) ⋅ P(𝐵).

Remark. Note that this formula is simply the conditional probability formula, rearranged.

126
7 Probability

7.6.2 Independence and conditional probabilities

Consider the following (hypothetical) distribution of gender and major of students in an introduc-
tory statistics class:

major total
gender social science non-social science
female 30 20 50
male 30 20 50
total 60 40 100

60
The probability that a randomly selected student is a social science major is P(major = sosc) = 100 =
0.6, while P(major = non-sosc) = 0.4.
The probability that a randomly selected student is a social science major given that they are female
is
30
P(major = sosc|gender = female) = = 0.6 .
50
Since P(major = sosc|gender = male) also equals 0.6 and

P(major = non-sosc|gender = female)


= P(major = non-sosc|gender = male) = 0.4 ,

the major of students in this class is independent of their gender.


Remember, in Definition 7.6, we claimed that P(𝐴|𝐵) = P(𝐴) implies that the events A and B are
independent. This says as much as, knowing that 𝐵 has occurred tells us nothing about 𝐴.
We know that events 𝐴 and 𝐵 are independent if and only if

P(𝐴 ∩ 𝐵) = P(𝐴) ⋅ P(𝐵).


P(𝐴∩𝐵)
Remember the definition of conditional probability, P(𝐴|𝐵) = P(𝐵) .

Thus, if P(𝐴|𝐵) = P(𝐴), we get

P(𝐴 ∩ 𝐵)
P(𝐴|𝐵) = = P(𝐴) ⇒ P(𝐴 ∩ 𝐵) = P(𝐴) ⋅ P(𝐵) .
P(𝐵)

Hence, 𝐴 and 𝐵 are independent.

127
7 Probability

7.6.3 Case study: breast cancer screening

American Cancer Society estimates that about 1.7% of women have breast cancer.
Susan G. Komen for The Cure Foundation states that mammography correctly identifies about 78 %
of women who truly have breast cancer.
An article published in 2003 suggests that up to 10% of all mammograms result in false positives for
patients who do not have cancer.

Remark. These percentages are approximate, and very difficult to estimate.

When a patient goes through breast cancer screening there are two competing claims:

i. patient has cancer,


ii. patient doesn’t have cancer.

negative 0.983*0.9 = 0.8847

no cancer

If a mammogram yields positive 0.983*0.1 = 0.0983

a positive result, what is


the probability that pa-
tient actually has cancer? negative 0.017*0.22 = 0.0037

cancer

positive 0.017*0.78 = 0.0133

Let 𝐶 describe if the patient has cancer or not, and let 𝑀 ∈ {+, −} be the result of the mammogram.
Then we are interested in the probability

P(𝐶 = yes ∩ 𝑀 = +) P(𝑀 = +|𝐶 = yes) ⋅ P(𝐶 = yes)


P(𝐶 = yes|+) = =
P(𝑀 = +) P(𝑀 = +)
0.017 ⋅ 0.78 0.0133
= =
0.017 ⋅ 0.78 + 0.983 ⋅ 0.1 0.0133 + 0.0983
= 0.12 .

Note

Tree diagrams are useful for inverting probabilities. We are given P(𝑀 = +|𝐶 = yes) and
ask for P(𝐶 = yes|𝑀 = +).

128
7 Probability

¾ Your turn

Suppose a woman who gets tested once and obtains a positive result wants to get tested again.
In the second test, what should we assume to be the probability of this specific woman having
cancer?
A 0.017
B 0.12
C 0.0133
D 0.88

¾ Your turn

What is the probability that this woman has cancer if this second mammogram also yielded a
positive result?

7.6.4 Bayes’ Theorem

The conditional probability formula we have seen so far is a special case of Bayes’ Theorem, which is
applicable even when events are defined by variables that have more than just two outcomes.
In the previous example, we calculated the probability P(𝑀 = +) by summing the probabilities of
two disjoint events: obtaining a positive test result for a person with cancer and for a person without
cancer. This rule is an application of the law of total probability, which is given in the next theorem.

Theorem 7.1. Assume 𝐴1 , … , 𝐴𝑘 are a partition of the sample space 𝑆, i.e., the events 𝐴1 , … , 𝐴𝑘 are
𝑘
all disjoint, P(𝐴𝑗 ) > 0 and ⋃𝑖=1 𝐴𝑖 = 𝑆. Let 𝐵 be any another event, then the law of total probability
says
P(𝐵) = P(𝐵 ∩ 𝐴1 ) + P(𝐵 ∩ 𝐴2 ) + ⋅ ⋅ ⋅ + P(𝐵 ∩ 𝐴𝑘 )
= P(𝐵|𝐴1 )P(𝐴1 ) + P(𝐵|𝐴2 )P(𝐴2 ) + ⋅ ⋅ ⋅ + P(𝐵|𝐴𝑘 )P(𝐴𝑘 )
𝑘
= ∑ P(𝐵|𝐴𝑗 )P(𝐴𝑗 ) .
𝑗=1

P(𝐴𝑗 ∩𝐵)
Bayes’ Theorem is an application of the conditional probability definition, P(𝐴𝑗 |𝐵) = P(𝐵) , along
with the law of total probability.

129
7 Probability

Theorem 7.2. (Bayes’ Theorem)


Assume 𝐴1 , … , 𝐴𝑘 are a partition of the sample space 𝑆, i.e., the events 𝐴1 , … , 𝐴𝑘 are all disjoint,
𝑘
P(𝐴𝑗 ) > 0 and ⋃𝑖=1 𝐴𝑖 = 𝑆. Let 𝐵 be another event with P(𝐵) > 0. Then

P(𝐴𝑗 ∩ 𝐵) P(𝐵|𝐴𝑗 )P(𝐴𝑗 )


P(𝐴𝑗 |𝐵) = =
P(𝐵) P(𝐵)
P(𝐵|𝐴𝑗 )P(𝐴𝑗 )
,
P(𝐵|𝐴1 )P(𝐴1 ) + P(𝐵|𝐴2 )P(𝐴2 ) + ⋅ ⋅ ⋅ + P(𝐵|𝐴𝑘 )P(𝐴𝑘 )

for 𝑗 ∈ {1, … , 𝑘}.

Remark. We think of 𝐴1 , … , 𝐴𝑘 as all possible (disjoint) outcomes of one random process and 𝐵 is
the outcome of a second random process.

Application activity: inverting probabilities

A common epidemiological model for the spread of diseases is the SIR model, where the popula-
tion is partitioned into three groups: Susceptible, Infected, and Recovered.
This is a reasonable model for diseases like chickenpox, where a single infection usually provides
immunity to subsequent infections. Sometimes these diseases can also be difficult to detect.
Imagine a population in the midst of an epidemic, where 60% of the population is considered sus-
ceptible, 10% is infected, and 30% is recovered.
The only test for the disease is accurate 95% of the time for susceptible individuals, 99% for in-
fected individuals, but 65% for recovered individuals.
Note: In this case accurate means returning a negative result for susceptible and recovered individuals
and a positive result for infected individuals.

¾ Your turn

Draw a probability tree to reflect the information given above. If the individual has tested
positive, what is the probability that they are actually infected?

130
7 Probability

7.7 Random variables

The concept of random variables is a helpful and intuitive tool to describe a random process.

Definition 7.9. Let 𝑆 be a sample space and P a probability measure. Then we call a real-valued
function
𝑋 ∶ 𝑆 → R; 𝑠 ↦ 𝑋(𝑠)
a random variable. 𝑋 is called a discrete random variable, if the set 𝑋(𝑆) ⊂ R is finite or at least
countably infinite. Otherwise, 𝑋 is called a continuous random variable.

Informal interpretation

We take a measurement 𝑋 for a sample unit 𝑠. Each sample unit 𝑠 contains uncertainty, which
carries over to the measurement 𝑋(𝑠).

Remark. We often write 𝑋 instead of 𝑋(𝑠). The values of random variables are denoted with a
lowercase letter. So, for a discrete random variable 𝑋, we may write, for example, P(𝑋 = 𝑥) for the
probability that the sampled value of 𝑋(𝑠) is equal to 𝑥.

7.7.1 Expected value

We are often interested in the average outcome of a random variable.

Definition 7.10. Let 𝑋 be a discrete random variable with outcome values 𝑥1 , … , 𝑥𝑘 and correspond-
ing probabilities P(𝑋 = 𝑥1 ), … , P(𝑋 = 𝑥𝑘 ). Then we call the weighted average of the possible
outcomes
𝑘
E[𝑋] = ∑ 𝑥𝑖 P(𝑋 = 𝑥𝑖 )
𝑖=1

the expected value of 𝑋.

Remark.

• The continuous case will be treated in Definition 7.16.


• 𝜇 is often used as a symbol for the expected value.

131
7 Probability

Example 7.3. In a game of cards, you win one dollar if you draw a heart, five dollars if you draw an
ace (including the ace of hearts), ten dollars if you pull the king of spades and nothing for any other
card you draw.
The random variable described below represents the outcome of this card game:

⎧1, 𝑠 ∈ {2♥, … , 𝐾♥}


{
{5, 𝑠 ∈ {𝐴♥, 𝐴♦, 𝐴♠, 𝐴♣}
𝑋(𝑠) = ⎨
{10, 𝑠 = 𝐾♠
{0, all else

12 4 1 35
with distribution P(𝑋 = 1) = 52 , P(𝑋 = 5) = 52 , P(𝑋 = 10) = 52 , P(𝑋 = 0) = 52 .

Given this input, we get the following expected value (winnings)

E[𝑋] = 0 ⋅ P(𝑋 = 0) + 1 ⋅ P(𝑋 = 1) + 5 ⋅ P(𝑋 = 5) + 10 ⋅ P(𝑋 = 10)


12 4 1 42
=1⋅ +5⋅ + 10 ⋅ = .
52 52 52 52

¾ Your turn

A casino game costs 5 Dollars to play. If the first card you draw is red, then you get to draw a
second card (without replacement). If the second card is the ace of clubs, you win 500 Dollars.
If not, you don’t win anything, i.e. lose your 5 Dollars. What is your expected profit/loss from
playing this game?
Hint: The random variable

495, (𝑠1 , 𝑠2 ) ∈ {2♥, … , 𝐴♥, 2♦, … , 𝐴♦} × {𝐴♣}


𝑋(s) = {
−5, all else
describes the profit/loss.

7.7.2 Variability

In addition to knowing the average value of a random experiment, we are also often interested in the
variability of the values of a random variable.

132
7 Probability

Definition 7.11. Let 𝑋 be a discrete random variable with outcome values 𝑥1 , … , 𝑥𝑘 , probabilities
P(𝑋 = 𝑥1 ), … , P(𝑋 = 𝑥𝑘 ) and expected value E[𝑋]. Then we call the weighted average of squared
distances
𝑘
Var[𝑋] = ∑(𝑥𝑖 − E[𝑋])2 P(𝑋 = 𝑥𝑖 )
𝑖=1

the variance of 𝑋, and SD[𝑋] = √Var[𝑋] is the standard deviation of 𝑋.

Remark.

• The continuous case will be treated in Definition 7.17.


• 𝜎2 is often used as a symbol for the variance.

Example 7.4. For the card game from Example 7.3, how much would you expect the winnings to
vary from game to game? Using
35 12
P(𝑋 = 0) = , P(𝑋 = 1) = ,
52 52
4 1
P(𝑋 = 5) = , P(𝑋 = 10) = , and
52 52
42
E[𝑋] =
52
we get

4
Var[𝑋] = ∑(𝑥𝑖 − E[𝑋])2 P(𝑋 = 𝑥𝑖 )
𝑖=1
= (0 − E[𝑋])2 ⋅ P(𝑋 = 0) + (1 − E[𝑋])2 ⋅ P(𝑋 = 1)
+ (5 − E[𝑋])2 ⋅ P(𝑋 = 5) + (10 − E[𝑋])2 ⋅ P(𝑋 = 10)
42 35 42 12 42 4 42 1
= (0 − )2 ⋅ + (1 − )2 ⋅ + (5 − )2 ⋅ + (10 − )2 ⋅ ≈ 3.425
52 52 52 52 52 52 52 52
and √
SD[𝑋] = 3.425 ≈ 1.85 .

7.7.3 Linear combinations

A linear combination of two random variables 𝑋 and 𝑌 is a weighted sum


𝑎𝑋 + 𝑏𝑌 ,
for some given fixed numbers 𝑎 ∈ R and 𝑏 ∈ R.
Taking an average (= computing an expected value) is a linear operation, meaning the expected value
of a weighted sum is the weighted sum of individual expectations.

133
7 Probability

The expected value of a linear combination of 𝑘 random variables equals


𝑘 𝑘
E[ ∑ 𝑎𝑖 𝑋𝑖 ] = ∑ 𝑎𝑖 ⋅ E[𝑋𝑖 ] .
𝑖=1 𝑖=1

To describe the variability of a linear combination of random variables, we need to determine the
variance of the linear combination.
The variance of the linear combination 𝑎𝑋 + 𝑏𝑌 of the random variables 𝑋 and 𝑌 is calculated as

Var[𝑎𝑋 + 𝑏𝑌 ] = 𝑎2 ⋅ Var[𝑋] + 𝑏2 ⋅ Var[𝑌 ] + 2𝑎𝑏 ⋅ Cov[𝑋, 𝑌 ] ,

where
Cov[𝑋, 𝑌 ] = E[𝑋 ⋅ 𝑌 ] − E[𝑋] ⋅ E[𝑌 ] .
is the covariance of the random variables 𝑋 and 𝑌 . The covariance is a measure of linear depen-
dence between two random variables.

Definition 7.12. Let 𝑋 and 𝑌 be two random variables. We call two random variables uncorrelated
if and only if
Cov[𝑋, 𝑌 ] = 0 .

For pairwise uncorrelated random variables 𝑋1 , … , 𝑋𝑘 the variance of the linear combination
𝑘
∑𝑖=1 𝑎𝑖 𝑋𝑖 is equal to
𝑘 𝑘
Var[ ∑ 𝑎𝑖 𝑋𝑖 ] = ∑ 𝑎2𝑖 Var[𝑋𝑖 ] .
𝑖=1 𝑖=1

Being uncorrelated is a weaker property compared to being independent. The definition of indepen-
dence between two random variables relies on the independence of events, see Definition 7.6.

Definition 7.13. Let 𝑋 ∶ 𝑆 → R and 𝑌 ∶ 𝑆 → R be two random variables and 𝐴, 𝐵 ⊆ R. The ran-
dom variables 𝑋 and 𝑌 are called independent, if the events {𝑋 ∈ 𝐴} and {𝑌 ∈ 𝐵} are independent
for all 𝐴, 𝐵 ∈ F, i.e.,

P({𝑋 ∈ 𝐴} ∩ {𝑌 ∈ 𝐵}) = P({𝑋 ∈ 𝐴}) ⋅ P({𝑌 ∈ 𝐵}) , (7.1)

where F is a collection 2 of subsets of R, such that the probabilities in (Equation 7.1) are well-defined.

2
Such a collection is called a 𝜎-algebra and needs to fulfill certain properties. For a discrete r.v. you can think of it being
the power set.

134
7 Probability

Remark. One can show that the independence between random variables implies uncorrelatedness.
Hence, when two random variables are independent, they are also uncorrelated. The other way
around does not hold in general.

The covariance measures linear dependence. However, it is influenced by the scale of the random
variables 𝑋 and 𝑌 . So, we could say that they are linearly dependent, but specifying the strength is
not straightforward. Therefore, we introduce a scale-free measure of linear dependence.

Definition 7.14. Let 𝑋 ∶ 𝑆 → R and 𝑌 ∶ 𝑆 → R be two random variables. Then, the correlation
between 𝑋 and 𝑌 is defined by

Cov[𝑋, 𝑌 ]
Corr[𝑋, 𝑌 ] = ∈ [−1, 1] .
√Var[𝑋] ⋅ √Var[𝑌 ]

¾ Your turn

A company has 5 Lincoln Town Cars in its fleet. Historical data show that annual maintenance
cost for each car is on average 2154 Dollars with a standard deviation of 132 . What is the mean
and the standard deviation of the total annual maintenance cost for this fleet?
Note: you can assume that the annual maintenance costs of the five cars are uncorrelated.

7.8 Continuous distributions

Up until now, we have always considered discrete probability distributions. The goal of this section is
to introduce continuous probability distributions. We want to motivate their definition by considering
the effect of an increasing population size, which allows us to measure the outcome of the random
process on finer and finer levels.

But to make this clear at the


beginning, the result of a ran-
dom process having a contin- 10000

uous distribution does not de-


count

pend on the population size.


Let’s consider the example of 5000

having a sample of heights


from 100000 adults. The his-
togram visualizes the distribu- 0
140 160 180 184 200
tion of these heights. height

135
7 Probability

Assume we are interested in the probability of the event

𝐴 = {a randomly selected adult has a height between 180 and 184 cm} = [180, 184].

The height of the bar corresponding to the interval 𝐴 = [180, 184] will tell us something about the
probability P(𝐴). The fraction of observations falling into this height range will be an approximation
of the probability P(𝐴).

10000
count

5000

0
140 160 180 184 200
height

10000
By looking at the counts in the plot we get P(𝐴) ≈ = 0.1 .
100000
9852
The precise number would be P(𝐴) = 100000 = 0.09852.

From histograms to continuous distributions

As height is a continuous numerical variable, the size of the bins in the histogram can be made smaller
as the sample size increases.

30000

20000
count

10000

0
140 160 180 200
height

136
7 Probability

Now we see the distribution of heights for a sample of size 1000000. We will visualize the distribution
in the limit using a curve.
The curve is called the density or density function of the distribution and is denoted by 𝑓.

0.03
density

0.02

0.01

0.00
140 160 180 200
height

Remark. As we form the limit, the y-axis in each histogram is rescaled such that the total area of all
bars is equal to 1. Thus, the area under a density equals one.

Let’s take a closer look at the density function 𝑓.

0.03

0.02
f(x)

0.01

0.00
140 160 180 200
x; height

Returning to the probability of the event 𝐴 = [180, 184], a randomly selected adult has a height
between 180 and 184 cm.
184
Using the density, we can calculate the probability as follows: P(𝐴) = ∫ 𝑓(𝑥)d𝑥.
180

137
7 Probability

Definition 7.15. Let 𝑓 be the density of a continuous distribution on R, and let 𝐵 ⊂ R be an


interval/event. Then the probability of 𝐵 is given by P(𝐵) = ∫𝐵 𝑓(𝑥)d𝑥 and
𝑥
𝐹 (𝑥) = ∫ 𝑓(𝑦)d𝑦
−∞
is called the distribution function (abbreviated d.f.).

Note

• The distribution function at 𝑥 is the probability of the event/interval (−∞, 𝑥], i.e., 𝐹 (𝑥) =
P((−∞, 𝑥]).

• We already know 0 ≤ P(𝐵) ≤ 1 for any event 𝐵. In particular, we know that the
probability of “all possible events” is one, i.e.,

P(R) = ∫ 𝑓(𝑥)d𝑥 = 1 .
−∞

So, 𝑓 needs to integrate to 1 and must be non-negative because otherwise P(𝐴) > P(𝐵)
with 𝐴 ⊂ 𝐵 could be the case, which is not allowed (=“doesn’t make sense”).

7.8.1 Expected value and variability

Also in the continuous case, we are often interested in the expected or average outcome of a random
variable.

Definition 7.16. Let 𝑋 be a random variable with continuous d.f. 𝐹 and corresponding density 𝑓
defined on R. Then we call the weighted average of the possible outcomes

E[𝑋] = ∫ 𝑥 ⋅ 𝑓(𝑥)d𝑥
R
the expected value of 𝑋 (or of 𝐹 ).

Additionally, we are interested in the variability contained in a continuous distribution.

Definition 7.17. Let 𝑋 be a random variable with continuous d.f. 𝐹 and corresponding density 𝑓.
Then we call the weighted average of squared distances

Var[𝑋] = ∫(𝑥 − E[𝑋])2 ⋅ 𝑓(𝑥)d𝑥


R
the variance of 𝑋 (or of 𝐹 ). The square root of the variance
SD[𝑋] = √Var[𝑋]
is called the standard deviation of 𝑋 (or of 𝐹 ).

138
7 Probability

To compute the variance, we can use the following rule

Var[𝑋] = ∫(𝑥 − E[𝑋])2 ⋅ 𝑓(𝑥)d𝑥


R

= ∫(𝑥2 − 2𝑥E[𝑋] + E[𝑋]2 ) ⋅ 𝑓(𝑥)d𝑥


R

= ∫ 𝑥2 ⋅ 𝑓(𝑥)d𝑥 − 2E[𝑋] ∫ 𝑥 ⋅ 𝑓(𝑥)d𝑥 + E[𝑋]2 ∫ 𝑓(𝑥)d𝑥


R R ⏟R
⏟⏟⏟⏟
=1
= E[𝑋 2 ] − 2E[𝑋] ⋅ E[𝑋] + E[𝑋]2
= E[𝑋 2 ] − E[𝑋]2 .

Note

The formula Var[𝑋] = E[𝑋 2 ] − E[𝑋]2 holds for continuous and discrete random variables.

¾ Your turn

In the production of cylinder pistons, the manufacturing process ensures that the deviations of
the diameter upwards or downwards are at most equal to 1.
We interpret the deviations in the current production deviations as realizations of a random
variable 𝑋 with density
3
𝑓(𝑥) = (1 − 𝑥2 ) ⋅ 1[−1,1] (𝑥),
4
1, 𝑥 ∈ 𝐴,
where 1𝐴 (𝑥) = {
0, 𝑥 ∉ 𝐴.
In the manufacturing process, there should be an average deviation of zero. Decide if this is the
case for the distribution with density 𝑓(𝑥) = 34 (1 − 𝑥2 ) ⋅ 1[−1,1] (𝑥). Justify your answer.

¾ Your turn

How much variability in deviations can be expected based on this distribution? Quantify by
computing the variance of the distribution.

Quantiles

In statistical inference, quantiles are used to construct interval estimates for an unknown parameter
or to specify the critical value in a hypothesis test. In these cases, we compute quantiles of a normal

139
7 Probability

distribution as well as for other distributions. Therefore, we define the q-quantile in a general case
using the following definition.

Definition 7.18. An observation 𝑥𝑞 from a distribution with distribution function 𝐹 is said to be


the q-quantile, if

𝐹 (𝑥𝑞 ) = P((−∞, 𝑥𝑞 ] ) ≥ 𝑞 and P([𝑥𝑞 , ∞)) ≥ 1 − 𝑞, 𝑞 ∈ (0, 1) .

Graphically, 𝑞 is the area below the probability density curve to the left of the q-quantile 𝑥𝑞 .

0.4

0.3
f(x)

0.2

0.1

0.0
−2.00 0.00 1.28 2.00
x

Figure 7.2: The gray area represents a probability of 0.9 based on a standard normal distribution.

7.8.2 The normal distributions

Let’s consider the most prominent example of a continuous distribution at bit more detailed than
the others. We defined (see Definition A.8) the distribution in the following way:
Let 𝜇 ∈ R and 𝜎 > 0. The normal distribution with mean 𝜇 and variance 𝜎2 is the continuous
distribution on R with density function
1 (𝑥−𝜇)2
𝑓(𝑥) = √ e− 2𝜎2 , 𝑥 ∈ R,
2𝜋𝜎2
and we will denote it by N (𝜇, 𝜎2 ).

Remark.

• Many variables are nearly normally distributed (although none are exactly normal). While not
perfect for any single problem, normal distributions are useful for a variety of problems.

140
7 Probability

• The normal density has a well-known bell shape. In particular, it is unimodal and symmetric
around the mean 𝜇.

Normal distributions with different parameters

A normal distribution with mean 𝜇 = 0 and standard deviation 𝜎2 = 1 is called standard normal
distribution; in symbols, N (0, 1).

0.4

0.3
f(x)

0.2

0.1

0.0
−5.0 −2.5 0.0 2.5 5.0
x

If we vary the mean 𝜇, the density will be shifted along the x-axis

0.4

0.3
f(x)

0.2

0.1

0.0
−10 −5 0 5 10
x

while varying the standard deviation 𝜎, changes the shape.

141
7 Probability

0.4

0.3

f(x)
0.2

0.1

0.0
−10 −5 0 5 10
x

Linear combinations

Let 𝑋1 ∼ N (𝜇1 , 𝜎12 ), … , 𝑋𝑘 ∼ N (𝜇𝑘 , 𝜎𝑘2 ) be 𝑘 independent normally distributed random variables.
Then, each linear combination of these 𝑘 random variables is also normally distribution:
𝑘 𝑘 𝑘
𝑎 + ∑ 𝑏𝑖 𝑋𝑖 ∼ N (𝑎 + ∑ 𝑏𝑖 𝜇𝑖 , ∑ 𝑏𝑖2 𝜎𝑖2 ) ,
𝑖=1 𝑖=1 𝑖=1

where 𝑎, 𝑏1 , … , 𝑏𝑘 are real valued constants.

Example 7.5. Consider a single r.v. 𝑋1 ∼ N (𝜇1 , 𝜎12 ), i.e., 𝑘 = 1. Let 𝑎 = − 𝜇𝜎1 and 𝑏1 = 1
𝜎1 , and
1
define
𝜇 1 𝑋 − 𝜇1
𝑍 = 𝑎 + 𝑏1 𝑋1 = − 1 + 𝑋1 = 1 .
𝜎1 𝜎1 𝜎1
Then, we get
𝑋 − 𝜇1 𝜇 1 1
𝑍= 1 ∼ N ( − 1 + 𝜇1 , 2 ⋅ 𝜎12 ) = N (0, 1) .
𝜎1 𝜎1 𝜎1 𝜎1
So, 𝑍 has standard normal distribution. It is called a standardized score, or Z score.

Remark.

1. The Z-score of an observation represents the number of standard deviations it deviates from
the mean.
2. Z scores can be formed for distributions of any shape, but only when the distribution is
normal, we get 𝑍 ∼ N (0, 1).
3. Observations that are more than 2 standard deviations away from the mean (|𝑍| > 2) are
usually considered unusual.

142
7 Probability

¾ Your turn

Question: Which of the following is false?


A Majority of Z scores in a right skewed distribution are negative.
B In skewed distributions the Z score of the mean may be different from 0.
C For a normal distribution, IQR is less than 4 ⋅ 𝑆𝐷.
D Z scores are helpful for determining how unusual a data point is compared to the rest of the
data in the distribution.

Calculating normal probabilities and quantiles in R

Probabilities:

pnorm(1800, mean = 1500, sd = 300)


# [1] 0.8413447
z <- (1800-1500)/300
pnorm(z, mean = 0, sd = 1)
# [1] 0.8413447

Quantiles:

qnorm(0.8413447, mean = 1500, sd = 300)


# [1] 1800
qnorm(0.8413447, mean = 0, sd = 1)
# [1] 0.9999998
q <- qnorm(0.8413447, mean = 0, sd = 1)
q * 300 + 1500
# [1] 1800

Example 7.6. At the Heinz ketchup factory, the amounts that go into ketchup bottles are supposed
to be normally distributed with a mean of 36 oz and a standard deviation of 0.11 oz. Every 30 minutes,
a bottle is selected from the production line, and its contents are precisely noted. If the amount of
ketchup in the bottle is below 35.8 oz or above 36.2 oz, the bottle fails the quality control inspection.
Question: What is the probability, that a bottle contains less than 35.8 ounces of ketchup?
So, we need to find the probability indicated by the gray area in the figure below.

143
7 Probability

2
f(x)

0
35.7 36.0 36.3 36.6
x

The amount of ketchup per bottle is denoted by 𝑋. It assumed that 𝑋 ∼ N (36, 0.012 ). To compute
the probability, we can either compute the z-score
35.8−36
𝑍= 0.11 = −1.81818... ≈ −1.82

and use the standard normal distribution

P(𝑋 ≤ 35.8) ≈ P(𝑍 ≤ −1.82)

pnorm(-1.82, mean = 0, sd = 1)
# [1] 0.0343795

or use the N (36, 0.012 ) distribution directly

pnorm(35.8, mean = 36, sd = 0.11)


# [1] 0.03451817

144
7 Probability

¾ Your turn

Question: What percent of bottles pass the quality control inspection?


A ≈ 3.44%
B ≈ 6.88%
C ≈ 93.12%
D ≈ 96.56%

pnorm(c(-1.88,-1.82,-1.75,-1.32,1.32,1.75,1.82,1.88), mean=0, sd=1)


# [1] 0.03005404 0.03437950 0.04005916 0.09341751 0.90658249
# [6] 0.95994084 0.96562050 0.96994596

Example 7.7. Body temperatures of healthy humans are typically nearly normal distributed, with a
mean of 36.7℃ and a standard deviation of 0.4℃.
Question: What is the cut-off (quantile) for the lowest 3% of human body temperatures?

1.00

0.75
f(x)

0.50

0.25

0.03
0.00
35.5 36.0 36.5 37.0 37.5 38.0
x

We want to find 𝑥0.03 , which satisfies


𝑥0.03
1 (𝑥−36.7)2
𝐹 (𝑥0.03 ) = ∫ √ e− 2⋅0.42 d𝑥 = 0.03 .
−∞ 2𝜋 ⋅ 0.42
Since there is no explicit representation of 𝐹 , we can’t solve this equation analytically and hence
have to use R.

qnorm(0.03, mean=36.7, sd=0.4)


# [1] 35.94768

145
7 Probability

¾ Your turn

Question: Which R codes compute the cut-off for the highest 10% of human body tempera-
tures?
Remember: body temperature was normally distributed with 𝜇 = 36.7 and 𝜎 = 0.4
A qnorm(0.9) * 0.4 + 36.7
B qnorm(0.1, mean = 36.7, sd = 0.4, lower.tail = FALSE)
C qnorm(0.1) * 0.4 + 36.7
D qnorm(0.9, mean = 36.7, sd = 0.4)

68-95-99.7 rule

For nearly normally distributed data,

• about 68% falls within 1 SD of the 68%

mean,
Y

• about 95% falls within 2 SD of the 95%

mean,
• about 99.7% falls within 3 SD of 99.7%
the mean.
µ − 3σ µ − 2σ µ−σ µ µ+σ µ + 2σ µ + 3σ

Observations can occasionally fall 4, 5, or more standard deviations away from the mean; however,
such occurrences are quite rare if the data follows a nearly normal distribution.

Normal probability plot

Let’s create a histogram and a normal probability plot of a sample of 100 male heights with empir-
ical mean 177.88 and empirical standard deviation 8.36.

h <- ggplot(df, aes(x = heights)) +


geom_histogram(aes(y =after_stat(density)), bins = 10, colour = "white") +
geom_function(fun = dnorm, args = list(mean = 177.88 , sd = 8.36),
colour = "red", size = 1.5)
q <- ggplot(df, aes(sample = heights)) +
geom_qq() + geom_qq_line()
h + q # using patchwork

146
7 Probability

200

190

0.04

180
density

y
0.02 170

160

0.00
150 160 170 180 190 200 −2 −1 0 1 2
heights x

Anatomy of a normal probability plot

Empirical quantiles, derived from the data, are plotted on the y-axis of a normal probability
plot, while theoretical quantiles from a normal distribution are displayed on the x-axis.
In detail, for data 𝑥1 , … , 𝑥𝑛 , the plot shows a point for each index 𝑖 = 1, … , 𝑛, with the 𝑖-th
point having the following coordinates:

• The y-coordinate is the 𝑖-th smallest value among the data points 𝑥1 , … , 𝑥𝑛 .
• The x-coordinate is a normal quantile that approximates the expected value of the 𝑖-th
smallest value in a random sample of 𝑛 values drawn from N (0, 1).

Interpretation: If a linear relationship is present in the plot, then the data nearly follows a
normal distribution.

Constructing a normal probability plot involves calculating percentiles and corresponding z-scores
for each observation. R performs the detailed calculations when we request it to create these plots.

147
7 Probability

Normal probability plot and skewness

Right skew: Points bend up and to the left of the line.

0.75
4

density
0.50
y

0.25

0.00
−2 −1 0 1 2 0 2 4
x x

Left skew: Points bend down and to the right of the line.

1.25

1.00
density

6
y

0.75

0.50

0
−2 −1 0 1 2 0.3 0.5 0.7 0.9
x x

148
7 Probability

Short tails (narrower than the normal distribution): Points follow an S shaped-curve.

1.0

0.9
0.5

density
0.0 0.6
y

−0.5
0.3

−1.0
0.0
−2 −1 0 1 2 −6 −3 0 3 6
x x

Long tails (wider than the normal distribution): Points start below the line, bend to follow it, and
end above it.

6 0.4

3 0.3
density

0.2
y

0.1
−3

0.0
−6
−2 −1 0 1 2 −6 −3 0 3 6
x x

149
7 Probability

Short summary

This provides a foundational overview of probability theory and its crucial role in statistical
thinking and data science. The text introduces core concepts such as sample spaces, events,
and the rules of probability, including frequentist and Bayesian interpretations. It further
explains important principles like the law of large numbers and the addition rule. The ma-
terial extends to conditional probability, independence, and Bayes’ Theorem, illustrating
their applications with examples. Finally, the resource covers random variables, both dis-
crete and continuous distributions (with a focus on the normal distribution), alongside
concepts like expected value and variance.

150
Part V

Predictive modeling

151
8 Statistical learning

8.1 Learning problems

In our course we will use the term statistical learning to refer to methods for predicting, or estimat-
ing, an output based on one or more inputs.

• Such prediction is often termed supervised learning (the development of the prediction method
is supervised by the output).

• If the output is numeric, we also speak of a regression problem.


• If the output is categorical, we speak of a classification problem (prediction amounts to assign-
ing to one of the categories).
• Unsupervised learning refers to problems in which there is no supervising output (e.g., cluster-
ing = finding group structure among the data points). This is for another course.

Example 8.1 (Advertising). Data on the sales of a product in 200 different markets, along with ad-
vertising budgets for the product in each of those markets for three different media: TV, radio, and
newspaper.

advertising <- read_csv("data/Advertising.csv")


advertising
# # A tibble: 200 x 4
# TV radio newspaper sales
# <dbl> <dbl> <dbl> <dbl>
# 1 230. 37.8 69.2 22.1
# 2 44.5 39.3 45.1 10.4
# 3 17.2 45.9 69.3 9.3
# 4 152. 41.3 58.5 18.5
# 5 181. 10.8 58.4 12.9
# 6 8.7 48.9 75 7.2
# # i 194 more rows

Remark. Advertising.csv data from website for Introduction to Statistical Learning.

152
8 Statistical learning

Predicting sales from radio budget

Suppose we wish to predict the sales for a randomly selected market with a given budget for radio
advertising.
In this problem, radio is the input variable and sales the output variable.
Terminology:

• Often, the input is also referred to as predictors, features, covariates, or independent variables.

• And the output is commonly termed the response or dependent variable.

Notation:
𝑋 = input, 𝑌 = output.

Can we predict sales from radio budget?

ggplot(advertising, aes(x = radio, y = sales)) +


geom_point()

20
sales

10

0 10 20 30 40 50
radio

153
8 Statistical learning

8.2 Modeling noisy relationships

There seems to be some relationship between sales and radio, but it is certainly noisy. We may
capture the noise in a statistical model that invokes probability.
Consider a randomly selected market with radio budget 𝑋 and sales 𝑌 . Mathematically, 𝑋 and 𝑌
are random variables. Then we may posit the model

𝑌 = 𝑓0 (𝑋) + 𝜖0 ,

where 𝜖0 is a random error term, i.e., a random variable that is independent of 𝑋 and has mean 0. The
random error 𝜖 encodes the noise in the prediction problem.
In this formulation, 𝑓0 is an unknown function that represents the systematic information that 𝑋
provides about the numeric variable 𝑌 .

8.2.1 Linear prediction

A simple but powerful idea:

i) Approximate the unknown 𝑓0 by a linear function

𝑓(𝑥) = 𝛽0 + 𝛽1 𝑥

and choose the unknown parameters 𝛽0 and 𝛽1 such that 𝑓 optimally approximates the
data (we will look at this in depth under the headings of linear regression and least
squares).

ii) To form a prediction of 𝑌 , we may evaluate the line at values of 𝑋 that are of interest. So
each prediction takes the form
̂ ∶= 𝛽 ̂ + 𝛽 ̂ 𝑥 ,
̂ = 𝑓(𝑥)
𝑌 0 1

where 𝛽0̂ and 𝛽1̂ are the optimally chosen parameters and 𝑥 is the value of interest.

154
8 Statistical learning

Example 8.2 (Advertising cont.).

Linear prediction of sales from radio budget

ggplot(advertising, aes(x = radio, y = sales)) +


geom_point() +
geom_smooth(method = lm, formula = y~x, se = FALSE)

20
sales

10

0 10 20 30 40 50
radio

Predicting sales from the different budgets

20 20 20
sales

sales

sales

10 10 10

0 100 200 300 0 10 20 30 40 50 0 30 60 90


TV radio newspaper

155
8 Statistical learning

Combining several predictors

In order to (hopefully) obtain better predictions, we may simultaneously draw on several predictors.
Advertising example:
Consider as input the vector 𝑋 = (𝑋1 , 𝑋2 , 𝑋3 ) comprising all three budgets, so 𝑋1 is radio, 𝑋2 is
TV, and 𝑋3 is newspaper.
The systematic part of our model 𝑌 = 𝑓(𝑋) + 𝜖 is now given by a function 𝑓 ∶ R3 → R.
Again, a very useful framework is to consider functions that linearly combine predictors, so

𝑓(𝑋) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + 𝛽3 𝑋3 .

We will discuss this in depth under the heading of multiple linear regression.

8.2.2 Noise

In the model
𝑌 = 𝑓(𝑋) + 𝜖,
the random error 𝜖 encodes the noise in the prediction problem.
The noise 𝜖 may contain unmeasured variables that are useful in predicting 𝑌 : Since we don’t measure
them, 𝑓 cannot use them for its prediction (e.g., TV matters but we only know and predict from
radio).
The noise 𝜖 may also contain unmeasurable variation (a stochastic aspect). For example, the risk of an
adverse reaction might vary for a given patient on a given day, depending on manufacturing variation
in the drug itself or the patient’s general feeling of well-being on that day.

8.2.3 Reducible and irreducible error

We assume that the true, but unknown, relationship between 𝑋 and 𝑌 is described by 𝑌 = 𝑓0 (𝑋)+𝜖0 .
Under the chosen model 𝑌 = 𝑓(𝑋) + 𝜖, the accuracy of 𝑌 ̂
̂ = 𝑓(𝑋) as a prediction for 𝑌 depends on
two quantities:

a **reducible** error + an **irreducible** error.

In general, 𝑓 ̂ is not a perfect estimate of 𝑓, and this inaccuracy will introduce some error. This error
is reducible because we can potentially improve the accuracy of our predictions by estimating 𝑓 via
more appropriate statistical learning techniques.

156
8 Statistical learning

However, even if our model is correctly specified, meaning that 𝑓0 = 𝑓, and we have a perfect estimate
𝑓 ̂ = 𝑓0 , we still face some prediction error, namely,
𝑌 −𝑌 ̂
̂ = 𝑌 − 𝑓(𝑋) = 𝜖0 .
Indeed, the noise 𝜖0 cannot be predicted using 𝑋. By the modeling assumption, 𝜖0 is independent of
𝑋. The prediction error resulting from the variability associated with 𝜖0 is known as the irreducible
error, because no matter how well we estimate 𝑓0 , we cannot reduce the error introduced by 𝜖0 .

8.3 Parametric methods for statistical learning

Parametric methods involve a two-step model-based approach.

1. Model the data by specifying assumptions about the form of 𝑓, e.g., 𝑓 is linear in 𝑋:
𝑓(𝑋) = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + + 𝛽𝑝 𝑋𝑝 .
Estimating 𝑓 amounts to estimating a vector of unknown parameters 𝛽 = (𝛽0 , 𝛽1 , … , 𝛽𝑝 ).
2. Fit the specified model to training data. In the above linear model, we optimize the choice of
𝛽 so as to find a linear function 𝑓 that optimally approximates the training data. That is, we
want to find values of the parameters such that
𝑌 ≈ 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + + 𝛽𝑝 𝑋𝑝 .
The most common approach to optimize the choice of 𝛽 is referred to as (ordinary) least squares,
which we will discuss in detail soon. (Other approaches can be useful…)

8.3.1 Least squares for a linear model

The following plot shows a least squares line for the first 20 data points from Advertising. The red line
segments show the prediction errors. The blue line minimizes the sum of squared prediction errors,
among all possible lines.

25

20
sales (Y)

15

10

5
0 100 200
TV (X)

157
8 Statistical learning

8.3.2 Non-linearity with the help of transformations

A straightforward method to enhance the expressivity of linear models is to generate additional pre-
dictors by transforming the original predictors. This allows one to form predictions that are non-linear
in the original predictors.

Example 8.3. (Quadratic regression)


Define 𝑋4 to be the squared TV budget, i.e., in terms of notation from above 𝑋4 = (𝑋2 )2 . Then the
following linear model in (𝑋2 , 𝑋4 ) specifies a quadratic model in 𝑋2 :

𝑓(𝑋) = 𝛽0 + 𝛽1 𝑋2 + 𝛽2 𝑋4 = 𝛽0 + 𝛽1 𝑋2 + 𝛽2 𝑋22 .

Example 8.4 (Advertising cont.).

Non-linear prediction of Sales from TV

f(X) = β0 + β1X2 + β2X22 f(X) = β0 + β1X2 + β2X22 + β3 X2

20 20
sales (Y)

sales (Y)

10 10

0 100 200 300 0 100 200 300


TV (X2) TV (X2)

8.4 Non-parametric methods for statistical learning

Non-parametric methods do not make explicit assumptions about the functional form of 𝑓. Instead,
they seek an estimate 𝑓 ̂ that gets as close to the data points as possible without being too rough or
wiggly.

158
8 Statistical learning

Advantages: Approximation of a wider range of possible shapes for 𝑓.


Disadvantage: A lot of data may be required to obtain accurate estimates.

Example 8.5. (Flexible function classes)


One common approach is to work with flexible function classes such as piecewise polynomial func-
tions (splines) or also neural networks. At times the number of parameters used to specify the func-
tion class is chosen as a function of the size of the available data set.

Example 8.6. (Nearest neighbors)


Nearest neighbor methods make predictions at a new input 𝑋 by finding one or more other inputs
𝑋 (𝑖) from the training data and averaging / combining the associated response values 𝑌 (𝑖) .

Example 8.7 (Example: 𝑘-NN regression). For a numeric response 𝑌 , 𝑘-nearest neighbor regression
forms the prediction as
̂ 1
𝑓(𝑋) = (𝑌𝑖1 + + 𝑌𝑖𝑘 ),
𝑘
where 𝑖1 , … , 𝑖𝑘 ∈ {1, … , 𝑛} are the indices such that 𝑋𝑖1 , … , 𝑋𝑖𝑘 are the 𝑘 inputs in the training
data that are closest to the input 𝑋 for which we predict.

Nearest neighbor regression (1 and 3 neighbors)


10.0

7.5

5.0
y

2.5

0.0

1 2 3
x

159
8 Statistical learning

Example 8.8 (Example: 𝑘-NN regression with varying 𝑘). Results for 𝑘 = 3, 10, 25 for a data set of
size 𝑛 = 300.

12 12 12

8 8 8
y

y
4 4 4

0 0 0

−2 0 2 −2 0 2 −2 0 2
x x x

8.5 Assessing model accuracy

ĺ No free lunch in statistical data analysis

No one method dominates all others over all possible data sets.

• Selecting a suitable method is often the most challenging part of performing statistical learning
in practice:
– is a linear model suitable?
– does adding transformations help?
– is a nonparametric method better?
– but if we apply, say, 𝑘-NN regression, what should be the value of 𝑘?
• Sometimes the best predictive performance results from ensemble methods that aver-
age/combine predictions from several different methods
• Speaking of “best” requires specifying a precise measure for prediction errors.

160
8 Statistical learning

8.6 Measuring the quality of fit

How can we measure the accuracy of a predicted value in relation to the true response?
In regression, the most commonly-used measure is the mean squared error (MSE), which is an
average squared prediction error.
Let (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ) be the observed values in a data set. Let 𝑓 ̂ be a prediction rule that may be
used to compute a prediction 𝑓(𝑥 ̂ ) for each response value 𝑦 . Then the MSE for 𝑓 ̂ is defined as
𝑖 𝑖

1 𝑛 ̂ ))2 .
MSE = ∑ (𝑦𝑖 − 𝑓(𝑥 𝑖
𝑛 𝑖=1

The MSE will be small if the predicted responses are very close to the true responses and large if for
some of the observations, the predicted and true responses differ substantially.

8.6.1 Training and test data

In practice, the prediction rule 𝑓 ̂ is determined using a training data set consisting of the pairs
(𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ). The term training MSE refers to the mean squared error for 𝑓,̂ calculated
by averaging the squared prediction errors based on the training data.
In contrast, the term test MSE refers to an MSE calculated using an independent test data set
(𝑥∗1 , 𝑦1∗ ), … , (𝑥∗𝑛 , 𝑦𝑛∗ ). S,o while 𝑓 ̂ is found using the training data, the test MSE is computed as

1 𝑛 ̂ ∗ ))2 .
MSE = ∑ (𝑦𝑖∗ − 𝑓(𝑥 𝑖
𝑛 𝑖=1

Statistical learning methods aim to find 𝑓 ̂ that minimizes training MSE, hoping that it will also mini-
mize test MSE. However, there is no guarantee that this will happen.

161
8 Statistical learning

Example 8.9 (Estimating a sine function with polynomials).

Degree 1 and 2 Degree 3


1 1

0 0
y

y
−1 −1

−2 −2

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x x

Degree 4 Degree 8 and 9


1 1

0 0
y

y
−1 −1

−2 −2

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
x x

Training versus test MSE

Figure 8.1: Figure 2.9 from ISLR (Left: true 𝑓 in black; estimates 𝑓 ̂ in color | Right: training MSE in
grey; test MSE in red)

162
8 Statistical learning

8.6.2 What contributes to MSE?

• U-shape in test MSE curve is due to two competing properties of statistical learning methods:
bias versus variance.

• Bias arises when there are systematic deviations between test case predictions 𝑓(𝑥 ̂ ∗ ) and the
𝑖
expected response 𝑓0 (𝑥∗𝑖 ). This occurs, in particular, when the learning method infers 𝑓 ̂ from an
overly simple class of functions M, that fails to contain/closely approximate the true function
𝑓0 . This indicates that we select our model 𝑓 from the set M (e.g. all linear functions in 𝑋),
but the true model/relationship 𝑓0 is not included in M. The bias increases with the distance
between 𝑓0 and our chosen model class M.
• Variance encompasses chance errors that arise when the learned prediction rule 𝑓 ̂ heavily de-
pends on the noise in the training data. This occurs, for instance, when fitting polynomials of
overly high degree to training data.

ĺ Important

For accurate prediction, we need a learning method that is not only sufficiently flexible
but also filters out noise in the training data.

8.7 The formal bias-variance trade-off

View the training data as a random sample (𝑋1 , 𝑌1 ), … , (𝑋𝑛 , 𝑌𝑛 ), and consider the problem of pre-
dicting the response 𝑌 ∗ in an independent test case (𝑋 ∗ , 𝑌 ∗ ). Here, we take 𝑋 ∗ = 𝑥∗ be fixed
(non-random) but generate the response randomly according to our model

𝑌 ∗ = 𝑓0 (𝑥∗ ) + 𝜖0 .

Then the prediction 𝑓(𝑥 ̂ ∗ ) is a random variable whose expected test MSE can be shown (see Sec-
tion C.1 for a proof) to decompose as

Bias-variance decomposition

2 2
̂ ∗ )) ] = Var [𝑓(𝑥
E [(𝑌 ∗ − 𝑓(𝑥 ̂ ∗ )] + [Bias(𝑓(𝑥
̂ ∗ ))] + Var[𝜖 ]. (8.1)
0

Here, the bias is defined as the deviation between expected prediction and expected response:
̂ ∗ )) = E [𝑓(𝑥
Bias(𝑓(𝑥 ̂ ∗ )] − 𝑓 (𝑥∗ ).
0

In the three-term decomposition, Var[𝜖] is due to irreducible error, whereas the contribution of re-
ducible errors is decomposed into the variance and the squared bias of the prediction.

163
8 Statistical learning

Example 8.10. Assume we have 100 observations (𝑥1 , 𝑦1 ), … , (𝑥100 , 𝑦100 ) from the model

𝑌 = 𝑓0 (𝑥) + 𝜖0 ,

with 𝑓0 (𝑥) = 𝛽0 + 𝛽1 𝑥, where 𝛽0 = 1 and 𝛽1 = 2. Based on these observations, we can compute


an unbiased estimator 𝑓,̂ an alternative unbiased estimator 𝑓𝑣̂ with increased variance, and a biased
estimator 𝑓𝑏̂ .
Our estimates are given by
̂ = 1.16 + 1.991 ⋅ 𝑥
𝑓(𝑥)
𝑓 ̂ (𝑥) = 1.882 + 2.007 ⋅ 𝑥
𝑣

𝑓𝑏̂ (𝑥) = 2.16 + 1.991 ⋅ 𝑥


and for each one we can compute the MSE:

MSE𝑓 ̂ ≈ 0.798, MSE𝑓 ̂ ≈ 1.664 and MSE𝑓 ̂ ≈ 1.798 .


𝑣 𝑏

Our unbiased estimator with the smallest variance has the least MSE, while the other two are similar.
To determine if this finding is consistent, we should repeat the entire procedure multiple times.
Assume we take a sample of 100 observations and compute all three estimators 1000 times. The
following histograms show the estimates of the intercept 𝛽0 for all three estimators.

200

100

0
^
intercept estimate b0 from f
250
200
count

150
100
50
0
^
intercept estimate b0 from f b
60

40

20

0
−1 0 1 2 3
^
intercept estimate b0 from f v

Averaged over all 1000 samples we get the following MSE values:

MSE𝑓 ̂ ≈ 0.988, MSE𝑓 ̂ ≈ 1.978 and MSE𝑓 ̂ ≈ 1.988 ,


𝑣 𝑏

which confirms our finding from above.

164
8 Statistical learning

8.8 Regression versus classification problems

• So far we have discussed problems in which response is numeric (i.e., regression problems).
• When the response is instead categorical, then the prediction problem is a classification
problem.

• Each value of a categorical response defines a class that a test case may belong to.

Example 8.11. (Email spam)


Response is binary with, say, value 1 if an email is spam and value 0 otherwise.

Example 8.12. (Cancer diagnosis)


Response may have, e.g, the levels Acute Myelogenous Leukemia, Acute Lymphoblastic
Leukemia, or No Leukemia.

One important method to solve classification problems with two classes (i.e., a binary response) is
logistic regression. This is a suitable generalization of linear regression that we will discuss in-depth
later in the course.

8.9 Classification: Nearest neighbors

A simple nonparametric method: Majority vote among nearest neighbors

Figure 8.2: Figure 2.14 from ISLR: 3-NN

165
8 Statistical learning

Algorithm

Input: New test observation 𝑥∗ .


Step 1: Select the number K of neighbors.
Step 2: Calculate the distance between 𝑥∗ and all available training data.
Step 3: Take the K nearest neighbors according to the calculated distance.
Step 4: Count the number of points belonging to each category among these K neighbors.
Step 5: Assign the new point to the category that is most frequently represented among these
K neighbors.

8.10 Selecting a learning method via a validation set

Let’s return to the practical problem of finding a statistical learning method that yields a good pre-
diction rule 𝑓 ̂ on the basis of a data set (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ).
̂ for new data points
A good rule is one that generalizes well. That is, it gives accurate predictions 𝑓(𝑥)
(for which we observe the input value 𝑥 but not the response 𝑦).
As noted earlier, finding a learning method typically requires picking one of several possible mod-
els, one of several possible estimation methods, or possibly also setting tuning parameters (like the
number 𝑘 of nearest neighbors). A natural idea to make such choices is to

1. randomly split the available data into a training and a validation part, and
2. make all statistical choices such that the MSE (or a misclassification rate) for the validation
cases is minimized.

This leads to a specific learning method that can then be applied to the entire original data set
(𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ) to form 𝑓.̂

8.10.1 Cross-validation

In order to reduce the variability due to the randomness in splitting the data, we may consider several
different random splits. This idea is typically implemented in the specific form of cross validation.
In 𝑣-fold cross validation, the data (𝑥1 , 𝑦1 ), … , (𝑥𝑛 , 𝑦𝑛 ) are randomly split into 𝑣 parts, also known
as the folds. Typical choices for the number of folds are 𝑣 = 5 or 𝑣 = 10.
Each fold then plays the role of the validation data once. This leads to 𝑣 validation errors, which
are averaged to form an overall cross-validation error CV(𝑣) . Learning methods are then designed to
minimize CV(𝑣) .

166
8 Statistical learning

Example 8.13. (CV error in regression) For a regression problem, let MSE𝑖 be the MSE when pre-
dicting the data in the 𝑖-th fold (𝑖 = 1, … , 𝑣). Then the overall cross-validation error is the average
MSE:
1 𝑣
CV(𝑣) = ∑ MSE𝑖 .
𝑣 𝑖=1

Illustrating cross-validation

The idea is well illustrated through the following graphic:

Figure 8.3: 4-fold cross validation; from Hardin, Çetinkaya-Rundel (2021). Introduction to Modern
Statistics

Example 8.14 (Email spam).

• Data: email from library(openintro)


• Task: classify emails as spam / no spam
• Predictors: features such as number of characters, attachment?, contains word “winner”?, …

Below is a plot of accuracy (% correctly classified) in 5-fold cross-validation for 𝑘-NN classification.

0.935
Accuracy (Cross−Validation)

0.930

0.925

0.920

0 10 20 30
#Neighbors
Note: 10,3% of emails in the dataset are spam.

167
8 Statistical learning

8.11 Supervised versus unsupervised learning

Our discussion focused on supervised learning, where we observe a response in the available data.
In contrast, unsupervised learning problems are learning problems in which we do not get to observe
data for a response.

Example 8.15. (Cluster analysis)


Each data point may really belong to one of three classes but this information is not part of the data.
The learning problem amounts to grouping data points into three clusters.

7.5 6
X2

X2
5.0
4

2.5
2

2 4 6 8 2 4 6
X1 X1

Each data point is a pair (𝑋1 , 𝑋2 ). Colors/Plotting symbols indicate which group each data point
belongs to. But imagine not knowing the color and having to form three clusters.

Reference

This chapter is based on Chapter 2 in James et al. (2021).

168
8 Statistical learning

Short summary

This chapter offers an introduction to statistical learning, which involves methods for predicting
outputs based on inputs. It differentiates between supervised learning, where predictions
are guided by an output, covering regression for numeric outputs and classification for categori-
cal ones, and unsupervised learning, which explores data without a supervising output. The
text discusses modelling noisy relationships, highlighting the concepts of reducible and
irreducible error, and explores both parametric, model-based approaches like linear regression
and non-parametric methods such as nearest neighbours. Furthermore, it addresses how to
assess model accuracy using metrics like mean squared error and the importance of training
and test data, alongside the bias-variance trade-off. Finally, the text touches upon classi-
fication problems and methods for selecting learning approaches using validation sets and
cross-validation, contrasting supervised with unsupervised learning.

169
9 Linear regression

This chapter is about linear regression, a straightforward approach to supervised learning.


Linear regression is a useful tool for predicting a quantitative response.
Although linear regression may seem outdated in comparison to more modern statistical learning
approaches, it remains a valuable and commonly used statistical learning method.
Consequently, the importance of having a good understanding of linear regression before studying
more complex learning methods cannot be overstated. In this chapter, we review some of the key ideas
underlying the linear regression model, as well as the least squares approach that is most commonly
used to fit this model.
As predictor variables, we will use numerical and categorical variables.

9.1 Simple linear regression

The simple linear regression allows us to predict a quantitative response 𝑌 based on a single pre-
dictor variable 𝑋. It assumes that there is approximately a linear relationship between 𝑋 and 𝑌 . We
can write this linear relationship as
𝑌 ≈ 𝛽 0 + 𝛽1 𝑋 .
You can interpret the symbol ≈ as meaning “is approximately modeled as”. We say that 𝑌 is regressed
on 𝑋.
The simple linear regression model for the i-th observation is then given by the equation

𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜖𝑖 , 𝑖 ∈ {1, … , 𝑛}, (9.1)

where 𝑥𝑖 is the i-th observation of the predictor variable 𝑋 and 𝜖1 , … , 𝜖𝑛 are independent random
error terms with zero mean and constant variance 𝜎2 , which are independent of 𝑋.

Remark. Later on, we will also make inference for the slope parameter. There, we will make the
additional assumption that the random errors have a N (0, 𝜎2 ) distribution.

170
9 Linear regression

The regression parameters 𝛽0 and 𝛽1 are two unknown constants that represent the intercept and
slope terms in the linear model.
Once we have computed estimates 𝛽0̂ and 𝛽1̂ for the regression parameters, we can predict future
response values based on (new) values 𝑥 of the predictor variable

𝑦 ̂ = 𝛽0̂ + 𝛽1̂ 𝑥 ,

where 𝑦 ̂ indicates a prediction of 𝑌 based on 𝑋 = 𝑥.

Here and in the following, we use a hat symbol t̂ o denote the estimated value for an unknown
parameter, or to denote the predicted value of the response.

Data

In this chapter we will work with the poverty dataset.

poverty
# # A tibble: 51 x 7
# State Metro_Res White Graduates Poverty Female_House Region
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
# 1 Alabama 55.4 71.3 79.9 14.6 14.2 South
# 2 Alaska 65.6 70.8 90.6 8.3 10.8 West
# 3 Arizona 88.2 87.7 83.8 13.3 11.1 West
# 4 Arkansas 52.5 81 80.9 18 12.1 South
# 5 California 94.4 77.5 81.1 12.8 12.6 West
# 6 Colorado 84.5 90.2 88.7 9.4 9.6 West
# # i 45 more rows

The scatterplot below illustrates the relationship between high school graduation rates (in percent)
across all 50 US states and DC, and the percentage of residents living below the poverty level 1 .

ggplot(poverty, aes(x = Graduates, y = Poverty)) +


geom_point() +
labs(x = "high school graduation rate",
y = "perc. living below the poverty line")

1
income below $23050 for a family of 4 in 2012.

171
9 Linear regression

perc. living below the poverty line


15

10

5
80 84 88 92
high school graduation rate

Poverty is the response, and Graduates is the predictor variable. The relationship can be described
as a moderately strong negative linear relationship.

9.1.1 Least squares fit

In practice, the regression parameters 𝛽0 and 𝛽1 are unknown. So, we must use the data to estimate
them. The data consists of 𝑛 paired observations

(𝑥1 , 𝑦1 ), (𝑥2 , 𝑦2 ), … , (𝑥𝑛 , 𝑦𝑛 ) ,

which are measurements of 𝑋 and 𝑌 .


Our goal is to obtain estimates 𝛽0̂ and 𝛽1̂ such that the linear model fits the available data well, i.e.,

𝑦𝑖 ≈ 𝛽0̂ + 𝛽1̂ 𝑥𝑖 , 𝑖 ∈ {1, … , 𝑛} .

This means we want to find an intercept 𝛽0̂ and a slope 𝛽1̂ such that the distance between the obser-
vations 𝑦𝑖 and the predicted values on the regression line 𝑦𝑖̂ is as close as possible for all 𝑛 observa-
tions.
There are several ways of measuring the distance. However, the most common approach in this
setting is minimizing the least squares distance, and we take that approach in this chapter.

172
9 Linear regression

Example 9.1. As an illustrative example we consider the relation between Poverty and Graduates
only in the states Florida, Louisiana, Minnesota and Washington. The observed Poverty values are
12.1, 17, 6.5 and 10.8. Imagine for a moment we want to describe the relation between Graduates
and Poverty only using the observations (79.8, 17) and (91.6, 6.5), which belong to Louisiana and
Minnesota, respectively. In this case we simply remember from secondary school how to compute a
slope and use it as your best estimate of 𝛽1 ,

6.5 − 17
𝛽1̂ = ≈ −0.8898 .
91.6 − 79.8

The intercept is given by 𝛽0̂ = 6.5 + 0.8898 ⋅ 91.6 ≈ 88.0085.


Given only two observations, we will not produce any error between the observations 𝑦1 , 𝑦2 and our
predictions 𝑦1̂ = 88.0085 − 0.8898 ⋅ 79.8, 𝑦2̂ = 88.0085 − 0.8898 ⋅ 91.6. The results are visualized in
the following figure.
perc. living below the poverty line

15

12

6
80 84 88 92
high school graduation rate

Figure 9.1: Fitted least squares regression line for the regression of Poverty onto Graduates given
only the observations from Louisiana and Minnesota. The observations for Florida and
Washington are added in red.

When we consider all four observations, it’s clear that we cannot fit a single line through all of the
points, as shown in Figure 9.1. This raises the question: how can we determine the line that best fits
this point cloud?
The goal is to keep the distance between our predictions 𝑦𝑖̂ and the observations 𝑦𝑖 , for 𝑖 ∈ {1, 2, 3, 4},
as small as possible. It will not be zero but should be minimized. There are several distance measures
available, but we will use squared distances, as they are easier to work with.
We consider the prediction 𝑦𝑖̂ = 𝑏0 + 𝑏1 𝑥𝑖 as a function of two parameters, 𝑏0 ∈ R and 𝑏1 ∈ R and
compute the squared distance between the predicted values and the actual observations, 𝑦𝑖 , for all

173
9 Linear regression

data points. By summing all these squared distances, we obtain a total that remains a function of the
two parameters. This sum can then be minimized with respect to 𝑏0 and 𝑏1 .
In our example, this will lead to the function

𝑓(𝑏0 , 𝑏1 ) = (𝑦1 − 𝑦1̂ (𝑏0 , 𝑏1 ))2 + (𝑦2 − 𝑦2̂ (𝑏0 , 𝑏1 ))2


+ (𝑦3 − 𝑦3̂ (𝑏0 , 𝑏1 ))2 + (𝑦𝑛 − 𝑦4̂ (𝑏0 , 𝑏1 ))2
= (12.1 − (𝑏0 + 𝑏1 84.7))2 + (17 − (𝑏0 + 𝑏1 79.8))2
+ (6.5 − (𝑏0 + 𝑏1 91.6))2 + (10.8 − (𝑏0 + 𝑏1 89.1))2 ,

which has a minimum at the point, where the partial derivatives with respect to 𝑏0 and 𝑏1 vanish.
Hence, our two estimates 𝛽0,4̂ and 𝛽 ̂ (the 4 denotes the fact that we have four observations) will
1,4
be solution of the followin system of equations:

d
𝑓(𝑏0 , 𝑏1 ) = −2(12.1 − (𝑏0 + 𝑏1 84.7)) − 2(17 − (𝑏0 + 𝑏1 79.8))
d𝑏0
− 2(6.5 − (𝑏0 + 𝑏1 91.6)) − 2(10.8 − (𝑏0 + 𝑏1 89.1))
= −92.8 + 8 ⋅ 𝑏0 + 690.4 ⋅ 𝑏1 = 0
d
𝑓(𝑏0 , 𝑏1 ) = 2 ⋅ 84.7(12.1 − (𝑏0 + 𝑏1 84.7)) + 2 ⋅ 79.8(17 − (𝑏0 + 𝑏1 79.8))
d𝑏1
+ 2 ⋅ 91.6(6.5 − (𝑏0 + 𝑏1 91.6)) + 2 ⋅ 89.1(10.8 − (𝑏0 + 𝑏1 89.1))
= 7878.3 − 690.4 ⋅ 𝑏0 − 59743 ⋅ 𝑏1 = 0 ,

̂ ≈ 86.8180 and 𝛽 ̂ ≈ −0.8072. The fitted least squared regression line is


which is given by 𝛽0,4 1,4
shown in Figure 9.2.

Definition 9.1. Let 𝛽0̂ and 𝛽1̂ be estimates of intercept and slope. The residual of the 𝑖-th observa-
tion (𝑥𝑖 , 𝑦𝑖 ) is the difference of the observed response 𝑦𝑖 and the prediction based on the model
fit 𝑦𝑖̂ = 𝛽0̂ + 𝛽1̂ 𝑥𝑖 :
𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖̂ .

Considering the predictions 𝑦𝑖̂ as a function of two parameters 𝑏0 ∈ R and 𝑏1 ∈ R allows us to say
that the least squares estimates minimize the residual sum of squares

RSS(𝑏0 , 𝑏1 ) = 𝑒21 + ⋅ ⋅ ⋅ + 𝑒2𝑛


= (𝑦1 − 𝑦1̂ (𝑏0 , 𝑏1 ))2 + ⋅ ⋅ ⋅ + (𝑦𝑛 − 𝑦𝑛̂ (𝑏0 , 𝑏1 ))2
= (𝑦1 − (𝑏0 + 𝑏1 𝑥1 ))2 + ⋅ ⋅ ⋅ + (𝑦𝑛 − (𝑏0 + 𝑏1 𝑥𝑛 ))2
with respect to 𝑏0 and 𝑏1 . The least squares estimates are now defined in the following definition.

174
9 Linear regression

perc. living below the poverty line


15

12

6
80 84 88 92
high school graduation rate

Figure 9.2: Fitted least squares regression line for the regression of Poverty onto Graduates. Each
red line segment represents one of the errors 𝑦𝑖 − 𝑦𝑖̂ .

Definition 9.2. The least squares estimates 𝛽0,𝑛 ̂ and 𝛽 ̂ (point estimates) for the parameters 𝛽
1,𝑛 0
̂ , 𝛽 ̂ ) that minimizes the function
and 𝛽1 (population parameters) are defined as the point (𝛽0,𝑛 1,𝑛
𝑛 𝑛
RSS(𝑏0 , 𝑏1 ) = ∑ 𝑒2𝑖 = ∑(𝑦𝑖 − (𝑏0 + 𝑏1 𝑥𝑖 ))2 ,
𝑖=1 𝑖=1

i.e.,
̂ , 𝛽 ̂ ) = argmin
(𝛽0,𝑛 RSS(𝑏0 , 𝑏1 ) .
1,𝑛 (𝑏 0 ,𝑏1 )∈R
2

The least squares estimates are given by


𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 )
𝛽̂1,𝑛 = 𝑛 ,
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2
̂
𝛽0,𝑛 = 𝑦𝑛 − 𝛽1̂ 𝑥𝑛 .

Remark. We often drop the index 𝑛 and denote the least squares estimates just by 𝛽0̂ and 𝛽1̂ .

The least squares estimates are the solution to the minimization problem argmin(𝑏 ,𝑏 )∈R2 RSS(𝑏0 , 𝑏1 ).
0 1
To understand why this is actually true, we will first rewrite our regression model Equation 9.1 using
matrix notation.
The simple linear regression model is defined through the following equation
1 𝑥1 𝜖1

⎜ 1 𝑥 ⎞
⎟ 𝛽 ⎛
⎜ 𝜖2 ⎞

Y = X𝛽 + 𝜖 = ⎜

2⎟
⎟ ( 0) + ⎜
⎜ ⎟,
⎜⋮ ⋮ ⎟ 𝛽1 ⎜⋮⎟ ⎟
⎝1 𝑥 𝑛 ⎠ ⎝𝜖𝑛 ⎠

175
9 Linear regression

with response vector Y = (𝑌1 , 𝑌2 , … … , 𝑌𝑛 )⊤ ∈ R𝑛 , design matrix X ∈ R𝑛×2 , population parameters


𝛽 ∈ R2 and residual errors 𝜖 ∈ R𝑛 . Using this notation, we get ŷ = X𝛽,̂ which represents the fitted
values. From this, we can derive an alternative form of the residual sum of squares:

𝑛
RSS(𝑏0 , 𝑏1 ) = ∑(𝑦𝑖 − (𝑏0 + 𝑏1 𝑥𝑖 ))2 = (y − Xb) (y − Xb) .
𝑖=1

Taking the derivative with respect to 𝑏0 and 𝑏1 leads to the following system of equations
X⊤ Xb = X⊤ y ,
which are called normal equation, and they have the solution
−1
̂
𝛽(y) = (X⊤ X) X⊤ y
𝑛
∑ 𝑖 (𝑥 −𝑥 )(𝑦 −𝑦 )
𝑛 𝑖
⎛ 𝑦𝑛 − 𝑖=1 𝑛
∑𝑖=1 (𝑥𝑖 −𝑥𝑛 )2
𝑛
𝑥𝑛 ⎞
=⎜ 𝑛
∑𝑖=1 (𝑥𝑖 −𝑥𝑛 )(𝑦𝑖 −𝑦𝑛 )
⎟.
𝑛
⎝ ∑𝑖=1 (𝑥𝑖 −𝑥𝑛 ) 2

See Section C.2 for a proof of this last result.

Remark. The notation 𝛽(y)̂ indicates that the formula is evaluated using the observed values y. In
this situation, we typically simplify the notation and just write 𝛽.̂ If the formula for the estimator
̂
uses the response vector Y, i.e. computing 𝛽(Y), the estimator becomes a random quantity (since Y
is random) and it make sense to compute an expectation or variance of 𝛽(Y).̂

Example 9.2. Consider again our model of regressing Poverty onto Graduates. The normal equa-
tions for this model are given by
1 79.9 14.6
1 1 ⋅⋅⋅ 1 ⎛
⎜ 1 90.6 ⎞
⎟ 𝑏 1 1 ⋅ ⋅ ⋅ 1 ⎛
⎜ 8.3 ⎞

( )⎜
⎜ ⎟
⎟ ( 0) = ( )⎜ ⎟.
79.9 90.6 ⋅ ⋅ ⋅ 90.9 ⎜ ⋮ ⋮ ⎟ 𝑏1 79.9 90.6 ⋅ ⋅ ⋅ 90.9 ⎜ ⋮ ⎟
⎜ ⎟
⎝1 90.9⎠ ⎝ 9.5 ⎠

Simplifying these equations leads to


51 4386.6 𝑏 578.8
( ) ( 0) = ( ).
4386.6 3.7799336 × 105 𝑏1 4.935239 × 104

Solving this system of equations with respect to 𝑏0 and 𝑏1 gives the least-squares estimates:

XtX <- matrix(c(51, 4386.6, 4386.6, 377993.36), ncol = 2, byrow = TRUE)


Xy <- c(578.8, 49352.39)

solve(XtX, Xy)
# [1] 64.7809658 -0.6212167

176
9 Linear regression

These estimates differ from our earlier ones based only on Louisiana and Minnesota data. However,
they are more aligned with the estimates derived from the complete dataset, which we will see later.

Given

poverty |>
summarise(mean_pov = mean(Poverty),
mean_grad = mean(Graduates),
sd_pov = sd(Poverty),
sd_grad = sd(Graduates))
# # A tibble: 1 x 4
# mean_pov mean_grad sd_pov sd_grad
# <dbl> <dbl> <dbl> <dbl>
# 1 11.3 86.0 3.10 3.73
cor(poverty$Poverty, poverty$Graduates)
# [1] -0.7468583

𝑛
∑𝑖=1 (𝑥𝑖 −𝑥)(𝑦𝑖 −𝑦)
we can compute the estimated slope. The formula for the slope estimator was 𝛽1̂ = 𝑛
∑𝑖=1 (𝑥𝑖 −𝑥)2
,
which can be rearranged in the following way
𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 ) (𝑛 − 1)𝑠𝑥𝑦,𝑛 𝑠𝑥𝑦,𝑛 𝑠𝑥𝑦,𝑛
𝛽̂1,𝑛 = 𝑛 = 2
= 2 =
∑𝑖=1 (𝑥𝑖
− 𝑥𝑛 )2 (𝑛 − 1)𝑠𝑥,𝑛 𝑠𝑥,𝑛 𝑠𝑥,𝑛 ⋅ 𝑠𝑥,𝑛
𝑠𝑦,𝑛 𝑠𝑥𝑦,𝑛 𝑠𝑦,𝑛
= ⋅ = ⋅𝑟
𝑠𝑥,𝑛 𝑠𝑥,𝑛 ⋅ 𝑠𝑦,𝑛 𝑠𝑥,𝑛 (𝑥,𝑦),𝑛

Now we can input the given output in the above formula, and get the following estimate
𝑠𝑦,𝑛 3.10
̂ =
𝛽1,𝑛 ⋅𝑟 ≈ ⋅ (−0.75) = −0.62 .
𝑠𝑥,𝑛 (𝑥,𝑦),𝑛 3.73

Interpretation

When comparing two states, then for each additional percentage point in the high school
graduation rate, we would expect the percentage of people living in poverty to be 0.62 percent-
age points lower on average.

177
9 Linear regression

Given

poverty |>
summarise(mean_pov = mean(Poverty), mean_grad = mean(Graduates),
sd_pov = sd(Poverty), sd_grad = sd(Graduates))
# # A tibble: 1 x 4
# mean_pov mean_grad sd_pov sd_grad
# <dbl> <dbl> <dbl> <dbl>
# 1 11.35 86.01 3.099 3.726

̂ = −0.62, we can compute the estimated intercept


and the slope estimate 𝛽1,𝑛

̂ = 𝑦 − 𝛽 ̂ 𝑥 ≈ 11.35 − (−0.62) ⋅ 86.01 = 64.68 .


𝛽0,𝑛 𝑛 1 𝑛

¾ Your turn

Which of the following is the correct interpretation of the intercept?


A For each percentage point increase in high school graduation rate, the percentage of people
living in poverty is expected to increase on average by 64.68%.
B Having no high school graduates leads to 64.68% of residents living below the poverty line.
C States with no high school graduates are expected, on average, to have 64.68% of residents
living below the poverty line.
D In states with no high school graduates, the poverty percentage is expected to increase on
average by 64.68%.

Summary: interpretation of slope and intercept

Intercept: In an observation with 𝑥 = 0, the response 𝑦 is expected to equal the intercept.


Slope: When comparing two observations, the slope tells us how much we expect their values
of 𝑦 to differ for each unit difference we see for their values of 𝑥.

Be aware: Our interpretation of intercept and slope is not causal, where by causal, we mean effects
resulting from interventions (such as policy changes, new treatments, …).
When interpreting the slope in our example, we consider differences in the expected responses of
two states with (naturally) different high school graduation rates. We do not conclude that the slope
provides an estimate of how an intervention that in-/decreases the high school graduation rate in one
state leads to a de-/increase in the poverty rate in that same state.
Remember: Causal conclusions may be drawn if a study is a randomized controlled experiment
(i.e., the value of 𝑥 was controlled and randomly assigned by the experimenter).

178
9 Linear regression

Fitting the linear regression model in R

In R, one can fit a linear regression model by using the function lm()

(model_pov_0 <- lm(Poverty ~ Graduates, data = poverty))


#
# Call:
# lm(formula = Poverty ~ Graduates, data = poverty)
#
# Coefficients:
# (Intercept) Graduates
# 64.7810 -0.6212

Note

After fitting the model, one should analyse the residuals. To check:

• if the residuals show no structure → implies that relationship between predictor and
response is roughly linear
• the shape of the distribution of the residuals
• if variability of residuals around the 0 line is roughly constant

The process for carrying out these three checks is explained in Section 14.2.

In this chapter, we will focus on the residual plot, which is a scatterplot of the residuals 𝑒𝑖 against
the predicted values 𝑦𝑖̂ . It helps to evaluate how well a linear model fits a dataset. To create the plot,
we first add the residuals and the predictions to the dataset with the functions add_residuals() and
add_predictions() from the modelr package.

library(modelr)
poverty <- poverty |>
add_residuals(model = model_pov_0) |>
add_predictions(model = model_pov_0)
# poverty contains now the additional columns fitted and resid
ggplot(poverty, aes(x = pred, y = resid)) +
geom_point() +
labs(x = "predicted perc. of people living below the poverty line", y =
↪ "residuals") +
geom_hline(yintercept = 0, linetype = 2)

179
9 Linear regression

5.0

2.5
residuals

0.0

−2.5

7.5 10.0 12.5 15.0


predicted perc. of people living below the poverty line

Figure 9.3: The given residual plot indicates a not-so-bad fit since no apparent structure is visible.
One might say that there are two clusters (below and above 12%) with slightly different
variability.

Note

One purpose of residual plots is to identify characteristics or patterns still apparent in data after
fitting a model.
If the chosen model fits the data rather well, there should be no pattern.

9.1.2 Quantifying the relationship

We introduced in Definition 5.7 the empirical correlation coefficient

𝑛 𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 ) ∑ (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 )
𝑟(𝑥,𝑦),𝑛 = = 𝑖=1
𝑛 𝑛
√∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2 ∑𝑖=1 (𝑦𝑖 − 𝑦𝑛 )2 (𝑛 − 1)𝑠𝑥,𝑛 𝑠𝑦,𝑛

as a measure of linear association between two variables 𝑥 and 𝑦.


Assume 𝑦𝑖 = 𝛽0̂ + 𝛽1̂ 𝑥𝑖 with (𝛽0̂ , 𝛽1̂ ) ∈ R2 . Then
𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝛽0̂ + 𝛽1̂ 𝑥𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑛 )
𝑟(𝑥,𝑦),𝑛 =
𝑛
1
(𝑛 − 1)𝑠𝑥,𝑛 √ 𝑛−1 ∑𝑖=1 (𝛽0̂ + 𝛽1̂ 𝑥𝑖 − (𝛽0̂ + 𝛽1̂ 𝑥𝑛 ))2

𝑛 ⎧−1, 𝛽1̂ < 0


∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝛽1̂ 𝑥𝑖 − 𝛽1̂ 𝑥𝑛 ) 𝛽1̂ 𝑠2𝑥,𝑛 {
= = = ⎨0, 𝛽1̂ = 0 ,
1
(𝑛 − 1)𝑠𝑥,𝑛 √ 𝑛−1
𝑛
∑𝑖=1 𝛽12̂ (𝑥𝑖 − 𝑥𝑛 )2 |𝛽1̂ |𝑠2𝑥,𝑛 {
⎩1, 𝛽1̂ > 0

180
9 Linear regression

which corresponds to the cases of perfect negative (-1) / positive (+1) correlation and no linear asso-
ciation (0). In general 𝑟(𝑥,𝑦),𝑛 ∈ [−1, 1].

¾ Your turn

Which of the following is the best guess for the correlation between % in poverty and the high
school graduation rate?
A -0.75
B -0.1
C 0.02
D -1.5

¾ Your turn

Which of the following plots shows the strongest correlation, i.e. correlation coefficient closest
to +1 or -1?

A B
40
40
30 30
20
20
10
0 10

0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0

C D
20

50 10
0
25
−10
0 −20
−25 −30
0.0 2.5 5.0 7.5 10.0 0.0 2.5 5.0 7.5 10.0

9.1.3 Extrapolation

When regressing the percentage of people living below the poverty line onto the percentage of high
school graduates, we estimated an intercept of 64.78. Since there are no states in the dataset with no
high school graduates, the intercept is of no interest, not very useful, and also not reliable since
the predicted value of the intercept is so far from the bulk of the data.

181
9 Linear regression

perc. living below the poverty line


60

40

20

0
0 25 50 75 100
high school graduation rate

Applying a fitted model to values outside the original data realm is called extrapolation. Sometimes,
the intercept might be an extrapolation.

Example 9.3. Figure 9.4 shows the median age for each year of men living in the US at first marriage.
Using only the data up to 1950 leads to a trend that dramatically underestimates the median age for
1970 and onwards.

30

27
median age

24

21

1920 1960 2000


year
Source: https://www.census.gov/data/tables/time−series/demo/families/marital.html

Figure 9.4: Men’s median age at first marriage for men living in the US.

Example 9.4. In 2004, the BBC reported that women “may outsprint men by 2156”. In their report
they were referring to results found in Tatem et al. (2004).
The study’s authors fitted linear regression lines to the winning times of males and females over the
past 100 years.

182
9 Linear regression

Then, they extrapolated these trends to the 2008 Olympia games and concluded that the women’s
100-meter race could be won in a time of 10.57± 0.232 seconds and the men’s event in 9.73 ± 0.144
seconds. The actual winning times have been 10.780 and 9.690, respectively. Both being within the
given 95% confidence intervals.
But already in the Tokyo 2020 Olympics, this wasn’t the case anymore.

Figure 9.5: Momentous sprint at the 2156 Olympics? Figure from Tatem et al. (2004); we added the
Tokyo 2020 results.

9.1.4 Assessing the accuracy of the model

The quality of a linear regression fit can be evaluated using the residual standard error (RSE) and the
R squared value (𝑅2 ), among other criteria.

Residual standard error

Recall that the linear regression model Equation 9.1 assumes that the response 𝑌𝑖 is a linear combi-
nation of the linear predictor 𝛽0 + 𝛽1 𝑥𝑖 and the error 𝜖𝑖 .
The RSE is an estimate of the standard deviation 𝜎 of the unobservable random error 𝜖𝑖 . Remember
that all 𝜖𝑖 are assumed to have the same variance. It is defined by the following formula
𝑛
1 1
RSE = √ RSS = √ ∑(𝑦𝑖 − 𝑦𝑖̂ )2 .
𝑛−2 𝑛 − 2 𝑖=1

The RSE is a measure of how well the model fits the data.

183
9 Linear regression

If the model’s predictions closely match the actual outcome values, the RSE will be small, indicating
a good fit. Conversely, if the predictions 𝑦𝑖̂ differ significantly from the actual values 𝑦𝑖 for some
observations, the RSE will be large, suggesting a poor fit of the model to the data.
The glance() function from the broom package computes several fit criterion. The RSE is denoted
by sigma.

broom::glance(model_pov_0)
# # A tibble: 1 x 12
# r.squared adj.r.squared sigma statistic p.value df
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.5578 0.5488 2.082 61.81 3.109e-10 1
# # i 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
# # deviance <dbl>, df.residual <int>, nobs <int>

Roughly speaking, the RSE measures the average amount that the response will deviate from the true
regression line.
In our example, this means that the actual percentage of people living below the poverty line in
each state differs from the true regression line by approximately two percentage points, on average.
Whether or not 2 percentage points is an acceptable prediction error depends on the problem con-
text. For the poverty dataset, the mean percentage living below the poverty line over all states is
approximately 11.35, and so the percentage error is 2.082
11.35 ⋅ 100% ≈ 18%.

𝑅2

The RSE provides an absolute measure of the lack of fit. However, since it is measured in the units
of the response, it is not always clear what constitutes a good RSE. The 𝑅2 statistic provides an
alternative measure of fit. It takes the form of a proportion and is hence independent of the scale of
𝑌.

Definition 9.3. The strength of the fit of a linear model can be evaluated using the R squared value
𝑅2 . It is defined in the following way
𝑛 𝑛
2 TSS − RSS ∑𝑖=1 (𝑦𝑖 − 𝑦𝑛 )2 − ∑𝑖=1 (𝑦𝑖 − 𝑦𝑖̂ )2
𝑅 = = 𝑛 ,
TSS ∑𝑖=1 (𝑦𝑖 − 𝑦𝑛 )2

where TSS is called total sum of squares.

Remark.

1. The R squared value is also called coefficient of determination.

184
9 Linear regression

2. One can show, that the 𝑅2 value is equal to the square of the correlation coefficient in the
simple linear model, i.e.
𝑅2 = 𝑟(𝑥,𝑦),𝑛
2
.
This equation implies that 𝑅2 ∈ [0, 1].
3. One can show that
𝑛 𝑛 𝑛
∑(𝑦𝑖̂ − 𝑦𝑛 )2 = ∑(𝑦𝑖 − 𝑦𝑛 )2 − ∑(𝑦𝑖 − 𝑦𝑖̂ )2 ,
𝑖=1 𝑖=1 𝑖=1

which implies together with


𝑛 𝑛 𝑛 𝑛 𝑛
∑ 𝑒𝑖 = ∑(𝑦𝑖 − 𝑦𝑖̂ ) = ∑ 𝑦𝑖 − ∑ 𝛽0̂ − ∑ 𝛽1̂ 𝑥𝑖
𝑖=1 𝑖=1 𝑖=1 𝑖=1 𝑖=1

= 𝑛𝑦𝑛 − 𝑛𝛽0̂ − 𝑛𝛽1̂ 𝑥𝑛 = 𝑛𝑦𝑛 − 𝑛(𝑦𝑛 − 𝛽1̂ 𝑥𝑛 ) − 𝑛𝛽1̂ 𝑥𝑛 = 0


𝑛 𝑛
⇔ ∑ 𝑦𝑖 = ∑ 𝑦𝑖̂
𝑖=1 𝑖=1

⇔ 𝑦𝑛 = 𝑦𝑛̂

that 𝑛
∑𝑖=1 (𝑦𝑖̂ − 𝑦𝑛 )2 𝑠2𝑦̂
𝑅2 = = ,
∑𝑖=1 (𝑦𝑖 − 𝑦𝑛 )2 𝑠2𝑦
where 𝑠2𝑦̂ denotes the empirical variance of the fitted values.

Interpretation

The percentage of variability in the response variable, which the model explains.
The remainder of the variability is explained by variables not included in the model or by inher-
ent randomness in the data.

The 𝑅2 value of a linear model object, like

model_pov_0
#
# Call:
# lm(formula = Poverty ~ Graduates, data = poverty)
#
# Coefficients:
# (Intercept) Graduates
# 64.7810 -0.6212

is contained in the summary of the model under Multiple R-squared.

185
9 Linear regression

summary(model_pov_0)
#
# Call:
# lm(formula = Poverty ~ Graduates, data = poverty)
#
# Residuals:
# Min 1Q Median 3Q Max
# -4.1624 -1.2593 -0.2184 0.9611 5.4437
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 64.78097 6.80260 9.523 9.94e-13 ***
# Graduates -0.62122 0.07902 -7.862 3.11e-10 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 2.082 on 49 degrees of freedom
# Multiple R-squared: 0.5578, Adjusted R-squared: 0.5488
# F-statistic: 61.81 on 1 and 49 DF, p-value: 3.109e-10

It can be extracted from the summary using the $ notation

summary(model_pov_0)$r.squared
# [1] 0.5577973

¾ Your turn

Which of the below is the correct interpretation of 𝑟(𝑥,𝑦),𝑛 = −0.75 and 𝑅2 = 0.56?
A The model explains 56% of the variability in the percentage of high school graduates among
the 51 states.
B The model explains 56% of the variability in the percentage of residents living in poverty
among the 51 states.
C 56% of the time the percentage of high school graduates predict the percentage of residents
living in poverty correctly.
D The model explains 75% of the variability in the percentage of residents living in poverty
among the 51 states.

186
9 Linear regression

9.2 Multiple linear regression

Simple linear regression is a useful method for predicting a response based on a single predictor
variable. However, in practical applications, we often have more than one predictor variable.
In the poverty data, we have used the high school graduation rate as predictor. However, the dataset
contains much information. It consists of the following variables:

• State: US state
• Metro_Res: metropolitan residence
• White: percent of population that is white
• Graduates: percent of high school graduates
• Female_House: percent of female householder families (no husband present)
• Poverty: percent living below the poverty line
• Region: region in the United States

When describing the relationship between multiple predictor variables and the response variable
Poverty, fitting separate simple linear regressions for each predictor is not ideal. This approach over-
looks potential dependencies between the predictor variables and does not provide a single prediction
for Poverty based on multiple fits.
Therefore, we extend the simple linear regression model in such a way that it can directly accommo-
date several predictor variables.
Let 𝑌 be our quantitative response, and let 𝑋1 , … , 𝑋𝑘 be the considered predictor variables. Then
multiple linear regression assumes that there is approximately a linear relationship between
𝑋1 , … , 𝑋𝑘 and 𝑌 , with
𝑌 ≈ 𝛽0 + 𝛽1 𝑋1 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑋𝑘 .
More formally, let 𝑌1 , … , 𝑌𝑛 be 𝑛 observations of the response variable, and write 𝑥𝑗,1 , … , 𝑥𝑗,𝑛 , for
𝑗 ∈ {1, … , 𝑘}, for the associated given values of the 𝑘 predictors.

Definition 9.4. The multiple linear regression model is defined through the equation

𝑌𝑖 = 𝛽0 + 𝛽1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘,𝑖 + 𝜖𝑖 , 𝑖 ∈ {1, … , 𝑛} , (9.2)

with independent errors 𝜖1 , … , 𝜖𝑛 , which have zero mean and constant variance 𝜎2 . In the model,
the regression parameters 𝛽𝑗 ∈ R, 𝑗 ∈ {0, … , 𝑘}, are fixed, but unknown, coefficients.

We interpret 𝛽𝑗 as the average effect on 𝑌 of a one-unit increase in 𝑥𝑗 , while holding all other pre-
dictors fixed.

187
9 Linear regression

Remark. Using matrix notation, the multiple linear regression model is given by
Y = X𝛽 + 𝜖 , (9.3)
with response vector Y ∈ R𝑛 , design matrix X ∈ R𝑛×(𝑘+1) , population parameters 𝛽 ∈ R𝑘+1 and
residual errors 𝜖 ∈ R𝑛 .

Let’s visualize the poverty data (except for State) in a pairsplot. But first we remove the fitted values
and residuals again, which we added when fitting the simple linear regression model model_pov_0.

poverty <- poverty |>


select(-fitted, -pred, -resid)

The following pairsplot, created using GGally::ggpairs(), compactly visualizes pairwise relation-
ships between variables.

GGally::ggpairs(
relocate(
select(poverty, -State), Poverty, .after = Female_House)
) # we removed State and show Poverty and Female_House next to each other

Metro_Res White Graduates Female_House Poverty Region


0.020

Metro_Res
0.015 Corr: Corr: Corr: Corr:
0.010
−0.342* 0.018 0.300* −0.204
0.005
0.000
100
80
Corr: Corr: Corr:

White
60
0.238. −0.751*** −0.309*
40

92
Graduates

88
Corr: Corr:
84
−0.612*** −0.747***
80
Female_House

17.5
15.0 Corr:
12.5 0.525***
10.0
7.5

15
Poverty

10

5
4
3
2
1
0
4
Region

3
2
1
0
4
3
2
1
0
4
3
2
1
0
40 60 80 100 40 60 80 100 80 85 90 7.510.0
12.5
15.0
17.5 5 10 15 South
West
Northeast
Midwest

188
9 Linear regression

The second last row is most interesting. It shows the relationship between Poverty and the other
predictor variables. To highlight one fact: The percentage of female householder families with no
husband present seems to have a positive relationship with Poverty.
Let’s fit another simple linear regression model, but this time using Female_House as predictor.

model_pov_1 <- lm(Poverty ~ Female_House, data = poverty)


tidy(model_pov_1)
# # A tibble: 2 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 3.309 1.897 1.745 0.08733
# 2 Female_House 0.6911 0.1599 4.322 0.00007534

15
Poverty

10

5
7.5 10.0 12.5 15.0 17.5
Female_House

As noted, Female_House seems to help predict the percentage of people living below the poverty line.
So, the question is how to fit a joint model by estimating the regression parameters in Equation 9.2.

9.2.1 Estimating the regression parameters

In Equation 9.2, as in the simple linear regression model, the regression parameters 𝛽0 , 𝛽1 , … , 𝛽𝑘 are
unknown and need to be estimated. Once we have estimates 𝛽0̂ , 𝛽1̂ , … , 𝛽𝑘̂ , we can use them to make
predictions using the following formula:

𝑦𝑖̂ = 𝛽0̂ + 𝛽1̂ 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘̂ 𝑥𝑘,𝑖 , 𝑖 ∈ {1, … , 𝑛}.

The parameters are estimated using the same least squares approach that we used in Section 9.1.1.

189
9 Linear regression

The least squares estimates are defined as the point 𝛽̂ = (𝛽0̂ , 𝛽1̂ , … , 𝛽𝑘̂ )⊤ ∈ R𝑘+1 that minimizes
the function
RSS(b) = RSS(𝑏0 , 𝑏1 , … , 𝑏𝑘 )
𝑛 𝑛
= ∑ 𝑒2𝑖 = ∑(𝑦𝑖 − 𝑦𝑖̂ )2
𝑖=1 𝑖=1 .
𝑛
= ∑(𝑦𝑖 − (𝑏0 + 𝑏1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝑏𝑘 𝑥𝑘,𝑖 ))2
𝑖=1

In symbols,
̂
𝛽(y) = argminb∈R𝑘+1 RSS(b)
−1
= (X⊤ X) X⊤ y .
This formula is the same as what we encountered in Section 9.1.1. However, at that time, we were
able to evaluate it manually. Now, the process is becoming more complex, and we will rely on R to
compute the estimates.
Using the formula for the least squares estimates 𝛽̂ gives the following representation of the fitted
values
ŷ = X𝛽̂ = X(X⊤ X)−1 X⊤ y =∶ Hy . (9.4)
The matrix H is called hat matrix and will be used in Section 14.2, when we speak about outlier
detection.
The variance 𝜎2 of the unobservable errors will be estimated through
𝑛
∑𝑖=1 𝑒2𝑖
𝜎̂ 2 = .
𝑛−𝑘−1
In R we simply have to extend the formula argument of lm() to Poverty ~ Graduates +
Female_House.

model_pov_2 <- lm(


Poverty ~ Graduates + Female_House, data = poverty)
tidy(model_pov_2)
# # A tibble: 3 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 58.32 9.847 5.923 0.0000003290
# 2 Graduates -0.5656 0.1001 -5.651 0.0000008511
# 3 Female_House 0.1439 0.1583 0.9089 0.3679

The output contains besides the estimates 𝛽̂ = (𝛽0̂ , 𝛽1̂ , 𝛽2̂ )⊤ also the corresponding standard errors

̂ ) = √𝜎̂ 2 (X⊤ X)−1 ,


SE(𝛽𝑗−1 𝑗 ∈ {1, … , 𝑘 + 1} .
𝑗𝑗

190
9 Linear regression

9.2.2 Another look at 𝑅2

𝑅2 can be calculated in two ways:

1. Square the empirical correlation between the observed response values y and the com-
puted predictions ŷ
𝑅2 = 𝑟(𝑦,
2
𝑦),𝑛
̂ .
Now our predictions/fitted values are

𝑦𝑖̂ = 𝛽0̂ + 𝛽1̂ 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘̂ 𝑥𝑘,𝑖 .

Remark: This is equivalent to squaring the correlation coefficient of 𝑦 and 𝑥 in a simple linear
regression model (because in this case 𝑦 ̂ is a linear transformation of 𝑥, which leaves correlation
invariant).

2. Use the definition

TSS − RSS
𝑅2 = .
TSS

If we use the second approach, we need to determine the relevant sums of squares. First, the total
sum of squares, which is defined precisely as in our discussion of the simple linear regression model.
Second, the residual sum of squares, which again is defined as before and sums the squares of the
residuals 𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖̂ .

Sums of squares for 𝑅2

We recap:
𝑛
TSS = ∑(𝑦𝑖 − 𝑦𝑛 )2 ,
𝑖=1
𝑛
RSS = ∑(𝑦𝑖 − 𝑦𝑖̂ )2 .
𝑖=1

Their difference is the explained sum of squares, which by our previous calculations is given by
𝑛
ESS = TSS − RSS = ∑(𝑦𝑖̂ − 𝑦𝑛 )2 .
𝑖=1

In R, the different sums of squares can be readily extracted from a so-called ANOVA (analysis of
variance) table:

191
9 Linear regression

anova(model_pov_2)
# Analysis of Variance Table
#
# Response: Poverty
# Df Sum Sq Mean Sq F value Pr(>F)
# Graduates 1 267.881 267.881 61.5896 3.741e-10 ***
# Female_House 1 3.593 3.593 0.8262 0.3679
# Residuals 48 208.773 4.349
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

𝑛
RSS = ∑ 𝑒2𝑖 = 208.8
𝑖=1
𝑛
ESS = ∑(𝑦𝑖̂ − 𝑦𝑛 )2 ≈ 267.9 + 3.593 = 271.493
𝑖=1
𝑛
TSS = ∑(𝑦𝑖 − 𝑦𝑛 )2 = ESS + RSS
𝑖=1
= 271.493 + 208.8 = 480.293
This then leads to
ESS 271.493
𝑅2 = = ≈ 0.5653
TSS 480.293

We can compare this result to the output of the glance() function.

glance(model_pov_2)
# # A tibble: 1 x 12
# r.squared adj.r.squared sigma statistic p.value df
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0.5653 0.5472 2.086 31.21 2.075e-9 2
# # i 6 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
# # deviance <dbl>, df.residual <int>, nobs <int>

The R computation makes it evident that the explained sum of squares (explained variability) ESS will
increase with every additional predictor. More formally, note that RSS will only decrease when opti-
mizing over additional coefficients 𝑏𝑗 . Hence, 𝑅2 increases with every additional predictor, making
it a less reliable measure for assessing the fit of a multiple linear regression model and unsuitable for
comparing different models.

192
9 Linear regression

Note

1. 𝑅2 is a biased estimate of the percentage of variability the model explains when there
are many variables. If we compute predictions for new data using the current model,
the 𝑅2 will tend to be slightly overly optimistic.

2. 𝑅2 increases with an increasing number of explanatory variables, regardless of the


true information content of the new variables.

We need to adjust the 𝑅2

To get a better estimate of the amount of variability explained by the model we use:

Definition 9.5. Let 𝑦𝑖 be the i-th response value and 𝑒𝑖 the estimated i-th residual of a fitted multiple
linear regression model with 𝑘 + 1 parameters. Then the adjusted 𝑅2 is defined as

2 ∑ 𝑒2𝑖 /(𝑛 − 𝑘 − 1)
𝑅𝑎𝑑𝑗 =1−
∑(𝑦𝑖 − 𝑦)2 /(𝑛 − 1)
∑ 𝑒2𝑖 𝑛−1
=1− ⋅ ,
∑(𝑦𝑖 − 𝑦)2 𝑛 − 𝑘 − 1
where 𝑛 is the number of cases/observations. The specific divisor 𝑛 − 𝑘 − 1 is connected to avoid-
ing bias in estimation of the variance of the 𝜖𝑖 ; we will revisit this point when discussing statistical
inference for linear regression models.

𝑛−1
Remark. Since 𝑛−𝑘−1 > 1, we always have

2 ∑ 𝑒2𝑖
𝑅𝑎𝑑𝑗 < 𝑅2 = 1 − .
∑(𝑦𝑖 − 𝑦)2

Let’s compare the (adjusted) 𝑅2 values for the regression model using Graduates and Female_House
as predictors to the one using in addition White.

model_pov_3 <- lm(


Poverty ~ Graduates + Female_House + White,
data = poverty)

glance(model_pov_2)[,1:2]
# # A tibble: 1 x 2
# r.squared adj.r.squared
# <dbl> <dbl>
# 1 0.5653 0.5472

193
9 Linear regression

glance(model_pov_3)[,1:2]
# # A tibble: 1 x 2
# r.squared adj.r.squared
# <dbl> <dbl>
# 1 0.5769 0.5499

We detect a stronger increase in 𝑅2 than in the adjusted 𝑅2 . This indicates that the actual amount of
explained variability hasn’t increased that much.

9.2.3 Collinearity between predictors

Does adding the variable White to the model add valuable information that wasn’t provided by
Female_House?

Metro_Res White Graduates Female_House Poverty Region


0.020

Metro_Res
0.015 Corr: Corr: Corr: Corr:
0.010
−0.342* 0.018 0.300* −0.204
0.005
0.000
100
80
Corr: Corr: Corr:

White
60
0.238. −0.751*** −0.309*
40

92

Graduates
88
Corr: Corr:
84
−0.612*** −0.747***
80
Female_House
17.5
15.0 Corr:
12.5 0.525***
10.0
7.5

15
Poverty

10

5
4
3
2
1
0
4
Region

3
2
1
0
4
3
2
1
0
4
3
2
1
0
40 60 80 100 40 60 80 100 80 85 90 7.510.0
12.5
15.0
17.5 5 10 15 South
West
Northeast
Midwest

194
9 Linear regression

In the pairsplot we can detect a quite strong linear dependence between Female_House and White,
which indicates that White doesn’t contain much additional information.
Fitting a model with dependent predictor variables also affects the least squares estimates. Comparing
the slope estimate for Female_House over the two models shows that it decreases from a positive
value of 0.1438501 to -0.0859684.

coef(model_pov_2)
# (Intercept) Graduates Female_House
# 58.3202572 -0.5655586 0.1438501

coef(model_pov_3)
# (Intercept) Graduates Female_House White
# 68.85606105 -0.61872923 -0.08596838 -0.04024676

This effect of reverting the sign of the estimate would not be present if we had added Metro_Res to
the model instead of White.

model_pov_4 <- lm(


Poverty ~ Graduates + Female_House + Metro_Res,
data = poverty)

coef(model_pov_4)
# (Intercept) Graduates Female_House Metro_Res
# 54.05960674 -0.49422868 0.31784887 -0.05396264

Multicollinearity

Multicollinearity happens when the predictor variables are correlated within themselves.
When the predictor variables are correlated, the coefficients in a multiple regression model can
be challenging to interpret.
Remember: Predictors are also called explanatory or independent variables. Ideally, they would
be independent of each other.

Remark. Female_House and White is an example of collinear predictor variables.

While it’s more or less impossible to avoid collinearity from arising in observational data, experi-
ments are usually designed to prevent correlation among predictors.

195
9 Linear regression

9.2.4 Categorical predictor variables

In our analysis of the poverty dataset, we used only numeric predictor variables so far. But the dataset
also contains the categorical variable Region. Categorical variables like Region are also helpful in
predicting outcomes. Using a categorical variable in a simple linear regression as a predictor variable
means fitting different mean levels of the response variable with respect to the explanatory variable.
As an example, we analyze if the percentage of people living below the poverty line varies with the
region:

poverty |>
summarise(`mean Pov per Region` = mean(Poverty), .by = Region)
# # A tibble: 4 x 2
# Region `mean Pov per Region`
# <fct> <dbl>
# 1 South 13.66
# 2 West 11.29
# 3 Northeast 9.5
# 4 Midwest 9.525

If we want to use a linear model to analyze the different mean values, we need to understand how
the information contained in the categorical variable is coded.

Coding a categorical variable

There exist different types of coding. R uses by default treatment coding, which is also called dummy
coding. In the case of Region (which has 4 levels), this amounts to creating indicators for three specific
regions, which is done as follows:

contrasts(poverty$Region)
# West Northeast Midwest
# South 0 0 0
# West 1 0 0
# Northeast 0 1 0
# Midwest 0 0 1

We see that South is the reference category and each estimated parameter (for the other three
levels) is the difference to the reference category. Let’s compare the results from fitting the linear
model

196
9 Linear regression

region_lm_model <- lm(Poverty ~ Region, data = poverty)


coef(region_lm_model)
# (Intercept) RegionWest RegionNortheast RegionMidwest
# 13.658824 -2.366516 -4.158824 -4.133824

to the different empirical mean values we computed before

poverty |>
summarise(`mean Pov per Region` = mean(Poverty), .by = Region)
# # A tibble: 4 x 2
# Region `mean Pov per Region`
# <fct> <dbl>
# 1 South 13.66
# 2 West 11.29
# 3 Northeast 9.5
# 4 Midwest 9.525

The intercept estimate 𝛽0̂ is the mean value of the reference category South. All other mean values
can be obtained by adding the corresponding estimate to 𝛽0̂ .

Example 9.5. As an example, let’s analyze the influence of the high school graduation rate on
Poverty in the West and Midwest. In the first step we reduce the dataset to observations from those
two regions.

poverty_west <- filter(poverty, Region %in% c("Midwest", "West"))

Using this dataset we fit a model with the two predictor variables Graduates and Region.

model_pov_5 <- lm(


Poverty ~ Graduates + Region,
data = poverty_west
)

The summary of the model is given by:

summary(model_pov_5)
#
# Call:
# lm(formula = Poverty ~ Graduates + Region, data = poverty_west)

197
9 Linear regression

#
# Residuals:
# Min 1Q Median 3Q Max
# -3.8039 -0.9648 -0.0985 0.3900 3.8086
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 53.5317 12.1951 4.390 0.000233 ***
# Graduates -0.4840 0.1396 -3.466 0.002194 **
# RegionMidwest -1.1310 0.7306 -1.548 0.135881
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 1.767 on 22 degrees of freedom
# Multiple R-squared: 0.4536, Adjusted R-squared: 0.4039
# F-statistic: 9.131 on 2 and 22 DF, p-value: 0.001297

The fitted regression model is given by the following equation

𝑦𝑖̂ ≈ 53.53 − 0.48 ⋅ 𝑥1,𝑖 − 1.13 ⋅ 𝑥2,𝑖 ,

where 𝑥1,𝑖 is the percentage of high school graduates in state 𝑖 and

1, if state i is in the Midwest


𝑥2,𝑖 = { ,
0, otherwise

and visualized in Figure 9.6.

ggplot(poverty_west, aes(x = Graduates, y = Poverty)) +


geom_point(aes(colour = Region)) +
geom_abline(intercept = 53.53, slope = -0.48,
colour = "#e41a1c") +
geom_abline(intercept = 52.40, slope = -0.48,
colour = "#377eb8") +
scale_color_brewer(palette = "Set1") +
labs(x = "high school graduation rate",
y = "perc. living below the poverty line")

198
9 Linear regression

18

perc. living below the poverty line


15

Region
12 West
Midwest

6
82.5 85.0 87.5 90.0
high school graduation rate

Figure 9.6: Predicted values for the regions West and Midwest. By design, the model does not allow
to estimate different slope values in the two regions.

Interpretation of the regression coefficients:

coef(model_pov_5)
# (Intercept) Graduates RegionMidwest
# 53.5317008 -0.4839698 -1.1310115

Slope of Graduates:
All else held constant, which each additional unit increase in the high school graduation rate, the
percentage of people living below the poverty line decreases, on average, by 0.484.
Slope of Region:
All else held constant, the model predicts that for states in the Midwest (compared to the West), the
percentage of people living below the poverty line is on average lower by 1.31 percentage points.
Intercept:
In Western States with a high school graduation rate of zero percent the percentage of people living
below the poverty line is on average 53.532 percent.

Remark. Obviously, the intercept does not make sense in context. It only serves to adjust the height
of the line.

199
9 Linear regression

¾ Your turn

Use the estimated regression parameters to compute the predicted percentage of people living
below the poverty line for Midwestern state with a high school graduation rate of 88%

coef(model_pov_5)
# (Intercept) Graduates RegionMidwest
# 53.5317008 -0.4839698 -1.1310115

Example 9.6. In this example, we analyze data from a survey of adult American women and their
children, a sub-sample from the National Longitudinal Survey of Youth. The aim of this analysis is
to predict the cognitive test scores of three- and four-year-old children using characteristics of their
mothers.

kidiq
# # A tibble: 434 x 5
# kid_score mom_hs mom_iq mom_work mom_age
# <int> <fct> <dbl> <fct> <int>
# 1 65 1 121.1 1 27
# 2 98 1 89.36 1 25
# 3 85 1 115.4 1 27
# 4 83 1 99.45 1 25
# 5 115 1 92.75 1 27
# 6 98 0 107.9 0 18
# # i 428 more rows

kid_score: cognitive test scores of three- and four-year-old children


mom_hs: mother graduated from high school (1: yes, 0: no)
mom_iq: iq of the mother
mom_work: mother worked during the first three year’s of the kid’s life (1: yes, 0: no)
mom_age: mother’s age at the time she gave birth
From: Gelman and Hill (2006)
We fit a model using all available predictor variables to explain the variation in kid_score. Choosing
all predictor variables can be done using the . notation.

model_iq <- lm(kid_score ~ ., data = kidiq)

200
9 Linear regression

¾ Your turn

What is the correct interpretation of the slope for mom’s IQ?

tidy(model_iq)
# # A tibble: 5 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 19.59 9.219 2.125 3.414e- 2
# 2 mom_hs1 5.095 2.315 2.201 2.825e- 2
# 3 mom_iq 0.5615 0.06064 9.259 9.973e-19
# 4 mom_work1 2.537 2.351 1.079 2.810e- 1
# 5 mom_age 0.2180 0.3307 0.6592 5.101e- 1

9.2.5 Cross validation

In this section we work with the evals dataset from the openintro package. The dataset contains
student evaluations of instructors’ beauty and teaching quality for 463 courses at the University of
Texas.
The teaching evaluations were conducted at the end of the semester. The beauty judgments were
made later, by six students who had not attended the classes and were not aware of the course evaluations
(two upper-level females, two upper-level males, one lower-level female, one lower-level male), see
Hamermesh and Parker (2005) for further details.

evals
# # A tibble: 463 x 11
# prof_id score bty_avg age gender cls_level cls_students
# <int> <dbl> <dbl> <int> <fct> <fct> <int>
# 1 1 4.7 5 36 female upper 43
# 2 1 4.1 5 36 female upper 125
# 3 1 3.9 5 36 female upper 125
# 4 1 4.8 5 36 female upper 123
# 5 2 4.6 3 59 male upper 20
# 6 2 4.3 3 59 male upper 40
# # i 457 more rows
# # i 4 more variables: rank <fct>, ethnicity <fct>, ...

score: evaluation score between 1 (very unsatisfactory) and 5 (excellent)


bty_avg: average beauty rating of professor between 1 (lowest) and 10 (highest)

201
9 Linear regression

cls_level: class level (lower, upper)


rank: tenure status (teaching, tenure track, tenured)
ethnicity: of professor (not minority, minority)
language: of school where prof received education (english, non-english)
pic_outfit: outfit of professor in picture (not formal, formal)
Our goal is to identify variables that help predict the evaluation score. We start with a small ex-
ploratory data analysis, plotting the relationship between the beauty average and the evaluation
score.

p1 <- ggplot(evals, aes(bty_avg, score)) +


geom_point() +
geom_smooth(method = "lm", se = FALSE)
p2 <- ggplot(evals, aes(bty_avg, score)) +
geom_point(aes(colour = gender)) +
scale_color_brewer(palette = "Set1")

p1 + p2 + plot_layout(axes = "collect") # using patchwork

4 gender
score

female
male

2 4 6 8 2 4 6 8
bty_avg

Conclusion: There seems to be a positive relation between the beauty score of the professor and the
evaluation score. For a given beauty rating, it is hard to see if male professors are evaluated higher,
lower, or about the same as female professors.
But we can fit a model to see if the evaluation score varies with gender.

202
9 Linear regression

¾ Your turn

model_beauty <- lm(score ~ bty_avg + gender, data = evals)


library(broom)
tidy(model_beauty)
# # A tibble: 3 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 3.747 0.08466 44.27 6.227e-168
# 2 bty_avg 0.07416 0.01625 4.563 6.484e- 6
# 3 gendermale 0.1724 0.05022 3.433 6.518e- 4

For a given beauty score, are male professors evaluated higher, lower, or about the same as
female professors?
A higher
B lower
C about the same

One possible model choice is the full model, which involves using all relevant variables as predictor
variables. In our case, it may not make sense to use the professor ID as a predictor variable. Therefore,
we will remove this variable from the dataset and then fit the full model using the . notation.

evals <- select(evals, -prof_id) # remove id of prof


model_beauty_2 <- lm(
score ~ . # . select all other variables as
, data = evals) # explanatory variables
tidy(model_beauty_2)
# # A tibble: 11 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 4.489 0.2437 18.42 6.131e-57
# 2 bty_avg 0.05825 0.01703 3.421 6.815e- 4
# 3 age -0.008788 0.003245 -2.708 7.026e- 3
# 4 gendermale 0.2057 0.05272 3.903 1.095e- 4
# 5 cls_levelupper -0.05801 0.05527 -1.050 2.945e- 1
# 6 cls_students -0.0003772 0.0003599 -1.048 2.952e- 1
# # i 5 more rows

203
9 Linear regression

Choosing all variables as predictor variables is definitely not always a good choice. But to decide
which model is preferable regarding the prediction accuracy, we need to talk about measures again.
Remember that we introduced in Section 8.6 the MSE (mean squared error):

1 𝑚 2
MSE = ∑ (𝑦𝑖 − 𝑦𝑖̂ ) ,
𝑚 𝑖=1

where 𝑦𝑖̂ are the predictions based on a multiple linear regression model.

Root mean squared error

Assume we have a fitted model based on 𝑛 observations and obtained the parameter estimates
𝛽0̂ , 𝛽1̂ , … , 𝛽𝑘̂ . Now we are given 𝑚 new outcomes of the predictor variables 𝑥𝑗,𝑖 . This allows us to
compute 𝑚 predictions 𝑦𝑖̂ for these new outcomes.
Further, assume that in addition to the 𝑚 new observations of the predictor variables, we are also
given the corresponding observations 𝑦𝑖 of the response. In that case, we can evaluate the predictive
accuracy through the MSE. However, since the MSE is measured in squared deviations, one often
prefers a measure on the scale of the observations and uses, therefore, the root mean square error:

√ 1 𝑚
RMSE = MSE = √ ∑(𝑦𝑖 − 𝑦𝑖̂ )2 .
𝑚 𝑖=1

Remark. This measure is challenging to use/evaluate if we don’t have any new outcomes and this
will often be the case.

Solution: Use the idea of splitting the available data in training and test data, introduced in Sec-
tion 8.6.1.

Training and test data

We could do this splitting in train-


ing and test data on our own. But
we want to make our lives easy by
using functions from tidymodels.

204
9 Linear regression

To use the different packages contained in tidymodels we load directly the collection (similar to
tidyverse):

library(tidymodels)

To get started, let’s split evals into a training set and a testing set. We’ll keep most (around 90%) of
the rows from the original dataset (subset chosen randomly) in the training set. The training data will
be used to fit the model, and the test data will be used to measure model performance.
To do this, we can use the rsample package to create an object that contains the information on how
to split the data (apply initial_split()), and then use training() and test() from the rsample
package to create data frames for the training and test data:

# for reproducibility
set.seed(111)

evals_split <- initial_split(evals, prop = 0.9)

train <- training(evals_split)


test <- testing(evals_split)

# check if the size of training data is around 90%


nrow(train) / nrow(evals)
# [1] 0.8984881

Now we fit the full model to the training data and compute predictions on the test data using
predict().

model_train_full <- lm(score ~ ., data = train)


pred_full <- predict(model_train_full, newdata = test)

pred_full contains the predictions 𝑦𝑖̂ of the response values 𝑦𝑖 contained in test. We can use this
information to compute the RMSE.
Let’s create a function for computing the RMSE, as we will need to do this multiple times.

rmse <- function(u, v){


sqrt(mean((u - v)^2))
}

rmse_full <-rmse(test$score, pred_full)


rmse_full
# [1] 0.5243576

205
9 Linear regression

Roughly speaking, we can say that the average difference between the predicted and actual scores
is approximately 0.52 on a scale of 1 to 5. That’s a measure of the goodness of fit. Our goal in this
section is to use the RMSE to compare the predictive accuracy between different models. So, we have
to fit another model.
5

The box plots indicate that the professor’s eth-


nicity may not have an impact on the evaluation 4

score
score. Therefore, we can remove ‘ethnicity’ from
the model and observe the effect on predictive ac- 3

curacy.
minority not minority
ethnicity

model_train_red <- lm(score ~ . -ethnicity, data = train)


pred_red <- predict(model_train_red, newdata = test)

rmse_red <- rmse(test$score, pred_red)

rmse_red
# [1] 0.5211824
rmse_full
# [1] 0.5243576

The RMSE is smaller for the reduced model.

Warning

The result is based on one split of the data. But in practice, this has to be done multiple times
to come up wit a conclusion.

We already know how to do this multiple times. In Section 8.10.1 we introduced the concept of 𝑣-fold
cross validation. The idea was to randomly split the data into 𝑣 parts/folds. Each fold then plays the
role of the validation (test) data once. This leads to 𝑣 validation errors, which are averaged to form
an overall cross-validation error

1 𝑣
CV(𝑣) = ∑ RMSE𝑖 ,
𝑣 𝑖=1
where RMSE𝑖 is the validation error when using the 𝑖-th fold as validation data.
Luckily we don’t have to do the 𝑣 splits by hand. We can use vfold_cv().

206
9 Linear regression

Model specification with parsnip

vfold_cv() is a function in the parsnip package from tidymodels. To use it, we must adhere to the
tidymodels approach for specifying and fitting models.
We start by specifying the functional form of the model that we want to fit using the parsnip package.
We can define a linear regression model with the linear_reg() function:

linear_reg()
# Linear Regression Model Specification (regression)
#
# Computational engine: lm

On its own, the function really doesn’t do much. It only specifies the type of the model. Next, we
must choose a method to fit the model, the so-called engine. We have introduced the least squares
method for estimating the parameters of a linear regression. But that’s not the only method for fitting
a linear model. In this course we will not discuss any further methods for fitting linear regression,
but have a look at ?linear_reg to see the list of available engines.
The engine for least squares estimation is called lm.

lin_mod <-
linear_reg(engine = "lm")

lin_mod still doesn’t contain any estimated parameters, which makes sense, since we haven’t speci-
fied a concrete model equation. This can be done using the fit() function.
For illustration, let’s fit a model with bty_avg being the only predictor variable.

lin_mod |>
fit(score ~ bty_avg, data = evals)
# parsnip model object
#
#
# Call:
# stats::lm(formula = score ~ bty_avg, data = data)
#
# Coefficients:
# (Intercept) bty_avg
# 3.88034 0.06664

Now we are ready to create the different folds. We want to do 10-fold cross validation and therefore
choose v=10.

207
9 Linear regression

set.seed(123)
folds <- vfold_cv(evals, v = 10)

folds
# # 10-fold cross-validation
# # A tibble: 10 x 2
# splits id
# <list> <chr>
# 1 <split [416/47]> Fold01
# 2 <split [416/47]> Fold02
# 3 <split [416/47]> Fold03
# 4 <split [417/46]> Fold04
# 5 <split [417/46]> Fold05
# 6 <split [417/46]> Fold06
# # i 4 more rows

Defining a workflow

The task is now to fit the linear regression model to the ten different folds.
We know how to fit the model to one dataset using fit(). Luckily, we don’t have to repeat that step
ten times by hand.
We can add the model specification and the model formula to a workflow using the workflow() func-
tion from the workflow package. Given a workflow the fit_resamples() function fits the specified
model to all folds.
We start by creating a workflow, which contains only the linear model.

wf_lm <-
workflow() |>
add_model(lin_mod)

By adding different formulas we create workflows for the full and reduced model.

wf_full <- wf_lm |>


add_formula(score ~ .)

wf_red <- wf_lm |>


add_formula(score ~ . - ethnicity)

208
9 Linear regression

Given the workflow, we train/fit the models on all ten folds

lm_fit_full <-
wf_full |>
fit_resamples(folds)

lm_fit_red <-
wf_red |>
fit_resamples(folds)

and collect the computed accuracy measures:

collect_metrics(lm_fit_full)[,1:5]
# # A tibble: 2 x 5
# .metric .estimator mean n std_err
# <chr> <chr> <dbl> <int> <dbl>
# 1 rmse standard 0.5241 10 0.01619
# 2 rsq standard 0.1041 10 0.02549

collect_metrics(lm_fit_red)[,1:5]
# # A tibble: 2 x 5
# .metric .estimator mean n std_err
# <chr> <chr> <dbl> <int> <dbl>
# 1 rmse standard 0.5235 10 0.01606
# 2 rsq standard 0.1065 10 0.02559

The output includes the RMSE and 𝑅2 values, calculated over ten different folds. We noticed slightly
better values for the reduced model compared to the full model for both measures. This confirms our
findings based on a single split.

9.2.6 Subset selection

In this section, we will explore methods for selecting subsets of predictors, including best subset and
stepwise model selection procedures.

Best subset selection

To perform best subset selection, we must fit a separate least squares regression for every possible
combination of the 𝑘 predictors.

209
9 Linear regression

This means fitting all 𝑘 models that contain exactly one predictor and choosing the best one among
those 𝑘 models. Then continue with all (𝑘2) = 𝑘(𝑘−1)
2 models that contain exactly two predictors, and
choose the best one containing two predictors, i.e., the one with the lowest RSS. Continue in this way
until the model with 𝑘 predictor variables has been fitted. In the end one needs to select a single best
model, which is not so obvious. Before we discuss how to do that, let’s summarize the Best subset
selection algorithm.

Algorithm - Best subset selection

1. Let M0 denote the null model, which contains no predictors but only an intercept.

2. For 𝑗 = 1, 2, … , 𝑘:
i) Fit all (𝑘𝑗) models that contain exactly 𝑗 predictors.
ii) Among these (𝑘𝑗) models, pick the one with lowest RSS, and call it M𝑗 .

3. Select a single best model from M0 , … , M𝑘 .

The set of 𝑘 predictor variables allows for 2𝑘 subsets, i.e., we have to consider 2𝑘 models. Step 2
reduces this number to 𝑘 + 1 models.
To choose the best model, we need to select from these 𝑘 + 1 options. This process requires careful
consideration because the RSS of these 𝑘 + 1 models consistently decreases as the number of features
increases.
If we use the RSS to select the best model, we will always end up with a model involving all variables.
The issue is that a low RSS or a high 𝑅2 indicates a model with low training error, but we actually
want to choose a model with low error on new data (low test error).
In order to select the best model with respect to test error, we need to estimate this test error. There
are two common approaches:

1. We can indirectly estimate test error by making an adjustment to the training error to account
for the bias due to overfitting.
2. We can directly estimate the test error, using either a validation set approach or a cross-
validation approach, as discussed in Section 8.10.1.

The second approach requires to compute the average RMSE (test error), calculated over all v folds,
for all models under consideration. We will not discuss this approach any further; see (James et al.
2021, chap. 6.1.3) for details.
In Section 9.2.2, we already introduced one way of adjusting by using the adjusted 𝑅2 instead of the
𝑅2 . Another way of adjusting the RSS is the AIC, which we introduce in the following.

210
9 Linear regression

AIC and adjusted 𝑅2

In Section 9.2.2, we discussed that the 𝑅2 value calculated on the training data can be overly optimistic
when applied to new data, especially in models with multiple predictors. This is because the training
MSE = 𝑛1 ⋅ RSS tends to underestimate the MSE on the test data. A more accurate measure of the
explained variability is the adjusted 𝑅2 . This characteristic makes adjusted 𝑅2 a valuable criterion
for model selection.
A different, but related adjustment of the RSS is the AIC (An Information Criterion or Akaike Informa-
tion Criterion), which will now be defined.

Definition 9.6. Given a fitted multiple linear regression model with 𝑛 residuals 𝑒1 , … , 𝑒𝑛 and 𝑘 + 1
estimated parameters, the AIC is defined in the following way
RSS
AIC = constant + 𝑛 ⋅ ln ( ) + 2(𝑘 + 1) .
𝑛

Like the adjusted 𝑅2 , the AIC strikes a balance between the flexibility of using more predictors (RSS
decreases) and the complexity of the resulting model (the last summand 2(𝑘 + 1) increases).

Remark. Besides the adjusted 𝑅2 and the AIC, there exist further criteria defined as a transformation
of the RSS, like Mallows’s 𝐶𝑝 or BIC. We will not discuss the advantages and disadvantages of one
criterion over the others. Instead, we will focus on using the AIC. But please be aware that other
criteria do exists, which one can use instead of the AIC, and which might be preferable in some
scenarios.

9.2.7 Stepwise selection

Due to computational limitations, best subset selection is not feasible for really large values of 𝑘.
In addition, as the search space increases, there is a greater risk of identifying models that appear
to perform well on the training data but may lack predictive power for future data. Therefore, an
extensive search space can lead to statistical challenges.
Stepwise methods, which explore a more restricted set of models, are appealing alternatives to best
subset selection.

Forward stepwise selection

Forward stepwise selection starts with a model that has only an intercept and no predictors. At each
step, the algorithm adds the predictor variable that provides the greatest improvement to the current
model’s fit. The process stops when no further improvement is possible.

211
9 Linear regression

Algorithm - Forward stepwise selection

Let M0 denote the null model, which contains no predictors, only an intercept.

1. For 𝑗 = 0, … , 𝑘 − 1
a) Fit all 𝑘 − 𝑗 models that can be obtained by adding one additional predictor to the
ones already in M𝑗 .
b) Choose among these 𝑘 − 𝑗 models the one with the smallest AIC, and call it M𝑗+1 .

2. If AIC(M𝑗+1 ) < AIC(M𝑗 ), go to Step 1; otherwise STOP.

Unlike best subset selection, which involved fitting 2𝑘 models, forward stepwise selection consists of
fitting:

1. one null model


2. 𝑘 − 𝑗 models in the 𝑗-th iteration, for 𝑗 = 0, … , 𝑘 − 1.
𝑘−1
This amounts to at most ∑𝑗=0 𝑘 − 𝑗 = 1 + 𝑘(𝑘+1) 2 models (if all 𝑘 iterations are needed). This is a
substantial difference, even for small 𝑘, as illustrated in Figure 9.7.

df <- tibble(bss = 2^(1:10), fss = 1 + (1:10) * (2:11) / 2)


ggplot(df, aes(x = fss, y = bss)) +
geom_point() + geom_line() +
labs(x = "# fitted models in FSS", y = "# fitted models in BSS")

1000
# fitted models in BSS

750

500

250

0
0 10 20 30 40 50
# fitted models in FSS

Figure 9.7: Comparison of computational advantage between Best Subset Selection (BSS) and Forward
Stepwise Selection (FSS) for models with k=1,…,10 predictors.

The computational advantage of forward stepwise selection over best subset selection is clear. While

212
9 Linear regression

forward stepwise tends to perform well in practice, it is not guaranteed to find the best possible model
out of all possible 2𝑘 models (best in terms of the considered criterion, e.g., AIC).
For example, consider a scenario where you have a dataset with 𝑘 = 3 predictors. Suppose the
best one-variable model includes 𝑋1 , while the best two-variable model includes 𝑋2 and 𝑋3 . Then,
forward stepwise selection does not choose the best two-variable model because M1 contains 𝑋1 ,
requiring M2 to include 𝑋1 and one additional variable.

Backward stepwise selection

Backward stepwise selection is similar to forward stepwise selection, providing an efficient alternative
to best subset selection. However, unlike forward stepwise selection, it starts with the full model
containing all predictors and then iteratively removes the least useful predictor one at a time.

Algorithm - Backward stepwise selection

Let M𝑘 denote the full model, which contains all 𝑘 predictors.

1. For 𝑗 = 𝑘, 𝑘 − 1, … , 1:
a) Fit all 𝑗 models that are obtained by omitting one of the predictors in 𝑀𝑗 . Each such
model has then a total of 𝑗 − 1 predictors.
b) Choose among these 𝑗 models the one with the smallest AIC, and call it M𝑗−1 .

2. If AIC(M𝑗−1 ) < AIC(M𝑗 ), go to Step 1; otherwise STOP.

Similar to the forward stepwise selection, the backward selection approach examines at most 1 +
𝑘(𝑘+1)
2 models. It is suitable for situations where the number of predictors 𝑘 is too large for best
subset selection.
As in forward stepwise selection, backward stepwise selection does not guarantee finding the best
model containing a subset of the 𝑘 predictors.
It’s important to note that backward selection requires the number of samples 𝑛 to be larger than the
number of predictors 𝑘, which allows the estimation of the full model. On the other hand, forward
stepwise can also be used in the case of 𝑛 < 𝑘.

Hybrid approaches

The best subset, forward stepwise, and backward stepwise selection approaches typically result in
similar but not identical models. In addition, hybrid versions of forward and backward stepwise se-
lection exist as an alternative. In a hybrid selection algorithm, variables are either added or removed
in each step. It continues to incrementally remove or add single predictors until no further improve-
ment to the model fit can be achieved. Such a stepwise selection approach aims to closely resemble

213
9 Linear regression

best subset selection while retaining the computational advantages of forward and backward stepwise
selection.

Stepwise selection with step()

The stepwise selection algorithms are implemented in the step() function from the stats package.
The type of stepwise selection can be chosen with the direction argument. Possible values are
both (hybrid), backward and forward. The first argument of step() is object, which has to be the
fitted linear regression model considered in the first step (so M0 for forward and M𝑘 for backward
selection). With the scope argument one can specify the range of models examined in the stepwise
search. In the case of backward selection, step() considers all models between the full and the null
model, if scope is unspecified. The same is true in the hybrid approach, when starting with a full
model.
We run a backward and forward selection for selecting a model predicting the evaluation score in
the evals dataset. We start with the backward stepwise selection and use model_beauty_2 (=M𝑘 ) as
a starting point. In each step (if possible), we drop the variable which leads to the most significant
reduction in AIC.

stats::step( # we use the :: notation, since tidymodels


object = model_beauty_2, # also contains a step function (which does
direction = "backward") # something different)
# Start: AIC=-594.65
# score ~ bty_avg + age + gender + cls_level + cls_students + rank +
# ethnicity + language + pic_outfit
#
# Df Sum of Sq RSS AIC
# - ethnicity 1 0.2441 122.47 -595.72
# - cls_students 1 0.2970 122.52 -595.52
# - cls_level 1 0.2979 122.53 -595.52
# <none> 122.23 -594.65
# - language 1 0.9658 123.19 -593.00
# - pic_outfit 1 1.0816 123.31 -592.57
# - rank 2 1.9619 124.19 -591.27
# - age 1 1.9830 124.21 -589.19
# - bty_avg 1 3.1641 125.39 -584.81
# - gender 1 4.1193 126.35 -581.30
#
# Step: AIC=-595.72
# score ~ bty_avg + age + gender + cls_level + cls_students + rank +
# language + pic_outfit
#
# Df Sum of Sq RSS AIC

214
9 Linear regression

# - cls_level 1 0.2058 122.68 -596.94


# - cls_students 1 0.2528 122.72 -596.77
# <none> 122.47 -595.72
# - pic_outfit 1 1.1398 123.61 -593.43
# - language 1 1.3836 123.86 -592.52
# - rank 2 2.1091 124.58 -591.82
# - age 1 1.9909 124.46 -590.26
# - bty_avg 1 3.1271 125.60 -586.05
# - gender 1 4.3526 126.82 -581.55
#
# Step: AIC=-596.94
# score ~ bty_avg + age + gender + cls_students + rank + language +
# pic_outfit
#
# Df Sum of Sq RSS AIC
# - cls_students 1 0.1668 122.84 -598.32
# <none> 122.68 -596.94
# - pic_outfit 1 1.0625 123.74 -594.95
# - language 1 1.4969 124.17 -593.33
# - rank 2 2.3416 125.02 -592.19
# - age 1 2.0673 124.74 -591.21
# - bty_avg 1 3.1185 125.80 -587.32
# - gender 1 4.5463 127.22 -582.10
#
# Step: AIC=-598.32
# score ~ bty_avg + age + gender + rank + language + pic_outfit
#
# Df Sum of Sq RSS AIC
# <none> 122.84 -598.32
# - pic_outfit 1 0.9100 123.75 -596.90
# - language 1 1.3827 124.23 -595.13
# - rank 2 2.4475 125.29 -593.18
# - age 1 1.9671 124.81 -592.96
# - bty_avg 1 3.0358 125.88 -589.01
# - gender 1 4.4083 127.25 -583.99
#
# Call:
# lm(formula = score ~ bty_avg + age + gender + rank + language +
# pic_outfit, data = evals)
#
# Coefficients:
# (Intercept) bty_avg
# 4.490380 0.056916

215
9 Linear regression

# age gendermale
# -0.008691 0.209779
# ranktenure track ranktenured
# -0.206806 -0.175756
# languagenon-english pic_outfitnot formal
# -0.244128 -0.130906

The last part of the output shows the estimated coefficients of the selected model:

score ~ bty_avg + age + gender + rank + language + pic_outfit

Now we compare this model with the outcome of the forward stepwise selection. We specify the null
model M0 as input. In addition, we have to define the scope, which has to be done by specifying a
formula describing the largest possible model under consideration.

stats::step(
object = lm(score ~ 1, data = evals),
scope = ~ bty_avg + age + gender + cls_level + cls_students +
rank + ethnicity + language + pic_outfit,
direction = "forward")
# Start: AIC=-562.99
# score ~ 1
#
# Df Sum of Sq RSS AIC
# + bty_avg 1 4.7859 131.87 -577.49
# + gender 1 2.2602 134.39 -568.71
# + language 1 1.6023 135.05 -566.45
# + age 1 1.5655 135.09 -566.32
# + rank 2 1.5891 135.06 -564.40
# + cls_level 1 0.9575 135.70 -564.24
# + ethnicity 1 0.7857 135.87 -563.66
# <none> 136.65 -562.99
# + pic_outfit 1 0.1959 136.46 -561.65
# + cls_students 1 0.0922 136.56 -561.30
#
# Step: AIC=-577.49
# score ~ bty_avg
#
# Df Sum of Sq RSS AIC
# + gender 1 3.2934 128.57 -587.20
# + language 1 1.6846 130.18 -581.45

216
9 Linear regression

# + rank 2 1.5711 130.30 -579.04


# + ethnicity 1 0.9557 130.91 -578.86
# + cls_level 1 0.8183 131.05 -578.37
# <none> 131.87 -577.49
# + age 1 0.3770 131.49 -576.82
# + pic_outfit 1 0.0493 131.82 -575.67
# + cls_students 1 0.0076 131.86 -575.52
#
# Step: AIC=-587.2
# score ~ bty_avg + gender
#
# Df Sum of Sq RSS AIC
# + language 1 1.67592 126.90 -591.28
# + rank 2 1.87687 126.70 -590.01
# + age 1 1.25751 127.32 -589.75
# + cls_level 1 0.63045127.94 -587.48
# + ethnicity 1 0.61220127.96 -587.41
# <none> 128.57 -587.20
# + cls_students 1 0.02887 128.55 -585.31
# + pic_outfit 1 0.00132 128.57 -585.21
#
# Step: AIC=-591.28
# score ~ bty_avg + gender + language
#
# Df Sum of Sq RSS AIC
# + age 1 1.24853 125.65 -593.86
# + rank 2 1.60848 125.29 -593.18
# <none> 126.90 -591.28
# + cls_level 1 0.37668 126.52 -590.65
# + ethnicity 1 0.17731 126.72 -589.92
# + pic_outfit 1 0.11640 126.78 -589.70
# + cls_students 1 0.08197 126.82 -589.58
#
# Step: AIC=-593.86
# score ~ bty_avg + gender + language + age
#
# Df Sum of Sq RSS AIC
# + rank 2 1.89685 123.75 -596.90
# <none> 125.65 -593.86
# + pic_outfit 1 0.35934 125.29 -593.18
# + ethnicity 1 0.25467 125.40 -592.79
# + cls_level 1 0.24855 125.40 -592.77
# + cls_students 1 0.09430 125.56 -592.20

217
9 Linear regression

#
# Step: AIC=-596.9
# score ~ bty_avg + gender + language + age + rank
#
# Df Sum of Sq RSS AIC
# + pic_outfit 1 0.91002 122.84 -598.32
# <none> 123.75 -596.90
# + ethnicity 1 0.20537 123.55 -595.67
# + cls_level 1 0.10498 123.65 -595.29
# + cls_students 1 0.01436 123.74 -594.95
#
# Step: AIC=-598.32
# score ~ bty_avg + gender + language + age + rank + pic_outfit
#
# Df Sum of Sq RSS AIC
# <none> 122.84 -598.32
# + cls_students 1 0.16684 122.68 -596.94
# + ethnicity 1 0.13898 122.70 -596.84
# + cls_level 1 0.11982 122.72 -596.77
#
# Call:
# lm(formula = score ~ bty_avg + gender + language + age + rank +
# pic_outfit, data = evals)
#
# Coefficients:
# (Intercept) bty_avg
# 4.490380 0.056916
# gendermale languagenon-english
# 0.209779 -0.244128
# age ranktenure track
# -0.008691 -0.206806
# ranktenured pic_outfitnot formal
# -0.175756 -0.130906

Forward stepwise selection gives us the same model, when specifying the scope to be the full
model.
Finally, let’s apply the hybrid approach. We start again with the null model.

stats::step(
object = lm(score ~ 1, data = evals),
scope = ~ bty_avg + age + gender + cls_level + cls_students +

218
9 Linear regression

rank + ethnicity + language + pic_outfit,


direction = "both")
# Start: AIC=-562.99
# score ~ 1
#
# Df Sum of Sq RSS AIC
# + bty_avg 1 4.7859 131.87 -577.49
# + gender 1 2.2602 134.39 -568.71
# + language 1 1.6023 135.05 -566.45
# + age 1 1.5655 135.09 -566.32
# + rank 2 1.5891 135.06 -564.40
# + cls_level 1 0.9575 135.70 -564.24
# + ethnicity 1 0.7857 135.87 -563.66
# <none> 136.65 -562.99
# + pic_outfit 1 0.1959 136.46 -561.65
# + cls_students 1 0.0922 136.56 -561.30
#
# Step: AIC=-577.49
# score ~ bty_avg
#
# Df Sum of Sq RSS AIC
# + gender 1 3.2934 128.57 -587.20
# + language 1 1.6846 130.18 -581.45
# + rank 2 1.5711 130.30 -579.04
# + ethnicity 1 0.9557 130.91 -578.86
# + cls_level 1 0.8183 131.05 -578.37
# <none> 131.87 -577.49
# + age 1 0.3770 131.49 -576.82
# + pic_outfit 1 0.0493 131.82 -575.67
# + cls_students 1 0.0076 131.86 -575.52
# - bty_avg 1 4.7859 136.65 -562.99
#
# Step: AIC=-587.2
# score ~ bty_avg + gender
#
# Df Sum of Sq RSS AIC
# + language 1 1.6759 126.90 -591.28
# + rank 2 1.8769 126.70 -590.01
# + age 1 1.2575 127.32 -589.75
# + cls_level 1 0.6305 127.94 -587.48
# + ethnicity 1 0.6122 127.96 -587.41
# <none> 128.57 -587.20
# + cls_students 1 0.0289 128.55 -585.31

219
9 Linear regression

# + pic_outfit 1 0.0013 128.57 -585.21


# - gender 1 3.2934 131.87 -577.49
# - bty_avg 1 5.8192 134.39 -568.71
#
# Step: AIC=-591.28
# score ~ bty_avg + gender + language
#
# Df Sum of Sq RSS AIC
# + age 1 1.2485 125.65 -593.86
# + rank 2 1.6085 125.29 -593.18
# <none> 126.90 -591.28
# + cls_level 1 0.3767 126.52 -590.65
# + ethnicity 1 0.1773 126.72 -589.92
# + pic_outfit 1 0.1164 126.78 -589.70
# + cls_students 1 0.0820 126.82 -589.58
# - language 1 1.6759 128.57 -587.20
# - gender 1 3.2847 130.18 -581.45
# - bty_avg 1 5.9072 132.81 -572.21
#
# Step: AIC=-593.86
# score ~ bty_avg + gender + language + age
#
# Df Sum of Sq RSS AIC
# + rank 2 1.8969 123.75 -596.90
# <none> 125.65 -593.86
# + pic_outfit 1 0.3593 125.29 -593.18
# + ethnicity 1 0.2547 125.40 -592.79
# + cls_level 1 0.2486 125.40 -592.77
# + cls_students 1 0.0943 125.56 -592.20
# - age 1 1.2485 126.90 -591.28
# - language 1 1.6669 127.32 -589.75
# - bty_avg 1 4.0799 129.73 -581.06
# - gender 1 4.1603 129.81 -580.77
#
# Step: AIC=-596.9
# score ~ bty_avg + gender + language + age + rank
#
# Df Sum of Sq RSS AIC
# + pic_outfit 1 0.9100 122.84 -598.32
# <none> 123.75 -596.90
# + ethnicity 1 0.2054 123.55 -595.67
# + cls_level 1 0.1050 123.65 -595.29
# - language 1 0.9936 124.75 -595.20

220
9 Linear regression

# + cls_students 1 0.0144 123.74 -594.95


# - rank 2 1.8969 125.65 -593.86
# - age 1 1.5369 125.29 -593.18
# - bty_avg 1 3.8427 127.60 -584.74
# - gender 1 4.4648 128.22 -582.49
#
# Step: AIC=-598.32
# score ~ bty_avg + gender + language + age + rank + pic_outfit
#
# Df Sum of Sq RSS AIC
# <none> 122.84 -598.32
# + cls_students 1 0.1668 122.68 -596.94
# - pic_outfit 1 0.9100 123.75 -596.90
# + ethnicity 1 0.1390 122.70 -596.84
# + cls_level 1 0.1198 122.72 -596.77
# - language 1 1.3827 124.23 -595.13
# - rank 2 2.4475 125.29 -593.18
# - age 1 1.9671 124.81 -592.96
# - bty_avg 1 3.0358 125.88 -589.01
# - gender 1 4.4083 127.25 -583.99
#
# Call:
# lm(formula = score ~ bty_avg + gender + language + age + rank +
# pic_outfit, data = evals)
#
# Coefficients:
# (Intercept) bty_avg
# 4.490380 0.056916
# gendermale languagenon-english
# 0.209779 -0.244128
# age ranktenure track
# -0.008691 -0.206806
# ranktenured pic_outfitnot formal
# -0.175756 -0.130906

The hybrid approach also leads to the model

score ~ bty_avg + age + gender + rank + language + pic_outfit

221
9 Linear regression

A bit of caution

The selection algorithms we just discussed can be viewed as heuristics to optimize a chosen model
selection criterion over the set of all the 2𝑘 possible models.
Note, however, that even if these algorithms were able to find the “best” model, such a model would
necessarily correspond to some subset of the predictors that were originally supplied. Hence, the
algorithms are entirely unable to select models containing transformations of the predictor variables
as long as the user doesn’t specify such transformations.

Interaction effects

For example, we did not account for any interaction effects. An interaction effect in a regression
model occurs when the impact of one predictor variable on the response variable varies based on
the value of another predictor variable. In other words, the influence of one variable is different at
different levels of another variable.
Possible interactions in this context might be between

• beauty and gender


• age and gender
• …

Let’s try to visualize a possible interaction effect between age and gender.

ggplot(evals, aes(age, score, colour = gender)) +


geom_point() +
geom_smooth(method = "lm", se = FALSE) +
scale_color_brewer(palette = "Set1")

4 gender
score

female
male

30 40 50 60 70
age

222
9 Linear regression

Conclusion: We observe a more pronounced decline in the evaluation score for women compared to
men as age increases.
Including this interaction effect improves the model fit, as we can see when comparing the model
chosen by the stepwise selection algorithms

model_select <- lm(


score ~ bty_avg + gender + age + language + rank +
pic_outfit, data = evals)

with the one including the additional interaction effect

model_int <- lm(


score ~ bty_avg + gender * age + language + rank +
pic_outfit, data = evals)

with respect to their AIC values

extractAIC(model_select) # output: number of param, AIC


# [1] 8.0000 -598.3152
extractAIC(model_int)
# [1] 9.0000 -605.8161

¾ Your turn

Determine the slope of age for male and female observations. Give an interpretation of the
slope.

coef(model_int)
# (Intercept) bty_avg
# 5.06163919 0.05767324
# gendermale age
# -0.59415367 -0.02057155
# languagenon-english ranktenure track
# -0.18133702 -0.26383826
# ranktenured pic_outfitnot formal
# -0.21350431 -0.14013002
# gendermale:age
# 0.01712965

223
9 Linear regression

Short summary

This chapter introduces linear regression as a foundational supervised learning method for pre-
dicting quantitative responses. It highlights the continued relevance of this seemingly basic
technique despite more advanced approaches. The chapter explores simple linear regres-
sion with a single predictor, detailing the least squares method for model fitting and
parameter estimation using the poverty dataset. It then extends to multiple linear regres-
sion, considering scenarios with several predictor variables and addressing concepts such as
multicollinearity. The text also covers the inclusion of categorical predictors and meth-
ods for model assessment, including (adjusted) R-squared, residual standard error and
AIC. Furthermore, it discusses model selection techniques like best subset and stepwise selec-
tion, along with the importance of cross-validation for evaluating predictive accuracy using the
evals dataset.

224
10 Logistic regression

In this chapter, we will illustrate the concept of logistic regression using the email dataset from the
openintro package. The data represents incoming emails from David Diez’s mail account for the
first three months of 2012.
We will be interested in predicting the spam status (0=no, 1=yes) of an incoming email, based on
further features of the email.

library(openintro)
email
# # A tibble: 3,921 x 21
# spam to_multiple from cc sent_email time image attach
# <fct> <fct> <fct> <int> <fct> <dttm> <dbl> <dbl>
# 1 0 0 1 0 0 2012-01-01 07:16:41 0 0
# 2 0 0 1 0 0 2012-01-01 08:03:59 0 0
# 3 0 0 1 0 0 2012-01-01 17:00:32 0 0
# 4 0 0 1 0 0 2012-01-01 10:09:49 0 0
# 5 0 0 1 0 0 2012-01-01 11:00:01 0 0
# 6 0 0 1 0 0 2012-01-01 11:04:46 0 0
# # i 3,915 more rows
# # i 13 more variables: dollar <dbl>, winner <fct>, inherit <dbl>, ...

10.1 EDA of the email dataset

Two of the twenty predictors contained in email are

• winner: Indicates whether the word “winner” appeared in the email.


• line_breaks: The number of line breaks in the email (does not count text wrapping).

Let’s visualize the distribution of winner and line_breaks for each level of spam. We start with a
plot of the joint distribution of spam and winner.

225
10 Logistic regression

library(ggmosaic)
ggplot(email) +
geom_mosaic(aes(x = product(winner, spam), fill = winner)) +
scale_fill_brewer(palette = "Set1")

yes

winner
winner

no no
yes

0 1
spam

Figure 10.1: A larger percentage of spam emails contain the word winner compared to non-spam
emails.

Boxplots for line_breaks, for each combination of spam and winner are shown in Figure 10.2.

ggplot(email, aes(x = line_breaks, y = spam, colour = winner)) +


geom_boxplot() + scale_color_brewer(palette = "Set1")

226
10 Logistic regression

winner
spam no
yes

0 1000 2000 3000 4000


line_breaks

Figure 10.2: The average number of line breaks is smaller in spam emails compared to non-spam
emails.

It seems clear that both winner and line_breaks affect the spam status. But how do we develop
a model to explore this relationship?
Why not use linear regression? The response variable is the binary variable spam with levels
0 (no spam) and 1 (spam). The expected value of spam is equal to the probability of 1 (spam),
and thus a number in [0, 1]. On the other hand, a linear regression model will yield predictions
𝑘
of the form 𝛽0 + ∑𝑗=1 𝛽𝑗 𝑋𝑗 , which depending on the values of the 𝑋𝑗 ’s may take on any real
number as a value.
While we could proceed in an ad-hoc fashion and map the linear predictions to the nearest num-
ber in [0, 1], it is evident that a different type of model that always produces sensible estimates
of the probability of 1 (spam) may be preferable.

10.2 The logistic regression model

In linear regression, we model the response variable 𝑌 directly. In contrast, logistic regression models
focus on the probability that the response takes one of two possible values.
For the email data, logistic regression models the probability

P(𝑌 = 1|𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ) ,

where 𝑌 is the response spam and 𝑋1 and 𝑋2 are the two predictors line_breaks and winner.
The values of 𝑝(𝑥1 , 𝑥2 ) = P(𝑌 = 1|𝑋1 = 𝑥1 , 𝑋2 = 𝑥2 ) will range between 0 and 1. A prediction
for spam can then be made based on the probability value 𝑝(𝑥1 , 𝑥2 ). For instance, one could set the
prediction as spam=1 for any email with 𝑝(𝑥1 , 𝑥2 ) > 0.5. On the other hand, if we are particularly

227
10 Logistic regression

bothered by spam, we might opt to lower the threshold, for example, 𝑝(𝑥1 , 𝑥2 ) > 0.3, making it
easier to classify an email as spam. However, this will also increase the chances of misclassifying a
non-spam email as spam.
We already argued that using the linear predictor
𝜂𝑖 ∶= 𝛽0 + 𝛽1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘,𝑖
to model the response values directly, as in linear regression, leads to fitted values outside the interval
[0, 1]. Nevertheless, we would like to maintain the linear predictor as a way of describing the influence
of the predictor variables. To resolve this issue, we will use the linear predictor not to directly predict
the probabilities, 𝑝(x𝑖 ), but rather a transformation of these probabilities. This transformation will
have values on the real line.
To complete the specification of the logistic model, we must introduce a suitable transformation,
known as the link function, which links the linear predictor 𝜂𝑖 to 𝑝(x𝑖 ). There are a variety of options
but the most commonly used is the logit function:
𝑝
logit(𝑝) = log ( ), 𝑝 ∈ [0, 1].
1−𝑝

The logit function is a map from [0, 1] to R ∪ {±∞}.

ggplot() + xlim(0.01,.99)+
geom_function(fun = function(x) log(x / (1-x)))+
labs(x = expression(p[i]), y = expression(logit(p[i])))

5.0

2.5
logit(pi)

0.0

−2.5

−5.0
0.00 0.25 0.50 0.75 1.00
pi

𝑝
Solving 𝜂 = log ( 1−𝑝 ) for 𝑝, we find the inverse of the logit function:

−1 exp(𝜂) 1
logit (𝜂) = = ∈ [0, 1].
1 + exp(𝜂) 1 + exp(−𝜂)

228
10 Logistic regression

ggplot() + xlim(-5, 5) +
geom_function(fun = function(x) 1 / (1 + exp(-x))) +
labs(x = expression(eta[i]), y = expression(logit^-1*(eta[i])))

1.00

0.75
logit−1(ηi)

0.50

0.25

0.00
−5.0 −2.5 0.0 2.5 5.0
ηi

Definition 10.1. Let 𝑌𝑖 be independent binary response variables with associated predictor variable
values x𝑖 = (𝑥1,𝑖 , … , 𝑥𝑘,𝑖 ), 𝑖 ∈ {1, … , 𝑛}. Then the logistic regression model is defined through
the equation
exp(𝜂𝑖 ) 1
P(𝑌𝑖 = 1|x𝑖 ) = =
1 + exp(𝜂𝑖 ) 1 + exp(−𝜂𝑖 )
with linear predictor
𝜂𝑖 = 𝛽0 + 𝛽1 𝑥𝑖,1 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑖,𝑘
and parameters 𝛽 = (𝛽0 , 𝛽1 , … , 𝛽𝑘 ). Written out, we arrive at:

exp(𝛽0 + 𝛽1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘,𝑖 )


𝑝(x𝑖 ) ∶= P(𝑌𝑖 = 1|x𝑖 ) = .
1 + exp(𝛽0 + 𝛽1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘,𝑖 )

Note

It turns out, that the logistic regression model is a special case of a more general class of regres-
sion models, the generalized linear models (GLMs).
All generalized linear models have the following three characteristics:

1. A probability distribution describing the outcome variable

2. A linear predictor

229
10 Logistic regression

𝜂 = 𝛽0 + 𝛽1 𝑥1 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘

3. A link function g such that

E(𝑌 |x) = 𝑔−1 (𝜂) ,


where E(𝑌 |x) is the expected value of the response given the predictor variables x.

10.2.1 Odds and odds-ratio

The odds of an event is the ratio of the probability of the event and the probability of the complemen-
tary event. Thus, the odds of the event {𝑌𝑖 = 1}|x are

P({𝑌𝑖 = 1}|x) P({𝑌𝑖 = 1}|x)


odds({𝑌𝑖 = 1}|x) = = ∈ (0, ∞) .
P({𝑌𝑖 = 1} |x)
𝑐 1 − P({𝑌𝑖 = 1}|x)

Values close to 0 and ∞ indicate very low or very high probabilities for the event of interest, such
as being spam.
Using Definition 10.1 leads to

𝑝(x𝑖 )
= exp(𝛽0 ) ⋅ exp(𝛽1 𝑥𝑖,1 ) ⋅ ⋅ ⋅ exp(𝛽𝑘 𝑥𝑖,𝑘 )
1 − 𝑝(x𝑖 )
and when apply the logarithm on both sides we arrive at
𝑘
𝑝(x𝑖 )
log ( ) = 𝛽0 + ∑ 𝛽𝑗 𝑥𝑗,𝑖
1 − 𝑝(x𝑖 ) 𝑗=1

called the log odds or logit.


Comparing the odds of {𝑌𝑖 = 1}|x to the odds when one of the predictors increases by one unit,
conditional on all the others remaining constant, we obtain the odds-ratio:

odds when 𝑥𝑖,𝑗 = 𝑥 + 1 e𝛽0 +𝛽1 𝑥𝑖,1 +⋅⋅⋅+𝛽𝑗 (𝑥+1)+⋅⋅⋅+𝛽𝑘 𝑥𝑖,𝑘


𝑂𝑅𝑗 ∶= = = e𝛽𝑗 .
odds when 𝑥𝑖,𝑗 = 𝑥 e𝛽0 +𝛽1 𝑥𝑖,1 +⋅⋅⋅+𝛽𝑗 𝑥+⋅⋅⋅+𝛽𝑘 𝑥𝑖,𝑘

Therefore, e𝛽𝑗 represents the change in odds when 𝑥𝑖,𝑗 increases by one unit, holding all other vari-
ables constant.

230
10 Logistic regression

10.2.2 Estimation approach in logistic regression

The logistic regression model states that

exp(𝛽0 + 𝛽1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘,𝑖 )


𝑝(x𝑖 ) = P(𝑌𝑖 = 1|x𝑖 ) = .
1 + exp(𝛽0 + 𝛽1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘,𝑖 )

Thus,
1
1 − 𝑝(x𝑖 ) = 𝑃 (𝑌𝑖 = 0|x𝑖 ) = .
1 + exp(𝛽0 + 𝛽1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘,𝑖 )

To find estimates (𝛽0̂ , … , 𝛽𝑘̂ ), we may apply the principle of maximum likelihood. That is, we find
the parameters for which the probability of the data (𝑦1 , … , 𝑦𝑛 ) is as large as possible.
This can also be thought of as seeking estimates for (𝛽0 , 𝛽1 , … , 𝛽𝑘 ) such that the predicted probability
̂ 𝑖 ) corresponds as closely as possible to the observed response value.
𝑝(x
By independence of the observations, the probability of our data is a product in which the factors are
𝑝(x𝑖 ) or 1 − 𝑝(x𝑖 ) depending on whether 𝑦𝑖 = 1 or 𝑦𝑖 = 0, respectively.
For any probability 𝑝 we have 𝑝0 = 1 and 𝑝1 = 𝑝, so the likelihood function (probability of the
data) may be written conveniently as
𝑛 𝑛 𝑖 𝑦
𝑦𝑖 1−𝑦𝑖 𝑝(x𝑖 )
𝐿(y, x|𝛽) = ∏ 𝑝(x𝑖 ) (1 − 𝑝(x𝑖 )) = ∏( ) (1 − 𝑝(x𝑖 )).
𝑖=1 𝑖=1
1 − 𝑝(x𝑖 )

For computation it is advantageous to pass to the log-scale, and we find


𝑛
𝑝(x𝑖 )
log (𝐿(y, x|𝛽)) = ∑ 𝑦𝑖 log ( ) + log(1 − 𝑝(x𝑖 ))
𝑖=1
1 − 𝑝(x𝑖 )
𝑛
= ∑ 𝑦𝑖 (𝛽0 + 𝛽1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘,𝑖 )
𝑖=1
− log(1 + exp(𝛽0 + 𝛽1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘,𝑖 )).

The maximum likelihood estimate 𝛽̂ is then defined as


̂ = max log(𝐿(y, x|𝛽)) .
log(𝐿(y, x|𝛽))
𝛽∈R𝑘+1

While we won’t go into any details here, this is a function that is easy to maximize using iterative
algorithms (implemented in R).

231
10 Logistic regression

10.2.3 Fitting the model in R

In R, we fit a Generalized Linear Model (GLM) similarly to a linear model. It’s important to remember
that logistic regression is a special case of a generalized linear model. Instead of using lm(), we use
glm() and specify the type of the GLM with the family argument of glm().
Let’s start by fitting a model using only line_breaks as predictor variable.

model_lb <- glm(spam ~ line_breaks, data = email, family = binomial)

Remark. To fit a logistic regression model, we choose family equal to binomial.

We can look at the estimated parameters with the tidy() function.

library(broom)
tidy(model_lb)
# # A tibble: 2 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) -1.74 0.0717 -24.3 5.61e-130
# 2 line_breaks -0.00345 0.000416 -8.31 9.37e- 17

The fitted model can be described by the log odds:


𝑝(𝑥𝑖,1 )
log ( ) ≈ −1.74 − 0.00345 ⋅ 𝑥𝑖,1 ,
1 − 𝑝(𝑥𝑖,1 )
where 𝑥𝑖,1 is the number of line breaks in the 𝑖-th email.
Given the log-odds we can predict the probability of being spam for a given number of line
breaks.
Probability of spam for 𝑥𝑖,1 = 500 line breaks:
𝑝(𝑥𝑖,1 )
log ( ) = −1.74 − 0.00345 ⋅ 500 = −3.465
1 − 𝑝(𝑥𝑖,1 )
𝑝(𝑥𝑖,1 ) 0.031273
⇒ = exp(−3.465) ≈ 0.031273 ⇒ 𝑝(𝑥𝑖,1 ) = = 0.0303247
1 − 𝑝(𝑥𝑖,1 ) 1.031273

Probability of spam for 𝑥𝑖,1 = 50 line breaks:


𝑝(𝑥𝑖,1 )
log ( ) = −1.74 − 0.00345 ⋅ 50 = −1.9125
1 − 𝑝(𝑥𝑖,1 )
𝑝(𝑥𝑖,1 ) 0.1477106
⇒ = exp(−1.9125) ≈ 0.1477106 ⇒ 𝑝(𝑥𝑖,1 ) = = 0.1287003
1 − 𝑝(𝑥𝑖,1 ) 1.1477106

232
10 Logistic regression

email |>
mutate(pred = predict(model_lb, type = "response"),
spam = ifelse(spam == "1", 1, 0 )) |>
ggplot(aes(x = line_breaks, y = spam)) +
geom_point(aes(colour = winner)) +
ylab("Spam / predicted values") +
geom_line(aes(y = pred), size = 1.2) +
scale_color_brewer(palette = "Set1")

1.00
Spam / predicted values

0.75

winner
0.50 no
yes

0.25

0.00
0 1000 2000 3000 4000
line_breaks

Interpretation:

tidy(model_lb)
# # A tibble: 2 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) -1.74 0.0717 -24.3 5.61e-130
# 2 line_breaks -0.00345 0.000416 -8.31 9.37e- 17

Interpretations in terms of log odds for intercept and slope terms are easy.
Intercept: The log odds of spam for an email with zero line breaks are -1.7391.
Slope: For each additional line break the log odds decrease by -0.00345.

Problem

These interpretations are not particularly intuitive. Most of the time, we care only about sign
and relative magnitude. This is described by the odds ratio.

233
10 Logistic regression

Let’s compute the odds ratio for 10 additional line breaks:

e−1.74−0.00345(𝑥+10) e−1.74 ⋅ e−0.00345𝑥 ⋅ e−10⋅0.00345


𝑂𝑅 = =
e−1.74−0.00345𝑥 e−1.74 ⋅ e0.00345𝑥
−10⋅0.00345
=e ≈ 0.97

Interpretation: If the number of line breaks is increased 10, the odds of being spam are decreased
by roughly 3%.

Remark. The choice of 10 additional line breaks was our decision. In a different application, we would
choose a different value. A standard value would be one, but in this case, a one-unit change would
have been too insignificant.

We have observed that the likelihood of an email being classified as spam decreases as the number of
line breaks increases. However, we obtained a relatively small probability of being spam for a small
number of line breaks. This indicates that only using line_breaks as a predictor will probably not
lead to a good classifier. Therefore, we extend the model by including winner as a second predictor.

234
10 Logistic regression

model_lb_win <- glm(spam ~ line_breaks + winner, data = email,


family=binomial)
tidy(model_lb_win)
# # A tibble: 3 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) -1.77 0.0723 -24.5 9.97e-133
# 2 line_breaks -0.00360 0.000419 -8.59 8.78e- 18
# 3 winneryes 1.97 0.303 6.51 7.69e- 11

For a given number of line_breaks, the odds ratio for being spam is e1.97 ≈ 7.17 if the level of
winner is changed from no (reference level) to yes.
For the different levels of winner we get the following fitted models:

1. winner=no
𝑝(x𝑖 )
log ( ) = −1.77 − 0.00036 ⋅ 𝑥𝑖,1 + 1.97 ⋅ 0
1 − 𝑝(x𝑖 )
= −1.77 − 0.00036 ⋅ 𝑥𝑖,1 ,
where x𝑖 = (𝑥𝑖,1 , 0)⊤ with 𝑥𝑖,1 being the number of line breaks in the 𝑖-th email.
2. winner=yes
𝑝(x𝑖 )
log ( ) = −1.77 − 0.00036 ⋅ 𝑥𝑖,1 + 1.97 ⋅ 1
1 − 𝑝(x𝑖 )
= 0.2 − 0.00036 ⋅ 𝑥𝑖,1 ,
where x𝑖 = (𝑥𝑖,1 , 1)⊤ with 𝑥𝑖,1 being the number of line breaks in the 𝑖-th email.

Let’s visualize the fitted values for the different levels of winner.

email |>
mutate(pred = predict(model_lb_win, type = "response"),
spam = ifelse(spam == "1", 1, 0 )) |>
ggplot(aes(x= line_breaks, y = spam, colour = winner)) +
geom_point() +
ylab("Spam / fitted values") +
geom_line(aes(y = pred), size = 1.2) +
scale_color_brewer(palette = "Set1")

235
10 Logistic regression

1.00

Spam / fitted values


0.75

winner
0.50 no
yes

0.25

0.00
0 1000 2000 3000 4000
line_breaks

10.3 Relative risk

The most common mistake when interpreting logistic regression is to treat an odds ratio as a ratio of
probabilities. It’s a ratio of odds.
̂
This means, that emails containing the word winner are not e𝛽2 = e1.97 ≈ 7.17 times more likely to
be spam than emails not containing the word winner.
Such an interpretation would be the relative risk
P(spam|exposed)
RR = ,
P(spam|unexposed)
where “exposed” means in this case, that the email contains the word winner. So, this is different
compared to the odds ratio
P(spam|exposed)
1−P(spam|exposed)
OR = P(spam|unexposed)
.
1−P(spam|unexposed)

Based on the fitted model (model_lb_win) one can compute the following probabilities of being
spam.
The probability of an email being spam that contains the word “winner” and has 20 line breaks is
given by:

predict(model_lb_win,
newdata = data.frame(winner = "yes", line_breaks = 20),
type = "response")
# 1
# 0.5317901

236
10 Logistic regression

The probability of an email being spam that does not contain the word “winner” and has 20 line
breaks is given by:

predict(model_lb_win,
newdata = data.frame(winner = "no", line_breaks = 20),
type = "response")
# 1
# 0.1365626

This then leads to a relative risk of


0.5317901
RR = ≈ 3.89.
0.1365626

Note

The relative risk depends on the context. In the current example, this means “number of line
breaks” contained in the email.

The probability of an email being spam that contains the word “winner” and has 2 line breaks is given
by:

predict(model_lb_win,
newdata = data.frame(winner = "yes", line_breaks = 2),
type = "response")
# 1
# 0.5478812

The probability of an email being spam that does not contain the word “winner” and has 2 line breaks
is given by:

predict(model_lb_win,
newdata = data.frame(winner = "no", line_breaks = 2),
type = "response")
# 1
# 0.1443825

This then leads to a relative risk of


0.5452054
RR = ≈ 3.81.
0.1430539

237
10 Logistic regression

Given a fitted logistic regression model, it is possible to compute the relative risk using the
oddsratio_to_riskratio() function from the effectsize package, which is based on the odds
ratio and the predicted probability obtained under a specific set of predictor variables.

library(effectsize)

# p0: 20 line breaks and winner not contained in the email


oddsratio_to_riskratio(model_lb_win, p0 = 0.1365626)
# Parameter | Risk Ratio | 95% CI
# ----------------------------------------
# (p0) | 0.14 |
# line breaks | 1.00 | [1.00, 1.00]
# winner [yes] | 3.89 | [2.80, 4.92]

# p0: 2 line breaks and winner not contained in the email


oddsratio_to_riskratio(model_lb_win, p0 = 0.1443825)
# Parameter | Risk Ratio | 95% CI
# ----------------------------------------
# (p0) | 0.14 |
# line breaks | 1.00 | [1.00, 1.00]
# winner [yes] | 3.79 | [2.76, 4.75]

10.4 Assessing the accuracy of the predictions

In linear regression, we use transformations of the residual sum of squares, such as the MSE, to assess
the accuracy of our predictions. However, in classification tasks, we need additional metrics.
For logistic regression, it is important to evaluate how well we predict each of the two possible out-
comes. To illustrate the need for these new measures, let’s consider an example.

Example 10.1. If you’ve ever watched the TV show House, you know that Dr. House regularly states,
“It’s never lupus.”

Lupus is a medical phenomenon where antibodies that are supposed to attack foreign
cells to prevent infections instead see plasma proteins as foreign bodies, leading to a high
risk of blood clotting. It is believed that 2% of the population suffers from this disease.

The test for lupus is very accurate if the person actually has lupus. However, it is very inaccurate if
the person does not.
More specifically, the test is 98% accurate if a person actually has the disease. The test is 74% accu-
rate if a person does not have the disease.
Is Dr. House correct when he says it’s never lupus, even if someone tests positive for lupus?

238
10 Logistic regression

Let’s use the following tree to compute the conditional probability of having lupus given a positive
test result.

test positive 0.02*0.98 = 0.0196

lupus yes

test negative 0.02*0.02 = 0.0004

test positive 0.98*0.26 = 0.2548

lupus no

test negative 0.98*0.74 = 0.7252

P(test= +, lupus=yes)
P(lupus = yes|test = +) =
P(test =+)
P(test= +, lupus=yes)
=
P(test =+, lupus=yes) + P(test= +, lupus=no)
0.0196
= = 0.0714
0.0196 + 0.2548

Testing for lupus is actually quite complicated. A diagnosis usually relies on the results of multi-
ple tests, including a complete blood count, an erythrocyte sedimentation rate, a kidney and liver
assessment, a urinalysis, and an anti-nuclear antibody (ANA) test.
It is important to consider the implications of each of these tests and how they contribute to the overall
decision to diagnose a patient with lupus.
At some level, a diagnosis can be seen as a binary decision (lupus or no lupus) that involves the
complex integration of various predictor variables.
The diagnosis should try to ensure a high probability of a positive test result when the patient is
actually ill. This is referred to as the test’s sensitivity.

239
10 Logistic regression

On the other hand, the diagnosis should also have the property of yielding a high probability of a
negative test result if the patient does not have the disease. This is known as the specificity of the
test.

The example does not provide any information about how a diagnosis/decision is made, but it
does give us something equally important - the concept of the “sensitivity” and “specificity” of
the test.
Sensitivity and specificity are crucial for understanding the true meaning of a positive or nega-
tive test result.

10.4.1 Sensitivity and specificity

Definition 10.2. The sensitivity of a test refers to its ability to accurately detect a condition in
subjects if they do have the condition:

sensitivity = P(Test = + | Condition = +) .

The sensitivity is also called the true positive rate.


The specificity of a test refers to the test’s ability to correctly identify the absence of a condition in
subjects:
specificity = P(Test = − | Condition = −) .
The specificity is also called the true negative rate.

A positive or negative test result can also be a mistake; these are called false positives and false nega-
tives, respectively. All four outcomes are illustrated in Figure 10.3.

Figure 10.3: All four scenarios related to a test decision.

When given a sample, we can calculate the number of true positives (#TP), false positives (#FP), false
negatives (#FN), and true negatives (#TN). Using these numbers, we can then estimate sensitivity and

240
10 Logistic regression

specificity.
̂ = #TP
sensitivity
#TP + #FN
̂ = #TN
specificity
#FP + #TN

False negative/positive rate

Given the definitions of sensitivity and specificity, we can further define the false negative
rate 𝛽
𝛽 = P(Test = − | Condition = +) ,
and the false positive rate 𝛼

𝛼 = P(Test = + | Condition = −) .

10.5 Classification

Given a fitted logistic regression model, we can compute the probability of spam given a set of pre-
dictor variables.
Classification algorithm
Input: response y_i, predictor variables x1_i,...,xk_i
1. Compute probabilities p_i
2. Decide if spam given p_i -> decisions d_i
Output: d_i

Important

We need a decision rule in the second step! The rule must be of the form:
email is spam if p_i > threshold
So, the rule is defined, if we can pick a suitable threshold.

Let’s begin by computing the probabilities p_i as the first step. We have already fitted two models to
the email dataset. But now let’s use the full model as our first choice.

full_mod <- glm(spam ~ . , data = email, family = binomial)

We can include the predicted probabilities (pred) of being spam in the dataset using the
add_predictions() function. Since we want to compute predictions on the response level,
we will choose the option type = "response".

241
10 Logistic regression

email_fit_full <- email |>


add_predictions(full_mod, type = "response")

names(email_fit_full) # check content of email_fit_full


# [1] "spam" "to_multiple" "from" "cc"
# [5] "sent_email" "time" "image" "attach"
# [9] "dollar" "winner" "inherit" "viagra"
# [13] "password" "num_char" "line_breaks" "format"
# [17] "re_subj" "exclaim_subj" "urgent_subj" "exclaim_mess"
# [21] "number" "pred"

Based on these probabilities, a decision needs to be made concerning which emails should be flagged
as spam. This is accomplished by selecting a threshold probability. Any email that surpasses that
probability will be flagged as spam.

10.5.1 Picking a threshold

The computed probabilities are visualized in Figure 10.4. For each level of spam the probabilities of
being spam are shown on the y-axis.
predicted probability of being spam

1.00

0.75

0.50

0.25

0.00
0 1
spam

Figure 10.4: Jitter plot of predicted probabilities of being spam for spam and non-spam emails.

We could start with a conservative choice for the threshold, such as 0.75, to avoid classifying non-
spam emails as spam. In Figure 10.5, a horizontal red line indicates a threshold of 0.75.

242
10 Logistic regression

predicted probability of being spam


1.00

0.75

0.50

0.25

0.00
0 1
spam

Figure 10.5: Jitter plot of predicted probabilities of being spam for spam and non-spam emails. Points
above the red line are classified as being spam.

library(tidymodels)
email_fit_full<- email_fit_full |>
mutate(
# transform obs as factor
obs = factor(spam == 1),
# transform pred as factor
pred_f = factor(pred > 0.75)
)

email_fit_full |>
accuracy(obs, pred_f)
# # A tibble: 1 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.911

Choosing a threshold of 0.75 leads to an accuracy (percentage of correct predictions) of roughly


90%.
But that’s simply because there are a large number of true negatives, as indicated in the confusion
matrix, which will be explained in the next section.

243
10 Logistic regression

10.5.2 Consequences of picking a threshold

A confusion matrix for categorical data is a contingency table of observed and predicted response
values. In can be computed with the conf_mat() function from the yardstick package.

conf_mat_email <- conf_mat( # the yardstick package is contained in tidymodels


email_fit_full,
truth = "obs",
estimate = "pred_f")

conf_mat_email
# Truth
# Prediction FALSE TRUE
# FALSE 3544 339
# TRUE 10 28

From the matrix, we can observe that there are 3544 true negatives.
What is the sensitivity and specificity of this decision rule? We can estimate both using the values
from the confusion matrix.

28

ens = = 0.076294
28 + 339
3544
spec
̂= = 0.997186
10 + 3544

The values are already stored in the conf_mat_email object, as evident from the following output.

summary(conf_mat_email, event_level = "second")


# # A tibble: 13 x 3
# .metric .estimator .estimate
# <chr> <chr> <dbl>
# 1 accuracy binary 0.911
# 2 kap binary 0.123
# 3 sens binary 0.0763
# 4 spec binary 0.997
# 5 ppv binary 0.737
# 6 npv binary 0.913
# # i 7 more rows

244
10 Logistic regression

10.5.3 Trying other thresholds

To strike a balance between sensitivity and specificity, we need to experiment with different thresh-
olds.
A good way to do this is by using the threshold_perf() function from the probably package, which
is linked to tidymodels but is not a part of it.

thresholds <- c(0.75, 0.625, 0.5, 0.375, 0.25)

# remember: levels of obs are (FALSE, TRUE)


email_fit_full |>
probably::threshold_perf(obs, pred, thresholds,
event_level = "second") |>
print(n = Inf)
# # A tibble: 15 x 4
# .threshold .metric .estimator .estimate
# <dbl> <chr> <chr> <dbl>
# 1 0.25 sensitivity binary 0.531
# 2 0.375 sensitivity binary 0.398
# 3 0.5 sensitivity binary 0.185
# 4 0.625 sensitivity binary 0.109
# 5 0.75 sensitivity binary 0.0763
# 6 0.25 specificity binary 0.918
# 7 0.375 specificity binary 0.961
# 8 0.5 specificity binary 0.991
# 9 0.625 specificity binary 0.996
# 10 0.75 specificity binary 0.997
# 11 0.25 j_index binary 0.449
# 12 0.375 j_index binary 0.359
# 13 0.5 j_index binary 0.176
# 14 0.625 j_index binary 0.105
# 15 0.75 j_index binary 0.0735

Youden’s J statistic

In addition to sensitivity and specificity, the output also includes a summary of both, the
Youden’s J statistic. The statistic is defined like this

𝐽 = sensitivity + specificity − 1 .

The statistic is at most one, which would be the case if there were no false positives and no false
negatives. Hence, one should choose a threshold such that Youden’s J becomes maximal.

245
10 Logistic regression

The various threshold options are illustrated in Figure 10.6.

predicted probability of being spam


1.00

0.75

0.50

0.25

0.00
0 1
spam

Figure 10.6: Jitter plot of predicted probabilities of being spam for spam and non-spam emails. The
horizontal lines indicate the different threshold values.

10.5.4 ROC curve

The relationship between sensitivity and specificity is illustrated by plotting the sensitivity against
the specificity or the false positive rate. Such a curve is called receiver operating characteristic
(ROC) curve.
Remember: false positive rate = 1 - specificity.
Before we can plot the ROC curve, we split the data in a training and test set,

set.seed(12345)
train_idx <- sample(1:nrow(email), floor(0.9 * nrow(email)))
email_train <- email[train_idx,]
email_test <- email[-train_idx,]

and re-fit the model on the training set

model_train <- glm(spam ~ . , data = email_train,


family = binomial)

Now we can predict spam on the test data

pred_spam <- predict(model_train, newdata = email_test,


type = "response")

246
10 Logistic regression

and compare it with the actual spam by plotting the ROC curve.

library(pROC)
spam_roc <-
roc(email_test$spam ~ pred_spam)

ggroc(spam_roc) +
geom_segment(
aes(x = 1, xend = 0, y = 0, yend = 1),
color="grey", linetype="dashed")

1.00

0.75
sensitivity

0.50

0.25

0.00

1.00 0.75 0.50 0.25 0.00


specificity

Figure 10.7: ROC curve for the full model fitted to email_train.

Why do we care about ROC curves?

1. The graph shows the trade-off between sensitivity and specificity for various thresholds.

2. It’s simple to evaluate the model’s performance against random chance (shown by the
dashed line).

3. We can use the area under the curve (AUC) to assess the predictive ability of a model.

auc(spam_roc)
# Area under the curve: 0.9212

The larger the AUC, the higher the predictive ability.

247
10 Logistic regression

Remark. The vertical distance between a point on the ROC curve and the dashed line equals Youden’s
J statistic.

10.5.5 Comparing models

The ROC curve and the corresponding AUC value provide a useful way to measure and describe the
predictive accuracy of a model. However, the most common use case is likely comparing different
models based on their AUC value.
We will conduct a backward stepwise selection on the full model fitted to the training data to find an
additional competitor.

model_step <- stats::step(model_train, direction = "backward")


# Start: AIC=1527.59
# spam ~ to_multiple + from + cc + sent_email + time + image +
# attach + dollar + winner + inherit + viagra + password +
# num_char + line_breaks + format + re_subj + exclaim_subj +
# urgent_subj + exclaim_mess + number
# Df Deviance AIC
# - exclaim_subj 1 1484.0 1526.0
# - cc 1 1484.2 1526.2
# <none> 1483.6 1527.6
# - inherit 1 1487.5 1529.5
# - num_char 1 1488.0 1530.0
# - viagra 1 1488.1 1530.1
# - time 1 1492.0 1534.0
# - dollar 1 1492.0 1534.0
# - from 1 1492.9 1534.9
# - urgent_subj 1 1494.3 1536.3
# - image 1 1494.6 1536.6
# - format 1 1499.0 1541.0
# - line_breaks 1 1500.0 1542.0
# - attach 1 1503.1 1545.1
# - password 1 1503.7 1545.7
# - re_subj 1 1505.1 1547.1
# - winner 1 1509.6 1551.6
# - exclaim_mess 1 1513.1 1555.1
# - number 2 1542.7 1582.7
# - to_multiple 1 1587.4 1629.4
# - sent_email 1 1605.4 1647.4
#
# Step: AIC=1525.95
# spam ~ to_multiple + from + cc + sent_email + time + image +

248
10 Logistic regression

# attach + dollar + winner + inherit + viagra + password +


# num_char + line_breaks + format + re_subj + urgent_subj +
# exclaim_mess + number
# Df Deviance AIC
# - cc 1 1484.6 1524.6
# <none> 1484.0 1526.0
# - inherit 1 1487.8 1527.8
# - num_char 1 1488.6 1528.6
# - viagra 1 1488.7 1528.7
# - dollar 1 1492.0 1532.0
# - time 1 1492.4 1532.4
# - from 1 1493.3 1533.3
# - urgent_subj 1 1494.6 1534.6
# - image 1 1495.1 1535.1
# - format 1 1499.1 1539.1
# - line_breaks 1 1501.0 1541.0
# - attach 1 1503.7 1543.7
# - password 1 1503.9 1543.9
# - re_subj 1 1505.7 1545.7
# - winner 1 1510.7 1550.7
# - exclaim_mess 1 1513.6 1553.6
# - number 2 1542.7 1580.7
# - to_multiple 1 1587.4 1627.4
# - sent_email 1 1605.6 1645.6
#
# Step: AIC=1524.59
# spam ~ to_multiple + from + sent_email + time + image + attach +
# dollar + winner + inherit + viagra + password + num_char +
# line_breaks + format + re_subj + urgent_subj + exclaim_mess +
# number
# Df Deviance AIC
# <none> 1484.6 1524.6
# - inherit 1 1488.6 1526.6
# - num_char 1 1489.2 1527.2
# - viagra 1 1489.4 1527.4
# - dollar 1 1492.8 1530.8
# - time 1 1493.0 1531.0
# - from 1 1493.9 1531.9
# - urgent_subj 1 1495.1 1533.1
# - image 1 1496.0 1534.0
# - format 1 1499.7 1537.7
# - line_breaks 1 1501.5 1539.5
# - password 1 1504.7 1542.7

249
10 Logistic regression

# - attach 1 1504.8 1542.8


# - re_subj 1 1506.0 1544.0
# - winner 1 1511.4 1549.4
# - exclaim_mess 1 1514.3 1552.3
# - number 2 1543.6 1579.6
# - to_multiple 1 1587.4 1625.4
# - sent_email 1 1607.1 1645.1

The algorithm removed two variables (exclaim_subj and cc). We can now compare the predictive
accuracy of the reduced and full models on the test data. Predictions and the ROC curve for the
model_step will be computed in the next step.

pred_spam_step <- predict(model_step, newdata = email_test, type = "response")

spam_roc_step <- roc(email_test$spam ~ pred_spam_step)

Figure 10.8 presents the ROC curves for both models. Visually, there is little to no difference.

ggroc(list(spam_roc, spam_roc_step)) +
scale_color_brewer(palette = "Set1", name = "model",labels = c("full","step"))
↪ +
geom_segment( aes(x = 1, xend = 0, y = 0, yend = 1),
color="grey", linetype="dashed")

1.00

0.75

model
sensitivity

0.50 full
step

0.25

0.00

1.00 0.75 0.50 0.25 0.00


specificity

Figure 10.8: ROC curve for the full and reduced model fitted to email_train.

250
10 Logistic regression

The full model has an area under the curve of 0.9212009, while the reduced model has an AUC value
of 0.9216457.
We prefer the reduced model based on these results. However, it’s important to note that this conclu-
sion is drawn from only one data split, so we cannot be certain of the accuracy of the area under the
curve estimates. To obtain a more reliable estimate, we will need to apply cross-validation, which we
will address in the next section.

10.6 Cross validation

The reduced model performed slightly better on the test data, but let’s conduct v-fold cross valida-
tion to verify if this outcome is consistent across multiple validation sets.
Remember, the idea of v-fold cross validation is to divide the entire dataset into v parts. Then, each
of these v parts will be selected as a validation set one at a time. The remaining v-1 parts are used to
train/fit the model.
In each round, the model is fitted using v-1 parts of the original data as training data. Afterward, the
accuracy and the AUC are computed on the validation set.
We will again use functions from tidymodels to perform the v-fold cross validation. The first step
will be to define the logistic regression model using functions from tidymodels.

library(tidymodels)
log_mod <-
logistic_reg(mode = "classification", engine = "glm")

Now we are ready to create the different folds.

set.seed(111) # for reproducibility


folds <- vfold_cv(email, v = 10)

folds
# # 10-fold cross-validation
# # A tibble: 10 x 2
# splits id
# <list> <chr>
# 1 <split [3528/393]> Fold01
# 2 <split [3529/392]> Fold02
# 3 <split [3529/392]> Fold03
# 4 <split [3529/392]> Fold04
# 5 <split [3529/392]> Fold05

251
10 Logistic regression

# 6 <split [3529/392]> Fold06


# # i 4 more rows

Next, we will fit the logistic regression model to ten different datasets (folds), beginning by creating
a workflow.

glm_wf <-
workflow() |>
add_model(log_mod) |>
add_formula(spam ~ .)

Afterwards, we will fit the models based on the above workflow.

glm_fit_full <-
glm_wf |>
fit_resamples(folds)

Given the fitted models, the accuracy measures are computed with collect_metric().

collect_metrics(glm_fit_full)
# # A tibble: 3 x 6
# .metric .estimator mean n std_err .config
# <chr> <chr> <dbl> <int> <dbl> <chr>
# 1 accuracy binary 0.914 10 0.00381 Preprocessor1_Model1
# 2 brier_class binary 0.0649 10 0.00203 Preprocessor1_Model1
# 3 roc_auc binary 0.885 10 0.00336 Preprocessor1_Model1

The output includes the average accuracy and area under the curve. Our average AUC is lower than
the one obtained on the single test dataset. Therefore, based on that one split, the estimate was too
optimistic.
Everything is now repeated for the reduced model. First, we update the formula.

glm_wf <- update_formula(glm_wf, spam ~ . - cc - exclaim_subj)

Afterwards, we fit the models based on the updated workflow.

glm_fit_red <-
glm_wf |>
fit_resamples(folds)

252
10 Logistic regression

Let’s compare the accuracy measures calculated for the two models now.

# type="wide" only shows mean for all three accuracy measures


collect_metrics(glm_fit_full, type = "wide")
# # A tibble: 1 x 4
# .config accuracy brier_class roc_auc
# <chr> <dbl> <dbl> <dbl>
# 1 Preprocessor1_Model1 0.914 0.0649 0.885

collect_metrics(glm_fit_red, type = "wide")


# # A tibble: 1 x 4
# .config accuracy brier_class roc_auc
# <chr> <dbl> <dbl> <dbl>
# 1 Preprocessor1_Model1 0.913 0.0648 0.886

Conclusion: The average AUC for the reduced mode is slightly larger. Barely, but larger.

¾ Your turn

The common brushtail possum of the Australian region is a bit


cuter than its distant cousin, the American opossum. We exam-
ined 104 brushtail possums from two regions in Australia, where
the possums may be considered a random sample from the pop-
ulation. The first region is Victoria, located in the eastern half of
Australia and spanning the southern coast. The second region com-
prises New South Wales and Queensland, constituting the eastern
and north-eastern parts of Australia. We use logistic regression to
differentiate between possums in these two regions.

The outcome variable, called pop, takes value Vic when a possum is from Victoria and other
(reference level) when it is from New South Wales or Queensland. We consider five predictors:
sex, headL (head length), skullW (skull width), totalL (total length) and tailL (tail length).
The full model was fitted to the data
# # A tibble: 6 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 39.2 11.5 3.40 0.000672
# 2 sexm -1.24 0.666 -1.86 0.0632
# 3 headL -0.160 0.139 -1.16 0.248
# 4 skullW -0.201 0.133 -1.52 0.129
# 5 totalL 0.649 0.153 4.24 0.0000227

253
10 Logistic regression

# 6 tailL -1.87 0.374 -5.00 0.000000571

Running a backward selection using step() leads to the model

(Intercept) sexm skullW totalL tailL


33.5094680 -1.4206720 -0.2787295 0.5687256 -1.8056547

a) Explain why the remaining estimates change between the two models.

b) Write out the form of the reduced model and give an interpretation of the estimated slope
of tail length.

c) The plot below displays the ROC curve for both the full (in red) and reduced (in blue)
models. The AUC values are 0.921 and 0.919. Which model would you prefer based on
the AUC?

1.00

0.75

model
sensitivity

0.50 full
step

0.25

0.00

1.00 0.75 0.50 0.25 0.00


specificity

d) While visiting a zoo in the US, we came across a brushtail possum with a sign indicating
that it was captured in the wild in Australia. However, the sign did not specify the exact
location within Australia. It mentioned that the possum is male, has a skull width of
approximately 63 mm, a tail length of 37 cm, and a total length of 83 cm. What is the
computed probability from the reduced model that this possum is from Victoria? How
confident are you in the model’s accuracy of this probability calculation?

254
10 Logistic regression

Short summary

This chapter explains how to model binary outcomes, such as the spam status of emails. The
chapter uses an email dataset to illustrate key concepts, including the logit function, odds ra-
tios, and the process of fitting a logistic regression model in R. It further discusses evaluating
model performance through metrics like sensitivity, specificity, ROC curves, and cross-
validation, demonstrating the selection of an appropriate classification threshold and model
comparison.

255
Part VI

Inference

256
11 Foundations of Inference

11.1 Intro

What we have discussed so far:

• we know how to describe the outcome of a random experiment by using a sample space,
• we are aware that by choosing a probability measure/distribution we are able to describe
the distribution of the outcome of a random experiment.

Let’s summarize this in the definition of a probability model.

Definition 11.1. Let 𝑆 be a sample space and P a probability measure, which defines probabilities
P(𝐴) for all events 𝐴 of a random experiment with outcomes on 𝑆. Then we call the pair (𝑆, P) a
probability model.

Example 11.1. For the random experiment of flipping a coin, we would use the following probability
model:
({Heads, Tails}, P),
with P(Heads) = 𝜃 and P(Tails) = 1 − 𝜃 for some value 𝜃 ∈ (0, 1).

In probability theory, we know 𝜃 and are interested in computing probabilities of events under the
probability model given by 𝜃. For instance, if we toss a fair coin, we know that the probability of
getting heads is 0.5.
In inferential statistics, we make assumptions about the probability model of a random experiment,
but we do not know the true probability measure.

11.1.1 Modeling data

How can we determine the value of 𝜃, if it is unknown?

We have to use data. In Example 11.1 we do not know 𝜃, the probability of heads, but we will toss the
coin repeatedly to infer 𝜃.

257
11 Foundations of Inference

Our approach

• We consider a sample of 𝑛 independent outcomes of the random experiment described


by the unknown probability model.

• Here, 𝑛 is referred to as the sample size.

• We specify a statistical model = a collection of candidates for the unknown data-


generating probability measure/distribution.

Before proceeding to discuss how we use data, we introduce the concept of a statistical model
formally.

Definition 11.2. A statistical model is a pair (𝑆, (P𝜃 )𝜃∈Θ ), where 𝑆 is the sample space for the
considered random experiment and (P𝜃 )𝜃∈Θ is a collection of probability measures, each of which
defines probabilities P𝜃 (𝐴) for events 𝐴 ⊆ 𝑆.
The probability measures depend on the unknown population parameter 𝜃 whose values form the
parameter space Θ.

Let 𝑋𝑖 , 𝑖 = 1, … , 𝑛, be independent random variables with values in 𝑆 and distribution P𝜃 for


𝜃 ∈ Θ.
Then we call 𝑋1 , … , 𝑋𝑛 an independent and identically distributed (i.i.d.) sample from the statistical
model (𝑆, (P𝜃 )𝜃∈Θ ).

11.2 Point estimation

We are interested in estimating the population parameter 𝜃.

Example 11.2. If P𝜃 is a normal distribution (see Definition A.8) for which both the mean 𝜇 and
the variance 𝜎2 are unknown, then 𝜃 = (𝜇, 𝜎2 )⊤ and Θ = R × (0, ∞) reflecting that 𝜇 ∈ R and
𝜎2 > 0.

Given the sample X = (𝑋1 , … , 𝑋𝑛 ), we use sample statistics as point estimators for the unknown
population parameters of interest.

Definition 11.3. Let X = (𝑋1 , … , 𝑋𝑛 ) be a sample. Then we call any real-valued function 𝑇
defined on the sample space a statistic.

258
11 Foundations of Inference

1 𝑛
An example would be the sample mean 𝑇 ∶ X ↦ 𝑇 (X) = 𝑋 𝑛 ∶= 𝑛 ∑𝑖=1 𝑋𝑖 .
Whenever we compute the value of a point estimator, we will make an error, which is defined
as the difference between the value of the sample statistic 𝑇 (X) and the population parameter 𝜃 it
estimates, e.g., 𝑋 𝑛 − 𝜇.

Definition 11.4. The bias is the systematic tendency to over- or under-estimate the true popula-
tion parameter. It is defined as
E[𝑇 (X)] − 𝜃,
where 𝜃 is the population parameter of interest.

The bias is the expected error. In addition to the bias, the error contains the sampling error, 𝑇 (X) −
E[𝑇 (X)], which describes how much an estimate will tend to vary from one sample to the next. We
can summarize the sampling error by computing the standard deviation of the estimator, which is
also called the standard error.

Definition 11.5. Let 𝑋1 , … , 𝑋𝑛 be an i.i.d. sample and 𝑇 (X) a point estimator for the unknown
parameter 𝜃. The standard deviation of 𝑇 (X) is then called the standard error of 𝑇 (X) and will be
denoted by
SE(𝑇 (X)) = √Var[𝑇 (X)] .

Example 11.3. Consider again the example of the sample mean 𝑇 (X) = 𝑋 𝑛 of i.i.d. observations
𝑋1 , … , 𝑋𝑛 with Var[𝑋𝑖 ] = 𝜎2 . The standard error of the sample mean is then

1 𝑛 1 𝑛
SE(𝑋 𝑛 ) = √Var[ ∑𝑋 ] = √ Var[ ∑ 𝑋𝑖 ]
𝑛 𝑖=1 𝑖 𝑛2 𝑖=1
𝑛 𝑛

1 1 𝑛 ⋅ 𝜎2
= √∑ Var[𝑋𝑖 ] = √∑ 𝜎 = 2
𝑛 𝑖=1 𝑛 𝑖=1 𝑛
𝜎
=√ .
𝑛

Much of statistics is focused on understanding and quantifying the sampling error. The sample size
is important when quantifying this error.
In the following we analyse the sampling error through simulations. We introduce the sampling
distribution of a sample statistic.

259
11 Foundations of Inference

11.3 Sampling distribution

Suppose we are given a bowl like this one

Figure 11.1: From Ismay and Kim (2019)

Let’s consider all balls in the bowl as our population of interest. Assume we are interested in
answering the question: What is the proportion of red balls in the bowl?
In general, we can precisely answer this question by counting the number of red (and white, if we
don’t know the total number) balls. But this wouldn’t be any fun at all, unless we have a digital
version of the bowl.
The package moderndive, which accompanies the book Ismay and Kim (2019), contains such a digital
version of the bowl.

library(moderndive)
bowl
# # A tibble: 2,400 x 2
# ball_ID color
# <int> <chr>
# 1 1 white
# 2 2 white
# 3 3 white
# 4 4 red
# 5 5 white
# 6 6 white
# # i 2,394 more rows

So, the bowl contains 2400 balls. Now let’s compute the proportion of red balls.

260
11 Foundations of Inference

Using the summarize() function, we can easily determine the total number of balls and the number
of red balls. With this information, we can calculate the proportion of red balls in the bowl.

bowl |>
summarise(
n = n(),
sum = sum(color == "red"),
prop = sum / n)
# # A tibble: 1 x 3
# n sum prop
# <int> <int> <dbl>
# 1 2400 900 0.375

Remark. By taking all balls (=the whole population) into account, we have done a full survey.

Virtual sampling

In reality, however, no one would want to check all 2400 balls for their color. Therefore, sampling is
typically the only realistic option.
But of course, given the sample, we would still like to answer the question: What is the proportion
of red balls in the bowl?
To answer this question, we define the proportion as the population parameter 𝜃 of a statistical
model and then consider the sample as a sample from this model.
If we take a sample 𝑋1 , … , 𝑋𝑛 of size 𝑛, we will perform 𝑛 Bernoulli trials with success probability
𝜃, which is equal to the proportion of red balls in the bowl.
Hence, the statistical model is given by ({0, 1}, P𝜃 ), with

P𝜃 (red ball in the i-th trial) = P𝜃 (𝑋𝑖 = 1) = 𝜃 .

We will draw a sample using different sample lengths 𝑛 ∈ {25, 50, 100} to analyze in addition the
influence of the sample length on the outcome.

For each of the three samples, we then calculate the relative frequency of red balls.

To draw the samples we use the function rep_sample_n() from the infer package, which is another
tidymodels package.

Remark. It is not absolutely necessary to load another package here, we could use dplyr functions
instead. However, the infer package is beneficial at this point, especially later, and the included
functions are intuitive to use.

261
11 Foundations of Inference

library(infer)
bowl |>
rep_sample_n(size = 25) |>
summarise(
prop = sum(color == "red") / 25
)
# # A tibble: 1 x 2
# replicate prop
# <int> <dbl>
# 1 1 0.32

bowl |>
rep_sample_n(size = 50) |>
summarise(
prop = sum(color == "red") / 50
)
# # A tibble: 1 x 2
# replicate prop
# <int> <dbl>
# 1 1 0.44

bowl |>
rep_sample_n(size = 100) |>
summarise(
prop = sum(color == "red") / 100
)
# # A tibble: 1 x 2
# replicate prop
# <int> <dbl>
# 1 1 0.39

Conclusion: All three values differ compared to the true value 0.375.

Uncertainty of our estimator

In practice, we would not know the true value. Therefore, we further consider how to evaluate or
estimate the quality of the calculated values.
One question we should ask ourselves in this regard is:

How much does the estimated value deviate from the center (ideally the unknown 𝜃) of
the distribution?

262
11 Foundations of Inference

If we were able to calculate not just one estimate but many, we could simply use the empirical standard
deviation as a measure of dispersion.
In reality, this is usually not possible, since it involves costs and/or time.
But since we only collect our samples on the computer, it is no problem for us to draw 1000 samples
of length 𝑛 ∈ {25, 50, 100}.

set.seed(123) # for reproducibility


stp_25 <- bowl |>
rep_sample_n(size = 25, reps = 1000)

stp_50 <- bowl |>


rep_sample_n(size = 50, reps = 1000)

stp_100<- bowl |>


rep_sample_n(size = 100, reps = 1000)

Now we have for each sample size 𝑁 = 1000 samples x𝑗 = (𝑥𝑗1 , … , 𝑥𝑗𝑛 ), 𝑗 ∈ {1, … , 𝑁 }, where

1, 𝑖-th ball in sample 𝑗 is red,


𝑥𝑗𝑖 = {
0, 𝑖-th ball in sample 𝑗 is white.

In the next step we will compute for each sample the empirical mean (=proportion of red balls)
1 𝑛 𝑗
𝜃𝑛𝑗̂ = 𝑥𝑗𝑛 = ∑𝑥 , 𝑗 ∈ {1, … , 𝑁 } .
𝑛 𝑖=1 𝑖

Then, we are able to compute the empirical standard deviation of the 1000 estimated proportions
√ 𝑁
√ 1 2
𝑠𝑛 = √ ∑ (𝜃𝑛𝑗̂ − 𝜃𝑛̂ ) , 𝑛 ∈ {25, 50, 100} ,
𝑁 − 1 𝑗=1

which is an estimate for the standard error of 𝑋 𝑛 .

stp_25 |>
summarise(
prop = sum(color == "red") / 25
) |>
summarise(sd_prop = sd(prop))
# # A tibble: 1 x 1
# sd_prop
# <dbl>
# 1 0.0986

263
11 Foundations of Inference

stp_50 |>
summarise(
prop = sum(color == "red") / 50
) |>
summarise(sd_prop = sd(prop))
# # A tibble: 1 x 1
# sd_prop
# <dbl>
# 1 0.0704

stp_100 |>
summarise(
prop = sum(color == "red") / 100
) |>
summarise(sd_prop = sd(prop))
# # A tibble: 1 x 1
# sd_prop
# <dbl>
# 1 0.0477

Conclusion: We detect decreasing uncertainty with increasing sample size.


This can also be seen by taking a look at the histograms of
𝑥1𝑛̄ , … , 𝑥𝑁
̄
𝑛

for 𝑛 ∈ {25, 50, 100}.

Recap

Given the whole population data (the bowl). We were able to create 1000 (our choice) samples
of size 𝑛 = 100 (our choice) from the population data.

stp_100<- bowl |>


rep_sample_n(size = 100, reps = 1000)

For each sample we computed an estimate for the proportion of red balls

stp_100 |>
summarise(
prop = sum(color == "red") / 100
)

264
11 Foundations of Inference

n=25
250

200

150
count

100

50

0
0.0 0.2 0.4 0.6 0.8
prop

n=50

300

200
count

100

0
0.0 0.2 0.4 0.6 0.8
prop

n=100

300
count

200

100

0
0.0 0.2 0.4 0.6 0.8
prop

Figure 11.2: We see a reduction in variability with increasing sample size. In addition, we see that
the empirical distribution is symmetrically distributed around the true parameter
0.375.

265
11 Foundations of Inference

The empirical distribution of the 1000 estimates


250

200

150

count
100

50

0
0.2 0.3 0.4 0.5 0.6
prop

Figure 11.3

is then considered as an approximation to the sampling distribution of the statistic under consid-
eration - in our case the empirical mean.

Definition 11.6. Let X = (𝑋1 , … , 𝑋𝑛 ) be sample from a statistical model and 𝑇 (X) a statistic.
Then we call the distribution of the r.v. 𝑇 (X) the sampling distribution.

As said, in reality we will usually not be able to collect more than one sample. Nevertheless, we
would like to say something about the distribution of our estimator (here 𝑋 𝑛 ). So we have to think
about other strategies.

Approximating the sampling distribution

One can think of three different approaches for approximating the sampling distribution.

1. Theoretical approach: For several statistics 𝑇 (X) one can derive the distribution of
𝑇 (X) by making assumptions about the distribution of the sample X. In this approach,
it’s important to consider whether the assumed distribution for the sample X actually
applies to the observed sample x.

2. Asymptotic approach: There are methods that enable us to approximate the distribution
of 𝑇 (X) for “large” samples. One important method is the Central Limit Theorem.

3. Bootstrap approach: In order to approximate the sampling distribution, we would ide-


ally take resamples from the population. However, since this is not feasible, the bootstrap
approach involves repeatedly sampling (with replacement) from the original dataset. This
method also provides an approximation of the sampling distribution in various scenarios.

We will now have a closer look at all three approaches.

266
11 Foundations of Inference

11.3.1 Theoretical approach

In this section, we won’t focus on deriving the exact distribution of a specific statistic 𝑇 (X) ourselves.
Instead, this part is more about providing several examples of statistics for which one can derive the
exact distribution after making assumptions about the distribution of the sample X.

Example 11.4. Given a sample X = (𝑋1 , … , 𝑋𝑛 )⊤ of i.i.d. observations from the statistical model
(R, N (𝜇, 𝜎2 )), with 𝜃 = (𝜇, 𝜎2 )⊤ , one can derive the following results.
𝜎2
1. The sample mean 𝑇 (X) = 𝑋 𝑛 has a normal distribution with mean 𝜇 and variance 𝑛 . This
result implies
𝑋𝑛 − 𝜇
𝜎 ∼ N (0, 1) .

𝑛

2. When inferring about the mean 𝜇, the variance 𝜎2 is generally unknown. In this scenario, the
1 𝑛
sample mean is standardized using the empirical variance 𝑆𝑛2 (X) = 𝑛−1 ∑𝑖=1 (𝑋𝑖 − 𝑋 𝑛 )2 ,
resulting in a known distribution: the t-distribution with 𝑛 − 1 degrees of freedom.

𝑋𝑛 − 𝜇
2
∼ 𝑡(𝑛 − 1) .
√ 𝑆𝑛𝑛(X)

3. Scaling the empirical variance appropriately leads to a chi-squared distribution with 𝑛 − 1


degrees of freedom:
𝑛−1 2
𝑆 (X) ∼ 𝜒2 (𝑛 − 1) .
𝜎2 𝑛

Example 11.5. Given a sample X = (𝑋1 , … , 𝑋𝑛 )⊤ of i.i.d. observations from the statistical model
({0, 1}, P𝜃 ), with P𝜃 being a Bernoulli distribution with parameter 𝜃 ∈ (0, 1). Then, the distribution
𝑛
of the statistic 𝑇 (X) = ∑𝑖=1 𝑋𝑖 is a binomial distribution with parameters 𝑛 and 𝜃, i.e.,
𝑛
∑ 𝑋𝑖 ∼ Bin(𝑛, 𝜃) ,
𝑖=1

where Bin(𝑛, 𝜃) denotes the binomial distribution with parameters 𝑛 and 𝜃.

Important

One advantage of the theoretical approach is that we can work with a well-specified distribution.
However, this benefit comes with a caveat. We need to consider whether the assumed distribu-
tion for the observations 𝑋1 , … , 𝑋𝑛 is realistic given the observed sample 𝑥1 , … , 𝑥𝑛 . If this is
not the case, the resulting distribution of the sample statistic 𝑇 (X) is likely to be incorrect.

267
11 Foundations of Inference

11.3.2 Asymptotic approach

The literature on the asymptotic distributions of sample statistics is extensive. However, we are
𝑛
specifically interested in the approximate distribution of the sample mean 𝑇 (X) = 𝑛1 ∑𝑖=1 𝑋𝑖 .
In the case of i.i.d. random variables, the famous Central Limit Theorem (CLT) is applicable to derive
the asymptotic distribution of 𝑇 (X).

Theorem 11.1. Suppose 𝑋1 , 𝑋2 , … , 𝑋𝑛 , … is a sequence of independent and identically dis-


tributed random variables with mean 𝜇 and variance 𝜎2 . Then as the sample size 𝑛 is sufficiently
𝑛
large, the r.v. 𝑋 𝑛 = 𝑛1 ∑𝑖=1 𝑋𝑖 will tend to follow a normal distribution with mean 𝜇 and variance
𝜎2
𝑛 , which will be denoted by
𝜎2
𝑋 𝑛 ∼̇ N (𝜇, ) .
𝑛

Remark. When estimating, e.g., a proportion 𝜃 ∈ (0, 1), sufficiently large is characterized through
𝑛𝜃 ≥ 10 and 𝑛(1 − 𝜃) ≥ 10.

Let’s try to verify Theorem 11.1 through simulations.

theta <- 0.4


# initialize vector
emp_means <- vector(mode = "double", length = 1000)

# take 1000 samples of size 25 from


# Binomial(1, theta) distribution
for(i in 1:1000){
emp_means[i] <- mean(
rbinom(25, size = 1, prob = theta)
)
}

p1 <- ggplot(tibble(x = emp_means), aes(x)) +


geom_histogram(color = "white") +
labs(title = "n = 25")

# take 1000 samples of size 100 from


# Binomial(1, theta) distribution
for(i in 1:1000){
emp_means[i] <- mean(
rbinom(100, size = 1, prob = theta)
)

268
11 Foundations of Inference

p2 <- ggplot(tibble(x = emp_means), aes(x)) +


geom_histogram(color = "white") +
labs(title = "n = 100")

n = 25

150
count

100

50

0
0.2 0.4 0.6
x

n = 100
150

100
count

50

0
0.3 0.4 0.5
x

We will use this result to make inferences about the mean 𝜇 of a distribution P. Even though the
observations 𝑋1 , … , 𝑋𝑛 in an i.i.d. sample come from a non-normal distribution, we can use the
normal distribution defined by the Central Limit Theorem to draw conclusions about the mean.

11.3.3 Bootstrap approach

The bootstrap is a powerful statistical tool used to measure uncertainty in estimators. In practical
terms, while we can gain a good understanding of the accuracy of an estimator 𝑇 (X) by drawing
samples from the population multiple times, this method is not feasible for real-world data. This is
because with real data, it’s rarely possible to generate new samples from the original population.
In the bootstrap approach we use a computer to simulate the process of obtaining new sample sets.
This way, we can estimate the variability of 𝑇 (X) without needing to generate additional samples.

269
11 Foundations of Inference

Instead of repeatedly sampling independent datasets from the population, we create samples by re-
peatedly sampling with replacement observations from the original dataset.
The idea behind the bootstrap approach is, that the original sample approximates the population. So,
resamples from the observed sample approximate independent samples from the population.
The bootstrap distribution of a statistic, based on many resamples, approximates the sampling distribu-
tion of the statistic.

Bootstrap algorithm

Input: observed sample x = (𝑥1 , … , 𝑥𝑛 )⊤ and number of resamples 𝐵

1. For 𝑏 = 1, … , 𝐵, randomly select 𝑛 observations with replacement from {𝑥1 , … , 𝑥𝑛 }


to create the resample
x𝑏 = (𝑥𝑏1 , … , 𝑥𝑏𝑛 )⊤ .

2. Compute for each resample x𝑏 the value of the statistic 𝑇 (x𝑏 ).

Evaluating the algorithm allows to estimate the standard error SE(𝑇 (X)) of the statistic 𝑇 (X). An
estimator is given by
√ 𝐵
√ 1 1 𝐵 2
SE(𝑇 (X)) = √
̂ ∑ (𝑇 (x𝑏 ) − ∑ 𝑇 (x𝑗 )) .
𝐵 − 1 𝑏=1 𝐵 𝑗=1

Let’s illustrate this algorithm with a very small dataset of size 𝑛 = 3, 𝐵 = 3 resamples and as statistic
𝑇 (x) the empirical mean.

set.seed(123) # for reproducibility


x <- bowl |>
slice_sample(n = 3)

x
# # A tibble: 3 x 2
# ball_ID color
# <int> <chr>
# 1 2227 red
# 2 526 white
# 3 195 white

Three resamples can be created with the following code.

270
11 Foundations of Inference

x_B <- x |>


rep_sample_n(size = 3, replace = TRUE, reps = 3)

x_B
# # A tibble: 9 x 3
# # Groups: replicate [3]
# replicate ball_ID color
# <int> <int> <chr>
# 1 1 526 white
# 2 1 526 white
# 3 1 526 white
# 4 2 195 white
# 5 2 2227 red
# 6 2 526 white
# 7 3 526 white
# 8 3 2227 red
# 9 3 526 white

One can then calculate the average from each of these three samples to estimate the proportion of
red balls.

x_B |>
summarise(
prop = mean(color == "red"))
# # A tibble: 3 x 2
# replicate prop
# <int> <dbl>
# 1 1 0
# 2 2 0.333
# 3 3 0.333

It’s evident that the choice of 𝐵 = 3 was purely for illustrative purposes. For small values of 𝐵, the
̂ (X)) will lack accuracy. A typical value of 𝐵 in real applications
estimate of the standard error SE(𝑇
is 1000.
Hence, we will now increase 𝐵. In addition, we will simplify the code for generating the resam-
ples and visualizing the distribution of these resamples using additional functions from the infer
package.

271
11 Foundations of Inference

infer package workflow

The infer workflow to generate bootstrap resamples is visualized in Figure 11.4.

Figure 11.4: From Ismay and Kim (2019).

To create and visualize the bootstrap distribution we use the functions:

• specify(), defines the variable of interest in the dataset,


• generate(), defines the number of sample repetitions and their type,
• calculate(), defines which statistic to calculate,
• visualize(), visualizes the bootstrap distribution of the calculated statistic,

from the infer package.


Let’s create an observed sample again, but this time of size 𝑛 = 100.

set.seed(123) # for reproducibility


x <- bowl |>
slice_sample(n = 100)

In the next step, we need to specify the variable under consideration and select the success argument
since we want to determine a proportion.

272
11 Foundations of Inference

(x_sp <- x |>


specify(response = color, success = "red"))
# Response: color (factor)
# # A tibble: 100 x 1
# color
# <fct>
# 1 red
# 2 white
# 3 white
# 4 white
# 5 red
# 6 white
# # i 94 more rows

Now, we can generate 1000 bootstrap samples and calculate the proportion of red balls for each sam-
ple.

(bootstrap_means <- x_sp |>


generate(reps = 1000, type = "bootstrap") |>
calculate(stat = "prop"))
# Response: color (factor)
# # A tibble: 1,000 x 2
# replicate stat
# <int> <dbl>
# 1 1 0.32
# 2 2 0.37
# 3 3 0.39
# 4 4 0.34
# 5 5 0.37
# 6 6 0.4
# # i 994 more rows

After completing this step, we removed the 100000 observations from the 1000 bootstrap samples and
retained only the 1000 estimates.
Finally we can visualize the bootstrap distribution:

visualize(bootstrap_means) +
geom_vline(xintercept = mean(x$color == "red"), color = "blue",
size = 2)

273
11 Foundations of Inference

Simulation−Based Bootstrap Distribution

150

100
count

50

0.2 0.3 0.4 0.5


stat

The bootstrap distribution is centered around the mean of the sample x, and not the unknown pro-
portion 0.375 of the population.

REMEMBER: SAMPLING DISTRIBUTIONS ARE NEVER OBSERVED

In real-world applications, we never actually observe the sampling distribution. However, it is


useful to always think of a point estimate as coming from such a hypothetical distribution.
We can approximate the sampling distribution by simulation-based methods, asymptotic con-
siderations or make distributional assumptions about the sample X to derive a theoretical ap-
proximation.

274
11 Foundations of Inference

Short summary

This chapter introduces core concepts in statistical inference. The text explains statistical mod-
els and the crucial idea of an independent and identically distributed (i.i.d.) sample. It further
discusses point estimation of population parameters, including concepts like bias and stan-
dard error, and explores the idea of a sampling distribution. Finally, it examines different
approaches for approximating this distribution, including theoretical, asymptotic (specifically
the Central Limit Theorem), and bootstrap methods, providing practical examples and R code
snippets for illustration.

275
12 Confidence intervals

In Section 11.3.3 we computed 1000 bootstrap means as estimates of the true proportion of red balls,
which was 0.375. All of them have been “wrong” (=not equal to the true value).

sum(bootstrap_means == 0.375)
# [1] 0

Even for the 1000 samples stp_100 from the bowl (the population), we did not see one estimate being
equal to the true value.

stp_100 |>
summarise(
prop = sum(color == "red") / 100
) |>
summarise(sum(prop == 0.375))
# # A tibble: 1 x 1
# `sum(prop == 0.375)`
# <int>
# 1 0

Hence, none of our point estimates for 𝜃, the proportion of red balls, produced the correct value of
0.375.
1 𝑛
But maybe our choice of using the point estimator 𝑇 (X) = 𝑛 ∑𝑖=1 𝑋𝑖 was bad?
That’s not the case. 𝑇 (X) is actually the maximum likelihood estimator of 𝜃, and as such has
“nice” statistical properties (which we won’t discuss in this course).

MLE

We will not discuss the derivation of maximum likelihood estimator (MLE) in general, but let’s
revisit the concept in the case of i.i.d. Bernoulli trials:

1, ball is red,
𝑋𝑖 = {
0, ball is white.

276
12 Confidence intervals

MLE method

Given a statistical model, the Maximum Likelihood Estimation (MLE) method estimates the
unknown parameter 𝜃 in a way that is most consistent with the observed data. In other words,
MLE gives the distribution under which the observed data are most likely.

For 𝜃 ∈ (0, 1), the likelihood of the observed data 𝑥1 , … , 𝑥𝑛 , is given by

𝑛 𝑛
𝐿(𝑥1 , … , 𝑥𝑛 |𝜃) = ∏ P(𝑋𝑖 = 𝑥𝑖 ) = ∏ 𝜃𝑥𝑖 (1 − 𝜃)1−𝑥𝑖
𝑖=1 𝑖=1
𝑛1 𝑛−𝑛1
= 𝜃 (1 − 𝜃) ,

where 𝑛1 = #{𝑖 ∶ 𝑥𝑖 = 1} is the number of ‘successes’ (coded as 1). When considered as a function
of 𝜃 this expression gives the likelihood function, and the maximum likelihood estimate 𝜃 ̂ is then
the parameter value such that

𝐿(𝑥1 , … , 𝑥𝑛 |𝜃)̂ = max𝜃∈(0,1) 𝐿(𝑥1 , … , 𝑥𝑛 |𝜃)


Setting the derivative to zero,

𝑑 𝑛1
𝜃 (1 − 𝜃)𝑛−𝑛1 = (1 − 𝜃)(𝑛−𝑛1 −1) 𝜃(𝑛1 −1) (𝑛1 − 𝑛𝜃) = 0,
𝑑𝜃
we have that
𝑛
𝑛1 − 𝑛𝜃 ̂ = 0 ⟺ 𝜃 ̂ = 1 .
𝑛
So, 𝜃 ̂ is the proportion of ‘successes’ out the 𝑛 trials.
Using a point estimate is similar to fishing in a murky lake with a spear. The chance of hitting the
correct value is very low.
We also saw that each point estimate is the realization of a random variable 𝑇 (X) having a non-zero
standard deviation.

In conclusion, it is important to not only provide a point estimate (a data-informed


“guess”) but also understand and quantify the uncertainty about such an estimate.

Confidence intervals

A combination of point estimate and corresponding uncertainty is given by a confidence interval.

277
12 Confidence intervals

Definition 12.1. Let X = (𝑋1 , … , 𝑋𝑛 ) be an i.i.d. sample and 𝛼 ∈ (0, 1) a chosen error level. Then
we call an interval 𝐶𝐼(X) a 100(1 − 𝛼)% confidence interval for the population parameter 𝜃 of
interest, if 𝐶𝐼(X) covers 𝜃 with probability at least 1 − 𝛼, i.e.,
P(𝜃 ∈ 𝐶𝐼(X)) ≥ 1 − 𝛼 , ∀𝜃 ∈ Θ .
Hence, 𝐶𝐼(X) is a plausible range of values for the population parameter 𝜃 of interest.

Conclusion: If we report a point estimate, we probably won’t hit the exact population parameter. If
we report a range of plausible values, we have a good shot at capturing the parameter.
We mentioned that a confidence interval for 𝜃 expresses the uncertainty of the corresponding point
estimate 𝑇 (X). Since we do not know the distribution of 𝑇 (X) (the sampling distribution), we must
approximate it using one of the three approaches introduced above. Given the approximation, we
can compute the confidence interval.

12.1 Theoretical approach

In this section, we will discuss the construction of a confidence interval based on the theoretical
approach. We will focus on an easy special case, rather than the general case.
Let’s assume we have i.i.d. observations 𝑋1 , … , 𝑋𝑛 from the statistical model (R, N (𝜃, 𝜎2 )), i.e.,
from a model specifying a normal distribution for which we know the variance 𝜎2 . The goal is to
construct a confidence interval for the unknown mean 𝜃.

Remark. This model is not very realistic. It’s rarely the case that you do not know the mean value
but do know the variance. We only consider it because deriving the confidence interval is illustrative
and easy to understand.
𝑛 2
We know from Example 11.4 that the distribution of the sample mean 𝑋 𝑛 = 𝑛1 ∑𝑖=1 𝑋𝑖 is N (𝜃, 𝜎𝑛 ).
But this is of course an unknown distribution, since we do not know 𝜃. Hence, we can’t compute any
probabilities of the form P(𝑋 𝑛 ≤ 𝑥) for given 𝑥.

Idea

Specify a function 𝑔 of the point estimator 𝑋 𝑛 and the unknown parameter 𝜃 (and perhaps even
other known parameters) so that the distribution of 𝑔(𝑋 𝑛 , 𝜃) is known and therefore indepen-
dent of 𝜃.
Using this known distribution, one can construct a two-sided interval [𝑥𝛼/2 , 𝑥1−𝛼/2 ] for which
the following holds:

P(𝑔(𝑋 𝑛 , 𝜃) ∈ [𝑥𝛼/2 , 𝑥1−𝛼/2 ]) = P(𝑥𝛼/2 ≤ 𝑔(𝑋 𝑛 , 𝜃) ≤ 𝑥1−𝛼/2 ) = 1 − 𝛼 .


Here, the notation 𝑥𝛽 stands for the 𝛽-quantile of the distribution of 𝑔(𝑋 𝑛 , 𝜃).

278
12 Confidence intervals

According to the above calculation, we obtain our confidence interval 𝐶𝐼(X) as the set of all values
of 𝜃 for which 𝑥𝛼/2 ≤ 𝑔(𝑋 𝑛 , 𝜃) ≤ 𝑥1−𝛼/2 .

Remark. A value 𝑥𝛽 from a distribution with distribution function 𝐹 is said to be the 𝛽-quantile, if

𝐹 (𝑥𝛽 ) = P((−∞, 𝑥𝛽 ] ) ≥ 𝛽 and P([𝑥𝛽 , ∞)) ≥ 1 − 𝛽, 𝛽 ∈ (0, 1) ,

see also Definition 7.18.

It is straightforward to find the function 𝑔 for the normal distribution. It holds that:

𝑋𝑛 − 𝜃
∼ N (0, 1) ,
√𝜎2 /𝑛
𝑥−𝜃
and hence, the function 𝑔 has the form 𝑔 ∶ (𝑥, 𝜃) ↦ √𝜎2 /𝑛
.

Let 𝑧𝛼 be the 𝛼-quantile of the standard normal distribution. Then we get

1 − 𝛼 = P (𝑧𝛼/2 ≤ 𝑔(𝑋 𝑛 , 𝜃) ≤ 𝑧1−𝛼/2 )


𝑋𝑛 − 𝜃
= P (𝑧𝛼/2 ≤ ≤ 𝑧1−𝛼/2 )
√𝜎2 /𝑛
𝜎 𝜎
= P (𝑧𝛼/2 √ ≤ 𝑋 𝑛 − 𝜃 ≤ 𝑧1−𝛼/2 √ )
𝑛 𝑛
𝜎 𝜎
= P (𝑋 𝑛 − 𝑧1−𝛼/2 √ ≤ 𝜃 ≤ 𝑋 𝑛 − 𝑧𝛼/2 √ )
𝑛 𝑛
𝜎 𝜎
= P (𝑋 𝑛 − 𝑧1−𝛼/2 √ ≤ 𝜃 ≤ 𝑋 𝑛 + 𝑧1−𝛼/2 √ )
𝑛 𝑛
𝜎 𝜎
= P (𝜃 ∈ [𝑋 𝑛 − 𝑧1−𝛼/2 √ , 𝑋 𝑛 + 𝑧1−𝛼/2 √ ]) ,
𝑛 𝑛

where we used that −𝑧𝛼/2 = 𝑧1−𝛼/2 , by symmetry of the normal distribution.


Thus,

𝜎 𝜎
[𝑋 𝑛 − 𝑧1−𝛼/2 √0 , 𝑋 𝑛 + 𝑧1−𝛼/2 √0 ]
𝑛 𝑛

is a two-sided confidence interval with the confidence level 1 − 𝛼 for the unknown mean value 𝜃 of
a normal distribution with known variance 𝜎2 .

279
12 Confidence intervals

Example 12.1. We consider again i.i.d. observations 𝑋1 , … , 𝑋𝑛 from a statistical model with a
normal distribution. However, this time both parameters are unknown. Therefore, we have 𝑋𝑖 ∼
N (𝜃1 , 𝜃2 ), with 𝜃 = (𝜃1 , 𝜃2 )⊤ unknown. In this example, our focus is on constructing a confidence
interval for the variance 𝜃2 .
It can be shown that

2
𝑛
𝑋 − 𝑋𝑛 (𝑛 − 1)𝑆𝑛2
𝑔(X, 𝜃2 ) ∶= ∑ ( 𝑖 ) = ∼ 𝜒2 (𝑛 − 1) ,
𝑖=1 √𝜃 2
𝜃2

where 𝜒2 (𝑛) denotes the 𝜒2 -distribution with 𝑛 degrees of freedom. Then the two-sided 1 − 𝛼 con-
fidence interval for 𝜃2 is given by:

(𝑛 − 1)𝑆𝑛2 (𝑛 − 1)𝑆𝑛2
[ , 2 ].
𝜒21−𝛼/2 (𝑛
− 1) 𝜒𝛼/2 (𝑛 − 1)

12.2 Asymptotic approach

In Theorem 11.1 we have seen that the sampling distribution of the average of i.i.d. random variables
𝑋1 , … , 𝑋𝑛 , with E[𝑋𝑖 ] = 𝜇 and Var[𝑋𝑖 ] = 𝜎2 can be approximated by a normal distribution with
2
mean 𝜇 and variance 𝜎𝑛 for large 𝑛. We can use this property to construct asymptotic confidence
intervals, i.e., for all 𝜃 ∈ Θ, the inequality

P(𝜃 ∈ 𝐶𝐼(X)) ≥ 1 − 𝛼 ,

holds in the limit as 𝑛 → ∞ and, thus, approximately for “large” 𝑛. The construction builds on
approximating the distribution of the concerned estimator 𝑇 (X) by the normal distribution with
mean 𝜃 and standard deviation SE(𝑇 (X)).

Definition 12.2. When the Central Limit Theorem (CLT) is applicable to the sampling distribution
of a point estimator 𝑇 (X), i.e., the point estimator is the average of i.i.d. random variables, and the
sampling distribution of the estimator thus closely follows a normal distribution, we can construct a
100(1 − 𝛼)% confidence interval as

̂ (X)) ,
𝑇 (x) ± 𝑧1−𝛼/2 ⋅ SE(𝑇

̂ (X)) is the estimated


where 𝑧1−𝛼/2 denotes the 1 − 𝛼/2 quantile of the N (0, 1) distribution and SE(𝑇
standard error of 𝑇 (X).

280
12 Confidence intervals

Example 12.2. We consider the outcome 𝑋1 , … , 𝑋𝑛 of 𝑛 i.i.d. Bernoulli trials. The success proba-
bility 𝜃 in each trial is unknown and we would like to compute a 95% confidence interval for 𝜃.
In order for the asymptotic approach to be effective, the sample size needs to be fairly large. Typically,
the sample size is considered sufficiently large if there are at least 10 successes and 10 failures in
the sample. The expected number of success in sample of size 𝑛 is,
𝑛 𝑛 𝑛
E[ ∑ 𝑋𝑖 ] = ∑ E[𝑋𝑖 ] = ∑ (0 ⋅ P(𝑋𝑖 = 0) + 1 ⋅ P(𝑋𝑖 = 1))
𝑖=1 𝑖=1 𝑖=1
𝑛
= ∑ 𝜃 = 𝑛𝜃 .
𝑖=1

Hence, we need 𝑛𝜃 (the average number of successes) and 𝑛(1 − 𝜃) (the average number of failures)
both greater than or equal to 10. This requirement is called the success-failure condition.
̂
When these conditions are met, the sampling distribution of the point estimator 𝜃(X) ∶= 𝑋 𝑛 is
approximately normal with mean 𝜃 and standard error

1 𝑛 1 𝑛
SE(𝑋 𝑛 ) = √Var[ ∑ 𝑋𝑖 ] = √ 2 ∑ Var[𝑋𝑖 ]
𝑛 𝑖=1 𝑛 𝑖=1

1 𝑛
= √ 2 ∑ 𝜃(1 − 𝜃)
𝑛 𝑖=1
𝜃(1 − 𝜃)
=√ .
𝑛
The confidence interval then has the form

̂
𝜃(X)(1 ̂
− 𝜃(X))
̂
𝜃(X) ± 𝑧1−𝛼/2 ⋅ √ .
𝑛

Remark. The success-failure conditions depends on the unknown 𝜃. In applications we use our best
̂
guess to check the condition, i.e., we check if 𝑛𝜃(x) ̂
≥ 10 and 𝑛(1 − 𝜃(x)) ̂ being the
≥ 10, with 𝜃(x)
computed point estimate.

Let’s apply the above formula for a simulated dataset consisting of 100 (Pseudo-) Bernoulli trials with
a success probability of 0.3.

set.seed(123) # for reproducibility


x <- rbinom(n = 100, size = 1, prob = 0.3)

In the first step we check the success-failure condition

281
12 Confidence intervals

100 * mean(x)
# [1] 29
100 * (1 - mean(x))
# [1] 71

Hence, the condition holds and we can compute the 95% confidence interval based on the CLT:

c(mean(x) - qnorm(0.975) * sqrt(mean(x) * (1 - mean(x)) / 100),


mean(x) + qnorm(0.975) * sqrt(mean(x) * (1 - mean(x)) / 100))
# [1] 0.2010643 0.3789357

We observe that the computed interval covers the true value of 0.3.

12.3 Bootstrap approach

We may construct a confidence interval with the help of the simulated bootstrap distribution.

Idea

Use the simulated bootstrap distribution to determine the lower (𝑙𝑒) and upper (𝑢𝑒) endpoint
of the interval (𝑙𝑒, 𝑢𝑒), such that the interval has a probability of 1 − 𝛼 under the bootstrap
distribution.
But this means nothing more than identifying the cut-off values (quantiles) of the bootstrap
distribution. This approach is referred to as the percentile method.
As an alternative, we can use the standard error method, which calculates the standard devi-
ation of the bootstrap distribution and then computes the interval based on the formula given
in Definition 12.2.

12.3.1 Percentile method

Let’s assume we want to build a 95% confidence interval for the proportion of red balls in the bowl.
Using the percentile method, we would consider the middle 95% of values from the bootstrap distri-
bution.
We can achieve this by calculating the 2.5th and 97.5th percentiles of the bootstrap distribution.

(q95_bowl <- quantile(bootstrap_means$stat,


probs = c(0.025, 0.975)))
# 2.5% 97.5%
# 0.26 0.46

282
12 Confidence intervals

Hence, our 95% confidence interval is equal to (0.26, 0.46).


Interpretation: If we were to repeatedly take samples of size 𝑛 = 100 and then compute the con-
fidence interval in this manner, we would capture the unknown population parameter 𝜃 in 95% of
cases.

ggplot(bootstrap_means, aes(x = stat)) +


geom_histogram(bins = 13, colour = "white") +
geom_vline(xintercept = q95_bowl[1], colour = "gold", size = 2) +
geom_vline(xintercept = q95_bowl[2], colour = "gold", size = 2) +
geom_vline(xintercept = 0.375, color = "blue", size = 2) +
geom_text(x = 0.42, y = 225, label = "true proportion", color = "blue")

250
true proportion
200

150
count

100

50

0
0.2 0.3 0.4 0.5
stat

Figure 12.1: Bootstrap distribution and the corresponding 95% confidence interval based on the per-
centile method.

12.3.2 Standard error method

If the bootstrap distribution has a symmetric shape, like a normal distribution, we can construct
the confidence interval based on the formula given in Definition 12.2. The point estimate of the
unknown parameter 𝜃 is given by 𝜃(X) ̂ = 𝑇 (X). The standard error of the statistic is estimated
through the standard deviation of the bootstrap distribution. The interval has the form

̂
𝜃(X) ̂ (X)) ,
± 𝑧1−𝛼/2 ⋅ SE(𝑇
where 𝑧1−𝛼/2 is the (1 − 𝛼/2)-quantile of the standard normal distribution and SE(𝑇 ̂ (X)) is the
standard deviation of the bootstrap distribution. It is a (1 − 𝛼) ⋅ 100% confidence interval for the
mean of the sampling distribution E[𝑇 (X)] based on the standard error method.

283
12 Confidence intervals

Example 12.3. Let’s calculate a 95% confidence interval for the proportion of red balls using the
standard error method. Therefore, we need the 0.975 quantile.

qnorm(0.975)
# [1] 1.959964

̂
Using the point estimate 𝜃(x) = 𝑥𝑛 and the standard deviation of the bootstrap distribution

# sample mean
x_bar <- x |>
specify(response = color, success = "red") |>
calculate(stat = "prop") |>
pull(stat)
x_bar
# [1] 0.36
# sd of
sd(bootstrap_means$stat)
# [1] 0.0485163

we get the 95% confidence interval

̂ (x)) ≈ 0.36 ± 1.96 ⋅ 0.0485


𝑥𝑛 ± 𝑧0.975 ⋅ SE(𝑇
≈ (0.2649, 0.4551) .

12.3.3 infer workflow

Using the infer package to construct confidence intervals is summarized in Figure 12.2.
A confidence interval based on the percentile method can be computed using get_ci() with the
argument type equal to percentile (default).

(per_ci <-
bootstrap_means |>
get_ci(level = 0.95, type = "percentile")
)
# # A tibble: 1 x 2
# lower_ci upper_ci
# <dbl> <dbl>
# 1 0.26 0.46

and visualized through the command

284
12 Confidence intervals

Figure 12.2: From Ismay and Kim (2019).

visualize(bootstrap_means) +
shade_ci(endpoints = per_ci,
color="blue", fill="gold")

Simulation−Based Bootstrap Distribution

150

100
count

50

0.2 0.3 0.4 0.5


stat

285
12 Confidence intervals

Interpretation: We are 95% confident that the proportion of red balls is between 0.26 and 0.46.
In Example 12.3 we already used the standard error method to compute a 95% confidence interval
for the proportion of red balls. So, let’s see how to compute this interval with get_ci(). As type we
have to choose se and in addition we need to specify the point_estimate.

(se_ci <- bootstrap_means |>


get_ci(level = 0.95, type = "se",
point_estimate = x_bar))
# # A tibble: 1 x 2
# lower_ci upper_ci
# <dbl> <dbl>
# 1 0.26491 0.45509

So, we obtain an interval that is very close to the one obtained using the percentile method. Let’s
create a graphic that shows both intervals at the same time.

visualize(bootstrap_means) +
shade_ci(endpoints = per_ci, color = "gold") +
shade_ci(endpoints = se_ci)

Simulation−Based Bootstrap Distribution

150

100
count

50

0.2 0.3 0.4 0.5


stat

286
12 Confidence intervals

Short summary

This chapter begins by highlighting the limitations of point estimates and introduces the con-
cept of a confidence interval as a range of plausible values for a population parameter, thus
quantifying uncertainty. The text then explores three primary methods for constructing confi-
dence intervals: the theoretical approach, the asymptotic approach leveraging the Cen-
tral Limit Theorem, and the bootstrap approach (including percentile and standard
error methods). Practical examples and the use of the infer R package are demonstrated to
illustrate these concepts and their application in statistical inference. Ultimately, the chapter
emphasises the importance of reporting a range of values to better capture the true,
unknown population parameter.

287
13 Hypothesis testing

The following question

comes from a TED talk of Hans and Ola Rosling. The answer to this question can be given based on
data collected within the Gapminder project.
We might be wondering about the level of awareness people have regarding global health. Assume
that individuals either possess knowledge about the topic or are influenced by false information. How-
ever, in addressing the question about vaccination, their responses are not random guesses. This leads
to our research question:
People have knowledge, whether correct or incorrect, about the topic of vaccination and do not randomly
guess an answer.
We can now transfer this research question into two competing hypotheses:
𝐻0 ∶ People never learn these particular topics and their responses are simply equivalent to ran-
dom guesses.
versus
𝐻𝐴 ∶ People have knowledge, either correct or incorrect, which they apply and hence do not ran-
domly guess an answer.

288
13 Hypothesis testing

Note

The null hypothesis 𝐻0 often represents a claim to be tested. The alternative hypothesis
𝐻𝐴 represents an alternative claim under consideration.

13.1 Statistical test

Definition 13.1. Let 𝑋1 , … , 𝑋𝑛 be an i.i.d. sample from a statistical model (𝑆, P𝜃 ), 𝜃 ∈ Θ. The test
problem consists of the null hypothesis 𝐻0 and the alternative 𝐻𝐴 , which constitute a partition of
the parameter space Θ, i.e.,
𝐻0 ∶ 𝜃 ∈ Θ0 𝐻𝐴 ∶ 𝜃 ∈ Θ ∖ Θ0 .

A statistical test decides, based on a sample 𝑋1 , … , 𝑋𝑛 , whether the null hypothesis can be rejected
or cannot be rejected.
The corresponding decision is taken with the help of a suitably chosen test statistic 𝑇 (X). The range
of 𝑇 (X) can be split up in a rejection region 𝑅 and its complement 𝑅𝑐 .
Given an observed sample x = {𝑥1 , … , 𝑥𝑛 } and the corresponding realization of the test statistic
𝑇 (x), one decides in the following manner:

• 𝐻0 is rejected, if 𝑇 (x) ∈ 𝑅.
• 𝐻0 is not rejected, if 𝑇 (x) ∈ 𝑅𝑐 .

Hypothesis tests are not flawless. There are two competing hypotheses: the null and the alternative.
In a hypothesis test, we make a decision about which hypothesis might be true, but our decision
might be incorrect. Two types of errors are possible. Both are visualized in Figure 13.1.

Figure 13.1: Type 1 and 2 error in a statistical test.

A type 1 error is rejecting the null hypothesis when 𝐻0 is true. A type 2 error is failing to reject
the null hypothesis when 𝐻𝐴 is true.
The way we will construct the test ensures that the probability of a type 1 error is at most 𝛼 ∈ (0, 1),
the significance level.

289
13 Hypothesis testing

Important

The rejection region is determined in such a way that P𝐻0 (𝑇 (X) ∈ 𝑅) ≤ 𝛼.

13.1.1 p-value

Given the two hypotheses, the data can now either support the alternative hypothesis or not. There-
fore, we need some quantification to understand how much the alternative is favored.
The p-value is a way of quantifying the strength of the evidence against the null hypothesis and in
favor of the alternative hypothesis.

Definition 13.2. The p-value is the probability of observing data at least as favorable to the
alternative hypothesis as our current dataset, under the assumption that the null hypothesis is
true.

Example 13.1. If we test a hypothesis with one-sided alternative 𝐻𝐴 ∶ 𝜇 > 𝜇0 about the mean value
𝜇, we could use the sample mean 𝑇 (X) = 𝑋 𝑛 as test statistic. The p-value is then given by

P𝐻0 (𝑋 𝑛 ≥ 𝑥𝑛 ) ,

where 𝑥𝑛 is the observed sample mean.

If the p-value quantifies the strength of the evidence against the null hypothesis, we should use it to
make decisions about rejecting the null hypothesis. But how?
Remember, that the probability of a type 1 error shall be at most the significance level 𝛼 ∈ (0, 1).
This can be achieved as follows:

Compare the p-value to the significance level

• Reject 𝐻0 , if the p-value is less than or equal to the significance level, 𝛼.


We would report as conclusion that the data provides strong evidence supporting the al-
ternative hypothesis.

• Fail to reject 𝐻0 , if the p-value is greater than 𝛼.


We would report as conclusion that the data do not provide sufficient evidence to reject
the null hypothesis.

290
13 Hypothesis testing

Remark.

1. Our definition of rejection regions and p-values is such that the decision rules 𝑇 (X) ∈ 𝑅 and
p-value ≤ 𝛼 are equivalent. We will work most of the time with the latter one.
2. The rule says that we reject the null hypothesis (𝐻0 ) when the p-value is less than the chosen
significance level (𝛼), which is determined before conducting the test. Common values for 𝛼
include 0.01, 0.05, or 0.1, but the specific choice depends on the particular application or setting.

Interpretation

The imposed significance level ensures that for those cases where 𝐻0 is actually true, we
incorrectly reject 𝐻0 at most 100 ⋅ 𝛼% of times.
In other words, when using a significance level 𝛼, there is about 100 ⋅ 𝛼% chance of making a
type 1 error if the null hypothesis is true.

13.1.2 Two-sided vs. one-sided alternative

In case we test a one-dimensional population parameter 𝜃 ∈ R, the alternative hypothesis can be


two-sided or one-sided:
Two-sided:

𝐻0 ∶ 𝜃 = 𝜃 0 𝐻𝐴 ∶ 𝜃 ≠ 𝜃0

One-sided:

𝐻0 ∶ 𝜃 = 𝜃 0 𝐻𝐴 ∶ 𝜃 < 𝜃0

or

𝐻0 ∶ 𝜃 = 𝜃 0 𝐻𝐴 ∶ 𝜃 > 𝜃0

Remark. In the two-sided case, the rejection region 𝑅 “lives” in both tails of the distribution of the
test statistic 𝑇 (X). This makes a two-sided alternative harder to verify and hence the test more con-
servative.Therefore, it is often preferable to test a two-sided alternative even if the research question
is formulated as a directed claim.

Example 13.2. Let’s reconsider the question of vaccination rates. The population parameter we want
to test is the probability of providing a correct answer.

291
13 Hypothesis testing

We may think that individuals possess incorrect knowledge, meaning they perform worse than ran-
dom guessing. However, we are uncertain, so the more cautious approach of a two-sided alternative
is preferred. This indicates that we assume they are not simply guessing.
To define the null hypothesis and alternative in mathematical notation, let’s introduce the probabil-
ity 𝜃 ∈ Θ = (0, 1) of a correct answer. This then leads to the following test problem:

1 1
𝐻0 ∶ 𝜃 = 𝐻𝐴 ∶ 𝜃 ≠ .
3 3
Since we miss background information, we decide to use a standard significance level 𝛼 of 5%.

13.2 Null distribution

Given a sample X, a test problem and a significance level 𝛼, we have to solve the following tasks:

1. Choose a suitable test statistic 𝑇 (X).


2. Determine a rejection region 𝑅 such that P𝐻0 (𝑇 (X) ∈ 𝑅) ≤ 𝛼 or compute the p-value.

Assume for a moment that first task is already solved.


In Example 13.2 we could, e.g., choose the number of correct answers as test statistic value. The test
𝑛
statistic would then be 𝑇 (X) = ∑𝑖=1 𝑋𝑖 with 𝑋𝑖 indicating if the answer of the i-th respondent was
correct or not.
But to solve the second task, we need to determine the distribution of the test statistic under the
assumption that the null hypothesis is true, which we will refer to as the null distribution.
The null distribution is unknown, so we need to approximate it. Depending on the situation, we can
use the theoretical, asymptotic, or simulation-based approach. However, there is a major difference

292
13 Hypothesis testing

compared to using these approaches to approximate the sampling distribution. In order to approx-
imate the null distribution, we must first make an assumption about the population parameter 𝜃
(assuming the null hypothesis is true). Therefore, we will use the assumed value 𝜃0 when calculating
the approximate null distribution.

13.3 Theoretical approach

We will consider some examples where a theoretical approach could be applied to test the given test
problem.

Example 13.3. (Continuation of Example 13.2)


Assume that a random sample of 50 respondents answered the question about the vaccination rate.
We obtained the following responses:

table(responses)
# responses
# A B C
# 23 15 12

The correct answer is C. So, we can code the responses in the following way:

1, for 𝐶,
𝑥𝑖 = {
0, for 𝐴, 𝐵.

Hence, 𝑥1 , … , 𝑥50 are realizations from 50 Bernoulli trials with success probability 𝜃 ∈ (0, 1).
Now we need to choose a test statistic. The sample mean would be reasonable estimator for 𝜃 and
could be used as test statistic. But we are unsure about the distribution of 𝑋 𝑛 .
50
That is different, if we consider the number of successes in 50 trials 𝑇 (X) = ∑𝑖=1 𝑋𝑖 , which is
definitely also informative about 𝜃.
From the definition of the Binomial distribution (see Definition A.3) we know that the sum of i.i.d.
Bernoulli random variables has Binomial distribution. Hence, we know that
50
1
𝑇 (X) = ∑ 𝑋𝑖 ∼ Bin(50, ) ,
𝑖=1
3

under 𝐻0 . Given this distribution, we can compute the p-value, which is defined in the following
way:
p-value = P𝐻0 (𝑇 (X) ≤ 12) + P𝐻0 (𝑇 (X) ≥ 22) ,

293
13 Hypothesis testing

1
where 22 is the smallest value that has at least the same distance to the expected value 50 ⋅ 3 as 12
has. For computing the actual value we use R:

pbinom(12, 50, prob = 1/3) +


pbinom(21, 50, prob = 1/3, lower.tail = FALSE)
# [1] 0.1791022

Computing the p-value this way is nothing you would typically do. Instead you would use the func-
tion binom.test().

binom.test(x = 12, n = 50, p = 1/3, alternative = "two.sided")


#
# Exact binomial test
#
# data: 12 and 50
# number of successes = 12, number of trials = 50, p-value = 0.1791
# alternative hypothesis: true probability of success is not equal to 0.3333333
# 95 percent confidence interval:
# 0.1306099 0.3816907
# sample estimates:
# probability of success
# 0.24

In the output, we observe the same p-value (after rounding) as calculated earlier. Because the p-value
is greater than the selected significance level of 0.05, we can conclude that the data does not provide
sufficient evidence to reject the null hypothesis of random guessing.

In the following example, we aim to draw conclusions about the mean value of a normally distributed
random variable.

Example 13.4. Assume we have observations from the statistical model (R, N (𝜃1 , 𝜃2 )), which means
mean value and variance are both unknown.
As an example we consider the loans_full_schema dataset from Chapter 4. To be specific, we will
not use the complete dataset of 10000 observations. Instead, we will only consider a sample of 50
individuals who are renting homes and have taken a grade A loan.

set.seed(1234)
loan_rentA <- loans_full_schema |>
filter(homeownership == "rent", grade == "A") |>
slice_sample(n = 50)

294
13 Hypothesis testing

loan_rentA
# # A tibble: 50 x 8
# loan_amount interest_rate term grade state annual_income homeownership
# <int> <dbl> <dbl> <fct> <fct> <dbl> <fct>
# 1 10000 7.96 36 A PA 64000 rent
# 2 11000 7.35 36 A GA 56500 rent
# 3 15000 7.35 36 A CA 55000 rent
# 4 10000 6.08 36 A NY 200000 rent
# 5 1500 6.07 36 A CT 65000 rent
# 6 6000 7.97 36 A CA 36051 rent
# # i 44 more rows
# # i 1 more variable: debt_to_income <dbl>

We are interested in the average grade A loan amount of individuals who are renting homes.
Our claim is that the average loan amount is less than $15400. Hence, we want to consider the
following test problem

𝐻0 ∶ 𝜃1 = 15400 vs. 𝐻𝐴 ∶ 𝜃1 < 15400

at a 5% significance level.
In Example 11.4 we said that the statistic

𝑋 𝑛 − 𝜃1 𝑋 𝑛 − 𝜃1
𝑇 (X) = = 𝑆√
2 𝑛 (X)
√ 𝑆𝑛𝑛(X) 𝑛

has a t-distribution with 𝑛 − 1 degrees of freedom. Since we know the value of 𝜃1 under the null
hypotheses, we can use 𝑇 (X) as our test statistic.
To calculate the test statistic for the given dataset, we require the sample mean and sample standard
deviation of the loan amount.

lr_stat <- loan_rentA |>


summarize(x_bar = mean(loan_amount),
s = sd(loan_amount))
lr_stat
# # A tibble: 1 x 2
# x_bar s
# <dbl> <dbl>
# 1 13395 11189.

The value of the test statistic is then given by:

295
13 Hypothesis testing

t <- (lr_stat$x_bar - 15400) / (lr_stat$s / sqrt(50))


t
# [1] -1.267117

The alternative hypothesis states that the average loan amount is less than 15400. Therefore, data
more favorable to the alternative hypothesis than the observed dataset would result in a test statistic
value being even smaller than the observed value -1.2671. So, the p-value is equal to

p-value = P𝐻0 (𝑇 (X) ≤ 𝑇 (x)) ≈ P𝐻0 (𝑇 (X) ≤ −1.2671) ,

which can be determined using the following code:

pt(t, df = 49)
# [1] 0.1055515

The p-value is greater than our pre-defined significance level of 0.05. Therefore, we can fail to reject
the null hypothesis at the given significance level. The data does not show enough evidence for
favoring the alternative.
The whole test is implemented within the function t.test(). To confirm our result, let’s compare it
with the output of the function.

t.test(loan_rentA$loan_amount, mu = 15400, alternative = "less")


#
# One Sample t-test
#
# data: loan_rentA$loan_amount
# t = -1.2671, df = 49, p-value = 0.1056
# alternative hypothesis: true mean is less than 15400
# 95 percent confidence interval:
# -Inf 16047.86
# sample estimates:
# mean of x
# 13395

But how likely was it to reject the null hypothesis for difference between the sample mean and the
null value of roughly 2000 (in absolute value)?

296
13 Hypothesis testing

13.3.1 Power of a test

Remember: A type 1 error occurs when we incorrectly reject 𝐻0 . The probability of a type 1 error is
(at most) 𝛼 (the significance level).
In case of a type 2 error, we fail to reject 𝐻0 although it is false. The probability of doing so is
denoted 𝛽.

Definition 13.3. A test’s power is defined as the probability of correctly rejecting 𝐻0 , and the
probability of doing so is 1 − 𝛽.

In hypothesis testing, we want to keep 𝛼 and 𝛽 low, but there are inherent trade-offs.

If the alternative hypothesis is true, what is the chance that we make a type 2 error?

The answer is not obvious.

• If the true population average is very close to the null hypothesis value, it will be difficult
to detect a difference (and reject 𝐻0 ).
• If the true population average is very different from the null hypothesis value, it will be
easier to detect a difference.
• Clearly, 𝛽 depends on the effect size 𝛿.

Example 13.5. In Example 13.4 we were not able to reject the null hypothesis of an average loan
amount being equal to $15400. The idea is now to determine a sample size, which leads to a power of
0.8 for an assumed effect size 𝛿 of $2000.
We can use the function power.t.test() to compute the necessary sample size. Besides the values
given above, we also need to enter an estimate/guess of the standard deviation of the loan amounts
as an argument.
A reasonable estimate is the sample standard deviation in the loan_rentA dataset:

297
13 Hypothesis testing

sd(loan_rentA$loan_amount)
# [1] 11188.78

Given these values we obtain the following sample size:

power.t.test(delta = 2000, sd = 11200, sig.level = 0.05,


power = 0.8,
type = "one.sample",
alternative = "one.sided")
#
# One-sample t test power calculation
#
# n = 195.2449
# delta = 2000
# sd = 11200
# sig.level = 0.05
# power = 0.8
# alternative = one.sided

The output says that we need a sample size of at least 196. Then let’s try our chance and take another
sample of size 196 from loans_full_schema.

set.seed(1234)
loan_rentA <- loans_full_schema |>
filter(homeownership == "rent", grade == "A") |>
slice_sample(n = 196)

dim(loan_rentA)
# [1] 196 8

Applying t.test() the same way as in Example 13.4 yields the following output.

t.test(loan_rentA$loan_amount, mu = 15400, alternative = "less")


#
# One Sample t-test
#
# data: loan_rentA$loan_amount
# t = -2.5176, df = 195, p-value = 0.006311
# alternative hypothesis: true mean is less than 15400
# 95 percent confidence interval:

298
13 Hypothesis testing

# -Inf 14802.68
# sample estimates:
# mean of x
# 13661.22

We observe a p-value of 0.0063108, which is less than our significance level of 0.05. Hence, we can
conclude that the data shows enough evidence to reject the null hypothesis of an average grade A
loan amount of $15400 for individuals who are renting homes.

13.4 Asymptotic approach

For larger sample sizes, we can use the asymptotic approach to approximate the null distribution. If
the test statistic has the form of an average of i.i.d. observation, this means applying the results from
Theorem 11.1. An application is shown in Example 13.6. In another example, we will analyze the
relationship between two categorical variables. The test statistic is derived from comparing observed
counts to the expected counts, assuming that both variables are independent. This test statistic follows
a non-normal distribution.

Example 13.6 (Continuation of Example 13.2). Remember, in Example 13.3 we analyzed the test
problem
1 1
𝐻0 ∶ 𝜃 = 𝐻𝐴 ∶ 𝜃 ≠
3 3
50
using the theoretical approach on the test statistic 𝑇 (X) = ∑𝑖=1 𝑋𝑖 . In this example, we aim to test
the same problem using the asymptotic approach. To apply Theorem 11.1 in this context we need to
choose another test statistic, since 𝑇 is not an average of random variables. But 𝑋 𝑛 is, which is also
the maximum likelihood estimator of the unknown success probability 𝜃. For the observed sample

table(responses)
# responses
# A B C
# 23 15 12

12
we get as test statistic value 𝑥50 = 50 = 0.24 (remember that C was the correct answer).
Hence, values more favorable to the alternative would be mean values even less than 0.24 or larger
than 13 + ( 13 − 0.24). This implies that the p-value is equal to

p-value = P𝐻0 (|𝑋̄ 𝑛 − 𝜃0 | ≥ |𝑥𝑛̄ − 𝜃0 |)


2
= P𝐻0 (𝑋̄ 𝑛 ≤ 0.24) + P𝐻0 (𝑋̄ 𝑛 ≥ − 0.24) .
3

299
13 Hypothesis testing

To compute (approximate) the probabilities P𝐻0 (𝑋̄ 𝑛 ≤ 0.24) and P𝐻0 (𝑋̄ 𝑛 ≥ 2
3 − 0.24) we want to
̇ ( 13 , 450
use the CLT, which implies 𝑋 𝑛 ∼N 2
) (see Example 12.2).
But this is only valid if the sample size is large enough. In Example 12.2 we introduced the success-
failure condition to check if the sample size is large enough. We used the point estimate 𝜃 ̂ to evaluate
the condition there. But now, the condition should hold under the null hypothesis, since we want
to approximate the null distribution. Thus, we use the assumed value of 𝜃 to check if the condition
holds.
Plugging in 𝜃0 = 31 shows that the success-failure condition holds, as 𝑛𝜃0 = 50
3 and 𝑛(1 − 𝜃0 ) = 50⋅2
3
are both greater 10.
Hence we can use R to compute the p-value based on the normal approximation, 𝑋 𝑛 ∼N 2
̇ ( 13 , 450 ), of
the null distribution.

pnorm(2/3 - 0.24, mean = 1/3, sd = sqrt(2/450), lower.tail = FALSE) +


pnorm(0.24, mean = 1/3, sd = sqrt(2/450))
# [1] 0.1615133

The asymptotic approach of testing a null hypothesis about a proportion is implemented in the func-
tion prop.test(). Applying this function yields the following output:

prop.test(x = 12, n = 50, p = 1/3,


alternative = "two.sided",
correct = FALSE # we do not apply the correction
) # since we haven't introduced it
#
# 1-sample proportions test without continuity correction
#
# data: 12 out of 50, null probability 1/3
# X-squared = 1.96, df = 1, p-value = 0.1615
# alternative hypothesis: true p is not equal to 0.3333333
# 95 percent confidence interval:
# 0.1429739 0.3741268
# sample estimates:
# p
# 0.24

Remember that we have selected a significance level of 5%. Since the p-value is 0.1615, which is
greater than 0.05, the data indicate insufficient evidence to reject the null hypothesis.

Remark. The obtained p-value is larger than the one computed for the binomial test in Example 13.3.
Intuitively this makes sense. The binomial test is an exact (using the exact distribution of the Bernoulli

300
13 Hypothesis testing

r.v. 𝑋𝑖 ) test, whereas the current one relied on an approximation. Therefore, it makes sense to obtain
a less “precise” answer when applying the two tests to the same data.

13.4.1 Chi-squared test of independence

In the rest of this section we will discuss the chi-squared test of independence, which checks whether
two categorical variables 𝑋 and 𝑌 are likely to be related or not. It analyzes the following testing
problem:

𝐻0 ∶ The variables 𝑋 and 𝑌 are independent.


𝐻𝐴 ∶ The variables 𝑋 and 𝑌 are dependent.

Before we discuss how to construct a reasonable test statistic, let’s look at an example.

Example 13.7. The popular dataset (available on Moodle) contains information about students in
grades 4 to 6.
They were asked whether good grades, athletic ability, or popularity was most important to them.
A two-way table separating the students by grade and choice of the most important factor is shown
below.

Can these data indicate that goals differ based on grade?

popular |>
table()
# goals
# grade Grades Popular Sports
# 4th 63 31 25
# 5th 88 55 33
# 6th 96 55 32

301
13 Hypothesis testing

1.00

0.75

count goals
Grades
0.50
Popular
Sports

0.25

0.00
4th 5th 6th
grade

The competing hypotheses are:


𝐻0 : Grade and goals are independent. Goals do not vary by grade.
𝐻𝐴 : Grade and goals are dependent. Goals vary by grade.

The idea of constructing the chi-squared test statistic is to compare the observed joint distribution of
𝑋 and 𝑌 with the joint distribution under independence.
Remember, the joint distribution of two categorical variables 𝑋 and 𝑌 with support {𝑢1 , … , 𝑢𝑘 } and
{𝑣1 , … , 𝑣ℓ }, respectively, can be summarized by a two-way table

𝑣1 𝑣2 ⋅⋅⋅ 𝑣ℓ
𝑢1 𝑁11 𝑁12 ⋅⋅⋅ 𝑁1ℓ
𝑢2 𝑁21 𝑁22 ⋅⋅⋅ 𝑁2ℓ
⋮ ⋮ ⋮ ⋮ ⋮
𝑢𝑘 𝑁11 𝑁12 ⋅⋅⋅ 𝑁1ℓ

containing observed counts 𝑁𝑖𝑗 for each combination of the levels of 𝑋 and 𝑌 .
Under the null hypothesis 𝑋 and 𝑌 are independent. This assumption implies that the expected
counts are given by
𝑁 𝑁a𝑗 𝑁𝑖a ⋅ 𝑁a𝑗
𝐸𝑖𝑗 = 𝑁 ⋅ 𝑖a ⋅ = ,
𝑁 𝑁 𝑁
𝑘 ℓ ℓ 𝑘
where 𝑁 = ∑𝑖=1 ∑𝑗=1 𝑁𝑖𝑗 is the table total, 𝑁𝑖a = ∑𝑗=1 𝑁𝑖𝑗 is the row 𝑖 total and 𝑁a𝑗 = ∑𝑖=1 𝑁𝑖𝑗
is the column 𝑗 total.

302
13 Hypothesis testing

¾ Your turn

Have we observed more than expected 5th graders who have the goal of being popular?

#
# Grades Popular Sports Sum
# 4th 63 31 25 119
# 5th 88 55 33 176
# 6th 96 55 32 183
# Sum 247 141 90 478

A yes
B no
C can’t tell

The alternative hypothesis says that the variables 𝑋 and 𝑌 are dependent. In this case, there is
favored when there are more significant differences between the observed and expected counts.
The differences are summarized in the chi-squared statistic, which will be used as a test statistic for
the test of independence.

Definition 13.4. Let (𝑋1 , 𝑌1 ), … , (𝑋𝑛 , 𝑌𝑛 ) be an i.i.d. sample, where each observation is a pair of
two (possibly dependent) categorical random variables with levels 𝑢1 , … , 𝑢𝑘 and 𝑣1 , … , 𝑣ℓ , respec-
𝑛
tively. Further let 𝑁𝑖𝑗 (X, Y) = ∑𝑟=1 1{𝑢𝑖 ,𝑣𝑗 } (𝑋𝑟 , 𝑌𝑟 ) be the observed count and 𝐸𝑖,𝑗 (X, Y) =
𝑁𝑖a (X,Y)⋅𝑁a𝑗 (X,Y)
𝑁(X,Y) the expected count under independence for cell (𝑖, 𝑗).
The chi-squared statistic is then calculated as
𝑘 ℓ
(𝑁𝑖𝑗 (X, Y) − 𝐸𝑖𝑗 (X, Y))2
𝜒2 (X, Y) = ∑ ∑ .
𝑖=1 𝑗=1
𝐸𝑖𝑗 (X, Y)

Example 13.8 (Continuation of Example 13.7). To better understand the relationship between the
two variables, we should compare the observed counts with the expected counts. We need the
marginal distributions to compute the expected counts, which are added to the contingency table
with the addmargins() function.

popular |>
table() |>
addmargins()
# goals
# grade Grades Popular Sports Sum
# 4th 63 31 25 119

303
13 Hypothesis testing

# 5th 88 55 33 176
# 6th 96 55 32 183
# Sum 247 141 90 478

Now let’s calculate the expected counts for 4th graders who prioritize good grades or popularity.
119
𝐸1,1 = ⋅ 247 = 61.4916318
478
119
𝐸1,2 = ⋅ 141 = 35.1025105
478

After computing all expected counts

#
# Grades Popular Sports
# 4th 61.49163 35.10251 22.40586
# 5th 90.94561 51.91632 33.13808
# 6th 94.56276 53.98117 34.45607

we can compute the 𝜒2 test statistic


3 3
(𝑁𝑖𝑗 (grade, goal) − 𝐸𝑖𝑗 )2
𝜒2 = ∑ ∑
𝑖=1 𝑗=1
𝐸𝑖𝑗
(63 − 61.49)2 (31 − 35.10)2 (32 − 34.46)2
≈ + +⋅⋅⋅+
61.49 35.10 34.46
= 1.3121 .

Given only the observed statistic value of 1.3121, we cannot determine if this value is sufficiently
large to reject the null hypothesis. Therefore, we need to learn how to calculate or approximate the
null distribution of the chi-squared statistic.

Definition 13.5. Let (𝑋1 , 𝑌1 ), … , (𝑋𝑛 , 𝑌𝑛 ) be an i.i.d. sample, where each observation is a pair of
two (possibly dependent) categorical random variables with levels 𝑢1 , … , 𝑢𝑘 and 𝑣1 , … , 𝑣ℓ , respec-
tively. The chi-squared statistic is calculated as
𝑘 ℓ
2
(𝑁𝑖𝑗 (X, Y) − 𝐸𝑖𝑗 (X, Y))2
𝜒 (X, Y) = ∑ ∑ .
𝑖=1 𝑗=1
𝐸𝑖𝑗 (X, Y)

Assuming the conditions:

• independent observations
• the number of expected counts 𝐸𝑖𝑗 is at least 5 in each cell

304
13 Hypothesis testing

are met, the statistic 𝜒2 has an approximate chi-squared distribution with df = (𝑘 − 1) ⋅ (ℓ − 1)


degrees of freedom given the null hypothesis 𝐻0 is true. The chi-squared test of independence
will reject 𝐻0 at significance level 𝛼, if the observed chi-squared statistic 𝜒2 (x, y) is larger than the
1 − 𝛼 quantile of the chi-squared distribution with (𝑘 − 1) ⋅ (ℓ − 1) degrees of freedom
The p-value for the test is thus defined by

𝑃𝐻0 (𝜒2 (X, Y) > 𝜒2 (x, y)) ,

which corresponds to the area under the 𝜒2(𝑘−1)⋅(ℓ−1) density, above the observed chi-squared
statistic 𝜒2 (x, y).

305
13 Hypothesis testing

¾ Your turn

Which of the following is the correct p-value for an observed test statistic value of 𝜒2 = 1.3121
and 𝑑𝑓 = 4 degrees of freedom?
A more than 0.3
B between 0.3 and 0.2
C between 0.2 and 0.1
D between 0.1 and 0.05

0.15

0.10
f(x)

0.05

0.00
0 5 10 15
x

qchisq(c(0.001, 0.05, 0.1, 0.2, 0.3), df = 4, lower.tail = FALSE)


# [1] 18.466827 9.487729 7.779440 5.988617 4.878433

Example 13.9 (Continuation of Example 13.8). The chi-squared test of independence is implemented
in the chisq.test() function. In this specific 𝜒2 test, which is one of several 𝜒2 tests, the arguments
of chisq.test() are the observations from both variables that we want to test for independence.

chisq.test(popular$grade, popular$goals)
#
# Pearson's Chi-squared test
#
# data: popular$grade and popular$goals
# X-squared = 1.3121, df = 4, p-value = 0.8593

Conclusion: Since the p-value is high, we fail to reject the null hypothesis 𝐻0 . The data do not
provide convincing evidence that grade and goals are dependent. It doesn’t appear that goals vary
by grade.

306
13 Hypothesis testing

13.5 Simulation-based approach

For constructing confidence intervals, we utilized the “infer workflow” to estimate the sampling dis-
tribution of the sample statistic. The idea for creating “new” samples was to take resamples with
replacement from the observed sample (the bootstrap approach). In the context of hypothesis testing,
we need to generate new values of the test statistic while assuming the null hypothesis to be true.
Hence, we need to adjust the sampling procedure. We consider the following two cases:

1. The null hypothesis specifies a specific probability model. In this instance, we will simply
generate new samples from this model.
2. The testing problem involves the relationship between two variables, and the null hypothesis
states that they are independent. In this scenario, we will randomly permute observations for
one of the variables, and this should not affect the test outcome under independence.

This approach is illustrated in Figure 13.2 using the infer workflow.

Figure 13.2: Infer workflow for hypothesis testing.

Compared to the interval estimation, we need one additional step for hypothesis testing:
hypothesize() is used to specify the null hypothesis. We have to choose for the null argument
of hypothesize() one of the the two following arguments:

• point: if the null hypothesis is about a single population parameter, where the chosen value
specifies a concrete probability model,

307
13 Hypothesis testing

• independence: if the null hypothesis refers to the independence of the two variables under
consideration.

Remark. After specifying the null, generate() creates the resamples. Here, we do not need to specify
the correct type of resamples. It is automatically chosen. It will be type="draw" for a point null
hypothesis about a concrete probability model and type="permute" if the null hypothesis is of type
independence.

Example 13.10. Let’s consider again the question

and use the infer workflow to test


1 1
𝐻0 ∶ 𝜃 = 𝐻𝐴 ∶ 𝜃 ≠ ,
3 3
where 𝜃 is the probability of giving a correct answer.
In the first step, we need to recode the data because we require a binary outcome.

df <- tibble(
responses = ifelse(responses == "C",
"correct",
"not correct")
)

Now we are able to approximate the null distribution through simulation.

308
13 Hypothesis testing

set.seed(12345) # for reproducibility


null_distn <- df |>
specify(
response = responses, # responses is the variable of interest
success = "correct" # correct is the success
) |>
hypothesize(
null = "point", # the null is about a population parameter
p = 1/3 # with assumed value 1/3
) |>
generate(
reps = 1000, # draw 1000 samples of size 50 from a
type = "draw" # Bernoulli distribution with success probability 1/3
) |>
calculate(
stat = "prop" # compute proportion of success as test statistic
)

The simulated null distribution is visualized in Figure 13.3. The figure also shows the p-value as
shaded area of the histogram.

null_distn |>
visualize() +
shade_p_value(obs_stat = 0.24, # observed value
direction = "both") # alternative hypothesis

Simulation−Based Null Distribution


200

150
count

100

50

0
0.2 0.3 0.4 0.5 0.6
stat

Figure 13.3

309
13 Hypothesis testing

The p-value can of course also be computed, using get_p_value().

null_distn |>
get_p_value(obs_stat = 0.24, direction = "both")
# # A tibble: 1 x 1
# p_value
# <dbl>
# 1 0.216

that is,
# resampled props ≤ 0.24 or ≥ 0.427 216
= .
1000 1000

The result says that 216 out of the 1000 simulated proportions have been either less or equal to 0.24
(=observed value) or larger or equal to 0.427 ( ≈ 13 + ( 13 − 0.24)). Hence, under the assumption of
the true success probability being equal to 13 , these ranges are not so unlikely. Therefore, we have
to conclude, that the data doesn’t provide sufficient evidence at the 5% significance level to reject
the null hypothesis of random guessing in favor of doing worse or better than that.

Example 13.11. Remember the gender discrimination study from Chapter 6. We were considering
the following two hypothesis:
𝐻0 ∶ Promotion and gender are independent, no gender discrimination, observed difference in pro-
portions is simply due to chance.
𝐻1 ∶ Promotion and gender are dependent, there is gender discrimination, observed difference in
proportions is not due to chance.
Let’s use the infer workflow to test the null hypothesis. The null distribution can be simulated with
the following code.

set.seed(190503) # the same seed as in Chapter 6


null_distn <- gender_discrimination |>
specify(formula = decision ~ gender, # defines the relation
success = "promoted") |>
hypothesize(null = "independence") |>
generate(reps = 1000,
type = "permute") |>
calculate(stat = "diff in props",
order = c("male", "female")) # defines p_m - p_f

To compute the p-value, we need the observed value of the test statistic (here difference in propor-
tions).

310
13 Hypothesis testing

diff_prop <- gender_discrimination |>


observe(decision ~ gender,
success = "promoted",
stat = "diff in props",
order = c("male", "female"))
diff_prop
# Response: decision (factor)
# Explanatory: gender (factor)
# # A tibble: 1 x 1
# stat
# <dbl>
# 1 0.292

The computed p-value is then given by:

null_distn |>
get_p_value(obs_stat = diff_prop$stat,
direction = "both")
# # A tibble: 1 x 1
# p_value
# <dbl>
# 1 0.028

We can conclude that there is something going on. The null hypothesis of promotion and gender
being independent can be rejected at the 0.05 significance level. The data provides evidence for an
existing gender discrimination; the observed difference in proportions is not due to chance.

13.5.1 Bootstrap method

In the previous two examples, we generated new samples by drawing observations from a specified
distribution or by permuting the original observations under the independence assumption of the
null hypothesis. But these are not the only methods for simulation-based hypothesis testing. Another
method is the bootstrap approach. Compared to approximating the sampling distribution using the
bootstrap approach, some slight adjustments have to be made.
We will employ this method to test the mean value 𝜃 of a distribution, which is otherwise not spec-
ified. For instance, let’s consider the following scenario. We have observations 𝑥1 , … , 𝑥𝑛 from a
distribution with mean 𝜃 and wish to test:

𝐻0 ∶ 𝜃 = 𝜃0 vs. 𝐻𝐴 ∶ 𝜃 ≠ 𝜃 0 .

311
13 Hypothesis testing

Bootstrap method for testing a mean value

Input: observed sample x = (𝑥1 , … , 𝑥𝑛 )⊤ and number of resamples 𝐵

1. Create new values 𝑥∗𝑖 = 𝑥𝑖 − 𝑥𝑛 + 𝜃0 , 𝑖 ∈ {1, … , 𝑛}, which will have an empirical mean
equal to 𝜃0 .

2. For 𝑏 = 1, … , 𝐵, randomly select 𝑛 observations with replacement from {𝑥∗1 , … , 𝑥∗𝑛 }


to create the resample

x𝑏 = (𝑥𝑏1 , … , 𝑥𝑏𝑛 )⊤ .

2. Compute for each resample x𝑏 the value of the test statistic 𝑇 (x𝑏 ).

The resampled test statistic values are then based on samples with the assumed mean value under the
null hypothesis, but they also represent the variation of the original sample.

13.6 Choosing a significance level

Choosing a significance level for a test is important in many contexts. The traditional default level
is 0.05. However, it is often helpful to adjust the significance level based on the application.
We may select a level that is smaller or larger than 0.05 depending on the consequences of any
conclusions reached from the test.

Caution regarding type 1 error:

If making a type 1 error is dangerous or exceptionally costly, we should choose a small


significance level (e.g., 0.01). Under this scenario, we want to be very cautious about rejecting
the null hypothesis, so we demand strong evidence favoring 𝐻𝐴 before rejecting 𝐻0 .

Caution regarding type 2 error:

If a type 2 error is important to avoid / costly, then we should choose a higher significance
level (e.g. 0.10). Then we are more cautious about failing to reject 𝐻0 when the null is actually
false (at the price of a higher type 1 error rate).

312
13 Hypothesis testing

Short summary

This chapter introduces the fundamental concepts of hypothesis testing. It begins by framing a
research question regarding public knowledge about vaccination, translating it into competing
null and alternative hypotheses. The text then defines key elements of a statistical test, such
as the test statistic, rejection region, and the crucial concepts of Type I and Type II
errors. The document thoroughly explains the p-value as a measure of evidence against
the null hypothesis and details various approaches to conducting hypothesis tests, including
theoretical, asymptotic (for large samples), and simulation-based methods like permutation tests
and a modified bootstrap. Furthermore, it discusses the power of a test and the importance
of selecting an appropriate significance level based on the potential consequences of
errors. Finally, the material covers specific statistical tests, such as the binomial test, t-tests
for means, and the chi-squared test for independence between categorical variables, illustrating
their application with examples in R.

313
14 Inference for linear regression

In this section we work again with the evals dataset from the openintro package. The dataset con-
tains student evaluations of instructors’ beauty and teaching quality for 463 courses at the University
of Texas.
The teaching evaluations were conducted at the end of the semester. The beauty judgments were
made later, by six students who had not attended the classes and were not aware of the course evaluations
(two upper-level females, two upper-level males, one lower-level female, one lower-level male), see
Hamermesh and Parker (2005) for further details.
In Section 9.2.7, we applied stepwise selection algorithms to choose informative predictor variables
for predicting the evaluation score. As a result, we derived the following model.

score ~ bty_avg + age + gender + rank + language + pic_outfit

Now we can ask a related but different question: Does the observed data provide enough evidence
to reject the assumption that there is no relation between one of the predictor variables and the
response?
This question can be answered by formulating it as a statistical test problem. For instance, let’s
consider the relationship between the average beauty score and the evaluation score. Given the
above model, we would consider the following test problem:

𝐻0 ∶ 𝛽1 = 0 𝐻𝐴 ∶ 𝛽1 =
/ 0.

Under the null hypothesis, the average beauty score has no relation with the evalution score.

14.1 Testing the slope parameters

We will be exploring two different approaches for testing the slope parameter 𝛽𝑗 of the 𝑗-th predictor
variable 𝑥𝑗 in the multiple linear regression model

𝑌𝑖 = 𝛽0 + 𝛽1 𝑥1,𝑖 + ⋅ ⋅ ⋅ + 𝛽𝑘 𝑥𝑘,𝑖 + 𝜖𝑖 . (14.1)

Both the theoretical and simulation-based approach will conduct a partial test, analyzing the relation-
ship between one predictor and the response while considering the influence of all other predictor
variables.

314
14 Inference for linear regression

14.1.1 Theoretical approach

Using the theoretical approach, one can derive the distribution of the test statistic used to test:

𝐻0 ∶ 𝛽𝑗 = 𝛽𝑗,0 𝐻𝐴 ∶ 𝛽𝑗 ≠ 𝛽𝑗,0 .

But this is only possible by making the following assumptions.

Assumptions:

The random errors 𝜖𝑖 , 𝑖 ∈ {1, … , 𝑛}, in Equation 14.1

1. are independent, and

2. have a normal distribution with zero mean and constant variance, i.e., 𝜖𝑖 ∼ N (0, 𝜎2 ).

We can create a test statistic by comparing the least-squares estimate 𝛽̂(Y, x) with the assumed value
𝑗
under the null hypothesis, i.e. considering the difference 𝛽𝑗̂ (Y, x) − 𝛽𝑗,0 . To obtain a statistic with a
known distribution, the difference is standardized using an estimate of the standard error SE𝛽̂ (Y, x)
𝑗

of the slope estimator 𝛽 ̂ (Y, x).


1

Assuming the random errors follow the above assumptions, it can be demonstrated that the test
statistic
𝛽𝑗̂ (Y, x) − 𝛽𝑗,0
𝑇𝑗 (Y, x) = , (14.2)
̂ ̂ (Y, x)
SE 𝛽𝑗

known as t-statistic, follows a t-distribution with 𝑛 − (𝑘 + 1) degrees of freedom under the assump-
tion that the null hypothesis (𝐻0 ∶ 𝛽𝑗 = 𝛽𝑗,0 ) is true.

Remark.

1. Most of the time the null value 𝛽𝑗,0 is assumed to be zero (no effect).
2. The standard error SE𝛽̂ (Y, x) depends on the unknown variance 𝜎2 of the random errors. But
𝑗
the residual variance
𝑛
2 1
𝜎̂ = ∑ 𝑒2
𝑛 − 𝑘 − 1 𝑖=1 𝑖

is an estimator for 𝜎2 . Using this estimator allows to define the estimated standard error
̂ ̂ (Y, x).
SE 𝛽𝑗

3. Often we write just 𝛽𝑗̂ instead of 𝛽𝑗̂ (Y, x).

315
14 Inference for linear regression

Let’s fit the linear model

score ~ bty_avg + age + gender + rank + language + pic_outfit

to the evals dataset.

evals_lm <- lm(


score ~ bty_avg + age + gender + rank + language + pic_outfit,
data = evals
)

With tidy() we can extract the estimates, their standard errors, the t-statistics and the corresponding
p-values.

tidy(evals_lm)
# # A tibble: 8 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) 4.49 0.229 19.6 1.11e-62
# 2 bty_avg 0.0569 0.0170 3.35 8.65e- 4
# 3 age -0.00869 0.00322 -2.70 7.21e- 3
# 4 gendermale 0.210 0.0519 4.04 6.25e- 5
# 5 ranktenure track -0.207 0.0839 -2.47 1.40e- 2
# 6 ranktenured -0.176 0.0641 -2.74 6.32e- 3
# 7 languagenon-english -0.244 0.108 -2.26 2.41e- 2
# 8 pic_outfitnot formal -0.131 0.0713 -1.84 6.70e- 2

Let’s try to verify the test statistic and p-value for testing 𝐻0 ∶ 𝛽1 = 0. The test statistic 𝑇1 (Y, x) has
the value
0.0569 − 0
𝑇1 (y, x) = ≈ 3.35 .
0.017
The distribution of the test statistic is a t-distribution with parameter (called degrees of freedom) equal
to 𝑑𝑓 = 𝑛 − 𝑘 − 1 = 463 − 7 − 1 = 455. This then leads to a p-value of

p-value = P𝐻0 (|𝑇 | > 3.35)

2 * pt(3.35, df = 455, lower.tail = FALSE)


# [1] 0.00087538

Given the influence of all other variables being part of the model, there is strong evidence that the
average beauty score is related with the evaluation score.

316
14 Inference for linear regression

Remark. Since the test examines the impact of one predictor while all other predictors are included
in the model, it is referred to as a partial t-test.

Confidence interval for the slope parameter

𝛽̂𝑗 (Y,x)−𝛽𝑗
Using the t-statistic ̂ ̂ (Y,x)
SE
and applying the idea for constructing confidence intervals pre-
𝛽𝑗

sented in Section 12.1, yields the 100(1 − 𝛼)% confidence interval for the slope parameter 𝛽𝑗 :

𝛽𝑗̂ (Y, x) ± 𝑡(𝑛 − 𝑘 − 1)1−𝛼/2 ⋅ SE


̂ ̂ (Y, x) ,
𝛽 𝑗

where 𝑡(𝑛 − 𝑘 − 1)1−𝛼/2 is the 1 − 𝛼/2 quantile of the 𝑡(𝑛 − 𝑘 − 1) distribution.

Given the estimates

tidy(evals_lm)[2,]
# # A tibble: 1 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 bty_avg 0.0569 0.0170 3.35 0.000865

we can compute a 95% confidence interval for the slope of the average beauty score. For 𝛼 = 0.05 the
critical value 𝑡𝑛−𝑘−1,1−𝛼/2 is given by the 0.975 quantile of the t-distribution with 𝑛 − 𝑘 − 1 = 455
degrees of freedom:

qt(0.975, df = 455)
# [1] 1.965191

This then leads to the interval

𝛽1̂ ± 𝑡455,0.975 ⋅ SE
̂ ̂ ≈ 0.0569 ± 1.97 ⋅ 0.017
𝛽 1

≈ (0.0234, 0.0904) .

The confint() function can be used to compute confidence intervals for the slope parameters of a
linear model.

confint(evals_lm, "bty_avg", level = 0.95)


# 2.5 % 97.5 %
# bty_avg 0.02355994 0.09027135

317
14 Inference for linear regression

¾ Your turn

The p-value for age is 0.00721. What does this indicate?

# # A tibble: 1 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 age -0.00869 0.00322 -2.70 0.00721

A Since the p-value is positive, the higher the professor’s age, the higher we would expect
them to be rated.
B If we keep all other variables in the model, there is strong evidence that professor’s age is
associated with their rating.
C Probability that the true slope parameter for age is 0 is 0.00721.
D There is about a 1% chance that the true slope parameter for age is -0.00869.

Before we discuss the simulation-based approach and talk about how to check the model assumption,
we will look at two special cases in the next two examples.

Example 14.1. We already know that several variables appear to be related to the evaluation score.
However, we will focus solely on comparing the evaluation scores of male and female professors in
this example.

ggplot(evals, aes(x = score, y = gender)) +


geom_boxplot()

male
gender

female

3 4 5
score

Figure 14.1

318
14 Inference for linear regression

A model for comparing the scores in the two groups and testing for a difference at a 5% significance
level is as follows.

evals_gender <- lm(score ~ gender, data = evals)


summary(evals_gender)
#
# Call:
# lm(formula = score ~ gender, data = evals)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.83433 -0.36357 0.06567 0.40718 0.90718
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 4.09282 0.03867 105.852 < 2e-16 ***
# gendermale 0.14151 0.05082 2.784 0.00558 **
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.5399 on 461 degrees of freedom
# Multiple R-squared: 0.01654, Adjusted R-squared: 0.01441
# F-statistic: 7.753 on 1 and 461 DF, p-value: 0.005583

From the summary, we can infer that male professors appear to have significantly higher average
evaluation scores.
Testing
𝐻0 ∶ 𝛽male = 0 𝐻𝐴 ∶ 𝛽male ≠ 0
is equivalent to testing
𝐻0 ∶ 𝜇female − 𝜇male = 0 𝐻𝐴 ∶ 𝜇female − 𝜇male ≠ 0 ,
where 𝜇male and 𝜇female are the average evaluation score in the population of male and female profes-
sors. This test is also known as the two sample t-test with equal variance.

t.test(score ~ gender, var.equal = TRUE, data = evals)


#
# Two Sample t-test
#
# data: score by gender
# t = -2.7844, df = 461, p-value = 0.005583
# alternative hypothesis: true difference in means between group female and group
↪ male is not equal to 0

319
14 Inference for linear regression

# 95 percent confidence interval:


# -0.2413779 -0.0416378
# sample estimates:
# mean in group female mean in group male
# 4.092821 4.234328

Example 14.2. In Example 14.1, we learned how to test mean values in two groups. If a variable di-
vides the population into more than two groups, we can no longer use the t-test. However, regression
analysis can still be used in this case.

evals_rank <- lm(score ~ rank, data = evals)


summary(evals_rank)
#
# Call:
# lm(formula = score ~ rank, data = evals)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.8546 -0.3391 0.1157 0.4305 0.8609
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 4.28431 0.05365 79.853 <2e-16 ***
# ranktenure track -0.12968 0.07482 -1.733 0.0837 .
# ranktenured -0.14518 0.06355 -2.284 0.0228 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Residual standard error: 0.5419 on 460 degrees of freedom
# Multiple R-squared: 0.01163, Adjusted R-squared: 0.007332
# F-statistic: 2.706 on 2 and 460 DF, p-value: 0.06786

From the summary, it can be inferred that tenured professors receive significantly lower evaluation
scores compared to teaching professors (reference level). At a 5% significance level, this difference
cannot be confirmed for professors on a tenure track.
Based on the summary, we can conclude that we cannot reject the null hypothesis 𝐻0 ∶ 𝛽ten. track =
𝛽tenured = 0 in favor of the alternative that at least one of them is different from zero. This is because
the F-test, which is used to test this kind of hypotheses, has a value (F-statistic) of c(value = 2.706),
which is not statistically significant (p-value: c(value = 0.06786)) at a 5% level.
Testing both slope parameters jointly to be zero, is equivalent to testing
𝐻0 ∶ 𝜇teaching = 𝜇ten. track = 𝜇tenured .

320
14 Inference for linear regression

A technique, known as one-way analysis of variance (ANOVA), can also be used to test hypothesis
like the one above.

evals_rank_aov <- aov(score ~ rank, data = evals)


summary(evals_rank_aov)
# Df Sum Sq Mean Sq F value Pr(>F)
# rank 2 1.59 0.7946 2.706 0.0679 .
# Residuals 460 135.07 0.2936
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

14.1.2 Simulation-based approach

The concept behind the simulation-based approach is straightforward.

Idea

When dealing with the null hypothesis 𝐻0 ∶ 𝛽𝑗 = 0, it implies that the j-th predictor has
no relationship with the response variable. Consequently, the values of 𝑥𝑗 have no impact on
the estimation results. Thus, we are able to randomly permute these values while keeping all
predictor values as they are.

This way we can create a large number of resamples under the null hypothesis of no relationship
between the j-th predictor and the response. This is again a partial test that evaluates the impact of
one predictor while considering all other predictors in the model.
We start by re-calculating the observed fit:

obs_fit <- evals |>


specify(score ~ bty_avg + age + gender +
rank + language + pic_outfit) |>
fit()

321
14 Inference for linear regression

Now we can generate a distribution of fits where each predictor variable is permuted indepen-
dently:

null_distn <- evals |>


specify(score ~ bty_avg + age + gender +
rank + language + pic_outfit) |>
hypothesize(null = "independence") |>
generate(reps = 1000, type = "permute",
variables = c(bty_avg, age, gender, rank,
language, pic_outfit)) |>
fit()

We can visualize the observed fit alongside the fits under the null hypothesis. This is done in Fig-
ure 14.2.

visualize(null_distn) +
shade_p_value(obs_stat = obs_fit, direction = "two-sided") +
plot_layout(ncol = 2)

Simulation−Based Null Distributions


200
count

count

150 150
100 100
50 50
0 0
−0.005 0.000 0.005 −0.050 −0.025 0.000 0.025 0.050 0.075
age bty_avg
200
count

count

150 150
100 100
50 50
0 0
−0.1 0.0 0.1 0.2 3.6 3.9 4.2 4.5 4.8
gendermale intercept
150 200
count

count

100 150
100
50 50
0 0
−0.2 0.0 0.2 −0.2 −0.1 0.0 0.1 0.2
languagenon−english pic_outfitnot formal
200 200
count

count

150 150
100 100
50 50
0 0
−0.3 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.2 −0.1 0.0 0.1 0.2
ranktenure track ranktenured

Figure 14.2

322
14 Inference for linear regression

The p-values shown in the last figure can also be calculated using the get_p_value() function
again.

null_distn |>
get_p_value(obs_stat = obs_fit, direction = "two-sided")
# # A tibble: 8 x 2
# term p_value
# <chr> <dbl>
# 1 age 0
# 2 bty_avg 0.004
# 3 gendermale 0
# 4 intercept 0.056
# 5 languagenon-english 0.028
# 6 pic_outfitnot formal 0.038
# 7 ranktenure track 0.006
# 8 ranktenured 0.006

Remark. Be cautious in reporting a p-value of zero. This result is an approximation based on the
number of reps chosen in the generate() step. In theory, the p-value is never zero.

14.2 Residual analysis

Caution

Always be aware of the type of data you’re working with: random sample, non-random
sample, or population.
Statistical inference, and the resulting p-values, are meaningless when you already have
population data.
If you have a sample that is non-random (biased), inference on the results will be unreliable.
The ideal situation is to have independent observations.

Inference methods for multiple regression based on the fitted model

𝑦 ̂ = 𝛽0̂ + 𝛽1̂ 𝑥1 + 𝛽2̂ 𝑥2 + ⋅ ⋅ ⋅ + 𝛽𝑘̂ 𝑥𝑘

depend on the following conditions:


(1) each variable is linearly related to the outcome,
(2) residuals have constant variability,
(3) residuals are nearly normal .

323
14 Inference for linear regression

One often uses graphical methods to verify these conditions. We will discuss how to do that
next.
In addition we will have a look at how to detect outliers.

Models for residual analysis

We want to analyse the residuals of four models based on simulated data to illustrate the cases of
having

• a nonlinear relationship,
• heterogeneous variance,
• outliers in the data.

# generating the data


set.seed(1234) # for reproducibility
df <- tibble(
x = sample(1:10, 100, replace = TRUE),
x_out = c(x[1:99],20),
epsilon = rnorm(100),
# linear relation
y = 2 + 0.4 * x + epsilon,
# nonlinear relation
y_str = 2 + 0.4 * x^2 + epsilon,
# heterogeneous variance
y_het = 2 + 0.4 * x + rnorm(100, sd = x),
# contains outlier
y_out = c(2 + 0.4 * x[1:99] + epsilon[1:99], 2)
)

We fit simple linear regression models to the four different response values using the simulated data.

# linear relation
model_reg <- lm(y ~ x, data = df)
# nonlinear relation
model_str <- lm(y_str ~ x, data = df)
# heterogeneous variance
model_het <- lm(y_het ~ x, data = df)
# contains outlier
model_out <- lm(y_out ~ x_out, data = df)

324
14 Inference for linear regression

x x_out y y_str y_het y_out


0.12
0.08 Corr: Corr: Corr: Corr: Corr:

x
0.04 0.852*** 0.778*** 0.970*** 0.297** 0.779***
0.00
20
15 Corr: Corr: Corr: Corr:

x_out
10
5
0.594*** 0.811*** 0.280** 0.603***

8
6 Corr: Corr: Corr:

y
4 0.812*** 0.177. 1.000***
2

40
30 Corr: Corr:

y_str
20
0.310** 0.813***
10
0
30
20
Corr:

y_het
10
0 0.178.
−10
−20
8
6

y_out
4
2
2.5 5.0 7.510.0 5 10 15 20 2 4 6 8 0 10 20 30 40−20
−100 102030 2 4 6 8

Figure 14.3: Pairsplot of the simulated data showing the the relationship between the different re-
sponse variables and x as well as x_out.

The estimated coefficients are:

# # A tibble: 4 x 3
# model intercept x
# <chr> <dbl> <dbl>
# 1 linear 1.98 0.414
# 2 nonlinear -8.17 4.58
# 3 het. 1.12 0.724
# 4 outlier 2.68 0.283

The visualizations are created using the autoplot() function, which requires a fitted regression
model as input. In order to transform the information contained in the fitted model into something
that can be plotted, autoplot() needs helper functions from the ggfortify package. Therefore, we
have to load this package before we can use autoplot() for the first time.

library(ggfortify)

325
14 Inference for linear regression

(1) Linear relationships

We assess the linear relationship by creating a scatterplot of residuals versus the fitted values (remem-
ber, the fitted values are a linear combination of all predictor variables).

p1 <- autoplot(model_reg, which = 1)


p2 <- autoplot(model_str, which = 1)
p3 <- autoplot(model_het, which = 1)
p4 <- autoplot(model_out, which = 1)

p1 + p2 + p3 + p4 + plot_layout(ncol = 2)

Residuals vs Fitted Residuals vs Fitted


2 49 26 49
45 1
1 5
Residuals

Residuals
0
0
−1

−2 64
3 4 5 6 0 10 20 30
Fitted values Fitted values

Residuals vs Fitted Residuals vs Fitted


4 2.5 26 1
20

0.0
Residuals

Residuals

10

0
−2.5
−10

26 −5.0
−20
18 100
2 4 6 8 3 4 5 6 7 8
Fitted values Fitted values

Figure 14.4

326
14 Inference for linear regression

(2) Constant variability in residuals

The constant variability is analyzed through a scatterplot of the square root of the absolute values of
standardized residuals vs. fitted values.

p1 <- autoplot(model_reg, which = 3)


p2 <- autoplot(model_str, which = 3)
p3 <- autoplot(model_het, which = 3)
p4 <- autoplot(model_out, which = 3)

p1 + p2 + p3 + p4 + plot_layout(ncol = 2)

Scale−Location Scale−Location
Standardized residuals

Standardized residuals
49 64 26 1.6 49
45 1
1.2
1.2

0.8 0.8

0.4 0.4

3 4 5 6 0 10 20 30
Fitted values Fitted values

Scale−Location Scale−Location
Standardized residuals

Standardized residuals

2.0 18 4 2.5 100


26 2.0
1.5
1.5 26 1
1.0
1.0
0.5
0.5

0.0 0.0
2 4 6 8 3 4 5 6 7 8
Fitted values Fitted values

Figure 14.5

327
14 Inference for linear regression

(3) Nearly normal residuals

We verify the assumption of normality using a normal quantile-quantile plot (normal-probability plot)
of the standardized residuals.

p1 <- autoplot(model_reg, which = 2)


p2 <- autoplot(model_str, which = 2)
p3 <- autoplot(model_het, which = 2)
p4 <- autoplot(model_out, which = 2)

p1 + p2 + p3 + p4 + plot_layout(ncol = 2)

Normal Q−Q Normal Q−Q


26 49 49
Standardized residuals

Standardized residuals
2
2 145
1
1
0
0
−1

−1
−2
64
−2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles

Normal Q−Q Normal Q−Q


4 2.5
4 26 1
Standardized residuals

Standardized residuals

2 0.0

0
−2.5

−2
26 −5.0
−4 18 100
−2 −1 0 1 2 −2 −1 0 1 2
Theoretical Quantiles Theoretical Quantiles

Figure 14.6

328
14 Inference for linear regression

Outliers

The presence of outliers can be analyzed using a scatterplot of standardized residuals versus leverage.
Before displaying plots, let’s define one component of these plots, the leverage score.
Remember from Equation 9.4 the representation of the fitted values

ŷ = X𝛽̂ = X(X⊤ X)−1 X⊤ y = Hy ,

using the hat matrix H.

Definition 14.1. The leverage score 𝐻𝑖𝑖 of the i-th observation (𝑦𝑖 , x𝑖 ) is defined as H𝑖𝑖 , the i-th
diagonal element of the hat matrix H.
It can be interpreted as the degree by which the i-th observed response influences the i-th fitted value:

𝜕 𝑦𝑖̂
𝐻𝑖𝑖 = .
𝜕𝑦𝑖

Remark. The variance of the i-th residual 𝑒𝑖 is 𝜎2 (1 − 𝐻𝑖𝑖 ), and it holds that

0 ≤ 𝐻𝑖𝑖 ≤ 1 .

Now let’s have a look at the plot of residuals against leverage.

p1 <- autoplot(model_reg, which = 5)


p2 <- autoplot(model_str, which = 5)
p3 <- autoplot(model_het, which = 5)
p4 <- autoplot(model_out, which = 5)

p1 + p2 + p3 + p4 + plot_layout(ncol = 2)

329
14 Inference for linear regression

Residuals vs Leverage Residuals vs Leverage

Standardized Residuals

Standardized Residuals
2 1 49
51
49
2 45
91
1
1
0
0
−1
−2 −1

0.00 0.01 0.02 0.03 0.04 0.00 0.01 0.02 0.03 0.04
Leverage Leverage

Residuals vs Leverage Residuals vs Leverage


Standardized Residuals

Standardized Residuals
4 4 2.5 1
51
2 0.0

0 −2.5
−2 94 −5.0
−4 18 100
0.00 0.01 0.02 0.03 0.04 0.00 0.05 0.10 0.15 0.20
Leverage Leverage

Figure 14.7

In case of detecting high leverage and uncertainty about its influence on parameter estimation, check
the plot Cook’s distance vs. Leverage.

autoplot(model_out, which = 6, ncol = 1)

Cook's dist vs Leverage


6
100

4
Cook's distance

0 1
51
0.00 0.05 0.10 0.15 0.20
Leverage

Figure 14.8

330
14 Inference for linear regression

The last plot indicated that the last observation (𝑥100 , 𝑦100 ) = (20, 2) has a large Cook’s distance

1
𝐷𝑖 = ⋅ (ŷ − ŷ(𝑖) )⊤ (ŷ − ŷ(𝑖) )
(𝑘 + 1)𝜎̂ 2
H𝑖𝑖 ⋅ 𝑟𝑖2
= ,
(1 − H𝑖𝑖 )(𝑘 + 1)
where ŷ(𝑖) denotes predictions based on parameter estimates of the regression coefficients that have
𝑒𝑖
been computed when omitting the 𝑖-th observation and 𝑟𝑖 = 𝜎̂√1−𝐻 is the i-th standardized
𝑖𝑖
residual.
The Cook’s distance is computed for each observation and can be used to indicate influential
observations.

Operational guideline

𝐷𝑖 > 0.5 or 1 ⇒ i-th observation might be influential.

The function augment() from the broom package adds, among other characteristics, the Cook’s dis-
tances to the dataset.

augment(model_out) |>
slice_tail(n = 10)
# # A tibble: 10 x 8
# y_out x_out .fitted .resid .hat .sigma .cooksd .std.resid
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2.81 1 2.96 -0.156 0.0355 1.15 0.000352 -0.138
# 2 3.42 2 3.25 0.179 0.0260 1.15 0.000333 0.158
# 3 6.08 6 4.38 1.70 0.0100 1.14 0.0112 1.49
# 4 5.93 10 5.51 0.421 0.0288 1.15 0.00206 0.372
# 5 2.48 2 3.25 -0.767 0.0260 1.15 0.00613 -0.677
# 6 5.47 5 4.10 1.38 0.0108 1.14 0.00792 1.21
# 7 5.30 4 3.81 1.49 0.0137 1.14 0.0119 1.31
# 8 3.24 3 3.53 -0.286 0.0188 1.15 0.000605 -0.251
# 9 5.67 10 5.51 0.157 0.0288 1.15 0.000286 0.139
# 10 2 20 8.34 -6.34 0.228 0.890 5.85 -6.29

Remark. We use slice_tail() to show the last rows, as the very last observation exhibits a high
Cook’s distance.

331
14 Inference for linear regression

¾ Your turn

Consider again the model

evals_lm
#
# Call:
# lm(formula = score ~ bty_avg + age + gender + rank + language +
# pic_outfit, data = evals)
#
# Coefficients:
# (Intercept) bty_avg age
# 4.490380 0.056916 -0.008691
# gendermale ranktenure track ranktenured
# 0.209779 -0.206806 -0.175756
# languagenon-english pic_outfitnot formal
# -0.244128 -0.130906

List the conditions required for linear regression and check if each one is satisfied for this model
based on the following diagnostic plots.

autoplot(evals_lm)

Residuals vs Fitted Normal Q−Q


Standardized residuals

1 2
Residuals

0 0

−1 −2
335 239 239
335
162 162
3.8 4.0 4.2 4.4 4.6 −2 0 2
Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage


Standardized residuals

Standardized Residuals

2
335 162 239
1.5
0
1.0

0.5 −2
376 239
0.0 162
3.8 4.0 4.2 4.4 4.6 0.00 0.02 0.04
Fitted values Leverage

332
14 Inference for linear regression

14.3 Options for improving the model fit

There are several options for improvement of a model:

• transforming variables
• seeking out additional variables to fill model gaps
• using more advanced methods (not part of the course)

We will examine the process of transforming variables. The data used to fit model_str conforms to
the model
𝑌𝑖 = 2 + 0.4 ⋅ 𝑥2𝑖 + 𝜖𝑖 .
This means that using 𝑥2𝑖 instead of 𝑥𝑖 as the predictor variable results in a linear relationship between
the response and predictor.

ggplot(df, aes(x = x^2, y = y_str)) +


geom_point()

40

30
y_str

20

10

0
0 25 50 75 100
x^2

Figure 14.9

333
14 Inference for linear regression

Refitting the model with x^2 as predictor variable

model_tf <- lm(y_str ~ I(x^2), data = df)

leads to the following residual plots.

autoplot(model_tf)

Residuals vs Fitted Normal Q−Q

Standardized residuals
2 49 26 2 26 49
Residuals

1 1
0 0

−1 −1

−2 64 −2 64
10 20 30 40 −2 −1 0 1 2
Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage


Standardized residuals

Standardized Residuals

49 64 26 2
49 1
51
1.2
1
0.8 0

0.4 −1
−2
10 20 30 40 0.00 0.01 0.02 0.03 0.04
Fitted values Leverage

Figure 14.10

Now, all conditions appear to be met.

334
14 Inference for linear regression

Short summary

This chapter introduces inference for linear regression, using a dataset of student evaluations
of university instructors. It explores how to statistically test the relationship between pre-
dictor variables like beauty, age, and gender, and the response variable evaluation score.
The text details both theoretical approaches using t-statistics and p-values, alongside
simulation-based methods involving permutation. Furthermore, it emphasises the as-
sumptions underlying linear regression and provides guidance on residual analysis to
check model validity, including identifying outliers and assessing linearity, constant variance,
and normality. The text also touches upon improving model fit through variable transforma-
tion and briefly discusses special cases like comparing two groups and multiple groups.

335
References

Brandt, A. M. 2009. The Cigarette Century. Basic Books.


Çetinkaya-Rundel, M., and J. Hardin. 2021. Introduction to Modern Statistics. OpenIntro. https://
openintro-ims.netlify.app.
Deale, A., T. Chalder, I. Marks, and S. Wessely. 1997. “Cognitive Behavior Therapy for Chronic
Fatigue Syndrome: A Randomized Controlled Trial.” American Journal of Psychiatry 154 (3): 408–
14. https://doi.org/10.1176/ajp.154.3.408.
Diez, D., M. Çetinkaya-Rundel, and C. Barr. 2019. OpenIntro Statistics. OpenIntro. https://leanpub.
com/os.
Gelman, A., and J. Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models.
Analytical Methods for Social Research. Cambridge University Press.
Hamermesh, D. S., and A. Parker. 2005. “Beauty in the Classroom: Instructors’ Pulchritude
and Putative Pedagogical Productivity.” Economics of Education Review 24 (4): 369–76.
https://doi.org/https://doi.org/10.1016/j.econedurev.2004.07.013.
Ismay, C., and A. Y. Kim. 2019. Statistical Inference via Data Science: A ModernDive into R and the
Tidyverse. Chapman & Hall/CRC the R Series. CRC Press. https://moderndive.com.
James, G., D. Witten, T. Hastie, and R. Tibshirani. 2021. An Introduction to Statistical Learning: With
Applications in R. Springer Texts in Statistics. Springer US. https://link.springer.com/book/10.
1007/978-1-0716-1418-1.
Ramsey, F., and D. Schafer. 2002. The Statistical Sleuth: A Course in Methods of Data Analysis. 2nd ed.
Duxbury Press.
Rosen, B., and T. H. Jerdee. 1974. “Influence of Sex Role Stereotypes on Personnel Decisions.” Journal
of Applied Psychology 59: 9–14. https://api.semanticscholar.org/CorpusID:62817155.
Tatem, A. J., C. A. Guerra, P. M. Atkinson, and S. I. Hay. 2004. “Momentous Sprint at the 2156
Olympics?” Nature 431 (7008): 525–25. https://doi.org/10.1038/431525a.
Wickham, H., M. Çetinkaya-Rundel, and G. Grolemund. 2023. R for Data Science. O’Reilly Media.
https://r4ds.hadley.nz.

336
A Some probability distributions

A.1 R and probability distributions

R knows four different types of functions in the context of probability distributions.


The name of each one of these functions starts with d, p, q or r and ends with an abbreviation of the
respective distribution.
d-functions compute the probability mass/density function, e.g.,

dnorm(0.5)
# [1] 0.3520653

p-functions compute the cumulative distribution function, e.g.,

pnorm(-0.5, mean = -1, sd = 2)


# [1] 0.5987063

q-functions compute quantiles of the distribution, e.g.,

qnorm(0.5, mean = -1)


# [1] -1

r-functions compute realizations of pseudo random variables, e.g.,

rnorm(2, sd = 3)
# [1] -3.3220566 0.2682201

A.2 Discrete distributions

Discrete uniform distribution

1
Definition A.1. A distribution over the set 𝑆 ∶= {𝑥1 , … , 𝑥𝑛 } assigning equal weight P({𝑥𝑖 }) = 𝑛
to each element 𝑥𝑖 is called discrete uniform distribution.

337
A Some probability distributions

Bernoulli distribution

Definition A.2. Let 𝑝 ∈ (0, 1) be the probability of success and 𝑆 ∶= {0, 1}. Then the probabilities

𝑝, 𝑘=1
P({𝑘}) = {
1 − 𝑝, 𝑘 = 0

define the Bernoulli distribution with parameter 𝑝.

Remark. A random trial with only two possible outcomes is called a Bernoulli random trial. A r.v.
𝑋 with sample space 𝑆 = {0, 1} and probabilities P(𝑋 = 1) = 𝑝 and P(𝑋 = 0) = 1 − 𝑝, is called a
Bernoulli random variable with E[𝑋] = 𝑝 and Var[𝑋] = (1 − 𝑝)𝑝.

Remark:

Binomial distribution

Definition A.3. Let 𝑝 ∈ (0, 1) be the probability of success, 𝑛 ∈ N the number of independent
Bernoulli trials and 𝑘 ∈ 𝑆 ∶= {0, 1, … , 𝑛} the number of successes (elementary event). Then the
probabilities
𝑛
P(𝑛,𝑝) ({𝑘}) = ( ) 𝑝𝑘 (1 − 𝑝)(𝑛−𝑘) , 𝑘 ∈ 𝑆,
𝑘
define a binomial distribution with parameters 𝑛 and 𝑝.

Interpretation

P(𝑛,𝑝) ({𝑘}) is the probability of seeing 𝑘 successes in 𝑛 Bernoulli trials.


Seeing 𝑘 success in a fixed number of 𝑛 Bernoulli trials can happen in various scenarios. The
number of possible scenarios is (𝑛𝑘)1 . The probability for one of these scenarios is 𝑝𝑘 (1−𝑝)(𝑛−𝑘) .
So, the formula for the probabilities is of the form

# of scenarios ⋅ P(single scenario)

Expectation and Variance of a r.v. 𝑋 with binomial distribution are E[𝑋] = 𝑛𝑝 and Var[𝑋] = 𝑛𝑝(1 −
𝑝), respectively.
In R, we can compute the probability P(𝑛,𝑝) ({𝑘}) using the command:

dbinom(k, size = n, prob = p)

1 𝑛 𝑛!
(𝑘) = 𝑘!(𝑛−𝑘)!
is called the binomial coefficient, and 𝑛! = 𝑛 ⋅ (𝑛 − 1) ⋅ (𝑛 − 2) ⋅ ⋅ 2 ⋅ 1 is the factorial.

338
A Some probability distributions

The probability of the event {𝑋 ≤ 𝑘} is the distribution function of the binomial distribution
𝑘
∑𝑖=0 P(𝑛,𝑝) ({𝑖}) at point 𝑘 and can be computed in R using the command:

pbinom(k, size = n, prob = p)

Example A.1. The number of success in 10 independent Bernoulli trials follows a binomial distribu-
tion with parameters 𝑛 = 10 and 𝑝 = 0.35.
The probability of 6 successes in 10 independent Bernoulli trials is then equal to

10 10!
P(6 out of 10 refuse) = ( ) ⋅ 0.356 ⋅ (1 − 0.35)10−6 = ⋅ 0.356 ⋅ 0.654
6 4!6!
10 ⋅ 9 ⋅ 8 ⋅ 7 ⋅ 6
= ⋅ 0.356 ⋅ 0.654
4⋅3⋅2⋅1
= 210 ⋅ 0.356 ⋅ 0.654
= 0.0689098

Using R we obtain

dbinom(6, size = 10, prob = 0.35)


# [1] 0.0689098

Geometric distribution

Definition A.4. Let 𝑝 ∈ (0, 1) be the probability of success, (1 − 𝑝) the probability of failure and
𝑛 ∈ 𝑆 ∶= N the number of independent trials. Then the probabilities

P𝑝 ({𝑛}) = (1 − 𝑝)𝑛−1 𝑝 , 𝑛 ∈ 𝑆,

define a geometric distribution with parameter 𝑝.


P𝑝 ({𝑛}) is the probability of the first success on the 𝑛𝑡ℎ trial.

1 1−𝑝
Expectation and Variance of a r.v. 𝑋 with geometric distribution are E(𝑋) = 𝑝 and Var(𝑋) = 𝑝2 .

In R, we can compute the probability P𝑝 ({𝑛}) using the command:

dgeom(n-1, prob = p)

Remark. dgeom() uses a different parametrization compared to Definition A.4. Therefore, we have
to use n-1 instead of n.

339
A Some probability distributions

Application

Waiting time until the first success in independent and identically distributed (i.i.d.)
Bernoulli random trials.

Poisson distribution

Definition A.5. Let 𝜆 ∈ R+ be a positive parameter, called rate. Then the probabilities

𝜆𝑘 𝑒−𝜆
P𝜆 ({𝑘}) = , 𝑘 ∈ 𝑆 ∶= N0 ,
𝑘!

where 𝑘! denotes the 𝑘-factorial, define a Poisson distribution with parameter 𝜆, E(𝑋) = 𝜆 and
Var(𝑋) = 𝜆.

Expectation and Variance of a r.v. 𝑋 with Poisson distribution are E[𝑋] = 𝜆 and Var[𝑋] = 𝜆.

Interpretation

The Poisson distribution is often useful for modeling the number of rare events in a large
population over a (short) unit of time. The population is assumed to be (mostly-)fixed, and the
units within the population should be independent.
Data, which can be modeled through a Poisson distribution, is also called count data.

In R, we can compute the probability P𝜆 ({𝑘}) using the command:

dpois(k, lambda = lambda)

Negative binomial distribution

Definition A.6. Let 𝑝 ∈ (0, 1) be the probability of success, 𝑛 ∈ N the number of independent
Bernoulli trials, and 𝑘 ≤ 𝑛 the number of successes. Then the probabilities
𝑛−1 𝑘
P(𝑘,𝑝) ({𝑛}) = ( ) 𝑝 (1 − 𝑝)𝑛−𝑘
𝑘−1
defines a negative binomial distribution with parameters 𝑘 and 𝑝.

340
A Some probability distributions

Interpretation

The negative binomial distribution describes the probability of observing the 𝑘𝑡ℎ success on the
𝑛𝑡ℎ trial.

R uses a different parametrization compared to Definition A.6. Therefore, we have to enter the number
of failures x and successes size in x + size Bernoulli trials with the last one being a success.

dnbinom(x = n - k, size = k, prob = p)

A.3 Continuous distributions

(Continuous) Uniform distribution

Definition A.7. Let [𝑎, 𝑏] an interval on the real line with 𝑎 < 𝑏. The distribution with density
function

1
1 , 𝑥 ∈ [𝑎, 𝑏]
𝑓(𝑥) = ⋅ 1[𝑎,𝑏] (𝑥) = { 𝑏−𝑎
𝑏−𝑎 0, 𝑥 ∉ [𝑎, 𝑏]
is called (continuous) uniform distribution on the interval [𝑎, 𝑏] and we will denote it by Unif(𝑎, 𝑏).

𝑏+𝑎
Expectation and Variance of a r.v. 𝑋 with uniform distribution on the interval [𝑎, 𝑏] are E[𝑋] = 2
2
and Var[𝑋] = (𝑏−𝑎)
12 .

The function

dunif(x, min = a, max = b)

computed the density function of the uniform distribution.

341
A Some probability distributions

Density of uniform distribution with a = 0, b=2


1.2

0.8
f(x)

0.4

0.0

−2 0 2 4
x

Normal distribution

Definition A.8. Let 𝜇 ∈ R and 𝜎 > 0. The normal distribution with mean 𝜇 and variance 𝜎2 is
the continuous distribution on R with density function
1 (𝑥−𝜇)2
𝑓(𝑥) = √ e− 2𝜎2 , 𝑥 ∈ R,
2𝜋𝜎2

and we will denote it by N (𝜇, 𝜎2 ).

Remark. A normal distribution with mean 𝜇 = 0 and variance 𝜎 = 1 is called standard normal
distribution; in symbols, N (0, 1).

Exponential distribution

Definition A.9. Let 𝜆 > 0 be a parameter called rate. The distribution with density function

𝑓(𝑥) = 𝜆e−𝜆𝑥 1[0,∞) (𝑥)

is called exponential distribution and we will denote it by Exp(𝜆).

342
A Some probability distributions

1 1
Expectation and Variance of a r.v. 𝑋 with distribution Exp(𝜆) are E[𝑋] = 𝜆 and Var[𝑋] = 𝜆2 .

The function

dexp(x, rate = lambda)

computes the density function of the exponential distribution with rate 𝜆.

Exponential density

1.5

1.0 λ
f(x)

1
2

0.5

0.0

−5 0 5 10
x

Application

Waiting time between two rare events.

Chi-squared distribution

Definition A.10. Let 𝑘 > 0 be a parameter called degree of freedom. The distribution with density
function

1
𝑓(𝑥) = 𝑥𝑘/2−1 e−𝑥/2 1[0,∞) (𝑥) ,
2𝑘/2 Γ(𝑘/2)

where Γ(⋅) is the Gamma function, is called chi-squared distribution and we will denote it by 𝜒2 (𝑘).

Expectation and Variance of a r.v. 𝑋 with distribution 𝜒2 (𝑘) are E[𝑋] = 𝑘 and Var[𝑋] = 2𝑘.

343
A Some probability distributions

Application

Used in inferential statistics.

The function

dchisq(x, df = k)

computes the density function of the chi-squared distribution with 𝑘 degrees of freedom.

Density of chi−squared distribution

0.4

0.3 k
2
f(x)

3
0.2
7

0.1

0.0

0 5 10 15
x

F-distribution

Definition A.11. Let 𝑑1 > 0 and 𝑑2 > 0 be two parameters called degrees of freedom. The distribu-
tion with density function

Γ(𝑑1 /2 + 𝑑2 /2) 𝑑1 𝑑1 /𝑑2 𝑑1 /2−1 𝑑 𝑥 −(𝑑 +𝑑 )/2


𝑓(𝑥) = ( ) 𝑥 ⋅ (1 + 1 ) 1 2 1[0,∞) (𝑥) ,
Γ(𝑑1 /2)Γ(𝑑2 /2) 𝑑2 𝑑2

where Γ(⋅) is the Gamma function, is called F-distribution and we will denote it by 𝐹 (𝑑1 , 𝑑2 ).
𝑑2
Expectation and Variance of a r.v. 𝑋 with distribution 𝐹 (𝑑1 , 𝑑2 ) are E[𝑋] = 𝑑2 −2 , for 𝑑2 > 2, and
2𝑑22 (𝑑1 +𝑑2 −2)
Var[𝑋] = 𝑑1 (𝑑2 −2)2 (𝑑2 −4) , for 𝑑2 > 4.

344
A Some probability distributions

Application

Used in inferential statistics.

The function

df(x, df1 = df1, df2 = df2)

computes the density function of the F-distribution with 𝑑1 and 𝑑2 degrees of freedom.

Density of F−distribution
1.00

0.75

d1,d2
2, 1
f(x)

0.50
5, 1
100, 100

0.25

0.00

0.0 2.5 5.0 7.5 10.0


x

𝑡 distribution

Definition A.12. Let 𝜈 > 0 be a parameter called degree of freedom. The distribution with density
function

Γ( 𝜈+1 ) 𝑥2 −(𝜈+1)/2
𝑓(𝑥) = √ 2 𝜈 (1 + ) ,
𝜈𝜋Γ( 2 ) 𝜈

where Γ(⋅) is the Gamma function, is called t-distribution and we will denote it by 𝑡(𝜈).

𝜈
Expectation and Variance of a r.v. 𝑋 with distribution 𝑡(𝜈) are E[𝑋] = 0 and Var[𝑋] = 𝜈−2 , for
𝜈 > 2.

345
A Some probability distributions

Remark. The Student’s 𝑡-distribution is a generalization of the standard normal distribution. Its
density is also symmetric around zero and bell-shaped, but its tails are thicker than the normal
model’s. Therefore, observations are more likely to fall beyond two SDs from the mean than those
under the normal distribution.

0.4

0.3

distribution
f(x)

0.2 normal
t

0.1

0.0

−5.0 −2.5 0.0 2.5 5.0


x

Application

Used in inferential statistics.

The function

dt(x, df = nu)

computes the density function of the t-distribution with 𝜈 degrees of freedom.

346
A Some probability distributions

0.4

0.3
distribution
df=1
df=10
f(x)

0.2
df=2
df=5
normal
0.1

0.0

−5.0 −2.5 0.0 2.5 5.0


x

Two-dimensional normal distribution

Definition A.13. Let (𝑋1 , 𝑋2 )⊤ ∈ R2 be a random vector. We say, that (𝑋1 , 𝑋2 )⊤ has a two-
dimensional normal distribution, if the density of the joint distribution of 𝑋1 and 𝑋2 is given
by

1 1 −1
𝑓(x) = exp {− (x − 𝜇)⊤ Σ (x − 𝜇)} , x ∈ R2 ,
√(2𝜋)2 det(Σ) 2
with 𝜇 = (𝜇1 , 𝜇2 )⊤ ∈ R2 and
𝜎12 𝜌𝜎1 𝜎2
Σ=( ),
𝜌𝜎1 𝜎2 𝜎22
where 𝜎1 , 𝜎2 > 0 and 𝜌 ∈ (−1, 1). The parameters 𝜇𝑖 and 𝜎𝑖2 are the expectation and variance of the
random variable 𝑋𝑖 , 𝑖 ∈ {1, 2}, respectively. We denote the two-dimensional normal distribution
with parameters 𝜇 and Σ by N2 (𝜇, Σ).

Remark.

1. Doing the matrix-vector multiplication, the density of the two-dimensional normal distribution
can be written as
1 (𝑥1 −𝜇1 )2 (𝑥1 −𝜇1 )⋅(𝑥2 −𝜇2 ) (𝑥 −𝜇 )2
1 − 2(1−𝜌 2) [ −2𝜌⋅ + 2 22 ]
𝜎2
𝑓(x) = 𝑓(𝑥1 , 𝑥2 ) = e 1
𝜎1 𝜎2 𝜎2
.
√(2𝜋)2 𝜎12 𝜎22 (1−𝜌2 )

347
A Some probability distributions

2. It holds that each component, 𝑋1 and 𝑋2 , has a one-dimensional normal distribution, with 𝑋𝑖
following N (𝜇𝑖 , 𝜎𝑖2 ). For example, the density 𝑓1 of 𝑋1 is given
(𝑥1 −𝜇1 )2
1 −
2𝜎2
𝑓1 (𝑥1 ) = e 1
√2𝜋𝜎12

which gives the density of N (𝜇1 , 𝜎12 ); see Section C.3 for a proof of this result. The case of 𝑋2
is analogous.
3. The form of the density from Definition A.13 also generalizes to an n-dimensional normal dis-
tribution with parameters 𝜇 ∈ R𝑛 and Σ ∈ S𝑛++ , where S𝑛++ is the set of all symmetric and
positive definite matrices of dimension 𝑛 × 𝑛.

We visualize the density of the two-dimensional normal distribution by contour plots. These plots
show the smallest regions containing 50%, 80%, 95%, and 99% of the probability mass and are created
using functions from the ggdensity package.
We start by defining the density function.

f <- function(x,y, mu1 = 0, mu2 = 0, sigma1 = 1, sigma2 = 1, rho = 0) {


1 / (2 * pi * sigma1 * sigma2 * sqrt(1- rho^2)) *
exp( -( (x - mu1)^2 / sigma1^2 -
2 * rho * ((x - mu1) * (y - mu2) / (sigma1 * sigma2)) +
(y- mu2)^2 / sigma2^2 ) / (2 * ( 1 - rho^2)) )
}

First, we vary the expectation vector 𝜇. We start with 𝜇 = (0, 0)⊤ , then change 𝜇1 and in the last
step change both.

library(ggdensity)

p1 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-3, 9), ylim = c(-7, 4),
fill = "blue") +
xlim(-3,9) + ylim(-7,4) +
labs(title = expression(paste(mu[1],"=0 and ", mu[2], "=0")))

p2 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-10, 10), ylim = c(-10, 10),
fill = "blue", args = list(mu1 = 5)) +
xlim(-3,9) + ylim(-7,4) +
labs(title = expression(paste(mu[1],"=5 and ", mu[2], "=0")))

348
A Some probability distributions

p3 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-10, 10), ylim = c(-10, 10),
fill = "blue", args = list(mu1 = 5, mu2 = -3)) +
xlim(-3,9) + ylim(-7,4) +
labs(title = expression(paste(mu[1],"=5 and ", mu[2], "=-3")))

p1 + p2 + p3 + # layout defined by the


plot_layout(guides = 'collect') # patchwork package

µ1=0 and µ2=0 µ1=5 and µ2=0 µ1=5 and µ2=−3

2.5 2.5 2.5


probs
99%
0.0 0.0 0.0
95%
−2.5 −2.5 −2.5
80%
−5.0 −5.0 −5.0
50%
−7.5 −7.5 −7.5
0 5 0 5 0 5

Now, we change the variance of the two components. The first plot shows again the N2 (𝜇, Σ) distri-
bution. Then, we increase both variances to 4, and in the last plot, the variance of 𝑋2 is 9.

p1 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-10, 10), ylim = c(-10, 10),
fill = "blue") +
xlim(-10,10) + ylim(-10,10) +
labs(title = expression(paste(sigma[1],"=1 and ", sigma[2], "=1")))

p2 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-10, 10), ylim = c(-10, 10),
fill = "blue", args = list(sigma1 = 2, sigma2 = 2)) +
xlim(-10,10) + ylim(-10,10) +
labs(title = expression(paste(sigma[1],"=2 and ", sigma[2], "=2")))

p3 <- ggplot() +

349
A Some probability distributions

geom_hdr_fun(fun = f, xlim = c(-10, 10), ylim = c(-10, 10),


fill = "blue", args = list(sigma1 = 2, sigma2 = 3)) +
xlim(-10,10) + ylim(-10,10) +
labs(title = expression(paste(sigma[1],"=2 and ", sigma[2], "=3")))

p1 + p2 + p3 +
plot_layout(guides = 'collect')

σ1=1 and σ2=1 σ1=2 and σ2=2 σ1=2 and σ2=3


10 10 10
probs
5 5 5
99%

0 0 0 95%
80%
−5 −5 −5
50%

−10 −10 −10


−10 −5 0 5 10 −10 −5 0 5 10 −10 −5 0 5 10

In all the previous examples, the parameter 𝜌 was equal to 0. Before we visualize the effect of varying
𝜌, let’s show that 𝜌 is the correlation between 𝑋1 and 𝑋2 for (𝑋1 , 𝑋2 ) ∼ N2 (𝜇, Σ) with Σ as given
in Definition A.13.
To compute the covariance between two random variables 𝑋1 and 𝑋2 , we use the formula
Cov[𝑋1 , 𝑋2 ] = E[𝑋1 ⋅ 𝑋2 ] − E[𝑋1 ] ⋅ E[𝑋2 ]. As we already know the expected values of 𝑋1 and
𝑋2 , denoted by 𝜇1 and 𝜇2 respectively, we only need to determine E[𝑋1 ⋅ 𝑋2 ], which is done in
Section C.3. From there we get

E[𝑋1 ⋅ 𝑋2 ] = 𝜌𝜎1 𝜎2 + 𝜇1 𝜇2 ,
which leads to the following covariance

Cov[𝑋1 , 𝑋2 ] = E[𝑋1 ⋅ 𝑋2 ] − E[𝑋1 ] ⋅ E[𝑋2 ]


= 𝜌𝜎1 𝜎2 + 𝜇1 𝜇2 − 𝜇1 𝜇2 = 𝜌𝜎1 𝜎2 .

Definition 7.14 then says that

Cov[𝑋1 , 𝑋2 ] 𝜌𝜎 𝜎
Corr[𝑋1 , 𝑋1 ] = = 1 1 = 𝜌,
√Var[𝑋1 ] ⋅ √Var[𝑋2 ] 𝜎1 𝜎2

350
A Some probability distributions

which shows that 𝜌 is the correlation between 𝑋1 and 𝑋2 , and describes the linear dependence be-
tween the two.
1 𝜌
The last figure shows contour plots for the N2 (𝜇, Σ) distribution, with 𝜇 = (0, 0)⊤ , Σ = ( )
𝜌 1
and 𝜌 ∈ {−0.7, −0.2, 0, 0.2, 0.7}.

p1 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-5, 5), ylim = c(-5, 5),
fill = "blue", args = list(rho = -.7)) +
xlim(-5,5) + ylim(-5,5) +
labs(title = expression(paste(rho,"=-0.7")))

p2 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-5, 5), ylim = c(-5, 5),
fill = "blue", args = list(rho = -.2)) +
xlim(-5,5) + ylim(-5,5) +
labs(title = expression(paste(rho,"=-0.2")))

p3 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-5, 5), ylim = c(-5, 5),
fill = "blue") +
xlim(-5,5) + ylim(-5,5) +
labs(title = expression(paste(rho,"=0")))

p4 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-5, 5), ylim = c(-5, 5),
fill = "blue", args = list(rho = .2)) +
xlim(-5,5) + ylim(-5,5) +
labs(title = expression(paste(rho,"=0.2")))

p5 <- ggplot() +
geom_hdr_fun(fun = f, xlim = c(-5, 5), ylim = c(-5, 5),
fill = "blue", args = list(rho = .7)) +
xlim(-5,5) + ylim(-5,5) +
labs(title = expression(paste(rho,"=0.7")))

((p1 + p2 + p3) / (p4 + p5 + plot_spacer()) ) +


plot_layout(guides = 'collect')

351
A Some probability distributions

ρ=−0.7 ρ=−0.2 ρ=0


5.0 5.0 5.0

2.5 2.5 2.5

0.0 0.0 0.0

−2.5 −2.5 −2.5


probs
−5.0 −5.0 −5.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0 99%
95%
ρ=0.2 ρ=0.7
80%
5.0 5.0
50%
2.5 2.5

0.0 0.0

−2.5 −2.5

−5.0 −5.0
−5.0 −2.5 0.0 2.5 5.0 −5.0 −2.5 0.0 2.5 5.0

352
B Inference for logistic regression

Birdkeeping and lung cancer1

A health survey conducted in The Hague, Netherlands from 1972 to 1981 found a link between keeping
pet birds and an increased risk of lung cancer.
To investigate bird keeping as a risk factor, researchers conducted a case-control study of patients
in 1985 at four hospitals in The Hague (population 450,000).
They identified 49 cases of lung cancer among the patients who were registered with a general
practice, who were age 65 or younger and who had resided in the city since 1965. They also selected
98 controls from a population of residents having the same general age structure.
The data is contained in the Sleuth3 package accompanying the book The Statistical Sleuth (2002).

case2002 <- as_tibble(Sleuth3::case2002)


case2002$LC <- relevel(case2002$LC, "NoCancer")
case2002$BK <- relevel(case2002$BK, "NoBird")

The dataset contains the following variables:


LC: whether subject has lung cancer; the reference level is NoCancer
FM: sex of subject
SS: socioeconomic status
BK: indicator for bird keeping - caged birds were kept in the household for more than 6 consecutive
months between the ages of 5 and 14 years prior to diagnosis (cases) or examination (control) - the
reference level is NoBird
AG: age of subject in years
YR: years of smoking prior to diagnosis or examination
CD: average rate of smoking in cigarettes per day

1
Example taken from Ramsey and Schafer (2002).

353
B Inference for logistic regression

case2002
# # A tibble: 147 x 7
# LC FM SS BK AG YR CD
# <fct> <fct> <fct> <fct> <int> <int> <int>
# 1 LungCancer Male Low Bird 37 19 12
# 2 LungCancer Male Low Bird 41 22 15
# 3 LungCancer Male High NoBird 43 19 15
# 4 LungCancer Male Low Bird 46 24 15
# 5 LungCancer Male Low Bird 49 31 20
# 6 LungCancer Male High NoBird 51 24 15
# # i 141 more rows

Exploratory data analysis

We want to visually analyze the impact of the predictor variables (age, smoking (years as well as the
rate), gender, socioeconomic status, and owning birds) on the likelihood of developing lung cancer.
We will create a separate bar plot for each of the three variables and color each bar based on the
proportions of individuals with and without lung cancer in the corresponding subgroup. Since age,
years of smoking, and the rate of smoking are numerical variables, we will need to categorize them
to create bar plots.

p_ag <- ggplot(case2002, aes(x = cut_number(AG, 5), fill= LC)) +


geom_bar(position = 'fill') +
scale_fill_brewer(palette = "Accent",
labels = c("no","yes")) +
labs(x = "Age", y = "proportions", fill = "lung cancer")

p_yr <- ggplot(case2002, aes(x = cut_number(YR, 5), fill= LC)) +


geom_bar(position = 'fill') +
scale_fill_brewer(palette = "Accent",
labels = c("no","yes")) +
labs(x = "Years of smoking\n prior to diagnosis",
y = "proportions", fill = "lung cancer")

p_cd <- ggplot(case2002, aes(x = cut_number(CD, 3), fill= LC)) +


geom_bar(position = 'fill') +
scale_fill_brewer(palette = "Accent",
labels = c("no","yes")) +
labs(x = "Average number of\n cigarettes per day",
y = "proportions", fill = "lung cancer")

354
B Inference for logistic regression

p_fm <- ggplot(case2002, aes(x = FM, fill= LC)) +


geom_bar(position = 'fill') +
scale_fill_brewer(palette = "Accent",
labels = c("no","yes")) +
labs(x = "Gender",
y = "proportions", fill = "lung cancer")

p_ss <- ggplot(case2002, aes(x = SS, fill= LC)) +


geom_bar(position = 'fill') +
scale_fill_brewer(palette = "Accent",
labels = c("no","yes")) +
labs(x = "Socioeconmic status",
y = "proportions", fill = "lung cancer")

p_bk <- ggplot(case2002, aes(x = BK, fill= LC)) +


geom_bar(position = 'fill') +
scale_fill_brewer(palette = "Accent",
labels = c("no","yes")) +
labs(x = "Keeping Birds",
y = "proportions", fill = "lung cancer")

p_ag + p_yr + p_cd + p_fm + p_ss + p_bk +


plot_layout(axes = "collect", ncol = 2, guides = "collect")

355
B Inference for logistic regression

1.00
0.75
0.50
0.25
0.00
[37,49.2]
(49.2,57](57,61](61,64](64,67] [0,17.2]
(17.2,25](25,35](35,40](40,50]
Age Years of smoking
prior to diagnosis
1.00
proportions

0.75 lung cancer


0.50 no
0.25 yes
0.00
[0,12] (12,20] (20,45] Female Male
Average number of Gender
cigarettes per day
1.00
0.75
0.50
0.25
0.00
High Low NoBird Bird
Socioeconmic status Keeping Birds

Figure B.1

We observe different proportions across the categories of age (minimal), years of smoking, average
number of cigarettes per day and bird ownership, while this is not evident for gender and the socioe-
conomic status.

Model selection

Based on our exploratory data analysis, we do not anticipate gender and socioeconomic status signif-
icantly contributing to a model that describes the probability of developing lung cancer. However,
owning birds and smoking appear to have an impact. As for the influence of age, we are uncertain.
Let’s try to verify our assumptions by a logistic regression model to the data. Remember, the model
states that

356
B Inference for logistic regression

exp(𝛽0 + 𝛽1 𝑥1,𝑖 + + 𝛽𝑘 𝑥𝑘,𝑖 )


P(𝑌𝑖 = 1|x𝑖 ) = .
1 + exp(𝛽0 + 𝛽1 𝑥1,𝑖 + + 𝛽𝑘 𝑥𝑘,𝑖 )

To select the predictor variables in the model, a hybrid stepwise selection algorithm is used. The
algorithm begins with the intercept model:

model_int <- glm(LC ~ 1, family = binomial, data = case2002)

Now, we can use the step() function on this model and define the scope to include all other variables
in the dataset.

model_step <- step(model_int,


scope = ~ FM + SS + BK + AG + YR + CD,
direction = "both")
# Start: AIC=189.14
# LC ~ 1
#
# Df Deviance AIC
# + BK 1 172.93 176.93
# + YR 1 173.17 177.17
# + CD 1 179.62 183.62
# <none> 187.13 189.13
# + SS 1 185.81 189.81
# + AG 1 187.13 191.13
# + FM 1 187.13 191.13
#
# Step: AIC=176.93
# LC ~ BK
#
# Df Deviance AIC
# + YR 1 158.11 164.11
# + CD 1 164.36 170.36
# <none> 172.93 176.93
# + FM 1 172.44 178.44
# + AG 1 172.53 178.53
# + SS 1 172.72 178.72
# - BK 1 187.13 189.13
#
# Step: AIC=164.11
# LC ~ BK + YR
#

357
B Inference for logistic regression

# Df Deviance AIC
# <none> 158.11 164.11
# + AG 1 156.22 164.22
# + CD 1 156.75 164.75
# + FM 1 157.10 165.10
# + SS 1 158.11 166.11
# - YR 1 172.93 176.93
# - BK 1 173.17 177.17

Let’s take a look at the estimated slope parameters 𝛽𝑗̂ , 𝑗 ∈ {1, 2}.

tidy(model_step)
# # A tibble: 3 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) -3.18 0.636 -5.00 0.000000582
# 2 BKBird 1.48 0.396 3.73 0.000194
# 3 YR 0.0582 0.0168 3.46 0.000544

Interpretation of the parameter estimates

Assuming all other predictors stay constant,

• the odds ratio of getting lung cancer for bird keepers vs non-bird keepers is exp(1.48) ≈
4.37,

• the odds ratio of getting lung cancer for an additional year of smoking is exp(0.0582) ≈
1.06.

B.1 Testing the slope parameters

Given the results of a fitted model, we want to test hypotheses about the slopes 𝛽𝑖 , as we have done
in the linear regression model.
We want to investigate whether the j-th explanatory variable has an impact on the probability of
success P(𝑌 = 1|x) within the population. Hence, we want to consider the testing problem

𝐻0 ∶ 𝛽𝑗 = 0 𝐻𝐴 ∶ 𝛽𝑗 ≠ 0 .

Remark. As with multiple linear regression, each hypothesis test is conducted with each of the other
variables remaining in the model. Hence, the null hypothesis 𝐻0 ∶ 𝛽𝑗 = 0 is actually:

358
B Inference for logistic regression

𝐻0 ∶ 𝛽𝑗 = 0 given all the other variables in the model.

The test statistic construction follows the same basic setup as linear regression using the theoretical
approach. The difference is that we only know the asymptotic distribution of the statistic constructed
this way.

Remark. For inference in the logistic regression model we will only use the asymptotic approach. The
infer package does not support the simulation-based approach for logistic regression. Therefore, one
must utilize the general setup provided by the tidymodels package to employ this approach.

Definition B.1. Let 𝛽𝑗̂ (Y, x), 𝑗 ∈ {1, … , 𝑘}, be the maximum likelihood estimator of the population
parameter 𝛽𝑗 in a logistic regression model and SE𝑏𝑗 (Y, x) the corresponding standard error. Then
the test statistic
𝛽𝑗̂ (Y, x) − 𝛽𝑗,0
𝑍𝑗 (Y, x) = ,
SE𝛽̂ (Y, x)
𝑗

which has approximately (for large n) a standard normal distribution, is used for testing

𝐻0 ∶ 𝛽𝑗 = 𝛽𝑗,0 𝐻𝐴 ∶ 𝛽𝑗 ≠ 𝛽𝑗,0 ,

given all the other variables in the model.

Remark. The only tricky bit, which is beyond the scope of this course, is how the standard error
SE𝛽̂ (Y, x) is calculated.
𝑗

Most of the time we are interested in testing

𝐻0 ∶ 𝛽𝑗 = 0 𝐻𝐴 ∶ 𝛽𝑗 ≠ 0 ,
given all the other variables in the model, i.e., 𝛽𝑗,0 is assumed to be zero.
In this case the null hypothesis would be rejected at the 𝛼 significance level, if

|𝑧𝑗 (y, x)| > 𝑧1−𝛼/2 ,

where 𝑧𝑗 (y, x) is the observed value of the test statistic 𝑍𝑗 (Y, x) and 𝑧1−𝛼/2 is the 1 − 𝛼/2 quantile
of the standard normal distribution.
Let’s consider the impact of owning a bird on the likelihood of developing lung cancer. Under the
null hypothesis, we assume that keeping a bird does not affect the likelihood. In other words, the test
problem is as follows:

𝐻0 ∶ 𝛽 1 = 0 𝐻𝐴 ∶ 𝛽1 ≠ 0

359
B Inference for logistic regression

tidy(model_step)
# # A tibble: 3 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) -3.18 0.636 -5.00 0.000000582
# 2 BKBird 1.48 0.396 3.73 0.000194
# 3 YR 0.0582 0.0168 3.46 0.000544

The observed value of the test statistic is equal to:

𝛽𝑗̂ − 𝛽1,0 1.48 − 0


𝑧1 = = ≈ 3.74 .
SE𝛽̂ 0.396
𝑗

The corresponding p-value is then given by

p-value ≈ P𝐻0 (|𝑍| ≥ 3.73) = 2 ⋅ P𝐻0 (𝑍 > 3.73)


= 1.9147977 × 10−4 ,

since

2 * pnorm(3.73, lower.tail = FALSE)


# [1] 0.0001914798

Confidence interval for the slope parameter

𝛽̂𝑗 (Y,x)−𝛽𝑗
Using the z-statistic SE𝛽̂ (Y,x) and applying the idea for constructing confidence intervals pre-
𝑗
sented in Section 12.1, yields the 100(1 − 𝛼)% asymptotic confidence interval for the slope
parameter 𝛽𝑗 :
𝛽𝑗̂ (Y, x) ± 𝑧1−𝛼/2 ⋅ SE𝛽̂ (Y, x) ,
𝑗

where 𝑧1−𝛼/2 is the 1 − 𝛼/2 quantile of the standard normal distribution.

360
B Inference for logistic regression

Using the estimation results

tidy(model_step)
# # A tibble: 3 x 5
# term estimate std.error statistic p.value
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 (Intercept) -3.18 0.636 -5.00 0.000000582
# 2 BKBird 1.48 0.396 3.73 0.000194
# 3 YR 0.0582 0.0168 3.46 0.000544

and the 0.975 quantile of the standard normal distribution

qnorm(0.975)
# [1] 1.959964

we can compute the asymptotic 95% confidence interval CI𝛽1 for 𝛽1 :

CI𝛽1 ≈ 1.48 ± 1.96 ⋅ 0.396 ≈ (0.7, 2.251) .

Remember, the odds ratio for a one unit change of BK is equal to e𝛽1 . Hence, an asymptotic 95%
confidence interval for the odds ratio is given by

eCI = (e0.7 , e2.251 ) ≈ (2.014, 9.497) .

Using confint.default() will allow us to verify this computation.

confint.default(model_step)
# 2.5 % 97.5 %
# (Intercept) -4.42747020 -1.93284067
# BKBird 0.69964830 2.25145472
# YR 0.02523359 0.09126585

The relative risk of developing lung cancer for individuals who own birds and smoke for 10 or 20
years is presented in the following output.

library(effectsize)

# p0: 10 years smoking and no birds


oddsratio_to_riskratio(model_step, p0 = 0.06928929)
# Parameter | Risk Ratio | 95% CI

361
B Inference for logistic regression

# -------------------------------------
# (p0) | 0.07 |
# BK [Bird] | 3.54 | [1.91, 6.07]
# YR | 1.06 | [1.03, 1.09]

# p0: 20 years smoking and no birds


oddsratio_to_riskratio(model_step, p0 = 0.1176203)
# Parameter | Risk Ratio | 95% CI
# -------------------------------------
# (p0) | 0.12 |
# BK [Bird] | 3.13 | [1.83, 4.80]
# YR | 1.05 | [1.02, 1.09]

362
B Inference for logistic regression

B.2 Checking model conditions

Remember, logistic regression assumes a linear relationship between the logit and the numeric pre-
dictor variables:

𝑘
𝑝(x𝑖 )
log ( ) = 𝛽0 + ∑ 𝛽𝑗 𝑥𝑗,𝑖 .
1 − 𝑝(x𝑖 ) 𝑗=1

To verify this assumption, we plot each numeric predictor variable against the estimated logit
log ( 1−𝑝(x
̂ 𝑖)
̂ ) ).
𝑝(x
𝑖

Since model_step has only one numeric predictor, let’s fit a model where we add the other two
numeric predictor variables AG and CD for illustration.

model_np <- glm(LC ~ BK + YR + AG + CD,


family = binomial, data = case2002)

Now we can add with the augment() function the fitted probabilities 𝑝𝑖̂ and the logit values to the
dataset.

case2002_fit <- augment(


model_np,
type.predict = "response") |>
mutate(
logit = log(.fitted / (1 - .fitted))
)

We generate a scatterplot for each numeric predictor against the logit. Additionally, we use
geom_smooth() to add a non-parametric fit to the point cloud. If the fitted function appears to be
linear, we can infer that the linearity assumption may be satisfied.

# age
p_AG <- ggplot(case2002_fit, aes(x = AG, y = logit)) +
geom_point() + geom_smooth()

# cigarettes per day


p_CD <- ggplot(case2002_fit, aes(x = CD, y = logit)) +
geom_point() + geom_smooth()

# years smoking
p_YR <- ggplot(case2002_fit, aes(x = YR, y = logit)) +
geom_point() + geom_smooth()

363
B Inference for logistic regression

All three plots are shown in Figure B.2.

0 0

−1
logit

logit
−2 −2

−3

−4 −4

40 50 60 0 10 20 30 40
AG CD

−1
logit

−2

−3

−4

0 10 20 30 40 50
YR

Figure B.2

It appears that the assumption only applies to YR. For AG, the relationship may also be linear, but
with a slope of approximately zero. The relationship between the logit and CD is evidently non-linear,
likely due to the non-smokers smoking zero cigarettes per day. For smokers who smoke 5 or more
cigarettes per day, the relationship appears to be linear, but with a rather small slope. Therefore, the
two plots once again confirm that a model without AG and CD makes more sense.

Remark.

1. The linearity assumption is obviously met for all categorical predictors. For categorical predic-
tors, the assumption states that the logit will have different mean values for the different levels
of the predictors, which will be the case.

364
B Inference for logistic regression

2. In logistic regression, we don’t need to check distributional assumptions as we do in linear


regression when checking for normality. This is because the response in logistic regression is
binary and follows a Bernoulli distribution. The only unknown is the success probability, which
we aim to model with logistic regression. As a result, we don’t need to consider the residuals
for checking distributional assumptions.
3. The residuals provide important information about the quality of the fit. One can visually
inspect the residuals to evaluate the goodness of fit. But this is beyond the scope of this course.

365
C Technical points

This section presents various technical points raised in previous chapters. The purpose of this section
is to provide proofs for each of these points for readers who are interested. Please note that the content
provided here is not considered part of the lecture material.

C.1 Technical points from Chapter 8

Claim

The bias-variance decomposition as given in Equation 8.1 is true.

Proof. We want to proof the bias-variance decomposition:


2 2
̂ ∗ )) ] = Var [𝑓(𝑥
E [(𝑌 ∗ − 𝑓(𝑥 ̂ ∗ )] + [Bias(𝑓(𝑥
̂ ∗ ))] + Var[𝜖 ].
0

̂ ∗ ), such that we create quantities with known


Therefore, we will add and remove terms to 𝑌 ∗ − 𝑓(𝑥
expectation. We get

2 2
̂ ∗ )) ] = E[(𝑌 ∗ − 𝑓 (𝑥∗ ) + 𝑓 (𝑥∗ ) − E[𝑓(𝑥
E[(𝑌 ∗ − 𝑓(𝑥 ̂ ∗ )] + E[𝑓(𝑥
̂ ∗ )] − 𝑓(𝑥
̂ ∗ )) ]
0 0

2
= E[(𝑌 ∗ − 𝑓0 (𝑥∗ ))

̂ ∗ )] + E[𝑓(𝑥
+ 2(𝑌 ∗ − 𝑓0 (𝑥∗ ))(𝑓0 (𝑥∗ ) − E[𝑓(𝑥 ̂ ∗ )] − 𝑓(𝑥
̂ ∗ ))

2
̂ ∗ )] + E[𝑓(𝑥
+ (𝑓0 (𝑥∗ ) − E[𝑓(𝑥 ̂ ∗ )] − 𝑓(𝑥
̂ ∗ )) ]

̂ ∗ )] + E[𝑓(𝑥
= E[𝜖20 ] + 2E[𝜖0 (𝑓0 (𝑥∗ ) − E[𝑓(𝑥 ̂ ∗ )] − 𝑓(𝑥
̂ ∗ ))]

2
̂ ∗ )] + E[𝑓(𝑥
+ E[(𝑓0 (𝑥∗ ) − E[𝑓(𝑥 ̂ ∗ )] − 𝑓(𝑥
̂ ∗ )) ] .

366
C Technical points

Now observe that 𝜖0 is independent from all other random quantities and has expectation zero, i.e.,
E[𝜖0 ] = 0 Hence,

2
̂ ∗ )) ] = E[𝜖2 ] + 2E[𝜖 ]E[(𝑓 (𝑥∗ ) − E[𝑓(𝑥
E[(𝑌 ∗ − 𝑓(𝑥 ̂ ∗ )] + E[𝑓(𝑥
̂ ∗ )] − 𝑓(𝑥
̂ ∗ ))]
0 0 0

2
̂ ∗ )]) + 2(𝑓 (𝑥∗ ) − E[𝑓(𝑥
+ E[(𝑓0 (𝑥∗ ) − E[𝑓(𝑥 ̂ ∗ )])(E[𝑓(𝑥
̂ ∗ )] − 𝑓(𝑥
̂ ∗ ))
0

2
̂ ∗ )] − 𝑓(𝑥
+ (E[𝑓(𝑥 ̂ ∗ )) ]

2
̂ ∗ )]) + 2(𝑓 (𝑥∗ ) − E[𝑓(𝑥
= E[(𝑓0 (𝑥∗ ) − E[𝑓(𝑥 ̂ ∗ )])(E[𝑓(𝑥
̂ ∗ )] − 𝑓(𝑥
̂ ∗ ))
0

2
̂ ∗ )] − 𝑓(𝑥
+ (E[𝑓(𝑥 ̂ ∗ )) ] + E[𝜖2 ] .
0

But since E[𝜖0 ] = 0, it follows that Var[𝜖0 ] = E[𝜖20 ] − E[𝜖0 ]2 = E[𝜖20 ].


Using this result, we get

2 2
̂ ∗ )) ] = E[(𝑓 (𝑥∗ ) − E[𝑓(𝑥
E[(𝑌 ∗ − 𝑓(𝑥 ̂ ∗ )]) ] + 2(𝑓 (𝑥∗ ) − E[𝑓(𝑥
̂ ∗ )])E[(E[𝑓(𝑥
̂ ∗ )] − 𝑓(𝑥
̂ ∗ ))]
0 0

2
̂ ∗ )] − 𝑓(𝑥
+ E[(E[𝑓(𝑥 ̂ ∗ )) ] + Var[𝜖 ]
0

2 2
̂ ∗ )] − 𝑓 (𝑥∗ )) ] + E[(𝑓(𝑥
= E[(E[𝑓(𝑥 ̂ ∗ ) − E[𝑓(𝑥
̂ ∗ )]) ] + Var[𝜖 ]
0 0

2
̂ ∗ ))) ] + Var[𝑓(𝑥
= E[(Bias(𝑓(𝑥 ̂ ∗ )] + Var[𝜖 ]
0

2
̂ ∗ )] + [Bias(𝑓(𝑥
= Var[𝑓(𝑥 ̂ ∗ ))] + Var[𝜖 ] .
0

367
C Technical points

C.2 Technical points from Chapter 9

The least squares estimates in a simple linear regression model are given by
𝑛
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 )
̂ =
𝛽1,𝑛 𝑛 ,
∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2
̂
𝛽0,𝑛 = 𝑦𝑛 − 𝛽1̂ 𝑥𝑛 .

Claim
−1
These estimates can be computed using the general formula 𝛽̂ = (X⊤ X) X⊤ y for least
squares estimates in linear regression models.

Proof. It holds that

−1
𝛽̂ = (X⊤ X) X⊤ y
−1
1 𝑥1
1 1 ⎛ 1 1
=⎛
⎜( )⎜⋮ ⋮ ⎞⎟⎞
⎟ ( )y
𝑥1 𝑥𝑛 𝑥1 𝑥𝑛
⎝ ⎝1 𝑥 𝑛 ⎠⎠
𝑛 −1 𝑛
𝑛 ∑𝑖=1 𝑥𝑖 ∑ 𝑦
=( 𝑛 𝑛 ) ( 𝑛𝑖=1 𝑖 )
∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑥2𝑖 ∑𝑖=1 𝑥𝑖 𝑦𝑖
𝑛 𝑛 𝑛
1 ∑𝑖=1 𝑥2𝑖 − ∑𝑖=1 𝑥𝑖 ∑ 𝑦
= 2
( 𝑛 ) ( 𝑛𝑖=1 𝑖 ) .
𝑛 2 𝑛
𝑛 ⋅ ∑𝑖=1 𝑥𝑖 − ( ∑𝑖=1 𝑥𝑖 ) − ∑ 𝑥
𝑖=1 𝑖
𝑛 ∑ 𝑥𝑦
𝑖=1 𝑖 𝑖

Now observe that

𝑛 𝑛
(𝑛 − 1)𝑠2𝑥,𝑛 = ∑(𝑥𝑖 − 𝑥𝑛 )2 = ∑ (𝑥2𝑖 − 2𝑥𝑖 𝑥𝑛 + 𝑥2𝑛 )
𝑖=1 𝑖=1
𝑛 𝑛 𝑛 𝑛
= ∑ 𝑥2𝑖 − 2𝑥𝑛 ∑ 𝑥𝑖 + ∑ 𝑥2𝑛 = ∑ 𝑥2𝑖 − 𝑛 ⋅ 𝑥2𝑛
𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
1 2
= ∑ 𝑥2𝑖 − ( ∑ 𝑥𝑖 ) ,
𝑖=1
𝑛 𝑖=1
and

368
C Technical points

𝑛 𝑛
(𝑛 − 1)𝑠𝑥𝑦,𝑛 = ∑(𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 ) = ∑ (𝑥𝑖 𝑦𝑖 − 𝑥𝑖 𝑦𝑛 − 𝑥𝑛 𝑦𝑖 + 𝑥𝑛 𝑦𝑛 )
𝑖=1 𝑖=1
𝑛 𝑛 𝑛
= ∑ 𝑥𝑖 𝑦𝑖 − 𝑦𝑛 ∑ 𝑥𝑖 − ∑ 𝑦𝑛 + 𝑛𝑥𝑛 𝑦𝑛
𝑖=1 𝑖=1 𝑖=1
𝑛 𝑛
= ∑ 𝑥𝑖 𝑦𝑖 − 𝑛𝑦𝑛 𝑥𝑛 − 𝑛𝑥𝑛 𝑦𝑛 + 𝑛𝑥𝑛 𝑦𝑛 = ∑ 𝑥𝑖 𝑦𝑖 − 𝑛𝑥𝑛 𝑦𝑛 .
𝑖=1 𝑖=1

Using both results we get

−1
𝛽̂ = (X⊤ X) X⊤ y
𝑛 𝑛 𝑛 𝑛
1 ∑𝑖=1 𝑥2𝑖 ∑𝑖=1 𝑦𝑖 − ∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑥𝑖 𝑦𝑖
= 𝑛 ( 𝑛 𝑛 𝑛 )
𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2 𝑛 ∑𝑖=1 𝑥𝑖 𝑦𝑖 − ∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑦𝑖
𝑛 𝑛 𝑛 𝑛
1 ∑𝑖=1 𝑥2𝑖 ∑𝑖=1 𝑦𝑖 − ∑𝑖=1 𝑥𝑖 ∑𝑖=1 𝑥𝑖 𝑦𝑖
= 𝑛 ( 𝑛 )
𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2 𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 )
𝑛 𝑛
1 ∑𝑖=1 𝑥2𝑖 𝑛𝑦𝑛 − (𝑛 ∑𝑖=1 𝑥𝑖 𝑦𝑖 − 𝑛2 𝑥𝑛 𝑦𝑛 )𝑥𝑛 − 𝑛2 𝑥2𝑛 𝑦𝑛
= 𝑛 ( 𝑛 )
𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2 𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 )
𝑛 𝑛
1 𝑛( ∑𝑖=1 𝑥2𝑖 − 𝑛𝑥2𝑛 )𝑦𝑛 − (𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 ))𝑥𝑛
= 𝑛 ( 𝑛 )
𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )2 𝑛 ∑𝑖=1 (𝑥𝑖 − 𝑥𝑛 )(𝑦𝑖 − 𝑦𝑛 )
𝑛
∑ 𝑖 (𝑥 −𝑥 )(𝑦 −𝑦 )
𝑛 𝑖
𝑦𝑛 − 𝑖=1 𝑛
𝑛
𝑥𝑛 ⎞
=⎛
∑𝑖=1 (𝑥𝑖 −𝑥𝑛 )2
⎜ 𝑛
∑𝑖=1 (𝑥𝑖 −𝑥𝑛 )(𝑦𝑖 −𝑦𝑛 )

𝑛
⎝ ∑𝑖=1 (𝑥𝑖 −𝑥𝑛 ) 2

C.3 Technical points from Section A.3

Claim

𝜎12 𝜌𝜎1 𝜎2
Let (𝑋1 , 𝑋2 ) ∼ N2 (𝜇, Σ) 𝜇 = (𝜇1 , 𝜇2 )⊤ ∈ R2 and Σ = ( ), where 𝜎1 , 𝜎2 > 0
𝜌𝜎1 𝜎2 𝜎22
and 𝜌 ∈ (−1, 1). Then it holds that the marginal distributions of 𝑋1 and 𝑋2 are also normal,
e.g. the density of 𝑋1 is given by
(𝑥1 −𝜇1 )2
1 −
2𝜎2
𝑓1 (𝑥1 ) = e 1 ,
√2𝜋𝜎12

369
C Technical points

and the expectation of 𝑋1 ⋅ 𝑋2 is equal to

E[𝑋1 ⋅ 𝑋2 ] = 𝜌𝜎1 𝜎2 + 𝜇1 𝜇2 .

Proof. We start by computing the density of 𝑋1 . The result for 𝑋2 is analogous. We use the following
form of the joint density
(𝑥1 −𝜇1 )2 (𝑥1 −𝜇1 )⋅(𝑥2 −𝜇2 ) (𝑥 −𝜇 )2
1 1
− 2(1−𝜌 2) [ 𝜎2
−2𝜌⋅ + 2 22 ]
𝑓(x) = 𝑓(𝑥1 , 𝑥2 ) = e 1
𝜎1 𝜎2 𝜎2

√(2𝜋)2 𝜎12 𝜎22 (1 − 𝜌2 )

and integrate with respect to 𝑥2 for computing the marginal density of 𝑋1 .

𝑓1 (𝑥1 ) = ∫R 𝑓(𝑥1 , 𝑥2 )d𝑥2 =


1 (𝑥1 −𝜇1 )2 (𝑥1 −𝜇1 )⋅(𝑥2 −𝜇2 ) (𝑥 −𝜇 )2
1 − 2(1−𝜌2) [ −2𝜌⋅ + 2 22 ]
𝜎2
= ∫R ⋅ e 1
𝜎1 𝜎2 𝜎2
d𝑥2
√(2𝜋)2 𝜎12 𝜎22 (1−𝜌2 )

(𝑥1 −𝜇1 )2 (𝑥1 −𝜇1 )2 (𝑥2 −𝜇2 )2 (𝑥 −𝜇 )2


+[ ]]
1 (𝑥1 −𝜇1 )⋅(𝑥2 −𝜇2 )
− 2(1−𝜌2) [ −𝜌2 −2𝜌⋅ +𝜌2 1 2 1
1 𝜎2 𝜎2 𝜎2 𝜎1 𝜎2 𝜎1
= ∫R e 1 1 2
d𝑥2
√(2𝜋)2 𝜎12 𝜎22 (1−𝜌2 )
2
(𝑥1 −𝜇1 )2 𝜌2 (𝑥1 −𝜇1 )2
2) [ ]
1 1 (𝑥2 −𝜇2 ) (𝑥 −𝜇 )
− 2(1−𝜌 2) [ − ] − 2(1−𝜌 −𝜌 1𝜎 1
1 𝜎2 𝜎2 1 𝜎2
= e 1 1 ∫R e 1
d𝑥2
√2𝜋𝜎12 √2𝜋𝜎22 (1−𝜌2 )
2
(𝑥1 −𝜇1 )2 (1−𝜌2 )
1
1
− 2(1−𝜌 2) [ 𝜎2
] 1

2𝜎2
1
2 [𝑥2 −(𝜇2 +𝜌 𝜎𝜎12 (𝑥1 −𝜇1 ))]
= e 1 ∫R e 2 (1−𝜌 ) d𝑥2
√2𝜋𝜎12 √2𝜋𝜎22 (1−𝜌2 )
(𝑥1 −𝜇1 )2
1 −
2𝜎2
= e 1
√2𝜋𝜎12

To compute E[𝑋1 ⋅ 𝑋2 ] we use again the above given form of the joint density 𝑓. It holds that

370
C Technical points

E[𝑋1 ⋅ 𝑋2 ] = ∫R ∫R 𝑥1 ⋅ 𝑥2 ⋅ 𝑓(x)d𝑥1 d𝑥2


1 (𝑥1 −𝜇1 )2 (𝑥1 −𝜇1 )⋅(𝑥2 −𝜇2 ) (𝑥 −𝜇 )2
𝑥1 ⋅𝑥2 − 2(1−𝜌2) [ −2𝜌⋅ + 2 22 ]
𝜎2
= ∫R ∫R ⋅ e 1
𝜎1 𝜎2 𝜎2
d𝑥1 d𝑥2
√(2𝜋)2 𝜎12 𝜎22 (1−𝜌2 )

(𝑥1 −𝜇1 )2 (𝑥1 −𝜇1 )2 (𝑥2 −𝜇2 )2 (𝑥 −𝜇 )2


+[ ]]
1 (𝑥1 −𝜇1 )⋅(𝑥2 −𝜇2 )
− 2(1−𝜌 2) [ −𝜌2 −2𝜌⋅ +𝜌2 1 2 1
𝜎2 𝜎2 𝜎2 𝜎1 𝜎2
∫R ∫R 2𝜋𝜎 𝑥𝜎1 ⋅𝑥
𝜎1
= √
2
2
e 1 1 2
d𝑥1 d𝑥2
1 2 ⋅ 1−𝜌
2
(𝑥1 −𝜇1 )2 𝜌2 (𝑥1 −𝜇1 )2
2) [ ]
1 1 (𝑥2 −𝜇2 ) (𝑥 −𝜇 )
𝑥1 − 2(1−𝜌2) [ − ] 𝑥2 − 2(1−𝜌 −𝜌 1𝜎 1
𝜎2 𝜎2 𝜎2
= ∫R ∫R e 1 1 e 1
d𝑥2 d𝑥1
√2𝜋𝜎12 √2𝜋𝜎22 (1−𝜌2 )
2
(𝑥1 −𝜇1 )2 (1−𝜌2 )
𝑥1
1
− 2(1−𝜌 2) [ 𝜎2
] 𝑥2 −
2𝜎2
1
2 [𝑥2 −(𝜇2 +𝜌 𝜎𝜎12 (𝑥1 −𝜇1 ))]
= ∫R e 1 ∫R e 2 (1−𝜌 ) d𝑥2 d𝑥1
√2𝜋𝜎12 √2𝜋𝜎22 (1−𝜌2 )

1 (𝑥 −𝜇 )2
− [(1−𝜌2 )⋅ 1 2 1 ]
2(1−𝜌2 ) 𝜎1
= ∫R 𝑥1 ⋅ e √
2𝜋𝜎1
⋅ (𝜌(𝑥1 − 𝜇1 ) 𝜎𝜎2 + 𝜇2 )d𝑥1
1
(𝑥 −𝜇 )2 (𝑥1 −𝜇1 )2
− 1 21 −
= 𝜌 𝜎𝜎2 ∫R (𝑥21 − 𝜇1 𝑥1 ) ⋅ 1
e 2𝜎1
d𝑥1 + 𝜇2 ∫R 𝑥1 ⋅ 1
e 2𝜎2
1 d𝑥1
1 √2𝜋𝜎12 2
√2𝜋𝜎1

= 𝜌 𝜎𝜎2 (Var[𝑋1 ] + E[𝑋1 ]2 − E[𝑋1 ]2 ) + 𝜇2 E[𝑋1 ]


1

= 𝜌 𝜎𝜎2 ⋅ 𝜎12 + 𝜇2 𝜇1 = 𝜌𝜎1 𝜎2 + 𝜇1 𝜇2 ,


1

which shows the result.

371
Index

AIC, 211 Effect size, 297


Empirical
Bayes’ Theorem, 130 correlation coefficient, 103
Bias, 163 mean, 88
Bias-variance decomposition, 163 median, 95
Binomial test, 293 standard deviation, 94
Blinding, 18 variance, 93
Bootstrap Events
algorithm, 270 addition rule, 121
distribution, 272 complement, 118
percentile method, 282 disjoint, 118
standard error method, 283 addition rule, 119
product rule, 124
Central limit theorem, 268
Expected count, 303
Chi-squared statistic, 303
Experiment, 10
Chi-squared test, 305
Exploratory data analysis, 24
Classification
algorithm, 241 Frequencies
accuracy, 243 absolute, 74
problem, 165 joint, 77
Classification problem marginal, 77
k nearest neighbors, 166 relative, 74
Conditional probability, 125
Confidence interval, 278 Independence
asymptotic, 280 events, 123
Confounding, 12 Interaction effect, 222
Confusion matrix, 244 Interquartile range, 96
Contingency table, 77
Continuous distribution Linear regression
density, 137 adjusted R squared value, 193
distribution function, 138 categorical predictors
Cross validation, 166 dummy coding, 196
confidence interval, 317
Distribution Cook’s distance, 331
q-quantile, 140 cross validation, 206
hat matrix, 190

372
Index

influential observation, 331 Probability measure, 119


intercept, 171 Probability model, 257
leverage, 329
multiple, 187 Random variable, 131
least squares estimates, 190 expected value
normal equation, 176 continuous, 138
R squared value, 184 discrete, 131
residual analysis, 323 variance
residual variance, 315 continuous, 138
root mean squared error, 204 discrete, 133
simple, 170 Random variables
least squares estimates, 175 correlation, 135
slope, 171 independent, 134
Logistic regression uncorrelated, 134
cross validation, 251 Regression
maximum likelihood estimation, 231 residuals, 174
model, 229 Relative risk, 236
Logit function, 228 Residual standard error, 183
ROC curve, 246
Maximum likelihood estimation, 277
Mean squared error, 161 Sample, 6
test, 161 Sample space, 118
training, 161 Sampling bias
Modality convenience sample, 8
bimodal, 92 non-response, 8
multimodal, 92 voluntary-response, 8
unimodal, 92 Sampling error, 259
Multicollinearity, 195 Sampling principles
cluster sampling, 14
Normal distribution, 140 multistage sampling, 15
simple random sampling, 13
Observational study, 10 stratified sampling, 14
Odds, 230 Sensitivity, 240
log, 230 Specificity, 240
ratio, 230 Statistic, 258
sampling distribution, 266
Percentile, 95
standard error, 259
Placebo, 18
Statistical model, 258
Point estimator, 258
Statistical test, 289
bias, 259
alternative hypothesis, 289
Population, 6
null distribution, 292
mean, 88
null hypothesis, 289
Probability distribution
p-value, 290
discrete, 122

373
Index

power, 297
significance level, 289
test problem, 289
test statistic, 289
type 1 error, 289
type 2 error, 289
Stepwise selection
backward, 213
forward, 212
Subset selection
best, 210
hybrid, 214
Success-failure condition, 281
Supervised learning, 152

t-test, 294
two sample, 320

Unsupervised learning, 152


cluster analysis, 168

Variables
block, 17
categorical, 4
nominal, 4
ordinal, 4
explanatory, 6
numerical, 4
response, 6
treatment, 17

Youden’s J statistic, 245

374

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy