0% found this document useful (0 votes)
498 views27 pages

Previewpdf

lneofnelfnekf ekf kefke fkefk iefke fiefke fie fke fiefke k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
498 views27 pages

Previewpdf

lneofnelfnekf ekf kefke fkefk iefke fiefke fie fke fiefke k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Linear Models with

Python
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Joseph K. Blitzstein, Harvard University, USA
Julian J. Faraway, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada

Recently Published Titles

Linear Models with Python


Julian J. Faraway

Introduction to Probability, Second Edition


Joseph K. Blitzstein and Jessica Hwang

Theory of Spatial Statistics


A Concise Introduction
M.N.M van Lieshout

Bayesian Statistical Methods


Brian J. Reich and Sujit K. Ghosh

Sampling
Design and Analysis, Second Edition
Sharon L. Lohr

The Analysis of Time Series


An Introduction with R, Seventh Edition
Chris Chatfield and Haipeng Xing

Time Series
A Data Analysis Approach Using R
Robert H. Shumway and David S. Stoffer

Practical Multivariate Analysis, Sixth Edition


Abdelmonem Afifi, Susanne May, Robin A. Donatello, and Virginia A. Clark

Time Series: A First Course with Bootstrap Starter


Tucker S. McElroy and Dimitris N. Politis

Probability and Bayesian Modeling


Jim Albert and Jingchen Hu

Surrogates
Gaussian Process Modeling, Design, and Optimization for the Applied Sciences
Robert B. Gramacy

For more information about this series, please visit:


https://www.crcpress.com/Chapman--HallCRC-Texts-in-Statistical-Science/book-
series/CHTEXSTASCI
Linear Models with
Python

Julian J. Faraway
First edition published 2021
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742

and by CRC Press


2 Park Square, Milton Park, Abingdon, Oxon, OX14 4RN

© 2021 Taylor & Francis Group, LLC

CRC Press is an imprint of Taylor & Francis Group, LLC

Reasonable efforts have been made to publish reliable data and information, but the author and pub-
lisher cannot assume responsibility for the validity of all materials or the consequences of their use.
The authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, access www.copyright.
com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA
01923, 978-750-8400. For works that are not available on CCC please contact mpkbookspermis-
sions@tandf.co.uk

Trademark notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Names: Faraway, Julian James, author.


Title: Linear models with Python / Julian J. Faraway.
Description: First edition. | Boca Raton : CRC Press, 2021. | Series:
Chapman & Hall/CRC texts in statistical science | Includes
bibliographical references and index.
Identifiers: LCCN 2020038706 | ISBN 9781138483958 (hardback) | ISBN
9781351053419 (ebook)
Subjects: LCSH: Linear models (Statistics) | Python (Computer program
language)
Classification: LCC QA279.F3695 2021 | DDC 519.50285/5133--dc23
LC record available at https://lccn.loc.gov/2020038706

ISBN: 978-1-138-48395-8 (hbk)


ISBN: 978-1-351-05341-9 (ebk)
Contents

Preface ix

1 Introduction 1
1.1 Before You Start 1
1.2 Initial Data Analysis 2
1.3 When to Use Linear Modeling 6
1.4 History 7

2 Estimation 15
2.1 Linear Model 15
2.2 Matrix Representation 16
2.3 Estimating β 17
2.4 Least Squares Estimation 18
2.5 Examples of Calculating βˆ 19
2.6 Example 19
2.7 Computing Least Squares Estimates 22
2.8 Gauss–Markov Theorem 24
2.9 Goodness of Fit 26
2.10 Identifiability 28
2.11 Orthogonality 31

3 Inference 37
3.1 Hypothesis Tests to Compare Models 37
3.2 Testing Examples 39
3.3 Permutation Tests 44
3.4 Sampling 45
3.5 Confidence Intervals for β 47
3.6 Bootstrap Confidence Intervals 48

4 Prediction 53
4.1 Confidence Intervals for Predictions 53
4.2 Predicting Body Fat 54
4.3 Autoregression 56
4.4 What Can Go Wrong with Predictions? 58

v
vi CONTENTS
5 Explanation 61
5.1 Simple Meaning 61
5.2 Causality 63
5.3 Designed Experiments 64
5.4 Observational Data 65
5.5 Matching 67
5.6 Covariate Adjustment 70
5.7 Qualitative Support for Causation 71

6 Diagnostics 75
6.1 Checking Error Assumptions 75
6.1.1 Constant Variance 75
6.1.2 Normality 80
6.1.3 Correlated Errors 83
6.2 Finding Unusual Observations 85
6.2.1 Leverage 85
6.2.2 Outliers 87
6.2.3 Influential Observations 91
6.3 Checking the Structure of the Model 93
6.4 Discussion 96

7 Problems with the Predictors 101


7.1 Errors in the Predictors 101
7.2 Changes of Scale 105
7.3 Collinearity 108

8 Problems with the Error 115


8.1 Generalized Least Squares 115
8.2 Weighted Least Squares 117
8.3 Testing for Lack of Fit 121
8.4 Robust Regression 125
8.4.1 M-Estimation 125
8.4.2 High Breakdown Estimators 128

9 Transformation 135
9.1 Transforming the Response 135
9.2 Transforming the Predictors 140
9.3 Broken Stick Regression 140
9.4 Polynomials 142
9.5 Splines 148
9.6 Additive Models 150
9.7 More Complex Models 152
CONTENTS vii
10 Model Selection 155
10.1 Hierarchical Models 156
10.2 Hypothesis Testing-Based Procedures 156
10.3 Criterion-Based Procedures 160
10.4 Sample Splitting 163
10.5 Crossvalidation 167
10.6 Summary 169

11 Shrinkage Methods 173


11.1 Principal Components 173
11.2 Partial Least Squares 184
11.3 Ridge Regression 187
11.4 Lasso 191
11.5 Other Methods 194

12 Insurance Redlining — A Complete Example 197


12.1 Ecological Correlation 197
12.2 Initial Data Analysis 199
12.3 Full Model and Diagnostics 202
12.4 Sensitivity Analysis 204
12.5 Discussion 207

13 Missing Data 211


13.1 Types of Missing Data 211
13.2 Representation and Detection of Missing Values 212
13.3 Deletion 213
13.4 Single Imputation 215
13.5 Multiple Imputation 217
13.6 Discussion 219

14 Categorical Predictors 221


14.1 A Two-Level Factor 221
14.2 Factors and Quantitative Predictors 225
14.3 Interpretation with Interaction Terms 228
14.4 Factors with More Than Two Levels 230
14.5 Alternative Codings of Qualitative Predictors 235

15 One-Factor Models 241


15.1 The Model 241
15.2 An Example 242
15.3 Diagnostics 245
15.4 Pairwise Comparisons 246
15.5 False Discovery Rate 248
viii CONTENTS
16 Models with Several Factors 253
16.1 Two Factors with No Replication 253
16.2 Two Factors with Replication 257
16.3 Two Factors with an Interaction 262
16.4 Larger Factorial Experiments 266

17 Experiments with Blocks 273


17.1 Randomized Block Design 274
17.2 Latin Squares 278
17.3 Balanced Incomplete Block Design 282

A About Python 289

Bibliography 291

Index 295
Preface

This is a book about linear models in statistics. A linear model describes a quanti-
tative response in terms of a linear combination of predictors. You can use a linear
model to make predictions or explain the relationship between the response and the
predictors. Linear models are very flexible and widely used in applications in phys-
ical science, engineering, social science and business. Linear models are part of the
core of statistics and understanding them well is crucial to a broader competence in
the practice of statistics.
This is not an introductory textbook. You will need some basic prior knowledge
of statistics as might be obtained in one or two courses at the university level. You
will need to be familiar with essential ideas such as hypothesis testing, confidence
intervals, likelihood and parameter estimation. You will also need to be competent
in the mathematical methods of calculus and linear algebra. This is not a particularly
theoretical book, as I have preferred intuition over rigorous proof. Nevertheless,
successful statistics requires an appreciation of the principles. It is my hope that the
reader will absorb these through the many examples I present.
This book is written in three languages: English, Mathematics and Python. I
aim to combine these three seamlessly to allow coherent exposition of the practice
of linear modeling. This requires the reader to become somewhat fluent in Python.
This is not a book about learning Python but like any foreign language, one becomes
proficient by practicing it rather than by memorizing the dictionary. The reader is
advised to look elsewhere for a basic introduction to Python, but should not hesitate
to dive into this book and pick it up as you go. I shall try to help. See the Appendix
to get started.
The book’s website can be found at:

https://julianfaraway.github.io/LMP/
This book has an ancestor: Faraway (2014) entitled Linear Models with R. Clearly,
the book you hold now is about Python and not R but it is not an exact translation.
Although I was able to accomplish almost all of the R book in this Python book, I
found reason for variation:
1. Python and R are similar (at least in the way they are used for statistics) but they
make different things easy and difficult. Hence, it is natural to flow along the
Python path for easier ways to accomplish the same tasks.
2. Python is multi-talented, but R was designed to do statistics. R has a very large li-
brary of packages for statistical methods while Python has few. This has restricted

ix
x PREFACE
the choice of methods I have presented in this book. One might expect the statis-
tical functionality of Python to grow over time.
If your sole objective is to do statistics, R is more attractive. Yet there are several
reasons why you might prefer Python. You may already know Python and use it for
other tasks. Indeed, it would be unusual for someone to solely do statistics. The data
in this text is already clean and ready to use. In practice, this is rarely the case, and
flexible software for obtaining and manipulating data is essential. You may already
be using Python for this purpose.
Python also has a place at the heart of Machine Learning (ML), but this is a
book about statistics rather than ML. But the aims of these two disciplines overlap
considerably to the extent that any data analyst should become familiar with the ideas
and methods of both. The datasets in this text are small by ML standards. I hope that
a reader coming to this book from an ML background would learn new statistical
perspectives on learning from data.
This book would not have been possible without several key open source Python
packages. I thank the authors and maintainers of these packages for their outstanding
work.
Chapter 1

Introduction

1.1 Before You Start

Statistics starts with a problem, proceeds with the collection of data, continues with
the data analysis and finishes with conclusions. It is a common mistake of inexperi-
enced statisticians to plunge into a complex analysis without paying attention to the
objectives or even whether the data are appropriate for the proposed analysis. As
Einstein said, the formulation of a problem is often more essential than its solution
which may be merely a matter of mathematical or experimental skill.
To formulate the problem correctly, you must:
1. Understand the physical background. Statisticians often work in collaboration
with others and need to understand something about the subject area. Regard this
as an opportunity to learn something new rather than a chore.
2. Understand the objective. Again, often you will be working with a collaborator
who may not be clear about what the objectives are. Beware of “fishing expedi-
tions” — if you look hard enough, you will almost always find something, but
that something may just be a coincidence.
3. Make sure you know what the client wants. You can often do quite different anal-
yses on the same dataset. Sometimes statisticians perform an analysis far more
complicated than the client really needed. You may find that simple descriptive
statistics are all that are needed.
4. Put the problem into statistical terms. This is a challenging step and where ir-
reparable errors are sometimes made. Once the problem is translated into the
language of statistics, the solution is often routine. This is where human intel-
ligence is decidedly superior to artificial intelligence. Defining the problem is
hard to program. That a statistical method can read in and process the data is not
enough. The results of an inapt analysis may be meaningless.
It is important to understand how the data were collected.
1. Are the data observational or experimental? Are the data a sample of convenience
or were they obtained via a designed sample survey? How the data were collected
has a crucial impact on what conclusions can be made.
2. Is there nonresponse? The data you do not see may be just as important as the
data you do see.
3. Are there missing values? This is a common problem that is troublesome and time
consuming to handle.

1
2 INTRODUCTION
4. How are the data coded? In particular, how are the categorical variables repre-
sented?
5. What are the units of measurement?
6. Beware of data entry errors and other corruption of the data. This problem is all
too common — almost a certainty in any real dataset of at least moderate size.
Perform some data sanity checks.

1.2 Initial Data Analysis

This is a critical step that should always be performed. It is simple but it is vital.
You should make numerical summaries such as means, standard deviations (SDs),
maximum and minimum, correlations and whatever else is appropriate to the spe-
cific dataset. Equally important are graphical summaries. There is a wide variety of
techniques to choose from. For one variable at a time, you can make boxplots, his-
tograms, density plots and more. For two variables, scatterplots are standard while
for even more variables, there are numerous good ideas for display including interac-
tive and dynamic graphics. In the plots, look for outliers, data-entry errors, skewed or
unusual distributions and structure. Check whether the data are distributed according
to prior expectations.
Getting data into a form suitable for analysis by cleaning out mistakes and aber-
rations is often time consuming. It often takes more time than the data analysis itself.
One might consider this the core work of data science. In this book, all the data will
be ready to analyze, but you should realize that in practice this is rarely the case.
Let’s look at an example. The National Institute of Diabetes and Digestive
and Kidney Diseases conducted a study on 768 adult female Pima Indians living
near Phoenix. The following variables were recorded: number of times pregnant,
plasma glucose concentration at 2 hours in an oral glucose tolerance test, diastolic
blood pressure (mmHg), triceps skin fold thickness (mm), 2-hour serum insulin (mu
U/ml), body mass index (weight in kg/(height in m2 )), diabetes pedigree function,
age (years) and a test whether the patient showed signs of diabetes (coded zero if
negative, one if positive). The data may be obtained from UCI Repository of ma-
chine learning databases at archive.ics.uci.edu/ml.
Base Python has only limited functionality for numerical work. You will surely
need to import some packages before you can accomplish anything. It is common to
load all the packages you will need in a session at the beginning. We start with:
import pandas as pd
import numpy as np
import matplotlib . pyplot as plt
import scipy as sp
import seaborn as sns
import statsmodels . formula . api as smf
You can wait until you need them but it can be helpful when you share or return
to your work later to have them all listed at the beginning so all will know which
packages you need. The as pd means we can refer to functions in the pandas with
the abbreviation pd.
INITIAL DATA ANALYSIS 3
Before doing anything else, one should find out the purpose of the study and
more about how the data were collected. However, let’s skip ahead to a look at the
data:
import faraway . datasets . pima
pima = faraway . datasets . pima . load ()
pima . head ()
pregnant glucose diastolic triceps insulin bmi diabetes age test
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
Many of the datasets used in this book are supplied in the faraway package. See
the appendix for how to install this package. Any time you want to use one of these
datasets, you will need to import the package containing the data you require and
then load it.
The command pima.head() prints out the first five lines of the data frame. This
is a good way to see what variables we have and what sort of values they take. You
can type pima to see the whole data frame but 768 lines may be more than you want
to examine.
If you want more details about the dataset, you can use:
print ( faraway . datasets . pima . DESCR )
We start with some numerical summaries:
pima . describe () . round (1)
pregnant glucose diastolic triceps insulin bmi diabetes age
count 768.0 768.0 768.0 768.0 768.0 768.0 768.0 768.0
mean 3.8 120.9 69.1 20.5 79.8 32.0 0.5 33.2
std 3.4 32.0 19.4 16.0 115.2 7.9 0.3 11.8
min 0.0 0.0 0.0 0.0 0.0 0.0 0.1 21.0
25% 1.0 99.0 62.0 0.0 0.0 27.3 0.2 24.0
50% 3.0 117.0 72.0 23.0 30.5 32.0 0.4 29.0
75% 6.0 140.2 80.0 32.0 127.2 36.6 0.6 41.0
max 17.0 199.0 122.0 99.0 846.0 67.1 2.4 81.0

test
count 768.0
mean 0.3
std 0.5
min 0.0
25% 0.0
50% 0.0
75% 1.0
max 1.0
The describe() command is a quick way to get the usual univariate summary in-
formation. We round to one decimal place for compact, easier to read output. At
this stage, we are looking for anything unusual or unexpected, perhaps indicating a
data-entry error. For this purpose, a close look at the minimum and maximum values
of each variable is worthwhile. Starting with pregnant, we see a maximum value
of 17. This is large, but not impossible. However, we then see that the next five
variables have minimum values of zero. No blood pressure is not good for the health
— something must be wrong. Let’s look at the first few sorted values:
pima [ ’ diastolic ’ ]. sort_values () . head ()
347 0
4 INTRODUCTION
494 0
222 0
81 0
78 0
We see that at least the first 5 values are zero. We can count the zeroes:
np . sum ( pima [ ’ diastolic ’] == 0)
35
For one reason or another, the researchers did not obtain the blood pressures of 35
patients. In a real investigation, one would likely be able to question the researchers
about what really happened. Nevertheless, this does illustrate the kind of misunder-
standing that can easily occur. A careless statistician might overlook these presumed
missing values and complete an analysis assuming that these were real observed ze-
roes. If the error was later discovered, they might then blame the researchers for
using zero as a missing value code (not a good choice since it is a valid value for
some of the variables) and not mentioning it in their data description. Unfortunately
such oversights are not uncommon, particularly with datasets of any size or complex-
ity. The statistician bears some share of responsibility for spotting these mistakes.

We set all zero values of the five variables to NaN which is a missing value code
used by Python.
pima . replace ({ ’ diastolic ’ : 0 , ’ triceps ’ : 0 , ’ insulin ’ : 0 ,
’ glucose ’ : 0 , ’ bmi ’ : 0} , np . nan , inplace = True )
The variable test is not quantitative but categorical. Such variables are sometimes
also called factors. However, because of the numerical coding, this variable has been
treated as if it were quantitative. It is best to designate such variables as categorical
so that they are treated appropriately. Sometimes people forget this and compute
stupid statistics such as the “average zip code.”
pima [ ’ test ’] = pima [ ’ test ’ ]. astype ( ’ category ’)
pima [ ’ test ’] = pima [ ’ test ’ ]. cat . rename_categories (
[ ’ Negative ’ , ’ Positive ’ ])
pima [ ’ test ’ ]. value_counts ()
Negative 500
Positive 268
Now that we have cleared up the missing values and coded the data appropriately,
we are ready to do some plots. Perhaps the most well-known univariate plot is the
histogram:
sns . distplot ( pima . diastolic . dropna () )
as seen in the first panel of Figure 1.1. We see a bell-shaped distribution for the di-
astolic blood pressures centered around 70. The construction of a histogram requires
the specification of the number of bins and their spacing on the horizontal axis. Some
choices can lead to histograms that obscure some features of the data. The seaborn
package chooses the number and spacing of bins given the size and distribution of the
data, but this choice is not foolproof and misleading histograms are possible. Some
experimentation with other choices is sometimes worthwhile. Histograms are rough
and some prefer to use kernel density estimates, which are essentially a smoothed
version of the histogram (see Simonoff (1996) for a discussion of the relative merits
of histograms and kernel estimates). This estimate is shown in the plot as a smooth
curve.
INITIAL DATA ANALYSIS 5

0.040
120
0.035
100
0.030
0.025 80

diastolic
0.020 60
0.015
40
0.010
20
0.005
0.000 0
20 40 60 80 100 120 0 100 200 300 400 500 600 700
diastolic diastolic

Figure 1.1 The first panel shows a histogram of the diastolic blood pressures, with
a kernel density estimate superimposed. The second panel shows an index plot of the
sorted values.

A simple alternative is to plot the sorted data against its index:


pimad = pima . diastolic . dropna () . sort_values ()
sns . lineplot ( range (0 , len ( pimad ) ) , pimad )
The advantage of this is that we can see all the cases individually. We can see the
distribution and possible outliers. We can also see the discreteness in the measure-
ment of blood pressure — values are rounded to the nearest even number and hence
we see the “steps” in the plot.
Now we show a couple of bivariate plots, as seen in Figure 1.2:
sns . scatterplot ( x = ’ diastolic ’ ,y = ’ diabetes ’ , data = pima , s =20)
and
sns . boxplot ( x = " test " , y = " diabetes " , data = pima )

2.5 2.5

2.0 2.0

1.5 1.5
diabetes

diabetes

1.0 1.0

0.5 0.5

0.0 0.0
20 40 60 80 100 120 Negative Positive
diastolic test

Figure 1.2 The first panel shows a scatterplot of the diastolic blood pressures
against diabetes function and the second shows boxplots of diabetes function bro-
ken down by test result.

First, we see the standard scatterplot showing two quantitative variables. Second,
we see a side-by-side boxplot suitable for showing a quantitative with a qualitative
variable.
6 INTRODUCTION
Sometimes we need to introduce a third variable into a bivariate plot. We show
two different ways that the varying test result can be shown in the relationship
between diastolic and diabetes:
sns . scatterplot ( x = " diastolic " , y = " diabetes " , data = pima ,
style = " test " , alpha =0.3)
and
sns . relplot ( x = " diastolic " , y = " diabetes " , data = pima , col = " test " )

2.5 test = Negative test = Positive


test 2.5
Positive
Negative
2.0 2.0

1.5 1.5
diabetes

1.0 diabetes
1.0

0.5 0.5

0.0 0.0
20 40 60 80 100 120 20 40 60 80 100 120 20 40 60 80 100 120
diastolic diastolic diastolic

Figure 1.3 Two ways of distinguishing a factor variable in a bivariate scatterplot.

The first plot, shown in Figure 1.3, introduces the third element using the shape of the
plotted point. The second plot uses two panels. Sometimes this is the better option
when crowded plots make different colors or shapes hard to distinguish.
Good graphics are vital in data analysis. They help you avoid mistakes and sug-
gest the form of the modeling to come. They are also important in communicating
your analysis to others. Many in your audience or readership will focus on the graphs.
This is your best opportunity to get your message over clearly and without misunder-
standing. In some cases, the graphics can be so convincing that the formal analysis
becomes just a confirmation of what has already been seen.

1.3 When to Use Linear Modeling


Linear modeling is used for explaining or modeling the relationship between a single
variable Y , called the response, outcome, output, endogenous or dependent variable;
and one or more predictor, input, independent, exogenous or explanatory variables,
X1 , . . . , X p , where p is the number of predictors. We recommend you avoid using the
words independent and dependent variables for X and Y , as these are easily confused
with the broader meanings of terms. The endogenous/exogenous naming pair is
popular in economics. Regression analysis is another term used for linear modeling
although regressions can also be nonlinear.
When p = 1, it is called simple regression but when p > 1 it is called multiple
regression or sometimes multivariate regression. When there is more than one re-
sponse, then it is called multivariate multiple regression or sometimes (confusingly)
multivariate regression. We will not cover this in this book, although you can just do
separate regressions on each Y .
The response should be a continuous variable but if we are pedantic, all vari-
ables are measured with limited precision in practice and are therefore discrete.
HISTORY 7
Fortunately, provided the response variable is not measured too coarsely, we can
ignore this objection without much consequence. The explanatory variables can be
continuous, discrete or categorical, although we leave the handling of categorical
explanatory variables to later in the book. Taking the example presented above, a
regression with diastolic and bmi as Xs and diabetes as Y would be a multiple
regression involving only quantitative variables which we tackle first. A regression
with diastolic and test as Xs and bmi as Y would have one predictor that is
quantitative and one that is qualitative, which we will consider later in Chapter 14 on
analysis of covariance. A regression with test as X and diastolic as Y involves
just qualitative predictors — a topic called analysis of variance (ANOVA), although
this would just be a simple two-sample situation. A regression of test as Y on
diastolic and bmi as predictors would involve a qualitative response. A logistic
regression could be used, but this will not be covered in this book.
Regression analyses have two main objectives:
1. Prediction of future or unseen responses given specified values of the predictors.
2. Assessment of the effect of, or relationship between, explanatory variables and
the response. We would like to infer causal relationships if possible.
You should be clear on the objective for the given data because some aspects of the
resulting analysis may differ. Regression modeling can also be used in a descriptive
manner to summarize the relationships between the variables. However, most end
users of data have more specific questions in mind and want to direct the analysis
toward a particular set of goals.
It is rare, except in a few cases in the precise physical sciences, to know (or even
suspect) the true model. In most applications, the model is an empirical construct
designed to answer questions about prediction or causation. It is usually not helpful
to think of regression analysis as the search for some true model. The model is a
means to an end, not an end in itself.

1.4 History
In the 18th century, accurate navigation was a difficult problem of commercial and
military interest. Although it is relatively easy to determine latitude from Polaris,
also known as the North Star, finding longitude then was difficult. Various attempts
were made to devise a method using astronomy. Contrary to popular supposition,
the moon does not always show the same face and moves such that about 60% of its
surface is visible at some time.
Tobias Mayer collected data on the locations of various landmarks on the moon,
including the Manilius crater, as they moved relative to the earth. He derived an
equation describing the motion of the moon (called libration) taking the form:
arc = α + βsinang + γcosang
He wished to obtain values for the three unknowns α, β and γ. The variables arc,
sinang and cosang can be observed using a telescope. A full explanation of the
story behind the data and the derivation of the equation can be found in Stigler
(1986).
8 INTRODUCTION
Since there are three unknowns, we need only three distinct observations of the
set of three variables to find a unique solution for α, β and γ. Embarassingly for
Mayer, there were 27 sets of observations available. Astronomical measurements
were naturally subject to some variation and so there was no solution that fit all 27
observations. Let’s take a look at the first few lines of the data:
import faraway . datasets . manilius
manilius = faraway . datasets . manilius . load ()
manilius . head ()
arc sinang cosang group
0 13.166667 0.8836 -0.4682 1
1 13.133333 0.9996 -0.0282 1
2 13.200000 0.9899 0.1421 1
3 14.250000 0.2221 0.9750 3
4 14.700000 0.0006 1.0000 3
Mayer’s solution was to divide the data into three groups so that observations within
each group were similar in some respect. He then computed the sum of the variables
within each group. We can also do this:
moon3 = manilius . groupby ( ’ group ’) . sum ()
moon3
arc sinang cosang
group
1 118.133333 8.4987 -0.7932
2 140.283333 -6.1404 1.7443
3 127.533333 2.9777 7.9649
Now there are just three equations in three unknowns to be solved. The solution is:
moon3 [ ’ intercept ’] = [9]*3
np . linalg . solve ( moon3 [[ ’ intercept ’ , ’ sinang ’ , ’ cosang ’]] ,
moon3 [ ’ arc ’ ])
array([14.54458591, -1.48982207, 0.13412639])
Hence the computed values of α, β and γ are 14.5, -1.49 and 0.134, respectively. One
might question how Mayer selected his three groups, but this solution does not seem
unreasonable.
Similar problems with more linear equations than unknowns continued to arise
until 1805, when Adrien Marie Legendre published the method of least squares. Sup-
pose we recognize that the equation is not exact and introduce an error term, ε:

arci = α + βsinangi + γcosangi + εi

where i = 1, . . . , 27. Now we find α, β and γ that minimize the sum of the squared
errors: ∑ ε2 . We will investigate this in much greater detail in the chapter to fol-
low, but for now we simply present the solution using the smf.ols function from
statsmodels:
mod = smf . ols ( ’ arc ~ sinang + cosang ’ , manilius ) . fit ()
mod . params
Intercept 14.561624
sinang -1.504581
cosang 0.091365
We observe that this solution is quite similar to Mayer’s. The least squares solu-
tion is more satisfactory in that it requires no arbitrary division into groups. Carl
Friedrich Gauss claimed to have devised the method of least squares earlier but
HISTORY 9
without publishing it. At any rate, he did publish in 1809 showing that the method
of least squares was, in some sense, optimal.
For many years, the method of least squares was confined to the physical sciences
where it was used to resolve problems of overdetermined linear equations. The equa-
tions were derived from theory, and least squares was used as a method to fit data to
these equations to estimate coefficients like α, β and γ above. It was not until later in
the 19th century that linear equations (or models) were suggested empirically from
the data rather than from theories of physical science. This opened up the field to the
social and life sciences.
Francis Galton, a nephew of Charles Darwin, was important in this extension of
statistics into social science. He coined the term regression to mediocrity in 1875
from which the rather peculiar term regression derives. Let’s see how this terminol-
ogy arose by looking at one of the datasets he collected at the time on the heights of
parents and children in Galton (1886). We load and plot the data as seen in Figure 1.4.
import faraway . datasets . families
families = faraway . datasets . families . load ()
sns . scatterplot ( x = ’ midparentHeight ’ , y = ’ childHeight ’ ,
data = families , s =20)

80

75

70
childHeight

65

60

55
64 66 68 70 72 74 76
midparentHeight

Figure 1.4 The height of a child is plotted against a combined parental height de-
fined as (father’s height + 1.08 × mother’s height)/2.

We see that midparentHeight, defined as the father’s height plus 1.08 times the
mother’s height divided by two, is correlated with the childHeight, both in inches.
10 INTRODUCTION
Now we might propose a linear relationship between the two of the form:

childHeight = α + βmidparentHeight + ε

We can estimate α and β using smf.ols().


mod = smf . ols ( ’ childHeight ~ midparentHeight ’ , families ) . fit ()
mod . params
Intercept 22.636241
midparentHeight 0.637361
For the simple case of a response y and a single predictor x, we can write the equation
in the form:
y − ȳ (x − x̄)
=r
SDy SDx
where r is the correlation between x and y. The equation can be expressed in words
as: the response in standard units is the correlation times the predictor in standard
units. We can verify that this produces the same results as above by rearranging the
equation in the form y = α + βx and computing the estimates:
cor = sp . stats . pearsonr ( families [ ’ childHeight ’] ,
families [ ’ midparentHeight ’ ]) [0]
sdy = np . std ( families [ ’ childHeight ’ ])
sdx = np . std ( families [ ’ midparentHeight ’ ])
beta = cor * sdy / sdx
alpha = np . mean ( families [ ’ childHeight ’ ]) - \
beta * np . mean ( families [ ’ midparentHeight ’ ])
np . round ([ alpha , beta ] ,2)
Now one might naively expect that a child with parents who are, for example, one
standard deviation above average in height, to also be one standard deviation above
average in height, give or take. The supposition would set r = 1 in the equation and
leads to a line which we compute and plot below:
beta1 = sdy / sdx
alpha1 = np . mean ( families [ ’ childHeight ’ ]) - \
beta1 * np . mean ( families [ ’ midparentHeight ’ ])
We use lmplot() to display the variables and the least squares line. We do not want
the confidence band, hence the ci=None. To add the second line, we need to specify
the range in the horizontal scale and draw a dashed line connecting the calculated
points at each end.
sns . lmplot ( ’ midparentHeight ’ , ’ childHeight ’ , families ,
ci = None , scatter_kws ={ ’s ’ :2})
xr = np . array ([64 ,76])
plt . plot ( xr , alpha1 + xr * beta1 , ’ -- ’)
The result can be seen in Figure 1.4. The lines cross at the point of the averages.
We can see that a child of tall parents is predicted by the least squares line to have
a height which is above average but not quite as tall as the parents, as the dashed
line would have you believe. Similarly children of below average height parents
are predicted to have a height which is still below-average but not quite as short as
the parents. This is why Galton used the phrase “regression to mediocrity” and the
phenomenom is sometimes called the regression effect.
This applies to any (x, y) situation like this. For example, in sports, an athlete
may have a spectacular first season only to do not quite as well in the second season.
EXERCISES 11
Sports writers come up with all kinds of explanations for this but the regression effect
is likely to be the unexciting cause. In the parents and children example, although it
does predict that successive descendants in the family will come closer to the mean,
it does not imply the same of the population in general since random fluctuations will
maintain the variation, so no need to get too pessimistic about mediocrity! In many
other applications of linear modeling, the regression effect is not of interest because
different types of variables are measured. Unfortunately, we are now stuck with the
rather gloomy word of regression thanks to Galton.
Regression methodology developed rapidly with the advent of high-speed com-
puting. Just fitting a regression model used to require extensive hand calculation. As
computing hardware has improved, the scope for analysis has widened. This has led
to an extensive development in the methodology and the scale of problems that can
be tackled.

Exercises
Not all the answers to the questions below can be derived from code illustrated in
this chapter. You may need to resort to internet Python resources.
1. The dataset teengamb concerns a study of teenage gambling in Britain.
(a) Turn the sex variable into a categorical variable with appropriate labels. Count
the number in each category.
(b) Use both the boxplot and the swarmplot functions from seaborn to plot the
status broken down by sex. Contrast the two plotting methods.
(c) Use both the distplot and the countplot functions from seaborn to show
the distributions of the verbal scores. Do not show the smoothed density on the
distplot. Contrast the two plotting methods - which is best here?
(d) Plot the gamble as the response and income as the predictor broken down by
sex. Make two plots, one with a single frame where sex is distinguished by
the color of the point and another where the sexes appear in different frames.
Which plot do you prefer and why?
(e) Construct a summary statistics table of numerical variables. Can you tell which
variable is highly skewed from the table?
2. The dataset uswages is drawn as a sample from the Current Population Survey in
1988.
(a) Construct a subset of the data with only the wage and four geographical vari-
ables.
(b) A weighted mean is given by ∑ wi yi / ∑ wi for weights w and data y. Compute
the mean wage in the north-east using this formula.
(c) Compute the mean wage in the north-east using the groupby function from
pandas. This should also give you the mean wage for those not living in the
north-east.
(d) Compute the row sums for just the geographic variables. What value do they
take?
12 INTRODUCTION
(e) The subset matrix of geographic variables can be called a dummy matrix where
ones and zeroes are used to code a categorical variable. Reconstruct an area
categorical variable which takes the four possible values.
(f) Make a boxplot of the wage broken down by area.
(g) Repeat the previous plot but on a log scale. Which is preferable?
3. The dataset prostate is from a study on 97 men with prostate cancer who were
due to receive a radical prostatectomy.
(a) Use the pairplot function from seaborn to construct an array of scatterplots
of the first four variables.
(b) Compute the correlations of the first four variables.
(c) The lbph variable is on the log scale. Many cases take the minimum value
of this variable. What value of benign prostatic hyperplasia do you think this
represents?
(d) Use the distplot function from seaborn to make a histogram of the ages
using the rug option. Create a version where the bin width is one year. Contrast
the two plots.
(e) Use the melt function from pandas with lpsa as the id variable to produce a
long version of the dataset. Now use replot to produce a grid of 8 scatterplots
of the data where lpsa is the response.
4. The dataset sat comes from a study entitled “Getting What You Pay For: The
Debate Over Equity in Public School Expenditures.”
(a) Verify that the sum of the verbal and math scores equals the total score.
(b) Compare the distributions of verbal and math scores using jointplot from
seaborn. Are they similar?
(c) Standardize both the verbal and math scores. Plot the standardized scores with
verbal on the x-axis. Plot the y = x line.
(d) Fit a linear model with math as the response and verbal as the predictor. Show
the estimated slope and compare it with the correlation between these two vari-
ables. Comment.
(e) Fit another linear model with the roles of the predictor and response exchanged.
Why is the estimated slope the same? Is the fitted line from this model and the
previous model the same?
(f) Make predictions for the following students. (i) Predict the math score of a
student scoring 2SDs above average on the verbal test. (ii) Predict the verbal
score of a student scoring 2SDs above average on the math test. (iii) Predict
the math score of a student with an average score on the verbal test. (iv) Predict
the math score of a student with no information about their verbal score.
5. The dataset divusa contains data on divorces in the United States from 1920 to
1996.
(a) Make a plot each with lineplot and scatterplot from seaborn. Put the
year on the x-axis and the divorce rate on the y-axis. Compare the two plots.
EXERCISES 13
(b) Use the shift function from pandas to plot divorce rate from the current year
against the divorce rate for the previous year. Does this show that one could
reasonably predict the divorce rate for the following year by using the divorce
rate from the current year?
(c) Fit a linear model with the divorce rate as the response and the year as the
predictor. In what year does the model predict the divorce rate to hit 100%? Is
this a reasonable prediction?
(d) Use the scatterplot function from seaborn to make a plot with femlab on
the x-axis, divorce rate on the y-axis and the color of the point changing with
the year.
References
Akaike, H. (1974). A new look at the statistical model identification. Automatic Control, IEEE
Transactions on 19 (6), 716723.
Anderson, C. and R. Loynes (1987). The Teaching of Practical Statistics. New York: Wiley.
Anderson, R. and T. Bancroft (1952). Statistical Theory in Research. New York: McGraw-Hill.
Andrews, D. and A. Herzberg (1985). Data: A Collection of Problems from Many Fields for the
Student and Research Worker. New York: Springer-Verlag.
Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician 27 (1), 1721.
Benjamini, Y. and Y. Hochberg (1995). Controlling the false discovery rate: a practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B
(Methodological) 57 , 289300.
Berkson, J. (1950). Are there two regressions? Journal of the American Statistical Association
45 , 165180.
Box, G. , S. Bisgaard , and C. Fung (1988). An explanation and critique of Taguchi's
contributions to quality engineering. Quality and Reliability Engineering International 4 , 123131.
Box, G. , W. Hunter , and J. Hunter (1978). Statistics for Experimenters. New York: Wiley.
Burnham, K. P. and D. R. Anderson (2002). Model Selection and Multi-Model Inference: A
Practical Information-Theoretic Approach. New York: Springer.
Cook, J. and L. Stefanski (1994). Simulationextrapolation estimation in parametric
measurement error models. Journal of the American Statistical Association 89 , 13141328.
Davies, O. (1954). The Design and Analysis of Industrial Experiments. New York: Wiley.
Davison, A. and D. Hinkley (1997). Bootstrap Methods and their Application. Cambridge:
Cambridge University Press.
de Boor, C. (2002). A Practical Guide to Splines. New York: Springer.
de Jong, S. (1993). SIMPLS: An alternative approach to partial least squares regression.
Chemometrics and Intelligent Laboratory Systems 18 , 251263.
292 Draper, N. and H. Smith (1998). Applied Regression Analysis (3rd ed.). New York: Wiley.
Efron, B. , T. Hastie , I. Johnstone , and R. Tibshirani (2004). Least angle regression. Annals of
Statistics 32 , 407499.
Efron, B. and R. Tibshirani (1993). An Introduction to the Bootstrap. London: Chapman & Hall.
Faraway, J. (1992). On the cost of data analysis. Journal of Computational and Graphical
Statistics 1 , 215231.
Faraway, J. (1994). Order of actions in regression analysis. In P. Cheeseman and W. Oldford
(Eds.), Selecting Models from Data: Artificial Intelligence and Statistics IV, pp. 403411. New
York: Springer-Verlag.
Faraway, J. (2014). Linear Models with R (2 ed.). London: Chapman & Hall.
Faraway, J. J. (2016). Does data splitting improve prediction? Statistics and Computing 26 (1-
2), 4960.
Fisher, R. (1936). Has Mendel's work been rediscovered? Annals of Science 1 , 115137.
Frank, I. and J. Friedman (1993). A statistical view of some chemometrics tools. Technometrics
35 , 109135.
Freedman, D. and D. Lane (1983). A nonstochastic interpretation of reported significance levels.
Journal of Business and Economic Statistics 1 (4), 292298.
Galton, F. (1886). Regression towards mediocrity in hereditary stature. The Journal of the
Anthropological Institute of Great Britain and Ireland 15 , 246263.
Garthwaite, P. (1994). An interpretation of partial least squares. Journal of the American
Statistical Association 89 , 122127.
Gelman, A. (2008). Scaling regression inputs by dividing by two standard deviations. Statistics
in Medicine 27 (15), 28652873.
Hamada, M. and J. Wu (2000). Experiments: Planning, Analysis, and Parameter Design
Optimization. New York: Wiley.
Harrell, F. E. (2015). Regression modeling strategies: with applications to linear models, logistic
and ordinal regression, and survival analysis (2 ed.). New York: Springer.
Herron, M. , W. M. Jr, and J. Wand (2008). Voting Technology and the 2008 New Hampshire
Primary. Wm. & Mary Bill Rts. J. 17 , 351374.
Hill, A. B. (1965). The environment and disease: association or causation? Proceedings of the
Royal Society of Medicine 58 (5), 295.
Hsu, J. (1996). Multiple Comparisons Procedures: Theory and Methods. London: Chapman &
Hall.
John, P. (1971). Statistical Design and Analysis of Experiments. New York: Macmillan.
293 Johnson, M. and P. Raven (1973). Species number and endemism: the Galpagos
Archipelago revisited. Science 179 , 893895.
Johnson, R. (1996). Fitting percentage of body fat to simple body measurements. Journal of
Statistics Education 4 (1), 265266.
Joliffe, I. (2002). Principal Component Analysis (2 ed.). New York: Springer Verlag.
Jones, P. and M. Mann (2004). Climate over past millennia. Reviews of Geophysics 42 , 142.
Lentner, M. and T. Bishop (1986). Experimental Design and Analysis. Blacksburg, VA: Valley
Book Company.
Little, R. and D. Rubin (2002). Statistical Analysis with Missing Data (2nd ed.). New York: Wiley.
Makridakis, S. , E. Spiliotis , and V. Assimakopoulos (2018). The M4 Competition: Results,
findings, conclusion and way forward. International Journal of Forecasting 34 (4), 802808.
Mazumdar, S. and S. Hoa (1995). Application of a Taguchi method for process enhancement of
an online consolidation technique. Composites 26 , 669673.
Morris, R. and E. Watson (1998). A comparison of the techniques used to evaluate the
measurement process. Quality Engineering 11 , 213219.
Mortimore, P. , P. Sammons , L. Stoll , D. Lewis , and R. Ecob (1988). School Matters. Wells:
Open Books.
Partridge, L. and M. Farquhar (1981). Sexual activity and the lifespan of male fruitflies. Nature
294 , 580581.
Raftery, A. E. , D. Madigan , and J. A. Hoeting (1997). Bayesian model averaging for linear
regression models. Journal of the American Statistical Association 92 (437), 179191.
Raghunathan, T. (2015). Missing Data Analysis in Practice. Boca Raton, FL: Chapman & Hall.
Rodriguez, N. , S. Ryan , H. V. Kemp , and D. Foy (1997). Post-traumatic stress disorder in
adult female survivors of childhood sexual abuse: A comparison study. Journal of Consulting
and Clinical Pyschology 65 , 5359.
Rousseeuw, P. and A. Leroy (1987). Robust Regression and Outlier Detection. New York:
Wiley.
Rousseeuw, P. J. and K. V. Driessen (1999). A fast algorithm for the minimum covariance
determinant estimator. Technometrics 41 (3), 212223.
Schafer, J. (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall.
Scheff, H. (1959). The Analysis of Variance. New York: Wiley.
Sen, A. and M. Srivastava (1990). Regression Analysis: Theory, Methods and Applications.
New York: Springer-Verlag.
294 Simonoff, J. (1996). Smoothing Methods in Statistics. New York: SpringerVerlag.
Steel, R. G. and J. Torrie (1980). Principles and Procedures of Statistics, a Biometrical
Approach. (2nd ed.). New York: McGraw-Hill.
Steyerberg, E. W. (2009). Clinical prediction models, Volume 381. New York: Springer.
Stigler, S. (1986). The History of Statistics. Cambridge, MA: Belknap Press.
Stolarski, R. , A. Krueger , M. Schoeberl , R. McPeters , P. Newman , and J. Alpert (1986).
Nimbus 7 satellite measurements of the springtime Antarctic ozone decrease. Nature 322 ,
808811.
Thodberg, H. H. (1993). Ace of Bayes: Application of neural networks with pruning. Technical
Report 1132E, Maglegaardvej 2, DK-4000 Roskilde, Denmark.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society, Series B 58 , 267288.
Tukey, J. (1977). Exploratory Data Analysis. Reading, MA: Addison-Wesley.
Weisberg, S. (1985). Applied Linear Regression (2nd ed.). New York: Wiley.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy