Modern - Regression Techniques Using R, Practical Guide
Modern - Regression Techniques Using R, Practical Guide
Techniques Using R
Modern Regression
Techniques Using R
A Practical Guide for Students and Researchers
Apart from any fair dealing for the purposes of research or private study, or criticism or review,
as permitted under the Copyright, Designs and Patents Act, 1988, this publication may
be reproduced, stored or transmitted in any form, or by any means, only with the prior
permission in writing of the publishers, or in the case of reprographic reproduction, in
accordance with the terms of licences issued by the Copyright Licensing Agency. Enquiries
concerning reproduction outside those terms should be sent to the publishers.
ISBN 978-1-84787-902-8
ISBN 978-1-84787-903-5 (pbk)
3 ANOVA as regression 29
References 196
Index 202
Preface
In this book we introduce several useful extensions to the basic regression model, without
too much mathematics, but with several pictures and some of the basic references. Not
all possible extensions are covered, but we chose a set that we think is particularly useful
for psychology. We will use the freeware package R so a secondary purpose of this book
is to introduce some of the facilities in R (R Development Core Team, 2008). It works
like syntax in many of the other statistics packages like SPSS (which seems the most
popular package in psychology, so we refer to it occasionally for comparison purposes),
but it is more flexible and has more procedures. Once you get used to it, we hope that
you will find it is easier than its competitors. It is free so we know you will like the price!
While we provide a brief introduction to R, we also provide links to useful books and
websites.
This book is divided into ten chapters. First, we explain the most basic basics of R,
but point readers to where they can find more details. Next, we give an overview of what
we call the basic regression and then briefly describe each of the extensions. Then, we
go through the seven extensions, and finish with a conclusion. Each of these chapters
includes a description and then goes through the analysis of some data.
This document grew out of a regression workshop to the Legal Psychology group
at Florida International University in 2006, when Dan was on sabbatical there (and
where he is now permanently), and was the basis for a talk and poster at the SARMAC
2007 conference at Bates College, Maine. It was also the basis of Modern Statistical
Methods, a graduate course at University of Sussex. Many thanks to all those who
provided comments!
All of the royalties from this book go to the American Partnership for Eosinophilic
Disorders (www.apfed.com).
See the website for more information.
Happy regressing,
Dan Wright and Kami London
In this book R commands are written in dark bold Courier and R output
is in gray Courier. There is a glossary at the back of the book which provides
brief descriptions of all the commands/functions used in this book, so if something is
unfamiliar look there first. If you want to know more about the function use the online
help facility within R. To do this you should use the help function. For example, for
the function mean, type either help(mean) or ?mean.
We have adopted an example-based approach. Most of the data come from real research
papers. The examples were chosen because we hope that they will be of interest to most
working in the social and behavioral sciences, and also because we were able to access
the data. By providing examples, we hope you can match your own research needs
onto these examples. The data for all these examples and the corresponding code are on
http://www.sagepub.co.uk/wrightandlondon.
There are many books that cover conducting statistics in R. A list of some can be
found at:
http://www.r-project.org/doc/bib/R-books.html
One of our favorites is:
Crawley, M. J. (2005). Statistics: An introduction using R. Chichester, UK: Wiley.
This is not written specifically for social scientists, but his clarity is excellent. He also
has written The R book (2007) which is excellent for a much more detailed treatment
at 950 pages.
More detailed readings are given at the end of each chapter. We assume that everybody
has studied some statistics, perhaps one semester of psychology-graduate-statistics, and
so understands the basics of the standard linear regression (covered briefly in Chapter 2).
There are several good background books for statistics, but one stands out above all others
for having the most modest authors:
Wright, D. B. & London, K. (2009). First (and second) steps in statistics (2nd). London:
Sage Publications.
NOTE
Microsoft Word and many other ‘high level’ word processing packages change some
characters (including " and ') to other characters (like “ and ‘), which are not read by R.
Therefore, if using one of these word processing packages we recommend turning off
several of the facilities that automatically change characters from those you type. If
copying code from websites, sometimes line breaks are lost, so you need to be careful
with this. If you are copying and pasting commands, it may be easier to save them in
Notepad or some other ‘low level’ word processing package. The text editor Tinn-R
is designed for R and can be downloaded from http://www.sciviews.org/Tinn-R/ and
http://sourceforge.net/projects/tinn-r.
1
Learning objectives
1. Learning some of the basic R concepts: functions, objects, assigning, packages,
mirrors, CRAN, and how to read data and access packages.
2. Statistical concepts reinforced are looking at data, transforming data, and there is
detailed discussion of skewness.
3. We introduce you to the bootstrap, which will be used for several examples in this book.
R statistician
statistician
CRAN statistician
mirrors statistician
You & your mirrors statistician
computer mirrors statistician (e.g., Ripley)
: statistician (e.g., Efron)
mirrors statistician
mirrors statistician
statistician (e.g., Wilcox)
statistician
would want, but not all. Statisticians write their own packages for specialist purposes.
Some submit these to CRAN so that others can use them. When a package is sent
to CRAN it gets copied onto all the mirror sites. You can then download packages
from there. For example, there is a package called foreign (R core members et al.,
2008) that allows you to read data from other statistics programs directly into R. If
you type:
install.packages("foreign")
a window like Figure 1.2 opens. We chose the server in Michigan (USA (MI)) since
that is close to where we are preparing this chapter (in Ohio). This should install the
package onto your computer. The package is now on your computer so you may access
it in the future from this computer even if you are not connected to the internet, assuming
it is not erased. However, because authors update their packages frequently, it is worth
reinstalling packages relatively regularly.
If the mirror is not perfectly up-to-date or if you are not connected to the internet it
may not install. You may also get some warning messages.
Although the package is now on your computer (in a folder in the R directory) it is
not active. To make it active type:
library("foreign")
Now you have access to a large number of functions that are used to import and export
data between R and other statistical packages.
Some packages will have been installed when you downloaded R and you will
just need to load these. You will need to download others from CRAN. Some
statisticians, like Rand Wilcox, keep their packages on their own web page. In the
case of Wilcox, he has written a book that effectively acts as both a manual and
teaching resource for his functions. We will use some of his code in one of the
examples and his code can be accessed from the web using the source function.
We have only written ‘statistician’ on the right of Figure 1.1, but there are people
from other disciplines (like computer science and psychology) who write packages
for R. We did not include them because it is primarily statisticians doing the
writing.
Very Brief Introduction to R 3
If you have not already opened R, it would be good to open up it now because we
will be telling you to type things in throughout the rest of this chapter.
4 Modern Regression Techniques Using R
scores
we get:
[1] 5 6 7 8
5:8
[1] 5 6 7 8
seq(5,8)
[1] 5 6 7 8
This function is useful if you want more complex sequences. If you type:
seq(10,30,5)
[1] 10 15 20 25 30
mean(scores)
Very Brief Introduction to R 5
and we get:
[1] 6.5
The [1] in front of 6.5 is because some functions produce several pieces of
information so their parts are labeled. You can use functions in creating new variables.
For example, you may want to have a variable for how far away each value is from the
mean of the variable. The following command does this:
Typing
residscores
produces
Functions in R work with certain types of objects. While you can take a mean of four
numbers, you cannot take a mean of four people’s names. Names (or string values) need
to be placed in quotation marks so that R does not think they are objects that it should
be able to find. The following creates a variable of four people’s names and shows that
the function mean does not work in this instance.
[1] NA
Warning message:
argument is not numeric or logical: returning NA in:
mean.default(Simpsons)
The function c works also with strings. Thus, when Maggie arrived at the Simpson
household, the glorious event of childbirth could be written in R as:
We can identify each member of the Simpsons by writing the variable name with an
index. So, to identify the fourth member of the Simpsons, we write:
Simpsons[4]
[1] "Bart"
When a function is applied to an object it can create another object, which can then be
used in other functions. The basic regression function is lm. It is applied to some data
objects (a response variable and some predictor variables) and a lm.object is created.
This object can then be used in other functions, like plot, summary, and anova.
6 Modern Regression Techniques Using R
These are illustrated in several examples throughout this book. Many R functions are
intelligent. For example, the plot function works differently dependent upon what type
of object is placed within it, and we will see this as we go through this book.
The basic methods in R to read data assume that the data are in a text file in a table-like
form. Figure 1.3 shows data in this format in a Notepad file. It is important to have data
in a text (or ASCII) format, so if your data are in a word processing file you have to use
the SAVE AS option. If the data are in text format and in a table like Figure 1.3, then the
command:
assigns the data in “filename” into the object newdata. When using read.table
the "filename" may either be a file on your computer or one on the web. For the
function used in this example, read.spss, the data must be saved on your computer.
The characters <- assign things to each other, and can be used in either direction so
x <- 6+4 and 6+4 -> x both assign the value 10 to x. Often you will have data in
another package and need to import it into R. The package foreign allows data to be
read from other packages, including: SPSS, Minitab, SYSTAT, SAS, and Stata. Since
it seems SPSS is the most popular among academic psychologists we will assume that
you want to import the data from SPSS.
Figure 1.3 A text file for the chile data set as shown within Notepad
Very Brief Introduction to R 7
PEPPER JOE’S
HOT PEPPER HEAT SCALE
The data in this example examine the relationship between the length of a chile and
its heat. LENGTH is in cm and HEAT is in Pepper Joe’s Hot Pepper Heat Scale shown in
Figure 1.4 (in technical papers the Scoville scale is used, but they don’t have a smiling
chile with a shovel waving at you, see www.pepperjoe.com/about/heatscale.html).
The library function loads the package foreign so that it can be used. Note
that \\ are used rather than \ in the read.spss function (and the read.table
function used below). The attach command makes the data file the active data
file, overwriting any other variables that may have the same names. The following
commands read data from c:\temp\chile.sav. The data are available on the book’s web
page (on www.sagepub.co.uk/wrightandlondon) so this file will need to be copied into
your c:\temp folder (if you do not have a c:\temp folder copy it elsewhere and change
the code below accordingly).
library("foreign")
chile <- read.spss("c:\\temp\\chile.sav")
attach(chile)
You may get a warning, but this particular warning should not a problem; these data are
read accurately. You do not need to have SPSS on your computer to read SPSS data files
into R.
The data are also stored on this book’s web page as a text file. They can be accessed by:
This command takes up multiple lines to print and requires that you very carefully write
the web page each time. So that you do not have to write a command that is multiple
lines, it is worth assigning this book’s web page to the object webreg so that it need
not be typed each time.
To write the full web address we have to paste together the web page and the file name.
paste(webreg,"chile.dat",sep="")
[1] "http://www.sagepub.co.uk//wrightandlondon//chile.sav"
The default for the paste function is to have one space between each of the objects
that are pasted together. Having a space in a web address will cause the web address
not to be recognized, so we have to tell R that there should be no separation between
the objects. The option sep="" tells the computer this. If we wrote sep="," it would
have placed a comma between each part.
Now we can write:
The header=T means that the first line in the data file has the variable names
(see Figure 1.3). The T stands for TRUE and it can be written as header=TRUE.
Alternative methods for reading data from the book’s web page are listed in the box
below. If you plan to use this book while disconnected from the internet, there are
instructions on the web page for downloading all the data, functions, and syntax to your
hard drive (or memory stick).
There are many ways to access data from the book’s web page and three are
presented here. The first is to write out the web page within the read.table
command:
The problem is that this takes a few lines and requires careful typing.
The second is to first assign the web page to an object, like webreg, and then
use the shorter command.
This still requires assigning the web address to webreg every time you turn on R
and want to access the book’s web page, but at least the first line can be copied
and pasted from other exercises.
The final option, which will make life easier for people who are fairly comfortable
with computers, is to add the line
into the file called Rprofile.site which was installed when you installed R.
Use the search facility on your computer to find it. If you add this line then
everytime you start R this assignment will be made. Then you would only need
to type:
Because this final option requires more computer knowledge than we are
assuming, the second option will be used throughout this book.
" !
After you import the data set (and name it, say chile), then type:
attach(chile)
This attaches the data set to your working environment. ‘Attaching to your working
environment’ is the technical jargon that means you can now access all these variables
by just typing the variable names. To find out the variable names type:
names(chile)
These are the same as in Figure 1.3. While you can access a variable without the file
being attached by typing:
chile$HEAT
this can get cumbersome and is not useful if you are working with only one data set at
a time. It is easier not having to re-type the name of the data set each time. Therefore,
in this book we will always attach the data set we are working with. If you are doing
more advanced statistics where you are looking at several data sets, we recommend not
attaching the data but accessing them within each function or using the with function
(see Crawley, 2007). However, for most psychologists’ needs this would add further
complications.
When R reads SPSS files, all the variable names are UPPERCASE. This is how SPSS
stores them internally because SPSS is not case sensitive. R is case sensitive so:
heat
10 Modern Regression Techniques Using R
produces the word NULL. This means there is no variable heat. The variable is HEAT.
If you want to find out how many cases are in the variable LENGTH, you need to ask
for the length of LENGTH:
length(LENGTH)
We are now going to describe skewness in more depth than is typical in
introductory statistics courses. We do this for two reasons. First, it emphasizes
the importance of looking at your variables and describing their distributions.
This is something that the American Psychological Association states should
be done (Wilkinson et al., 1999). Second, it provides a good way to focus on
how well different transformations work. Statisticians stress the importance of
transformations (e.g., Mosteller & Tukey, 1977), but this stress is often absent from
psychology texts.
We now look at the histogram of this variable. This can be done with hist(LENGTH)
and the result is shown in the left panel of Figure 1.5. It looks fairly skewed. There are
several different definitions of skewness (Groeneveld & Meeden, 1984) and none are in
the base package R, so you have to load another package with this command, the two
most commonly used being e1071 (Dimitriadou et al., 2008) and fBasics (Harrell
et al., 2007). This loads a function, skewness, which is the most common measure of
skewness.1 It is:
(xi − x̄)3
skewness =
n(sd(x)3 )
1 The skewness function in both of these R packages provides the value for the sample skewness, not the
estimate of the population skewness. In your introductory statistics textbooks there may have been some
discussion of the difference between these with respect to the standard deviation. A better estimate for the
population skewness of a variable x is, in R code:
n <- length(x)
popskew <- sqrt(n*(n-1))/(n-2)*skewness(x)
n(n − 1)
est. pop. skew = sample skew
(n − 2)
As with the standard deviation, the difference is usually only slight, so people tend not to worry about this.
For the variable LENGTH the population estimate is 1.196.
Very Brief Introduction to R 11
25
15
Frequency
20
Frequency
Frequency
15
10
10
5
5
5
0
0
Figure 1.5 The histograms for the variable LENGTH, in the left panel, the
transformed variable log(LENGTH), in the middle panel, and transformed variable
log(LENGTH + 2.54), in the right panel
install.packages("e1071")
library(e1071)
skewness(LENGTH)
[1] 1.174819
which is a pretty high skew (anything over 1 is usually considered high). An equation
for standard error of skewness assuming the data are normally distributed is:
6(n − 2)
seskewness =
(n + 1)(n + 3)
√
This produces 0.256 for these data. Sometimes an approximation, 6/n, is given, and
this yields 0.266, so very similar. This means an approximate 95% confidence interval
for these data is: 1.17 ± 2(.26) = 1.17 ± .52 = (0.65, 1.69). One of the problems with
this approach is the assumption of normality when it is likely that you are interested in
distributions which are not normal.
A bootstrap sample is one taken from the observed sample where you randomly
choose one item, record its value, and then return this item to the sample. You then
randomly choose a second item, record, and return, and repeat this until you have
a sample as large as the original sample. Some items will be chosen multiple times
because each time an item is randomly chosen. You can then calculate whatever
statistic you want for this bootstrap sample (i.e., the mean, skewness, the F from
an ANOVA).
Using a computer you can repeat this procedure thousands of times and record
the relevant statistics for each bootstrap sample. The distribution of the statistics
(Cont’d )
12 Modern Regression Techniques Using R
for these bootstrap samples provides a way of measuring the precision of your
estimates. A rough way to estimate the 95% confidence interval of any statistic
is the middle 95% of the distribution for these bootstrap samples. Statisticians
have devised ways to improve upon this rough approximation to estimate the
confidence intervals and the bias-corrected and accelerated (BCa) method is
used here.
Bootstrapping is a very flexible procedure and is rapidly gaining in popularity.
It can work on many problems where the traditional mathematical approach has
difficulties. Bootstrapping supplements the mathematical approach to statistics
with the brut computing force of being able to create thousands of bootstrap
samples in seconds!
library(boot)
lengthboot <- boot(LENGTH,function(x,i) skewness(x[i]), R=1000)
boot.ci(lengthboot)
CALL :
boot.ci(boot.out = lengthboot)
Very Brief Introduction to R 13
Intervals :
Level Normal Basic
95% ( 0.644, 1.826 ) ( 0.627, 1.765 )
Level Percentile BCa
95% ( 0.584, 1.723 ) ( 0.714, 1.962 )
Calculations and Intervals on Original Scale
Some BCa intervals may be unstable
Whichever estimates are used, this variable is skewed (skewness is zero for perfectly
symmetric data). Because of this skewness we might want to transform the data; the
natural logarithm (log) is a popular choice for positively skewed data. The middle panel
of Figure 1.5 shows the distribution for this transformed variable is more symmetrical
than the untransformed variables (in the left panel), but now its tail seems drawn out to
the left. It appears negatively skewed and the statistics confirm this.
[1] -0.4703981
The 95% confidence interval just overlaps with zero if using the traditional ±2se
approach and intervals that do not overlap with zero with the bootstrap method.
Because bootstrap estimates depend on the particular bootstrap samples chosen, if
you run a bootstrap confidence interval several times you may find some of them
do overlap (but most will not). If we add an inch to each chile (or 2.54 cm) and
then log the data, the skewness is near zero (skewness(log(LENGTH+2.54))
produces 0.04). This type of transformation is described in the classic regression book
by Mosteller and Tukey (1977), where the 2.54 is called a starting value. It is also
referred to as a flattening constant, delta, and a Bayesian flat prior, depending on the
situation in which it is used. Figure 1.5 shows the three histograms next to each other.
To tell the computer to print multiple graphs on the same page you have to use the
par(mfrow=c(1,3)) command. This is a tricky command to remember, but it is
often used so is worth memorizing. This tells R that the graph window should have 1
row of graphs and 3 columns of graphs. If you had wanted the graphs in a 2 x 2 grid,
you would have typed: par(mfrow=c(2,2)). When you are done with any of these
multiple-graph-figures it is worth returning this to its default par(mfrow=c(1,1)),
which has just one graph per figure. You will need to re-shape the graph window
to make the figures look like those that are printed in this book. If you right-click
on the window you can copy it as a metafile or a bitmap and paste it into other
documents in packages including Word and PowerPoint. The following code makes
Figure 1.5.
par(mfrow=c(1,3))
hist(LENGTH)
hist(log(LENGTH))
hist(log(LENGTH+2.54))
par(mfrow=c(1,1))
14 Modern Regression Techniques Using R
To end a session, you should detach the data set, here detach(chile). To quit,
type q(). This is one of the few R functions that has no input into it.
This was an incredibly brief introduction to a very powerful statistics environment.
As shown by the rapidly increasing list of R books on http://www.r-project.org/doc/bib/
R-books.html, this free and flexible environment is growing in popularity.
' $
R functions
• read.spss and read.table: to read SPSS and text files;
• install.packages: to access packages from CRAN;
• library: to make packages active;
• seq and :: to make sequences of numbers;
• par(mfrow=c(7,4)): to print with 7 rows and 4 columns of graphs;
• log: the logarithm function;
• <-: to assign objects to each other;
• c: to concatenate (or combine) objects.
Statistical concepts
• skewness: a measure of a distribution’s symmetry;
• bootstrap: a modern method for estimating the precision of almost anything.
& %
FURTHER READING
Crawley, M. J. (2005). Statistics: An introduction using R. Chichester, UK: Wiley. This is a great
introduction. Although not written for psychologists, it is still excellent and very clear. Professor
Crawley is actually a plant ecologist looking at the interactions between plants and animals, and
has books on things like the ‘Flora of Berkshire’. http://www3.imperial.ac.uk/naturalsciences/
research/statisticsusingr
Crawley, M. J. (2007). The R book. Chichester, UK: Wiley. 950 pages of R and it’s legal! This
is part reference part teaching book. It is more advanced than his 2005 book, not in terms
of statistical knowledge, but in terms of computing skills. http://www.bio.ic.ac.uk/research/
mjcraw/therbook/index.htm
Fox, J. & Anderson, R. (2005). Using the R statistical computing environment to teach social
science statistics courses. http://socserv.mcmaster.ca/jfox/Teaching-with-R.pdf and see also
http://socserv.mcmaster.ca/jfox/Courses/R-course/index.html. John Fox has written much
about R and statistics.
Very Brief Introduction to R 15
Venables, Smith and the R Development Core Team (2008). An introduction to R. Free on http://
cran.r-project.org/doc/manuals/R-intro.pdf.
An Amazon list on learning R:
http://www.amazon.com/Learn-the-R-statistics-software/lm/244T3243F9I31/
ref=cm_lmt_srch_f_2_rsrsrs0/102-3396071-7592139
The R team recommends various packages for different areas. See:
http://cran.r-project.org/src/contrib/Views/
http://cran.r-project.org/src/contrib/Views/SocialSciences.html
The R help facilities do not tell you about the statistical issues, just how to run the
functions. Much R information can be downloaded from the R web site.
The code in this book was run using R 2.4–2.7, but we will update the code if it stops
working (if someone tells us!). It is best to use the most up-to-date non-beta version if
you are a non-expert.
Note: If using a word processor to type commands, and then pasting them into
R, be careful that the symbols you type are not being changed. For example,
some word processors might change <- to a single character for an arrow,
←, which R will not understand. This facility can be turned off (in Word, from
the Tools/AutoCorrect toolbar). Other problem symbols are ", ', and ellipses
(three dots). The line breaks are also sometimes not copied correctly. WordPad
and Notepad, which have fewer auto-correcting procedures, are often better
to use than the more encompassing word processing packages like Word and
WordPerfect. Alternatively, you can use text editors designed for R like Tinn-R
(http://www.sciviews.org/Tinn-R/ and http://sourceforge.net/projects/tinn-r).
2
Learning objectives
1. Describe the simple linear regression.
2. Show some R code relevant to regression.
3. Create variables with normal and other distributions.
4. Write data to files.
5. Review the topics to be covered in the remainder of the book.
The word ‘regression’ gets used by different people in different ways. While it is a very
general procedure, people are usually first introduced to a simple linear regression of
the form:
yi = β0 + β1 xi + ei
This equation could either represent a linear relationship between two variables measured
on different scales or a difference between the means of yi for two groups (where xi could
be coded as xi = 0 for one group, and xi = 1 for the other group – what is often called
dummy variable coding). These two situations are depicted in Figure 2.1 for data created
for illustrative purposes.
Because learning how to use R is an important part of this book, we will list the code
for making all figures and for all analyses (in black). We do this because learning
from examples and adapting the code for your own purposes is one of the best ways
for learning R. In each chapter we will add extra important information about R, but
you should also search yourself using the help facilities within R. To find out what a
function means write help(xxxx), replacing xxxx with the name of the function.
So, help(mean) provides help about the function mean. The command ?mean also
provides help. If you do not know the name of the function you can search the help
facilities with: help.search("concept"). You can also use the search facilities
on the CRAN web page. The help facilities are written for statisticians so sometimes
may seem abstract or too filled with statistical jargon to be easily understood. Books like
this fill in some of these gaps.
The Basic Regression 17
80
80
60
60
40
40
y
y
20
20
0
0
–20
–20
In this chapter we cover creating new variables from probability distributions and
we use the rnorm function. So, if you want to know more about this function you
would type help(rnorm) (or ?rnorm) and the computer would tell you that rnorm
creates normally distributed random variables. In the code below the 100 means there
are 100 cases, the first 10 is the population mean, and the second 10 the population
standard deviation. These are the characteristics of the population from which the
variable x1 is created. If you want to switch the order of the arguments within a
function you have to tell R what each means, so rnorm(10,20,5) is the same as
rnorm(mean=20,sd=5,n=10). x2 is created using rbinom(100,1,.5). It says
to create a binomial variable with 100 cases based on 1 flip of a single coin where the
coin has a .5 chance of landing heads (and heads=1, tails = 0). Since a probability cannot
be above 1 or below 0, if we had written rbinom(100,1,1.5)the computer would
have created a variable with 100 missing values (labeled NaN, which stands for ‘Not A
Number’). We mostly keep with the default settings because that is simpler for teaching
purposes and requires less typing (and therefore fewer typos). We have set the random
seed so that the data produced will be the same each time you run this code (though the
numbers may be different with different versions of R). The third variable is called y
and it is a combination of x1 and x2 with additional random normally distributed error.
It also has 100 cases.
set.seed = 121
x1 <- rnorm(100,10,10)
x2 <- rbinom(100,1,.5)
y <- x1 + x2*40 + rnorm(100,0,10)
Next we can run some simple linear regressions between y and the x variables. The
basic regression function in R is lm. lm stands for linear model. lm(y~x1) says
to run the regression yi = β0 + β1x1i + ei . There are more advanced regressions
that will be introduced during this book which build on this syntax, but they
all have the form that the tilde symbol, ~, separates the response variable, here
y, from the variables that you are using to predict it, here x1. This notation
is based on one developed by Wilkinson and Rogers (1973). When we run a
18 Modern Regression Techniques Using R
regression it creates a lm.object. Here we run two regressions and store the
results in reg1 and reg2. As will be shown, we can use these objects in other
functions.
par(mfrow=c(1,2))
plot(x1,y)
abline(reg1)
plot(x2,y)
abline(reg2)
par(mfrow=c(1,1))
The abline functions place the regression lines onto the scatterplots. There are a few
functions within R that draw lines onto scatterplots. abline is for straight lines.
Traditionally, the situation shown in the left panel of Figure 2.1 is called a simple
linear regression, and on the right is a t test. However, both are the same model –
the one depicted by the equation above. There are several assumptions for assessing
these models. Many of the topics discussed in this book are about extensions which
address these assumptions. These two situations are likely to have been dealt with in
your introductory statistics courses, but it is worth repeating them in order to show how
R can estimate them.
The default diagnostic checks for the model, yi = β0 + β1x1i + ei , are shown in
Figure 2.2. Figure 2.2 was made with the plot function. If a lm.object is entered
into the plot function it assumes that you want these four graphs. Because there are
four graphs we have told R to present them in a 2×2 grid.
par(mfrow=c(2,2))
plot(reg1)
The Q–Q (quantile–quantile) plot in Figure 2.2 shows systematic deviation from
normality. If the residuals were normally distributed, we would expect the points in the
Q–Q plot to form a straight line near the diagonal with only non-systematic deviation.
Systematic deviations can arise for several reasons, but because we have made up the data
ourselves we know that the deviations are likely to be due to the model (which includes
just x1) being incomplete. It should also include x2. This suggests that we should try
adding the variable x2, creating what is usually called a multiple regression: yi = β0+
1 For comparison, if x2 were treated as a factor the plot command would draw two boxplots. To show this you
Standardized residuals
2
27 27
0 20
1
Residuals
0
–2 –1
−40
88 88
71 71
10 20 30 40 50 –2 –1 0 1 2
Fitted values Theoretical quantiles
71 Standardized residuals
2
88 27 27
86
1
1.0
0
0.5
–2 –1
3
Cook's distance
0.0
Figure 2.2 The four default diagnostic plots from R for a linear regression object, for
the model yi = β0 + β1x 1i + ei
Researchers are usually interested in the parameter estimates for these models. In the
code above we have created three regression or lm objects: reg1,reg2, and reg3.
The way R works is to use functions that take information from these objects and, in
the case of plot, creates the plots above, or in the case of summary provides the
parameter estimates (the coefficients), the multiple correlation coefficient, and the like.
So, summary(reg3) produces:
Call:
lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max
-26.5775 -6.7035 -0.8574 7.7245 23.5401
20 Modern Regression Techniques Using R
Standardized residuals
27 27
2
10
Residuals
1
0
–10
–2 –1
71
–30
71 29 29
0 20 40 60 –2 –1 0 1 2
Fitted values Theoretical quantiles
Standardized residuals
29
1.5
71
27 27
2
1
1.0
–3 –2 –1 0
0.5
Cook's distance
0.0
29 71
Figure 2.3 The diagnostic plots for the regressions including both x 1 and x 2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.6183 1.9887 1.317 0.191
x1 0.8147 0.1047 7.785 7.68e-12 ***
x2 36.9657 2.1490 17.201 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(1 − R2 )(n − 1)
2
Radj =1−
n−k−1
The Basic Regression 21
where k is the number of predictor variables. The population values are: yi = 0 + 10 x1i +
40 x2i + ei . If you want to purposefully exclude the constant you have to write: reg4 <-
lm(y~x1+x2-1). If you want to add the interaction term (which here means that the
slope for x1 is different for the two groups), it is reg5 <- lm(y~x1+x2+x1:x2)
or reg5 <- lm(y~x1*x2). The x1:x2 means the interaction between x1 and
x2, and x1*x2 means the interaction of x1 and x2 and all the effects nested
within this (so x1*x2*x3 is the same as x1:x2:x3+x1:x2+x1:x3+x2:x3+
x1+x2+x3).
You can see if a nested model fits significantly better than another with the anova
function.2 The one with no interaction nor any intercept is compared with the regression
with these by: anova(reg4,reg5):
Model 1: y ~ x1 + x2 - 1
Model 2: y ~ x1 + x2 + x1:x2
The second model accounts for more of the variation (RSS is residual sum of squares
which drops by 250), but it uses 2 more degrees of freedom (for the intercept and the
interaction). This difference is nonsignificant. The difference divided by the residual
sum of squares of the first model is the popular effect size, partial eta-squared. Here it
is: 251.9/11285.0 = .02. We could write:
Including the interaction and intercept did not significantly improve the fit of the model,
F (2, 96) = 1.10, p = .34, ηp 2 = .02.
2 The phrase analysis-of-variance (ANOVA) refers to comparing the ratio of two variances. This is the
final step in the popular statistical procedure for comparing means, which is often also called ANOVA.
The ANOVA procedure for comparing means is run in R with a function called aov or with lm (see
Chapter 3).
22 Modern Regression Techniques Using R
These data examine whether dissociation (what is high in what used to be called
multiple personality disorder) predicts suggestibility on an eyewitness memory study.
Most of the data in this book are ours. This is because we have these data files handy
and are allowed to disseminate them without copyright issues.4
Data can be read into R in several ways. In Chapter 1 you were shown how to
access them from other files but for small data sets it is often convenient to type the
numbers directly into R. Here we assign the 50 values for each variable to suggest
3 Ifyou have not used matrix notation before, do not worry. This is the only place in the book where we use it
like this. It is useful shorthand, but is not necessary for conceptual understanding.
4 The website that accompanies this book includes other examples. We encourage you to include your own
examples on these pages. Directions for how this should be done are on the website.
The Basic Regression 23
and DES (DES stands for dissociative experiences scale). The function c is often used
in R; it stands for concatenation and is used to tell the computer that these numbers
form a set. Notice that with long commands they will stretch over multiple lines. In
many R sources authors place a + at the start of a new line if it is a continuation
of a previous command. We do not. Where it is unclear we indent the continuation
lines.
suggest <- c(12, 2, -2, 5, 10, -13, 10, -3, 4, 9, 13, 6, 18, 12, 14,
-6, 0, 5, 6, 19, -5, 8, 5, 14, -1, -2, 10, 17, 2, 10, -1, 21,
14, 4, 20, 24, 5, -12, 8, 7, 0, 2, 7, -1, 12, 4, 0, 19, 8, -12)
DES <- c(45.00,49.28,38.57,55.71,48.21,14.60,12.86,43.50,27.10,
46.40,46.40,35.00,55.00,52.85,46.40,18.21,36.79,48.93,45.50,
22.10,31.43,32.50,52.14,18.21,22.50,27.14,16.67,14.64,39.64,
16.78,34.28,69.64,38.90,23.20,41.67,62.86,47.00,22.86,31.42,
48.21,18.57,48.21,40.00,35.00,57.85,50.35,41.07,45.00,48.00,49.28)
The values for 50 participants on these two variables are in the same order so that the
fourth participant’s can be accessed with suggest[4] and DES[4].
Next, the simple linear regression is run with the lm function, a lm.object is stored
in reg, and summary is used to print information about this object. The lm function
assumes the variables are in the same order, so that the fourth number in suggest
corresponds to the same person as the fourth number in DES. When a lm.object is
entered into the summary function, R produces the most commonly reported statistics
for a regression. R prints a * to denote p <.05, ** for p <.01, etc. Because many people
find this a ‘poor scientific strategy’ (Meehl, 1978: 817) you may want to turn this facility
off with options(show.signif.stars=FALSE). The words TRUE and FALSE
can be replaced by T and F in most R functions, though it is recommended that you use
TRUE and FALSE to avoid confusion. Though in order that commands fit on as few lines
as possible we sometimes do not.
Call:
lm(formula = suggest ~ DES)
Residuals:
Min 1Q Median 3Q Max
-20.26046 -5.99737 -0.02292 6.03891 15.92422
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.1399 3.4105 -0.334 0.7397
DES 0.1908 0.0838 2.276 0.0273 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because the coefficient for DES (0.1908) is positive, we can tell that in the sample
there is a positive relationship between the dissociation scores and memory suggestibility,
with a multiple R2 of 0.09744. To find Pearson’s r we can take the square root of this:
sqrt(.09744) yields 0.3121538. There are several ways to report this result in a
manuscript including t(48) = 2.28, p = .03, and r = .31, n = 50, p = .03. These are
reporting the same information (being sensible with the sign, if t is negative r should
also be negative) because:
t2
r=
t + df
2
To interpret a regression properly it is necessary to graph the data. Over the last 30 years
there has been a large increase in the awareness that graphs are important for statistics.
There has also been a large increase in what computers can do to make good graphs
(and also in how they can create bad graphs, Wainer, 1984). Figure 2.4 is just the basic
scatterplot between these variables. We have added our own labels for the x and y axes.
The regression line is plotted using the abline function. It is often useful to add text
to graphs. The text function does this. It requires three arguments: the location on the
x axis and on the y axis and the text to write. The text needs to be in quotation marks.5
After completing your analyses you may want to save the data. At present, there are
just the two variables (DES and suggest) floating around in your current work space.
The only way that R knew they were related was that in the lm and plot functions
they were of the same length and placed within these functions together. To create a data
set of these two variables we can use the column combine function cbind:
write.table(dataset1,"c:\\temp\\dataset1.dat")
The \\ are needed when either saving or accessing information from other locations.
The data can be written in the appropriate format for other packages too. We can use
the package foreign (R Core members et al., 2008), which we used in Chapter 1 to read
data from SPSS, to write the data to other packages. If you are using the same computer
as you were for Chapter 1, and if the computer hard drive is not cleaned (which happens
5 You can write text(locator(1),"r =.31") and then click onto the plot where you want the text.
We do not recommend this because if you want to re-make the graph it is nice knowing the exact location
of the text.
The Basic Regression 25
20
Suggestibility (–25 – 25)
10
0
–10
r = .31
20 30 40 50 60 70
Dissociation score (0–100)
Figure 2.4 A scatterplot with the regression line shown for the Wright and
Livingston-Raper (2001) data
on many public access machines), you should not need to re-install this package. But, if
you are on a new computer, to access this package type:
install.packages("foreign")
and it will prompt you to choose a mirror site (see Figure 1.2). Once it is installed to
activate all the functions within the package type:
library("foreign")
write.foreign(as.data.frame(dataset1),"c:\\temp\\dataspss.dat",
"c:\\temp\\dataspss.sps",package="SPSS")
6 We refer to variables on the left side of the equal sign (or ~ in R functions) as response variables and those
on the right side as predictor variables and covariates. Sometimes these are referred to as the DV and the IVs,
for dependent and independent variables.
The Basic Regression 27
is the logistic regression which is appropriate when the response variable is either binary
or a proportion. We briefly examine the most common types of GLMs, and then look in
more detail at an example from juror decision making which uses logistic regression to
estimate the meaning of ‘reasonable doubt’ (Wright & Hall, 2007).
R functions
• help: to learn about functions;
• write.table: writes data to files;
(Cont’d )
28 Modern Regression Techniques Using R
Statistical concepts
• linear regression: a model linear in the parameters (the β values).
• Q–Q plot: a plot to check the normality of residuals.
FURTHER READING
Fox, J. (2002). An R and S-Plus companion to applied regression. Thousand Oaks, CA: Sage
Publishing. This is the book most similar to ours as far as content. This is a very good book.
It assumes more statistical knowledge than we do.
Mosteller, F. & Tukey, J. W. (1977). Data analysis and regression: A second course in statistics.
Reading, MA: Addison-Wesley Publishing Company. This book is a classic textbook. It is more
advanced than Wright and London (2009).
Wright, D. B. & London, K. (2009). First (and second) steps in statistics (2nd Ed.). London: Sage
Publications. Most introductory textbooks cover basic regression. Introductory textbooks vary
in how much statistical knowledge is assumed. This one is on the beginner side of this scale.
3
ANOVA as regression
Learning outcomes
1. Running a oneway ANOVA as a regression;
2. Using different contrasts in R;
3. Learning why you should not categorize variables unless necessary;
4. Learning how R treats factors and numeric variables.
This chapter covers two almost opposite topics. This first is what to do if you have
predictor variables that are categorical. This involves using regression like an ANOVA.
The second is that unless the data really are just categorical you should usually not
treat them as categorical. We use data from two studies. The first example is based on
the classic cognitive dissonance study by Festinger and Carlsmith (1959) showing that
giving people insufficient reward for lying can create dissonance. The second example
is from a recent study by London et al. (in press, Exp. 2). The primary purpose of their
research was to look at the long-term effects of a suggestive interview. Here just the
correct free recall utterances are examined.
In one of the classic studies of social psychology, Festinger and Carlsmith (1959)
had participants spend about one hour putting spools onto a tray and turning square
pegs a quarter rotation. It was designed to be boring and was! After participants finished
this tedious task, the experimenter pretended as if the study was over and gave them a
pretend debriefing. Participants were told that there were two groups in the study and that
they were in the control group who had received no information before the study. They
were told that people in the other group spoke with a confederate (someone working
30 Modern Regression Techniques Using R
for the experimenter but pretending to have just taken part in the study as a participant)
who told them that the experiment was enjoyable. At this point, the real study was just
beginning.
There were three groups in Festinger and Carlsmith’s study. There was a control group
who after the ‘debriefing’were ushered into a waiting room. There were two experimental
groups. For each of these the experimenter explained that the usually reliable confederate
had phoned saying that he could not make it. The experimenter asked if the participant
would help out and tell a female participant who was waiting in the next room that the
experiment was enjoyable. One group was paid $1 and the other was paid $20.1 Most
complied, although a few said they were suspicious and their data were discarded.2 After
the participant either waited in an empty room (control group) or told the confederate that
the boring task was enjoyable, they thought they were done. On leaving the building, they
were informed that the department monitors all experiments and asked if they would fill
out a questionnaire. This in fact was an integral part of the study. It included a question
asking them, on a −5 to +5 scale, how interesting and enjoyable the study (the boring
spools and pegs tasks) was. The prediction from Festinger’s cognitive dissonance theory
is that those paid only $1 were more likely to say the task was enjoyable compared with
the other groups.
Festinger and Carlsmith had 20 people in each condition. We have recreated these
data so they closely resemble their original data.
We need to create a variable for condition where the first 20 values are for the control
group, the next 20 are for the $1 group, and the final 20 are $20 group. The rep function
does this. The each=20 means to write each one of these 20 times.
group <-
as.factor(rep(c("control","$1","$20"),each=20))
We have to tell R that we want it to treat this sequence as a categorical variable, which
in R terminology is a factor.
1All were asked for this money back at the end of the real experiment. Festinger and Carlsmith (1959: 207)
said all ‘were quite willing’ to do this. Would this replicate?
2Another participant’s data were discarded because he asked for the female’s phone number and said ‘he would
call her and explain things’ (p. 207) and wanted to stay around until she was done with the experiment so
they could talk, presumably wanting to ‘debrief’ the young lady (who was the actual confederate). After the
actual experiment, participants were debriefed with the female confederate present. The participant slept alone
that night.
ANOVA as Regression 31
Several R functions can be used to explore these data. The first might be to look at
the means and standard deviations. The tapply function can be used to look at the
value of any statistic for a variable broken down by groups. Here, the mean and the sd
(standard deviation) of enjoy are printed for each value of group.
tapply(enjoy,group,mean)
$1 $20 control
1.35 -0.05 -0.45
tapply(enjoy,group,sd)
$1 $20 control
2.158825 1.848897 2.438183
Because group is a factor the plot function makes separate boxplots for each group
(Figure 3.1). We have added labels to the y and x axes. The cex.lab=1.3 makes these
labels 1.3 times larger than their default.
$1 $20 Control
Condition
This is the standard ANOVA table that you will have been taught to make during
introductory statistics courses. You will probably have also been taught to calculate
the proportion of the total Sums of Squares (SS) accounted for by the model. The total
SS is 35.733 + 266.450 = 302.183. The ratio is 35.733/302.183 = .118. In ANOVA
terminology this is called η2 (eta squared) and it is a common measure of effect size
(and is related to the partial eta squared). Another effect size sometimes reported in
these situations is ω2 (omega squared). It is like η2 but adjusted to take into account the
complexity of the model. An equation for this is:
SSb − (k − 1)MSe
ω2 =
SStotal + MSe
Call:
lm(formula = enjoy ~ group)
Residuals:
Min 1Q Median 3Q Max
-5.35 -1.55 0.05 1.65 3.45
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3500 0.4835 2.792 0.00711 **
group$20 -1.4000 0.6837 -2.048 0.04521 *
groupcontrol -1.8000 0.6837 -2.633 0.01088 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1
This output looks very different from the ANOVA output, but a couple of things look the
same. For example, the p value at the end is the same; the multiple R2 is the same as what
we calculated for η2 , and the degrees of freedom for both the model and the residuals
are the same. In fact, the model is the same, just the output looks different. The anova
function produces the ANOVA table from information stored in an lm.object.
anova(anova2)
Response: enjoy
Df Sum Sq Mean Sq F value Pr(>F)
group 2 35.733 17.867 3.8221 0.02769 *
Residuals 57 266.450 4.675
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1
This shows that ANOVA is just a way of describing the results from a regression. The
regression model being evaluated is:
When the R functions aov and lm encounter a factor like group, it creates dummy
variables. These are variables with the value 1 for one of the groups and 0 for all other
cases. When there are k categories, R creates k − 1 dummy variables. Here there are
3 categories so 2 dummy variables are created. The first has the value 1 for the $20
group and 0 for everyone else. The second has the value 1 for the control group and
0 for everyone else. The $1 group has the value 0 for both of these. It is called the
reference category. R has chosen it as the reference category simply because it is first
in its list. These are called contrasts and these are a major focus in books on ANOVA.
To find out how R plans to create dummy variables for a factor use the contrasts
function.
contrasts(group)
$20 control
$1 0 0
$20 1 0
control 0 1
The mean for the reference category will be the estimate for the intercept: 1.35.
The mean for the $20 group will be the intercept plus the estimate for this coefficient
(1.35 − 1.40 = −0.05). The statistics for this coefficient provide a test for whether this
group is different from the reference group. It is, t(57) = 2.05, p = .05. Similarly,
the mean for the control group is 1.35 − 1.80 = −0.45, t(57) = 2.63, p = .01.
This is slightly different from running t tests comparing groups and this can be seen
because the number of degrees of freedom is for the entire sample, not just the
two groups.
34 Modern Regression Techniques Using R
You may want to change how R constructs contrasts. The contrasts function also
allows this to be done. Often you want to compare each group to the control group.
To do this, you need to construct a matrix like the one shown above but with zeroes for
the control row and assign this to the contrasts. The dim function says what dimensions
the contrast matrix has.
[,1] [,2]
$1 1 0
$20 0 1
control 0 0
Now when the regression is run the coefficients compare each group with the control
group.
summary(lm(enjoy~group))
Call:
lm(formula = enjoy ~ group)
Residuals:
Min 1Q Median 3Q Max
-5.35 -1.55 0.05 1.65 3.45
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.4500 0.4835 -0.931 0.3559
group1 1.8000 0.6837 2.633 0.0109 *
group2 0.4000 0.6837 0.585 0.5608
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1
We can see the $1 group is different from the control group, t(57) = 2.63, p = .01,
which we knew from before when the $1 group was the reference category. We now
have a test comparing the control group and the $20 group, which is non-significant,
t(57) = 0.59, p = .56. Note that the overall F value and other statistics are all the
same. It is worth noting that we have done three tests between the different pairs of
groups. When you increase the number of statistics tests you increase the chances
of incorrectly rejecting a true null hypothesis (a type 1 error) and failing to reject
a false null hypothesis (a type 2 error). Therefore you should be cautious when
ANOVA as Regression 35
conducting many of tests. Some books suggest controlling the chance of a type 1
error by requiring a lower p value, but this increases the chances of a type 2 error.
This has become a major concern for some modern statistical procedures in, for example,
bioinformatics and brain imaging studies, where you may have hundreds or thousands
of statistical tests.
R has many built in contrasts that you can choose from. This is a large and sometimes
complicated area. It is covered well in most books on ANOVA. It is less important
with the typical regressions because usually the predictor variables are not multi-valued
categorical variables. Sometimes they are ordinal. If you tell R that a variable is ordinal
by group <- as.ordered(group) then the default contrasts in R are polynomial
contrasts (see Chapter 7).
Children, aged 5 to 9 years-old, participated in a magic show. Two weeks later they
took part in an exit interview where they were asked about the magic show. Then,
approximately ten months later, they were re-interviewed. London et al. expected most
children to recall less after the delay. They wanted to see whether the delay affected
children of different ages in similar ways. This can be looked at in several ways. First
we look at it by comparing the final amount recalled by different ages, and then we look
at it by the difference in the amount recalled between the two times. In Chapter 4 this
is looked at with a different statistical procedure, the ANCOVA, and then in Chapter 7
generalized additive models are used to explore these data.
The data file has just three variables (age in months, final score, and initial score) and
is stored in a text file. If it were on your computer, say in c:\temp, then the command
would create an object in R called lordex which could then be attached. Alter-
natively, because these data are on the book’s web page they can be directly
accessed with:
names(lordex)
12
15
Frequency
Frequency
10
0 2 4 6 8
5
0
Figure 3.2 Histograms of the untransformed Final and when it is transformed with the
square root, plus .5, transformation
We are going to compare the child’s age and their final recall so it is worth looking at
both of these variables. First we will look at Final. The left panel of Figure 3.2 shows
a histogram for Final. Here is the code for this where we first tell the computer that
we want one row with two graphs.
par(mfrow=c(1,2))
hist(Final,main="Untransformed Final")
It appears positively skewed. Following the procedures used in Chapter 1 we can calculate
skewness and the 95% confidence interval for skewness.
library(e1071)
library(boot)
skewness(Final)
[1] 0.6433149
CALL :
boot.ci(boot.out = skewboot, type = "bca")
Intervals :
Level BCa
95% (0.2317, 1.1750)
Calculations and Intervals on Original Scale
ANOVA as Regression 37
In Chapters 6 and 7 methods are described that could be used to analyze, directly,
the untransformed variable but for the methods described in this and the next chapter
it would be inappropriate to model a variable with this degree of skew. A variety of
transformations could be used to lessen this skew (Box & Cox, 1964). In the previous
chapter we used a transformation of the form ln(x + k) to transform the variable x. The
k is a small amount added to each value sometimes called the starting value. The square
root transformation can also be used to lessen a skewed variable. Here we tried the
square root (the R function sqrt) of Final + .5. The histogram in the right panel
of Figure 3.2 shows that this transformation reduced the skew.
skewness(newfinal)
[1] 0.03587077
CALL :
boot.ci(boot.out = skewboot, type = "bca")
Intervals :
Level BCa
95% (-0.3511, 0.4177)
Calculations and Intervals on Original Scale
Next we look at the age variable. When making the previous histograms we let R
choose how wide to have each of the bars (or bins, as they are sometimes called).
There is a lot written on how to choose the number of bars and their widths (Wand,
1997), and within R you can choose one of several algorithms which calculates these.
R, however, will not know that the variable AGEMOS should be divided into the cultural
defined unit years. Therefore, we have to tell R to do this. We tell R to begin new bars
at 36, 48, 60, etc. The seq(36,120,12) means go from 36 to 120 by 12s (try typing
seq(36,120,12) in R). The resulting graph is Figure 3.3.
hist(AGEMOS, breaks=c(seq(36,120,12)),
xlab="Age in months")
38 Modern Regression Techniques Using R
Histogram of AGEMOS
10
8
Frequency
6
4
2
0
40 60 80 100 120
Age in months
Figure 3.3 A histogram with bin widths of 1 year for the age of children in London et al.
(in press)
Now we are ready to do our analyses and to use age to predict newfinal. The
different analyses will use age defined in four different ways.
1. Age in months (AGEMOS) as a numeric variable.
2. Age in years (AGEYRS) as a numeric variable.
trunc means truncate, so 5.7 years becomes 5 years. This converts this variable into
age in years.
3. Age in years (AGEYRSfac) as a 6-category factor
The cut function cuts the variable at the break points, here at the points 0, 6.5, and 10.
The 6.5 is because that splits the groups into 4–6 year-olds and the 7–9 year-olds. This
will create a binary variable for age.
If we use the first two methods these are simple linear regression between newfinal
and the age variables. The resulting regressions are:
Call:
lm(formula = newfinal ~ AGEMOS)
Residuals:
Min 1Q Median 3Q Max
-1.10772 -0.36184 -0.03093 0.36593 1.24272
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.18100 0.32698 0.554 0.582
AGEMOS 0.01776 0.00368 4.825 1.40e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1
and
Call:
lm(formula = newfinal ~ AGEYRS)
Residuals:
Min 1Q Median 3Q Max
-1.06293 -0.41094 -0.07864 0.40277 1.14544
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2653 0.3284 0.808 0.423
AGEYRS 0.2150 0.0473 4.545 3.61e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1
The outputs from these regressions are very similar to each other except that the
coefficient estimate for the AGEYRS regression is about 12 times the size of the one
found with the AGEMOS regression. This is expected because of the difference in scales
between years and months. These two models can be plotted on top of a scatterplot of
40 Modern Regression Techniques Using R
Final + 0.5
50 70 90 110 50 70 90 110
Age in months Age in months
Figure 3.4 Scatterplots comparing the transformed variable for the amount recalled by
age in months. The left panel is the model with age treated as a numeric variable in
months (black line) and years (gray line). The right panel is of the models with age
treated as a factor (black line) and as binary variable (gray line)
the data by using the predict and the lines functions.3 The predict function
calculates the predicted values for each of the models, and the lines function allows
these to be plotted on the scatterplot. In order for this to work the variables have to
be sorted according to the x axis variable. This variable is sorted with the command
sort(AGEMOS). The predicted values have to be in the same order. This is done by
telling R to place the predicted values (predict(reg1)) in the order of the age
variable ([order(AGEMOS)]). This has been done in the left panel of Figure 3.4. The
reg1 prediction lines are in black and the reg2 ones are in gray. It is important to
note that the scatterplot is in the transformed units. This is printed on the y axis with
ylab=expression(sqrt(Final + .5)). The black line is when treating age
in months and the gray line treating it in years. The lines are similar to each other,
although the gray line has a step pattern because all children in the same year band
have the same predicted value. The steps are all the same size. This is an assumption
of this model.
par(mfrow=c(1,2))
plot(AGEMOS,newfinal,ylab=expression(sqrt(Final + .5)),
xlab="Age in months")
lines(sort(AGEMOS),predict(reg1)[order(AGEMOS)],
col="black")
lines(sort(AGEMOS),predict(reg2)[order(AGEMOS)],
col="gray")
These analyses were repeated with age as a 6-category factor and a 2-category factor.
In these cases the lm function runs an ANOVA on the data. We first describe the numeric
output and then make graphs depicting the models (right panel of Figure 3.4).
3 The function abline is only for drawing a straight line which is the entire length of the plot.
ANOVA as Regression 41
Call:
lm(formula = newfinal ~ AGEYRSfac)
Residuals:
Min 1Q Median 3Q Max
-1.05537 -0.23195 0.07367 0.36799 1.15300
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9391 0.2285 4.109 0.000166 ***
AGEYRSfac5 0.4863 0.3114 1.561 0.125437
AGEYRSfac6 0.7890 0.2891 2.730 0.009017 **
AGEYRSfac7 0.8234 0.2950 2.791 0.007681 **
AGEYRSfac8 0.8395 0.2950 2.846 0.006651 **
AGEYRSfac9 1.3325 0.2891 4.610 3.33e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1
The reference category is the 4-year-olds so their mean is 0.9391 utterances. The 5-year
olds had 0.9391 + 0.4863 = 1.4254 utterances, etc. The coefficient estimates allow the
means for each year to be calculated. To show these are correct:
tapply(newfinal,AGEYRSfac,mean)
4 5 6 7 8 9
0.9390518 1.4253104 1.7280871 1.7624776 1.7785640 2.2715376
anova(reg2,reg3)
and we see that the increased complexity of the model does not produce a fit that is
statistically significantly better, F(4, 45) = 0.79, p = .54, ηp2 = .07. The value for ηp2
(partial eta-squared) can be calculated from the above output: the difference between
the residual sums of squares (RSS) for the two models (15.0935 − 14.1003 = .9932)
divided by the RSS of the first model, 15.0935.
The final model uses the dichotomized variable AGE2. Sometimes it may be useful to
dichotomize a variable to present it graphically to friends in a bar, but seldom should it
be used outside of drinking establishments (MacCallum et al., 2002). We present it here
to illustrate what is implied by the dichotomized method. This could be run as a t test,
assuming equal variances, as below.
t.test(newfinal~AGE2,var.equal=T)
Alternatively, it can be run as a regression. The model being evaluated is the same, but
the output is in a different format.
Call:
lm(formula = newfinal ~ AGE2)
Residuals:
Min 1Q Median 3Q Max
-1.24235 -0.54566 -0.07863 0.41824 1.13275
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.4301 0.1266 11.300 2.98e-15 ***
AGE2(6.5,10] 0.5194 0.1708 3.041 0.00378 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1
The R2 now has dropped to 0.16. This means the model predicts the number of
utterances less well than the other models. The right panel of Figure 3.4 shows
ANOVA as Regression 43
reg3 and reg4. The black line is when the regression treats age as a 6-category
factor. The predicted number of utterances goes up in steps, but the steps can be of
different heights (unlike reg2 where we forced the steps all to be the same height).
Allowing this flexibility means the model has to estimate many more coefficients than
reg2. The increase in fit (the R2 going up slightly) does not justify this increase
in complexity. Chapter 7 covers methods to increase flexibility that do not increase
complexity as much. The model shown with the gray line is when age has been split
into two categories: young and old. The shape of this line shows a problem with the
model. The predicted value of utterances is the same for all the children aged between
4 and 6 years old and is the same for all the children above 6 years of age. It assumes
that there is some large leap in ability between the ages of 6 and 7. Perhaps there is,
but it would be necessary to argue both for the flat parts of the line within the groups
and the sudden shift in order to justify this model. If you had such a complex model
for the development of children’s memories, then the procedures in Chapter 7 would be
appropriate for testing it.
plot(AGEMOS,newfinal,ylab=expression(sqrt(Final + .5)),
xlab="Age in months")
lines(sort(AGEMOS),predict(reg3)[order(AGEMOS)],
col="black")
lines(sort(AGEMOS),predict(reg4)[order(AGEMOS)],
col="gray")
par(mfrow=c(1,1))
It is negatively skewed (−0.81) and while in most situations you would transform it
(rank(Diff) is one possibility), here we will keep with the untransformed variable
because we want to compare the results to those in Chapter 4. The four regressions are
now re-run. To differentiate them from the previous ones they will be labeled dreg1 to
dreg4 (for Diff regression).
0
Final – Initial
Final – Initial
–5
–5
–15 –10
–15 –10
50 70 90 110 50 70 90 110
Age in months Age in months
Figure 3.5 Scatterplots between the difference between the number of utterances at
the final interview and at the initial interview, with the child’s age in months. The left
panel shows the models for regression treating age in months (black line) and age in
years (gray line) as linear predictors. The right panel shows the models treating age as
a 6-category factor (black line) and with age dichotomized (gray line). Overall, these
show that as age increases the decrease in the amount recalled becomes larger
The summary function can be used to look at these. The coefficients all show that there
is a negative relationship between age and the difference variable. The following creates
Figure 3.5, which is the same as Figure 3.4 for these regressions.
par(mfrow=c(1,2))
plot(AGEMOS,Diff, ylab="Final - Initial",
xlab="Age in months")
lines(sort(AGEMOS),predict(dreg1)[order(AGEMOS)],
col="black")
lines(sort(AGEMOS),predict(dreg2)[order(AGEMOS)],
col="gray")
plot(AGEMOS,Diff,ylab="Final - Initial",
xlab="Age in months")
lines(sort(AGEMOS),predict(dreg3)[order(AGEMOS)],
col="black")
lines(sort(AGEMOS),predict(dreg4)[order(AGEMOS)],
col="gray")
par(mfrow=c(1,1))
The line for treating age as a 6-category factor is odd. This is because the mean
difference goes up slightly between 4 and 5, and between 5 and 6, and then drops a lot
between 6 and 7. The means for the difference between final and initial recall for each
year are:
tapply(Diff,AGEYRSfac,mean)
4 5 6 7 8 9
-2.666667 -1.714286 -1.100000 -7.000000 -6.333333 -6.100000
ANOVA as Regression 45
These results suggest that forgetting is greatest for older kids because the difference in
the amount recalled is greatest for them.4 Using the model with AGEMOS as the example,
the regression is:
summary(dreg1)
Call:
lm(formula = Diff ~ AGEMOS)
Residuals:
Min 1Q Median 3Q Max
-13.8051 -2.2868 0.6827 2.4205 7.6827
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.06123 2.43603 1.257 0.21484
AGEMOS -0.08537 0.02742 -3.113 0.00309 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1
So as the number of months goes up, the difference becomes more negative. The
equation is:
SUMMARY
The purpose of this chapter was to allow you to get used to manipulating a variable and
using these in regressions as predictors. The first example showed a oneway ANOVA
both with the aov function and the lm function. The two functions can test the same
model, although the format of the output is different. The two procedures, ANOVA and
regression, developed relatively separately and are often taught as if they are distinct
procedures. It is important to realize that seeing whether group means differ (the purpose
often described forANOVA) is the same as seeing whether there is an association between
group membership and the means (the purpose often described for regression).
4 The suggest is in italics because this model is based on a dubious assumption, discussed in Chapter 4.
46 Modern Regression Techniques Using R
In the second example the predictor variable was one that could be treated in different
ways. In R it is straightforward to manipulate a variable and move between numeric and
categorical (factor) variable types. The statistical purpose of this example was to stress
that information is usually lost when splitting variables into categories. Here, not much
is lost changing a variable from being measured in months to being measured in years
because for this example there are not huge differences between children’s cognitive
abilities between months. But, the variable changed greatly when recoded into a young
versus old dichotomous variable.
' $
R functions
• as.factor: reads data as a categorical variable;
• as.ordered: reads data as an ordinal variable;
• tapply: used to print statistics for different groups;
• aov: the function for an ANOVA;
• plot: makes lots of different kinds of figures;
• anova: compares two or more regression objects;
• contrasts: how to find or change contrasts;
• dim: how to find or change the dimensions of data;
• expression: used for writing mathematical expressions;
• cut: used to cut variables in categories;
• sort: sorts a variable from low to high;
• order: finds the order of a variable (useful with lines);
• predict: the predicted values from a regression.
Statistical concepts
• ANOVA: is a type of regression;
• dichotomization: is a poor scientific strategy and should be avoided.
& %
FURTHER READING
The classic reference within psychology on ANOVA being a regression with categorical
variables is:
Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin,
70, 426–443. Here is a good quotation from it:
If you should say to a mathematical statistician that you have discovered that linear
multiple regression analysis and the analysis of variance (and covariance) are identical
ANOVA as Regression 47
systems, he would mutter something like, ‘Of course—general linear model,’ and you might
have trouble maintaining his attention. If you should say this to a typical psychologist, you
would be met with incredulity, or worse. (p. 426)
Learning outcomes
1. Conducting and interpreting ANCOVA.
2. Multiple regression.
3. Producing a complex graph.
4. Writing functions in R.
model 1 yi = β0 + β1 covariatei + ei
and model 2 yi = β0 + β1 covariatei + β2 xi + ei
ANCOVA: Lord’s Paradox and Mediation Analysis 49
The difference between the fit of these two models shows how important xi is in predicting
yi after the covariate’s influence has been taken into account. This is often described as
the effect of xi on yi after partialling out the covariate.1
Two situations are used to present ANCOVA. The first is where you are trying to
measure change between two points in time and you want to see if two (or more)
groups differ. We will use the London et al. (in press) data set described in Chapter 3.
The ANCOVA produces a result that looks at odds with what we found in Chapter 3.
This is an example of Lord’s Paradox (1967). The second situation is where you
have experimentally manipulated one variable and you want to see if its effect on
another variable is due to its effect on a third variable. This is called mediation
analysis.
LORD’S PARADOX
Lord (1967) described a fictitious example where two statisticians, faced with the same
data set and the same basic research question, came to different answers. The research
question was whether there are group differences on some measure at time 2 after taking
into account values at time 1. The approaches the statisticians took were an ANOVA
on the differences between the scores (time 2 – time 1) and an ANCOVA on time
2 scores partialling out time 1 scores. These are the two most popular approaches that
psychologists continue to use in this situation. Because both of these approaches appear
to address the same substantive question and yet can produce different conclusions, this
phenomenon has become known as Lord’s Paradox.
Much has been written on Lord’s Paradox in the statistics literature (e.g., Wainer,
1991). Hand (1994) described how Lord’s Paradox is only paradoxical because people
are not precise enough about their hypotheses. Thus, if London et al.’s (in press)
hypothesis was whether the ages differed in the difference in the amount recalled
at the points in time, then this would translate into subtracting the amount recalled
at the initial interview from the amount recalled at the final interview, and seeing
whether this difference is associated with the child’s age. This, the ANOVA method,
was advocated by one of Lord’s (1967) statisticians, and is what was done in Chapter 3.
Lord’s other statistician suggested an ANCOVA with the scores from the initial
interview partialled out. Formal comparison of the ANOVA on the change scores
and the ANCOVA have led many statisticians to argue that if you are interested
in whether the grouping variable is causing a difference in the response variable at
time 2, then the ANCOVA approach is usually, but not always, preferred (Wainer, 1991;
Wright, 2006b).
The two approaches can be written as follows:
1 Several other phrases are used for this. ‘Covarying out the covariate’ has the same meaning, but most
dictionaries will not have the word ‘covarying’ in them. ‘After taking the covariate into account’ is okay,
but it does not tell you how the covariate has been taken into account. ‘After controlling for the covariate’
suggests that the researcher has done something to the covariate, and a common reason for doing an ANCOVA
is that the researcher cannot manipulate the covariate. ‘After partialling out the covariate’ is not a perfect
phrase, but it seems the best of these.
50 Modern Regression Techniques Using R
Viewed in this way it is clear that in one sense (at an algorithmic level) the only difference
between the approaches is that with the ANOVA approach β2 is assumed to be 1, and
with the ANCOVA approach it is estimated. While these are often referred to as the
ANOVA and ANCOVA approaches, as shown in Chapter 3 and repeated here, both of
these approaches can be modeled with the lm function.
In Chapter 3 the difference between the number of utterances at the final interview
was subtracted from the initial interview and a regression was run on this difference.
The ANCOVA approach involves predicting the Final from Initial and seeing
whether the age variable is able to help predict Final after partialling out the
Initial scores. We will use the variable Final rather than the transformed variable.
This allows comparisons with the methods used in Chapter 3 and those used later
in Chapter 7. Using either AGEMOS (age in months) or AGEYRS (age in years)
produce nearly identical results. We will use the AGEMOS variable for the regressions,
but use both of them in constructing a graph. Notice that all of these variables are
numeric.
ANCOVAs should be conducted in three steps. First you should see how well the
covariate (or covariates) predicts the response variable. Most of the time the covariate
is associated with the response variable, but this is not always the case. Next you
should see whether adding the other predictor variable increases the fit of the model.
Finally, you should look at the interaction between this variable and the covariate. This
tests whether any effect of predictor variable depends on the value of the covariate.
You may have several covariates and several other predictor variables, and interactions
among all of these. If you have several covariates and predictor variables each of these
steps will have several parts and you should proceed carefully entering variables (see
Chapter 5 for related issues). These three steps are done with the following commands
(Initial*AGEMOS includes the interaction of Initial and AGEMOS and both of
these variables’ main effects). The summary function is used with each to extract
important statistics.
Call:
lm(formula = Final ~ Initial)
ANCOVA: Lord’s Paradox and Mediation Analysis 51
Residuals:
Min 1Q Median 3Q Max
-2.6985 -1.7312 -0.2614 1.1294 5.5201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.29404 0.50690 2.553 0.013850 *
Initial 0.21859 0.05818 3.757 0.000457 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = Final ˜ Initial + AGEMOS)
Residuals:
Min 1Q Median 3Q Max
-3.4304 -0.8374 -0.4666 0.7103 5.1568
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.72927 1.21740 -1.420 0.16193
Initial 0.10525 0.06901 1.525 0.13379
AGEMOS 0.04441 0.01645 2.699 0.00956 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = Final ~ Initial * AGEMOS)
Residuals:
Min 1Q Median 3Q Max
-3.4185 -0.7978 -0.4959 0.7168 5.1251
52 Modern Regression Techniques Using R
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9960325 2.2986626 -0.868 0.390
Initial 0.1476360 0.3163338 0.467 0.643
AGEMOS 0.0478716 0.0301855 1.586 0.119
Initial:AGEMOS -0.0004931 0.0035896 -0.137 0.891
There are several measures that can be used to compare the fit of these models, and
more discussion about these in Chapter 5. For now, see that the R2 increases from 0.2237
to 0.3260 to 0.3263 from lm1 to lm2 to lm3. So, the increase is substantial when the
main effect of AGEMOS is added, but is minimal when the interaction is added. Hypothesis
tests of these differences are calculated with the anova function.
anova(lm1,lm2,lm3)
ηp2 values can be calculated from the RSS values, so ηp2 for AGEMOS partialling out
Initial is (27.222)/206.531 = .13. The first comparison shows that model 2 is a
significantly better fit than model 1 (F(1, 48) = 7.14, p = .01, ηp2 = .13). The second
comparison shows that the interaction between AGEMOS and Initial for predicting
Final is non-significant (F(1, 47) = 0.02, p = .89, ηp2 = .00). Thus, lm2 looks the
best of these models. From the output above, the regression equation is:
This is the basic ANCOVA. This is exciting (for us at least). The coefficient for age
is positive. It shows that as age increases so does the predicted value for recall at the
final interview. This is in the opposite direction than that suggested by the analysis in
Chapter 3. This is an example of Lord’s Paradox.
Some graphs are necessary to understand any data set. Here the lattice library
(Sarkar, 2008) is used to make trellis graphs. Many statisticians think trellis graphs are
incredibly useful. A very simple one is done in Figure 4.1. It draws the scatterplot for
each age-year group.
ANCOVA: Lord’s Paradox and Mediation Analysis 53
0 5 10 15 20
7 8 9
0
Final
4 5 6
0 5 10 15 20 0 5 10 15 20
Initial
Figure 4.1 A trellis plot of the relationship between Final and Initial scores for
the six age groups
library(lattice)
AGEYRS <- trunc(AGEMOS/12)
scatmat <- xyplot(Final~Initial | as.factor(AGEYRS))
print(scatmat)
From this graph is it clear that there is less variability for some ages than others. Here
are standard deviations.
4 5 6 7 8 9
3.311596 2.299068 2.998148 6.964194 2.147350 3.314949
4 5 6 7 8 9
0.836660 1.772811 2.043961 2.500000 1.201850 2.538591
54 Modern Regression Techniques Using R
For many real world variables (like income and reaction times, etc.) it is expected that
as the mean increases so does the standard deviation. As the means for both of these
variables increase with age you would expect the standard deviations to do this also.
This does not happen for the 8 year-olds, so we might expect any regressions for this
group to be unreliable.
More complex figures can also be drawn, and an example is shown in Figure 4.2.
This is included to show some of the possibilities in R. This shows both the ANOVA
on the differences from Chapter 3 and the ANCOVA from this chapter. Complex graphs
like this can take a long time to make, but it does get quicker once you get used to
whichever package that you use. You probably do not want to spend days on this graph,
so just copy the code below to show what you get (the code is on the book’s web
page). This is a fairly detailed example so is included for you to look through and as an
illustration.
The split command creates new variables that are divided into different parts
for each age. So, sexit[[1]] are the initial scores for the youngest year group
and sexit[[2]] are the values for the second youngest group. predmod4 are
the predicted values for the model which allows the slopes to vary for each year
group. If you want parallel slopes replace the Initial*as.factor(AGEYRS) with
Initial+as.factor(AGEYRS).
sexit <- split(Initial,AGEYRS)
sfoll <- split(Final,AGEYRS)
predmod4 <- split(lm(Final~Initial*as.factor(AGEYRS))
$fitted.values,AGEYRS)
plot(Initial,Final,pch=19,cex=.5,col="black",
xlab="Initial interview", ylab="Final interview", xlim=c(0,25),
ylim=c(0,10), cex.lab=1.3, las=1, font.lab=1.5)
When the function abline has two points entered it assumes they are the coefficients
of a line y = ax + b. Here the line drawn is y = 1x + 0, which is a diagonal line through
the origin. This shows where the values of Initial equal those of Final. Next, we
want to draw a line for each of the six age years. The code for (i in 1:6) tells
the computer that you want it to run the remainder of the line (or whatever is included
within { } if it is a longer set of commands) six times and put the numbers 1 to 6
in where i is for each one. So it runs lines(sexit[[1]],predmod4[[1]],
col = "black") and then lines(sexit[[2]],predmod4[[2]], col =
"black"), and so on up to 6. The for function is often used in R to save having
to write similar sets of commands over and over. We have used the lines command
rather than abline because we wanted the lines to only go as far as the observed data
in each group (i.e., not to extrapolate beyond the data). It is worth noting that we have
not worried about using the sort or order functions here as we did in Chapter 3. This
is because the lines are straight. For regressions covered in later chapters sorting would
be necessary.
abline(0,1,lty = 3)
for (i in 1:6) lines(sexit[[i]],predmod4[[i]],
col="black")
The following calculates the year means and then connects them with a thick (lwd=4,
lwd stands for line width and the default is 1) "gray" line. The lend="round" tells R
ANCOVA: Lord’s Paradox and Mediation Analysis 55
to make rounded corners where the lines change directions. This option is only necessary
when thick lines are used.
Adding text to a graph is important so it is worth repeating how this done. In R you put in
the x and y coordinates, then in " " you put the text. The argument pos defines whether
it is left, right, top, or bottom justified. Some trial and error is usually required to get the
text where you want it. If you want to include values of variables or odd characters use
the paste or expression functions. These are illustrated in many graphs throughout
this book. The numbers in the arrows function are for the start and the end of the arrow,
length is for the arrow head. You can put in multiple elements in the text command,
as below, just make sure there are the same number of x and y values and objects to plot.
The \n in the text functions tells R to put in a line break.
text(12,10,"Final = Initial",pos=4)
arrows(12,10,10,10,length=.1)
text(c(13,21,12.1,0,0,4.4),c(5.5,4.5,2.2,3.7,1.5,.5),
c("9","7","8","6","5","4"))
points(meanexit,meanfoll,pch=19)
lines(c(14,16.7), c(9.2,9.2),col="gray",lwd=4)
text(16.7,9.2,"Age group means", pos=4)
text(14,8.5,
"Means increase with age\nfor both interviews",pos=4)
text(-.3,9.5,
"Intercepts higher for\nolder children\n (Ancova)",pos=4)
text(14,1,
"Older children further away\nfrom dashed line (Anova).",
pos=4)
What this shows is that for all the groups except for 8-year olds there is a positive
relationship between scores on the two interviews (i.e., all the thin black lines except
the one for 8-year olds have positive slopes). The basic ANCOVA approach, which
assumes the lines are parallel, seems adequate. Further, the intercepts for all these
groups tend to be higher for the older children. This is what the ANCOVA is testing: for
any initial interview score the predicted score at the final interview is higher for older
children.
For the ANOVA approach in Chapter 3, the estimate for the age coefficient was
negative, meaning as age increases the detriment from increased elapsed time also
increases. In other words, the average decrease in the amount reported is larger for
the older children than for the younger children. This is shown by the means for the
older age groups being further away from the dashed line, Finali = Initiali , made with
abline(0,1,lty = 3) in Figure 4.2. With the ANCOVA approach, conducted
56 Modern Regression Techniques Using R
6
9
7
4
6
2 8
5 Older children further away
from dashed line (ANOVA).
4
0
0 5 10 15 20 25
Initial interview
Figure 4.2 A scatterplot showing the relationships between final and initial recall for
the 6 age groups from London et al (in press)
in this chapter, the estimates for age are positive. This means that, controlling for
initial scores, the older children have higher predicted final scores. As these appear
to be the opposite conclusions, it logically must mean that they address different
questions.
Let’s imagine two scenarios of forensic importance which relate to these data. Suppose
a crime has occurred and there are two witnesses: a 5- and an 8-year old. For the first
scenario, suppose that you have limited resources and that you can interview one of the
children soon after the crime, but you have to wait ten months to interview the other child.
You want to get the most information possible from the combination of the two children.
Which child should you interview first? Because the ANOVA says that the decrease
is greatest for older children you would interview the 8-year old first, and wait for
the 5-year old. It is as if the younger child’s recall is less affected by the passage
of time.
For the second scenario, suppose that the two children were both interviewed soon
after the crime and recalled the same amount at the initial interview. One year later you
want to call one of the children to the witness stand during the trial. The ANCOVA
approach showed that, controlling for initial recall, older children do better. Thus, you
would interview the older child. It is as if the older child’s recall is less affected by the
passage of time.
As descriptions of the data, both of these approaches are valid, but they apply
to different scenarios. Each of Lord’s original statisticians were correct, but for
different situations. These situations are both fictional, and while we can imagine
situations like them, most scientists will be interested in the causal inference suggested
by each. Because the conclusions are contradictory, both cannot be valid. These
data provide a striking example of Lord’s Paradox. Sometimes the choice of a
ANCOVA: Lord’s Paradox and Mediation Analysis 57
statistical procedure may determine which side of the magical p = .05 a solution
may lie on. But here the two methods give large statistically significant effects in
opposite directions! Taken at face value, the ANOVA approach suggests that delay
exerts a more detriment effect on older children’s reports, while the ANCOVA
approach suggests the opposite. Which analytic method accounts more accurately for
these data?
When Lord (1967) introduced this paradox he did not say which approach was
generally preferred. It was not until Rubin’s (1974) model of causal inference was
applied to Lord’s Paradox (Holland & Rubin, 1983) that it was shown when each
of these approaches is more likely to be valid for causal inference. Rubin’s model
is usually described in relation to experimental groups and causation, but it can
also be used when the interest is in a quasi-experimental group, like age (Wainer,
1991). When the variable is quasi-experimental, interest is usually in an association
rather than a cause. Wainer and others have argued that for the ANOVA approach to
be appropriate you have to be able to assume that, as time progresses, people will
recall the same amount of information. The ANCOVA approach does not make this
assumption, so is preferred for inference here since we expect people to remember
less as time elapses (as evident from 125 years of memory research). The ANCOVA
assumes the amount recalled goes down linearly with time. From memory research
we know that the memory decay function is more complex, and this complexity could
be built into the analysis, but for most purposes the linear decay assumption of the
standard ANCOVA is adequate. A more flexible way of exploring these data is discussed
in Chapter 7.
Answer: Usually the ANCOVA approach is preferred to running an ANOVA on the
change or difference scores. See Wainer (1991) and Wright (2006a) for more details.
An alternative way to argue against the ANOVA method, here, is by floor effects
and ‘levels of measurement.’ The floor effects argument is simple. As can be seen in
Figure 4.2, the younger children cannot recall that much less at the second testing than
at the earlier testing because they did not recall that much at the first testing. If you only
recall 1 thing at the first testing, you can only recall 1 less at the 10 month testing. This
is a possible floor effect. There is controversy surrounding how and if to use ‘levels of
measurement’when choosing statistical procedures (Lord, 1953; Velleman & Wilkinson,
1993; Wright & London, 2009). If levels of measurement is taken at face value, and
ANOVAis used, the implicit assumption is that the difference between recalling 2 items at
the initial testing and 1 at the 10 month testing is the same amount (in some psychological
sense) as the difference between recalling 12 and 11 items. This does not seem valid.
Another option would be to use the ratio of these numbers (or other transformations), so
equating a drop from 2 to 1 as the same as from 12 to 6, but this also presents problems.
The ANCOVA model implies a more complex relationship, which is more realistic here.
It is easy to use ‘levels of measurement’ to argue against doing a particular test, but it
is hard to use it to argue for any test. This is one reason why we urge researchers not
to treat ‘levels of measurement’ as a restrictive doctrine, but more as a guide (Wright &
London, 2009).
58 Modern Regression Techniques Using R
set.seed(143)
leaflet <- rep(c(0,1),each=50)
fairskin <- rbinom(100,1,.5)
likely <- rbinom(100,10,.20 + .2*leaflet + .2*fairskin)
plan <- rbinom(100,7,likely/15+leaflet*.2)
We have two research questions. First, does the leaflet we sent out increase the
likelihood that people plan to use sun block? Second, if there is an effect, is any part of
this effect due to increasing how likely it is that people think that they will get cancer? The
first question can be addressed easily with a standard regression. The second question
is more complex and requires that we run series of regressions each asking a different
question. These can be done directly with the lm function. Here they are with some
of the relevant output. The summary function on its own would have produced more
output. Including the $coef at the end means just the part relevant to the coefficients
is printed.
summary(lm(likely~leaflet))$coef
summary(lm(plan~leaflet))$coef
summary(lm(plan~likely))$coef
summary(lm(plan~likely+leaflet))$coef
The first regression (which is the same as a t test) shows that the leaflet increases
the likelihood that people think they will get skin cancer (t(98) = 5.23, p < .001)
with the mean increasing from 3.00 to 4.92. The second (also a t test) shows that the
leaflet increases people’s plans to use sun block (t(98) = 7.83, p < .001) with the mean
increasing from 1.60 to 3.68. The third regression shows that the variables likely and
plan are highly associated (we are careful to avoid causal terms here: t(98) = 10.41,
p < .001). The final model shows that after controlling for people’s believed likelihood of
getting cancer, the leaflet still had an effect (t(97) = 5.14, p < .001). This is the standard
ANCOVA and means that at least part of the effect of the leaflet cannot be accounted
for by changing the belief about the likelihood of getting cancer. The question is: can
any of the leaflet effect be accounted for by this belief. In statistical jargon, is likely
a partial mediator of the leaflet effect? The most common test of this is the Sobel
test (MacKinnon et al., 2007). If we let a be the effect of leaflet on likely (1.92,
se = .37), b be the effect of likely on plan after taking into account leaflet (0.45,
se = .06), then the effect associated with the path from leaflet through likely to
plan is a∗ b with a standard error of the square root of (sea2∗ b2 + seb2∗ a2 + sea2∗ seb2 ).2
The Sobel test is the ratio of these and is usually assumed to be normally distributed.
Here this is:
(1.92*.45)/sqrt(.37ˆ2*.45ˆ2 + .06ˆ2*1.92ˆ2+.37ˆ2*.06ˆ2)
4.24
It is always a problem calculating statistics in this manner because you can make
typing errors, there will be rounding errors (in the equation we square .06 which gives
us .0036, but because the real value is closer to .057, a better square would be .0032, a
10% difference), and it is not fun typing lots of numbers. So, you could write a function
for this. A simple example is:
sobel1(1.92,.45,.37,.06)
This still includes rounding errors, so increasing the precision by one digit we get:
sobel1(1.920,.454,.367,.057)
2 The ‘sea2∗ seb2 is not included in some sources because it is usually very small. Some researchers suggest
subtracting it rather than adding it, but given that it is usually very small, detailed discussion of this is not
warranted.
ANCOVA: Lord’s Paradox and Mediation Analysis 61
While this saves one calculation, we would still have to do all the regressions, and if we
still thought p values were of use, we would need to look that up. We might even want to
draw a graph of the effects. This suggests that it might be worth writing a function that
does all this. A function that does this is shown in the box below. You can copy it from
there, or run it with the following command which reads it from the book’s web page:
source("http://www.sagepub.co.uk//wrightandlondon//mediator.R")
This function does what is often called simple mediation. Type mediator with the
name of the experimental variable, the outcome variable, and the mediator variable in
order. For example:
mediator(leaflet,plan,likely)
which is Sobel’s z and the associated p value. Notice that this new figure, z = 4.34, is
very similar to the number found earlier. The difference is that more precise values are
used here so there is less rounding error. The function also produces the graph shown in
Figure 4.3.
The mediator function is shown in the box below. As with the other figures in
this book, to reproduce the figure exactly may require changing the size of the graphics
window. This function is for simple mediator analyses. Variations can be added to it,
for example using bootstrap procedures rather than relying on the asymptotic p value,
which is known to have a fairly large error associated with it (so should be viewed as
the approximate p value). Only two new functions are used: rect and invisible.
rect draws a rectangle if you provide is with the four values to define the two corners
of the rectangle. Here it is used to make multiple rectangles. The invisible function
M
1.92 0.45
1.21
X Y
2.08
X Y
Figure 4.3 The graphical output from the mediator function. It shows that M is a
partial mediator of the effect of X on Y
62 Modern Regression Techniques Using R
is used so the mediator function does not print to screen Sobel’s z and its associated
p value to many decimal points, but that these values are saved if you type values <-
mediator(x,y,z).
' $
Here is the mediator function:
& %
SUMMARY
ANCOVA is often taught as an extension to ANOVA. This is a shame because it is
really just a multiple regression where there is an emphasis on the order in which the
predictor variables are input into the model. If it were taught as a way to introduce multiple
ANCOVA: Lord’s Paradox and Mediation Analysis 63
regression it would make the similarity between ANCOVA and regression clearer, would
allow people to realize that categorical and/or continuous variables could be used with
either the covariates or the predictor variables, and it would make people doing multiple
regressions more careful about how they describe their results. When marking multiple
regression assignments, one of the most annoying things is when it is not clear that the
person understands that the effect estimated is conditional on all of the other predictor
variables. This will be discussed further in Chapter 5.
There are several different examples of ANCOVA that could have been used to
illustrate the technique. We chose showing an example of Lord’s Paradox, because
that stresses the importance of choosing the best statistical procedure, and mediation
analysis, because this is often required when trying to tease apart different effects.
' $
R functions
• xy.plot: scatterplots for groups in lattice library;
• for (i in 1:k): a loop function in R;
• /n: line breaks in text function;
• arrows: prints arrows on graphs;
• $coef: how to access regression coefficients;
• format: controls the number of digits printed;
• source: accesses functions;
• rect: draws rectangles;
• invisible: so returned values are not printed to the screen.
Statistical concepts
• ANCOVA: is a type of multiple regression;
• Lord’s Paradox: when looking at change, usually use ANCOVA;
• Mediation analysis: to test if an effect is due to a mediating variable.
& %
FURTHER READING
The key reference on Lord’s Paradox, and an enjoyable read, is:
Lord, F. M. (1967). A paradox in the interpretation of group comparisons. Psychological Bulletin,
72, 304–305.
64 Modern Regression Techniques Using R
Learning outcomes
1. Different methods for selecting a simpler model from a large set of predictors;
• best subset regression
• ridge regression
• the lasso
• principal component regression and partial least squares regression.
2. Deciding how complex a model needs to be to account for the data adequately.
The purpose of this chapter is to describe some techniques for choosing the best set of
predictors (and how large the estimated coefficients should be) for a multiple regression.
This is a classic problem within psychology. It has received much recent attention in
statistics because it is also an important area in data mining and bioinformatics (two hot
areas in statistics). While in data mining and bioinformatics you may have hundreds of
predictor variables, in psychology you usually have no more than about ten. Further,
in data mining and bioinformatics it is often the case that you have no theory that
individuates each of these variables, so you need to use some method to help find a
good set of these. In psychology you usually do have some theory about the individual
predictors. An exception may be education where you may have dozens (or hundreds)
of questions on the typical exam, but item response methods (a form of latent variable
model) are well established and cover this (Bartholomew et al., 2002; Embretson &
Reise, 2000). The techniques described in this section are for exploratory analysis. If
you have some particular set of theories which describes how the variables may relate,
ANCOVA-type methods described in Chapter 4 are probably better suited.
The techniques described in this chapter are designed to examine only main effects.
While there are procedures which also search for interactions among predictor variables,
it is unlikely that many psychology data sets will require searching for interactions
in an exploratory fashion so we do not cover these. It is important to realize that
these techniques are only important when predictor variables are correlated among
themselves (i.e., there is collinearity). If your predictor variables are uncorrelated (like
when they are experimental factors) or have small correlations, then model selection is
less problematic.
66 Modern Regression Techniques Using R
1 Box and Cox (1964) recommend examining a series of transformations of the form:
newX = ((oldX + s)k − 1)/k, for k = 0. newX = log(oldX + s) for k = 0.
where k is the power the variable is taken to, and s is the starting value (usually used to prevent small values
for the original variable becoming large negative values). When k = 0 it has a special form so that the
transformation is continuous. The transformation used here is one of these (k = 0, s = 1). If you wanted to
use the Box-Cox transformation, the box.cox (or bc) function from the car library (Fox, 2002, 2008) will
do this: bc(x,0,1) is the same as log(x+1). Fox (2002) describes functions for searching for values of s
and k which are optimal in one sense (like being the most normally distributed).
Model Selection and Shrinkage 67
in the original variable, there are also a lot of zeroes in this transformed variable
(i.e., log(1) = 0). This section is divided into four additional parts: one for each of
the techniques. To load the data and transform the response variable:
A new object with all the predictor variables is created below. This saves having to
re-type all the predictor variables each time that you want to put them all in a model.
cbind means column bind and (rbind means row bind; c will not work because it
would combine all the numbers into one long variable). After doing this, if you want to
refer to all the predictors you just type preds. If you want to refer to a subset of these,
say the first four, you can type preds[,1:4].
The procedures described in this chapter are necessary when the predictor variables
are correlated, what is called collinearity. To examine the correlations we create a
correlation matrix, ptsdcorrmat, and then print it. The option digits=2 tells R that
the minimum number of non-zero leading digits to print in any column is two. If you
do not use this option, or just type cor(preds), the printed correlation matrix has far
more digits for each correlation than is appropriate.
OVER2 OVER3 OVER5 BOND POSIT NEG CONTR SUP CONS AFF
OVER2 1.0000 0.4980 -0.15 -0.035 -0.22 0.528 -0.278 -0.3943 0.0088 -0.126
OVER3 0.4980 1.0000 -0.15 0.138 -0.22 0.437 -0.045 -0.3009 -0.0075 0.070
OVER5 -0.1469 -0.1541 1.00 0.189 0.54 -0.208 0.265 0.2551 0.2443 0.197
BOND -0.0346 0.1376 0.19 1.000 0.29 -0.127 0.149 0.0853 0.2505 0.077
POSIT -0.2193 -0.2182 0.54 0.289 1.00 -0.176 0.498 0.4676 0.2618 0.407
NEG 0.5276 0.4369 -0.21 -0.127 -0.18 1.000 -0.283 -0.2233 0.1128 -0.053
CONTR -0.2783 -0.0454 0.26 0.149 0.50 -0.283 1.000 0.5227 0.1132 0.207
SUP -0.3943 -0.3009 0.26 0.085 0.47 -0.223 0.523 1.0000 0.0093 0.153
CONS 0.0088 -0.0075 0.24 0.250 0.26 0.113 0.113 0.0093 1.0000 0.523
AFF -0.1262 0.0703 0.20 0.077 0.41 -0.053 0.207 0.1532 0.5232 1.000
When cor creates a correlation matrix it makes it a square matrix. It has the
same values on the upper triangle of the matrix as on the lower triangle.2 So, the
correlation of OVER2 with OVER3 is the same as OVER3 with OVER2 (both 0.4980).
2A matrix can be divided into 3 parts: the diagonal (which are the 1.0000s in this matrix), the upper triangle
which is the numbers above and to the right of the diagonal, and the lower triangle which is the numbers below
and to the left of the diagonal.
68 Modern Regression Techniques Using R
Also, the value 1.000 is printed on the diagonal. This is because the correlation between
any variable and itself will always be 1.0000. This means some of the information
above is repeated (the correlations on half of the table) and some is unnecessary (the
1.0000s on the diagonal). To make the table more useful you can print more information
in the table, like replacing the 1.0000 on the diagonal with the standard deviations
for each variable (found by typing sd(preds)) and replacing one of the triangles
with another measure of association. For example, Spearman’s correlation is found
by cor(preds,method="spearman"), where spearman is not capitalized.
Alternatively, the confidence intervals or p values, found with the cor.test function
for each pair of variables, could be printed. These functions are illustrated below:
print(sd(preds),digits=2)
OVER2 OVER3 OVER5 BOND POSIT NEG CONTR SUP CONS AFF
3.3 3.1 1.3 3.1 11.0 11.6 14.8 5.9 11.1 3.1
OVER2 OVER3 OVER5 BOND POSIT NEG CONTR SUP CONS AFF
OVER2 1.0000 0.4980 -0.15 -0.035 -0.22 0.528 -0.278 -0.3943 0.0088 -0.126
OVER3 0.4980 1.0000 -0.15 0.138 -0.22 0.437 -0.045 -0.3009 -0.0075 0.070
OVER5 -0.1469 -0.1541 1.00 0.189 0.54 -0.208 0.265 0.2551 0.2443 0.197
BOND -0.0346 0.1376 0.19 1.000 0.29 -0.127 0.149 0.0853 0.2505 0.077
POSIT -0.2193 -0.2182 0.54 0.289 1.00 -0.176 0.498 0.4676 0.2618 0.407
NEG 0.5276 0.4369 -0.21 -0.127 -0.18 1.000 -0.283 -0.2233 0.1128 -0.053
CONTR -0.2783 -0.0454 0.26 0.149 0.50 -0.283 1.000 0.5227 0.1132 0.207
SUP -0.3943 -0.3009 0.26 0.085 0.47 -0.223 0.523 1.0000 0.0093 0.153
CONS 0.0088 -0.0075 0.24 0.250 0.26 0.113 0.113 0.0093 1.0000 0.523
AFF -0.1262 0.0703 0.20 0.077 0.41 -0.053 0.207 0.1532 0.5232 1.000
cor.test(OVER2,OVER3)
These values could all be painstakingly re-typed, but luckily R allows the matrices to
be added.
The diag(sd(preds)) says to put the standard deviations on the diagonal. The
upper.tri(spmat)*spmat says to take all values in the upper triangle of spmat
(Spearman’s ρs) and put them into the new matrix, and the lower.tri(ptsdmat)*
ptsdmat does the same for the lower triangle and Pearson’s r. Here is the new matrix:
print(new,digits=2)
OVER2 OVER3 OVER5 BOND POSIT NEG CONTR SUP CONS AFF
OVER2 3.3389 0.5420 -0.25 0.036 -0.22 0.479 -0.312 -0.5444 0.0618 -0.0261
OVER3 0.4980 3.1344 -0.23 0.139 -0.14 0.414 -0.057 -0.3536 0.1098 0.0480
OVER5 -0.1469 -0.1541 1.34 0.261 0.43 -0.160 0.230 0.3568 0.1648 0.0953
BOND -0.0346 0.1376 0.19 3.065 0.29 -0.096 0.155 0.0608 0.3159 0.1061
POSIT -0.2193 -0.2182 0.54 0.289 11.02 -0.229 0.537 0.4818 0.2592 0.3395
NEG 0.5276 0.4369 -0.21 -0.127 -0.18 11.601 -0.362 -0.2958 0.1053 -0.0016
CONTR -0.2783 -0.0454 0.26 0.149 0.50 -0.283 14.841 0.5231 0.1484 0.1119
SUP -0.3943 -0.3009 0.26 0.085 0.47 -0.223 0.523 5.8654 -0.0053 0.1022
CONS 0.0088 -0.0075 0.24 0.250 0.26 0.113 0.113 0.0093 11.1466 0.4478
AFF -0.1262 0.0703 0.20 0.077 0.41 -0.053 0.207 0.1532 0.5232 3.0676
To make the table more useful, correlations above .3 in magnitude have been
underlined. From the number of underlines it is clear that many of the predictor variables
are correlated, and therefore much care is needed when interpreting the individual
regression coefficients. The relationships between these pairs of variables should also be
examined graphically. There is a function, scatterplot.matrix (or spm for short),
in the library car (Fox, 2008). You will need to install this package before loading it.
Histograms are printed on the diagonal with a smooth line to show the distribution
(Figure 5.1).
library(car)
spm(preds)
Call:
lm(formula = ptsd ~ preds)
Residuals:
Min 1Q Median 3Q Max
-2.06245 -0.92718 0.08414 0.80786 2.80523
70 Modern Regression Techniques Using R
0 6 10 25 0 30 5 15 0 10
OVER2
6
| ||| | | || | |
0
OVER3
6
| ||| | | || | | |
0
OVER5
7
4
25
BOND
10
10 40
POSIT
30
NEG
0
80
CONTR
20
SUP
5 15
CONS
40
0
AFF
10
0
0 6 4 7 10 40 20 80 0 40
Figure 5.1 A scatterplot matrix of the predictor variables. Histograms are on the
diagonal
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.737e+00 1.606e+00 1.081 0.2846
predsOVER2 -7.143e-02 5.793e-02 -1.233 0.2230
predsOVER3 1.191e-01 6.449e-02 1.848 0.0703 .
predsOVER5 -1.235e-06 1.332e-06 -0.927 0.3581
predsBOND -8.802e-02 5.512e-02 -1.597 0.1162
predsPOSIT 2.490e-02 2.051e-02 1.214 0.2301
predsNEG 3.593e-02 1.679e-02 2.140 0.0370 *
predsCONTR -8.588e-03 1.316e-02 -0.653 0.5167
predsSUP 4.130e-02 3.274e-02 1.261 0.2127
predsCONS 1.473e-02 1.717e-02 0.857 0.3951
predsAFF 3.635e-02 6.415e-02 0.567 0.5734
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model Selection and Shrinkage 71
This shows a multiple R2 of .30. The model with all these variables is a complex.
None of the coefficient estimates can be easily interpreted because each is conditional
on nine others. Thus, you cannot say the NEG is positively related to ptsd from
this output (you would use cor.test(NEG,ptsd) to say this). From the multiple
regression output you have to say it as: ‘NEG is positively related to ptsd after
partialling out OVER2, OVER3, OVER5, BOND, POSIT, CONTR, SUP, CONS, and
AFF.’ It is doubtful that any human could really understand what this would mean
for any theory, particularly as many of the variables are correlated among themselves.
A goal of science is to simplify models like this. One of the standard stepwise methods
(backwards stepwise) is to take one variable away (it would be AFF since it is the least
significant) and then re-examine the model. The main problem with this automated
approach is that it does this automatically based on something as meaningless as
the p value. In the simplest circumstances p values have a difficult meaning (Cohen,
1990, 1994; Dienes, 2008), but here, where they are based on a relationship after
partialling out nine other variables, the p values have essentially no scientific value.
Why take AFF out rather than CONTR or OVER5 or any of them? This is the main reason
methodologists stress avoiding these automated methods, but another reason is that these
methods are not guaranteed to find the best subset, and there are methods which address
this deficiency.
One alternative is to say, for all of the models with k predictor variables, which
is the one with the best fit. When the number of variables is very large the number
of possible models to check is 2k where k is the number of variables. If you have
40 variables, this is (using 2ˆ40 in R) approximately 1012 , or 10 with 11 more zeroes,
which is far too many to deal with. Even if the computer could solve a thousand
models every second, it would take over thirty years. And, if you had 70 variables,
the necessary time is longer than current estimates of the age of the universe.3 Even
smaller numbers of variables can create problems, so statisticians have created clever
algorithms that search for the best fits. Some packages, like SAS, have built-in procedures
for best subset regression (bsr). Others, like SPSS, require additional instructions (http://
distdell4.ad.stat.tamu.edu/spss_1/allpos1.html; syntax for doing this is: http://www.
spsstools.net/Syntax/RegressionRepeatedMeasure/DoAll-SubsetsRegressions.txt).
Within R, the leaps package (Lumley, 2006) is recommended for this procedure.
This will need to be installed (see Chapter 1) and loaded. There are many different
criteria for estimating the fit of the model. Cross-validation is often used for exploring
the fit of different models. This involves fitting a model to a subset of the data, seeing
how the model fits the remaining data, and then doing this for several subsets. This is
discussed later in this chapter. A simpler method, and much more common in psychology,
is to summarize the fit of the model to the entire data set with a single statistic. The
statistics available in leaps include Mallow’s Cp, BIC, AIC, R2 , and adjusted R2 .
3A recent estimate for the age of the universe from NASA is 1.37 x 1010 years (http://map.gsfc.nasa.gov/
m_mm/mr_age.html, accessed March 17, 2007), though this requires some assumptions, and some scientific
models do suggest it may be infinite. Some clubs, cults, and religions provide much smaller estimates and
some say it is infinite.
72 Modern Regression Techniques Using R
R2 and adjusted R2 are popular in psychology, so we will use these. The first two
commands install and load leaps. You may need to choose a mirror site from which to
download the package.
install.packages("leaps")
library(leaps)
The next command runs a series of regressions which searches for the best fitting
models for each number of predictor variables. So, with 10 potential predictors there can
be anywhere between 0 and 10 predictors in a model. The summary command shows
which variables are to be included at each step. So, if one variable is to be used, it is NEG,
if two are used they are NEG and AFF. If eight variables are to be used then they are all
the variables except CONTR and AFF. So AFF is no longer in the model, despite being
in before. The graph produced is difficult to read if the number of variables gets larger
and it is not visually appealing as is. At least all the "s should be removed if printing it
for a paper or an assignment.
x1 <- regsubsets(preds,ptsd)
summary(x1)
The package leaps has some graph capabilities which we will now look at. We set
it up so that the graphs are in a 2 × 2 grid format with the par(mfrow=c(2,2))
command. The leaps graphs are in the top two panels of Figure 5.2. The bottom panels
take information from the object created by the regsubsets function and present the
Model Selection and Shrinkage 73
information in a more visually appealing manner. The "*" plot printed above has no
scale other than the number of variables in the model. This is fine if all you want to see
is if model x has a better or worse fit than model y, if they have the same number of
predictors, but it does not tell you how much better or allow you to compare between
models with different numbers of predictors. To do this you need to define a measure
of fit. The scale=c("r2") and scale=c("adjr2") are used with plot to show
whether R2 or adjusted R2 should be used. The plot function treats these objects in
special ways (i.e., different from lm.objects).
par(mfrow=c(2,2))
plot(x1,scale=c("r2"))
title("Default for leaps")
plot(x1,scale=c("adjr2"))
title("Default for leaps")
More visually appealing graphs can be constructed by running the leaps function
and plotting parts of the resulting objects. The following commands create two objects,
x2 and x3, which show the best fitting (nbest=1 means the command just stores
the best model) regression model for each number of predictor variables. leaps allows
different measures to determine how to measure fit. For x2 R2 is used and for x3 adjusted
R2 is used. The number of variables in the model, including the intercept, can be found
with x2$size.
Figure 5.2 provides information to help choose the number of predictors to have in
your model. The default plots for leaps seem difficult to read. Below these are the plots
for R2 and adjusted R2 which are easier to read. They show the fit statistics extracted
from the leaps objects for the best models for each number of predictor variables.
x2 is a leaps object and because it was made with method="r2" the R2 values can
be extracted with x2$r2. Similarly for adjr2 with x3. We have used x2$size-1
and x3$size-1 for the number of predictors since leaps counts the intercept as one
of the variables. Notice that the R2 values continue to increase each time you add a
variable, but the adjusted R2 value hits a peak. This is why the adjusted value is often
used to decide between models; it goes down when the predictor variable added has no
predictive value. Statisticians say that the statistic, adjusted R2 , penalizes models with
lots of predictor variables.
Figure 5.3 shows the lower right-hand graph from the 2 × 2 grid of Figure 5.2 but in
a more useful way. Notice that the adjusted R2 value is higher with three predictors than
with four predictors. If you were using a forward stepwise method to search for models
and stopped the search if none of the variables not included in the model increased the
74 Modern Regression Techniques Using R
adjr2
0.26 0.19
r2
0.24 0.19
0.23 0.19
0.2 0.18
0.14 0.12
(Intercept)
OVER2
OVER3
OVER5
BOND
POSIT
NEG
CONTR
SUP
CONS
AFF
(Intercept)
OVER2
OVER3
OVER5
BOND
POSIT
NEG
CONTR
SUP
CONS
AFF
0.20
0.25
adj.R 2
0.16
R2
0.15
0.12
2 4 6 8 10 2 4 6 8 10
Number of predictors Number of predictors
Figure 5.2 The default plots for the leaps function and plots showing the amount of
predictor variables and the R 2 and adjusted R 2
0.20
0.18
2 4 6 8 10
Number of predictors not including B0
adjusted R2 , you would stop at the model with just NEG, SUP, and AFF. We have
used expression and paste in the text functions below. These allow you to
put in mathematical expressions and variable values. The pos in the text command
tells R where the text should be placed in relation to the x, y coordinates given
at the start of the command (1 is below, 2 is to the left, 3 is above, and 4 is to
the right).
Once you have decided how many predictors you want, you need to find out what they
are, and then evaluate the regression model with them. The which and which.max
commands allow you to find which variables are included in the model that has the
maximum adjusted R2 . It is the 1, 2, 4, 6, 8 and A variables (when the function runs
out of the digits 1–9 it uses letters), which are: OVER2, OVER4, BOND, NEG, SUP,
and AFF.
x3$which[which.max(x3$adjr2),]
1 2 3 4 5 6 7 8 9 A
TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
labels(preds[1,])
summary(lm(ptsd ~ preds[,c(1,2,4,6,8,10)]))
Call:
lm(formula = ptsd ~ preds[, c(1, 2, 4, 6, 8, 10)])
Residuals:
Min 1Q Median 3Q Max
-2.01516 -0.96366 -0.03749 0.87906 2.82240
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.70057 1.24076 0.565 0.57454
preds[, c(1,2,4,6,8,10)]OVER2 -0.06445 0.05650 -1.141 0.25873
preds[, c(1,2,4,6,8,10)]OVER3 0.08485 0.05692 1.491 0.14156
preds[, c(1,2,4,6,8,10)]BOND -0.05972 0.04857 -1.230 0.22389
preds[, c(1,2,4,6,8,10)]NEG 0.04383 0.01520 2.883 0.00555 **
preds[, c(1,2,4,6,8,10)]SUP 0.03926 0.02719 1.444 0.15420
preds[, c(1,2,4,6,8,10)]AFF 0.08496 0.04810 1.766 0.08268 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice that most of the p values are above .05. The adjusted R2 is for the model, not
the individual coefficients. Most of the stepwise procedures would continue removing
terms. The decision about how many terms to include should be based on how important
parsimony is for your particular application. We believe parsimony is usually very
important in psychology, so that researchers should usually opt for a simpler model.
While adjusted R2 penalizes complex models, other adjustments have greater penalties
for complex models. The pragmatic approach to statistics, however, stresses that
decisions and conclusions should not be based on any single statistic and that the
researcher should look at various indices and describe the range of possible conclusions
which their data may suggest.
RIDGE REGRESSION
The bsr (best subset regression) method described above either includes a variable or not,
and often the choice of whether to include a variable is based on only a minute difference
in fit. Efron et al. (2004: 409) describe this as ‘overly greedy, impulsively eliminating
covariates which are correlated with’ other covariates. One alternative is a more smooth
transition where the sizes of the regression coefficients are constrained. This is called
ridge regression and it can be solved with a form of least squares regression. Therefore,
it is a technique that has been used for several decades and can be calculated using many
of the general statistics packages (for example, SPSS).
In R there are a few functions for ridge regression. One of the simplest is lm.ridge
in the MASS library (Venables & Ripley, 2002), so we will use that (see also Halvorsen,
2007). It takes the sum of the squared standardized regression estimated coefficients and
constrains them to be only as large as some value k:
β̂j 2 ≤ k
The value k is one of several measures of the amount of shrinkage. The function
lm.ridge plots the size of the individual coefficients with the amount of shrinkage,
but uses λ (lambda) (which is used in the computation of the ridge regression). As λ goes
Model Selection and Shrinkage 77
up the amount of shrinkage goes up, and k goes down (Hastie et al., 2001: 59). Here is
the R code for Figure 5.4:
library(MASS)
lm.ridge(ptsd~preds,lambda=seq(0,100,by=1))-> x
plot(x)
title("Ridge Regression")
abline(h=0)
abline(v=50,lty=3)
Usually you need to use trial and error to decide the range of the λs to be tested. Here,
seq(0,100,by=1) means going from 0 to 100 in steps of 1 (even steps of 10 or
20 produce smooth enough curves, but the computer is fast enough to have 100 steps).
Note that -> is used to assign the lm object to x rather than the other way around.
There are complex ways to compare the fit of models like this, but for the present
purpose just choose a λ (lambda) for where the coefficients seem relatively stable. The
value lambda = 50 seems about right on Figure 5.4 and we added a vertical line at this
value with the abline function. Here are the coefficient values for lambda = 50:
x$coef[,50]
We are not going to spend much time on ridge regression because we do not think it
is of much use. Historically it is important because it could be solved relatively easily
and so has been available for decades. We presented it as a way to introduce a better
procedure called the lasso. A problem with ridge regression is that all of the variables
are still included in the model. Given that it is better to have simpler models, it would
have been nice if some of these coefficients had dropped out. This occurs with the next
procedure, the lasso.
78 Modern Regression Techniques Using R
Ridge regression
0.4
0.3
0.2
t(x$coef)
0.1
0.0
–0.1
–0.2
0 20 40 60 80 100
x$lambda
Figure 5.4 The ridge regression output for Ayers et al. (2007)
THE LASSO
The lasso works by constraining the sum of the absolute values of standardized estimated
coefficients to some constant, k. In math:
β̂j ≤ k
While the difference between this and what is done with ridge regression appears
slight, there are two important consequences. First, ridge regression is computationally
simpler than the lasso so standard least squares techniques can be used to estimate
the coefficients. The lasso is more difficult computationally. Second, in ridge regression
while the individual coefficients shrink and sometimes approach zero, they seldom reach
zero so they are not excluded from the model. With the lasso the coefficients reach
zero and therefore predictor variables do drop out. This means that the lasso leads to a
more parsimonious model than ridge regression. In technical terms, ridge regression is
a method of shrinkage, not model selection, while lasso does both.
The computational difficulty with the lasso has been solved. Efron and colleagues
(2004) have developed an algorithm called least angle regression (lars) that calculates
the lasso solution in about the same time as least squares regression. Therefore, the lasso
should regularly be used instead of ridge regression (although other techniques exist
which may be better suited for particular situations; for example, see Tibshirani et al.,
2005). If k (the shrinkage value in the formula above) is chosen to be too small then
the model may not capture important characteristics of the data. If k is chosen to be
Model Selection and Shrinkage 79
too large then the model may over-fit the data in the sample, providing an inaccurate
representation for the population. As described above cross-validation and measures
of fit like adjusted R2 are often used to assess the fit and to decide how much to
constrain the size of the coefficients. The lars package (Hastie & Efron, 2007) allows
both the computation of lasso coefficient estimates and cross-validation to help the
researcher decide the appropriate amount of shrinkage. As stated above, it is important
that researchers do not rely too much on any single statistic to guide their conclusions, and
again here the researcher must consider the importance of parsimony for the particular
application.
lars can be downloaded in the usual way from CRAN, but it is worth looking at
Hastie’s web page (www-stat.stanford.edu/∼hastie/Papers) for lots of other information
that he has made available.
install.packages("lars")
library("lars")
To illustrate the package we will go through the same example as above (Ayers et al.,
2007).
Lasso
0 1 2 4 5 7 9 10
*
2 6
*
* *
3
** *
*
*
* *
2
Standardized coefficients
* *
** * *
10 9
* * * ** *
* * *
1
* ** * *
* ** *
* * * *
* **
0
* * * * * * ** *
*
** *
* *
*
–1
*
*
3
*
*
–2
*
4
*
0.0 0.2 0.4 0.6 0.8 1.0
|beta|/max|beta|
Figure 5.5 The graph of the lasso solution for the Ayers et al. (2007) data. The vertical
lines show when a variable has been eliminated from the model
80 Modern Regression Techniques Using R
The scales are different than on the ridge regression graph, and the lasso graph starts at
constraining the coefficients to zero and then moves towards ordinary least squares, but
the basic form of the graph is the same. An advantage of the lasso over ridge regression
is that variables actually drop out; their coefficients go to zero. Vertical lines are placed
each time a variable drops out, and in fact, these are the only points that are actually
shown in the default lasso graph (this can be changed). The variable numbers are shown
to the right of the graph.
The function, lars, uses least angle regression to calculate the entire sequence
of lasso coefficients. Lasso is the default. Here, the solution is stored in lasso1,
which is an object, like the lm objects, so it can be used in other functions.
In Figure 5.5 the y-axis is the standardized coefficients and the x-axis is labeled
‘|beta|/max|beta|’ which is for: |βs|/max|βs|. Both of these labels deserve further
explanation.
When using the lasso, it makes sense to remove the constant, and usually to standardize
all the variables so that the shrinkage does not penalize some coefficients more simply
because of their scale (this is true also with ridge regression). The lars package does this
automatically (as does the ridge regression function discussed). The β values are stored
in lasso1$beta. The graph uses the transformed values which can be found with
the following command: scale(lasso1$beta,FALSE,1/lasso1$normx). No
one other than the functions’ authors would be expected to know this level of
information, but that is why they produced help files (help(lars)) and have
a manual.
The ‘|beta|/max|beta|’ on the x-axis ranges from 0 (sum of the |βs| being zero) to 1
(no shrinkage, the OLS unbiased estimates). Rather than using k, which would depend
on the scale of the variables, this amount has been transformed so that it is comparable
across problems. This graph has other useful features. The curves produced by R for
each covariate are in different colors on the screen so can be differentiated. The vertical
lines show each time a covariate is added to the model (or eliminated from the model,
depending on whether you look at it from left-to-right or right-to-left). These are referred
to as steps in the output. On the far right the covariate numbers are shown. With several
covariates not all of them are printed, but the user can figure out which line is for which
covariate by looking at the values.
Normally methods like cross validation are used to decide how much shrinkage should
be used, but for present purposes lets say the point of shrinkage where there are still six
predictor variables looks good. Since the package includes the intercept here, let s=7,
and write:
predict(lasso1,s=7,mode="step",
type="coefficient")$coefficients
These coefficients can then be used. Another possibility is just using two predictor
variables (let s=3).
There is a cross-validation procedure built within lars called cv.lars. Here we
just run the default (usually you vary the number of folds with the option K=). With the
Model Selection and Shrinkage 81
1.8
1.7
1.6
cv
1.5
1.4
1.3
Figure 5.6 The graphical output from cv.lars for the Ayers et al. (2007) data
small data set used here for illustration the default, K=10, will do. The first command
runs the procedure and produces Figure 5.6. The second calculates which value has the
lowest cv score, and then finds the fraction associated with that, and stores the result
in frac. The coefficients for this fraction are then found using the predict function
but telling it that mode="fraction". Notice that only four predictor variables are
included.
There is a big conceptual hurdle in adopting this approach. The ordinary least squares
estimates, from the lm function, are unbiased, but they have higher variance than other
procedures. Hastie et al. (2001) discuss in detail the importance of weighting both bias
and variance when choosing a statistic. Efron et al. (2004) describe a hybrid lars/OLS
procedure, where lars is used to decide which covariates to include and then the
OLS coefficients are reported. This has the advantage that the output will be more
familiar to most psychologists. This is the procedure that Ayers et al. (2007) reported
in their paper, but it falls foul of Efron et al.’s (2004: 409) ‘overly greedy’ criticism
mentioned earlier.
82 Modern Regression Techniques Using R
where the a values are called loadings, which are estimated. For interpretability, it
is hoped that many of these loadings are near zero so can be ignored. Sometimes
all of the loadings for later components are small enough that these components can
be ignored without much loss of information. Because of this PCA is used as a data
reduction/simplification technique.
PCA is often likened to exploratory factor analysis (EFA). It has some superficial
similarities with EFA, for example, they both rely on correlations among the X variables
to yield useful results and that correlated variables tend to hang together in both the
components of PCA and the factors of EFA. PCA is preferred by most statisticians. The
difficulty many people have with EFA is that hypothesizing these latent variables and
measuring them, often in the same step, is a dubious scientific approach and should only
be done with great caution. Bartholomew et al. (2002) provide a good treatment of each
and compare them.
It is possible to use the R function princomp for PCA and then enter the resulting
components into a linear regression (lm). Alternatively, the pls library (Wehrens &
Mevik, 2007) allows these two steps to be done with a single function for principal
component regression (pcr), and also provides useful options and an alternative. The
alternative is partial least squares regression (plsr) which combines the procedure into
a single step. Rather than finding the linear combination of X variables with the largest
variance, plsr uses information about both the X variables and their associations with
the response variable. The choice between these alternatives depends on your particular
goals. If you want to reduce the dimensionality of a large number of variables into a
small number of components, and then see how these components predict a response
variable, use pcr. If you want to see how well a combination of a set of variables
can predict a response variable, use plsr. plsr will produce a better prediction
of the response variable, because that is what it is designed to do, but it will often
be more difficult to interpret. The examples below show how these methods yield
different results.
Model Selection and Shrinkage 83
Several algorithms are offered for the plsr approach, but they produce the same
solutions when there is only a single Y variable. We show both the pcr and the plsr
approaches for the Ayers et al. (2007) data. We use the pls library which you will need
to install and load.5
install.packages("pls")
library("pls")
This first set of code runs pcr and produces Figure 5.7 which can be used to help
decide how many components you need (like how you would use a scree plot).6
The values are the cumulative percentage variance accounted for. It is the non-
cumulative ones that go in a normal scree plot. The values for ptsd is how much
of the ptsd variance is accounted for by the components. The first accounts for
nothing (well, 0.004%). The second accounts for 13%, which is a relatively large
amount. If using all 10 components, 30.47% is accounted for. This is the same as
the multiple regression using all the variables. This makes sense since it is the same
information.
Data: X dimension: 64 10
Y dimension: 64 1
Fit method: svdpc
Number of components considered: 10
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
X 45.068953 67.21 82.13 92.42 95.83 97.30 98.43
ptsd 0.004225 13.07 17.16 17.91 19.82 20.07 25.39
8 comps 9 comps 10 comps
X 99.36 99.83 100.00
ptsd 27.11 29.24 30.47
lines(1:10,summary(pcr1)[1,],lwd=1.5)
points(1:10,summary(pcr1)[2,],pch=19)
lines(1:10,summary(pcr1)[2,],lwd=1.5)
5 Both of these functions, pcr and plsr, can be accessed as different methods from a function called mvr
(for multivariate regression), but we will keep with pcr and plsr so that the function name indicates what
is being done. The function name mvr indicates that it can be used, generally, for multiple response variables.
6An odd thing occurs with the pcr function. Every time you use summary(pcr.object) or
summary(plsr1), which we use below, the computer echoes the summary output to the screen. It is likely
the invisible function, used in the mediator function of Chapter 4, could take care of this. Anyway,
this is a minor nuisance rather than a problem but worth mentioning because if you run these functions you
may get more output than is printed here in this book. Throughout the rest of this chapter, when R echoes the
summary, we will not print it in order to save a little paper.
84 Modern Regression Techniques Using R
100
variance of predictors
Cumulative variance accounted for
80
60
40
variance of PTSD
20
0
2 4 6 8 10
Number of components
Figure 5.7 The pcr method. The top line shows the cumulative variance accounted
for by each of the components, the same as you would find with a PCA. The bottom line
shows how much these components account for the ptsd variable
text(4,85,"variance of predictors",pos=4,cex=1.2)
text(5,35,"variance of PTSD",pos=4,cex=1.2)
Like with other principal components software, R will produce the loadings without
showing some that are low so that it is easier to see which variables load onto
which components. If you type pcr1$projection you get all the loadings.
Because of Figure 5.7, you would probably focus just on the first two components
because none of the others account for much variation in ptsd. Here are these
loadings.
pcr1$loadings
Loadings:
Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8 Comp 9 Comp 10
OVER2 -0.241 0.329 -0.245 0.799 -0.336
OVER3 -0.122 -0.237 0.520 -0.445 -0.176 0.647
OVER5 0.107 0.990
BOND -0.103 0.719 0.461 -0.335 -0.373
POSIT -0.459 -0.245 0.128 -0.809 -0.205
NEG 0.306 -0.618 -0.685 -0.124 -0.108
CONTR -0.786 -0.450 0.395 -0.123
SUP -0.205 -0.139 0.896 0.298 -0.146 0.137
CONS -0.148 -0.729 0.520 0.385
AFF -0.106 -0.703 -0.421 -0.543
Model Selection and Shrinkage 85
The following shows that the solution using this approach to pcr is the same as
running princomp (for PCA) and then lm on the components. The principal component
analysis is done and stored in pca1, and the scores for the first two components
are used to predict ptsd. The multiple R2 shows that 13.07% of the variance is
accounted for.
Call:
lm(formula = ptsd ~ pca1$scores[, 1] + pca1$scores[, 2])
Residuals:
Min 1Q Median 3Q Max
-2.28493 -1.03742 0.07687 0.84556 2.46798
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.5924224 0.1505042 10.581 1.97e-15 ***
pca1$scores[, 1] -0.0004731 0.0086882 -0.054 0.9568
pca1$scores[, 2] 0.0375420 0.0123959 3.029 0.0036 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The partial least square regression (plsr) procedure is now shown. The syntax works
the same way so the same type of graph is produced in Figure 5.8. Because this procedure
tries to account for variation within the response variable, the first component does this
while still trying to combine the predictor variables.
100
variance of predictors
Cumulative variance accounted for
80
60
40
variance of PTSD
20
0
2 4 6 8 10
Number of components
Figure 5.8 The figure for plsr. The top line shows the cumulative amount of variance
in the predictors accounted for by the components. The bottom line shows the amount
accounted for of the variables ptsd. One component seems fine
points(1:10,summary(plsr1)[2,],pch=19)
lines(1:10,summary(plsr1)[2,],lwd=1.5)
text(4.3,85,"variance of predictors",
pos=4,cex=1.2)
text(5,35,"variance of PTSD",pos=4,cex=1.2)
The numbers below are those used in Figure 5.8. Notice that there is a much higher
amount of ptsd variance accounted for than with pcr. It is still less than the amount
from OLS regression including everything. plsr can be seen as a combination of these.
summary(plsr1)
Data: X dimension: 64 10
Y dimension: 64 1
Fit method: kernelpls
Number of components considered: 10
TRAINING: % variance explained
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
X 21.07 36.40 68.99 84.41 92.75 96.82 97.49
ptsd 18.81 21.78 23.75 26.88 28.24 29.41 30.17
8 comps 9 comps 10 comps
X 98.53 99.20 100.00
ptsd 30.34 30.45 30.47
Model Selection and Shrinkage 87
plsr1$loadings
Loadings:
Comp 1 Comp 2 Comp 3 Comp 4 Comp 5 Comp 6 Comp 7 Comp 8 Comp 9 Comp 10
OVER2 0.109 -0.198 -0.424 0.782 -0.728
OVER3 0.275 0.247 0.509 -0.558 0.370 -0.415
OVER5 -0.145 -0.199 -0.418 -0.248
BOND -0.115 -0.140 -0.167 0.418 -0.970 0.407 0.283
POSIT 0.233 0.346 -0.677 0.472 -1.029 0.766 0.112
NEG 0.839 0.210 0.429 -0.487 0.193 -0.105 -0.147 0.164
CONTR 0.796 -1.415 0.405 0.587 -0.299
SUP 0.471 -0.132 0.366 -0.248 -0.489 0.326 -0.271 0.361 -0.283
CONS 0.522 -0.998 -0.300 0.563 -0.232 0.123
AFF 0.219 0.212 -0.730 -0.299 0.283 0.208
This shows that the final part of the plsr procedure is a regression with the
scores.
Call:
lm(formula = ptsd ~ plsr1$scores[, 1])
Residuals:
Min 1Q Median 3Q Max
-2.27623 -0.98499 0.02705 0.85712 2.33879
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.59242 0.14428 11.04 2.87e-16 ***
plsr1$scores[, 1] 0.04752 0.01254 3.79 0.000344 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
' $
these automated functions can help guide the researcher, but they should not dictate how
model selection progresses.
The question is: are the standard stepwise approaches learned during your under-
graduate courses okay for these two specific situations? For the first situation, with
hundreds of predictors, the traditional approaches are not good, and the alternatives
described here should be used. For the second situation, the traditional approaches may
be okay, but each of the alternatives presented in this chapter has advantages so should
be considered. The best subset regression (bsr) has an obvious advantage in that the
traditional stepwise approaches are not guaranteed to reach the best, in some sense,
set of predictors. The ridge and the lasso regressions have advantages over best subset
regressions because, with correlated predictor variables, some coefficients may become
high and unstable, and the shrinkage used in both of these procedures helps lessen
this. Further, the graphs produced help show unstable coefficients. The lasso has the
advantage over ridge regression in that, as well as shrinking the coefficients, some
coefficients drop out, making the solution simpler. Hastie et al. (2001) describe it as
‘a kind of continuous subset selection’ (p. 64). As described in this chapter, PCR and
PLSR seem different, but in fact principal component analysis is very closely related to
multiple regression. The PCR technique has the two distinct steps: PCA and multiple
regression. This division can be helpful if, for example, you wanted to use the components
in some other analyses. Both of these procedures have the problem that the components
are likely to include lots of variables and so not be very parsimonious if you consider
all these variables, but if you can describe these summaries as meaningful indices in
their own right, then it becomes simpler. This is easier to do with PCR than with PLSR
because you do not need to refer to the response variable in describing the components
for PCR.
An important attribute in the choice of statistical test is the ease of communicating the
results to an audience. The traditional stepwise methods and best subset regressions are
top of our list for communication because they have a long history in psychology, but
PCR also will be easily understood by most psychologists because most of them have
knowledge of PCA. Ridge and lasso regression will be more difficult to describe, but
because the constraints used look simple (βk2 ≤ k, and |βk | ≤ k) they should be able
to be explained to most psychology audiences. There have been several extensions to the
lasso, which we did not describe for reasons both of space and their complexity. PLSR
is a newer technique and more difficult to describe.
Statisticians often run Monte Carlo simulation studies where they create data with
known values, and see how well the techniques are at estimating these values. They
have looked at the behavior of each of these methods. Another way to compare
methods is with cross validation, and in fact cross validation is a method to help
decide how much shrinkage to use in the ridge and lasso, how many variables
to use in best subset regression, and how many components to use in PCR and
PLSR.7 Hastie et al. (2001) review these different procedures and describe how
PCR, PLSR, and the ridge regression behave similarly, and they preferred ridge
regression of these three. They compare the lasso and ridge, but the parsimony of the
lasso makes it a clear winner in our minds. This does not mean the lasso is either
going to find the optimal (in some sense) solution or that it should always be used.
7 It
is worth saying that different types of cross validation can be done easily with the different R functions
used here. For space reasons we only briefly mention them.
90 Modern Regression Techniques Using R
For example, Zhao and Yu (2006) find it has trouble excluding non-predictive
variables which are highly correlated with some predictive ones. The original lasso
paper (Tibshirani, 1996) has been cited hundreds of times and spawned numer-
ous extensions and modifications, so there are already new alternatives with more
to come.
So, which procedure wins? The boring answer is that you should use several different
techniques. Each makes different assumptions and has different purposes. If they all
give a similar solution then there is a good chance that is a good solution. If they give
different answers then you will have to think more about your specific question and the
aims of your research. You may have to choose one technique.
That is the boring answer, which you probably did not want to hear. We know people
do not like the advice ‘try lots and see what you get’, and there are some sound reasons
against trying too many approaches because sometimes researchers may be biased to
choose the answers which suit them best. Therefore, we will give more direct advice,
but with the caveat that it usually is worth trying several techniques. On the basis of
what type of output you get, the simplicity of the solution, and the ease of exposition, we
recommend the lasso and PCR when you have multiple correlated predictor variables,
when you lack any clear theories about the relationships among these, and you wish
to see how these predictors relate to a response variable. Use the lasso when you are
more interested in the response variable, and PCR when you are more interested in the
predictor variables.
' $
R functions
• cbind: combines variables to use as a group;
• cor: for making a correlation matrix;
• cor.test: for statistics correlating two variables;
• diag: for using the diagonal of a matrix;
• lower.tri, upper.tri: to use the lower and upper triangles of a matrix;
• spm: scatterplot matrices;
• regsubsets: for best subset regression;
• which, which.max: to identify which case is the largest in a set;
• lm.ridge: ridge regression;
• lars: least angle regression (for the lasso);
• pcr: principal component regression;
• princomp: principal component analysis;
• plsr: partial least squares regression.
& %
Model Selection and Shrinkage 91
Statistical concepts
• bsr: best subset regression;
• ridge and lasso: two methods that constrain coefficients;
• PCA: principal component analysis;
• PCR and PLS: model simplification based on PCA.
FURTHER READING
Hastie, T., Tibshirani, R. & Friedman, J. (2001). The elements of statistical learning:
Data mining, inference, and prediction. Springer-Verlag: New York. Webpage: http://www-
stat.stanford.edu/∼tibs/ElemStatLearn/. Chapter 3 of this book provides a mathematical
introduction to all of these procedures. Trevor Hastie says the 2nd edition should be out soon.
The procedures within this book have been written up as an R package: Halvorsen, K. (2007)
ElemStatLearn: Data sets, functions and examples from the book: ‘The Elements of Statistical
Learning, Data Mining, Inference, and Prediction’ by Trevor Hastie, Robert Tibshirani and
Jerome Friedman. R package version 0.1–3.
Mevlik, B-H. (2006). The pls package. R News, 6/3, 12-17. Available: http://cran.r-project.org/
doc/Rnews/Rnews_2006-3.pdf. This is a good brief description of PCR and PLSR.
The lasso page, http://www-stat.stanford.edu/∼tibs/lasso.html, has links to various descriptions
of the lasso.
6
Generalized linear
models (GLMs)
Learning outcomes
1. To be able to run generalized linear models (GLMs) for response variables that are:
• normally distributed;
• counts (frequencies);
• binary (like a single YES or NO question);
• binomial (like a proportion correct on several binary variables).
2. To be able to present the results of a GLM graphically.
One of the most important advances in statistical modeling during the last 50 years was
begun by Nelder and Wedderburn (1972). They showed that linear regression could be
extended to a larger set of situations, including many that are frequently encountered in
psychology. Requiring the model to be linearly related to the responses and requiring
normally distributed residuals are not appropriate for many research problems. In the
past psychologists often still ran regressions and ANOVAs pretending that their response
variables had these characteristics, but many did so with a guilty feeling. These guilty
feelings lead to high levels of stress, poor health, and they died. Not really, but only
because they could justify this behavior, to some extent, because alternatives were not
readily available. The development of the generalized linear model (GLM) means that
alternatives are now available. Different functions can be used to link the predicted
values with a linear combination of the predictor variables. There is a notation used for
GLMs that is worth introducing. The linear combination of the predictor variables (and
this can include variables multiplied by each other and functions of these variables –
it is linear in the β values) is called the model and is denoted with ηi . The predicted
values are denoted with μi . The link function, denoted g(), connects the model and the
predicted values such that g(μi ) = ηi .
The link functions are one of the two key concepts needed to conduct GLMs. In
this chapter three link functions are considered: the identity function; the log function;
and the logit function. These link functions have different error distributions usually
Generalized Linear Models 93
associated with them. The error distribution is the second key concept for GLMs. The
error distributions usually associated with each of the link functions are: normally
distributed errors with the identity link; Poisson distributed errors with the log link;
and binomially distributed errors with the logit link.
The phrase ‘identity function’ in mathematics means a function that maps something
onto itself. For the identity function,
for k predictor variables. The μi are not the observed responses themselves (i.e., not
the yi ), but predicted values. To get the responses we have to include an error term, so
yi = β0 + βk xki + ei . If we assume normally distributed errors, which is the standard
assumption with the identity link, this is simply the standard linear multiple regression
that has been covered in previous chapters. This is yi = ηi + ei , where it is assumed
that ei ∼ N(0, σ ), which means the ei come from a normal distribution with a mean
of 0 and an unknown standard deviation (σ ) which will be estimated. This is a special
case of GLM. The model part of the regression, the ηi , is linear in the sense that the βs
can be separated from the Xs, and this is why this procedure is called the generalized
linear model. As is the norm for many statistical procedures it is often referred to by its
abbreviation: GLM.
A common situation in psychology is where the dependent variable is a frequency.
For example, this might be how many times a child asks for help in a classroom. If these
occurrences are independent from each other and are based only on a single probability for
each person then often it is reasonable to assume that the data follow a Poisson distribution
and the log link is appropriate: ln(μi ) = ηi with error following a Poisson distribution.
This is ln( yi ) = ηi + ei , where ei ∼ Poisson(λ), which means the residuals are from
a Poisson distribution with an unknown mean of λ. With the Poisson distribution the
standard deviation is the same as the mean. Most of the time in psychology when a
Poisson distribution is used λ is small (<3), and in these cases it has high expected
probabilities for low frequencies and then the expected probabilities decline as the
frequencies increase (i.e., it is positively skewed). Thus, it is expected that most children
ask few questions, but that some may ask lots. Figure 6.1 shows some examples of
Poisson distributions for λ = 1, 2, 5, 10, and 20. As the value of λ reaches 10 and
20 the distribution looks more like a normal distribution, so in these situations people
would often just assume normally distributed errors. The lines(spline(0:40,
dpois(0:40,i))) tells R to draw lines based on a smooth curve called a spline
(covered more in Chapter 7) with x-axis coordinates of 0–40, and y-axis coordinates
corresponding to these values for λ = i. The for function tells R to do this for λ = 1, 2,
5, 10, and 20. The locations of the ‘λ = ’ in the figure were based on trial and error with
the text function. This would be an occasion where the locator(1) function could
have been used. This would allow you to place the text where you wanted on the graph
(try text(locator(1), expression(lambda," = 1"),pos=4)).
plot(0:40,dpois(0:40,1),xlab="Variable",ylab="",ylim=c(0,.5),
col="white")
for (i in c(1,2,5,10,20)) lines(spline(0:40,dpois(0:40,i)))
text(1, .37, expression(lambda," = 1"),pos=4)
text(2.4, .26, expression(lambda," = 2"),pos=4)
text(4, .20, expression(lambda," = 5"), pos=4)
94 Modern Regression Techniques Using R
0.5
0.4
0.3 λ=1
λ=2
0.2
λ=5
λ = 10
λ = 20
0.1
0.0
0 10 20 30 40
Variable
1 InR ‘Poisson’ requires the data to be integer numbers, but putting ‘quasipoisson’ in as the family allows you
to have non-integer values (with this method the dispersions are also allowed to vary, which can be a good
thing, but is more complex, see Hinde & Demétrio, 1998; Wright, 1997).
Generalized Linear Models 95
logistic regressions are often used for dichotomous variables, and in fact when some
people refer to logistic regression they are actually referring just to this special case.
The variance of the binomial distribution is a simple function of its mean, and, as
with the Poisson regression, this can be useful in some situations (Hinde & Demétrio,
1998).
We will use a very small data set to illustrate the different types of GLM before
describing an example with real data. For illustrative purposes suppose that we have
data on 20 children. The data (from Wright, 2006a) include values from a standardized
intelligence test that is normally distributed with a mean of 0 and a standard deviation
of 1. We want to see how these scores relate with scores from a scale of socializability,
the number of books read, the number correct out of 10 on a math quiz, and whether the
child received detention during the previous year.
The first model we look at is a simple linear regression between social scores and
intelligence scores. This regression can be done with glm or with lm, but we will
use glm for illustration. The defaults for glm are to assume the residuals are normally
distributed and that the link function is the identity function (i.e., none). It shows a
positive and significant relationship between test and social. The output looks a
little different than what you get with the lm function. The dispersion parameter is
not usually mentioned with the standard regression because it is allowed to vary. With
the other GLMs it can be more important because the standard deviation/variance is
often assumed to be a function of the mean. When we wrote out the model with an
error term, we wrote ei ∼ N(0, σ ). The dispersion value (1.36) is the estimate of σ 2 .
The residual sum of squares (24.531) divided by its degrees of freedom (18) is this
value. The estimate for σ is the square root, or 1.17. The sum of squares (listed as
deviance measures) and the coefficient estimates are the same as you would get with
the lm function. Statistics like R2 are not printed, but you can calculate this yourself:
(41.710 − 24.531)/41.710 = 41%.
Call:
glm(formula = social ~ test)
96 Modern Regression Techniques Using R
Deviance Residuals:
Min 1Q Median 3Q Max
-1.85983 -0.88971 -0.08739 1.08340 1.57728
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.2224 0.2658 -0.836 0.41385
test 0.8743 0.2463 3.550 0.00229 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following regression shows that there is a negative relationship for test score
in predicting detention. Figure 6.2, described below, helps to show how strong the
relationship is. The usual test statistic is: t(18) = 1.90, p = .06. detent is a binary
variable meaning it is like a single coin flip. Binomial means like a number of coin
flips, so binary is a special case of binomial where it is only a single coin flip.
This distinction is important. Because the variance for the binomial distribution is
a function of its mean the function assumes an appropriate variance for the error
distribution. The glm procedure allows this assumption to be lifted, but this requires
slightly more complex computation (see Venables & Ripley, 2002, for how to do
this in R).
Call:
glm(formula = detent ~ test, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4593 -0.8497 -0.3505 0.9032 1.5888
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3385 0.5310 -0.637 0.5239
test -1.3430 0.7059 -1.902 0.0571 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R has different ways to run regressions with proportions. Here we have entered
the proportions as a two column matrix where the first column is the number
of correct answers and the second column is the number of incorrect answers.
This is useful in case people have answered different numbers of questions
(see Venables & Ripley, 2002, for discussion). This model shows that test predicts
math scores.
x <- cbind(math,10-math)
mathreg <- glm(x~test,binomial)
summary(mathreg)
Call:
glm(formula = x ~ test, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7942 -0.6700 0.2121 0.7118 1.3207
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.3216 0.2021 -1.591 0.112
test 2.7027 0.4037 6.695 2.16e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finally, test also predicts the number of books. It is assumed the number of books
follows a Poisson distribution.
Call:
glm(formula = books ~ test, family = poisson)
98 Modern Regression Techniques Using R
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4813 -0.7041 -0.2727 0.2818 1.1853
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4047 0.3109 -1.302 0.193
test 1.1304 0.1734 6.518 7.12e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The generalized linear model is important for statistics because it turns non-linear
models into linear ones with the link function. This means that a computer can solve
them relatively easily. The difficulty for most people is trying to conceptualize the models
in terms of logits and logs. It is usually easier to see simplified patterns in data in
graphical format than in numerical format, and we think this is particularly true for
GLMs. Figure 6.2 shows the four graphs which correspond to the four models just
discussed. The predict(glm.object,type="response") tells the computer
to have on the y axis the predicted response values, the μi .
par(mfrow=c(2,2))
plot(test,social)
lines(test,predict(socreg,type="response"))
plot(test,detent)
lines(test,predict(detreg,type="response"))
plot(test,math/10)
lines(test,predict(mathreg,type="response"))
plot(test,books)
lines(test,predict(bookreg,type="response"))
par(mfrow=c(1,1))
0.8
1
detent
social
0
0.4
–3 –2 –1
0.0
–1 0 1 2 –1 0 1 2
test test
8
0.8
6
math/10
books
0.4
4
2
0.0
–1 0 1 2 –1 0 1 2
test test
Figure 6.2 The predicted values (or probabilities) for four different generalized linear
models are shown by the lines. The observed values are shown with circles
Next the interaction is added, and this also fails to significantly improve the fit of the
model (χ 2 (1) = 0.12, p = .73). We told the computer to use test="Chi". If we
had said test="F" R would print a warning that the F test is not appropriate in this
circumstance.
2
1
0
social
–1
–2
–3
–1 0 1 2
test
Figure 6.3 A scatterplot of social with test scores, with the predicted values of social
with test scores from a quadratic regression
Finally, the code below shows that you can include things like polynomial functions
within the glm function. Remember that the ‘linear’ in linear models is in terms of the β
values, not the Xs. Here it includes both the linear and the quadratic (the 2 in the poly
function) polynomial terms. It is non-significant. Figure 6.3 shows that the curve does
not deviate much from the linear.
This example comes from Wright and Hall (2007) who were interested in a reasonable
doubt instruction on juror decision making. Participants read a brief crime summary and
were asked to render a verdict. They were told to tick ‘guilty’ if they believed the
defendant’s guilt was ‘beyond a reasonable doubt.’ They also rated their belief in guilt
on a 0.00 to 1.00 probability scale. Participants were either in a control condition where
they received no further instructions or they were in an experimental condition where
they received a more elaborate instruction which had two predicted effects. First, it was
Generalized Linear Models 101
expected to lower belief in guilt and second it was expected to lower the reasonable doubt
threshold. 172 participants made both a belief in guilt judgment and a binary verdict
(guilty vs. not guilty). It seems plausible to define ‘reasonable doubt’ as where there was
a 50% chance on the probability scale for rendering a guilty verdict. This is called LD50,
for lethal dose 50%, and comes from medicine where analysts have used it to estimate
the dose of a drug where 50% of the animals would be expected to die. Most medical
testing no longer uses this procedure, but the name has stuck (and serves as a reminder of
how ethical principles for science have evolved). From the estimates of a logistic regres-
sion LD50 is −β0/β1. If run on the two groups separately the confidence interval can be
found with the dose.p function from the MASS library. This is what Wright and Hall
used in their paper, but the confidence intervals could also be found with bootstrapping.
One purpose of this example is to go through in more detail how a researcher would
analyze data. Because there are only three variables, this is much simpler than most
datasets. The first command, read.table, accesses the data and then you must
attach them. The names command shows that the variable names are all in capitals.
GUILTY and FORM are binary variables. For GUILTY, 0 is not guilty and 1 is guilty,
and FORM is either 0 for the control group or 1 for the imagine instruction group.
The variable BELIEF is important because we want to see if it is affected by the condition,
and also we want to see how it relates to verdict. The left panel of Figure 6.4 shows the
default histogram of this variable. It is negatively skewed.
par(mfrow=c(1,2))
hist(BELIEF)
20
30
Frequency
Frequency
20
10
10
5
0
0
Figure 6.4 The left panel shows a histogram of the untransformed variable BELIEF,
showing a negative skew. The right panel shows a histogram of the variable
transformed with BELIEF∧ 2
102 Modern Regression Techniques Using R
To check skewness, we used the skewness function in the library e1071. If you did not
know where a skewness function was you would type help.search("skewness")
and
√it would find a couple of functions for you. A rough estimate of the error of skewness
is 6/n so an estimate of the 95% confidence interval is presented below this.
library(e1071)
skewness(BELIEF)
[1] -0.7808726
library(boot)
beliefboot <- boot(BELIEF,function(x,i) skewness(x[i]),R=1000)
boot.ci(beliefboot)
CALL :
boot.ci(boot.out = beliefboot)
Intervals :
Level Normal Basic
95% (-1.0507, -0.5155 ) (-1.0504, -0.5083 )
The 95% BCa CI does not overlap with 0 so we can be confident that the variable’s
distribution would be skewed if we had tested an infinite number of people from the
population from which this sample was drawn (they were University of Bristol [UK]
psychology students). Of course ‘an infinite number of people from the population from
which this sample was drawn’ does not exist (there were about 300 of them in any
given year). This is one of the conceptual problems people have to deal with when first
learning about hypothesis testing and confidence intervals (Wright & London, 2009),
but it is worth providing occasional reminders about the difficulties in making inference
with all statistics.
Generalized Linear Models 103
Because an assumption of many statistical tests is that the residuals are normally
distributed, not just symmetrical, it is often worth testing this. The two most used tests
are the Shapiro-Wilks test and the Kolmogorov-Smirnov test. Here, both of these are
statistically significant. The Shariro-Wilks test is usually preferred. The code and output
for both of these tests are:
shapiro.test(BELIEF)
data: BELIEF
W = 0.9271, p-value = 1.291e-07
ks.test(BELIEF,"pnorm")
data: BELIEF
D = 0.5502, p-value < 2.2e-16
alternative hypothesis: two-sided
Warning message:
cannot compute correct p-values with ties in: ks.test
(BELIEF, "pnorm")
When data are negatively skewed people often try squaring the variable, and as
can be seen below this works fairly well. The right panel of Figure 6.4 shows
hist(beliefsq). This transformed variable is fairly symmetric. As can be seen,
the BCa 95% CI overlaps with zero.
CALL :
boot.ci(boot.out = bboot2)
Intervals :
Level Normal Basic
95% (-0.2971, 0.1458 ) (-0.2950, 0.1516 )
Warning message:
bootstrap variances needed for studentized intervals in:
boot.ci(bboot2)
Call:
lm(formula = beliefsq ~ FORM)
Residuals:
Min 1Q Median 3Q Max
-0.46176 -0.21176 0.02824 0.17824 0.51573
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.46176 0.02572 17.952 <2e-16 ***
FORM -0.07500 0.03638 -2.062 0.0408 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following code shows that the default t.test in R produces something
very slightly different. If you want t.test to assume equal variances include
var.equal=T.
t.test(beliefsq~FORM)
The split command below creates a variable for the belief squared variable that
is separated by the groups of FORM. This is needed for some procedures, like the
Wilcoxon, and useful for others. The Wilcoxon is a test from the 1950s that still is
often used for ranked data. Here it compares the ranks for the belief variables for the
two groups. Wow, look at that p value … because both authors teach statistics we often
try to find examples like this which are incredibly close to that magical .05. We are far
more pleased than is healthy! It is also interesting because there are different ways to
calculate the Wilcoxon rank sum test (which is equivalent to the Mann-Whitney U).
Here, the continuity correction is used. If it is not used, the p value becomes a tiny
bit smaller (p = .04981). However, R calculates these differently than the original
Wilcoxon paper, than the way it is done in many introductory textbooks (like Wright &
London, 2009), and than even S-Plus. If the other method is used, then p = .0507. Of
course the difference between .049 and .051 should be of no interest to anyone, but
sadly it is.
Now we can do the logistic regressions. Much of the time conducting statistics will be
preparing the data and preliminary exploratory analyses. We begin with the belief squared
variable to predict the verdict. Next we add the experimental condition. The anova
command compares these two models. It gives us a χ 2 value which we can either look
up in a table, or do as we have done here and let R calculate the associated p value. We
could have written: pchisq(15.351,1,lower.tail=F). Either way, this shows
it is statistically significant, meaning that controlling for beliefsq, FORM has an effect
(8.9e-05 means 0.000089). If we had written anova(reg2,reg3,test="Chi"),
it would have printed the appropriate p value, too.
1-pchisq(15.351,1)
[1] 8.927366e-05
106 Modern Regression Techniques Using R
We then look at the values for this regression, although they are easier to see in
graphical form, which is done below. This does show that there is a large (and predictable)
effect for beliefsq, meaning that if you believe the person is guilty you are more likely
to deliver a guilty verdict. There is also an effect for FORM. Those in the experimental
group have a higher probability of making a guilty verdict once beliefsq has been
controlled for.
summary(reg3)
Call:
glm(formula = GUILTY ~ beliefsq + FORM, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5267 -0.6269 -0.1827 0.6119 2.1203
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.6442 0.8822 -6.398 1.58e-10 ***
beliefsq 9.7443 1.4747 6.608 3.90e-11 ***
FORM 1.7436 0.4836 3.605 0.000312 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Next, the model is plotted. Because there are two groups, for drawing the separate
lines it was necessary to split the BELIEF and the prediction variables, and draw these
as two separate lines. The original BELIEF variable is used rather than the transformed
one because this will be in an easier scale for people to understand. The lines
function requires the variables to be ordered. This is done with the order function
in the code below. This records the order of the belief variable for each group and then
the lines function says to plot the curve in this order. A dashed horizontal line at
50% probability of giving a guilty verdict (the LD50 line) and vertical lines where this
LD50 line touches the curves for the two groups have been added using the abline
function. These values (which are also printed with the paste function) are found with:
sqrt(-(reg3$coef[1]/reg3$coef[2])). Figure 6.5 is the resulting graph.
You would usually annotate the graph with words using the text, paste, and/or
expression functions.
plot(BELIEF,predict(reg3,type="response"),ylab="Probability of a
guilty verdict")
beliefs <- split(BELIEF,FORM)
Generalized Linear Models 107
1.0
0.8
Probability of a guilty verdict
0.6
0.4
0.2
0.0
Figure 6.5 A graph showing the predicted probabilities for rendering a guilty verdict
based on the person’s belief in guilt and the condition
It could also be interpreted as the effect only existing for values of BELIEF above
about .4. As several different interpretations are all valid descriptions of the data,
caution is urged accepting any one of them. This is a general aspect of many statistical
procedures; showing that a model fits the data does not mean that the model is
correct or even good. To show a model is good you should compare how it fits with
alternatives.
anova(reg3,glm(GUILTY~FORM*beliefsq,binomial),
test="Chi")
We will do one final bit of analysis: the standard chi-square test, here between GUILTY
and FORM. First we look at the contingency table and then we run the chisq.test.
The correct=F means Yates’ correction factor is not used. If you use the correction
factor you get χ 2 (1) = 1.92, p = .17.
table(GUILTY,FORM)
FORM
GUILTY 0 1
0 54 44
1 32 42
chisq.test(GUILTY,FORM,correct=F)
This can also be run as a glm (as a log-linear model) but this requires creating variables
for the cells of the table above:
Coefficients:
(Intercept) imagine guilty
3.892e+00 5.414e-16 -2.809e-01
and you get the likelihood ratio Chi-square value, 2.378. The two values are
different because the two statistics are calculated differently: Pearson’s Chi-square,
(Oij − Eij )2 /Eij and likelihood ratio Chi-square, 2 Oij ln (Oij /Eij ).
mediatorbin(FORM,GUILTY,beliefsq)
produces Figure 6.6. The mediator function in Chapter 4 can be easily altered so
that it can be run for lots of different situations. Preacher and Hayes (2004) have been
producing similar functions with some software packages, so we expect someone will
do this for R in the near future.
SUMMARY
John Hoffman (2004: viii) writes: ‘We are most fortunate to be living in a time when …’.
It sounds like one of those university entrance essays where they ask you to complete the
phrase and 99% of candidates provide some bland tear-jerk story about how humanity
can save itself from all the evils of Gomorrah. John is not like the 99%: ‘We are most
fortunate to be living in a time when the statistical tools for analyzing regression models
110 Modern Regression Techniques Using R
M
–0.075 9.74
1.74
X Y
0.48
X Y
Figure 6.6 The graph for a mediation analysis for Wright and Hall (2007). The graph
shows that the instruction (X) lowers the belief in guilt (M) with a coefficient of −0.075
(this is the squared variable units). The mediator has a positive influence on verdict (Y)
with a coefficient of 9.74. Once the belief has been included in the model, the X is a
significant positive predictor. Because the coefficients are from different types of
regression, a lot of care is needed interpreting them. And the Sobel test has a
p value of .051
R functions
• glm: for generalized linear models;
• shapiro.test: to conduct Shapiro-Wilk test;
• ks.test: to conduct Kolmogorov-Smirnov test;
• wilcox.test: to conduct Wilcoxon tests;
Generalized Linear Models 111
Statistical concepts
• GLM: generalized linear model;
• link: how the response is transformed;
• logit: the log odds;
• logistic regression: regression with proportions;
• Bernoulli: the distribution for single coin flips;
• Binomial: the distribution for multiple Bernoulli trials;
• Poisson: a distribution usually assumed for count data.
FURTHER READING
Hoffmann, J. P. (2004) Generalized linear models: An applied approach. Boston, MA: Pearson
Education Inc. This is a good introduction and is at about the right level for graduate psychology
students. We hope he reads (and likes) the final paragraph of this chapter!
Learning outcomes
1. To understand why you would want to use splines.
2. To know how to include different regression splines;
• different degree polynomials; and
• different numbers of knots.
3. To understand the basic concepts of the generalized additive model.
linear and nonlinear components. The choice of functions, which often comes down to
the type and complexity of the splines, is critical.
There are different types of splines used in statistics. While all are mathematically
complex, we will use one that is simpler than most, is flexible, and fairly easy to
understand in R. It is called a B-spline and is found using the bs function in the library
splines, which is installed as part of the main R program. The functions gam (Hastie,
2008) and mgcv (Wood, 2006) are also often used and these are necessary for more
complex models, but not those usually used in psychology.
The purpose of regression splines is to draw a curve through a scatterplot. One type
of curve that is often used in regressions is a polynomial. Polynomials have different
degrees. A 0 degree polynomial is a constant, yi = β0. A 1 degree polynomial is a
straight line and corresponds to the simple linear regression, yi = β0 + β1xi . A 2 degree
polynomial is a quadratic regression of the form, yi = β0 + β1 xi + β2 xi 2 . A 3 degree
polynomial is a cubic regression, yi = β0 + β1 xi + β2 xi 2 + β3xi 3 . The intercept, β0,
is found with all these regressions. The number of additional β values estimated is a
measure of the complexity of the model. It is called the number of degrees of freedom
for that variable. So, a 2 degree polynomial has 2 degrees of freedom. Figure 7.1 shows
some data and regressions for polynomials of 0, 1, 2 and 3 degrees.
set.seed=2007
x <- (1:100)/50
y <- 4 + 2*x - 3*xˆ2 + xˆ3 + rnorm(100,0,.3)
par(mfrow=c(1,4))
plot(x,y, main="0 degree", xlab="", ylab="", pch=19, axes=F)
box()
abline(h=mean(y),lwd=1.5)
for (i in 1:3) {
plot(x,y, main=paste(i,"degree"), xlab="", ylab="", pch=19, axes=F)
114 Modern Regression Techniques Using R
box()
lines(x,predict(lm(y~poly(x,i))),lwd=1.5)}
par(mfrow=c(1,1))
Two aspects of the code need to be explained. The first is that a new function, poly,
is used. This sets up polynomial contrast according to the degree that you put in. If you
want a quadratic regression you write lm(y~poly(x,degree=2)) or in shorthand
lm(y~poly(x,2)). The second aspect is due to poly not allowing degree=0.
Because of this the first part of the code makes the 0 degree graph and the remainder are
made by 3 loops of the for command.
The data in Figure 7.1 were made using a cubic equation plus random error so we would
expect the cubic regression in the right panel of Figure 7.1 to perform best. The 0 degree
polynomial (the mean) clearly does not capture the data. The 1 degree polynomial
(straight line) fails to capture many of the high and low points and the 2 degree polynomial
(quadratic) looks similar to the 1 degree polynomial so also fails to capture these points.
The 3 degree (cubic) does appear to capture the data the best. The anova function can
also be used with these models, and the results confirm what is visible in Figure 7.1.
anova(lm(y~1),lm(y~poly(x,1)),lm(y~poly(x,2)),lm(y~poly(x,3)))
Model 1: y ~ 1
Model 2: y ~ poly(x, 1)
Model 3: y ~ poly(x, 2)
Model 4: y ~ poly(x, 3)
Res.Df RSS Df Sum of Sq F Pr(>F)
1 99 18.5660
2 98 12.2077 1 6.3583 68.4588 7.452e-13 ***
3 97 12.1816 1 0.0260 0.2804 0.5977
4 96 8.9163 1 3.2654 35.1576 4.777e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The problem with polynomial regressions is that they are fairly inflexible and
seldom fit the entire data set well. You can increase the complexity of the polynomial
Regression Splines and Generalized Additive Models 115
by increasing its degree, but this has two associated problems. First, there are few
psychological relationships which you would predict would have this complex a function.
Second, polynomials tend to shoot up or down at the ends and to be very influenced by
points at the ends. The solution to this is to add together pieces of these polynomials.
This is the basis of splines.
The default for the splines that we will use estimates 1 curve for the first half of the
data and one for the second half. This could be done by running separate regressions.
For example, for the data in Figure 7.1, we could run a linear regression for the
first half:
par(mfrow=c(1,2))
plot(x,y, main="Piecewise linear", xlab="", ylab="", pch=19, axes=F)
box()
lines(x[1:50],predict(qreg1),lwd=1.5)
lines(x[51:100],predict(qreg2),lwd=1.5)
The problem is these lines do not touch. This suggests that there is some sudden shift
in the relationship between the variables at this point. As shifts like this are unlikely
(which is one reason why we argued against median splits in Chapter 3), it would be an
advantage to make the curves meet. This is what splines do, and is where the mathematics
become complex. The splines not only make the curves meet at a single point, called a
knot, but they make them meet as smoothly as possible. If the two curves are quadratic
then they form a straight line at the knot. If the curves are cubic then they form a quadratic
at the knot. In R there is a package called splines which contains the bs function,
which we will use to generate splines in this chapter. There are other spline functions, but
Figure 7.2 The left panel shows separate linear regressions for the first and second
halves of the data. The right panel shows a cubic spline with a single knot at the
median (the default location for a single knot)
116 Modern Regression Techniques Using R
the bs function will suffice for most purposes. Like the poly function, you need to tell
bs what degree polynomial you have. You also need to tell it either how complex the
spline is with the df = option, or the location of the knots with knots = c(knot1,
knot2,...). A very clever aspect of splines is that each additional curve requires only
one additional df, so that increasing the number of knots does not greatly increase the
complexity of the model in terms of degrees of freedom. This is because the curves are
constrained to fit together smoothing. Thus, if we want two cubic (degree=3) curves
with a single knot it requires only df=4. The right panel of Figure 7.2 shows this:
The default for bs is degree=3 (cubic) without any knots. If you add one knot by
making df equal to degree +1 then its default is to place the knot at the median of
the variable. If you have two knots then the knots are placed at the first and second
tertiles, etc. If you want to place the knots at different values this can be done with
the knots option. This and other aspects of the bs function are illustrated in the next
example with a small data set. After this two examples with larger data sets illustrate the
typical use of these splines.
The default smoothing methods are heavily dependent on the data rather than a theory,
which means that with large amounts of high quality data they can be useful exploratory
tools, but with smaller samples (as typical in much psychology research) they need to
be used cautiously. We believe their most valuable uses in psychology are as follows.
1. When you do not want to make any assumptions about the relationship between a covariate
and the response variable. With the typical ANCOVA, it is assumed that the covariate is a
linear predictor. It may be appropriate to relax this assumption using a spline of the covariate.
We call this a GAMCOVA (Wright, 2008).
2. To test if a relationship is not linear.
3. As an exploratory graphical tool to plot curves within a scatterplot to search for patterns, but
be careful, particularly at the ends of the scales, not to over-interpret sudden changes.
We examine these uses in our examples. We are less concerned with the exact form of
the spline than is typically the case in either the statistics literature or the fields with
larger amounts of data.
We will begin with an example to illustrate the concepts. The data come from Thornton
(2007: 22) on university faculty salaries in the US over the past three decades. The data
Regression Splines and Generalized Additive Models 117
5
Salary change in real terms
0
–5
–10
Figure 7.3 A scatterplot of the percentage change in real terms of academics’ salaries
by year. The dashed line is for 0%, or no change
show the amount of increase/decrease in salary in real terms (i.e., taking into account
inflation). The years are:
The first step is to create a scatterplot and then to add the curves corresponding to different
models to it. Figure 7.3 shows data.
plot(years,salary,xlab="Year",ylab="Salary change in
real terms",pch=20)
abline(h=0,lty=2)
From Figure 7.3 it is clear that in the late 1970s academic salaries did not keep
pace with the high rate of inflation. Usually time series methods (Chatfield, 2003;
Shumway & Stoffer, 2006) are used to model data like these, particularly if you have
lots of data for each year. Here, regression splines will work well to compare potential
118 Modern Regression Techniques Using R
models that may account for the pattern in these data. The models that will be used are
as follows.
The first model, salary1, is just the simple linear regression between the two
variables, the same as lm(salary~years). To make the parallel with the remaining
examples clearer, we will write this as lm(salary~bs(years, degree=1)).
The coefficients from these two models are not the same as if we had run
lm(salary~years), but this is because of the way bs works. For this simple model
bs(years,degree=1) takes the years variable and re-scales it so the values go
between 0 and 1. With more complicated splines it gets more difficult to match the
coefficient estimates onto something meaningful with respect to the original variables.
The bs function scales the coefficients to between 0 and 1 and makes them orthogonal
to each other. Therefore, the norm is to look at these graphically and to compare models
with anova.
Call:
lm(formula = salary ~ bs(years, degree = 1))
Residuals:
Min 1Q Median 3Q Max
-8.2945 -0.9241 -0.0241 0.9511 6.2289
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.782 1.268 -2.194 0.0374 *
bs(years, degree = 1) 4.821 2.001 2.409 0.0234 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The first panel of Figure 7.4 shows this linear model. fitted(salary1)
means the predicted values from this model. predict(salary1) and
salary1$fitted.values will produce the same values.
par(mfrow=c(2,3))
plot(years,salary,xlab="Years",ylab="Salary change",pch = 19)
abline(h=0,lty=2)
lines(years, fitted(salary1))
Regression Splines and Generalized Additive Models 119
5
5
Salary change
Salary change
Salary change
0
0
–5
–5
–5
–10
–10
–10
1975 1985 1995 2005 1975 1985 1995 2005 1975 1985 1995 2005
Years Years Years
5
5
Salary change
Salary change
Salary change
0
0
–5
–5
–5
–10
–10
1975 1985 1995 2005 1975 1985 1995 2005 –10 1975 1985 1995 2005
Years Years Years
Figure 7.4 Different models for the change in real terms of academics’ salaries with
years (Thornton, 2007). The top panel shows a single linear model, a piecewise linear
model with a knot at the median (1992.5 years), and a piecewise linear model with a
knot at 1986. The bottom panel shows quadratic models with: no knots, a knot at the
median, and a knot at 1986
Clearly the single straight line fails to account for the data. It overestimates change in
salary during the 1970s and it is unlikely that the percentage increase in real terms will
continue rising at this rate in the coming decades. The next step is to slightly increase
the complexity of the model. We will do this by assuming the model is fit by two straight
lines joined at a single point. This single point is called a knot. The default within the
bs function is to place the knot at the median.
If you have two knots it will set them at the 33rd and 67th percentiles. If you
have a specific x value where you want to place the knot you can use the knots
option. For example, if you thought a priori there might be a change at 1986
because that was when some changes in contract legislation occurred, you could
write: lm(salary~bs(years,degree=1,knots=1986)). If you want two
knots then you could write, for example, lm(salary~bs(years,degree=1,
knots=c(1980,1986))). The model which assumes that the trajectory of academics’
120 Modern Regression Techniques Using R
salaries changed in 1986 is plotted in the third panel of Figure 7.4 and the code
is below.
The anova function can be used to compare the fit of these three models.
The comparison between the first two models shows a significant improvement,
F(1, 25) = 4.93, p = .04. There is only one further degree of freedom used because
to draw the additional line all you need to know is its slope, since the knot is assumed
to be the median of the years variable. The third model fits better. The residual sum
of squares is lower when we place the knot at 1986 than at the median. There is no
significance test for whether this is better than the second model because the two use the
same degrees of freedom in their models. In other words, they each estimate the same
number of things (the intercept and the slopes of the two lines). Looking at Figure 7.4,
the third model appears to fit well and shows that since 1986 the trend is that academics’
salary increases have been getting smaller in real terms (as academics, our main hope is
that the spline stays above the dashed 0% line).
The next set of models use 2 degree polynomials (i.e., quadratics). Here is the code to
create each of these models and to graph them in the bottom row of Figure 7.4. When bs
connects curves it does so in the smoothest way possible. For these curves it means that
they connect in a straight line. This makes the mathematics complex, but it means only
one extra degree of freedom is needed for each knot. This is why df=3 is appropriate
for salary5.
anova(salary4,salary5,salary6)
and they can be compared with the linear models. For example,
anova(salary2,salary5)
shows that the quadratic model with one knot at the median (the default location for a
single knot in bs) is not a significant improvement upon the linear model, F(1,24) =
3.00, p = .10. Interestingly, the linear one with the knot at 1986 fits even better than
the quadratic one. This is because, as well as having two quadratic curves, it has the
requirement to have a smooth transition between the curves, so as Figure 7.4 shows, it
does not get very near the outliers at the top of the plot.
Finally, the default spline in the bs function is a cubic curve (so degree=3)
with no knots. So writing lm(salary~bs(years)) produces a cubic function
through the data (the same as lm(salary~poly(years,3))). To add a knot, use
lm(salary~bs(years, df=4)).
1 When comparing models with different splines, allowing more flexibility does not mean that the less complex
model is nested within the more complex model. This is because of the way the splines are constrained: different
contrasts are used rather than just having additional ones (as with polynomial contrasts). This means the models
are not necessarily nested, so care is required using F values to compare models. These difficulties are more
apparent with more complex models than are typical in psychology.
122 Modern Regression Techniques Using R
We will go through an example using data from 534 respondents on hourly wages
and several covariates (experience in years, gender and education in years) (Berndt,
1991). There was one outlier with an hourly wage of $44 (z = 6.9) and it was removed,
but the data remained skewed (1.28, se = 0.11). Logging these data removes the skew
(0.05, se = 0.11), so a fairly common approach is to use the logged values as the response
variable and assume that the residuals are normally distributed. These data, with the
outlier removed, can be accessed in the usual way. The package splines is activated
and the variables names are shown.
Suppose the researchers’ main interests are with the experience variable (EXPER),
and whether LNWAGE steadily increases or whether it increases rapidly until some point
(a knot) and then increases but less rapidly. For argument’s sake, let us assume that the
increases are both linear with the logged wages and that placing the knot at the median
of EXPER is okay. The researchers accept that wages increase with education (EDUC)
and believe that the relationship may be nonlinear, and so allow this to be modeled with
a spline. As the variable FEMALE is binary only a single parameter is needed to measure
the difference in earnings between males and females. While categorical variables can
be included within GAMs, the purpose of GAMs is to examine the relationships between
quantitative variables and the response variable. The first model is:
This is like a normal multiple linear regression for the variables FEMALE and EXPER,
the model fits both as conditionally linear with LNWAGE, but the relationship for EDUC
is more complex.
As with many of the regression procedures discussed in this book we will compare two
models which vary in how much information they use to predict the response variable.
The first one we create will be the simpler. To predict LNWAGE we use a single straight
line for EXPER. This is shown by using the bs function, for B-spline, but telling it
that degree=1 which means to use straight lines and df=1. As explained in the
Regression Splines and Generalized Additive Models 123
previous example, df=x refers to total degrees of freedom for the spline. Since it is
the same as the degree of the polynomial (i.e., 1), it means there are no knots and only
a single line is used. We expect the relationship between EDUC and LNWAGE to be
non-linear. For EDUC we use a flexible spline with degree=3 (cubic) and a single knot
(df=4). Cubic splines are often used because when they are placed together it is difficult
to see where the knots lie so that the curve appears very smooth. This is the default for
some of the spline functions (like ns), but for simplicity we will continue to use bs.
The model and some summary information are:
Call:
lm(formula = LNWAGE ~ bs(EXPER, df = 1, degree = 1)
+ bs(EDUC, df = 4) + FEMALE)
Residuals:
Min 1Q Median 3Q Max
-2.144173 -0.295100 -0.002809 0.307351 1.144169
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.18445 0.34391 3.444 0.000619 ***
bs(EXPER, df = 1,
degree = 1) 0.73050 0.09550 7.649 9.72e-14 ***
bs(EDUC, df = 4)1 0.02196 0.50041 0.044 0.965009
bs(EDUC, df = 4)2 0.38501 0.32881 1.171 0.242163
bs(EDUC, df = 4)3 1.17094 0.36977 3.167 0.001631 **
bs(EDUC, df = 4)4 1.18195 0.34541 3.422 0.000670 ***
FEMALE -0.26411 0.03897 -6.777 3.30e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(wage1)
Response: LNWAGE
Df Sum Sq Mean Sq F value Pr(>F)
bs(EXPER, df = 1,
degree = 1) 1 2.006 2.006 10.163 0.001518 **
bs(EDUC, df = 4) 4 30.537 7.634 38.678 < 2.2e-16 ***
124 Modern Regression Techniques Using R
This output tells us that all the variables are significant predictors of LNWAGE, and
that for EDUC this is as a fairly complex curve through the data. As with the previous
example much of this output is difficult to interpret without graphs. When there are
several predictor variables the methods used in the previous figures need to be adapted.
Two methods will be used. In Figure 7.5 the predict function is used, as with Figures
7.1–7.4, but split by gender and with different lines for different amounts of experience.
In Figure 7.6 the gam function (Hastie, 2008) is used. For complex problems it is easier
to use this function rather than lm (or glm for the next example), and it can also be used
with other smoothing functions.
The left and right panels of Figure 7.5 show the relationship between EDUC and
LNWAGE for males and females, respectively. Within each panel five different lines are
used to show different levels of experience. The five lines correspond to the minimum,
first quartile, median, third quartile, and the maximum (basically the 5-point summary,
Tukey, 1977). These are found with:
We then use the predict function with the model, wage1, but providing the function
with new values for EXPER, EDUC, and FEMALE. Because EDUC goes from 2 to 18,
a good range to plot values is from 0 to 20. To create a mini-dataset with these values
for each of the five quantiles of males (which have the value 0 for FEMALE) is:
minimales <-
data.frame(EXPER=rep(experq,each=21),EDUC=rep(0:20,5),
FEMALE=rep(0,105))
3.0
2.5
2.5
Predicted ln(wage)
2.0
2.0
1.5
1.5
Less experience
Less experience
1.0
1.0
0.5
0.5
0 5 10 15 20 0 5 10 15 20
Years in education Years in education
Figure 7.5 The predicted values from wage 1 for different amounts of experience for
males and females
Regression Splines and Generalized Additive Models 125
minifemales <-
data.frame(EXPER=rep(experq,each=21),EDUC=rep(0:20,5),
FEMALE=rep(1,105))
par(mfrow=c(1,2))
plot(minimales$EDUC,predict(wage1,minimales),
xlab="Years in education", ylab="Predicted ln(wage)",
pch=19,cex=.5,ylim=c(.5,3))
lines(0:20, predict(wage1,minimales[1:21,]),lwd=.5)
lines(0:20, predict(wage1,minimales[22:42,]),lwd=1)
lines(0:20, predict(wage1,minimales[43:63,]),lwd=1.5)
lines(0:20, predict(wage1,minimales[64:84,]),lwd=2)
lines(0:20, predict(wage1,minimales[85:105,]),lwd=2.5)
text(6,2.5,"More experience")
text(15,1.5,"Less experience")
title("For males")
plot(minifemales$EDUC,predict(wage1,minifemales),
xlab="Years in education", ylab="Predicted ln(wage)",
pch=19,cex=.5,ylim=c(.5,3))
lines(0:20, predict(wage1,minifemales[1:21,]),lwd=.5)
lines(0:20, predict(wage1,minifemales[22:42,]),lwd=1)
lines(0:20, predict(wage1,minifemales[43:63,]),lwd=1.5)
lines(0:20, predict(wage1,minifemales[64:84,]),lwd=2)
lines(0:20, predict(wage1,minifemales[85:105,]),lwd=2.5)
text(6,2.5,"More experience")
text(15,1.2,"Less experience")
title("For females")
par(mfrow=c(1,1))
Making Figure 7.5 is a hassle and easily open to error. If your interest is just in the
shape of the spline for education, an easier method is using the gam function. After
loading the gam package, you run the model, but with the gam function rather than lm:
install.packages("gam")
library(gam)
wage1a <- gam(LNWAGE~bs(EXPER,df=1,degree=1) + bs(EDUC,
df=4) + FEMALE)
The summary function shows that this is the same model as wage1, as shown by
the output from anova(wage1). The residual deviance of wage1a (103.8207) is
the same as the residual sum of squares from wage1 (103.821), and the
difference between the null deviance and residual deviance for wage1a
126 Modern Regression Techniques Using R
(145.4278 − 103.8207 = 41.6071) is the same as the total model sum of squares from
wage1: 2.006 + 30.537+ 9.064 = 41.607.
summary(wage1a)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.144173 -0.295100 -0.002809 0.307351 1.144169
Df
(Intercept) 1
bs(EXPER, df = 1, degree = 1) 1
bs(EDUC, df = 4) 4
FEMALE 1
gam is for generalized additive models, and as with the generalized linear models
discussed in Chapter 6, a dispersion parameter is estimated. It is the null deviance divided
by the residual degrees of freedom (103.8207/526). The reason why we use gam here is
because when a gam.object is used in plot, it shows the shapes of all the predictor
variables. Figure 7.6 shows these. The standard error intervals are shown with dashed
lines by using the option se=T. The code is much shorter than that used constructing
Figure 7.5.
par(mfrow=c(1,3))
plot(wage1a, se=T)
Figure 7.5 shows that as EXPER increases so does LNWAGE. It shows that as EDUC
increases LNWAGE also increases. Finally, as gender moves from male to female,
LNWAGE decreases. Of course, with a variable like GENDER this does not make sense.
If FEMALE was stored as a factor then boxplots would be shown in this final graph.
Try re-running the analyses with FEMALE2 <- as.factor(FEMALE) in the gam
command that created wage1a.
Next, a slightly more complicated spline is used for EXPER. Two straight lines,
connected at a knot at the median of EXPER, are used. This can be run either with
lm or gam. We will use gam since the resulting graphs are easier to construct.
Regression Splines and Generalized Additive Models 127
0.15
0.6
0.5
bs(EXPER, df = 1, degree = 1)
0.4
0.05
0.2
–0.5
–0.05
–0.2 0.0
–1.0
–0.15
0 10 20 30 40 50 –1.5 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0
EXPER EDUC FEMALE
Figure 7.6 The graphs made for the generalized additive model on the log of wages
To change this, the option df=1 should be changed to df=2 with the bs function
for EXPER:
This model can be compared to the previous one with the anova function. To make
consistent with the previous comparison the F test option is used.
anova(wage1a,wage2a,test="F")
This allows us to say that the model with a single knot fits significantly better than the
model without a knot, F(1,525) = 32.95, p < .01. The AICs of the two models can also
be compared by looking at the output from the summary functions; AIC drops from
656.68 to 626.23. Remember, AIC and BIC are measures of the residual – the larger
their values the worse the fit. Similarly, the proportion of total deviance accounted for
by the model increases from (145.4278 − 103.8207)/145.4278 = .286 for wage1a to
(145.4278 − 97.6892)/145.4278 = .328 for wage2a.
To understand the model better it should be graphed. Figure 7.7 shows the results of:
plot(wage2a,se=T)
These show the same basic pattern for EDUC and FEMALE. The difference is for EXPER.
This shows that the log of salary increases rapidly at the beginning, but less so afterwards.
128 Modern Regression Techniques Using R
0.4
0.15
0.5
bs(EXPER, df = 2, degree = 1)
0.2
0.05
0.0
–0.5
–0.05
–0.2
–1.0
–0.4
–0.15
0 10 20 30 40 50 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0
EXPER EDUC FEMALE
Figure 7.7 The plots from the generalized additive model of wage2a
A caveat is worth adding here, but it applies to most statistical models. It is important
not to interpret this result to mean that the model with one knot is right. Models after
all are just models of reality, and the usual interpretation of this means that they are
never ‘right,’ but they may be very good. Cox (2006: 31) describes this well: ‘the
very word “model” implies idealization. With very few possible exceptions it would
be absurd to think that a mathematical model is an exact representation of a real
system.’
Finally, many readers will be interested in the gender gap in pay. Because an argument
for gender inequality in the past has been that it was due to experience and education,
it is worth using very flexible splines for these variables and seeing whether FEMALE is
still significant.
Call:
Residuals:
Min 1Q Median 3Q Max
-2.20144 -0.27374 0.02893 0.28891 1.19476
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.99061 0.37215 2.662 0.008010 **
bs(EXPER, degree = 3,
df = 4)1 0.53593 0.14628 3.664 0.000274 ***
bs(EXPER, degree = 3,
df = 4)2 0.82210 0.15968 5.148 3.73e-07 ***
bs(EXPER, degree = 3,
df = 4)3 0.82448 0.25049 3.291 0.001064 **
bs(EXPER, degree = 3,
df = 4)4 0.72980 0.28376 2.572 0.010390 *
bs(EDUC, degree = 3,
df = 4)1 -0.02853 0.50956 -0.056 0.955373
bs(EDUC, degree = 3,
df = 4)2 0.15374 0.36227 0.424 0.671463
bs(EDUC, degree = 3,
df = 4)3 1.00241 0.39159 2.560 0.010753 *
Regression Splines and Generalized Additive Models 129
bs(EXPER, degree = 3, df = 4)
0.15
bs(EDUC, degree = 3, df = 4)
0.5
0.5
–0.15 –0.05
–0.5
–0.5
–1.0
0 10 30 50 5 10 15 0.0 0.4 0.8
EXPER EDUC FEMALE
Figure 7.8 The graphical output with cubic splines with a single know each for EXPER
and EDUC. This shows, even allowing for very flexible main effects between these
variables and LNWAGE, that the gender still makes a difference
bs(EDUC, degree = 3,
df = 4)4 0.95806 0.37396 2.562 0.010687 *
FEMALE -0.26329 0.03778 -6.970 9.62e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
So, the difference still remains, t(523) = 6.97, p < .001. Figure 7.8 shows the graphical
output:
plot(gam(LNWAGE~bs(EXPER,degree=3,df=4)+
bs(EDUC,degree=3,df=4)+ FEMALE),se=T)
par(mfrow=c(1,1))
In Chapter 6 the generalized linear model was shown to be an extension of the linear
model. Additive models using splines can also be used with generalized models. The glm
function or the gam function can be used with B-splines using bs. The glm function
provides more useful numeric output and the gam function is easier to use for graphs.
Further, the gam function is necessary for complex splines not covered in this book. In
this example the focus will be on graphs, so the gam function will be used.
To illustrate a logistic additive model, data inspired by truth and lie detection using
criteria-based content analysis (CBCA) (Vrij, 2005) will be used. This is a method used
in several countries to try to determine whether a child is telling the truth or a lie when
questioned, usually in connection with cases of child sexual abuse. There are 19 criteria
130 Modern Regression Techniques Using R
and a statement can be given a 0, 1, or 2 for each criterion, and these are summed so that
each person can get a score from 0 to 38 with high scores indicating more truthfulness.
One problem with this procedure is that people with more linguistic skills tend to have
higher scores than people with less linguistic skills. Because of this there is assumed to
be a complex relationship between age, CBCA score, and truth.
Suppose there are 1000 statements from people who are 3–22 years old, and these
have CBCA scores and it is known whether the statements are truthful or not. We will
create the data below, so that we know age and truth should be independent (and in the
sample they are). Three GAMs were estimated. The first has just CBCA to predict truth.
This uses the logit link function and assumes binomial variation.
We created the three variables here. Notice in calculating truth, age is not used,
thus the two are independent in one sense. The coding for cbca looks complex. We were
playing with the coding to make it look like what is suggested in Vrij’s review. Nobody
has done a deception study with 1000 people.
set.seed(406)
age <- runif(1000,3,23)
truth <- rbinom(1000,1,.5)
cbca <-round(-3*truth+2*truth*log(age)
+.2*age+rbinom(1000,25,.7))
The following shows that there is not a statistical significant relationship between age
and truth. This is the basic logistic regression (see Chapter 6).
summary(glm(truth~age,family=binomial))
Call:
Deviance Residuals:
Min 1Q Median 3Q Max
-1.214 -1.189 1.142 1.165 1.190
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.102594 0.154826 0.663 0.508
age -0.005776 0.010941 -0.528 0.598
It is worth making sure that there was not an odd relationship between truth and
age, for example younger and older children being truthful, but those in the middle lying.
Regression Splines and Generalized Additive Models 131
0.6
0.4
0.2
s(age)
0.0
–0.2
–0.4
5 10 15 20
age
Figure 7.9 The relationship between truth and age as shown with the default
smoothing spline. There appears no relationship
Figure 7.9 shows the graph for this relationship using gam and bs. The spline is two
cubic curves (degree=3 is the default) connected at a single knot at the median of age
(the median being the default). By setting se=T R adds lines for ±se. The end points on
splines tend to have large confidence intervals. There are some methods to adjust for this
(see Wood, 2006). Because this is a logistic model we have written family=binomial
(see Chapter 6).
library(gam)
plot(gam(truth~bs(age,df=4),family=binomial),se=T)
Now we focus on the relationship between cbca and truth. The bs(cbca,
df=4) in the function below means the spline is two cubic curves connected at the
median. For most psychology examples this provides enough flexibility, although for
large data set you may wish to increase the number of knots by increasing df= to a
higher number.
The top left panel of Figure 7.10 shows that there is a positive association between
cbca and truth. If you are making a graph using the predict function with the
glm function, use type="response" as in Chapter 6, if you want the y-axis to be in
the original scale. Next the age variable is added. We have used the same flexibility in
the spline for this variable.
20
15 20 25 30
bs(cbca, df = 4)
5 10 15 20
cbca
cbca
0 5 10
15 20 25 30 5 10 15 20 false true
cbca age
20
bs(cbca, df = 4)
bs(age, df = 4)
5 10 15 20
0.0 1.0
0 5 10
age
–1.5
15 20 25 30 5 10 15 20 false true
cbca age
40
15
bs(cbca, df = 4)
bs(inter, df = 4)
bs(age, df = 4)
0 20
20
0 5
0
–20
–40
–10
Figure 7.10 The basis function for 3 generalized additive models for the (made-up)
CBCA data. The first row shows the spline between cbca and truth (for gam0);
the scatterplot of cbca with age, and boxplots showing truthful statements had,
overall, higher cbca scores. The second row shows the splines between cbca and
truth, and age and truth (for gam1); and boxplots showing that the distributions
of age for false and true statements are similar. The third row shows the splines for
gam2. The effect of most interest is the interaction. It shows that the diagnostic value
of CBCA scores increases with children’s age
The two models can be compared using the anova function. For these generalized
models, test="Chisq", which is the default, is appropriate.
anova(gam0,gam1)
This shows including age increases the predictive value for predicting truth, χ 2 (4) =
48.86, p < .001. The second row of Figure 7.10 shows that, once controlling for cbca, as
age increases the likelihood of the statements being truthful decreases. This conditional
relationship occurs because cbca and age are positively correlated (see second panel,
top row, of Figure 7.10.
When using some types of splines, and with other complex models, the number
of degrees of freedom will not be a whole number. Degrees of freedom is just a
measure of information so non-integer values present no conceptual difficulties.
With the models generated by bs the degrees of freedom should be whole numbers
(or at least within rounding error).
The main hypothesis, that the interaction improves the fit of the model, is now included.
There was an interaction term used in creating the data, so we would expect this to be
significant. If the effect were non-significant it would be a Type 2 error. We created the
interaction, inter <- cbca*age, prior to running the gam function.
anova(gam1,gam2)
this shows that the interaction is significant: χ 2 (4) = 24.67, p < .001. It is best to
interpret this while looking at the graphs.
Figure 7.10 is a 3×3 panel graph that illustrates the models in this example. We begin
by telling R to treat truth2 as a categorical variable. This means in the plot commands
below it draws boxplots rather than scatterplots when truth2 is on the x axis. The
se=T adds the standard error lines. The pch="." tells R to make the symbols for the
scatterplot as small as possible. This is often necessary when you have large amounts
of data.
par(mfrow=c(3,3))
truth2 <- factor(truth,label=c("false","true"))
plot(gam0, se=T)
134 Modern Regression Techniques Using R
plot(age,cbca,pch=".")
plot(truth2,cbca,ylab="cbca")
plot(gam1, se=T)
plot(truth2,age,ylab="age")
plot(gam2, se = T)
par(mfrow=c(1,1))
The first row shows that there is a positive bivariate relationship between cbca and
truth. The relationship looks like the probability of a statement being truthful goes
up with cbca, mostly for large values of cbca, but that at lower values cbca is not
very diagnostic. The second panel in the first row shows age and cbca are positively
associated and the third panel shows that true statements tend to have higher cbca
scores than false ones. The second row shows the relationship between cbca and truth
continues after controlling for age. age has a strong negative relationship with truth,
but it is important to recall that this is after controlling for cbca. As the boxplots to the
right show, there is not a bivariate relationship between age and truth. The third row
shows the interaction model. The interaction is shown in the final panel of this row and it
is clear that the diagnostic value for cbca increases linearly with age. In other words,
the cbca is only valid for older children (according to these made-up data!).
KERNEL METHODS
There are several smoothing functions, other than B-splines, which are used by
researchers. One popular type is called kernel methods. These calculate a line for
each point along the x axis and join these together. The line is based on the data
within some distance of that point on the x axis. The smoothness of the curve is
dependent on how narrow or wide this region (or kernel) is. The wider the region is
the smoother the curve. These methods are often called locally weighted methods
because only values near that point on the x axis are used in calculating the
line. Sometimes the functions used will weight all the values in the region equally;
sometimes their weight will be dependent on how far they are from the x value.
The procedures can use ordinary least squares or robust (see Chapter 9) methods.
In R the functions lo, loess, and lowess are the most common. Kernel methods
can be mathematically complex.
2
Predicted conditional recall
0
–2
–4
Figure 7.11 The plot for a generalized additive model with a df = 4 B-spline for age in
months predicting final recall partialling out initial recall. Poisson error is assumed with
the ln link function
with the tools covered in Chapters 6 and 7. We noted that there was the problem of
floor effects: that somebody who only recalled 1 or 2 items could not decline as much as
someone who initially recalled 8 or 9 items. While floor effects can never be completely
resolved with statistical tools, a Poisson regression works well for count data like these
which have a large skew. In addition, there was worry about whether the relationship
between age in months and the final recall, after accounting for initial recall, really was
linear. It is more likely that memory skills increase rapidly in earlier years and then
level off. Now with GAMs as part of our statistical arsenal we can further explore these
data. Several models were explored, but the model shown in Figure 7.11 fits the data
well. The following code produces this graph. As expected, memory retention increases
rapidly for young children, but this level of improvement does not carry on as the children
get older.
attach(lordex)
gam1 <- gam(Final ~ Initial + bs(AGEMOS,df=4),poisson)
plot(gam1,se=T,terms="bs(AGEMOS, df = 4)",1,pch=19,
xlab="Age in months",ylab="Predicted conditional
recall",cex.lab=1.5,cex.axis=1.3)
SUMMARY
GAMs are useful generalizations of the basic regression models. Their basic building
blocks are the splines used to model the relationships. Like GLMs, they allow different
link functions and distributions which are appropriate for a large amount of data
collected in psychology and other sciences. Further, the additive components allow an
extremely flexible approach to data modelling. There are several extensions to GAMs not
136 Modern Regression Techniques Using R
discussed here, like model selection and regularization techniques, multilevel GAMs,
and different types of estimation. Current software allows many different types of curves
to be fit within GAMs, and as algorithms and software advance, these models should
become more flexible and more popular.
The typical psychology dataset and the typical research questions within psychology
dictate how splines/GAMs might be most useful within our discipline. There are three
ways in which we feel they will become more popular. The first is in an ANCOVA context
when you do not wish to make many assumptions about the relationship between the
covariate, which you are trying to partial out, and the response variable. In these cases
you are not primarily interested in the covariate, but in the other predictors, and you do
not want to assume a linear relationship between the covariate and the response variable.
If you knew more about the relationship between the covariate and the response variable,
then a specific non-linear model might be appropriate, but often you do not have this
knowledge (and it is not of primary interest). In these cases we recommend a fairly
flexible spline, like cubic curves (degree=3) connected at one or two knots (so df=4
or df=5). This could be called a GAMCOVA (Wright, 2008). The second way is where
you want to test if the data are consistent with a linear model, or whether some more
flexible model is better. In many undergraduate courses people are taught to try a quadratic
or a cubic term, and these methods are one alternative, but the flexibility of splines make
them an attractive option. Again, degree=3 is recommended and usually with just a
single knot (df=4). The examples presented in this chapter illustrate these uses. The third
way that these models are often used is graphically to draw a curve through data. How
much flexibility you have is dependent on how much data you have and how smooth
you want the curve. Sometimes the kernel methods described in the box above can
be used.
R concepts
• plotting with the predict function including different levels of other variables.
R functions
• axes=F with box(): to make a box around data without axes;
• poly: to make polynomials of different degrees;
• degree: the option for the degree polynomial;
• predict: for making predicted values from regressions;
• bs: for making B-splines;
• df: the option for the spline complexity;
• knots: the option for number or location of knots;
• gam: for generalized additive models.
" !
Regression Splines and Generalized Additive Models 137
Statistical concepts
• polynomial regression: using different degree polynomials;
• piecewise polynomials: adding together polynomials;
• splines: adding them together at knots;
• generalized additive model: models with splines (and other functions).
FURTHER READING
Most of the literature tends to be written for those with much statistical background. Two of the
best of these are:
Hastie, T. & Tibshirani, R. (1990) Generalized additive models. London: Chapman and Hall.
The original GAM book and the basis for the gam software. It is more technical than is
appropriate for most psychologists.
Wood, S. N. (2006) Generalized additive models: An introduction with R. Boca Raton,
F.L: Chapman & Hall/CRC. Simon Wood has written an extension of the gam package used
here. It is a good alternative to gam, and is very similar in its use. This is a good book, although
a bit technical for most psychologists. http://www.maths.bath.ac.uk/∼sw283/
A very introductory piece is:
Wright, D. B. (2008) A new improved analysis of covariance. Psychologist, 21, 225–226.
8
Multilevel models
Learning outcomes
1. To be able to run multilevel linear and generalized multilevel linear models.
2. To understand why ignoring clustering in variables is wrong.
3. To have some practice aggregating variables.
In almost every undergraduate methods course students are told that their statistical
models assume that the data are independent, but they are not told what to do if the data
are not independent. This is because, until about 25 years ago, there were neither the
algorithms nor the computational power for conducting many of the statistics that can
now be estimated quickly on a desktop computer. Before this, there were corrections that
could be applied to non-independent data, but these were cumbersome, problematic, and
inflexible.
A common situation is where the non-independence is due to the data being
clustered. Multilevel modeling (often called hierarchical modeling) takes into account the
clustering. It allows modeling to be conducted simultaneously at the level of the cluster
and at the level of the individual. The model does make assumptions. For example,
most multilevel models assume that after taking into account the clustering (and any
other variables in the model) that the data are independent. Further, for inference the
researcher usually assumes that the units at one level are a random sample of all those
from within the cluster.
Consider the following example from Barth and colleagues (2004). They looked at
the peer relationships of about 600 pupils by the pupils’ race and gender. The pupils
were sampled from 65 different classrooms. The assumption is that the pupils sampled
were representative of the pupils within these classrooms. For illustration, consider their
fourth grade sample (age around 10 years old). For the traditional single level approach,
a researcher might use a regression of the following form:
where high Peeri values mean problematic relations, Genderi is a dummy variable with
female = 0 and male = 1, and Racei is a dummy variable with Caucasian = 0 and
non-Caucasian = 1 (in R this would be lm(Peer~Gender+Race)). The standard
Multilevel Models 139
regression assumes that the ei are independent. Here they are not because there are likely
to be classroom effects: children within the same classroom are more likely to have
similar Peer i scores than children from different classrooms. In fact, in Barth et al. about
12% of the total variation in Peer i could be attributable to classroom level variation. The
standard errors using this traditional approach are likely to be too small which means
that you will often get a significant p value when you should not. Because in psychology
people worry more about Type 1 errors than Type 2 errors (for better or worse), this bias
causes much concern among editors, reviewers, and supervisors, who now often require
authors to use multilevel modeling.1
For multilevel modeling, let the intercepts vary randomly for each classroom. In
notation, let the intercepts be β0j = β0 + uj , where the uj are independent and
normally distributed for the schools. uj is a residual or error term but at the cluster
level. The subscript j is for the 65 schools. Barth et al. (2004) estimated the following
model:
They found a significant gender effect (t(1325) = 4.42) and a significant effect for race
(t(1325) = 5.80) with males and non-Caucasian participants having poorer peer relations
after taking into account each other.
Multilevel modeling can be extended in many ways. First, while it is true that the
pupils in Barth et al.’s study were nested within classrooms, the classrooms were nested
within schools. In their paper, they looked at this using a 3 level model. In addition,
variables that are about the classroom (like characteristics of the teacher) and school
(like the neighborhood’s affluence) could be included, and in their paper some of these
were included. The random part of the model can also be made to incorporate more
aspects of the data. It is common to see if the slopes also vary among the classrooms.
For example, to allow the gender effect to vary among schools, let β1j = β1 + vj , where
the vj represent the spread around the central gender effect, β1.
A frequently asked question about multilevel modeling is what types of data can be
used with it. The textbook example is of pupils sampled within classrooms. This seems
to satisfy the criterion that we can imagine pupils as a random sample of all those in the
classroom. However, multilevel models are now often used for any hierarchical data set
regardless of whether the lower level units can be easily thought of as a random sample of
some population. Sometimes researchers apply multilevel modeling to a hierarchical data
set without thinking whether the data meet all the assumptions of multilevel modeling.
There seems to be a belief that using a modern mathematically complex statistic can
rectify any methodological difficulty.
Two examples are used in this chapter. The first has the traditional structure, with
pupils nested within classrooms. We examine exercise of children nested with classrooms
across several conditions (Hill et al., 2007). We use a multilevel equivalent to ANOVA,
and in relation to Lord’s Paradox and mediation effects (see Chapter 4). The second
example has measurements nested within person. This is the most common example
within medicine and is increasingly used in psychology. Multilevel modeling is rapidly
become the preferred method for repeated measures data. The example is of people’s
1 The p values can be too high too, particularly when multilevel models are used to analyze within subject
designs.
140 Modern Regression Techniques Using R
memory for own and other race faces (Wright et al., 2003). The response variable is
binary, whether the participant says they have seen the face before or not. Thus, this is
an example of a generalized linear multilevel model.
The format for this example follows Wright (1998) where four approaches are used
to illustrate the strength of multilevel modeling. First, the clustering is ignored (which
is wrong, but useful for comparison). Second, the analysis is done at the cluster level on
aggregate measures. This approach addresses a different question than analysis of the
lower level, and in most cases falls foul of the ecological fallacy (Robinson, 1950), and
is wrong. The third approach treats the cluster as a fixed effect and covaries it out. This
is a limited and cumbersome approach. The final approach is multilevel modeling.
The purpose of this research was to examine whether providing children with a leaflet
based on the ‘theory of planned behavior’ increases children’s exercise (Hill et al., 2007).
Five hundred and three children from 22 different classrooms were sampled. Because
it would not have been practical to have children in the same classrooms in different
conditions, the 22 classrooms were randomly assigned to 4 different conditions (control,
and 3 with leaflets). Children were asked the following question before and after the
intervention:
On average over the last three weeks, I have exercised energetically for at least
30 minutes ______ times per week.
Here we will concentrate on the post-intervention scores. The original exercise variable
was skewed (0.83). We have not calculated a standard error or confidence intervals for
the skewness because these would assume that the data were not clustered. To lessen the
skew .5 was added to each value and the square root was taken (new skewness = −0.10).
This transformed variable will be used.
The data are read directly from the book’s web page. The file has a numeric
variable, wcond, with the values 1–4. This is changed to a factor and given labels.
"L+Quiz" means participants received the leaflet and a quiz and "L+Plan" means
participants received the leaflet and made an exercise plan. The library nlme (Pinheiro
et al., 2008) is loaded. This library has traditionally been the most used within R to
conduct multilevel modeling, although the package used in the next example is seen as
an extension of it and is likely to become more popular. This library nlme may need to
be installed first.
install.packages("nlme")
library("nlme")
The first model is the ordinary least squares regression, which is the equivalent to a
oneway anova, so aov(sqw2~cond) produces the same model. The intercept is 1.65,
which is the predicted value for the control group. All the other coefficients are positive,
which shows that the means are all higher for those given a leaflet (alone, with a quiz,
or with a plan). Everything looks significant (or almost significant), but the standard
errors and p values are wrong. Imagine if there happened to be one really sporty class
because of a particular teacher. Whichever condition that class was in might have 20 high
exercise people.
Call:
lm(formula = sqw2 ~ cond)
Residuals:
Min 1Q Median 3Q Max
-1.19166 -0.31763 0.03136 0.33913 1.45818
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.64631 0.04925 33.425 < 2e-16 ***
condLeaftet 0.19316 0.06979 2.768 0.005857 **
condL+quiz 0.13588 0.06926 1.962 0.050318 .
condL+plan 0.25246 0.07127 3.542 0.000434 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Figure 8.1 is made in two parts. The first is here, the basic boxplot. The default for plot
when the first variable is a factor is to draw a boxplot. We used expression(sqrt
(exercise + .5)) to put the mathematical formula along the y axis. The function
expression allows this to be done, and there are a variety of functions that can be
printed (type demo(plotmath) and see Murrell, 2006: 97). The second part of the
graph is the dots for each classroom. These are made in a couple of pages.
plot(cond,sqw2,xlab="Conditions",ylab=expression
(sqrt(exercise + .5)))
The second approach to these data involves calculating aggregate variables for
the classrooms. The aggregate function is used. Its first argument is the variable
to aggregate (sqw1), the second a list of the variable to aggregate by (class),
and finally the function used to summarize the group (here, the mean). The mean for
142 Modern Regression Techniques Using R
3.0
2.5
exercise + 0.5
2.0
1.5
1.0
Figure 8.1 Boxplots of exercise for the different conditions in Hill et al. (2007).
The large dots are the means for the 22 classrooms
exercise (mexer) is calculated for each class, and the same procedure is used to get a
variable of the same length corresponding to the condition (mean is used, but all values
within a class are the same). The [,2] tells R to store the mean of these in the variables
mexer and mcond. If [,1] were used it would have stored the classroom number.
The linear regression (model2) looks at the means for the classrooms. Although
R2 = .22 is fairly big, the effect is non-significant because the sample size is small
(because the sample size is now the number of classrooms). Further, the effect size does
not correspond to an effect for how the intervention increases a child’s exercise. Making
this conclusion would be the ecological fallacy (Robinson, 1950). It is about classrooms
(and if the class sizes were a lot bigger the correlation would likely increase, what in
psychology is known as the Spearman-Brown prophecy).
Call:
lm(formula = mexer ~ factor(mcond))
Residuals:
Min 1Q Median 3Q Max
-0.335601 -0.118965 -0.006939 0.128987 0.313912
Multilevel Models 143
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.64890 0.07778 21.201 3.51e-14 ***
factor(mcond)2 0.18441 0.11536 1.599 0.1273
factor(mcond)3 0.13449 0.10999 1.223 0.2372
factor(mcond)4 0.24774 0.11536 2.148 0.0456 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now that the aggregate variables have been created the points can be added to Figure 8.1,
using the points command. The points are made fairly big by cex=1.3.
points(mcond,mexer,pch=19,cex=1.3)
The third method is to treat class as a covariate, partialling out its effects like
you would with an ANCOVA. For some multilevel datasets this is an acceptable
alternative, providing the number of classrooms is not large. Here it does not work
because each classroom is assigned to a condition, so classroom and condition are
confounded. This is why the coefficients for cond are not estimated below. This
can happen in other situations also, particularly when there are lots of clusters and
few points per cluster. Wright (1998) describes how this can be a valid approach
to modeling, but that it requires estimating lots of things (and therefore can overfit
the model) and from a philosophical standpoint it is difficult to make inference
about population values. Both of these disadvantages are addressed by the multilevel
approach.
Call:
lm(formula = sqw2 ~ factor(class) + cond)
Residuals:
Min 1Q Median 3Q Max
-1.50345 -0.33972 0.04162 0.33896 1.59050
The fourth and final approach is the multilevel approach, and it can be done with
the lme function which is part of the nlme package. lme stands for linear mixed
effect and nlme stands for non-linear mixed effect. The lme function works like the
lm function, except that you need to tell it what part of the model is random and
what the cluster name is. Here, random = ~1 means make the intercept random.
The |class tells the computer that the children are nested within classrooms. We
have set method="ML". ML stands for maximum likelihood. The default is REML,
which stands for restricted maximum likelihood. There are disagreements about which
of these methods is preferred, but the ML method has the advantage that the change in
log(likelihood) between models can be compared in a similar manner to the sum
of squares in standard ANOVA models. For this reason we keep with method="ML"
throughout this section.
Here, model4 is the baseline model and model4a includes the effect of condition.
They are compared, and we see that the difference is non-significant, χ 2 (3) = 5.50,
p = .14. This is with three degrees of freedom.
Despite the difference between the models being non-significant, it is still worth
examining the coefficient estimates to see if there are any patterns.
summary(model4a)
Random effects:
Formula: ~1 | class
(Intercept) Residual
StdDev: 0.1310681 0.5392071
Because cond is a factor, R will have used the default contrasts. These
compare the first category with each of the other categories. Therefore, the mean for the
control group is 1.65 and each of the three coefficients shows the difference between the
control group and each of the experimental conditions. In Chapter 3 there was discussion
about contrasts. Hill et al. (2007) had specific a priori contrasts in which they were
interested. These were whether the control group differs from all the leaflet groups;
whether the leaflet only group differs from the other two; and whether the leaflet + quiz
group differs from the leaflet + plan condition. R has several built-in contrasts but it
is often easier to write these in yourself. The following tells R to use the contrasts just
described whenever the variable cond is used.
If the model is re-run with these contrasts, the statistics for the overall fit of the model
like AIC and BIC are exactly the same, but the individual coefficients measure different
contrasts. We can now see that the control group differs from the leaflet conditions,
t(18) = 2.27, p = .04. The df for the t test is 18 because there are 22 classrooms and 4
coefficients estimated.
Random effects:
Formula: ~1 | class
(Intercept) Residual
StdDev: 0.1310681 0.5392071
Recall from Chapter 4 (on ANCOVA) the discussion of using change scores versus a
time 1 score as a covariate. Here, we can increase the power of the comparison by using
the time 1 exercise score (also transformed by the square root of exercise plus .5) in the
model, and then add condition. This is a multilevel ANCOVA. The result is statistically
significant, χ 2 (3) = 16.67, p < .001. As mentioned in Chapter 4 you should always
Multilevel Models 147
look at the interaction between the covariate and any factor. model5b does this, and we
see adding the interaction does not increase the fit significantly, χ 2 (3) = 7.33, p = .06,
but the AIC goes down (but the BIC goes up). Given the discussion in Chapter 7 you
might consider whether the relationship between the exercise variables are linear. The
bs function can also be used (for example, bs(sqw1,df=4)), but the model does
not significantly improve. For more complex multilevel GAMs, the gamm function in
Wood’s (2006, section 6.7) mgcv package can be used.
Figure 8.2 shows a scatterplot of the two transformed exercise variables with each
other. We used the jitter function because otherwise there would have been several
dots on the same coordinates, and it would not have been possible to tell how many
were at each coordinate. The jitter function adds a small random error (i.e., a jitter)
to each value so that the dots are not on top of one another. We have also let the
3.0
2.5
post − exercise + 0.5
2.0
1.5
Control
1.0
Leaftet
L+quiz
L+plan
Figure 8.2 The result from a multilevel model of exercise at time 2 using exercise at
time 1 as a covariate. This is based on model5b where there is an interaction
between condition and the covariate
148 Modern Regression Techniques Using R
color and symbol type be determined by the condition number (we used the original
condition variable, wcond, because it is numeric while cond is a factor). We
have added lines for the predicted values for each condition. The par function at
the start slightly changes the margins so that the top part of the square root symbol
can be seen on the y axis label and the par function at the end returns them to their
default.
par(mar=c(5,5,4,2))
plot(jitter(sqw1),jitter(sqw2), xlab=expression(sqrt
(pre-exercise + .5)),ylab=expression(sqrt(post-exercise + .5)),
pch=wcond,col=wcond)
legend(3,1.3,c(levels(cond)),pch=1:4,col=1:4)
sexer1 <- split(sqw1,cond)
spred <- split(model5b$fitted[,1],cond)
for (i in 1:4) lines(sexer1[[i]],spred[[i]],col=i)
par(mar=c(5,4,4,2))
If you thought making that graph was a lot of work, a simpler graph, that is useful in
multilevel modeling, can be made with the lattice package. This makes trellis graphs
which are very popular among statisticians. Figure 8.3 shows the default xyplot; a
lot more can be added to it to make it more useful. We only touch upon the graphic
capabilities of R, see Murrell (2006) for more details.
library(lattice)
xyplot(sqw2~sqw1 | class)
The confidence intervals of the different estimates can be found with the intervals
function. Let’s return to the model without using the initial exercise as a covariate.
As shown below, the interval for the first contrast, between the control conditions and
the others, does not overlap with zero and therefore it is statistically significant.
intervals(model5a)
Fixed effects:
lower est. upper
(Intercept) 0.49439856 0.588418074 0.68243759
sqw1 0.66359295 0.715412254 0.76723156
cond1 0.02713989 0.048667999 0.07019611
cond2 -0.01296502 0.018097991 0.04916100
cond3 -0.04820522 0.005679484 0.05956419
attr(,"label")
[1] "Fixed effects:"
Random Effects:
Level: class
lower est. upper
sd((Intercept)) 0.01269164 0.04170216 0.1370249
Multilevel Models 149
Figure 8.3 Individual scatterplots for each of the classes comparing post-intervention
exercise with pre-intervention exercise
[1] 0.5842596
varprop(model50,model5a)
[1] 0.6088032
varprop(model50,model5b)
[1] 0.6150124
The typical memory recognition study involves showing a set of stimuli and then at
a later point asking participants whether they recognize several objects as previously
shown. This is usually done by testing people with the originally shown items plus a
set of filler items not previously seen. This is called an old/new memory recognition
procedure and ten years ago the norm would have been to use signal detection
theory (SDT) to differentiate participants’ ability to discriminate old from new faces
(essentially, accuracy) and a bias to say ‘old’ (Banks, 1970). Because the standard SDT
approach is a form of generalized linear model for each individual (DeCarlo, 1998),
it seems natural to analyze these data with a multilevel generalized linear model with
Multilevel Models 151
individual trial nested within participant. Generalized linear multilevel models are
becoming more common in the psychology literature (see Hoffman & Rovine, 2007).
When analyzing memory recognition data it is tempting to model a response
as being correct or incorrect. However, in practice it is often better to model the
actual response (‘old’ or ‘new’) and use the parameter that denotes whether the
object is old or not to estimate accuracy. To see if a variable is associated with
accuracy you should test if the interaction between this variable and whether the
object is old improves the model. Here, for example, it is expected that the White
sample will be more accurate with White faces, and therefore the prediction is
an interaction between the race of the face and whether it was previously seen.
Sometimes it is convenient to report accuracy, and this is shown done in Figure 8.5.
These data are from the White English participants in Wright et al. (2003). The data
are accessed from the book’s web page.
Response time measures are usually positively skewed, so this was checked (Figure 8.4).
We have used the paste function in the code below so that the skewness values could
be plotted directly onto the graphs. Of course we could have just written the number, but
this method allows us to re-use the code for other problems (or if we changed one data
point) and it avoids transcription errors.
par(mfrow=c(2,1))
hist(time,xlab="Response time (in msec)",
main="Untransformed variable")
library(e1071)
text(6000,850,paste("skewness =",
format(skewness(time),digits=2)),pos=4)
lntime <- log(time)
hist(lntime,xlab="ln(response time in msec)",
main="Transformed variable")
text(8.5,400,paste("skewness =",
format(skewness(lntime),digits=2)),pos=4)
par(mfrow=c(1,1))
The function lmer works in a similar way to the lme function, but can also
be used for generalized linear multilevel models. There are two main differences.
First, random effects are shown within the formula so (1|partno) tells R that
the intercept is random and that the level 2 indicator is the variable partno.
Thus, lmer(sqw2~1+(1|class),method="ML")will estimate the same as
model4 from the last example. This allows models with multiple random variables and
non-hierarchical models to be estimated. The second difference is that you are allowed
to state the family as with the glm command. If it is not stated then normal error with the
152 Modern Regression Techniques Using R
Untransformed variable
400 1000
Frequency
skewness = 3.0
Transformed variable
500
Frequency
skewness = 0.64
200 0
Figure 8.4 Histograms for the response time (in msec) in the top panel and the natural
logarithm of response time in the bottom panel. No standard errors or confidence
intervals are calculated for skewness because the standard approaches are not
appropriate for multilevel data
identity link function is assumed. Here binomial is used so the computer assumes it
is a logistic regression with binomial error. As with the glm command, there are several
options for this.
With the linear multilevel models you have the choice between REML and ML, and we
have tended to use ML (following Pinhiero & Bates, 2000). For the generalized form,
while maximum likelihood is used it has to be approximated, and there are three methods
listed in the lme4 documentation that can do this: PQL (for penalized quasi-likelihood),
Laplace, and AGQ (for adaptive Gaussian quadrature). Bates and Sarkar (2007) say that
AGQ is the most accurate but the slowest and at the time of writing it was not available in
this package. Laplace is the next most accurate, and following Bates’ recommendation
is used here.2
This first model uses just whether the face has previously been seen to predict
participants’ responses. It should be large and significant because we would hope that
participants are performing above chance.
install.packages("lme4")
library(lme4)
2As this book went into print, Laplace became the only method for this procedure. The removed the method=
option, so if you run the code as is you get a warning. We have not altered the text here because the option to
do adaptive Guassian quadrature is likely to be included in future implementations of lme4.
Multilevel Models 153
The update function is used below. The . before the ~ means to keep the response
variable the same. The . after the ~ means keep the model the same but you can add
(with a +) or remove (with a -) any variables. The var1:var2 means the interaction
of these. You can also update other aspects of the model. The following is a sequence of
models which add, one at a time: the race of the face, the interaction between the race of
the face and whether it is old, a transformed response time, the time by old interaction, the
time by race of face interaction, and finally the three-way interaction. The significance
of each step can be evaluated with a single anova command.
Data:
Models:
model1: saysold ~ faceold + (1 | partno)
model2: saysold ~ faceold + (1 | partno) + facewhite
model3: saysold ~ faceold + (1 | partno) + facewhite +
faceold:facewhite
model4: saysold ~ faceold + (1 | partno) + facewhite + lntime +
faceold:facewhite
model5: saysold ~ faceold + (1 | partno) + facewhite + lntime +
faceold:facewhite +
model6: faceold:lntime
model7: saysold ~ faceold + (1 | partno) + facewhite + lntime +
faceold:facewhite +
model1: faceold:lntime + facewhite:lntime
model2: saysold ~ faceold + (1 | partno) + facewhite + lntime +
faceold:facewhite +
model3: faceold:lntime + facewhite:lntime +
faceold:facewhite:lntime
The spacing on these gets kind of messed up, but it is fairly easy to figure out what the
models are.3 The last few models should have been:
3 If
you have been playing around with other aspects of R, you may be interested that the command
options(width=150) does not help, but this may work for future implementations.
154 Modern Regression Techniques Using R
There are lots of different ways to decide on which model looks best. Usually three-
way interactions are difficult to explain, and given that the BIC is higher for model7
than model6, and model6 is higher than model5, it seems best to treat model5 as
the most useful.
This model is examined by typing its name:
model5
Random effects:
Groups Name Variance Std.Dev.
partno (Intercept) 0.11544 0.33977
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.8387 1.1447 -5.100 3.39e-07 ***
faceold 10.7439 1.4355 7.485 7.18e-14 ***
facewhite -1.0359 0.1331 -7.780 7.24e-15 ***
lntime 0.6486 0.1454 4.460 8.18e-06 ***
faceold:facewhite 0.9677 0.1742 5.555 2.78e-08 ***
Multilevel Models 155
Under Random effects, the variance and standard deviation associated with
partno is listed (.115 and .340, respectively). This is the variability around the intercept.
So, assuming normality, about 95% of the population should be between about −6.5 and
−5.0 (i.e., −5.8 ± 2 × .34).
In general psychologists usually focus on the fixed effects, and begin with the
interactions. faceold:facewhite means that there is an own race bias. These White
participants were more accurate (meaning whether the item was seen was more predictive
of whether they said seen) with White faces than with Black faces. The significance of
the faceold:lntime effect means that response time also predicted accuracy. That
it is negative means longer times were associated with more errors.
Figure 8.5 was made to look at the relationship between response time and accuracy.
First the predicted probabilities from the estimates of the fixed effects from above
0.9
Probability of a correct response
0.8
Figure 8.5 The probability of a correct response for different response times, based on
model 5. Controlling for response time, previously seen White faces are those most
accurately recognized
156 Modern Regression Techniques Using R
are calculated. The predicted probability is made by taking any value x and finding:
ex /(1 + ex ).
We decided to show the probability of a correct response rather than an old response, so
a variable rightprob is created. faceold is 0 if new and 1 if old, so this flips the
probabilities for the new faces around.
The following code makes the graph. For the lines command it was necessary to sort
the data for stime and to place the values for srightprob in this order.
par(mfrow=c(1,1))
plot(time,rightprob,pch=20,xlab="Time in msec",ylab="Probability
of a correct response",cex.lab=1.3)
stime <- split(time,facewhite+2*faceold)
srightprob <- split(rightprob,facewhite+2*faceold)
for (i in 1:4)
lines(sort(stime[[i]]),srightprob[[i]][order(stime[[i]])],col=i,
lwd=1.5)
text(12000,.8,"previously unseen White faces",cex=1.3)
Figure 8.5 is a fairly interesting graph. It shows for all four conditions (new and old
faces, White and Black faces) that as response time increased the probability of a correct
response decreased. Also, the line for White old faces stands out. Controlling for response
time, these are the most accurate.
The next model allows the level of accuracy (the coefficient for the faceold
variable) to vary by participant. The update function is used. First you remove
the random intercept (-(1|partno)) and then you add the random variable for
faceold (+ (faceold|partno)), which also includes the intercept (this can
be done in other ways, too). The anova function shows that this model fits better,
χ 2 (2) = 23.77, p < .001. The 2 degrees of freedom are for the variance in accuracy and
the correlation between this variability and the intercept variability. The fixed effects
remain approximately the same.
Data:
Models:
model5: saysold ~ faceold + (1 | partno) + facewhite + lntime +
faceold:facewhite +
model5b: faceold:lntime
Multilevel Models 157
model5b
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.5122 1.2431 -6.043 1.51e-09 ***
faceold 13.3075 1.5900 8.370 < 2e-16 ***
facewhite -1.0640 0.1355 -7.850 4.16e-15 ***
lntime 0.8581 0.1578 5.439 5.37e-08 ***
faceold:facewhite 0.9930 0.1761 5.639 1.71e-08 ***
faceold:lntime -1.5012 0.2024 -7.416 1.21e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The shape of the curves in Figure 8.5 is dependent on the log transformation used,
the response time variable, and the logistic model used in the regression. Without these
the actual model is a straight line. It is unlikely that the relationship is this simple.
158 Modern Regression Techniques Using R
2
1
s(time,1)
s(time,1)
1
0
0
–1
–1
6.5 7.5 8.5 9.5 7.0 7.5 8.0 8.5 9.0 9.5
ln (time in msec) ln (time in msec)
1
–0.5
s(time,2.35)
s(time,1)
0
–1.5
–1
–2
–2.5
6.5 7.0 7.5 8.0 8.5 9.0 9.5 7.0 8.0 9.0
ln (time in msec) ln (time in msec)
Figure 8.6 Generalized additive multilevel models for the relationship between
response time and the probability of responding ‘old’. The assumption in previous
models was that these would be linear (after accounting for the link function and the
initial transformation of the data), which appears true for three of the four conditions
install.packages("mgcv")
library(mgcv)
par(mfrow=c(2,2))
ssaysold <- split(saysold,facewhite+2*faceold)
slntime <- split(lntime,facewhite+2*faceold)
spartno <- split(partno,facewhite+2*faceold)
for (i in 1:4) {
part <- spartno[[i]]
time <- slntime[[i]]
sold <- ssaysold[[i]]
Multilevel Models 159
SUMMARY
Multilevel modeling is one of the hot methods now in lots of areas of science, including
psychology. While the traditional example has been with people nested within larger
clusters (so, pupils nested within classrooms), because of the great amount of medical
research with multiple measurements per person, multilevel models with the person as
the higher order level are now common (perhaps more common). Harvey Goldstein, one
of the pioneers of this approach, talks about how there are hierarchies everywhere you
look. Multilevel modeling is now one of the tools expected for social and psychological
scientists.
We will end with a caveat. While multilevel models are now expected to be used
in areas where the hierarchical structure is obvious, more research is necessary to
see how useful they are when the levels are not such clean structures and where the
components at different levels cannot be viewed as some random sample of those at
that level. This was essentially Jacob Cohen’s (1976) criticism of Herb Clark’s (1973)
language as a fixed effect fallacy. Perhaps some of the resampling techniques (and local
160 Modern Regression Techniques Using R
causal inference) will be applied to these situations. As with all statistical procedures, it
is critical to examine the data carefully and consider the alternatives before running any
statistical test.
' $
R functions
• aggregate: to calculate group measures (see also tapply);
• lme: linear mixed effect models;
• nlme: non-linear mixed effect models;
• par(mar=: for changing a graph’s margin;
• lattice: a graphics package within R;
• xyplot: for scatterplots of different groups;
• intervals: prints the confidence intervals of lme/nlme models;
• lmer: generalized linear mixed effect models;
• gamm: generalized additive mixed effect models.
Statistical concepts
• multilevel models: models for clustered data;
• SDT: Signal detection theory.
& %
FURTHER READING
http://www.cmm.bristol.ac.uk/ is the multilevel modelling centre’s web page and has a wealth of
information on the topic.
Goldstein, H. (2003). Multilevel statistical methods (3rd edition). London: Edward Arnold. This is
the bible of multilevel modeling, but is a bit heavy on the statistics.
Hox, J. (2002). Multilevel analysis: Techniques and applications. London: Lawrence Erlbaum
Associates. This book covers lots of topics and would be a good book for a one-term graduate
course for psychologists.
Kreft, I. I. & de Leeuw, J. (1998). Introducing multilevel modeling. London: Sage Publications.
This is one of the clearer introductions to multilevel modeling, focusing on linear models. This
is an excellent book and is more introductory than any of the others.
Singer, J. D. & Willett, J. B. (2003). Applied longitudinal data analysis: Modeling change and
event occurrence. New York: Oxford University Press. This is a book that covers longitudinal
methods. The first half focuses on multilevel models where the individual testing session is
nested within the person. The writing is really clear. They have a book on multilevel modeling
currently in preparation.
Sullivan, L. M., Dukes, K. A. & Losina, E. (1999). An introduction to hierarchical linear
modelling. Statistics in Medicine, 18, 855–888. This review is aimed more towards medical
researchers.
Multilevel Models 161
Robust regression
Learning outcomes
1. That the standard approaches to statistical inference are highly influenced by outliers
and lack power under most empirical conditions.
2. That there are several alternative procedures including:
• rank based procedures;
• eliminating outliers;
• M-estimates.
In 1805 Adrien Marie Legendre introduced the idea of minimizing the square of the
residuals: ‘it consists of making the sum of the squares of the errors a minimum … it
prevents the extremes from dominating’ (translation from Stigler, 1986: 13, original
French manuscript printed on p. 58).1 The ease of computing least squares, its
conceptual appeal, and the fact that least squares estimation is well suited for a very
particular (and rare) set of situations means that this approach has dominated statistics.
Every time we calculate a mean, t test, ANOVA, etc., we are minimizing the sum
of the squared residuals. The least squares approach is one of several possible loss
functions.
The second part of the quote from Legendre deserves further scrutiny. Squaring
a residual means the impact of large residuals will be greater than if, for example,
the absolute value was taken (an approach that actually pre-dated Legendre, but is
computationally difficult so was not widely used until recently). Small residuals have
little impact on least squares, but as their value increases the impact becomes very large.
Large residuals are not as influential for minimizing the sum of absolute values. Another
method is to trim data beyond a certain value. For example, the 20% trimmed mean is the
1 Legendre and 1805 are generally given as the person and date for the introduction of least squares, although
Gauss probably was using it since 1795. Soon after 1805, Gauss did publish a much extended formulation
of least squares. Stigler (1999: 331) concluded that while Gauss may have discovered least squares, it was
Legendre ‘who first put the method within the reach of the common man.’
Robust Regression 163
mean of values excluding the extreme 20% from each end of the scale.2 The R function
mean allows trimming as an option, so mean(xvar,.2) produces the 20% trimmed
mean (and mean(xvar,.5 produces the same value as median). All three of these
loss functions (least squares, least absolute values, and trimming) can be applied to
estimating any quantitative parameter.
The second part of Legendre’s quotation is wrong; extremes can dominate with least
squares estimation. Robust alternatives lessen the impact of these extremes. Consider
the following three datasets:
1. Set 1: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
• Mean = 5.5, 95% CI = (3.33, 7.67),
• Standard deviation = 3.03,
• t (9) = 5.75, p < .001.
2. Set 2: 1, 2, 3, 4, 5, 6, 7, 8, 9, 100
• Mean = 14.5, 95% CI = (−7.07, 36.07),
• Standard deviation = 9.54,
• t (9) = 1.52, p = .16.
3. Set 3: 1, 2, 3, 4, 5, 6, 7, 8, 9, 1000
• Mean = 104.5, 95% CI = (−120.59, 329.60),
• Standard deviation = 314.66,
• t (9) = 1.05, p = .32.
As the most extreme point gets larger and larger, the mean goes from 5.5 to 14.5 to 104.5.
This compares with minimizing the sum of least absolute values (which produces the
median for univariate analyses) and the 20% trimmed mean, which both remain at 5.5.
The outlier has less impact with these robust estimators. Depending on your research
question, you may or may not want a single point to have this much impact.
How does the outlier impact on hypothesis testing and confidence interval estimation?
Suppose we wanted to test the hypothesis that the data above come from a√population
with a mean of 0. A common approach is to calculate a √ t test: t = x n/sd and
the corresponding 95% confidence interval: x ± tcrit · sd/ n. Because the outlier
affects the standard deviation even more than the mean (see above), as the outlier
moves away from the null hypothesis it actually makes the statistic less significant.
It makes the estimate much less precise, as is reflected in the confidence intervals.
While this may seem paradoxical, it is not a new discovery (see Fisher, 1925: 112, for
a similar example). It is just that only recently have robust procedures become widely
available.
A topical example showing the effect of removing an outlier was published the week
before the 2004 US presidential election and concerned the number of civilians deaths
in Iraq since the younger George Bush’s war there (Roberts et al., 2004; a more recent
survey is Burnham et al. (2006) where they do not discuss the problem of this outlier).
The most discussed statistic they report is an estimate that 98,000 more civilians died
during the post-invasion occupation than expected, though the 95% confidence interval
2 Ifyou are calculating the trimmed mean, do not simply exclude the extremes and conduct analyses as if
the data were not trimmed. The equations to estimate the standard error (and therefore p values) are different
(Wilcox, 2003a).
164 Modern Regression Techniques Using R
was large, from 8,000 to 194,000. This was based on a pre-invasion mortality rate
of 5.0 (per 1,000 per year) with an interval from 3.7 to 6.3. They estimated the post-
invasion mortality rate to be 12.3 with an interval from 1.4 to 23.2, but did not use this
estimate to reach their conclusions. As the interval includes the estimated pre-invasion
mortality rate, it does not provide strong evidence for an increase. Instead they removed
the data of Falluja, where the fighting was most intense. This lowered the estimated
mortality rate to 7.9 but made the interval smaller, from 5.6 to 10.2, thereby providing
better evidence for an increased mortality rate. There is an important political point
raised by this paper, why were the occupying forces not doing more to keep track of the
number of civilian deaths? The US General Tommy Franks reportedly said (Roberts et al.,
2004: 1863): ‘we don’t do body counts,’ apparently as a snub of Geneva Convention
guidelines for an occupying force. The statistical point, that eliminating an outlier can
make the confidence intervals much smaller, raises an ethical concern. Given that the
researchers knew Falluja was going to be an outlier and therefore that they were likely
to exclude it from many of their analyses, and that Falluja was a very (very) dangerous
place for their researchers to be operating, should they have been trying to gather data
from there?
The Iraqi data and the three data sets above are clearly not normally distributed. Many
people believe that: a) psychology data sets are usually normally distributed and b) if
they were not, we would notice any discrepancy large enough to matter. Micceri (1989)
surveyed a large number of psychology data sets. He found none approximated the
normal distribution and most were very un-normal. But would you be able to notice if
a distribution was un-normal enough to matter? The main part of Figure 9.1 shows two
Normal
Mixture of
two Normals
approx. approx. 5.85% under
2.5% under the mixed curve
Normal curve
Figure 9.1 A normal distribution in gray and a mixture of two normal distributions
(90% with a sd = 1, and 10% with an sd = 10) in black. The tails of these distributions
are shown in the upper right hand corner
Robust Regression 165
curves. One is normally distributed; one is not. The mixture curve (which is 90% a normal
curve with a standard deviation of 1 and 10% a normal curve with a standard deviation
of 10) looks very similar to the standard normal curve. If you observed data that looked
like this mixture distribution, you would be likely to assume that the normal distribution
assumption had been met. Tukey (1960) showed that these distributions differed in a very
important way. The interest, particularly for null hypothesis significance testing (which
is dominant in psychology), is often in the tails of the distribution. The upper-right hand
corner of Figure 9.1 shows the tails of these distributions. The mixture distribution has
a lot more area under the curve beyond z = 1.96.
Because of this area under the curve and that the outliers can affect measures of
precision more than location, the standard least squares procedure is often less able to
identify differences. This led Rand Wilcox to ask the following question as the title of
an American Psychologist paper: ‘How many discoveries have been lost by ignoring
modern statistical methods?’ (1998). In most situations, least squares procedures are
less powerful than methods which are less influenced by outliers, despite what is said
in many textbooks. This means that often people using traditional methods will be
missing significant effects. Wilcox poignantly makes this point in another provocative
title: ‘ANOVA: A paradigm for low power and misleading measures of effect size?’
(1995).
We have mentioned least absolute values and trimmed estimates. Trimmed statistics
are fairly popular. They are conceptually simple and have good properties. Much of the
discussion in Wilcox (2003a) is about trimmed estimates. Statisticians have come up
with other robust loss functions. The most popular of these are M-estimators (there are
also R-, L-, S-, and W-estimators; see Maronna et al., 2006; Wilcox, 2003a, for details).
The largest collection of robust procedures has been written for R/S-Plus (http://cran.
r-project.org/ and http://lib.stat.cmu.edu/S/). The procedures that come with the package
allow robust GLMs (and therefore linear regressions too), ANOVAs, correlations,
principal component analysis, etc. Other packages like SPSS and SYSTAT also include
some M-estimators in some of their procedures.
Robust procedures increase the likelihood of finding a significant result; they are
endorsed by the APA task force on statistics (Wilkinson et al., 1999), and are becoming
more common in the main packages psychologists use. They will become more
popular.
We cover three robust methods. The first is already used by many psychologists:
Spearman’s ρ (rho) correlation which is based on ranked data. The example used to
illustrate this procedure is about crime in neighborhoods in Sussex (UK). We discuss
some problems with this measure, and then go through an example concerning children’s
well-being in several wealthy nations. We use the skipped correlation, which we like for
its conceptual simplicity. It also allows us to direct readers to a very useful collection of
functions on Wilcox’s web page. The final example uses the best made-up data set ever
(Anscombe, 1973) and illustrates M-estimator robust regression.
The crime statistics in Sussex, England, for 2005–6 were recently published in the
local Brighton paper and broken down by neighborhood and type of offence. They are
available on the book’s web page:
We order the data set by theft (in the 9th column) which will be useful for some of the
graphs. This command re-orders the entire data set (columns 1:10) as is done when
ordering a data set with many of the main statistical packages.
The left hand panel of Figure 9.2 shows for the different neighborhoods the number of
thefts and drug offences. Clearly there is a positive relationship between the two types
of offences, but one point stands out: ‘Regency’, which corresponds to the center of
Brighton, where people drink, get high, and thieve. Regency is an outlier for both crimes
because it has far more of both types. Similar graphs are made for the data when ranked
and also when logged.
par(mfrow=c(1,3))
plot(theft, Drugs, xlab="Theft offences",ylab="Drug
offences",pch=19,main="Scatterplot of raw data",
cex.lab=1.3)
text(2900,300,"Regency",pos=3,cex=1.3)
plot(rank(theft), rank(Drugs) ,xlim=c(0,300),
ylim=c(0,300), xlab="Rank of theft offences",
ylab="Rank of drug offences",pch=19,
main="Scatterplot of ranked data",cex.lab=1.3)
5
Drug offences
4
3
2
1
0
text(180,270,"Regency",pos=3,cex=1.3)
arrows(230,285, 255,265, length=.07)
plot(log(theft+1), log(Drugs+1), xlab="Log of theft
offences + 1",ylab="Log of drug offences + 1",
pch=19,main="Scatterplot of logged data",cex.lab=1.3)
text(6.7,5.4,"Regency",pos=3,cex=1.3)
par(mfrow=c(1,1))
Spearman’s ρ addresses problems with cases like Regency which are univariate
outliers. By univariate outlier, we mean that the case is an outlier just looking at the drug
variable, and it is an outlier just looking at the theft variable. Spearman’s ρ involves
taking the ranks of each of the variables on their own, and then conducting Pearson’s
r on the ranks. Ranking procedures, like this, are a popular method for analyzing data
in psychology when you do not believe either that the data are normally distributed or
that you do not believe they are interval (Conover & Iman, 1981). Ranking procedures
are popular, due in part to Siegel (1956 and later editions) providing clear ‘how-to’
descriptions of some of these tests. While there have been advances in ranked-based
procedures since the 1950s (Cliff & Keats, 2002), those described in Siegel remain the
most popular in psychology.
The R function cor.test (used also in Chapter 5) calculates both Pearson’s and
Spearman’s correlations:
cor.test(theft,Drugs)
cor.test(theft,Drugs,method="spearman")
Warning message:
Cannot compute exact p-values with ties in: cor.test.
default(theft, Drugs, method = "spearman")
168 Modern Regression Techniques Using R
These statistics are: r = .94, p < .001, n = 261, and ρ = .78, p < .001, n = 261.
The value for Spearman’s is much smaller. Much of this is due to Regency (without this
outlier, r drops to .89). Most psychology journals either recommend or require confidence
intervals to be reported. R does not print a confidence interval for Spearman’s ρ. One
possibility is finding a bootstrap estimate of the confidence interval. The boot function
requires the data to be placed into a single object. When doing bootstrap estimates for the
median and skewness in other chapters only a single variable was being analyzed at
a time. Because two variables are required for Spearman’s ρ the variables Drugs and
theft are combined into the object thevars. The BCa estimate for the interval is
between .71 and .84.
library(boot)
bootspear <- function(x,i)
cor.test(x$Drugs[i],x$theft[i],
method="spearman")$estimate
thevars <- as.data.frame(cbind(Drugs,theft))
spearboot <- boot(thevars,bootspear, R=1000)
boot.ci(spearboot)
CALL :
boot.ci(boot.out = spearboot)
Intervals :
Level Normal Basic
95% ( 0.7185, 0.8427 ) ( 0.7212, 0.8483 )
An alternative is simply to rank the variable and find the confidence interval for Pearson’s
r on the ranks. Most methodologists would prefer the bootstrap estimates, but all these
estimates are similar.
cor.test(rank(theft),rank(Drugs))
sample estimates:
cor
0.7786106
Ranking the variables lessens the impact of any univariate outlier. Thus, Pearson’s
r on raw data is r = .94, but it is very influenced by this one data point and also by a
couple of the other HTDs (the sociologists’abbreviation of havens for thieving druggies).
Ranking the data on both of these variables means that all the wholesome areas that are
squished into the lower left-hand corner of the first panel of Figure 9.2 are spread out in
the second panel, and Regency and other HTDs are pulled in. When the correlation is
run on the ranks, what is Spearman’s ρ, you get .78. As can be seen in the output above,
R gives a warning that because there are ties exact p values are not computed. The final
command shows that you get the same correlation when directly calculating Pearson’s
r on the ranked data.
There are a couple of difficulties with Spearman’s ρ. The first is the same as with the
other ranked based procedures. Ranking is a particular transformation and when it is
done all meaning about the distances between adjacent points is lost. You would know
that there were more drug offences in Regency than elsewhere, but you would not know
how much more and there is nothing you could do with these ranks to get back to the
original data. The inference becomes about the ranks of data, and this can make it difficult
to describe the results.
Here, a better alternative to lessen the impact of HTDs might be to take the natural
logarithms (ln) of the variables plus 1 (the +1, which is sometimes called a starting
value, prevents negative infinities for the places with 0 of a particular type of offence;
Mosteller & Tukey, 1977: 91). For these transformed values, r = .76 and ρ = .74, and
Regency no longer stands out (see right panel of Figure 9.2). The ln transformation
is a useful transformation for many positively skewed variables (so is the square root
transformation though neither of these work with negative values unless a starting value
is added to the original variable, see also the Box-Cox family of transformations, 1964).
Figure 9.3 shows the line through the scatterplot of the logged data (left panel) and the
back-transformed line through the raw data. The regression on the raw data is shown in
gray, so it can be seen how it is more influenced by the outlier than the logged regression.
Drug offences
4
3
2
1
0
2 3 4 5 6 7 8 0 1000 3000
LN of theft offences Theft offences
Figure 9.3 A scatterplot of the logged data with the logged regression line (left panel),
and a scatterplot of the raw data with the back-transformed predicted values from the
logged regression. The regression on the raw data is shown in black
170 Modern Regression Techniques Using R
The code for Figure 9.3 is shown below. Note that the transformations are actually done
within the lm and lines functions.
The second problem with Spearman’s ρ, and it also exists for the ln transformation
(and for any transformation that looks only at each variable individually), is that they
just examine univariate outliers. Given that the point of regressions is to examine the
relationships among variables, it would be good to have a technique that can look
for bivariate (and multivariate) outliers and lessen their influence. There are several
techniques that can do this, and we opt to present a conceptually simple one called
the skipped correlation coefficient (Wilcox, 2003b). There are actually a variety of
skipped correlations, but we will use the default described by Wilcox. This is a new
procedure, and while there are more computationally advanced methods, this one
works fairly well, there is an R function for it (scor) and it is described in the next
example.
The United Nations Children’s Fund’s (UNICEF, 2007) report card on children of
21 nations ranked the nations on 7 different attributes, including health and risk. These
data are already ranked from 0 to 20, where 0 means good and 20 means bad. So Sweden
is the point down in the lower left-hand corner, and US is the point in the top right-hand
corner of Figure 9.4. The UK is the one with the worst risk score, preventing the US
from being the worst on both – there have been some criticisms of these data.
The data are on the book’s web page and can be accessed in the usual way:
20
*
Ireland
Greece
*
15 *
Poland
*
Health rank
*
*
*
10
*
+*
*
*
*
5
*
*
*
*
*
5 10 15 20
Risk rank
Figure 9.4 The ranks of health behaviors with risk behaviors for 21 wealthy countries
(UNICEF, 2007). The polygon (i.e., the enclosed shape) in the middle was made by
Wilcox’s (2003b) skipped correlation function in R to show where about half of the
data are. The + shows the mean for the two variables once the outliers are
excluded
cor.test(risks,health)
cor.test(risks,health,method="spearman")
Since the data are already ranked, Pearson’s r and Spearman’s ρ are the same (.51),
which means that according to these statistics there is a positive relationship between
the two, and that the effect, in Cohen’s (1988) terms, is ‘large’. This is an example where
‘large’ is a misnomer. As discussed in Chapter 8, when you have data where each data
point is based on lots of people, the effect sizes tend to be much larger than when based
on their individual constituents (Robinson, 1950).
The problem with these estimates is that while there are no univariate outliers (because
the data are already ranked), there may be bivariate outliers. These are points where the
combined values do not fit the pattern of the rest of the data. If the data were the length of
the right foot and the left foot of a group of people, finding a right foot that is 10 inches
long is not striking, nor is finding a left foot that is 6 inches long, but to find these together
on the same person would be surprising. Figure 9.4 shows a scatterplot between risks
and health. Three of the countries have been labeled because they stand out. This figure
was made with the scor function from Wilcox (2003b). The source command below
accesses his functions from his website. We have no control over this website, so they
may move locations. If the command fails to work try googling ‘Rand Wilcox’ and find
where they are. We will update our web pages accordingly. The scor function is in the
second set of functions accessed, but it calls functions from the first set, so both need to
be accessed.
source("ftp://ftp.usc.edu/pub/wilcox/Rallfunv1.v4")
source("ftp://ftp.usc.edu/pub/wilcox/Rallfunv2.v4")
scor(risks,health,xlab="Risk rank",ylab="Health rank")
$cor
[1] 0.8573386
$test.stat
[1] 7.259896
$crit.05
[1] 2.650510
The scor function makes the scatterplot and you can add in other information to the
plot. It is often worth labeling outliers, and therefore we have done this with the text
function.
text(c(2,8,4),c(15,18,19),c("Poland","Greece",
"Ireland"),pos=4)
Figure 9.4 shows this positive relationship but it also shows that the data points for
Ireland, Greece, and Poland do not fit with this trend. These countries score fairly well
on ‘risks’, but poorly on ‘health’. These outliers will have a large impact on the standard
regression and correlation, and the question is whether you want them to. The move
within statistics is that you should try to lessen their impact, but before automatically
following this belief (which we share), you should ask yourself whether these points are
particularly important and are there any explanations for why they are different. If in
fact they are qualitatively different (like someone sneezing in a reaction time task), then
Robust Regression 173
you probably do not want to include them in your sample because they are not part of
your population of interest. Presumably most cognitive theories are not trying to account
for cognition while sneezing. However, often outliers are important, so they are worth
careful examination.
The first step for the skipped correlation is to decide which data points are outliers.
In Figure 9.4 it appears that there are three outliers, but it is worth having a general
rule with good statistical properties. Wilcox uses a complex algorithm based on where
the bulk of the data lie. His scor function makes the plot shown in this figure with the
outliers shown with circles, a polygon containing the bulk of the data, and a + for the
mean of both variables excluding the outliers. It decides the circle points are outliers,
removes these, and runs the correlation (you can use Pearson’s or Spearman’s, here
Pearson’s was used).3 You get rskip = .857, which is much higher than that found with
the standard Pearson correlation (r = .513). While the value of this statistic is the same
as Pearson’s r if you conducted it on the data without the outliers, rskip is a different
statistic from Pearson’s r and you cannot look up significance in the same way. You first
calculate a tskip value with n being the total number in the sample (including the outliers)
with the same equation as used with r:
√
rskip n − 2
tskip =
1 − rskip
2
or here:
√
.857 21 − 2
tskip = √ = 7.25
1 − .862
This is within rounding error of the value produced above by the scor function. This
statistic does not have the same critical values at the t statistic. Wilcox (2003b) ran
some simulations and came up with an equation for the critical value for α = .05. The
equation is:
6.947
tskipcritical = + 2.3197
n
which for this example yields 6.947/21 + 2.3197 = 2.65. The observed value exceeds
this, so the skipped correlation is statistically significant at α = .05. Wilcox’s scor
function, written for R, produces all these values so you do not have to.
3Although Spearman’s and Pearson’s produce the same value with all the data, they will not when the outliers
• Research question: What is the relationship between food and drink intake for different types
of animal?
• Purpose: To illustrate the rlm function, to work with the split function, and to stress the
importance of graphing your data.
The Anscombe (1973) data set is important in the history of exploratory data analysis
(EDA) because it beautifully illustrates how somebody who does not make friends with
their data can reach very silly conclusions. We use these data to illustrate robust regression
and in particular show when we would expect these methods to make a difference and
when we would expect them not to make a difference.
Robust methods are an incredibly active area of statistical research. Methodologists
agree that robust methods, of some type, should usually be used. Because of this
importance it is tempting to have several chapters on this. But, because of its importance
it also means the people writing many of these functions have made them user-friendly.
After one of us had a lecture course where about a third of the time was spent on robust
methods, the student feedback was: ‘shorten that section, all you do is replace lm with
rlm.’ This was frustrating because at one level robust methods are so important, and the
mathematics behind them are detailed, that they deserve much explanation. However, at
the pragmatic level of the typical psychology-user this feedback was right, so we will
adopt the pragmatic brief approach. This does not mean robust regression is unimportant.
Please try rlm on your own data!
To make this example more concrete, assume that data are food and drink intake for
birds, mammals, insects, and reptiles. The food and drink are in some measure that takes
into account body weight. They are stored on the book’s web page.
If you type anscombe the data are printed in several long columns. To make them
readable on a single screen (assuming the window is open wide enough) type:
cbind(anscombe[1:11,], anscombe[12:22,],
anscombe[23:33,], anscombe[34:44,])
Food Drink Type Food Drink Type Food Drink Type Food Drink Type
1 10 8.04 bird 10 9.14 mammal 10 7.46 insect 8 6.58 reptile
2 8 6.95 bird 8 8.14 mammal 8 6.77 insect 8 5.76 reptile
3 13 7.58 bird 13 8.74 mammal 13 12.74 insect 8 7.71 reptile
4 9 8.81 bird 9 8.77 mammal 9 7.11 insect 8 8.84 reptile
5 11 8.33 bird 11 9.26 mammal 11 7.81 insect 8 8.47 reptile
6 14 9.96 bird 14 8.10 mammal 14 8.84 insect 8 7.04 reptile
7 6 7.24 bird 6 6.13 mammal 6 6.08 insect 8 5.25 reptile
8 4 4.26 bird 4 3.10 mammal 4 5.39 insect 19 12.50 reptile
9 12 10.84 bird 12 9.13 mammal 12 8.15 insect 8 5.56 reptile
10 7 4.82 bird 7 7.26 mammal 7 6.42 insect 8 7.91 reptile
11 5 5.68 bird 5 4.74 mammal 5 5.73 insect 8 6.89 reptile
Robust Regression 175
Pretend that you thought Tukey (1977) was misguided when he suggested that graphing
data was an integral part of analysis, that you thought Tufte’s books (e.g., 2001) on
making clear and informative graphs are boring, and that you thought Murrell (2006)
had wasted his time writing lots of the graphing procedures in R. If you wanted to see
if the animal types were different, you might compare means and standard deviations.
To do this use the tapply function to calculate the means and standard deviations for
the different groups. So, tapply(Food,Type,mean) calculates the mean of Food
for each value of Type. The command below does the mean and standard deviation
(sd) for food and drink for each animal type. The row.names function adds names to
each row for the object meansd.
Having seen these values it would clearly be misguided to run ANOVAs (because the
means are all the same so the Fvalues should be 0 or very near zero), but given your
views on Tukey, Tufte, and Murrell you may still want to:
summary(aov(Food~Type))
summary(aov(Drink~Type))
Call:
lm(formula = Drinkg[[i]] ~ Foodg[[i]])
Residuals:
Min 1Q Median 3Q Max
-1.92127 -0.45577 -0.04136 0.70941 1.83882
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 *
Foodg[[i]] 0.5001 0.1179 4.241 0.00217 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = Drinkg[[i]] ~ Foodg[[i]])
Residuals:
Min 1Q Median 3Q Max
-1.1586 -0.6146 -0.2303 0.1540 3.2411
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0025 1.1245 2.670 0.02562 *
Foodg[[i]] 0.4997 0.1179 4.239 0.00218 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Call:
lm(formula = Drinkg[[i]] ~ Foodg[[i]])
Residuals:
Min 1Q Median 3Q Max
-1.9009 -0.7609 0.1291 0.9491 1.2691
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.001 1.125 2.667 0.02576 *
Foodg[[i]] 0.500 0.118 4.239 0.00218 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Robust Regression 177
Call:
lm(formula = Drinkg[[i]] ~ Foodg[[i]])
Residuals:
Min 1Q Median 3Q Max
-1.751e+00 -8.310e-01 1.258e-16 8.090e-01 1.839e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0017 1.1239 2.671 0.02559 *
Foodg[[i]] 0.4999 0.1178 4.243 0.00216 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is nothing in this numeric output to show that the relationships are different;
most of the key statistics that people look at are the same for each of the four groups.
Of course, some of you may be thinking that the Tukey-Tufte-Murrell approach may have
some merit, so you decide to look at the scatterplots for the different animal types. There
is a package called lattice (Sarkar, 2008), also used in Chapters 4 and 8, that can do
this type of graph quickly. This package is useful and is described in detail in Murrell
(2006). However, in this book we have relied mostly on the traditional graph methods
since it would take many pages to scratch the surface of lattice’s capabilities. But,
just as an example the following procedures and Figure 9.5.
library(lattice)
xyplot(Drink~Food|Type)
A similar plot using traditional graphs (i.e., the plot function) can be made. The
for (i in 1:4) has R make the plot for each of the four groups. main=
(paste(names(Foodg)[i])) is used to put the names of the groups in above
the individual scatterplots in Figure 9.6.
par(mfrow=c(2,2))
for (i in 1:4) {
x <- lm(Drinkg[[i]]~Foodg[[i]])
plot(Foodg[[i]],Drinkg[[i]],main=
(paste(names(Foodg)[i])),xlab="Food intake",
ylab="Drink intake",ylim=c(0,15),xlim=c(0,20))
abline(x)}
par(mfrow=c(1,1))
178 Modern Regression Techniques Using R
5 10 15
mammal reptile
12
10
4
Drink
bird insect
12
10
5 10 15
Food
Figure 9.5 Scatterplots for the four animal types comparing drink intake with food
intake. The graph is made with xyplot from the lattice library and the data were
made by Anscombe (1973)
Two things are obvious from these data. First, the relationship between food and
drink intake is very different for the different types of animals. Second, these data
are meticulously made-up so that they have the same means, standard deviations,
correlations, regression lines, but the relationships are all different. Clearly if someone
had just reported the numeric statistics they would have reached the wrong conclusions.
For those of you who teach undergraduate statistics, we encourage you to use these data
to illustrate the need for graphing. Here we use them for robust regression.
There are several choices of robust regressions. The main way that the regressions
vary is according to the loss function. Earlier in the chapter we talked about least absolute
values, trimming, and M-Estimators. M-estimators are the most common and these are
what are built within the rlm function. The psi option allows different loss functions
and at present it allows ones developed by Huber, Hampel, and Tukey. Each of these has
certain values you can tell the computer to use to specify it. While there are arguments
among statisticians about the best of these, within psychology we feel trying multiple
methods until you get a model you like would be wrong here (we are not in general
against this approach, see the conclusions of model selection in Chapter 5) and we feel
Robust Regression 179
bird insect
15
15
Drink intake
Drink intake
10
10
5
5
0
0
0 5 10 15 20 0 5 10 15 20
Food intake Food intake
mammal reptile
15
15
Drink intake
Drink intake
10
10
5
5
0
0 5 10 15 20 0 5 10 15 20
Food intake Food intake
Figure 9.6 The scatterplots from Figure 9.5 remade with the plot command. The
standard linear regression is added to each plot
there are no psychology examples which would lend themselves better to one of these
than another. Therefore, we recommend only running the default (which is the Huber
function).
The question is, how will a robust regression help in modeling the Anscombe data?
For the birds, the data look as if they conform, roughly, to the assumptions of least
squares regression, so we would not expect large differences from a robust regression.
Like we said before, for the user the rlm function works in a very similar way to the
lm function. It is part of the MASS library (Venables & Ripley, 2002), a large collection
of functions for R (and S-Plus) that are used by most R users. The results below are very
similar to those found with the lm procedure. Unlike the lm procedure, the output does
not give you the p values associated with the coefficients. If you find p values useful,
you need to take the t value from the output and the degrees of freedom (calculate
this from df = n − 2 = 9) and use the pt function. This is for the probability of a
t value. For this example the probability associated with the food coefficient for birds
type: pt(3.3351,9,lower.tail=F)*2. It produces p = 0.009 so you can write
t(9) = 3.34, p < .001.
library(MASS)
birds <- rlm(Drinkg[[1]]~Foodg[[1]])
summary(birds)
180 Modern Regression Techniques Using R
Coefficients:
Value Std. Error t value
(Intercept) 2.9837 1.4476 2.0611
Foodg[[1]] 0.5061 0.1518 3.3351
Correlation of Coefficients:
(Intercept)
Foodg[[1]] -0.9435
Class Insecta! There are over 2,000 species of praying mantises! Their scatterplots in
Figures 9.5 and 9.6 show a straight line of data with one point off the line. This point will
be influential for the standard regression, but it will make less of an impact for robust
regressions. The regression line does change, but the most noticeable difference in the
output is that the t values have rocketed upwards. This makes sense. Because all the
others are on a straight line, if the regression is lessening the weight of the outlier, then
the p values drop accordingly.
Coefficients:
Value Std. Error t value
(Intercept) 4.0035 0.0040 990.3355
Foodg[[2]] 0.3457 0.0004 815.8284
Correlation of Coefficients:
(Intercept)
Foodg[[2]] -0.9435
The third group is mammals. Clearly a linear regression is wrong because of the curved
pattern. But the robust regressions that we are describing are still linear. There are not
any large outliers, so we would not expect, and do not find, many differences between
the robust regression and the standard one.
Robust Regression 181
Coefficients:
Value Std. Error t value
(Intercept) 3.0600 1.3113 2.3336
Foodg[[3]] 0.5000 0.1375 3.6375
Correlation of Coefficients:
(Intercept)
Foodg[[3]] -0.9435
What is needed here is a term in the regression for the curve, so we tried a quadratic
regression. From this it is clear we have uncovered how Anscombe (1973) created the
data for this group. A simple quadratic fits these data nearly perfectly.
Call:
lm(formula = Drinkg[[3]] ~ poly(Foodg[[3]], 2))
Residuals:
Min 1Q Median 3Q Max
-0.0013287 -0.0011888 -0.0006294 0.0008741 0.0023776
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.5009091 0.0005043 14875 <2e-16 ***
poly(Foodg[[3]], 2)1 5.2440442 0.0016725 3135 <2e-16 ***
poly(Foodg[[3]], 2)2 -3.7116396 0.0016725 -2219 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The final group is reptiles. Like the insects, there is one data point that does not fit
the pattern. However, the pattern for the others is to have the exact same values for food
intake. If we removed the outlying datum, the standard deviation for the remainder of this
group would be 0, and therefore without the outlier there is no correlation or regression
(not a correlation of 0, no correlation). The odd point is a univariate outlier, but it actually
182 Modern Regression Techniques Using R
is a very influential point for the standard regression. Thus, we would expect (and we
find) it would still be weighted highly by this robust regression.
Coefficients:
Value Std. Error t value
(Intercept) 2.9976 1.2570 2.3847
Foodg[[4]] 0.5001 0.1318 3.7955
Correlation of Coefficients:
(Intercept)
Foodg[[4]] -0.9435
Of course this is an odd set.4 It is worth considering how the scor function
would work with the data for each of these animal types. For the birds, scor did
not remove any outliers so produced Pearson’s r (.82). For insects, it does not count
the one wayward point as an outlier, so also produces Pearson’s r. For the mammals,
it identifies one point at the end as an outlier. Because this is near the regression
line, when it is excluded the correlation value drops to .76. For the reptiles, scor
produces an error because the outlier is excluded and this leaves no standard deviation
in food intake. The graphs though tell the story. You can add a small jitter to these
points scor(Drinkg[[4]],jitter(Foodg[[4]]))) to allow a correlation to
be calculated, but its value will be near zero since it will just compare the drink variable
with random noise.
SUMMARY
Robust procedures are recommended by the APA (Wilkinson et al., 1999) for analyzing
data and they increase the chances that you find significant effects. Therefore, if you want
a significant effect (and a lot of psychologists do), use these. If a reviewer complains
that they think you are using some fancy statistic to squeeze out a significant result,
then point them to the APA report (or Wilcox’s books, or this book, or lots of places) as
justification.
While detailed descriptions of different estimators could have been given, the approach
taken here was both simpler and with an education purpose. We began with Spearman’s ρ
because it is very popular. We wanted to stress the difficulty once you have a statistic based
4Another data set similar to the reptiles can be made with drink <- c(rep(1,10),k) and food <-
1:11. Let k be different values and look at the correlation. It will be −.5, 0, or +.5. The size of k does not
affect for the correlation for these data, only which side of 1 that it is on.
Robust Regression 183
on ranks trying to make sensible quantitative statements in terms of the raw data. There
are more advanced rank based procedures, but they are less popular than other modern
alternatives. Because of this we recommend using a numerical transformation where
possible (or using GLMs if appropriate) because the values can be back-transformed onto
the original scales. Both the main rank-based procedures and the standard transformations
work only on univariate outliers. The second example showed a conceptually simple
method for calculating a correlation that excludes bivariate outliers. The procedure works
in three steps: a) determine which values are outliers and remove these; b) calculate the
correlation on the remaining items; and c) calculate the p value associated with this
correlation. Showing this procedure also allowed us to introduce readers to Wilcox’s
functions. The scor function is relatively new, but it has promise.
In practice, the most useful approach uses the rlm function, so we ended with this.
Like scor it lessens the impact of bivariate outliers because these have the largest
residuals, but unlike scor it only lessens their impact if they are outliers away from
the regression line. This was seen with the reptiles in the final example, where an outlier
near the regression line was still highly influential. When describing robust methods it
is tempting to say to use them for everything and that you will no longer need to worry
about outliers or any other oddities in the data. It may seem as if they are a panacea for
everything. This is not true and is why we chose the Anscombe (1973) data sets. Looking
at the scatterplots, it is clear that the standard linear regression is only appropriate for one
of the data sets (birds), but the robust regression only addresses the odd pattern in one
of the remaining sets (insects), where there is a large residual. Therefore, as Anscombe
intended, his data sets show the importance of looking at your data graphically before
deciding which numeric procedure to apply.
' $
R functions
• scor: skipped correlation coefficient;
• rlm: robust linear model;
• pt: probability associated with a t value.
Statistical concepts
• loss function: what is minimized when estimating a regression;
• trimmed: removing the tails of the distribution;
• outliers and inference: outliers decrease the chance of significance;
• to Tommy Franks: to ridicule the Geneva Convention∗ ;
• M-estimators: a popular method of robust estimation;
• Anscombe’s data: a neat data set showing the need for graphs.
FURTHER READING
Wilcox, R. R. (1998). How many discoveries have been lost by ignoring modern statistical
methods? American Psychologist, 53, 300–314. This article is written for a general psychology
audience and goes through the reasons why robust methods should be used.
Wilcox, R. R. (2003a). Applying contemporary statistical techniques. Orlando, FL: Academic
Press. Wilcox has written several books at different levels of complexity. This one is an
introductory statistics book, written for intelligent people, but who have not taken much
statistics.
10
Learning outcomes
1. Regression towards the median.
2. To consider the discovery and dissemination aspects of statistics.
The Oxford English Dictionary lists several definitions for regression. The most used
is to ‘go back.’ One psychology use of this is with hypnotic age regression, which
means mentally revisiting your childhood. In some faiths it can mean going back to
previous lives on distant worlds (probably the most discussed of these is Scientology,
http://sf.irk.ru/www/ot3/otiii-gif.html, but other faiths have beliefs about things like
life-after-death). Within data analysis regression takes on two related but distinct
meanings.
Historically, the first statistical use of regression is ‘regression towards the
mean/median’ which follows the English definition of going back. This was introduced
by Francis Galton (1886), one of the most influential (and controversial) people in early
psychology (Brookes, 2004). Galton had noticed in studies of seeds that small seeds
tended to produce seeds that were slightly larger than themselves, and that large seeds
tended to produce seeds that were slightly smaller than themselves. There was a tendency
for the offspring to move towards middle sized seeds. He found the same thing with
humans, and used this to argue for a theory of heredity (and his views on eugenics).
Given the topic of this book, his paper is of enough historical importance that it is worth
further discussion.
In his Table 1 Galton lists the average height (in inches) of 205 parents and their 928
adult offspring (he adjusts female heights so that there is not a gender effect and describes
numerous methodological issues; see also Wachsmuth et al., 2003). To demonstrate the
concept of regression towards (not ‘to’ since you would still expect the offspring of
tall parents to be taller than) the middle, we will create some data. To simplify things,
we look at the simpler (and less enjoyable) case of asexual reproduction where the
offspring receives 100% of a single parent’s genes. There would be some genetic basis
for height and some non-genetic basis. An individual’s height will be determined by
186 Modern Regression Techniques Using R
the genetic and the non-genetic causes. Since we created the data ourselves, we know
what proportion of variation we would expect from both of these causes in this sample.
We would expect people’s height based just on genetic factors to be 70 ± 10 inches
and then the environmental factors can alter this by ±10 inches. So genetic factors and
environmental factors should have about equal impacts.
The code below makes these data. We use set.seed(47) so that the data can be
recreated. runif(100,60,80) creates 100 cases from a variable with a uniform
distribution between 60 and 80. We are assuming that there is no correlation between a
parent’s and her child’s environment, so pretend this is part of some bizarre experiment.
set.seed(47)
genoheight <- runif(100,60,80)
parentheight <- genoheight + runif(100,-10,10)
kidheight <- genoheight + runif(100,-10,10)
Below is the code for Figure 10.1, using a variety of functions that we have been
using before (abline(0,1) means a line with a 0 intercept and a slope of one;
abline(v=X) and abline(h=X) draw vertical and horizontal lines at the value X;
and \n within the text command means a line break).
plot(parentheight,kidheight,pch=20,xlab="Parent’s height in
inches",ylab="Child’s height in inches")
abline(0,1)
abline(h=mean(kidheight),v=mean(parentheight),lty="dashed")
text(55,84,"Most kids of short\nparents above\nthe diagonal",
pos=4)
kids = parents
55 60 65 70 75 80 85
Parent's height in inches
The correlation and its confidence interval can be found with the cor.test function:
cor.test(parentheight,kidheight)
Figure 10.1 and the statistical output show that the heights of parents with their
children are related (r = .58, with a 95% CI including .5, which is expected because
the variable is half from the genotype variable and half an environmental/random
variable), but that kids of short parents tend to be taller than their parents (dots above
the diagonal) and that kids of tall parents tend to be shorter than their parents (dots
below the diagonal). This regression towards the mean, median or mediocrity (as it
gets labeled in different places) is now taught as an methodological artifact to avoid,
but in Galton’s day it was used for supporting his theory. Today, the most important
thing to take from it is that most measures are based both on some real value and some
measurement error (in behavioral genetics the non-genetic component includes error),
and therefore if you sample just on the basis of the observed value then you would
expect later measures to shift towards the middle of the distribution. An example is if
an educational researcher was interested in an intervention to improve poor students’
scores. If the researcher gave a test to students and then gave the intervention to the half
of students who performed the worse on this test, it is likely that at re-test this group
would perform better then they had, and that the other group would perform worse than
they had. This, however, could just be the methodological artifact of regression towards
the median.
In the history of regression Galton’s paper was also important for drawing the
regression equation line. The most common use of the word ‘regression’ within
contemporary data analysis/statistics is as a method to create expected values from
some model and to compare these with observed values. This is the use of ‘regression’
assumed in this book. The basic elements of a regression are: a set of observed values on
the left side of the equal sign, an equal sign, a model based on observed values of other
variables on the right side of the equal sign, and an error term. Because most statistics
can be framed in this way means that most statistics can be described as regressions.
Of course, just being able to rename an ANOVA as a special case of regression does
not increase our understanding of either, but as shown in Chapter 3, being able to show
their equivalence and then build on each does. Further, the advances in each of the other
chapters also help to increase our understanding of statistics. The purpose of this book
188 Modern Regression Techniques Using R
is not, however, to increase your understanding of statistics for the sake of it, but so that
you can use these techniques for understanding your data.
Statistics has two components: discovering/evaluating patterns in nature and describing
these to an audience. The history of statistics is of a lot of intelligent and insightful people
describing ways to turn your complex and seemingly chaotic data into information that
illuminates your theories of nature. Because this book is on the art of regression and
therefore it is really on the art of all statistics, it is fitting that we end with two of our
favorite philosophies of this discovering/disseminating duality of statistics.
The first philosophy is based on many people’s work, but most notably Tukey (1977),
and that is we should focus on the data, or in Rosenthal’s terms we should make
friends with our data (Wright, 2003). The second philosophy is based on Abelson’s
MAGIC (1995) criteria for persuading people about the story your data have to tell. The
first philosophy is about turning data into information, and the second is about using this
information to persuade the reader about the hypotheses under consideration.
The ‘making friends with data’ component is best exemplified by Tukey and
colleagues when discussing exploratory data analysis (EDA). Hoaglin et al. (2000)
discussed the four themes, or four Rs, of EDA: resistance, residuals, re-expression,
and revelation.
These four Rs should be considered whenever you are trying to extract coherent patterns
from data. The hope is that the techniques presented throughout this book will help you
address each of these four Rs.
Figure 10.2 shows how you would use this approach to evaluate a regression. If you
only ask whether the p value is significant, this would be bad. It is worth remembering
that finding a significant p just tells you that your sample size was large enough to detect
the effect. If you only use the p value to evaluate a model you get a . You should
also look at the size of the effect, and with regressions this is often Pearson’s r or some
related statistic. This is better, but we know that r is just a single number and it can be
influenced by outliers and other influential points. You should ask whether the model is
robust. This may involve running robust methods, but it can also be addressed by seeing if
your conclusions would be very different if you just moved a couple of points around.
If your conclusions are dependent on the location of just a couple of points, then you
should be cautious in making these conclusions. As the Anscombe (1973) data sets in
Chapter 9 illustrate, it is vital to look at your data. These are the first four of the five
- - - - - - -☺
Figure 10.2 Ways to assess the value of a model (i.e., to make friends with your data)
Conclusion – Make Your Data Cool 189
steps in Figure 10.2. You may find a significant p value, a large r value, have tested that
this is robust, and have looked at the plots. But, if your conclusion implies that apples
do not fall to Earth with gravity, your conclusion is wrong. The final step to achieve the
coveted requires that your finding fits into a coherent view of science. This leads to
the second philosophy of discovery/dissemination of statistics.
The second philosophy describes what to do with the information that has carefully
been considered with the four Rs and evaluated with Figure 10.2. Abelson (1995)
argued that the general rules of communication and persuasion should be applied to
describing this information within results sections (see also Wright & Williams, 2003).
He described the MAGIC criteria for a good results section. MAGIC stands for:
Magnitude, Articulation, Generality, Interest, and Credibility. These labels are fairly
self-explanatory:
Abelson’s MAGIC can be applied to these observed patterns to produce an accurate, con-
vincing and clear story. MAGIC is a great acronym because it stresses the amazement that
your readers should experience when reading your results section. One of our colleagues
was giving a talk once and before showing his results he said: ‘these data are really
cool!’ His excitement spread through the audience. Our final advice is to: Make your
data cool!
Statistical concepts
• regressions towards the median: scores are based on a true score and error.
• 4 Rs of statistics : Hoaglin et al.’s approach to statistics.
• MAGIC: Abelson’s description for making statistics useful.
• smile scale: our way of summarizing an approach to statistics.
FURTHER READING
Abelson, R. P. (1995). Statistics as principled argument. Mahwah, NJ: Lawrence Erlbaum
Associates. This is the ideal summer reading for anybody who plans to teach their first
undergraduate psych-stats course. It allows you to step back from the algorithms and think
about why you are doing statistics.
Glossary of R functions used
in this book
The following is an alphabetical list of most of the R functions and packages used in this
book, as well as some of the important options.
xlim sets the lower and higher values for the x axis
xy.plot multiple scatterplots. A trellis graph from the lattice package
ylab the y axis label
ylim sets the lower and higher values for the y axis
The code for running all the analysis is available on the book’s web page. This means
you do not need to retype code. Also, updates, the data, and links are available on the
page. It is: http://www.sagepub.co.uk/wrightandlondon.
References
Abelson, R. P. (1995). Statistics as principled argument. Mahwah, NJ: Lawrence Erlbaum
Associates.
Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: John Wiley &
Sons.
Anscombe, F. J. (1973). Graphs in statistical analysis. American Statistician, 27, 17–21.
Ayers, S., Wright, D. B. & Wells, N. (2007). Post-traumatic stress in couples after
birth: Association with the couple’s relationship and parent-baby bond. Journal of
Reproductive and Infant Psychology, 25, 40–50.
Banks, W. P. (1970). Signal detection theory and human memory. Psychological Bulletin,
74, 81–99.
Barth, J. M., Dunlap, S. I., Dane, H., Lochman, J. E. & Wells, K. C. (2004). Classroom
environment influences on aggression, peer relations, and academic focus. Journal of
School Psychology, 42, 115–133.
Bartholomew, D. J., Steele, F., Moustaki, I. & Galbraith, J. (2002). The analysis
and interpretation of multivariate data for social scientists. London: Chapman &
Hall/CRC.
Bates, D. & Sarkar, D. (2007). lme4: Linear mixed-effects models using S4 classes.
R package version 0.9975–13.
Berndt, E. R. (1991). The practice of econometrics. New York: Addison-Wesley.
Box, G. E. P. & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal
Statistical Society: B, 26, 211–246.
Brookes, M. (2004). Extreme measures: The dark visions and bright ideas of Francis
Galton. New York: Bloomsbury Publishing.
Burnham, G., Lafta, R., Doocy, S. & Roberts, L. (2006). Mortality after the 2003 invasion
of Iraq: A cross-sectional cluster sample survey. The Lancet, 368, 1421–1428.
Canty, A. & Ripley, B. (2008). boot: Bootstrap R. (S-Plus) functions. R package version
1.2–33.
Chatfield, C. (2003). The analysis of time series: An introduction (6th ed.). Chapman &
Hall/CRC.
Cliff, N. & Keats, J.A. (2002). Ordinal measurement in the behavioral sciences. Mahwah,
NJ: Lawrence Erlbaum Associates.
Clark, H. H. (1973). The language-as-fixed-effect fallacy:Acritique of language statistics
in psychological research. Journal of Verbal Learning and Verbal Behavior, 12,
335–359.
Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological
Bulletin, 70, 426–443.
Cohen, J. (1976). Random means random. Journal of Verbal Learning and Verbal
Behavior, 15, 261–262.
Cohen, J. (1983). The cost of dichotomization. Applied Psychological Measurement, 7,
249–253.
References 197
Wright, D. B., Boyd, C. E. & Tredoux, C. G. (2003). Inter-racial contact and the own race
bias for face recognition in South Africa and England. Applied Cognitive Psychology,
17, 365–373.
Wright, D. B. & Hall, M. (2007). How a ‘Reasonable Doubt’ instruction affects decisions
of guilt. Basic and Applied Social Psychology, 29, 85–92.
Wright, D. B., Horry, R. & Skagerberg, E. M. (in press). Functions for traditional and
multilevel approaches to signal detection theory. Behavior Research Methods.
Wright, D. B. & Livingston-Raper, D. (2001). Memory distortion and dissociation:
Exploring the relationship in a non-clinical sample. Journal of Trauma and
Dissociation, 3, 97–109.
Wright, D. B. & London, K. (2009). First (and second) steps in statistics (2nd ed.).
London: Sage.
Wright, D. B. & London, K. (in press). Multilevel modelling: Beyond the basic
applications. British Journal of Mathematical and Statistical Psychology.
Wright, D. B. & Williams, S. (2003). Producing bad results sections. Psychologist, 16,
644–648.
Zhao, P. & Yu, B. (2006). On model selection consistency of Lasso. Journal of Machine
Learning Research, 7, 2541–2563.
Index
Note: Entries in bold are R functions. The suffix ‘n’ following locators refers to footnotes
and ‘f’ refers to figures.
abline function 18, 24, 28, 40n, 54, 77, 106 truth and lie detection 129–34
academics’ salaries 116–21 well-being of 170–3
AIC (Akaike’s Information Criterion) 21, 127, 147 chile heat and length 6–14
ANCOVA 26, 48–64, 116, 134, 136, 143, 146, 150 clustering 138, 140
children’s recall 50–7 cognitive dissonance 29–35
Lord’s Paradox 49 contrasts 33, 34, 46, 114, 145
mediation analysis and writing functions 58–62 contrasts 35, 146
misinterpretations of 58 covariate 48, 49, 50, 58, 76, 80
animals, food and drink intake 173–82 CRAN mirror sites 1, 2, 79
anova 21, 28, 114, 118, 120, 123, 127, 132, crime
134, 141 drug offences and thefts in Sussex 165–70
ANOVA 26, 27, 29–47, 54, 55, 56, 57, 58, 92 interviewing children as witnesses 56
ANCOVA and 48, 50, 62 reasonable doubt and 100–9
Lord’s Paradox 49 criteria-based content analysis (CBCA) 129, 130
Anscombe data sets 173, 174, 178f, 179, 181, cross-validation 71, 79, 80, 89
183, 188
data, saving of 24, 25
B-spline 113, 122, 129, 135f degrees of freedom 133
best subset regression 66, 89 diagnostic plots 19f, 20f
leaps package 69–76 dissociation, and memory suggestibility 22–5
BIC (Bayesian Information Criterion) 21, dummy variable coding 16
127, 154
binary response variables 109, 129, 140 e1071 10, 11
see also response variables error distributions 92–3, 96
binary variables 26, 38, 40f, 96, 101 exploratory data analysis 174, 188
birds, food and drink intake 174, 179–80, 182 exploratory factor analysis 82, 88
bivariate outliers 27, 170, 172, 183
boot 12, 168 fBasics 10
bootstrap estimates 12, 13, 102, 168 floor effects 57, 135
bootstrap samples 11–12, 13, 14 foreign 2, 6, 7, 24
boxplots 31, 126, 132f, 133, 141, 142f functions, and objects 4–14