Longitudinal Data Analysis
Longitudinal Data Analysis
R and Epicalc
Epidemiology Unit
Prince of Songkla University, THAILAND
1
Analysis of Longitudinal Data using
R and Epicalc
Cover image :
ISBN :
Printed in Thailand
R and Epicalc
2
Preface
From the theoretical side, this book was written to help new researchers in Health
Science to be able to understand the nature of longitudinal studies, the structure of
data and the approach in analysis. Equipped with R, an open-source software
consisting of a suite of both standard and user contributed packages, readers are
encouraged to follow the examples on data manipulation, graphing, exploration and
modelling of longitudinal data.
Readers of this book should have some basic epidemiological knowledge, such as
measures in epidemiology (incidence, prevalence etc), types of study designs, bias,
confounding and interaction. Those who feel that they have an inadequate
background should read fundamental text books of epidemiology listed at the end of
this book. Basic data management concepts are also required as the first few
chapters deal with data structure and data manipulation.
Experience in using data entry software, such as EpiInfo and EpiData, may also be
beneficial. Although they can be used for entering longitudinal data, facilities for
management of data with longitudinal nature are limited. So-called relational
database software, such as Microsoft Access and others, are designed for data
management and manipulation, but cannot do complicated statistical analyses. Data
manipulation is more efficient if the same software is used for both. Documentation
is also simplified since it can be integrated within the command file. R can be used
exclusively for both purposes.
To get the full benefit from this book, readers should be acquainted with R
software. Readers are recommended to follow each section in conjunction with
typing the commands that follow since the theoretical parts are often followed by an
example. The output on the computer screen can be observed and compared with
that in the book and the explanations can then be read in order to integrate the
learning of concepts and practice of data analysis simultaneously.
There are several text books and tutorials written for R available from the Internet.
"Analysis of Epidemiological Data Using R and Epicalc", which can be freely
downloaded from the CRAN website, http://cran.r-project.org/ and also from the
WHO web site is strongly recommended as a preamble reading.
http://apps.who.int/tdr/svc/publications/training-guideline-publications/analysis-
epidemiological-data/
That book not only explains how the functions in R and Epicalc are used but it also
provides the concept of variable management in R, especially how to avoid the
confusion between a free vector and a variable of the same name inside a data
frame. Epicalc also enables the data frame as well the variable to be labelled, which
3
subsequently leads to more understandable output in tables and graphs. The current
version of Epicalc has been developed to respond to the needs of longitudinal data
analysis in addition to the existing ordinary data exploration features.
Similar to the previous book, Epicalc functions are typed in Italics in contrast to
functions from other R packages, which appear in normal font type. A function will
be briefly explained when it is first used to let readers who have never used Epicalc
before catch up quickly but would probably not be enough to substitute the needs to
learn from the preceding book.
However, in addition to just following up and waiting for the failure event to occur,
follow-up records allow analysis of transitions. The state of an outcome can be
more than the classical dichotomy (diseased vs. non-diseased). It can be different
states of the disease or health. Transition is the change of state between one point of
time to the next. In biology, the measurements are mostly taken as continuous data.
Transition in this case may mean the difference in outcome measure between two
adjacent time points.
When the follow-up time is short and the number of variables is small, the data for
a longitudinal study could easily be stored in the so-called 'wide' format. In wide
form, a person appears in only in one record with measurements of the same sets of
variables, usually measured at different times, stored in separate columns. The wide
form has serious limitations when the number of repetitions of the visits is large and
each visit has a relatively large number of measurements. It is also inefficient when
4
certain persons have only a few visits while others have a large number of visits. In
this situation it is more efficient to store the data in the so-called 'long' format. In
long form, a person can appear in more than one record corresponding to each of
their visits. Measurements for each variable are stored in a single column, with an
additional column denoting time included to distinguish separate visits. When the
data are stored in long form, the number of visits does not need to be the same for
every person.
Finally, as an individual changes his/her exposure and outcome over time, instead
of looking at the status at each time point, one can consider his/her transition from
one point of time to the next and the relationship between the transition of the
outcome and the transition of exposure. While the outcome statuses of an individual
over time are correlated and thus the relationship needs adjustment for this
correlation, the transition from one time point to the next is usually not correlated,
and analysis of the transition (transition modelling) is therefore simpler than the
above two approaches. For a continuous outcome variable, the magnitude of change
is modelled against the exposure variable in the concurrent transition or preceding
state. When the outcome is a dichotomous variable, the transition probability that
we are interested in is not confined to failure but so-called transition, which could
be multi-directional. Modelling of transitional probabilities is called transition
modelling or Markov modelling. Markov models predict the probability of the
current outcome from the preceding status. It can also be called auto-regressive
modelling as the outcome is regressed by its own previous value.
5
6
Table of Contents
Chapter 1: Data formats _____________________________________________ 8
Chapter 2: Exploration and graphical display ___________________________ 14
Chapter 3: Area under the curve (AUC) ________________________________ 26
Chapter 4: Individual growth rate ____________________________________ 35
Chapter 5: Within subject, across time comparison _______________________ 46
Chapter 6: Analysis of missing records ________________________________ 55
Chapter 7: Modelling longitudinal data ________________________________ 65
Chapter 8: Mixed models ___________________________________________ 75
Chapter 9: Transition models ________________________________________ 84
Solutions to exercises ______________________________________________ 91
7
Chapter 1: Data formats
Data entry software can issue a warning to the user if a certain value is too much
different from a preceding value. These consistency checks are difficult to
implement when the data are entered in long form because values of the same
variable, but from different times, are not entered in the same record.
Normalization of data
Some variables, especially baseline characteristics of the subjects, are usually fixed,
for example, date of birth, sex, place of birth. This data should be entered only
once. In a database, a set of the data is considered as a table. The baseline table,
which has one record per subject, can be linked with the follow-up data set through
an identification (ID) field. This field is also called a key field. This ID field in the
baseline table must be unique. In other words, there must not be any duplication in
ID in the baseline table. In the follow-up table, ID certainly can be duplicated but
the combination of ID and time of follow-up must be unique since each follow up
of a subject should be recorded only once. A good database design would require
such a unique ID in the baseline table and unique ID+time (a compound key) in the
follow-up table. The database design must also ensure that all the IDs in the follow-
up table must also be present in the baseline table.
The baseline table is sometimes considered as the mother table and correspondingly
the follow-up table is considered as the child table. A follow-up record without
corresponding ID in the baseline table is called an 'orphan' record. It indicates poor
quality control in the data entry system. In order to ensure such integrity (absence
of orphan records), a relational database software, such as Microsoft Access, is
8
required. EpiData can also be used to serve this purpose. Such data integrity can be
ensured if and only if the records in the follow-up table can be entered only through
an existing baseline record.
Hierarchical data
Data in which the relationship between the baseline data and the follow-up data has
a hierarchy is called hierarchical data. It is also known as multi-level data because
each follow-up record is considered as level 1, whereas each subject is considered
as level 2. The latter is nested around the former; and is also therefore called nested
data. Apart from longitudinal studies, hierarchical data can also be nested based on
social or spatial relationships. For example, subjects can be nested with families and
families can be nested within villages, and so forth.
When upper level variables affect the outcome at the individual level, the variables
are sometimes called contextual determinants. For example, the nutritional status of
a child is not only influenced by his/her immune status but also by the child rearing
behaviours of the family and the hygiene conditions (waste disposal, water supply)
of the community.
Several software packages can be used to analyze hierarchical data. Analysis of this
multi-level data is well covered by a number of packages in R and will be discussed
in subsequent sections.
9
Examples of longitudinal data in long form
The datasets package in R contains a large number of data sets from longitudinal
studies. All of these are in the long format. The list can be viewed by typing:
> data(package="datasets")
Among these, is a data frame containing 578 rows and 4 columns from an
experiment on the effect of diet on early growth of chicks.
> class(ChickWeight)
[1] "nfnGroupedData" "nfGroupedData" "groupedData" "data.frame"
In addition to being a data frame, the object is also a special kind of data frame
which was modified from an ordinary data frame in order to make it suitable for
analysis using functions from the nlme package. To view the first 6 records you can
type:
> head(ChickWeight)
Grouped Data: weight ~ Time | Chick
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
The dataset actually contains a formula, which models the chick’s weight using
each chick's age (in days).
To view the variable names, classes and descriptions you can type:
> library(epicalc)
> des(ChickWeight)
No. of observations = 578
Variable Class Description
1 weight numeric
2 Time numeric
3 Chick ordered
4 Diet factor
There are 4 variables, of which the third variable, 'Chick', is the identification
variable.
> use(ChickWeight)
> Chick[1:30]
[1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3
50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 <
3 < 1 < 12 < 2 < 5 < 14 < 7 < 24 < 30 < 22 < 23 < 27 < 28 < 26 < 25 <
... < 48
10
The display looks fine but the ordering for the levels is unusual. Now let's do a
cross-tabulation with the 'Time' variable.
> table(Chick, Time)
Time
Chick 0 2 4 6 8 10 12 14 16 18 20 21
18 1 1 0 0 0 0 0 0 0 0 0 0
16 1 1 1 1 1 1 1 0 0 0 0 0
15 1 1 1 1 1 1 1 1 0 0 0 0
13 1 1 1 1 1 1 1 1 1 1 1 1
9 1 1 1 1 1 1 1 1 1 1 1 1
========= remaining lines omitted =========
The sequence of 'Chick' in the rows is not in numeric order. This is because it is an
ordered "factor" class object. It was classed this way in order to fit in with the
structure of "groupedData" required by the nlme package. We will learn about how
to convert this class of object into a normal integer in the next chapter. The other
point to note is that Chick "18", which appears in the first row, has only 2 visits at
times 0 and 2. The data is unbalanced. This is a common finding in longitudinal
data. It will be discussed in detail in subsequent chapters.
11
In the above example there is only one variable with repeated measures. In reality, a
data set can contain many sets of repeated measurements. As a simple illustration,
study the following commands carefully.
> exposure1 <- c(1:9,NA)
> exposure2 <- 11:20
> exposure3 <- 21:30
> outcome1 <- 101:110
> outcome2 <- 111:120
> outcome3 <- 121:130
> data.wide <- data.frame(ID=letters[1:10], exposure1, exposure2,
exposure3, outcome1, outcome2, outcome3)
> data.wide
ID exposure1 exposure2 exposure3 outcome1 outcome2 outcome3
1 a 1 11 21 101 111 121
2 b 2 12 22 102 112 122
3 c 3 13 23 103 113 123
4 d 4 14 24 104 114 124
5 e 5 15 25 105 115 125
6 f 6 16 26 106 116 126
7 g 7 17 27 107 117 127
8 h 8 18 28 108 118 128
9 i 9 19 29 109 119 129
10 j NA 20 30 110 120 130
Note the missing value for the 'exposure1' variable in the last row (ID = "j"). Now
let's reshape this data frame to long format.
> data1.long <- reshape(data.wide, idvar="ID", varying=list(2:4, 5:7),
v.names=c("exposure", "outcome"), direction="long")
> data1.long
ID time exposure outcome
a.1 a 1 1 101
b.1 b 1 2 102
c.1 c 1 3 103
d.1 d 1 4 104
e.1 e 1 5 105
f.1 f 1 6 106
g.1 g 1 7 107
h.1 h 1 8 108
i.1 i 1 9 109
j.1 j 1 NA 110
a.2 a 2 11 111
b.2 b 2 12 112
c.2 c 2 13 113
d.2 d 2 14 114
e.2 e 2 15 115
f.2 f 2 16 116
g.2 g 2 17 117
h.2 h 2 18 118
i.2 i 2 19 119
j.2 j 2 20 120
========= remaining lines omitted =========
12
Note the source of the new 'time' variable in the long format is generated from the
suffixes of the 'exposure' and 'outcome' variables in the wide format. Also the new
'exposure' variable in the long format corresponds to the 2nd to 4th variables in the
wide format and the 'outcome' variable in the long format corresponds to the 5th to
7th variables in the wide format. These sets of variables must be matched correctly
in the 'varying' argument. Note also that the value of 'exposure1' for ID "j" is
missing. Now suppose that the exposure and outcome variables are adjacent to each
other in the wide data frame.
> data2.wide <- data.frame(ID=letters[1:10], exposure1, outcome1,
exposure2, outcome2, exposure3, outcome3)
> data2.wide
ID exposure1 outcome1 exposure2 outcome2 exposure3 outcome3
1 a 1 101 11 111 21 121
2 b 2 102 12 112 22 122
3 c 3 103 13 113 23 123
4 d 4 104 14 114 24 124
5 e 5 105 15 115 25 125
6 f 6 106 16 116 26 126
7 g 7 107 17 117 27 127
8 h 8 108 18 118 28 128
9 i 9 109 19 119 29 129
10 j NA 110 20 120 30 130
The positions of the variables in the 'varying' list need to be changed accordingly to
match the order of the 'v.names' argument. Note that the row names of the resulting
data frame are formed from the combination of the 'ID' and 'Time' variables. They
are therefore unique.
Exercise
In what format (wide or long) is the data frame Theoph provided by R?
Reshape it to the other format. Explain how the variable 'Subject' is arranged in the
new format.
13
Chapter 2: Exploration and graphical
display
In this chapter, we will go into more details of data frames that have class
"groupedData".
> library(epicalc)
> zap()
> use(Indometh)
> des()
No. of observations = 66
Variable Class Description
1 Subject ordered
2 time numeric
3 conc numeric
There are 66 records and 3 variables, the first of which has class "ordered", with the
other 2 being "numeric". There are no descriptive labels attached to the variables.
> summ()
No. of observations = 66
The minimum value of 'Subject' is 1 and the maximum is 6 but that does not
necessarily mean that there are 6 subjects in total. We have to check with
tabulation.
> tab1(Subject)
Subject :
Frequency Percent Cum. percent
1 11 16.7 16.7
4 11 16.7 33.3
2 11 16.7 50.0
5 11 16.7 66.7
6 11 16.7 83.3
3 11 16.7 100.0
Total 66 100.0 100.0
14
The table above indicates that there are 6 subjects, each contributing 11 records.
The order of 'Subject' in this table is not sorted from lowest to highest because the
variable is an ordered factor and the levels have been preset to this order.
To make sure that time of measurement of the drug concentration is systematic for
all subjects a cross-tabulation can be carried out.
> table(time, Subject)
Subject
time 1 4 2 5 6 3
0.25 1 1 1 1 1 1
0.5 1 1 1 1 1 1
0.75 1 1 1 1 1 1
1 1 1 1 1 1 1
1.25 1 1 1 1 1 1
2 1 1 1 1 1 1
3 1 1 1 1 1 1
4 1 1 1 1 1 1
5 1 1 1 1 1 1
6 1 1 1 1 1 1
8 1 1 1 1 1 1
Time increases by 0.25 unit until 1.25. Then it increases by a step of 1 until time 8,
except that time 7 is missing. This kind of tabulation is a good routine practice in
longitudinal data exploration. All cells in the table are filled with 1s indicating the
uniqueness for the combination of 'Subject' and 'time'. Note that there is no time 0.
When the data set is small, eyeball scanning on such tabulation may be adequate.
When the data set is large, it may be better to check whether there are any missing
follow up times or duplicated records by typing the following command.
> table(table(time, Subject))
1
66
This command tabulates all values of the above table and finds that there are 66
cells all having a common value of 1.
Duplication of records having the same person with the same time point could be
checked by typing:
> any( table(time, Subject) > 1 )
[1] FALSE
15
Graphing longitudinal data
There are two main methods for graphing the relationship between concentration
and time for each subject. The first method is to employ trellis plots, which give
one small plotting frame for each subject. The second method is to employ the
epicalc package, which has graphical functions that show all the data in the one
plotting frame.
The upper part of the graph indicates the order of the Subject corresponding to the
panels in the plot. The plot is read from left-to-right and bottom-to-top, so the
bottom left panel corresponds to Subject 1 and top right panel corresponds to
Subject 3. To visualize the position of Subject in each panel we can replace the
open circles with the actual Subject number as follows:
> coplot(conc ~ time|Subject, data=Indometh,pch=as.character(Subject))
16
The coplot function is designed to show the relationship between two variables
conditional on the value of a third variable, in this case subject. Instead of using
coplot, we can use the xyplot function from the lattice package.
> library(lattice)
> xyplot(conc ~ time | Subject, data=Indometh)
> xyplot(conc ~ time | Subject, type="b", data=Indometh)
The last command plots connecting lines in each frame instead of just open circles.
Since the class of the data frame is "groupedData", we can also call the nlme
library, which has a default plot method for this class of data frame.
> library(nlme)
> plot(Indometh)
17
The result is similar to the xyplot command; however the labels for the X and Y
axes, which are stored as attributes in the data frame, are used here.
> attributes(Indometh)
In order to add the "groupedData" class to an ordinary data frame, we must employ
the groupedData function from the nlme package.
Let's create Sitka.gp from the Sitka data, which comes from the MASS
package.
> library(nlme)
> data(Sitka, package="MASS")
> Sitka.gp <- groupedData(size ~ Time|tree, data=Sitka,
labels=list(x="Time (Days since 1 Jan 1988)", y="Log(Height x
diameter ^2)"))
> plot(Sitka.gp)
18
In order to show all the subjects in the same plotting frame, let's return to the
Indometh data set. We first think that if there is only one subject, this value could
be simply plotted as a line graph against time.
> plot(conc~time, subset=Subject==1, type="l", data=Indometh)
The above command produces a plot of the pharmacokinetic curve for the first
subject. We can further proceed with the second and third subjects.
> lines(conc~time, subset=Subject==2, type="l", data=Indometh)
> lines(conc~time, subset=Subject==3, type="l", data=Indometh)
Each of the above lines commands adds one line to the existing graph for the
second and the third subjects, respectively. One can repeat the same process until
all six subjects have had their curves displayed.
The problem encountered so far is that the maximum value of the Y axis defined by
the first subject is too low for subsequent subjects. To prevent this, the initial plot
command should include a 'ylim' argument so that subsequent curves with higher
19
concentrations can still be accommodated. The remaining lines commands can then
follow as above.
Obviously, if there are too many subjects, the command would be too tedious to
run. It may be better to exploit a for loop.
> plot(conc~time, subset=Subject==1, ylim=c(0, max(conc, na.rm=TRUE)),
xlab="", ylab="", type="l", data=Indometh)
> for(i in 2:6) lines(conc~time, subset=Subject==i, col=i,
data=Indometh)
That completes the majority of the requirements for the graph. Readers can further
proceed with putting axis labels, a legend, title, etc.
The above process could be carried out with two epicalc commands.
> use(Indometh)
> followup.plot(id=Subject, time=time, outcome=conc,
main="Pharmacokinetics of Indomethicin")
The resultant graph is more or less the same as the previous commands using the
for loop construct. Note that the colours are automatically chosen based on the
Subject number.
20
Plots of aggregated values
The examples in the help page of the followup.plot function explore the
Sitka data set and give some ideas for the colour of the lines indicating the
treatment group.
> data(Sitka, package="MASS")
> use(Sitka)
> followup.plot(id=tree, time=Time, outcome=size, by=treat,
main="Growth Curves for Sitka Spruce Trees in 1988")
The control group, represented by solid black lines, tends to have larger trees than
the ones grown in the ozone-enriched chambers. This can be more clearly seen with
the following command.
> aggregate.plot(x=size, by=Time, group=treat)
21
It is clear that the mean tree size of the ozone group was somewhat smaller at the
start and distinctively smaller at the end of the follow-up period. If the argument
'return.output' is set to TRUE then the numerical results are shown as well.
> aggregate.plot(x=size, by=Time, group=treat, return=TRUE)
grouping time mean.size lower95ci upper95ci
1 control 152 4.166000 3.837764 4.494236
2 ozone 152 4.059630 3.898858 4.220401
3 control 174 4.629600 4.333247 4.925953
4 ozone 174 4.467037 4.310078 4.623996
5 control 201 5.037200 4.760472 5.313928
6 ozone 201 4.849074 4.693651 5.004498
7 control 227 5.438400 5.161541 5.715259
8 ozone 227 5.180926 5.014109 5.347743
9 control 258 5.654400 5.372244 5.936556
10 ozone 258 5.313148 5.145238 5.481058
22
Dichotmous longitudinal outcome variable
All the above examples have outcome variables on a continuous scale. Let's explore
a data set which has a dichotomous outcome variable.
> data(bacteria, package="MASS")
> use(bacteria)
> des()
No. of observations = 220
Variable Class Description
1 y factor
2 ap factor
3 hilo factor
4 week integer
5 ID factor
6 trt factor
The data set comes from a study testing the presence of the bacteria H. influenzae in
children with otitis media in the Northern Territory of Australia. The outcome is in
the variable 'y' and the follow-up period is represented by the 'week' variable. Some
follow-up times are missing, as shown by the following command.
> table(table(week, ID))
0 1
30 220
The 95% confidence intervals of the prevalences overlap due to the relatively small
sample sizes in the three treatment groups. We can also use 'ap' (active vs placebo)
as the group variable instead of 'trt'.
23
> aggregate.plot(x=y, by=week, group=ap)
24
The prevalence of bacteria in the active treatment group declined steadily until
week 6 when the difference is the highest. The two groups tend to have a closer
prevalence again after 11 weeks.
Exercise
• Explore the Theoph data set again from the datasets package.
• How many subjects are there? How many times does each subject appear?
• Were the subjects assessed on drug levels in exactly the same pattern of
time?
• Plot the concentration of this drug of each individual over time using
followup.plot and coplot.
• Note that from followup.plot, the colours of the lines are all the
same. Why?
• How could we change the color? How can it be more colourful?
• Was the weight of each subject stable?
• Divide the subjects into 2 groups, one below 70kg and one greater than or
equal to 70 kg. Create a variable called 'Wtgr' based on this weight
division and use the followup.plot command to draw a graph similar
to the one on page 20.
• Is there a tendency that the heavier group has a higher level of drug
concentration over time?
25
Chapter 3: Area under the curve (AUC)
The area under the plasma (serum, or blood) concentration versus time curve
(AUC) has a number of important uses in toxicology, biopharmaceutics and
pharmacokinetics. In pharmacokinetics, drug AUC values can be used to determine
other pharmacokinetic parameters, such as clearance or bioavailability.
The Theoph data set has a starting point at time zero, a good reason to compute
area under the time-concentration curve (AUC).
> library(epicalc)
> class(Theoph)
[1] "nfnGroupedData" "nfGroupedData" "groupedData" "data.frame"
The data frame has the same class as those from previous chapters.
> use(Theoph)
> class(.data) # same as above
Area under the curve is computed by summing the trapezoids formed by the two
closest time points and their base. From elementary geometry, the area of one such
trapezoid is equal to the average height of these two points and the width of the
base. The records of the first subject in the Theoph data frame are shown below.
> .data[Subject==1, c(1,4,5)]
Subject Time conc
1 1 0.00 0.74
2 1 0.25 2.84
3 1 0.57 6.57
4 1 1.12 10.50
5 1 2.02 9.66
6 1 3.82 8.58
7 1 5.10 8.36
8 1 7.03 7.47
9 1 9.05 6.89
10 1 12.12 5.94
11 1 24.37 3.28
For subject 1, the area under the curve of the first trapezoid is (0.25 – 0.00) × (2.84
+ 0.74)/2 = 0.4475 unit. This is then added to all the trapezoids belonging to the
same subject. The second one is (0.57 – 0.25) × (6.57 + 2.84)/2 = 1.5056 and so on.
26
The final summation result is 148.92.
In epicalc, this computation can be done for each subject using the following
command:
> auc.data <- auc(conc=conc, time=Time, id=Subject)
> auc.data
Subject auc
1 1 148.92305
2 2 91.52680
3 3 99.28650
4 4 106.79630
5 5 121.29440
6 6 73.77555
7 7 90.75340
8 8 88.55995
9 9 86.32615
10 10 138.36810
11 11 80.09360
12 12 119.97750
The function auc is based on the above principle of summation of trapezoids.
We can also compute the AUC one subject at a time, omitting the 'id' argument,
since as the default value of 'id' is NULL.
> auc(conc=conc[Subject==1], time=Time[Subject==1])
[1] 148.9230
> auc(conc=conc[Subject==2], time=Time[Subject==2])
[1] 91.5268
> auc(conc=conc[Subject==3], time=Time[Subject==3])
[1] 99.2865
The above three lines just confirm the same results as the preceding command.
Since the 'Subject' variable is not ordered numerically, let's create an integer
variable, say 'subject' (small s), that has the same value as 'Subject' (capital S). The
command is:
> subject <- as.integer(Subject)[order(Subject)]
In order to create an integer vector from 'Subject', which is an ordered factor, the
values are coerced to integer first and then sorted in descending order.
> pack()
> class(.data)
[1] "data.frame"
The epicalc command pack has changed the class of .data to a simple data
frame. The original Theoph data frame remains intact. It can be used for more
complicated analyses later.
27
Now run the auc command once more.
> auc.data <- auc(conc=conc, time=Time, id=subject)
For those who are serious in pharmacokinetic studies, it is advised to install and
load the PK package. This package gives more details and variety of AUC than
epicalc. The auc function in epicalc is simply based on a trapezoid summation as
described above. The auc function from the PK package also gives this value. In
addition, it provides a better estimation of the area under time-concentration curve
based on the estimation of unavailable information of concentration during the time
interval. The right tail of the curve can go up to infinity to better reflect the total
amount of drug that the subject was exposed to. However, the auc function in
epicalc allows the user to subset the AUC by subject, making it more convenient
for data management.
The auc.data data frame contains AUC for each subject. It will be merged with
other data frames for further analysis.
Medically speaking, the AUC would reflect the speed of redistribution (into
different compartments of the body), destruction and excretion of the individual
subject on the drug. Thus the subject who destroyed/excreted fastest was the 6th
subject and the slowest was the first one. From now on we will try to find the
relationship between AUC and individual characteristics of the subject.
One can check the variability of the values of various variables across subjects
using the following command.
> aggregate(x=.data[,2:5], by=list(subject=subject), FUN="sd")
subject Wt Dose Time conc
1 1 0 0 7.273320 3.034533
2 2 0 0 7.269680 3.027389
3 3 0 0 7.234519 2.684222
4 4 0 0 7.329930 2.921623
5 5 0 0 7.278871 3.537344
6 6 0 0 7.151650 2.180382
7 7 0 0 7.250075 2.485130
8 8 0 0 7.233919 2.455296
9 9 0 0 7.244127 2.716175
10 10 0 0 7.108937 3.050539
11 11 0 0 7.223542 2.552839
12 12 0 0 7.235402 3.499246
The first argument of the function is the data frame to aggregate. Unlike the
aggregate.numeric function, which can apply several statistical summaries
on a single variable after splitting the data into subsets, the aggregate function
28
can apply only one summary statistic to multiple variables in the data frame.
One may try changing the "by" argument to "list(subject = Subject)" to see the
ordering of the subjects (results omitted). Here, using 'subject' instead of 'Subject'
allows the data to be displayed in ascending subject ID order.
Variables 'Subject' (capital S), 'Wt' and 'Dose' have zero sd. This means that there is
no variation within subjects. Since the values of 'Subject', 'Wt' and 'Dose' of the
same person do not change at all, their standard deviations are all zero. The
standard deviations of 'Time' are all relatively similar indicating that the time of
drawing blood for drug assay was probably set to be synchronized for all subjects.
However, they are not exactly the same. The synchronization process is not perfect.
Let's check the variation graphically.
> summ(Time)
obs. mean median s.d. min. max.
132 5.895 3.53 6.93 0 24.65
The first 11 points are in the same vertical line, that is, at time zero. Later on, the
timing of drug drawing was not so synchronised. Variation in time of drawing
blood causes the stacks of points to jitter.
Now modify the above aggregate command as follows to obtain the mean
weight and dose for each subject.
29
> WtDose.data <- aggregate(.data[,2:3], by=list(subject=subject),
FUN="mean")
> WtDose.data
subject Wt Dose
1 1 79.6 4.02
2 2 72.4 4.40
3 3 70.5 4.53
4 4 72.7 4.40
5 5 54.6 5.86
6 6 80.0 4.00
7 7 64.6 4.95
8 8 70.5 4.53
9 9 86.4 3.10
10 10 58.2 5.50
11 11 65.0 4.92
12 12 60.5 5.30
By default, only records sharing the common field in both data frames are returned.
In this case all of the records are returned and no subject was omitted.
30
> summ(Wt)
obs. mean median s.d. min. max.
12 69.583 70.5 9.5 54.6 86.4
> summ(Dose)
obs. mean median s.d. min. max.
12 4.626 4.53 0.75 3.1 5.86
> summ(auc)
obs. mean median s.d. min. max.
12 103.807 95.407 23.65 73.776 148.923
All the variables are quite uniformly distributed. Let's create a two-way scatter plot.
> plot(Wt, auc, type="n", xlab="Wt (kg) ", ylab="Area under time-
concentration curve (hour-mg/L)")
> text(Wt, auc, labels=subject)
1
140
10
Area under time-concentration curve (hour-mg/L)
120
5
12
4
100
7 2
8
9
80
11
55 60 65 70 75 80 85
Wt (kg)
There is a slight negative correlation between AUC and Wt. Heavier persons tended
to destroy/excrete the drug faster that the lighter ones causing the drug to have a
small AUC. One exception is the first subject who has the highest AUC (remember
148 units!) and yet he/she was among the heaviest. This person would need special
investigation for this out-lying nature (e.g. maybe due to disease or genetic mark
up).
31
Now we plot AUC against dose.
> plot(Dose, auc, type="n", xlab="Dose (mg/kg) ", ylab="Area under
time-concentration curve (hour-mg/L)")
> text(Dose, auc, labels=subject)
140
10
120
5
12
4
100
2 7
8
9
80
11
Dose (mg/kg)
Heavy subjects were more likely to be given lower doses. There are no outliers here
since both variables were controlled by the protocol. Having such a high correlation
indicates is a potential confounding situation. We should clarify whether dose or
weight had a stronger effect on AUC.
32
85
80
75
Wt
70
65
Dose
No. of observations = 12
Since all t-tests and F-tests in the crude and adjusted analyses are above 0.05, the
conclusion is that neither dose nor weight significantly determines the AUC.
Remember that we have an outlier. Let's exclude it and repeat the analysis.
> data.not1 <- .data[-1,]
> regress.display(lm(auc ~ Dose + Wt, data=data.not1), crude=TRUE)
Linear regression predicting auc
crude coeff.(95%CI) adj. coeff.(95%CI) P(t-test) P(F-test)
Dose 18.02 (3.71,32.32) -23.27 (-142.12,95.59) 0.664 0.023
No. of observations = 11
33
This model suggests that after excluding the outlier, in the crude analysis, dose had
a significant positive effect on AUC whereas weight had a significant negative
effect. This is judged by their 95% confidence intervals not including zero. In the
adjusted analysis, where both independent variables are included in the model, the
t-test on both variables suggests that they are not significant factors. However, dose,
and not weight, is significant by analysis of variance (F-test). We conclude that
dose (in mg/kg) is more important than patient weight in prolonging high levels of
oral theophylline.
In summary, this chapter gives you an example of computing the area under the
curve (AUC) as the outcome variable. This approach removes the need to do
sophisticated statistical modelling.
Summary
In more advanced approaches, the level of drug in a body at any given time can be
modelled as a function of several underlining parameters. The nlme function in the
package of the same name was created to address this problem. Use of that package
is beyond the scope of this book.
Exercise
Read in the Sitka data set from the MASS package. Try to find out whether
ozone exposure reduced the area under the time-size curve. For simplicity, keep
only records without any missing values.
34
Chapter 4: Individual growth rate
In the preceding chapter, we calculated the area under the time-concentration curve.
This method of analysis is justified in pharmacokinetics as it reflects the ability of a
person to destroy and or excrete the drug. In this chapter we will analyse the Sitka
data set again, comparing the tree growth rates in each treatment group.
> library(epicalc)
> zap()
> data(Sitka, package="MASS")
> use(Sitka)
> des()
No. of observations = 395
Variable Class Description
1 size numeric
2 Time numeric
3 tree integer
4 treat factor
Note that the 'tree' variable is equivalent to 'Subject' in the Indometh and
Theophdata sets. Here, its class is "integer". Note that the data frame is not in
"groupedData" format.
> summ()
No. of observations = 395
Variable Class Description
1 size numeric
2 Time numeric
3 tree integer
4 treat factor
Let's check whether there is any duplication of measurement on the same tree at the
same time.
> table(Time, tree)
The output (omitted) is not too extensive for this data set and shows that there are 5
different values for 'Time' and 79 different trees. All cells have counts of 1
indicating no duplication. For larger data sets, the following command may be
better.
> table(table(Time, tree))
1
395
Since each cell in the previous table contains only the number one, the mean for
that cell would be the size of the tree at that point of time.
35
> tapply(size, list(time=Time, tree=tree), mean)
The orientation of the table is the same. Again, the output is rather large and is
omitted here. Each column represents the size of the tree over time. Let's create
some follow-up plots.
> followup.plot(id=tree, time=Time, outcome=size)
Unlike the follow-up plots of the pharmacokinetic studies, in which there are only 6
lines for Indometh and 12 for Theoph, there are 79 lines in this plot, one for each
tree. These plots are sometimes called "spaghetti-plots" due to the crossing lines.
36
There is a tendency that trees grown in ozone-rich chambers (red dashed lines) are
smaller than those in the control group (black lines).
Time is days since 1 Jan 1988. However, it is not clear when the experiment started.
Unlike pharmacokinetic studies, which start at zero level of a drug, in the Sitka
study, the area under the curve was measured at some unknown time after the tree
had grown. Therefore the AUC may be an invalid outcome measure.
Let's try to subtract the size on the first (152nd day) measurement from each tree and
then calculate the AUC. First we must create a 'visit' index for each tree.
Indexing visits
Let us make sure that the data are properly sorted by 'tree' and 'Time'.
> sortBy(tree, Time)
Next, count the number of records contributed by each tree (in R, this is called 'run
37
length encoding' since the lengths of each element that appears repeatedly are
encoded into a list). The function is rle.
> list1 <- rle(tree)
> list1 Run Length Encoding
lengths: int [1:79] 5 5 5 5 5 5 5 5 5 5 ...
values : int [1:79] 1 2 3 4 5 6 7 8 9 10 ...
The object 'list1' has two elements, namely 'lengths' and 'values'. The first
element shows that there are 5 visits for each tree. The second is the value of the
tree. Note that the function rle takes only an atomic vector as its argument. In this
case we do not have any problem as tree is a vector. If it was a factor, the
corresponding function would be
> lst1 <- rle(as.vector(tree))
The above rle and sapply commands are complicated and not easy to
remember. The epicalc package has a function called markVisits for this
purpose.
> visit <- markVisits(id=tree, time=Time)
This is the required index vector for visits that we can pack into our data frame.
> pack()
Note that marking of the visit may not be well synchronized with the 'Time'
variable for a couple of reasons. Firstly, the exact time may not be repeated, as seen
in the Theoph data. Secondly, data may not be collected at the scheduled time. For
example, if patients are supposed to come weekly but they do not show up in the
rd
second week, then their 3 week visit will become their second visit and their 4th
week visit will become their third week visit, etc. We can simply check the
consistency between 'Time' and 'visit' as follows.
> table(Time, visit)
visit
Time 1 2 3 4 5
152 79 0 0 0 0
174 0 79 0 0 0
201 0 0 79 0 0
227 0 0 0 79 0
258 0 0 0 0 79
38
> head(.data, 10)
size Time tree treat visit
1 4.51 152 1 ozone 1
2 4.98 174 1 ozone 2
3 5.41 201 1 ozone 3
4 5.90 227 1 ozone 4
5 6.15 258 1 ozone 5
6 4.24 152 2 ozone 1
7 4.20 174 2 ozone 2
8 4.68 201 2 ozone 3
9 4.92 227 2 ozone 4
10 4.96 258 2 ozone 5
Our current task is to subtract the tree size for each tree at visit 1 from its size at all
subsequent visits. This can be achieved with the following commands.
> tmp <- by(.data, INDICES=tree, FUN=function(x) x$size -
x$size[x$visit==1])
> size.change <- sapply(tmp, as.numeric)
The first command above splits the data frame into each individual tree and
subtracts the tree size at the first visit from the tree sizes at all subsequent visits.
The result is saved into a temporary object, 'tmp'. This object is a type of list and is
not very useful unless it is converted to a vector or matrix using the sapply
function.
We need to convert this matrix into one long vector and integrate it as a variable
into the current data frame.
> size.change <- as.numeric(size.change)
> pack()
39
Now, let's look at the summary this new variable.
> summ(size.change)
obs. mean median s.d. min. max.
395 0.747 0.76 0.55 -0.52 2.1
Note that there are a few negative values (size decreased over time) in addition to a
number of zeros corresponding to the first visit, and the remaining positive values
(tree size increased). We temporarily ignore the records with negative tree growth
and complete the AUC for the size differences and then compare the values of this
variable between the two treatment groups.
> auc.tree <- auc(size.change, time=Time, id=tree)
Here the 'tree' variable is used as the subject identification. This vector can now be
merged with a data frame containing a subset of records from the first visit.
> visit1 <- subset(.data, subset=visit==1, select=c("tree", "size",
"treat"))
> auc.visit1 <- merge(auc.tree, visit1)
Before using auc.visit1, since we have made many changes to .data, let's
make a copy of it for future use.
> .data -> Sitka1
> des()
40
No. of observations = 79
Variable Class Description
1 tree integer
2 auc numeric
3 size numeric
4 treat factor
Surprisingly, the minimum AUC value for the treatment group is negative, which
means one or more trees actually got smaller! To check which one(s) type
> tree[auc <0]
[1] 15
This unlikely finding was perhaps due to a measurement error made at the first visit
of this tree.
More details about how to detect abnormal records where the size became smaller
is shown in the example of the followup.plot command.
We will omit this tree and then test the hypothesis that the AUC is different
between the two treatment groups.
> use(auc.visit1)
> keepData(.data, subset=auc>0)
> tableStack(auc, by=treat)
control ozone Test stat. P value
auc t-test (76 df) = 1.88 0.064
mean(SD) 93.8 (22.4) 84.4 (19.6)
The conditions satisfy the requirements for a t-test and the difference in AUC is not
statistically significant.
41
Individual growth rates
Apart from AUC, we can compute and compare the growth rates of trees in the two
treatment groups.
Assuming that tree size is a linear function of time, for each individual tree, the
intercept would be its expected size at time 0 and the coefficient for 'Time' would
be the growth rate. We return to the original Sitka data set and use the function
by to get the coefficients for each individual tree.
> use(Sitka)
> tmp <- by(.data, INDICES=tree, FUN=function(x) lm(size ~ Time,
data=x))
Each element of 'tmp' contains the results of a linear model predicting the tree size
from 'Time' using only the data records of each tree. We then use the sapply
function to extract the coefficients of each model out from the 'tmp' object.
> tree.coef <- sapply(tmp, FUN=coef)
42
> class(tree.coef)
[1] "matrix"
> dim(tree.coef)
[1] 2 79
This matrix has 2 rows and 79 columns. Our objective here is to create a data frame
containing three variables. One variable must contain the unique tree id. The other
two variables should consist of the initial tree sizes and the growth rates for each
tree, which we have already obtained from the individual linear models.
We can convert the matrix above to a data frame using the as.data.frame
function, but first it needs to be transposed (columns to rows) using the function t.
> tree.growth <- as.data.frame( t(tree.coef) )
> des(tree.growth)
No. of observations = 79
Variable Class Description
1 (Intercept) numeric
2 Time numeric
The names of the variables created from linear modelling should be changed to
something more appropriate. The 'Time' variable represents the individual growth
rates obtained from the linear model.
> names(tree.growth)[2] <- "growth.rate"
This variable will be used to link with the visit1 data frame created previously.
> tree.growth <- merge(tree.growth, visit1)
> use(tree.growth)
> des()
No. of observations = 79
Variable Class Description
1 tree integer
2 (Intercept) numeric
3 growth.rate numeric
4 size numeric
5 treat factor
Now we have a data frame with 79 records, one record for each tree, containing
each tree's own intercept, growth rate and treatment. We can now test the
hypothesis of different growth rates between trees in the two different treatment
groups.
43
> tableStack(vars=2:3, by=treat, decimal=3)
control ozone Test stat. P value
(Intercept) t-test (77 df) = 1.0213 0.310
mean(SD) 2.122 (1.108) 2.343 (0.783)
At time zero, both groups are not significantly different. However, the growth rate
of trees in the treatment group (0.012 per day) is significantly lower than those in
the control group (0.014 unit per day).
Note that in this experiment, we do not know when ozone treatment started. If
treatment was given late in the growth curve, then the validity of using the linearly
increasing tree sizes (from time zero) as the outcome in the models would be in
doubt.
Summary
In summary, without Epicalc, manipulation of data within each subject in
longitudinal data requires several complicated functions, such as rle and sapply
to create an index variable within the same subject, here called 'visit'. The
markVisits function from epicalc can simplify this task.
Measurements from the first visit can be subtracted from the other visits in the same
individual to see the change from baseline. In the Sitka tree example, no baseline
data was given, so the first visit records were used to represent the baseline. The
linear growth rate of each individual can be computed using functions by and
sapply. These two functions, when used together, are very powerful. Analysts of
longitudinal data should get acquainted with them. They will be encountered
extensively in subsequent chapters.
44
Exercises
Based on the experience of the above examples, check whether there are any
missing records in the Xerop data set. Use the markVisits function to create a
'visit' index which indicates the order of visit for each subject. Check the
consistency of this visit index and the 'time' variable.
Use the Sitka data set to compute coefficients for predicted quadratic growth
curve for each tree. Determine which components of the growth curve are
significantly predicted by ozone.
45
Chapter 5: Within subject, across time
comparison
In the previous two chapters, we computed solitary outcome variables, such as area
under curve (AUC) and growth rate. By using this strategy, the complicated of
statistical models for repeated observations on the same subjects can be avoided.
We used the functions by and sapply to create linear models of growth for each
tree. In fact, there are more important applications of these functions – that is,
within subject, across time comparison.
If you had run the example code in the help page for the followup.plot
function, you would have found that some trees became smaller. Whether this is
naturally possible or whether it was due to human error during data collection
and/or data entry is not known. The technical challenge that we are facing is how to
detect the records that have decreasing tree sizes.
> library(epicalc)
> zap()
> data(Sitka, package="MASS")
> use(Sitka)
The data frame is split into subsets for each unique value of 'tree'. The tree size at the
second visit in each subset is then subtracted from the size at the first visit and stored
in a temporary object called 'tmp'.
> diff2from1 <- sapply(tmp, FUN=as.numeric)
46
> which(diff2from1 < 0)
2 15
2 15
They were the second and the fifteenth trees. The numbers on the top row are
'names' of the values, which are on the bottom row.
Similarly, one can find records where the tree size at the third visit is smaller than
the size at the second etc.
Lag measurements
A more efficient method is to compare the current tree size at time t of each tree
with its size at time t-1. A lag vector of sizes can be created using the strategy shown
in the example of followup.plot. It is further discussed here.
The last argument 'lag.unit' has a default value of 1 if omitted. For a lag of 2 type:
> size.lag.2 <- lagVar(var=size, id=tree, visit=visit, lag.unit=2)
These newly created lags can be seen from the following command.
> head(.data, 10)
size Time tree treat visit size.lag.1 size.lag.2 size.next.1
1 4.51 152 1 ozone 1 NA NA 4.98
2 4.98 174 1 ozone 2 4.51 NA 5.41
3 5.41 201 1 ozone 3 4.98 4.51 5.90
4 5.90 227 1 ozone 4 5.41 4.98 6.15
5 6.15 258 1 ozone 5 5.90 5.41 NA
6 4.24 152 2 ozone 1 NA NA 4.20
7 4.20 174 2 ozone 2 4.24 NA 4.68
8 4.68 201 2 ozone 3 4.20 4.24 4.92
9 4.92 227 2 ozone 4 4.68 4.20 4.96
10 4.96 258 2 ozone 5 4.92 4.68 NA
Note that the first visit has neither a 'size.lag.1' nor 'size.lag.2' value. For
'size.next.1', the value is the tree size at the next (second) visit. At the second visit,
'size.lag.1' is the tree size from the first visit, etc. Now the trees that became smaller
47
at any point of time can be easily identified.
> .data[which(size.lag.1 > size),]
size Time tree treat visit size.lag.1 size.lag.2 size.next.1
7 4.20 174 2 ozone 2 4.24 NA 4.68
72 4.08 174 15 ozone 2 4.60 NA 4.17
94 4.62 227 19 ozone 4 4.76 3.93 4.64
135 5.32 258 27 ozone 5 5.44 4.70 NA
180 4.60 258 36 ozone 5 4.62 4.42 NA
270 5.02 258 54 ozone 5 5.03 4.55 NA
There are six records, corresponding to six different trees. We can now create a
variable for the change in size between two adjacent visits on the same tree.
> size.change <- size - size.lag.1
> pack()
The records can now be inspected and corrected if needed. A summary of the change
is shown as follows.
> summ(size.change)
obs. mean median s.d. min. max.
316 0.332 0.33 0.18 -0.52 0.87
48
The leftmost outlying value was probably a serious error in measurement. The upper
part of the graph suggests that there are many missing values. In fact, the statistical
output from the command shows that there are only 316 non-missing values. The
remaining 395-316 = 79 missing records are from the first measurements which did
not have any preceding measurement.
> summ(size.change, by=Time)
49
The time intervals between measurements are 22, 27, 26 and 31 days, which slightly
increased over time. Noticeable from the graph is that the growth of the trees
actually slowed down over time. This can be seen more clearly with:
> boxplot(size.change ~ Time)
Despite the plots, one must realize that the variable 'size' is the logarithm of the
actual tree size. The untransformed values would in fact show accelerated growth.
In subsequent chapters, we will go deeper using level of change as the main outcome
variable compared with using the absolute value. Right now let's finish with tracking
changes of a dichotomous outcome over time.
50
Variable Class Description
1 y factor
2 ap factor
3 hilo factor
4 week integer
5 ID factor
6 trt factor
In this data set, the time variable is 'week', an integer, and the subject variable is 'ID',
which is a "factor".
> summ()
There are 220 records. Let's check whether there are any IDs missing in any of the
follow-up periods.
> length(unique(ID))
[1] 50
> table(table(ID))
2 3 4 5
3 5 11 31
There are a total of 50 subjects. Three people came only twice, five people came 3
times, 11 people came four times and 31 came to every follow-up visit.
51
No person had a duplicate record in any follow-up visit. To assess the total number
of subjects who attended in each week type:
> colSums(table(ID, week))
0 2 4 6 11
50 44 42 40 44
All 50 subjects attended week 0. The numbers declined to 44, 42 and 40 at weeks 2,
4 and 6, respectively. At the final follow-up visit (week 11), 44 persons attended.
It is a good idea to see the change of bacteria status from one week to the next. Let's
start with the change from the first to the second week.
> next.y <- lagVar(var=y, id=ID, visit=visit, lag.unit=-1)
> pack()
Before continuing, we keep a copy of the data frame for later use.
> .data -> data1
> head(.data,10)
y ap hilo week ID trt visit next.y
1 y p hi 0 X01 placebo 1 y
2 y p hi 2 X01 placebo 2 y
3 y p hi 4 X01 placebo 3 <NA>
4 y p hi 11 X01 placebo 5 <NA>
5 y a hi 0 X02 drug+ 1 y
6 y a hi 2 X02 drug+ 2 <NA>
7 n a hi 6 X02 drug+ 4 y
8 y a hi 11 X02 drug+ 5 <NA>
9 y a lo 0 X03 drug 1 y
10 y a lo 2 X03 drug 2 y
Note that the first ID, X01, did not attend in week 6, thus the value of 'next.y' for
week 4 is missing. Similarly, the second ID, X02, did not attend in week 4. The
value of 'next.y' for this subject for week 2 is missing. In order to cross-tabulate the
outcome variable, 'y', at visit 1 and visit 2, type:
> keepData(subset=visit==1)
> addmargins(table(y, next.y))
52
next.y
y n y Sum
n 0 4 4
y 4 36 40
Sum 4 40 44
Out of 4 persons who did not have the bacteria ('y' = "n") in their first visit, all of
them changed to "y". Out of 40 subjects who did have the bacteria ('y' = "y"), 4
persons changed to "n".
> mcnemar.test(table(y, next.y)) # P value = 1
Note that at this transition, the total number of records is now 37, not 50. There were
definitely more people who changed from "y" on their first visit to "n" on their next
visit than in the opposite direction . In other words, bacteria tended to disappear in
the second transition period (from week 2 to week 4), and this imbalance toward
more bacteria is statistically significant by McNemar’s test.
> mcnemar.test(table(y, next.y))
McNemar's chi-squared = 6.125, df = 1, p-value = 0.01333
The value of the 'week' variable from the original data set (in long form) became the
53
suffix for the repeated variables in wide form. The variable 'y.11' appears before 'y.6'
because the first person (X01) did not attend the sixth week appointment. The
outcome value in 'y.6' for this person is therefore <NA>. Then,
> with(wide, addmargins(table(y.0, y.2)))
y.2
y.0 n y Sum
n 0 4 4
y 4 36 40
Sum 4 40 44
The results are exactly the same as what we obtained using the preceding method.
In conclusion, there are two methods for comparing values across time within the
same person. The first method created a 'visit' variable which was modified from
'week' and shifted the values by using the lagVar function. The second method
reshaped the data to wide format before making the comparison. This method is
more straightforward for tabulation but not useful for transition modelling.
Exercise
Read in the Xerop data from Epicalc. Explore the pattern of visiting times. Were
they evenly distributed? Track changing status of respiratory infection (respinfect),
xerop and stunting over the visits.
54
Chapter 6: Analysis of missing records
The preceding chapter looked at the changing status of the subject. This chapter
pays attention to missing records during follow-up.
Missing records are different to missing values within variables of existing records.
For missing records, the analysts should first highlight any pattern to the research
team despite the fact that the data is not available. Later, like analysis of missing
values, the missing records should be checked to see if they are missing at random
or if there is some underlying cause. Some analysts prefer to impute data for
missing values with their 'best guess'. For a follow-up study focusing on only one
outcome variable with all other variables (such as demographic and clinical
prognostic factors) being fixed and if the statistical methods used allowed no
missing data, such imputation would be cost-effective. However, when there is
more than one variable (especially both changing exposure and changing outcome)
being monitored, and with statistical methods allowing non-perfect data, imputation
would be less important.
Handling missing values is a complicated technique and beyond the scope of this
book. Readers are advised to consult with other sources for more details on this
topic.
Based on the above arguments, data management is the most important technique to
deal with missing records. We will examine methods for identifying, refilling and
highlighting the pattern of missing records.
55
> use(bacteria) # This data is in the MASS package
> des()
> summ()
No. of observations = 220
There are 220 records. The most important variables for identification of missing
visits are 'ID' and 'week'. Since the class of the 'ID' variable is "factor" the output
from the summ command is not meaningful, particularly the minimum and
maximum values. All we can say is that there are 50 distinct values.
The 'week' variable is an integer and ranges from 0 to 11. Let's view the distribution
more closely.
> table(week)
week
0 2 4 6 11
50 44 42 40 44
The follow up interval is 2 weeks up until week 6 with a final visit at week 11.
There were 50 children who attended in week 0 (all children in the study attended
the initial visit). The number declined to between 40 and 44 over the following 11
weeks. There is no obvious pattern for the missing records.
56
We can further tabulate this table to obtain a frequency of total visits for all
children.
> table(table(ID))
2 3 4 5
3 5 11 31
Of the 5 weeks scheduled, three children came twice, five came three times, 11
came four times and 31 attended on all five scheduled visits. No child came just
once. That is, every child returned for treatment at least once during the 11 week
study. The full vector of frequencies for 0 up to 5 visits is [0, 0, 3, 5, 11, 31].
Since there are 220 records out of the possible maximum 250 (50 subjects × 5
visits), the overall probability of a child attending at each visit is 220/250 = 0.88.
The question is whether or not the observed distribution is actually random.
If we can prove that the observed data follows some known distribution, then we
can conclude that the missing records have no pattern and are therefore missing at
random.
The probability of observing zero successes out of 5 trials is 0.00002 while for five
successes the probability is more than half. These probabilities are for one trial. If
we repeated this trial 50 times, then the resulting total numbers of successes would
be:
> 50 * p
[1] 0.00124416 0.04561920 0.66908160 4.90659840 17.99086080 26.38659584
Let's compare theses expected numbers with the observed numbers from the
bacteria study.
> data.frame(week=0:5, p=p, expected=50*p, observed=c(0,0,3,5,11,31))
week p expected observed
1 0 0.0000248832 0.00124416 0
2 1 0.0009123840 0.04561920 0
3 2 0.0133816320 0.66908160 3
4 3 0.0981319680 4.90659840 5
5 4 0.3598172160 17.99086080 11
6 5 0.5277319168 26.38659584 31
57
The observed numbers appear fairly close to the expected numbers. In our study, 31
out of 50 children attended all 5 visits. If the distribution of total visits followed a
binomial distribution, with p = 0.88, we would expect this number to be 26. To test
whether the whole vector of observed frequencies fits well with the above expected
frequencies from a binomial distribution, we employ the chi-squared goodness-of-
fit test.
> chisq.test(x=c(0,0,3,5,11,31), p=p)
Chi-squared test for given probabilities
data: c(0, 0, 3, 5, 11, 31)
X-squared = 11.6921, df = 5, p-value = 0.03926
The p-value is significant; however the warning suggests that the test may not be
appropriate, most likely because many of the expected frequencies are less than 5.
Let's view the help page for this function.
> help(chisq.test)
The result indicates that there is not enough evidence to conclude that the observed
data do not come from a binomial distribution. Thus we conclude that the observed
distribution of missing visits is missing at random.
Let's now check whether treatment ('trt') is associated with missing records. Note
that treatment is fixed for each subject throughout the study, which can be checked
firstly with eye-ball scanning of the cross-tabulation.
> table(ID, trt)
trt
ID placebo drug drug+
X01 4 0 0
X02 0 0 4
X03 0 5 0
X04 5 0 0
X05 5 0 0
X06 0 4 0
===== lines omitted ====
Within each row, all but one cell should be greater than zero.
58
> table(ID, trt) > 0
trt
ID placebo drug drug+
X01 TRUE FALSE FALSE
X02 FALSE FALSE TRUE
X03 FALSE TRUE FALSE
X04 TRUE FALSE FALSE
X05 TRUE FALSE FALSE
X06 FALSE TRUE FALSE
===== lines omitted =====
If no child changed treatment, then the sum of each row should not be more than 1.
> any(rowSums(table(ID, trt) > 0) > 1)
[1] FALSE
No child had their level of encouragement to comply with treatment changed during
the study.
At baseline (week 0) we have shown that all children attended for treatment. Let's
explore the treatment allocation for that first week, which is stated in the help page
for the data set as being randomized.
> table(trt[week==0])
The treatment allocation is not completely balanced, but in any case we would like
to see if this distribution is more or less the same throughout the 5 weeks of follow-
up.
> table(week, trt)
trt
week placebo drug drug+
0 21 14 15
2 20 13 11
4 18 12 12
6 17 11 12
11 20 12 12
For each subsequent visit (weeks 2 to 11), the number of children receiving the 3
different treatments appear to be quite similar, indicating a randomness to the
missing records in terms of treatment group. We can test the hypothesis that for
59
each follow up the distribution is no different to the first week by using the chi-
square test.
> chisq.test(table(trt[week==2]), p=table(trt[week==0])/50,
simulate=TRUE)
> chisq.test(table(trt[week==4]), p=table(trt[week==0])/50,
simulate=TRUE)
> chisq.test(table(trt[week==6]), p=table(trt[week==0])/50,
simulate=TRUE)
> chisq.test(table(trt[week==11]), p=table(trt[week==0])/50,
simulate=TRUE)
All four tests have large P values indicating that for each follow-up visit, the
number of children who attended for treatment was very close to the expected value
set by that at the first visit (week 0).
We can continue exploring the distribution of missing records for each variable in
this manner. However, since the number of variables we can explore at one time is
limited, we cannot investigate whether an interaction between variables exists. For
example, we have shown that the distribution of missing records is neither
associated with week nor with treatment. There is, however, a possibility that non-
attendees (the missing records) occurred earlier in one treatment group compared to
another.
To add missing records into a longitudinal data set, we first create a data frame
containing all the possible combinations of 'ID' and 'week' by reshaping the data to
wide format.
> wide <- reshape(.data, idvar="ID", timevar="week", v.names="y",
direction="wide")
The warning message appears because we did not specify the treatment variables in
the command. In the wide data frame the values of the variables will come from the
first record of each ID. Since we have shown that these variables are fixed
(constant) with the ID, we can safely ignore the warning.
> head(wide)
ap hilo ID trt y.0 y.2 y.4 y.11 y.6
1 p hi X01 placebo y y y y <NA>
5 a hi X02 drug+ y y <NA> y n
9 a lo X03 drug y y y y y
14 p lo X04 placebo y y y y y
19 p lo X05 placebo y y y y y
24 a lo X06 drug y y y y <NA>
60
Now we can reshape this data frame back to long format. As explained in chapter 2,
most of the arguments of the function can be omitted, since the data frame was
created from a previous reshape command. The result is a data frame containing
250 records, containing all unique combinations of ID and week.
> long <- reshape(wide, direction="long")
> des(long)
No. of observations = 250
Variable Class Description
1 ap factor
2 hilo factor
3 ID factor
4 trt factor
5 week integer
6 y factor
Note that the order of the variables is alphabetical. This is a side-effect of the
reshape command.
> summ(long)
The 'y' variable contains only 220 observations, which comes from the original
data.
The final step is to add an indicator variable, specifying whether the subject
attended at each corresponding follow up visit. These are in fact the records in
which the 'y' variable is not missing.
> long$attend <- !is.na(long$y)
61
> head(bacteria.new, 10)
Note the reordering of the variables and the addition of the 'present' variable, which
indicates whether the record was present in the original dataset.
The output suggests that attendance (and therefore non-attendance) in each week
was not random. Let's run a logistic regression model. The 'week' variable needs to
be converted to a factor first so that a comparison of attendance between the first
visit and each remaining visit can be done.
> week <- factor(week)
> pack()
> glm1 <- glm(attend ~ trt + hilo + week, family = binomial, data =
.data)
62
> logistic.display(glm1)
Log-likelihood = -81.9821
No. of observations = 250
AIC value = 179.9642
Odds ratios comparing the follow-up weeks to the first week are zero because there
was no missing visit in the first week. The likelihood ratio test, however, confirms
that there is a significant difference in missing records among the visits. The
adjusted odds ratio of compliance ('hilo') is quite different from the crude odds ratio
(1.2 vs 0.74). This is due to the fact that compliance is associated with type of
treatment (trt). In fact, treatment ('trt') is simply a recoding of the 'ap' variable
(active/placebo) and 'hilo' (high/low encouragement to comply with treatment).
Nonetheless, both variables are not statistically significant.
Our final conclusion is that missing records is not a random phenomenon but
significantly increased after week 0. Neither treatment nor compliance is associated
with missing.
63
Summary
Missing records are almost unavoidable when the sample size and the number of
follow-up visits are large. When they do occur, it is important to investigate the
reasons and to ensure that they are missing at random. The most effective method is
to fill in the missing visits based on subjects and time of follow-up using
addMissingRecords. By creating a 'missing' indicator variable as the outcome
variable, cross tabulation (tableStack) and logistic regression (glm) can help to
identify determinants of missing records.
Exercise
Load the Xerop data set from Epicalc. This is a dataset from an Indonesian study
on vitamin A deficiency and risk of respiratory infection in 275 children.
• At each scheduled visit, determine how many records are missing.
• Identify and remove the duplicated records based on the combination of
'id' and 'time', then repeat question 1.
• Including the duplicated records that you removed, use Epicalc’s
addMissingRecords function to create a new data frame containing a
complete set of records.
• Was season associated with non-attendance?
• Determine whether or not vitamin A deficiency and/or respiratory
infection preceded non-attendance.
• Find the determinant(s) of the missing records.
64
Chapter 7: Modelling longitudinal data
In the previous chapters, the relationship of the outcome and predicting variables is
based on an assumption of independence among the records. Exceptions include,
for example, situations where the data are analysed using non-linear models, such
as conditional logistic regression, where the data are stratified and conditionally
analysed in sets.
In longitudinal studies, data on one individual can be measured more than one time.
Thus records belonging to one individual appear in the dataset more than once.
Measurements from the same individual are not independent from each other.
Analysis of this type of data using (general) linear models will therefore give
erroneous results. There are three main choices of modelling here:
Population average models are so called because they only focus on the average
relationship among repeated measures. A number of individuals are measured on
the outcome and exposure repeatedly. Repetitions may arise from measurements of
the same individuals several times or from measurement of the same sets of
variables on several individuals. The models don't take into account the source of
repetition. They just find the average relationship. This relationship comes from
65
averaging among repeated times and within repeated persons. The model is so
called marginal model. Remember that when we apply the addmargins function
to a table, we have added, at the right most column, the sum of each rows and, at
the bottom row, the sum of each column. The margins thus focus on the overall
effect of rows and columns and ignore what is inside. In marginal models, we are
not able to estimate the individual person outcome but we can still predict the
outcome value of a new subject if the person is given the covariate values. This
prediction is based on the average effects mentioned above. The modelling
technique is called “generalized estimating equations”. The reason is probably
because the final model is based on several equations of generalized linear models
that share the same set of coefficients. GEE models require more parameters to be
estimated than the ordinary GLM methodology. The correlation coefficient
structure among the residuals of different rounds of observations must be specified.
The choice of possible correlation structures will be discussed in future sections.
The other name for these types of models is 'mixed models' because they are a
mixture of fixed coefficients and random ones. While marginal models predict the
outcome of each person based on an average set of predicting coefficients, mixed
models use both fixed coefficients common to all subjects and random coefficients
specific to each individual in order to predict the outcome of that particular
individual. For a model with only one random coefficient, the random coefficient
would be the variation of intercept of each individual with other coefficients (or
slopes) being fixed. The output of the model should show the standard deviation of
the intercepts. If this is large, it would mean that there is a large level of baseline
variation of the subjects under study. Random coefficients models also allow for
random slopes. A model with random slopes means that different subjects may be
66
differently affected by the independent variable. This is similar to interaction of
strata in stratified analysis discussed in our previous book1. Finally, while marginal
models focus on correlations among residuals of different times of observations,
random coefficient models are more interested in correlation of the residuals of
different random variables. For example, if the intercepts have a positive correlation
with the slopes, the lines on the upper part (high value of intercept) of the graph
would be steeper than those in the lower part. When there is only one random
variable, i.e. a random intercept, then the correlation is not of main concern.
Occasionally, analysts use conditional fixed effects models. This is an extreme case
where random terms disappear. Comparison is made within the same person or the
same matched set, just like in a matched case-control study. For longitudinal
studies, the outcome of main concern using a fixed effects model is not at any
individual time point but the difference between two time points in the same person.
This is confined to before-after studies, or studies looking at the difference between
two sides of the same organ, such as eyes or kidneys, within the same person. This
type of model is of limited use and will be omitted in future discussion.
Finally, transitional models are focused on transition of states. They are used to
predict the outcome of a set of subjects who share the same previous status as well
as other independent variables. This model thus has two simultaneous interests.
First, it is interested in the effect of the preceding outcome status on the current one
after adjustment for other covariates. Second, it demonstrates the effects of those
covariates after adjustment with the previous outcome. A simple transition model
will look at the effects of a previous outcome in only one or a few preceding
rounds. Autoregressive regression (AR) models, often employed by economists,
include looking at more preceding lags, since financial data may have longer term
effects compared to medical outcomes. Economic models often further aggregate
outcome of individual time points into a 'moving average' to have a more stable
outcome value.
As moving average at one time point is highly correlated with its neighbours,
moving average outcomes are almost always associated with the autoregressive
approach. The two components make up a new name autoregressive moving
average analysis (ARMA). This is rarely used in epidemiology and will not be
included in future discussion.
___________________________________________________________________
1
Analysis of Epidemiological Data using R and Epicalc
67
Packages in R that are used for different modelling are shown in the following
table:
geepack geeglm
Let's try these packages and functions with the longitudinal data that we have
previously explored. Let's use the gee function in the gee package to model the
Sitka dataset.
> library(epicalc)
> data(Sitka, package="MASS")
> use(Sitka)
> library(gee)
> gee.in <- gee(size ~ Time, id=tree, data=.data)
> summary(gee.in)
Model:
Link: Identity
Variance to Mean Relation: Gaussian
Correlation Structure: Independent
Call:
gee(formula = size ~ Time, id = tree, data = .data)
Summary of Residuals:
Min 1Q Median 3Q Max
-2.02609732 -0.37956123 0.06948273 0.41669270 1.30948273
68
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 2.27324430 0.1768643245 12.85304 0.1003348470 22.65658
Time 0.01268548 0.0008591845 14.76456 0.0003719549 34.10488
Working Correlation
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 1 0 0 0
[3,] 0 0 1 0 0
[4,] 0 0 0 1 0
[5,] 0 0 0 0 1
Leaving most arguments to their default values, the link function is 'identity', which
means that the original value of outcome variable is not transformed. This is
applicable to continuous outcome variables in all models. The family is 'gaussian'
by default. The default correlation structure among residuals of different times is
"independent". This assumes that there is no association among residuals of
different time periods, as shown by zero values in the off-diagonal cells of the
working correlation matrix. There are two sets of standard errors produced. The
robust ones are based on a conservative computation. The naïve ones give the same
results as those from using the glm command.
> summary(glm(size ~ Time))$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.27324430 0.1768643245 12.85304 8.319835e-32
Time 0.01268548 0.0008591845 14.76456 1.473822e-39
> cor(wide[,2:6])
res.152 res.174 res.201 res.227 res.258
res.152 1.0000000 0.9614699 0.9176641 0.8710606 0.8565763
res.174 0.9614699 1.0000000 0.9721038 0.9370675 0.9247401
res.201 0.9176641 0.9721038 1.0000000 0.9653189 0.9494939
res.227 0.8710606 0.9370675 0.9653189 1.0000000 0.9866713
res.258 0.8565763 0.9247401 0.9494939 0.9866713 1.0000000
69
The correlation coefficients among residuals of different time points are very high.
Thus the assumption of independence is not valid.
We first try the most commonly used correlation structure, “exchangeable”, which
assumes that the correlation between time points is constant and non-zero.
> gee.ex <- gee(size ~ Time, id=tree, data=.data, corstr =
"exchangeable")
> summary(gee.ex)
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Identity
Variance to Mean Relation: Gaussian
Correlation Structure: Exchangeable
Call:
gee(formula = size ~ Time, id = tree, data = .data, corstr =
"exchangeable")
Summary of Residuals:
Min 1Q Median 3Q Max
-2.02609732 -0.37956123 0.06948273 0.41669270 1.30948273
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 2.27324430 0.0880570302 25.81559 0.1003348470 22.65658
Time 0.01268548 0.0002688318 47.18742 0.0003719549 34.10488
Working Correlation
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 0.9020987 0.9020987 0.9020987 0.9020987
[2,] 0.9020987 1.0000000 0.9020987 0.9020987 0.9020987
[3,] 0.9020987 0.9020987 1.0000000 0.9020987 0.9020987
[4,] 0.9020987 0.9020987 0.9020987 1.0000000 0.9020987
[5,] 0.9020987 0.9020987 0.9020987 0.9020987 1.0000000
The coefficients of the regression are exactly the same as those obtained from
specifying an independent correlation structure. However, the standard errors are
smaller.
Since the working correlation coefficients are constantly high, which is not similar
to what we have found, the argument 'corstr' should be further changed.
> gee.ar1 <- gee(size ~ Time, id=tree, data=.data, corstr = "AR-M")
> summary(gee.ar1)
70
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)
Model:
Link: Identity
Variance to Mean Relation: Gaussian
Correlation Structure: AR-M , M = 1
Call:
gee(formula = size ~ Time, id = tree, data = .data, corstr = "AR-M")
Summary of Residuals:
Min 1Q Median 3Q Max
-1.9059082 -0.2816582 0.1728384 0.5287292 1.3964369
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 2.31297888 0.1123461827 20.58796 0.1003399163 23.05143
Time 0.01199296 0.0004319665 27.76362 0.0003508396 34.18359
Working Correlation
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 0.9460445 0.8950002 0.8467100 0.8010253
[2,] 0.9460445 1.0000000 0.9460445 0.8950002 0.8467100
[3,] 0.8950002 0.9460445 1.0000000 0.9460445 0.8950002
[4,] 0.8467100 0.8950002 0.9460445 1.0000000 0.9460445
[5,] 0.8010253 0.8467100 0.8950002 0.9460445 1.0000000
Now the working correlation is closer to what we observed from the simple
regression. AR-M denotes autoregressive correlation of one order one (M=1). This
means corrt, t+(n+1) = (corr t, t+1)n. In our example, the correlation between two
visits of one different time point (or the right side of this equation) is 0.9460445.
When the lag is increased to 2, it is 0.94604452 = 0.8950002 and for a lag of 3, it is
0.94604453 = 0.84671 etc. The correlation coefficients among visits thus slowly
reduce with time. To speed up reduction of correlation, another argument 'Mv' can
be specified. For example,
> gee.ar2 <- gee(size ~ Time, id=tree, data=.data, corstr = "AR-M",
Mv=2)
> summary(gee.ar2)
The results are omitted. Both the coefficients and standard errors are slightly
different compared to those from gee.ar1 as the working correlations drop faster by
time lag.
71
> library(geepack)
> geeglm.ex <- geeglm(size~Time, id=tree, data=.data, corstr =
"exchangeable")
> summary(geeglm.ex)
Call: geeglm(formula = size ~ Time, data = .data, id = tree, corstr =
"exchangeable")
Coefficients:
Estimate Std.err Wald p(>W)
(Intercept) 2.27324430 0.0977722507 540.5813 0
Time 0.01268548 0.0003640436 1214.2462 0
The estimated correlation parameter is 0.904, which is slightly larger than the one
obtained from the gee package using the same correlation structure (gee.ex). The
coefficients and standard errors are also very similar. These slight differences are
seen among the results from different software regardless of whether the software is
open-source or commercial. The very small differences are quite acceptable.
Call:
geeglm(formula = size ~ Time, data = .data, id = tree, corstr = "ar1")
Coefficients:
Estimate Std.err Wald p(>W)
(Intercept) 2.31280103 0.0980443147 556.4571 0
Time 0.01198546 0.0003445955 1209.7360 0
72
Again, this gives practically the same coefficients and robust standard errors as the
one from the gee package.
Now that we are acquainted with these packages, the next step is to use them for
hypothesis testing. We want to see whether the size of trees is affected by treatment
after adjusting for time.
> geeglm1.ar1 <- geeglm(size ~ Time+treat, id=tree, data=.data,
corstr="ar1")
> summary(geeglm1.ar1)$coefficient
Estimate Std.err Wald p(>W)
(Intercept) 2.46484647 0.1657640661 221.105204 0.0000000
Time 0.01198697 0.0003446439 1209.700571 0.0000000
treatozone -0.22238262 0.1621230081 1.881535 0.1701597
The trees treated with ozone had a non-significantly smaller size throughout the
time of follow up. Unsurprisingly, tree sizes increased over time.
As treatment may have different effects over time we now put in the interaction
term.
> geeglm2.ar1 <- geeglm(size ~ Time*treat, id=tree, data=.data,
corstr="ar1")
> summary(geeglm2.ar1)$coefficient
Estimate Std.err Wald p(>W)
(Intercept) 2.154288609 0.2071115387 108.1930041 0.000000000
Time 0.013501415 0.0005847241 533.1587465 0.000000000
treatozone 0.231863021 0.2331996714 0.9885693 0.320092306
Time:treatozone -0.002219175 0.0007066679 9.8617143 0.001687538
This final model gives a better picture of the ozone effect. The main effect of ozone
is not significant indicating that at Time 0, there was no difference in size between
trees in the two treatment groups. The interaction term is strongly significant and
the negative co-efficient indicates that the growth rate of trees in the ozone treated
group was significantly lower than that of trees in the control group.
You may like to try modelling this data using the gee package. The conclusion
should be the same.
Now let's model dichotomous outcomes using the GEE methodology. Let's return to
the bacteria data set.
> zap()
> data(bacteria, package="MASS")
> use(bacteria)
> infected <- y=="y"
> pack()
> geeglm3.ar1 <- geeglm(infected ~ ap+week, id=ID, data=.data,
corstr = "ar1", family="binomial")
73
> summary(geeglm3.ar1)
Call:
geeglm(formula = infected ~ ap + week, family = "binomial", data =
.data, id = ID, corstr = "ar1")
Coefficients:
Estimate Std.err Wald p(>W)
(Intercept) 1.6315020 0.31066152 27.580382 1.506995e-07
app 0.8256061 0.48819185 2.859991 9.080800e-02
week -0.1041665 0.03738504 7.763552 5.331103e-03
The output indicates that there is no evidence of any effect on bacterial infection
between those receiving active treatment and placebo (p-value=0.09). On the other
hand, the probability of getting infection reduced over time.
Attempt for interaction could be further made but all covariates are non-significant.
The results are omitted here.
Results using the gee function from the gee package are similar.
Exercise
Try modelling the respiratory data set from the geepack package using the
methods described in this chapter. Compare the output from functions in the
different packages.
74
Chapter 8: Mixed models
The previous chapter demonstrated how to use the GEE methodology to model the
Sitka and bacteria data sets. In this chapter will use mixed modelling
techniques to model the same data sets and compare the results.
> library(epicalc)
> library(MASS)
> use(Sitka)
> glmmPQL1 <- glmmPQL(fixed = size ~ Time * treat, random= ~ 1 | tree,
data=.data, family="gaussian")
> summary(glmmPQL1)
Linear mixed-effects model fit by maximum likelihood
Data: .data
AIC BIC logLik
NA NA NA
Random effects:
Formula: ~1 | tree
(Intercept) Residual
StdDev: 0.6003342 0.1932339
Variance function:
Structure: fixed weights
Formula: ~invwt
Fixed effects: size ~ Time * treat
Value Std.Error DF t-value p-value
(Intercept) 2.1217179 0.15374924 314 13.799860 0.0000
Time 0.0141472 0.00046278 314 30.569975 0.0000
treatozone 0.2216775 0.18596433 77 1.192043 0.2369
Time:treatozone -0.0021385 0.00055975 314 -3.820480 0.0002
Correlation:
(Intr) Time tretzn
Time -0.609
treatozone -0.827 0.504
Time:treatozone 0.504 -0.827 -0.609
75
Number of Groups: 79
The glmmPQL function comes from the MASS package. It fits generalized linear
mixed models using the penalized quasi-likelihood technique.
The random component is ~1 | tree. The 1 denotes the first coefficient of the model,
the intercept term. The sign | denotes "given" or "clustered by" or "grouped by". So
the random component for this data is the random intercepts of the trees. In other
words, the model assumes that all trees have the same coefficients for 'time' and
'Treat' as well as the interaction term. The only difference is in their intercepts,
which are assumed to be a random variable (or random coefficients), thus there is
no need for any additional coefficient.
The standard deviation of the random intercept is 0.6003 unit, which is quite large
compared to the standard deviation of the residuals of the main effect, which is
0.193. This means that the intercept values of the trees varied considerably
compared to variation of growth within the same tree.
The fixed effects are due to the interaction between 'Time' and 'treat'. Since the
main objective of the analysis is to compare the size of trees in each group over
time, fixed effects are more important than random effects. The results are not too
dissimilar from those we obtained using the GEE methodology. The results indicate
that the effect of ozone on tree size is different between the two groups of trees.
The correlation section and the standardised residuals are complicated, not very
important, and can be ignored.
Random effects:
Formula: ~Time | tree
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 0.790968484 (Intr)
Time 0.002487428 -0.649
Residual 0.162608831
Variance function:
Structure: fixed weights
76
Formula: ~invwt
The random effect is now "~ Time | tree" which allows for a different effect of
'Time' on each tree. From the output, the intercept term has a larger standard
deviation (0.79 compared with 0.6003 in the random intercept model). Individual
slopes are strongly negatively correlated with the intercepts (Corr = -0.649). Within
the same treatment group, trees with the larger initial size tended to have a flatter
growth.
> summary(lme1)
Linear mixed-effects model fit by REML
Data: .data
AIC BIC logLik
175.5170 199.3293 -81.75852
Random effects:
Formula: ~1 | tree
(Intercept) Residual
StdDev: 0.6082011 0.1938483
77
Correlation:
(Intr) Time tretzn
Time -0.606
treatozone -0.827 0.501
Time:treatozone 0.501 -0.827 -0.606
The results from using the lme function are very close to those from using
glmmPQL.
> lme2 <- lme(fixed = size ~ Time*treat, random= ~ Time | tree, data
=.data)
> summary(lme2)
After adding a random time component, the results are very close to those of
'glmmPQL2' in all aspects.
The advantage of using the LME method is the availability of the AIC, BIC and log
likelihood, which imply relative levels of fit. We can therefore use these to compare
different models.
> anova(lme1, lme2)
Model df AIC BIC logLik Test L.Ratio p-value
lme1 1 6 175.5170 199.3293 -81.75852
lme2 2 8 146.1218 177.8714 -65.06088 1 vs 2 33.39528 <.0001
The model 'lme2' has two degrees of freedom more than 'lme1' but it has a much
smaller AIC value. Thus 'lme2' is significantly better than 'lme1'. The random
slope model fits better than one with a random intercept alone for this data set.
Note that the lme function was designed for linear mixed effects modelling. It is
confined to linear models where the outcome variable is on a continuous scale only.
Therefore, there is no "family" argument. The function lme has an option to use
maximum likelihood (ML) or restricted maximum likelihood (REML) methods.
A more recent package called lme4 allows a choice of "family" as well as a more
versatile nesting procedure. The formula syntax is slightly different.
> library(lme4)
> lmer1 <- lmer(size ~ Time*treat + (1|tree), family="gaussian",
data=.data)
> summary(lmer1)
Linear mixed model fit by REML
Formula: size ~ Time * treat + (1 | tree)
78
Data: .data
AIC BIC logLik deviance REMLdev
175.5 199.4 -81.76 130.2 163.5
Random effects:
Groups Name Variance Std.Dev.
tree (Intercept) 0.369909 0.60820
Residual 0.037577 0.19385
Number of obs: 395, groups: tree, 79
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.1217179 0.1543913 13.742
Time 0.0141472 0.0004619 30.629
treatozone 0.2216775 0.1867409 1.187
Time:treatozone -0.0021385 0.0005587 -3.828
The results are the same as that from using the lme function except that p-values
for the coefficients are not displayed. One can however obtain 95% confidence
intervals by creating a Markov Chain Monte Carlo (MCMC) sample and creating a
95% highest posterior density (HPD) interval.
> tmp <- mcmcsamp(lmer1, n=1000)
> HPDinterval(tmp)
$fixef
lower upper
(Intercept) 1.790535519 2.4503748846
Time 0.012564591 0.0155585309
treatozone -0.152775369 0.6803190421
Time:treatozone -0.004096989 -0.0002541869
attr(,"Probability")
[1] 0.95
========= further lines omitted ==========
Both lower and upper limits of the 95% CI for 'Time' are positive whereas that of
the interaction term are negative, indicating the statistical significance of these two
variables. Now we add the random component.
> lmer2 <- lmer(size~Time*treat + (Time|tree), family="gaussian",
data=.data)
79
Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
lmer1 6 142.201 166.074 -65.100
lmer2 8 114.102 145.933 -49.051 32.099 2 1.071e-07 ***
The conclusion is the same as before. The model with a random slope is
significantly better than that with random intercept alone.
Random effects:
Formula: ~1 | ID
(Intercept) Residual
StdDev: 1.347478 0.7881903
Variance function:
Structure: fixed weights
Formula: ~invwt
Fixed effects: infected ~ ap + week
Value Std.Error DF t-value p-value
(Intercept) 2.0352357 0.3816667 169 5.332495 0.0000
app 1.0082124 0.5326217 48 1.892924 0.0644
week -0.1450321 0.0390851 169 -3.710677 0.0003
Correlation:
(Intr) app
app -0.485
week -0.536 -0.047
The coefficients are fairly different from those using GEE in the previous chapter,
however the conclusion is the same. Treatment is no better than placebo for
controlling infection, and the probability of infection decreases with time.
80
Modelling using the lmer function gives similar results.
> lmer1 <- lmer(infected ~ ap + week + (1|ID), family="binomial",
data=.data)
> summary(lmer1)
Generalized linear mixed model fit by the Laplace approximation
Formula: infected ~ ap + week + (1 | ID)
Data: .data
AIC BIC logLik deviance
206.4 220 -99.2 198.4
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 1.4012 1.1837
Number of obs: 220, groups: ID, 50
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.09745 0.41334 5.074 3.89e-07 ***
app 1.07571 0.55431 1.941 0.05230 .
week -0.14440 0.04833 -2.988 0.00281 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that when the outcome variable is dichotomous, the lmer function also
provides z values and p-values. 95% confidence intervals of the coefficients thus
can be computed.
> coefs <- attr(summary(lmer1), "coefs")
> coefs
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.0974492 0.41333783 5.074419 3.886823e-07
app 1.0757068 0.55431021 1.940622 5.230411e-02
week -0.1444006 0.04833126 -2.987726 2.810615e-03
One advantage of using lmer in hypothesis testing is the availability of the anova
function.
There are 3 levels of treatment. In order to determine whether or not there is any
treatment effect we can try the following model.
81
> lmer2 <- lmer(infected ~ trt+week + (1|ID), family="binomial",
data=.data)
> summary(lmer2)
The output indicates that there is not enough evidence to show that treatment (with
or without encouragement to comply) has a significant effect on infection.
In summary, this chapter has shown that generalized linear mixed effects
modelling, in which the outcome variable may be continuous or dichotomous, can
be performed with glmmPQL in the MASS package, and lmer in the lme4
package with similar results. The lmer command has the advantage of being able
to test whether the random effects should be simple random intercepts or random
slopes as well. It is also useful in testing hypothesis for variables with more than
two levels of exposure. However, package lme4, which contains the lmer
function, is still under development. Some changes are expected in future versions.
Exercise
Load the BMD data from Epicalc. It contains data from a clinical trial of three does
of a drug which is thought to affect bone density levels in post-menopausal women.
Treatment 1 is the lowest dosage of the drug and Treatment 2 is the highest dosage.
82
Because it is always preferable to have patents on the lowest effective dosage of a
drug, the interest in this trial is focussed on whether Treatment 1 is significantly
different from Treatments 2 and 3.
Bone mineral densities were measured at the start of the trial, and at 12 and 24
months after the trial commenced.
1. Is there a beneficial treatment effect on bone density at the hip, after correcting
for the covariates given?
2. Anecdotal evidence is that a side-effect of the treatment is a gain in weight (or
increase in BMI). Do these data provide evidence for this theory?
83
Chapter 9: Transition models
This chapter discusses the last type of modelling of longitudinal data; transition
models.
Transition - the change of status from one point of time to the next, as its name
suggests, is the centre of interest. We have already explored and analysed the
bacteria data set, where the time points did not increase regularly. In this
chapter, let's try to analyse a more complicated data set – Xerop, which is
concerned with the relationship between vitamin A deficiency and respiratory
infection among children.
> library(epicalc)
> zap()
> data(Xerop)
> use(Xerop)
> des()
(subset)
No. of observations = 1200
Variable Class Description
1 id integer
2 respinfect integer
3 age.month integer
4 xerop integer
5 sex factor
6 ht.for.age integer
7 stunted integer
8 time integer
9 baseline.age integer
10 season factor
> summ()
(subset)
No. of observations = 1200
84
6 ht.for.age 1200 0.91 1 5.85 -23 25
7 stunted 1200 0.12 0 0.33 0 1
8 time 1200 3.42 3 1.76 1 6
9 baseline.age 1200 -4.05 -3 19.63 -32 44
10 season 1200 2.488 2 0.922 1 4
There are 275 children in the data set. Check for missing and duplicate visits.
> T <- table(id, time)
> sum(T == 0)
[1] 452
> sum(T > 1)
[1] 2
There are 452 visits missed and in 2 records the combination of 'id' and 'time' are
duplicated. We can use the following command to list the ids of the duplicated
records:
> id[which(duplicated(cbind(id, time)))]
[1] 161013 161013
Now we can list the records of the child whose 'id' is equal to 161013. (The output
has been edited to fit on the page).
> .data[id==161013,]
id respinf age.month xerop sex ht.for.age stunted time base.age
496 161013 0 -1 0 0 2 0 1 -1
497 161013 0 2 0 0 3 0 2 -1
498 161013 0 5 0 0 2 0 3 -1
499 161013 0 8 0 0 3 0 4 -1
500 161013 0 11 0 0 2 0 1 11
501 161013 0 14 1 0 1 0 2 11
The duplicated records are rows 500 and 501 where 'time' of 1 and 2 is repeated
from rows 496 and 497. This is likely to be a human error arising during data entry.
Inspection of the 'age.month' variable, which has a constant increment of 3, would
suggest that the 'time' variable for the duplicated records should be changed to 5
and 6, respectively. This subject was vitamin A deficient ('xerop' = 1) at the last
visit and had a lower height for age but did not yet have stunted growth. Their
baseline age was -1 in the first four visits but noted to be 11 in the last two. We now
have the dilemma of either deleting these two records or changing the times to 5
and 6. Let's choose the first option just for illustration purposes.
> data.new <- .data[-c(500,501),]
85
> use(data.new)
> anyDuplicated(cbind(id, time)) # final check
[1] 0
OR = 1.55
95% CI = 0.58 3.58
Chi-squared = 1.13 , 1 d.f. , P value = 0.288
Fisher's exact test (2-sided) P value = 0.323
We expect that it should take some amount of time before vitamin A deficiency
could have any effect on infection. To create lag variables for respiratory infection
and vitamin A deficiency we use the lagVar command from epicalc.
> respinfect.lag1 <- lagVar(respinfect, id=id, visit=time, lag.unit=1)
> xerop.lag1 <- lagVar(xerop, id=id, visit=time, lag.unit=1)
> pack()
OR = 3.13
95% CI = 1.19 7.35
Chi-squared = 8.26 , 1 d.f. , P value = 0.004
Fisher's exact test (2-sided) P value = 0.011
The risk of infection is 3 times higher if the child had vitamin A deficiency in the
preceding visit. We should check whether this association is confounded by
preceding infection status using the Mantel Haenszel method.
> mhor(respinfect, xerop.lag1, respinfect.lag1)
Stratified analysis by respinfect.lag1
86
OR lower lim. upper lim. P value
respinfect.lag1 0 3.39 1.1886 8.46 0.01145
respinfect.lag1 1 1.65 0.0305 19.45 0.52669
M-H combined 3.03 1.3348 6.89 0.00534
The adjusted odds ratio (3.03) and the crude odds ratio (3.13) are quite close,
indicating minimal confounding by past infection.
respinfect.lag1: 1 vs 0
1.88 (0.92,3.84) 1.82 (0.88,3.75) 0.104 0.124
Log-likelihood = -237.9799
No. of observations = 855
AIC value = 481.9597
This model suggests that during the transition from previous visit to current visit,
susceptibility to respiratory infection is enhanced by vitamin A deficiency and not
by the presence of the preceding infection.
We can add other putative risk factors to see if the odds ratio changes.
> glm2 <- glm(respinfect ~ xerop + xerop.lag1 + respinfect.lag1 +
season + sex + age.month, family="binomial", data=.data)
> logistic.display(glm2)
The variables 'xerop' and 'sex' are not significant and so we run a new model with
them omitted.
> glm3 <- glm(respinfect ~ xerop.lag1 + respinfect.lag1 + season +
age.month, family="binomial", data=.data)
> logistic.display(glm3, decimal=1)
87
season: ref.=1
< 0.001
2 4.0 (1.7,9.5) 4.6 (1.9,11.2) < 0.001
3 1.9 (0.8,4.3) 1.9 (0.8,4.6) 0.133
4 1.3 (0.5,3.6) 1.3 (0.5,3.7) 0.588
age.month (cont. var) 0.98 (0.97,0.99) 0.97 (0.96,0.99) <0.001 < 0.001
Log-likelihood = -223.9393
No. of observations = 855
AIC value = 461.8786
Both season and age are significant determinants of respiratory infection. With their
presence, the odds ratio of preceding vitamin A deficiency increases from 3.1 to 4.2
and the odds ratio of preceding infection slightly decreases from 1.8 to 1.6, which is
again not significant.
In summary, transition models are suitable for cohort studies containing a regular
follow-up interval and changing exposure and outcome status. Transition modelling
is statistically relatively simple but needs careful data management and exploration.
Keeping preceding outcome status in the model ensures that the 'carry-over' effects
are adjusted for and the problem of correlation over time is taken care of.
Demonstrating that preceding exposure is associated with current health outcome
provides stronger logic of causation than what is found in usual cross-sectional
data.
88
Exercise
Analyse the bacteria data set from the MASS package.
89
Further Reading
90
Solutions to exercises
Chapter 1
> library(epicalc)
> des(Theoph) # This dataset has a "lazy loading" attribute
> help(Theoph)
The help page describes the variables in this dataset. The data appear to be in long
format, since there are no repeating variables, and there is a "time" variable. To
confirm this, we can determine if the Subject id is duplicated.
> head(Theoph)
Grouped Data: conc ~ Time | Subject
<environment: R_EmptyEnv>
Subject Wt Dose Time conc
1 1 79.6 4.02 0.00 0.74
2 1 79.6 4.02 0.25 2.84
3 1 79.6 4.02 0.57 6.57
4 1 79.6 4.02 1.12 10.50
5 1 79.6 4.02 2.02 9.66
6 1 79.6 4.02 3.82 8.58
7 1 79.6 4.02 5.10 8.36
8 1 79.6 4.02 7.03 7.47
9 1 79.6 4.02 9.05 6.89
10 1 79.6 4.02 12.12 5.94
11 1 79.6 4.02 24.37 3.28
12 2 72.4 4.40 0.00 0.00
13 2 72.4 4.40 0.27 1.72
14 2 72.4 4.40 0.52 7.91
15 2 72.4 4.40 1.00 8.31
> use(Theoph)
> any(duplicated(Subject))
[1] TRUE
The variable "Time" appears to be inconsistent across subjects, as can be seen from:
> tab1(Time)
91
Reshaping this data to wide form using this "time" variable is possible, but would
result in a useless data set.
> Theoph.wide <- reshape(Theoph, direction="wide", idvar="Subject",
timevar="Time", v.names="conc")
Chapter 2
> library(epicalc)
> use(Theoph)
> des()
> summ()
> table(Subject)
Subject
6 7 8 11 3 2 4 9 12 10 1 5
11 11 11 11 11 11 11 11 11 11 11 11
For a small data set such as this one, we can easily see that there are 12 subjects,
each having 11 records. For large data sets, the following may be better.
> length(unique(Subject))
[1] 12
> tab1(table(Subject))
table(Subject) :
Frequency Percent Cum. percent
11 12 100 100
Total 12 100 100
Assess the timing of drug measurements using the tab1 and summ commands.
> tab1(Time)
Time :
Frequency Percent Cum. percent
92
0 12 9.1 9.1
0.25 5 3.8 12.9
0.27 3 2.3 15.2
0.3 2 1.5 16.7
0.35 1 0.8 17.4
0.37 1 0.8 18.2
====== remaining lines omitted ======
> table(Time, Subject) # output not shown
> summ(Time)
Distribution of Time
0 5 10 15 20 25
The jittering of the stacks of points indicates that the time of drawing blood was not
perfectly synchronised for all subjects. It appears as if some attempt was made to
draw the blood at specific intervals for each subject, namely at 15 and 30 minutes,
and then at 1, 2, 3.5, 5, 7, 9, 12 and 24 hours after the start of the study, however
this was not achieved exactly.
> followup.plot(id=Subject, time=Time, outcome=conc, xlab="Time (hrs)",
ylab="Concentration (mg/L)", las=1)
> title(main="Pharmacokinetics of theophylline")
93
Pharmacokinetics of theophylline
10
Concentration (mg/L)
8
0 5 10 15 20 25
Time (hrs)
The concentration rises sharply after the first dose, then drops gradually over time.
> coplot(conc~Time|Subject, panel=lines, type="b", data=Theoph)
Given : Subject
1 5
12 10
4 9
3 2
8 11
6 7
0 5 10 20 0 5 10 20
0 2 4 6 8
0 2 4 6 8
conc
0 2 4 6 8
0 5 10 20 0 5 10 20
Time
94
Multicoloured lines can be achieved as follows (graph not shown).
> followup.plot(Subject, Time, conc, line.col="multicolour")
For the examination of the subject's weights over time, we can use the aggregate
command. If the standard deviation of each subject’s weight is zero, then that
would indicate stability.
> aggregate(Wt, by=list(Subject=Subject), FUN=sd)
Subject sd.Wt
1 6 0
2 7 0
3 8 0
4 11 0
5 3 0
6 2 0
7 4 0
8 9 0
9 12 0
10 10 0
11 1 0
12 5 0
Pharmacokinetics of theophylline
<70 kg
70+ kg
10
8
conc
0 5 10 15 20 25
Time
95
A comparison may perhaps be better visualised by aggregating the concentrations
for each subject withing the two weigth groups at suitably chosen time points.
> aggregate.plot(conc, by=Time, group=Wt.gp, lwd=2, lty=c(1,3), las=1)
> title(ylab="Concentration (mg/L)", xlab="Time (hour)")
Note that because the time of drawing blood is not exact for each subject, the
aggregate.plot command will group the time points into 4 "bins" by default.
You may like to experiment with the "bin" arguments to see what effect they have
on the graph.
> aggregate.plot(x=conc, by=Time, group=Wt.gp, bin.method="quantile")
> aggregate.plot(x=conc, by=Time, group=Wt.gp, bin.time=11
bin.method="fixed")
In order to use the scheduled times, which were 15 minutes, 30 minutes, 1 hour and
then 2, 3, 5, 7, 9, 12 and 24 hours after drug administration, a new vector is needed.
> visit <- markVisits(Subject, Time)
> pack()
> aggregate.plot(conc, by=visit, group=Wt.gp)
In this graph, the distances between the time points do not reflect the actual times in
hours. We need to change the visit times.
> scheduled.visit <- c(0,0.25,0.5,1,2,3,5,7,9,12,24)
> recode(visit, 1:11, scheduled.visit)
> aggregate.plot(conc, by=visit, group=Wt.gp, lwd=2, lty=c(1,3), las=1)
<70 kg
70+ kg
10
0 5 10 15 20
Those weighing over 70kg start to have lower theophylline concentrations after 2
hours, but the difference is negligible at the end of the study.
96
Chapter 3
> zap()
> use(Sitka)
> auc.data <- auc(conc=size, time=Time, id=tree)
> treat.data <- reshape(Sitka, direction="wide", idvar="tree",
timevar="Time", v.names="size")[,1:2]
Trees treated with ozone have a lower AUC on average, however the difference is
not significant.
Chapter 4
> zap()
> use(Xerop)
> des(); summ()
> table(table(id, time))
0 1 2
452 1196 2
The output indicates that there are 452 missing records, however there are also 2
duplicates. These must be removed before continuing the anlaysis.
> id.dup <- id[duplicated(cbind(id, time))]
> .data[id %in% id.dup,]
> keepData(subset=!duplicated(cbind(id,time)))
> sortBy(id, time)
> visit <- markVisits(id, time)
> pack()
> table(time, visit)
The newly created 'visit' variable is not consistent with the 'time' variable due to the
missing records.
> zap()
> use(Sitka)
> tmp <- by(.data, INDICES=tree,
FUN=function(x) lm(size ~ Time+I(Time^2), data=x))
> tree.coef <- sapply(tmp, FUN=coef)
> tree.growth <- as.data.frame( t(tree.coef) )
> use(tree.growth)
97
> des()
> tableStack(vars=2:4, by=treat, decimal=2)
control ozone Test stat. P value
(Intercept) Ranksum test 0.5728
median(IQR) -0.38 (-3.21,0.53) -1.01 (-2.51,0.16)
Chapter 5
> zap()
> use(Xerop)
> des(); summ()
> table(time)
time
1 2 3 4 5 6
230 214 177 183 195 201
> length(unique(id))
[1] 275
The visit times are not evenly distributed. Of the 275 subjects, only 230 came to the
first visit (time=1). Subsequent visits are imbalanced. The two duplicates are now
removed.
> keepData(subset=!duplicated(cbind(id,time)),
select=c(id, baseline.age, sex, respinfect, xerop, stunted, time))
> Xerop.wide <- reshape(.data, idvar="id", v.names=c("respinfect",
"xerop", "stunted"), timevar="time", direction="wide")
> summ(Xerop.wide)
> table(time)
Note that the first 3 variables, ('id', 'baseline.age' and 'sex'), all have 275 non-missng
values, since we omitted these in the 'v.names' argument of the reshape command.
The others have varying numbers of missing values, and the frequencies should
match those from the last tabulation of the 'time' variable.
> with(Xerop.wide, addmargins(table(respinfect.1, respinfect.2)))
respinfect.2
respinfect.1 0 1 Sum
0 162 8 170
1 21 2 23
Sum 183 10 193
98
> with(Xerop.wide, mcnemar.test(table(respinfect.1, respinfect.2)))
The change from visit 2 to visit 3 is not significant, as evidenced by the following
commands.
> with(Xerop.wide, addmargins(table(respinfect.2, respinfect.3)))
respinfect.3
respinfect.2 0 1 Sum
0 148 8 156
1 4 1 5
Sum 152 9 161
> with(Xerop.wide, mcnemar.test(table(respinfect.2, respinfect.3)))
Continuing in this fashion, you will discover that from visit 4 to visit 5, respiratory
infection actually increases. Vitamin A deficiency (xerop) does not change
significantly during any of the transitional periods. Stunting only changes
significantly between visits 3 and 4.
Chapter 6
> zap()
> data(Xerop)
> use(Xerop)
> id.dup <- id[duplicated(cbind(id, time))]
> .data[id %in% id.dup,]
> keepData(subset=!duplicated(cbind(id,time)))
> Xerop.all <- addMissingRecords(.data, id, time,
outcome=c("season","respinfect", "xerop","stunted"))
> use(Xerop.all)
> summ()
> tableStack(vars=season, by=present)
0 1 Test stat. P value
season Chisq. (3 df) = 22 < 0.001
1 92 (20.4) 183 (15.3)
2 126 (27.9) 424 (35.4)
3 136 (30.1) 414 (34.6)
4 98 (21.7) 177 (14.8)
99
Seasons 1 and 4 had significantly lower attendance rates than the other 2 seasons.
> sortBy(id, time)
> present.next <- lagVar(present, id, time, lag = -1)
> pack()
> logistic.display(glm(present.next ~ xerop+respinfect, data=.data,
family=binomial))
Log-likelihood = -407.6334
No. of observations = 997
AIC value = 821.2667
Log-likelihood = -952.3859
No. of observations = 1650
AIC value = 1916.7717
Both season and age at baseline are significant predictors for non-attendance. As
age increases, the chances of attending reduces.
Chapter 7
> library(geepack)
> library(epicalc)
> zap()
> data(respiratory)
> use(respiratory)
> des(); summ()
> table(table(id, visit))
100
The output is interesting. According to the help page, two centers were involved in
this study. Each center assigned a running number for the id, thus explaining the
duplicates. We need to create a new id variable.
> id2 <- paste(center,id,sep="")
> label.var(id2, "patient ID")
> table(id2, visit)
Now there are no duplicates. This dataset is not exactly in wide or long form. The
outcome is respiratory status, wich was measured on the 111 patients at baseline,
and at 4 subsequent visits. Thus, the outcome saved in two separate variables, and
we need to reshape the dataset so that it is contained in only one variable. First
create a dataset containing just the records from the baseline.
> Baseline <- .data[visit==1,]
Next, change the values of the 'visit' variable all to 0 and create a new outcome
variable called 'resp'.
> Baseline$visit <- 0
> Baseline$resp <- Baseline$baseline
Finally, do the same to the original dataset and append them together.
> .data$resp <- .data$outcome
> data <- rbind(Baseline, .data)
> use(data)
> sortBy(id2, visit)
> head(.data,30)
There should now be 555 records, with 111 patients having their respiratory status
measured at 5 visits. Modelling can now proceed.
> resp.ex <- geeglm(outcome~treat+center+age+sex, id=id2, data=.data,
family="binomial", corstr = "exchangeable")
> summary(resp.ex)
Only 'treatment' and 'center' are significant. Those from center "2" were more likely
to have a “good” respiratory status in addition to those given the active treatment.
Note that the coding for the outcome variable is 1=good, 0=poor.
Chapter 8
Chapter 9
> library(geepack)
> library(epicalc)
> zap()
> data(bacteria, package="MASS")
> use(bacteria)
> des(); summ()
> sortBy(ID, week)
101
> visit <- markVisits(ID, week)
> pack()
> table(week, visit)
Note the differences between the two variables, due to the missing records.
> cc(y, ap)
The risk of infection is 2.3 times higher for the placebo group.
> y.lag1 <- lagVar(y, ID, visit)
> pack()
> mhor(y, ap, y.lag1)
The risk of bacterial infection is still 2.3 times higher in the placebo group after
adjusting for previous infection. Conclude tha the active treatment is protective
against the bacteria regardless of whether the child was infected in the preceding
visit or not.
> logistic.display(glm(y ~ y.lag1+ap, family=binomial, data=.data))
Note that the crude odds ratio for 'ap' is not the same as above using the cc
command. This is because the records of the first visit do not have preceding
infection status and therefore are not included in the adjusted analysis nor the
logistic regression model. Checking for a confounding effect using the glm
command is more accurate than from a comparison of the results from the cc and
mhor commands. In this case there is moderate confounding by previous infection
status.
> logistic.display(glm(y ~ y.lag1+trt, family=binomial, data=.data))
There are three treatment groups now. One group was given no active treatment at
all (placebo). The second and third groups were given the active treatment
and further randomised to recieve active encouragement to comply with treatment
(drug+) or not (drug).
The effect (OR) of 'y.lag1' is similar to the model which included 'ap'. However,
while the likelihood ratio test (LR-test) for 'ap' is significant (p=0.038), it is not
for 'trt' (p=0.096). This is because the risk for the "drug" and "drug+" groups
are rather close. The LR-test reports the effect of 'trt' as a whole. It tells us that there
is not sufficient evidence that the three treatment groups have a different risk for
infection after adjusting for preceding infection. The Wald's tests of this set of
variables tell a slightly different story. The group receiving treatment with low
compliace (drug) has a significantly lower risk of infection compared to the placebo
group but treatment with high compliance is not better than placebo (one may
wonder if the data contains wrong coding). However, the odds ratio is still
relatively and negatively strong (0.5) with quite a wide 95% CI. We conclude that
the sample size (170 valid subjects) may not be large enough.
102