100% found this document useful (1 vote)

2K views103 pages

Longitudinal Data Analysis

This document provides an introduction and preface to a book about analyzing longitudinal data using R and Epicalc. It discusses how longitudinal data is structured and stored, and describes different approaches to modeling longitudinal data, including population-averaged models using generalized estimating equations (GEE), random effects models, and transition models. The goal of the book is to help researchers understand longitudinal data analysis and apply it using the R software package.

Uploaded by

Cynthia Jones

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

2K views103 pages

Longitudinal Data Analysis

Uploaded by

Cynthia Jones

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 103

Analysis of Longitudinal Data using

R and Epicalc

Author: Virasakdi Chongsuvivatwong

Analysis of Longitudinal Data using
R and Epicalc

Author: Virasakdi Chongsuvivatwong

cvirasak@medicine.psu.ac.th
Editor: Edward McNeil
edward.m@psu.ac.th

Epidemiology Unit
Prince of Songkla University, THAILAND

1
Analysis of Longitudinal Data using
R and Epicalc

First Paperback Edition

Author : Virasakdi Chongsuvivatwong

Editor : Edward McNeil

First Published : Chamnuang Press

Soi 2, 11 Phetkasem Rd, Hat Yai, Songkhla 90110, Thailand

Book design by : Patcharin Potong

Book Unit, Facuilty of Medicine
Prince of Songkla University
Hat Yai, Songkhla 90110, Thailand

Cover image :

ISBN :

Printed in Thailand

First Paperback Printing, 2011

R and Epicalc

2
Preface
From the theoretical side, this book was written to help new researchers in Health
Science to be able to understand the nature of longitudinal studies, the structure of
data and the approach in analysis. Equipped with R, an open-source software
consisting of a suite of both standard and user contributed packages, readers are
encouraged to follow the examples on data manipulation, graphing, exploration and
modelling of longitudinal data.

Readers of this book should have some basic epidemiological knowledge, such as
measures in epidemiology (incidence, prevalence etc), types of study designs, bias,
confounding and interaction. Those who feel that they have an inadequate
background should read fundamental text books of epidemiology listed at the end of
this book. Basic data management concepts are also required as the first few
chapters deal with data structure and data manipulation.

Experience in using data entry software, such as EpiInfo and EpiData, may also be
beneficial. Although they can be used for entering longitudinal data, facilities for
management of data with longitudinal nature are limited. So-called relational
database software, such as Microsoft Access and others, are designed for data
management and manipulation, but cannot do complicated statistical analyses. Data
manipulation is more efficient if the same software is used for both. Documentation
is also simplified since it can be integrated within the command file. R can be used
exclusively for both purposes.

To get the full benefit from this book, readers should be acquainted with R
software. Readers are recommended to follow each section in conjunction with
typing the commands that follow since the theoretical parts are often followed by an
example. The output on the computer screen can be observed and compared with
that in the book and the explanations can then be read in order to integrate the
learning of concepts and practice of data analysis simultaneously.
There are several text books and tutorials written for R available from the Internet.
"Analysis of Epidemiological Data Using R and Epicalc", which can be freely
downloaded from the CRAN website, http://cran.r-project.org/ and also from the
WHO web site is strongly recommended as a preamble reading.
http://apps.who.int/tdr/svc/publications/training-guideline-publications/analysis-
epidemiological-data/
That book not only explains how the functions in R and Epicalc are used but it also
provides the concept of variable management in R, especially how to avoid the
confusion between a free vector and a variable of the same name inside a data
frame. Epicalc also enables the data frame as well the variable to be labelled, which

3
subsequently leads to more understandable output in tables and graphs. The current
version of Epicalc has been developed to respond to the needs of longitudinal data
analysis in addition to the existing ordinary data exploration features.

Similar to the previous book, Epicalc functions are typed in Italics in contrast to
functions from other R packages, which appear in normal font type. A function will
be briefly explained when it is first used to let readers who have never used Epicalc
before catch up quickly but would probably not be enough to substitute the needs to
learn from the preceding book.

Students of Epidemiology may recall that a "longitudinal study" is first introduced

to them as a cohort study or a follow-up study. The exposure variable is determined
at the time of subject recruitment. Subjects are followed until the final outcome
occurs or until the study ends. A cohort study design taught to students is often
over-simplified, that is, repeated observations of exposure and outcome, as well as
other explanatory variables, are often ignored. The simple 2-by-2 table for a cohort
design is often limited to diseases with a short and homogeneous length of
incubation period, such as food poisoning. For chronic diseases that have multiple
risk factors and a multi-stage of disease manifestation, the incubation period can be
long and may even vary. The denominator for calculation of risk is person-time, not
just person. Statistical techniques employed are often survival analysis and Poisson
regression and its variants. Data sets used for teaching students usually contain a
person-time variable for grouped data or period of follow-up (time first seen and
time last seen) of the subjects. This is quite different from real-life practices where
individual subjects are followed-up at regular intervals until the event occurs or the
study terminates. Unlike the epidemiological agreement that this is a longitudinal
study, statisticians in general probably do not consider that analysis of time-to-event
data is a longitudinal data analysis.

However, in addition to just following up and waiting for the failure event to occur,
follow-up records allow analysis of transitions. The state of an outcome can be
more than the classical dichotomy (diseased vs. non-diseased). It can be different
states of the disease or health. Transition is the change of state between one point of
time to the next. In biology, the measurements are mostly taken as continuous data.
Transition in this case may mean the difference in outcome measure between two
adjacent time points.

When the follow-up time is short and the number of variables is small, the data for
a longitudinal study could easily be stored in the so-called 'wide' format. In wide
form, a person appears in only in one record with measurements of the same sets of
variables, usually measured at different times, stored in separate columns. The wide
form has serious limitations when the number of repetitions of the visits is large and
each visit has a relatively large number of measurements. It is also inefficient when

4
certain persons have only a few visits while others have a large number of visits. In
this situation it is more efficient to store the data in the so-called 'long' format. In
long form, a person can appear in more than one record corresponding to each of
their visits. Measurements for each variable are stored in a single column, with an
additional column denoting time included to distinguish separate visits. When the
data are stored in long form, the number of visits does not need to be the same for
every person.

As most well-designed follow-up studies have a fixed time interval, longitudinal

studies may be considered as multiple cross-sectional studies over time. The
relationship between the outcome and the exposure variables are measured in the
same cross-sectional time. For each cross-sectional study, there is an equation
explaining this relationship. For repeated measurement of the outcome and the
exposure, the relationships are therefore modelled under generalized estimating
equations (GEE). The estimated values are actually more-or-less the average values
of the relationship. The model is also therefore called population average. The
equations are averaged to only one with the correlations among the repetitions taken
into account.

Instead of focusing on and time-averaging the relationship, the effects of the

exposure variables may be separated into two components. The first component is
fixed, and the exposure variables are called fixed effects since they are shared by
individual subjects. The second component consists of variables which may vary
from one individual to another in a random fashion, and are called random effects.
Modelling data in this two component fashion is called a random effects modelling,
or random coefficient model, or mixed effects model, because the components of
the model contain a mixture of random and fixed effects.

Finally, as an individual changes his/her exposure and outcome over time, instead
of looking at the status at each time point, one can consider his/her transition from
one point of time to the next and the relationship between the transition of the
outcome and the transition of exposure. While the outcome statuses of an individual
over time are correlated and thus the relationship needs adjustment for this
correlation, the transition from one time point to the next is usually not correlated,
and analysis of the transition (transition modelling) is therefore simpler than the
above two approaches. For a continuous outcome variable, the magnitude of change
is modelled against the exposure variable in the concurrent transition or preceding
state. When the outcome is a dichotomous variable, the transition probability that
we are interested in is not confined to failure but so-called transition, which could
be multi-directional. Modelling of transitional probabilities is called transition
modelling or Markov modelling. Markov models predict the probability of the
current outcome from the preceding status. It can also be called auto-regressive
modelling as the outcome is regressed by its own previous value.

5
6
Table of Contents
Chapter 1: Data formats _____________________________________________ 8
Chapter 2: Exploration and graphical display ___________________________ 14
Chapter 3: Area under the curve (AUC) ________________________________ 26
Chapter 4: Individual growth rate ____________________________________ 35
Chapter 5: Within subject, across time comparison _______________________ 46
Chapter 6: Analysis of missing records ________________________________ 55
Chapter 7: Modelling longitudinal data ________________________________ 65
Chapter 8: Mixed models ___________________________________________ 75
Chapter 9: Transition models ________________________________________ 84
Solutions to exercises ______________________________________________ 91

7
Chapter 1: Data formats

Wide form and long form

Data entry phase

The decision to choose between wide form and long form should be made at the
design level. As mentioned before, when the number of visits is few and constant
for each subject, it may be better to enter the data in wide form. Since all variables
are in the same record, quality control by comparison of variables is
straightforward. For example, one can easily ensure that the date of first visit comes
before that of the second visit.

Data entry software can issue a warning to the user if a certain value is too much
different from a preceding value. These consistency checks are difficult to
implement when the data are entered in long form because values of the same
variable, but from different times, are not entered in the same record.

Normalization of data
Some variables, especially baseline characteristics of the subjects, are usually fixed,
for example, date of birth, sex, place of birth. This data should be entered only
once. In a database, a set of the data is considered as a table. The baseline table,
which has one record per subject, can be linked with the follow-up data set through
an identification (ID) field. This field is also called a key field. This ID field in the
baseline table must be unique. In other words, there must not be any duplication in
ID in the baseline table. In the follow-up table, ID certainly can be duplicated but
the combination of ID and time of follow-up must be unique since each follow up
of a subject should be recorded only once. A good database design would require
such a unique ID in the baseline table and unique ID+time (a compound key) in the
follow-up table. The database design must also ensure that all the IDs in the follow-
up table must also be present in the baseline table.

The baseline table is sometimes considered as the mother table and correspondingly
the follow-up table is considered as the child table. A follow-up record without
corresponding ID in the baseline table is called an 'orphan' record. It indicates poor
quality control in the data entry system. In order to ensure such integrity (absence
of orphan records), a relational database software, such as Microsoft Access, is

8
required. EpiData can also be used to serve this purpose. Such data integrity can be
ensured if and only if the records in the follow-up table can be entered only through
an existing baseline record.

Hierarchical data
Data in which the relationship between the baseline data and the follow-up data has
a hierarchy is called hierarchical data. It is also known as multi-level data because
each follow-up record is considered as level 1, whereas each subject is considered
as level 2. The latter is nested around the former; and is also therefore called nested
data. Apart from longitudinal studies, hierarchical data can also be nested based on
social or spatial relationships. For example, subjects can be nested with families and
families can be nested within villages, and so forth.

When upper level variables affect the outcome at the individual level, the variables
are sometimes called contextual determinants. For example, the nutritional status of
a child is not only influenced by his/her immune status but also by the child rearing
behaviours of the family and the hygiene conditions (waste disposal, water supply)
of the community.

Several software packages can be used to analyze hierarchical data. Analysis of this
multi-level data is well covered by a number of packages in R and will be discussed
in subsequent sections.

Complex relational database

Relational data is often, but not necessarily, in hierarchical fashion. For example,
several patients are often seen by one doctor. However, a patient can also been seen
by more than one doctor. The same matched patient-doctor set could also occur on
several visits. The quality of care that the patient receives may be determined by the
characteristic of the patient, for example their age and sex, and of the doctor, for
example, their qualification, experience, and the interaction between the two. So
far, there are very few statistical software packages capable of analyzing this type
of complex data. The package lme4 in R (by Bates and Maechler) is a pioneer in
this area. We will not discuss this highly complicated relational data since this is
outside of our current interest.

9
Examples of longitudinal data in long form
The datasets package in R contains a large number of data sets from longitudinal
studies. All of these are in the long format. The list can be viewed by typing:
> data(package="datasets")

Among these, is a data frame containing 578 rows and 4 columns from an
experiment on the effect of diet on early growth of chicks.
> class(ChickWeight)
[1] "nfnGroupedData" "nfGroupedData" "groupedData" "data.frame"

In addition to being a data frame, the object is also a special kind of data frame
which was modified from an ordinary data frame in order to make it suitable for
analysis using functions from the nlme package. To view the first 6 records you can
type:
> head(ChickWeight)
Grouped Data: weight ~ Time | Chick
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1

The dataset actually contains a formula, which models the chick’s weight using
each chick's age (in days).

To view the variable names, classes and descriptions you can type:
> library(epicalc)
> des(ChickWeight)
No. of observations = 578
Variable Class Description
1 weight numeric
2 Time numeric
3 Chick ordered
4 Diet factor

There are 4 variables, of which the third variable, 'Chick', is the identification
variable.
> use(ChickWeight)
> Chick[1:30]
[1] 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3
50 Levels: 18 < 16 < 15 < 13 < 9 < 20 < 10 < 8 < 17 < 19 < 4 < 6 < 11 <
3 < 1 < 12 < 2 < 5 < 14 < 7 < 24 < 30 < 22 < 23 < 27 < 28 < 26 < 25 <
... < 48

10
The display looks fine but the ordering for the levels is unusual. Now let's do a
cross-tabulation with the 'Time' variable.
> table(Chick, Time)
Time
Chick 0 2 4 6 8 10 12 14 16 18 20 21
18 1 1 0 0 0 0 0 0 0 0 0 0
16 1 1 1 1 1 1 1 0 0 0 0 0
15 1 1 1 1 1 1 1 1 0 0 0 0
13 1 1 1 1 1 1 1 1 1 1 1 1
9 1 1 1 1 1 1 1 1 1 1 1 1
========= remaining lines omitted =========

The sequence of 'Chick' in the rows is not in numeric order. This is because it is an
ordered "factor" class object. It was classed this way in order to fit in with the
structure of "groupedData" required by the nlme package. We will learn about how
to convert this class of object into a normal integer in the next chapter. The other
point to note is that Chick "18", which appears in the first row, has only 2 visits at
times 0 and 2. The data is unbalanced. This is a common finding in longitudinal
data. It will be discussed in detail in subsequent chapters.

Reshaping longitudinal data frame

Many other data sets, such as Indometh and Theoph, which contain
pharmacokinetic data, have similar class and structure. This long form data can be
reshaped to wide form if necessary. The first three commands to reshape the
Indometh data to wide form actually come from the help page for the reshape
function.
> wide <- reshape(Indometh, v.names="conc", idvar="Subject",
timevar="time", direction="wide")
> des(wide)
No. of observations = 6
Variable Class Description
1 Subject ordered
2 conc.0.25 numeric
3 conc.0.5 numeric
4 conc.0.75 numeric
5 conc.1 numeric
6 conc.1.25 numeric
7 conc.2 numeric
8 conc.3 numeric
9 conc.4 numeric
10 conc.5 numeric
11 conc.6 numeric
12 conc.8 numeric

11
In the above example there is only one variable with repeated measures. In reality, a
data set can contain many sets of repeated measurements. As a simple illustration,
study the following commands carefully.
> exposure1 <- c(1:9,NA)
> exposure2 <- 11:20
> exposure3 <- 21:30
> outcome1 <- 101:110
> outcome2 <- 111:120
> outcome3 <- 121:130
> data.wide <- data.frame(ID=letters[1:10], exposure1, exposure2,
exposure3, outcome1, outcome2, outcome3)
> data.wide
ID exposure1 exposure2 exposure3 outcome1 outcome2 outcome3
1 a 1 11 21 101 111 121
2 b 2 12 22 102 112 122
3 c 3 13 23 103 113 123
4 d 4 14 24 104 114 124
5 e 5 15 25 105 115 125
6 f 6 16 26 106 116 126
7 g 7 17 27 107 117 127
8 h 8 18 28 108 118 128
9 i 9 19 29 109 119 129
10 j NA 20 30 110 120 130

Note the missing value for the 'exposure1' variable in the last row (ID = "j"). Now
let's reshape this data frame to long format.
> data1.long <- reshape(data.wide, idvar="ID", varying=list(2:4, 5:7),
v.names=c("exposure", "outcome"), direction="long")
> data1.long
ID time exposure outcome
a.1 a 1 1 101
b.1 b 1 2 102
c.1 c 1 3 103
d.1 d 1 4 104
e.1 e 1 5 105
f.1 f 1 6 106
g.1 g 1 7 107
h.1 h 1 8 108
i.1 i 1 9 109
j.1 j 1 NA 110
a.2 a 2 11 111
b.2 b 2 12 112
c.2 c 2 13 113
d.2 d 2 14 114
e.2 e 2 15 115
f.2 f 2 16 116
g.2 g 2 17 117
h.2 h 2 18 118
i.2 i 2 19 119
j.2 j 2 20 120
========= remaining lines omitted =========

12
Note the source of the new 'time' variable in the long format is generated from the
suffixes of the 'exposure' and 'outcome' variables in the wide format. Also the new
'exposure' variable in the long format corresponds to the 2nd to 4th variables in the
wide format and the 'outcome' variable in the long format corresponds to the 5th to
7th variables in the wide format. These sets of variables must be matched correctly
in the 'varying' argument. Note also that the value of 'exposure1' for ID "j" is
missing. Now suppose that the exposure and outcome variables are adjacent to each
other in the wide data frame.
> data2.wide <- data.frame(ID=letters[1:10], exposure1, outcome1,
exposure2, outcome2, exposure3, outcome3)
> data2.wide
ID exposure1 outcome1 exposure2 outcome2 exposure3 outcome3
1 a 1 101 11 111 21 121
2 b 2 102 12 112 22 122
3 c 3 103 13 113 23 123
4 d 4 104 14 114 24 124
5 e 5 105 15 115 25 125
6 f 6 106 16 116 26 126
7 g 7 107 17 117 27 127
8 h 8 108 18 118 28 128
9 i 9 109 19 119 29 129
10 j NA 110 20 120 30 130

The command to reshape to wide format is almost identical to the previous

command.
> data2.long <- reshape(data2.wide, idvar="ID", varying=list(c(2,4,6),
c(3,5,7)), v.names=c("exposure", "outcome"), direction="long")
> data2.long

The positions of the variables in the 'varying' list need to be changed accordingly to
match the order of the 'v.names' argument. Note that the row names of the resulting
data frame are formed from the combination of the 'ID' and 'Time' variables. They
are therefore unique.

Exercise
In what format (wide or long) is the data frame Theoph provided by R?
Reshape it to the other format. Explain how the variable 'Subject' is arranged in the
new format.

13
Chapter 2: Exploration and graphical
display

In this chapter, we will go into more details of data frames that have class
"groupedData".
> library(epicalc)
> zap()
> use(Indometh)
> des()
No. of observations = 66
Variable Class Description
1 Subject ordered
2 time numeric
3 conc numeric

There are 66 records and 3 variables, the first of which has class "ordered", with the
other 2 being "numeric". There are no descriptive labels attached to the variables.
> summ()
No. of observations = 66

Var. name obs. mean median s.d. min. max.

1 Subject 66 3.5 3.5 1.721 1 6
2 time 66 2.89 2 2.46 0.25 8
3 conc 66 0.59 0.34 0.63 0.05 2.72

The minimum value of 'Subject' is 1 and the maximum is 6 but that does not
necessarily mean that there are 6 subjects in total. We have to check with
tabulation.
> tab1(Subject)
Subject :
Frequency Percent Cum. percent
1 11 16.7 16.7
4 11 16.7 33.3
2 11 16.7 50.0
5 11 16.7 66.7
6 11 16.7 83.3
3 11 16.7 100.0
Total 66 100.0 100.0

14
The table above indicates that there are 6 subjects, each contributing 11 records.
The order of 'Subject' in this table is not sorted from lowest to highest because the
variable is an ordered factor and the levels have been preset to this order.

To make sure that time of measurement of the drug concentration is systematic for
all subjects a cross-tabulation can be carried out.
> table(time, Subject)
Subject
time 1 4 2 5 6 3
0.25 1 1 1 1 1 1
0.5 1 1 1 1 1 1
0.75 1 1 1 1 1 1
1 1 1 1 1 1 1
1.25 1 1 1 1 1 1
2 1 1 1 1 1 1
3 1 1 1 1 1 1
4 1 1 1 1 1 1
5 1 1 1 1 1 1
6 1 1 1 1 1 1
8 1 1 1 1 1 1

Time increases by 0.25 unit until 1.25. Then it increases by a step of 1 until time 8,
except that time 7 is missing. This kind of tabulation is a good routine practice in
longitudinal data exploration. All cells in the table are filled with 1s indicating the
uniqueness for the combination of 'Subject' and 'time'. Note that there is no time 0.
When the data set is small, eyeball scanning on such tabulation may be adequate.
When the data set is large, it may be better to check whether there are any missing
follow up times or duplicated records by typing the following command.
> table(table(time, Subject))
1
66

This command tabulates all values of the above table and finds that there are 66
cells all having a common value of 1.

Alternatively, missing follow up times can be checked by typing:

> any( table(time, Subject) < 1 )
[1] FALSE

Duplication of records having the same person with the same time point could be
checked by typing:
> any( table(time, Subject) > 1 )
[1] FALSE

15
Graphing longitudinal data
There are two main methods for graphing the relationship between concentration
and time for each subject. The first method is to employ trellis plots, which give
one small plotting frame for each subject. The second method is to employ the
epicalc package, which has graphical functions that show all the data in the one
plotting frame.

The simplest plotting command for this data is coplot.

> coplot(conc ~ time | Subject, data = Indometh)

The upper part of the graph indicates the order of the Subject corresponding to the
panels in the plot. The plot is read from left-to-right and bottom-to-top, so the
bottom left panel corresponds to Subject 1 and top right panel corresponds to
Subject 3. To visualize the position of Subject in each panel we can replace the
open circles with the actual Subject number as follows:
> coplot(conc ~ time|Subject, data=Indometh,pch=as.character(Subject))

16
The coplot function is designed to show the relationship between two variables
conditional on the value of a third variable, in this case subject. Instead of using
coplot, we can use the xyplot function from the lattice package.
> library(lattice)
> xyplot(conc ~ time | Subject, data=Indometh)
> xyplot(conc ~ time | Subject, type="b", data=Indometh)

The last command plots connecting lines in each frame instead of just open circles.

Since the class of the data frame is "groupedData", we can also call the nlme
library, which has a default plot method for this class of data frame.
> library(nlme)
> plot(Indometh)

17
The result is similar to the xyplot command; however the labels for the X and Y
axes, which are stored as attributes in the data frame, are used here.
> attributes(Indometh)

In order to add the "groupedData" class to an ordinary data frame, we must employ
the groupedData function from the nlme package.

Let's create Sitka.gp from the Sitka data, which comes from the MASS
package.
> library(nlme)
> data(Sitka, package="MASS")
> Sitka.gp <- groupedData(size ~ Time|tree, data=Sitka,
labels=list(x="Time (Days since 1 Jan 1988)", y="Log(Height x
diameter ^2)"))
> plot(Sitka.gp)

18
In order to show all the subjects in the same plotting frame, let's return to the
Indometh data set. We first think that if there is only one subject, this value could
be simply plotted as a line graph against time.
> plot(conc~time, subset=Subject==1, type="l", data=Indometh)

The above command produces a plot of the pharmacokinetic curve for the first
subject. We can further proceed with the second and third subjects.
> lines(conc~time, subset=Subject==2, type="l", data=Indometh)
> lines(conc~time, subset=Subject==3, type="l", data=Indometh)

Each of the above lines commands adds one line to the existing graph for the
second and the third subjects, respectively. One can repeat the same process until
all six subjects have had their curves displayed.

The problem encountered so far is that the maximum value of the Y axis defined by
the first subject is too low for subsequent subjects. To prevent this, the initial plot
command should include a 'ylim' argument so that subsequent curves with higher

19
concentrations can still be accommodated. The remaining lines commands can then
follow as above.

Obviously, if there are too many subjects, the command would be too tedious to
run. It may be better to exploit a for loop.
> plot(conc~time, subset=Subject==1, ylim=c(0, max(conc, na.rm=TRUE)),
xlab="", ylab="", type="l", data=Indometh)
> for(i in 2:6) lines(conc~time, subset=Subject==i, col=i,
data=Indometh)

That completes the majority of the requirements for the graph. Readers can further
proceed with putting axis labels, a legend, title, etc.

The above process could be carried out with two epicalc commands.
> use(Indometh)
> followup.plot(id=Subject, time=time, outcome=conc,
main="Pharmacokinetics of Indomethicin")

The resultant graph is more or less the same as the previous commands using the
for loop construct. Note that the colours are automatically chosen based on the
Subject number.

20
Plots of aggregated values
The examples in the help page of the followup.plot function explore the
Sitka data set and give some ideas for the colour of the lines indicating the
treatment group.
> data(Sitka, package="MASS")
> use(Sitka)
> followup.plot(id=tree, time=Time, outcome=size, by=treat,
main="Growth Curves for Sitka Spruce Trees in 1988")

The control group, represented by solid black lines, tends to have larger trees than
the ones grown in the ozone-enriched chambers. This can be more clearly seen with
the following command.
> aggregate.plot(x=size, by=Time, group=treat)

21
It is clear that the mean tree size of the ozone group was somewhat smaller at the
start and distinctively smaller at the end of the follow-up period. If the argument
'return.output' is set to TRUE then the numerical results are shown as well.
> aggregate.plot(x=size, by=Time, group=treat, return=TRUE)
grouping time mean.size lower95ci upper95ci
1 control 152 4.166000 3.837764 4.494236
2 ozone 152 4.059630 3.898858 4.220401
3 control 174 4.629600 4.333247 4.925953
4 ozone 174 4.467037 4.310078 4.623996
5 control 201 5.037200 4.760472 5.313928
6 ozone 201 4.849074 4.693651 5.004498
7 control 227 5.438400 5.161541 5.715259
8 ozone 227 5.180926 5.014109 5.347743
9 control 258 5.654400 5.372244 5.936556
10 ozone 258 5.313148 5.145238 5.481058

22
Dichotmous longitudinal outcome variable
All the above examples have outcome variables on a continuous scale. Let's explore
a data set which has a dichotomous outcome variable.
> data(bacteria, package="MASS")
> use(bacteria)
> des()
No. of observations = 220
Variable Class Description
1 y factor
2 ap factor
3 hilo factor
4 week integer
5 ID factor
6 trt factor

The data set comes from a study testing the presence of the bacteria H. influenzae in
children with otitis media in the Northern Territory of Australia. The outcome is in
the variable 'y' and the follow-up period is represented by the 'week' variable. Some
follow-up times are missing, as shown by the following command.
> table(table(week, ID))
0 1
30 220

It is not appropriate to draw a follow-up plot because the outcome here is

dichotomous. In order to show the different prevalence of bacteria over the follow-
up period by treatment group, we have to use the aggregate.plot command.
> aggregate.plot(x=y, by=week, group=trt)

The 95% confidence intervals of the prevalences overlap due to the relatively small
sample sizes in the three treatment groups. We can also use 'ap' (active vs placebo)
as the group variable instead of 'trt'.

23
> aggregate.plot(x=y, by=week, group=ap)

24
The prevalence of bacteria in the active treatment group declined steadily until
week 6 when the difference is the highest. The two groups tend to have a closer
prevalence again after 11 weeks.

Exercise
• Explore the Theoph data set again from the datasets package.
• How many subjects are there? How many times does each subject appear?
• Were the subjects assessed on drug levels in exactly the same pattern of
time?
• Plot the concentration of this drug of each individual over time using
followup.plot and coplot.
• Note that from followup.plot, the colours of the lines are all the
same. Why?
• How could we change the color? How can it be more colourful?
• Was the weight of each subject stable?
• Divide the subjects into 2 groups, one below 70kg and one greater than or
equal to 70 kg. Create a variable called 'Wtgr' based on this weight
division and use the followup.plot command to draw a graph similar
to the one on page 20.
• Is there a tendency that the heavier group has a higher level of drug
concentration over time?

25
Chapter 3: Area under the curve (AUC)

The area under the plasma (serum, or blood) concentration versus time curve
(AUC) has a number of important uses in toxicology, biopharmaceutics and
pharmacokinetics. In pharmacokinetics, drug AUC values can be used to determine
other pharmacokinetic parameters, such as clearance or bioavailability.

The Theoph data set has a starting point at time zero, a good reason to compute
area under the time-concentration curve (AUC).
> library(epicalc)
> class(Theoph)
[1] "nfnGroupedData" "nfGroupedData" "groupedData" "data.frame"

The data frame has the same class as those from previous chapters.
> use(Theoph)
> class(.data) # same as above

Area under the curve is computed by summing the trapezoids formed by the two
closest time points and their base. From elementary geometry, the area of one such
trapezoid is equal to the average height of these two points and the width of the
base. The records of the first subject in the Theoph data frame are shown below.
> .data[Subject==1, c(1,4,5)]
Subject Time conc
1 1 0.00 0.74
2 1 0.25 2.84
3 1 0.57 6.57
4 1 1.12 10.50
5 1 2.02 9.66
6 1 3.82 8.58
7 1 5.10 8.36
8 1 7.03 7.47
9 1 9.05 6.89
10 1 12.12 5.94
11 1 24.37 3.28

For subject 1, the area under the curve of the first trapezoid is (0.25 – 0.00) × (2.84
+ 0.74)/2 = 0.4475 unit. This is then added to all the trapezoids belonging to the
same subject. The second one is (0.57 – 0.25) × (6.57 + 2.84)/2 = 1.5056 and so on.

26
The final summation result is 148.92.

In epicalc, this computation can be done for each subject using the following
command:
> auc.data <- auc(conc=conc, time=Time, id=Subject)
> auc.data
Subject auc
1 1 148.92305
2 2 91.52680
3 3 99.28650
4 4 106.79630
5 5 121.29440
6 6 73.77555
7 7 90.75340
8 8 88.55995
9 9 86.32615
10 10 138.36810
11 11 80.09360
12 12 119.97750
The function auc is based on the above principle of summation of trapezoids.
We can also compute the AUC one subject at a time, omitting the 'id' argument,
since as the default value of 'id' is NULL.
> auc(conc=conc[Subject==1], time=Time[Subject==1])
[1] 148.9230
> auc(conc=conc[Subject==2], time=Time[Subject==2])
[1] 91.5268
> auc(conc=conc[Subject==3], time=Time[Subject==3])
[1] 99.2865

The above three lines just confirm the same results as the preceding command.

Since the 'Subject' variable is not ordered numerically, let's create an integer
variable, say 'subject' (small s), that has the same value as 'Subject' (capital S). The
command is:
> subject <- as.integer(Subject)[order(Subject)]

In order to create an integer vector from 'Subject', which is an ordered factor, the
values are coerced to integer first and then sorted in descending order.
> pack()
> class(.data)
[1] "data.frame"

The epicalc command pack has changed the class of .data to a simple data
frame. The original Theoph data frame remains intact. It can be used for more
complicated analyses later.

27
Now run the auc command once more.
> auc.data <- auc(conc=conc, time=Time, id=subject)

For those who are serious in pharmacokinetic studies, it is advised to install and
load the PK package. This package gives more details and variety of AUC than
epicalc. The auc function in epicalc is simply based on a trapezoid summation as
described above. The auc function from the PK package also gives this value. In
addition, it provides a better estimation of the area under time-concentration curve
based on the estimation of unavailable information of concentration during the time
interval. The right tail of the curve can go up to infinity to better reflect the total
amount of drug that the subject was exposed to. However, the auc function in
epicalc allows the user to subset the AUC by subject, making it more convenient
for data management.

The auc.data data frame contains AUC for each subject. It will be merged with
other data frames for further analysis.

Medically speaking, the AUC would reflect the speed of redistribution (into
different compartments of the body), destruction and excretion of the individual
subject on the drug. Thus the subject who destroyed/excreted fastest was the 6th
subject and the slowest was the first one. From now on we will try to find the
relationship between AUC and individual characteristics of the subject.

One can check the variability of the values of various variables across subjects
using the following command.
> aggregate(x=.data[,2:5], by=list(subject=subject), FUN="sd")
subject Wt Dose Time conc
1 1 0 0 7.273320 3.034533
2 2 0 0 7.269680 3.027389
3 3 0 0 7.234519 2.684222
4 4 0 0 7.329930 2.921623
5 5 0 0 7.278871 3.537344
6 6 0 0 7.151650 2.180382
7 7 0 0 7.250075 2.485130
8 8 0 0 7.233919 2.455296
9 9 0 0 7.244127 2.716175
10 10 0 0 7.108937 3.050539
11 11 0 0 7.223542 2.552839
12 12 0 0 7.235402 3.499246

There are certain interesting points here.

The first argument of the function is the data frame to aggregate. Unlike the
aggregate.numeric function, which can apply several statistical summaries
on a single variable after splitting the data into subsets, the aggregate function

28
can apply only one summary statistic to multiple variables in the data frame.

One may try changing the "by" argument to "list(subject = Subject)" to see the
ordering of the subjects (results omitted). Here, using 'subject' instead of 'Subject'
allows the data to be displayed in ascending subject ID order.

Variables 'Subject' (capital S), 'Wt' and 'Dose' have zero sd. This means that there is
no variation within subjects. Since the values of 'Subject', 'Wt' and 'Dose' of the
same person do not change at all, their standard deviations are all zero. The
standard deviations of 'Time' are all relatively similar indicating that the time of
drawing blood for drug assay was probably set to be synchronized for all subjects.
However, they are not exactly the same. The synchronization process is not perfect.
Let's check the variation graphically.
> summ(Time)
obs. mean median s.d. min. max.
132 5.895 3.53 6.93 0 24.65

The first 11 points are in the same vertical line, that is, at time zero. Later on, the
timing of drug drawing was not so synchronised. Variation in time of drawing
blood causes the stacks of points to jitter.

Now modify the above aggregate command as follows to obtain the mean
weight and dose for each subject.

29
> WtDose.data <- aggregate(.data[,2:3], by=list(subject=subject),
FUN="mean")
> WtDose.data
subject Wt Dose
1 1 79.6 4.02
2 2 72.4 4.40
3 3 70.5 4.53
4 4 72.7 4.40
5 5 54.6 5.86
6 6 80.0 4.00
7 7 64.6 4.95
8 8 70.5 4.53
9 9 86.4 3.10
10 10 58.2 5.50
11 11 65.0 4.92
12 12 60.5 5.30

Now we merge the WtDose dataframe with auc.data.

> merge.data <- merge(WtDose.data, auc.data)
> merge.data
subject Wt Dose auc
1 1 79.6 4.02 148.92305
2 2 72.4 4.40 91.52680
3 3 70.5 4.53 99.28650
4 4 72.7 4.40 106.79630
5 5 54.6 5.86 121.29440
6 6 80.0 4.00 73.77555
7 7 64.6 4.95 90.75340
8 8 70.5 4.53 88.55995
9 9 86.4 3.10 86.32615
10 10 58.2 5.50 138.36810
11 11 65.0 4.92 80.09360
12 12 60.5 5.30 119.97750

The merge command is often used in longitudinal data management. In this

exercise both data frames share one common field, namely 'subject'. The merge
command detected this and used it to join the records from both data frames.
> intersect(names(WtDose.data), names(auc.data))
[1] "subject"

By default, only records sharing the common field in both data frames are returned.
In this case all of the records are returned and no subject was omitted.

From now on, the data can be analysed quite easily.

> use(merge.data)
> des()
> summ()
> summ(subject)
obs. mean median s.d. min. max.
12 6.5 6.5 3.61 1 12

30
> summ(Wt)
obs. mean median s.d. min. max.
12 69.583 70.5 9.5 54.6 86.4

> summ(Dose)
obs. mean median s.d. min. max.
12 4.626 4.53 0.75 3.1 5.86

> summ(auc)
obs. mean median s.d. min. max.
12 103.807 95.407 23.65 73.776 148.923

All the variables are quite uniformly distributed. Let's create a two-way scatter plot.
> plot(Wt, auc, type="n", xlab="Wt (kg) ", ylab="Area under time-
concentration curve (hour-mg/L)")
> text(Wt, auc, labels=subject)

1
140

10
Area under time-concentration curve (hour-mg/L)

120

5
12

4
100

7 2
8
9
80

55 60 65 70 75 80 85

Wt (kg)

There is a slight negative correlation between AUC and Wt. Heavier persons tended
to destroy/excrete the drug faster that the lighter ones causing the drug to have a
small AUC. One exception is the first subject who has the highest AUC (remember
148 units!) and yet he/she was among the heaviest. This person would need special
investigation for this out-lying nature (e.g. maybe due to disease or genetic mark
up).

31
Now we plot AUC against dose.
> plot(Dose, auc, type="n", xlab="Dose (mg/kg) ", ylab="Area under
time-concentration curve (hour-mg/L)")
> text(Dose, auc, labels=subject)

140
10

Area under time-concentration curve (hour-mg/L)

120

5
12

4
100

2 7
8
9
80

3.0 3.5 4.0 4.5 5.0 5.5

Dose (mg/kg)

As medically expected, there is a positive correlation between dose and AUC.

Persons who receive higher dose tended to have higher AUC because the drug will
have a higher concentration and stay longer in the blood. Again, the fist subject is
the exception. This subject received a dose of 4.02 units, yet retained as much as
148 units of AUC.
> plot(Dose, Wt)
> text(x=4,y=65,labels=paste("Corr. coef. =", round(cor(Dose, Wt),3)),
col="blue", cex=1.5)

Heavy subjects were more likely to be given lower doses. There are no outliers here
since both variables were controlled by the protocol. Having such a high correlation
indicates is a potential confounding situation. We should clarify whether dose or
weight had a stronger effect on AUC.

32
85
80
75
Wt

70
65

Corr. coef. = -0.99

60
55

3.0 3.5 4.0 4.5 5.0 5.5

Dose

> regress.display(lm(auc ~ Dose + Wt, data=.data), crude=TRUE)

Linear regression predicting auc

crude coeff.(95%CI) adj. coeff.(95%CI) P(t-test) P(F-test)

Dose 11.99 (-8.65,32.62) 73.19 (-73.25,219.64) 0.287 0.23

Wt -0.83 (-2.48,0.82) 4.86 (-6.65,16.38) 0.364 0.364

No. of observations = 12

Since all t-tests and F-tests in the crude and adjusted analyses are above 0.05, the
conclusion is that neither dose nor weight significantly determines the AUC.

Remember that we have an outlier. Let's exclude it and repeat the analysis.
> data.not1 <- .data[-1,]
> regress.display(lm(auc ~ Dose + Wt, data=data.not1), crude=TRUE)
Linear regression predicting auc
crude coeff.(95%CI) adj. coeff.(95%CI) P(t-test) P(F-test)
Dose 18.02 (3.71,32.32) -23.27 (-142.12,95.59) 0.664 0.023

Wt -1.49 (-2.62,-0.37) -3.35 (-12.93,6.22) 0.443 0.443

No. of observations = 11

33
This model suggests that after excluding the outlier, in the crude analysis, dose had
a significant positive effect on AUC whereas weight had a significant negative
effect. This is judged by their 95% confidence intervals not including zero. In the
adjusted analysis, where both independent variables are included in the model, the
t-test on both variables suggests that they are not significant factors. However, dose,
and not weight, is significant by analysis of variance (F-test). We conclude that
dose (in mg/kg) is more important than patient weight in prolonging high levels of
oral theophylline.

In summary, this chapter gives you an example of computing the area under the
curve (AUC) as the outcome variable. This approach removes the need to do
sophisticated statistical modelling.

Summary

In more advanced approaches, the level of drug in a body at any given time can be
modelled as a function of several underlining parameters. The nlme function in the
package of the same name was created to address this problem. Use of that package
is beyond the scope of this book.

Exercise

Read in the Sitka data set from the MASS package. Try to find out whether
ozone exposure reduced the area under the time-size curve. For simplicity, keep
only records without any missing values.

34
Chapter 4: Individual growth rate

In the preceding chapter, we calculated the area under the time-concentration curve.
This method of analysis is justified in pharmacokinetics as it reflects the ability of a
person to destroy and or excrete the drug. In this chapter we will analyse the Sitka
data set again, comparing the tree growth rates in each treatment group.
> library(epicalc)
> zap()
> data(Sitka, package="MASS")
> use(Sitka)
> des()
No. of observations = 395
Variable Class Description
1 size numeric
2 Time numeric
3 tree integer
4 treat factor

Note that the 'tree' variable is equivalent to 'Subject' in the Indometh and
Theophdata sets. Here, its class is "integer". Note that the data frame is not in
"groupedData" format.
> summ()
No. of observations = 395
Variable Class Description
1 size numeric
2 Time numeric
3 tree integer
4 treat factor

Let's check whether there is any duplication of measurement on the same tree at the
same time.
> table(Time, tree)

The output (omitted) is not too extensive for this data set and shows that there are 5
different values for 'Time' and 79 different trees. All cells have counts of 1
indicating no duplication. For larger data sets, the following command may be
better.
> table(table(Time, tree))
1
395

Since each cell in the previous table contains only the number one, the mean for
that cell would be the size of the tree at that point of time.

35
> tapply(size, list(time=Time, tree=tree), mean)

The orientation of the table is the same. Again, the output is rather large and is
omitted here. Each column represents the size of the tree over time. Let's create
some follow-up plots.
> followup.plot(id=tree, time=Time, outcome=size)

Unlike the follow-up plots of the pharmacokinetic studies, in which there are only 6
lines for Indometh and 12 for Theoph, there are 79 lines in this plot, one for each
tree. These plots are sometimes called "spaghetti-plots" due to the crossing lines.

Subsets can be achieved with the 'by' argument.

> followup.plot(id=tree, time=Time, outcome=size, by=treat)

36
There is a tendency that trees grown in ozone-rich chambers (red dashed lines) are
smaller than those in the control group (black lines).

Time is days since 1 Jan 1988. However, it is not clear when the experiment started.

Unlike pharmacokinetic studies, which start at zero level of a drug, in the Sitka
study, the area under the curve was measured at some unknown time after the tree
had grown. Therefore the AUC may be an invalid outcome measure.

Let's try to subtract the size on the first (152nd day) measurement from each tree and
then calculate the AUC. First we must create a 'visit' index for each tree.

Indexing visits
Let us make sure that the data are properly sorted by 'tree' and 'Time'.
> sortBy(tree, Time)

Next, count the number of records contributed by each tree (in R, this is called 'run

37
length encoding' since the lengths of each element that appears repeatedly are
encoded into a list). The function is rle.
> list1 <- rle(tree)
> list1 Run Length Encoding
lengths: int [1:79] 5 5 5 5 5 5 5 5 5 5 ...
values : int [1:79] 1 2 3 4 5 6 7 8 9 10 ...

The object 'list1' has two elements, namely 'lengths' and 'values'. The first
element shows that there are 5 visits for each tree. The second is the value of the
tree. Note that the function rle takes only an atomic vector as its argument. In this
case we do not have any problem as tree is a vector. If it was a factor, the
corresponding function would be
> lst1 <- rle(as.vector(tree))

R has a function called sapply, which applies a function to each element of a

vector.
> visit <- unlist(sapply(X=lst1$lengths, FUN=function(x) 1:x,
simplify=FALSE))
> visit

The above rle and sapply commands are complicated and not easy to
remember. The epicalc package has a function called markVisits for this
purpose.
> visit <- markVisits(id=tree, time=Time)

This is the required index vector for visits that we can pack into our data frame.
> pack()

Note that marking of the visit may not be well synchronized with the 'Time'
variable for a couple of reasons. Firstly, the exact time may not be repeated, as seen
in the Theoph data. Secondly, data may not be collected at the scheduled time. For
example, if patients are supposed to come weekly but they do not show up in the
rd
second week, then their 3 week visit will become their second visit and their 4th
week visit will become their third week visit, etc. We can simply check the
consistency between 'Time' and 'visit' as follows.
> table(Time, visit)
visit
Time 1 2 3 4 5
152 79 0 0 0 0
174 0 79 0 0 0
201 0 0 79 0 0
227 0 0 0 79 0
258 0 0 0 0 79

All the numbers (79) in the diagonal indicate perfect consistency.

38
> head(.data, 10)
size Time tree treat visit
1 4.51 152 1 ozone 1
2 4.98 174 1 ozone 2
3 5.41 201 1 ozone 3
4 5.90 227 1 ozone 4
5 6.15 258 1 ozone 5
6 4.24 152 2 ozone 1
7 4.20 174 2 ozone 2
8 4.68 201 2 ozone 3
9 4.92 227 2 ozone 4
10 4.96 258 2 ozone 5

Creating an index variable allows easier manipulation of the follow-up records.

Our current task is to subtract the tree size for each tree at visit 1 from its size at all
subsequent visits. This can be achieved with the following commands.
> tmp <- by(.data, INDICES=tree, FUN=function(x) x$size -
x$size[x$visit==1])
> size.change <- sapply(tmp, as.numeric)

In analysis of longitudinal data, the by function should be one of the most

commonly used. The first argument is the data frame used. The second argument,
INDICES, is the factor that splits the data into subsets and then a function, FUN, is
applied to each subset. The FUN argument can be any valid R function or a user
defined one, which is the case here. This approach using by and sapply will be
repeatedly used throughout our lessons.

The first command above splits the data frame into each individual tree and
subtracts the tree size at the first visit from the tree sizes at all subsequent visits.
The result is saved into a temporary object, 'tmp'. This object is a type of list and is
not very useful unless it is converted to a vector or matrix using the sapply
function.

In the second command above, the function as.numeric is applied to each

element of 'tmp'. The result is a matrix containing the numeric differences between
the tree sizes at each visit and those from the first visit. There are 5 visits for each
tree and there are 79 trees, so the result is a 5 (visits) by 79 (trees) dimension
matrix.
> dim(size.change)
[1] 5 79

We need to convert this matrix into one long vector and integrate it as a variable
into the current data frame.
> size.change <- as.numeric(size.change)
> pack()

39
Now, let's look at the summary this new variable.
> summ(size.change)
obs. mean median s.d. min. max.
395 0.747 0.76 0.55 -0.52 2.1

Note that there are a few negative values (size decreased over time) in addition to a
number of zeros corresponding to the first visit, and the remaining positive values
(tree size increased). We temporarily ignore the records with negative tree growth
and complete the AUC for the size differences and then compare the values of this
variable between the two treatment groups.
> auc.tree <- auc(size.change, time=Time, id=tree)

Here the 'tree' variable is used as the subject identification. This vector can now be
merged with a data frame containing a subset of records from the first visit.
> visit1 <- subset(.data, subset=visit==1, select=c("tree", "size",
"treat"))
> auc.visit1 <- merge(auc.tree, visit1)

Before using auc.visit1, since we have made many changes to .data, let's
make a copy of it for future use.
> .data -> Sitka1

Now auc.visit1 can be used.

> use(auc.visit1)

> des()

40
No. of observations = 79
Variable Class Description
1 tree integer
2 auc numeric
3 size numeric
4 treat factor

A summary of the AUC for each treatment group can be shown.

> summ(auc, by=treat)
For treat = control
obs. mean median s.d. min. max.
25 93.78 91.66 22.414 62.93 141.5

For treat = ozone

obs. mean median s.d. min. max.
54 82.29 82.14 24.991 -31.41 122.1

Surprisingly, the minimum AUC value for the treatment group is negative, which
means one or more trees actually got smaller! To check which one(s) type
> tree[auc <0]
[1] 15

This can also be illustrated graphically.

> use(Sitka1)
> followup.plot(id=tree, time=Time, outcome=size, stress=15,
stress.col=2, stress.width=3)

This unlikely finding was perhaps due to a measurement error made at the first visit
of this tree.

More details about how to detect abnormal records where the size became smaller
is shown in the example of the followup.plot command.

We will omit this tree and then test the hypothesis that the AUC is different
between the two treatment groups.
> use(auc.visit1)
> keepData(.data, subset=auc>0)
> tableStack(auc, by=treat)
control ozone Test stat. P value
auc t-test (76 df) = 1.88 0.064
mean(SD) 93.8 (22.4) 84.4 (19.6)

The conditions satisfy the requirements for a t-test and the difference in AUC is not
statistically significant.

41
Individual growth rates
Apart from AUC, we can compute and compare the growth rates of trees in the two
treatment groups.

Assuming that tree size is a linear function of time, for each individual tree, the
intercept would be its expected size at time 0 and the coefficient for 'Time' would
be the growth rate. We return to the original Sitka data set and use the function
by to get the coefficients for each individual tree.
> use(Sitka)
> tmp <- by(.data, INDICES=tree, FUN=function(x) lm(size ~ Time,
data=x))

Each element of 'tmp' contains the results of a linear model predicting the tree size
from 'Time' using only the data records of each tree. We then use the sapply
function to extract the coefficients of each model out from the 'tmp' object.
> tree.coef <- sapply(tmp, FUN=coef)

The class of this new object is a matrix.

42
> class(tree.coef)
[1] "matrix"

> dim(tree.coef)
[1] 2 79

This matrix has 2 rows and 79 columns. Our objective here is to create a data frame
containing three variables. One variable must contain the unique tree id. The other
two variables should consist of the initial tree sizes and the growth rates for each
tree, which we have already obtained from the individual linear models.

We can convert the matrix above to a data frame using the as.data.frame
function, but first it needs to be transposed (columns to rows) using the function t.
> tree.growth <- as.data.frame( t(tree.coef) )
> des(tree.growth)
No. of observations = 79
Variable Class Description
1 (Intercept) numeric
2 Time numeric

The names of the variables created from linear modelling should be changed to
something more appropriate. The 'Time' variable represents the individual growth
rates obtained from the linear model.
> names(tree.growth)[2] <- "growth.rate"

We now add the 'tree' variable into this data frame.

> tree.growth$tree <- 1:79

This variable will be used to link with the visit1 data frame created previously.
> tree.growth <- merge(tree.growth, visit1)
> use(tree.growth)
> des()

No. of observations = 79
Variable Class Description
1 tree integer
2 (Intercept) numeric
3 growth.rate numeric
4 size numeric
5 treat factor

Now we have a data frame with 79 records, one record for each tree, containing
each tree's own intercept, growth rate and treatment. We can now test the
hypothesis of different growth rates between trees in the two different treatment
groups.

43
> tableStack(vars=2:3, by=treat, decimal=3)
control ozone Test stat. P value
(Intercept) t-test (77 df) = 1.0213 0.310
mean(SD) 2.122 (1.108) 2.343 (0.783)

growth.rate t-test (77 df) = 2.7682 0.007

mean(SD) 0.014 (0.003) 0.012 (0.003)

At time zero, both groups are not significantly different. However, the growth rate
of trees in the treatment group (0.012 per day) is significantly lower than those in
the control group (0.014 unit per day).

Note that in this experiment, we do not know when ozone treatment started. If
treatment was given late in the growth curve, then the validity of using the linearly
increasing tree sizes (from time zero) as the outcome in the models would be in
doubt.

Summary
In summary, without Epicalc, manipulation of data within each subject in
longitudinal data requires several complicated functions, such as rle and sapply
to create an index variable within the same subject, here called 'visit'. The
markVisits function from epicalc can simplify this task.

Measurements from the first visit can be subtracted from the other visits in the same
individual to see the change from baseline. In the Sitka tree example, no baseline
data was given, so the first visit records were used to represent the baseline. The
linear growth rate of each individual can be computed using functions by and
sapply. These two functions, when used together, are very powerful. Analysts of
longitudinal data should get acquainted with them. They will be encountered
extensively in subsequent chapters.

44
Exercises

Based on the experience of the above examples, check whether there are any
missing records in the Xerop data set. Use the markVisits function to create a
'visit' index which indicates the order of visit for each subject. Check the
consistency of this visit index and the 'time' variable.

Use the Sitka data set to compute coefficients for predicted quadratic growth
curve for each tree. Determine which components of the growth curve are
significantly predicted by ozone.

45
Chapter 5: Within subject, across time
comparison

In the previous two chapters, we computed solitary outcome variables, such as area
under curve (AUC) and growth rate. By using this strategy, the complicated of
statistical models for repeated observations on the same subjects can be avoided.
We used the functions by and sapply to create linear models of growth for each
tree. In fact, there are more important applications of these functions – that is,
within subject, across time comparison.

If you had run the example code in the help page for the followup.plot
function, you would have found that some trees became smaller. Whether this is
naturally possible or whether it was due to human error during data collection
and/or data entry is not known. The technical challenge that we are facing is how to
detect the records that have decreasing tree sizes.
> library(epicalc)
> zap()
> data(Sitka, package="MASS")
> use(Sitka)

Since there are no missing visits, we just mark the visits.

> visit <- markVisits(id=tree, time=Time)
> pack()

(See detailed explanation from the preceding chapter).

> tmp <- by(data=.data, INDICES=tree, FUN = function(x)
x$size[x$visit==2] - x$size[x$visit==1])

The data frame is split into subsets for each unique value of 'tree'. The tree size at the
second visit in each subset is then subtracted from the size at the first visit and stored
in a temporary object called 'tmp'.
> diff2from1 <- sapply(tmp, FUN=as.numeric)

To find which tree(s) got smaller type:

46
> which(diff2from1 < 0)
2 15
2 15

They were the second and the fifteenth trees. The numbers on the top row are
'names' of the values, which are on the bottom row.

Similarly, one can find records where the tree size at the third visit is smaller than
the size at the second etc.

Lag measurements
A more efficient method is to compare the current tree size at time t of each tree
with its size at time t-1. A lag vector of sizes can be created using the strategy shown
in the example of followup.plot. It is further discussed here.

The data frame first needs to be sorted by 'tree' and 'Time'.

> sortBy(tree, Time)

Then we create a vector which has one visit lag.

> size.lag.1 <- lagVar(var=size, id=tree, visit=visit, lag.unit=1)

The last argument 'lag.unit' has a default value of 1 if omitted. For a lag of 2 type:
> size.lag.2 <- lagVar(var=size, id=tree, visit=visit, lag.unit=2)

A 'lag.unit' value of -1 will use the next visit.

> size.next.1 <- lagVar(var=size, id=tree, visit=visit, lag.unit=-1)
> pack()

These newly created lags can be seen from the following command.
> head(.data, 10)
size Time tree treat visit size.lag.1 size.lag.2 size.next.1
1 4.51 152 1 ozone 1 NA NA 4.98
2 4.98 174 1 ozone 2 4.51 NA 5.41
3 5.41 201 1 ozone 3 4.98 4.51 5.90
4 5.90 227 1 ozone 4 5.41 4.98 6.15
5 6.15 258 1 ozone 5 5.90 5.41 NA
6 4.24 152 2 ozone 1 NA NA 4.20
7 4.20 174 2 ozone 2 4.24 NA 4.68
8 4.68 201 2 ozone 3 4.20 4.24 4.92
9 4.92 227 2 ozone 4 4.68 4.20 4.96
10 4.96 258 2 ozone 5 4.92 4.68 NA

Note that the first visit has neither a 'size.lag.1' nor 'size.lag.2' value. For
'size.next.1', the value is the tree size at the next (second) visit. At the second visit,
'size.lag.1' is the tree size from the first visit, etc. Now the trees that became smaller

47
at any point of time can be easily identified.
> .data[which(size.lag.1 > size),]
size Time tree treat visit size.lag.1 size.lag.2 size.next.1
7 4.20 174 2 ozone 2 4.24 NA 4.68
72 4.08 174 15 ozone 2 4.60 NA 4.17
94 4.62 227 19 ozone 4 4.76 3.93 4.64
135 5.32 258 27 ozone 5 5.44 4.70 NA
180 4.60 258 36 ozone 5 4.62 4.42 NA
270 5.02 258 54 ozone 5 5.03 4.55 NA

There are six records, corresponding to six different trees. We can now create a
variable for the change in size between two adjacent visits on the same tree.
> size.change <- size - size.lag.1
> pack()

All visits of trees which became smaller can now be shown.

> smaller.trees <- tree[which(size.change<0)]
> .data[tree %in% smaller.trees,]
size Time tree treat visit size.lag.1 size.lag.2 size.next.1 size.change
6 4.24 152 2 ozone 1 NA NA 4.20 NA
7 4.20 174 2 ozone 2 4.24 NA 4.68 -0.04
8 4.68 201 2 ozone 3 4.20 4.24 4.92 0.48
9 4.92 227 2 ozone 4 4.68 4.20 4.96 0.24
10 4.96 258 2 ozone 5 4.92 4.68 NA 0.04
======================= trees 15, 19, 27 and 36 omitted =====================
266 3.72 152 54 ozone 1 NA NA 4.16 NA
267 4.16 174 54 ozone 2 3.72 NA 4.55 0.44
268 4.55 201 54 ozone 3 4.16 3.72 5.03 0.39
269 5.03 227 54 ozone 4 4.55 4.16 5.02 0.48
270 5.02 258 54 ozone 5 5.03 4.55 NA -0.01

The records can now be inspected and corrected if needed. A summary of the change
is shown as follows.
> summ(size.change)
obs. mean median s.d. min. max.
316 0.332 0.33 0.18 -0.52 0.87

48
The leftmost outlying value was probably a serious error in measurement. The upper
part of the graph suggests that there are many missing values. In fact, the statistical
output from the command shows that there are only 316 non-missing values. The
remaining 395-316 = 79 missing records are from the first measurements which did
not have any preceding measurement.
> summ(size.change, by=Time)

49
The time intervals between measurements are 22, 27, 26 and 31 days, which slightly
increased over time. Noticeable from the graph is that the growth of the trees
actually slowed down over time. This can be seen more clearly with:
> boxplot(size.change ~ Time)

Despite the plots, one must realize that the variable 'size' is the logarithm of the
actual tree size. The untransformed values would in fact show accelerated growth.

In subsequent chapters, we will go deeper using level of change as the main outcome
variable compared with using the absolute value. Right now let's finish with tracking
changes of a dichotomous outcome over time.

Follow up of dichotomous outcome variable

So far, we have used longitudinal data with continuous outcomes. Let's explore the
bacteria data set once again, which comes from a longitudinal study in which the
outcome is dichotomous.
> zap()
> data(bacteria, package="MASS")
> use(bacteria)
> des()

No. of observations = 220

50
Variable Class Description
1 y factor
2 ap factor
3 hilo factor
4 week integer
5 ID factor
6 trt factor

In this data set, the time variable is 'week', an integer, and the subject variable is 'ID',
which is a "factor".
> summ()

No. of observations = 220

Var. name obs. mean median s.d. min. max.

1 y 220 1.805 2 0.397 1 2
2 ap 220 1.436 1 0.497 1 2
3 hilo 220 1.445 1 0.498 1 2
4 week 220 4.45 4 3.85 0 11
5 ID 220 25.218 24 14.582 1 50
6 trt 220 1.845 2 0.835 1 3

There are 220 records. Let's check whether there are any IDs missing in any of the
follow-up periods.
> length(unique(ID))
[1] 50

> table(table(ID))
2 3 4 5
3 5 11 31

There are a total of 50 subjects. Three people came only twice, five people came 3
times, 11 people came four times and 31 came to every follow-up visit.

Now we check for duplicates.

> table(ID, week)
week
ID 0 2 4 6 11
X01 1 1 1 0 1
X02 1 1 0 1 1
X03 1 1 1 1 1
X04 1 1 1 1 1
X05 1 1 1 1 1
X06 1 1 1 0 1
X07 1 1 1 1 1
X08 1 1 1 1 1
X09 1 1 1 1 1
================= remaining lines omitted =============
> any(table(ID, week) > 1)
[1] FALSE

51
No person had a duplicate record in any follow-up visit. To assess the total number
of subjects who attended in each week type:
> colSums(table(ID, week))
0 2 4 6 11
50 44 42 40 44

All 50 subjects attended week 0. The numbers declined to 44, 42 and 40 at weeks 2,
4 and 6, respectively. At the final follow-up visit (week 11), 44 persons attended.

We now create a 'visit' index from the 'week' variable.

> visit <- week
> pack()
> recode(visit, old.value = c(0,2,4,6,11), new.value = 1:5)
> table(week, visit) # just to check
visit
week 1 2 3 4 5
0 50 0 0 0 0
2 0 44 0 0 0
4 0 0 42 0 0
6 0 0 0 40 0
11 0 0 0 0 44

It is a good idea to see the change of bacteria status from one week to the next. Let's
start with the change from the first to the second week.
> next.y <- lagVar(var=y, id=ID, visit=visit, lag.unit=-1)
> pack()

Before continuing, we keep a copy of the data frame for later use.
> .data -> data1
> head(.data,10)
y ap hilo week ID trt visit next.y
1 y p hi 0 X01 placebo 1 y
2 y p hi 2 X01 placebo 2 y
3 y p hi 4 X01 placebo 3 <NA>
4 y p hi 11 X01 placebo 5 <NA>
5 y a hi 0 X02 drug+ 1 y
6 y a hi 2 X02 drug+ 2 <NA>
7 n a hi 6 X02 drug+ 4 y
8 y a hi 11 X02 drug+ 5 <NA>
9 y a lo 0 X03 drug 1 y
10 y a lo 2 X03 drug 2 y

Note that the first ID, X01, did not attend in week 6, thus the value of 'next.y' for
week 4 is missing. Similarly, the second ID, X02, did not attend in week 4. The
value of 'next.y' for this subject for week 2 is missing. In order to cross-tabulate the
outcome variable, 'y', at visit 1 and visit 2, type:
> keepData(subset=visit==1)
> addmargins(table(y, next.y))

52
next.y
y n y Sum
n 0 4 4
y 4 36 40
Sum 4 40 44

Out of 4 persons who did not have the bacteria ('y' = "n") in their first visit, all of
them changed to "y". Out of 40 subjects who did have the bacteria ('y' = "y"), 4
persons changed to "n".
> mcnemar.test(table(y, next.y)) # P value = 1

There is no significant discrepancy between transition states. To compare the

outcome at the second and third visits, we must return to data1 and use only the
records for the second visit only.
> use(data1)
> keepData(subset = visit==2)
> addmargins(table(y, next.y))
next.y
y n y Sum
n 3 0 3
y 8 26 34
Sum 11 26 37

Note that at this transition, the total number of records is now 37, not 50. There were
definitely more people who changed from "y" on their first visit to "n" on their next
visit than in the opposite direction . In other words, bacteria tended to disappear in
the second transition period (from week 2 to week 4), and this imbalance toward
more bacteria is statistically significant by McNemar’s test.
> mcnemar.test(table(y, next.y))
McNemar's chi-squared = 6.125, df = 1, p-value = 0.01333

We can continue in this manner until there is no further transition.

Reshaping to wide format

Reshaping the data frame from long to wide form gives the same results.
> wide <- reshape(bacteria, idvar="ID", v.names="y", timevar="week",
direction="wide")
> head(wide)
ap hilo ID trt y.0 y.2 y.4 y.11 y.6
1 p hi X01 placebo y y y y <NA>
5 a hi X02 drug+ y y <NA> y n
9 a lo X03 drug y y y y y
14 p lo X04 placebo y y y y y
19 p lo X05 placebo y y y y y
24 a lo X06 drug y y y y <NA>

The value of the 'week' variable from the original data set (in long form) became the

53
suffix for the repeated variables in wide form. The variable 'y.11' appears before 'y.6'
because the first person (X01) did not attend the sixth week appointment. The
outcome value in 'y.6' for this person is therefore <NA>. Then,
> with(wide, addmargins(table(y.0, y.2)))
y.2
y.0 n y Sum
n 0 4 4
y 4 36 40
Sum 4 40 44

> with(wide, addmargins(table(y.2, y.4)))

y.4
y.2 n y Sum
n 3 0 3
y 8 26 34
Sum 11 26 37

The results are exactly the same as what we obtained using the preceding method.

In conclusion, there are two methods for comparing values across time within the
same person. The first method created a 'visit' variable which was modified from
'week' and shifted the values by using the lagVar function. The second method
reshaped the data to wide format before making the comparison. This method is
more straightforward for tabulation but not useful for transition modelling.

Exercise
Read in the Xerop data from Epicalc. Explore the pattern of visiting times. Were
they evenly distributed? Track changing status of respiratory infection (respinfect),
xerop and stunting over the visits.

54
Chapter 6: Analysis of missing records

The preceding chapter looked at the changing status of the subject. This chapter
pays attention to missing records during follow-up.

Missing records are different to missing values within variables of existing records.
For missing records, the analysts should first highlight any pattern to the research
team despite the fact that the data is not available. Later, like analysis of missing
values, the missing records should be checked to see if they are missing at random
or if there is some underlying cause. Some analysts prefer to impute data for
missing values with their 'best guess'. For a follow-up study focusing on only one
outcome variable with all other variables (such as demographic and clinical
prognostic factors) being fixed and if the statistical methods used allowed no
missing data, such imputation would be cost-effective. However, when there is
more than one variable (especially both changing exposure and changing outcome)
being monitored, and with statistical methods allowing non-perfect data, imputation
would be less important.

Handling missing values is a complicated technique and beyond the scope of this
book. Readers are advised to consult with other sources for more details on this
topic.

Based on the above arguments, data management is the most important technique to
deal with missing records. We will examine methods for identifying, refilling and
highlighting the pattern of missing records.

Identifying missing records

Let's return to the bacteria data set, which tests the presence of the bacteria H.
influenzae in children with otitis media in the Northern Territory of Australia, and
identify the missing records.
> library(MASS)
> library(epicalc)
> zap()

55
> use(bacteria) # This data is in the MASS package
> des()

No. of observations = 220

Variable Class Description
1 y factor
2 ap factor
3 hilo factor
4 week integer
5 ID factor
6 trt factor

> summ()
No. of observations = 220

Var. name obs. mean median s.d. min. max.

1 y 220 1.805 2 0.397 1 2
2 ap 220 1.436 1 0.497 1 2
3 hilo 220 1.445 1 0.498 1 2
4 week 220 4.45 4 3.85 0 11
5 ID 220 25.218 24 14.582 1 50
6 trt 220 1.845 2 0.835 1 3 ]

There are 220 records. The most important variables for identification of missing
visits are 'ID' and 'week'. Since the class of the 'ID' variable is "factor" the output
from the summ command is not meaningful, particularly the minimum and
maximum values. All we can say is that there are 50 distinct values.

The 'week' variable is an integer and ranges from 0 to 11. Let's view the distribution
more closely.
> table(week)
week
0 2 4 6 11
50 44 42 40 44

The follow up interval is 2 weeks up until week 6 with a final visit at week 11.
There were 50 children who attended in week 0 (all children in the study attended
the initial visit). The number declined to between 40 and 44 over the following 11
weeks. There is no obvious pattern for the missing records.

Let's now explore the visit frequency of all 50 children.

> table(ID)
ID
X01 X02 X03 X04 X05 X06 X07 X08 X09 X10 X11 X12 X13 X14 X15 X16 X17 X18
4 4 5 5 5 4 5 5 5 2 5 5 5 3 5 5 5 5
X19 X20 X21 Y01 Y02 Y03 Y04 Y05 Y06 Y07 Y08 Y09 Y10 Y11 Y12 Y13 Y14 Z01
5 4 5 5 5 5 4 3 4 5 4 4 5 5 2 4 3 3
Z02 Z03 Z05 Z06 Z07 Z09 Z10 Z11 Z14 Z15 Z19 Z20 Z24 Z26
5 5 5 2 5 4 5 5 5 3 5 4 5 5

56
We can further tabulate this table to obtain a frequency of total visits for all
children.
> table(table(ID))
2 3 4 5
3 5 11 31

Of the 5 weeks scheduled, three children came twice, five came three times, 11
came four times and 31 attended on all five scheduled visits. No child came just
once. That is, every child returned for treatment at least once during the 11 week
study. The full vector of frequencies for 0 up to 5 visits is [0, 0, 3, 5, 11, 31].

Since there are 220 records out of the possible maximum 250 (50 subjects × 5
visits), the overall probability of a child attending at each visit is 220/250 = 0.88.
The question is whether or not the observed distribution is actually random.

If we can prove that the observed data follows some known distribution, then we
can conclude that the missing records have no pattern and are therefore missing at
random.

Proving randomness requires an understanding of certain statistical theory. The

binomial distribution is a discrete probability distribution of the number of
successes in a sequence of n independent experiments (or trials), each of which has
a success with probability equal to some constant, p. For n = 5 trials each with a
probability of success equal to 0.88 the associated probabilities for each number of
successes (from 0 up to 5) is given by:
> p <- dbinom(x=0:5, size=5, prob=0.88); p
[1] 0.00002488 0.00091238 0.01338163 0.09813197 0.35981722 0.52773192

The probability of observing zero successes out of 5 trials is 0.00002 while for five
successes the probability is more than half. These probabilities are for one trial. If
we repeated this trial 50 times, then the resulting total numbers of successes would
be:
> 50 * p
[1] 0.00124416 0.04561920 0.66908160 4.90659840 17.99086080 26.38659584

Let's compare theses expected numbers with the observed numbers from the
bacteria study.
> data.frame(week=0:5, p=p, expected=50*p, observed=c(0,0,3,5,11,31))
week p expected observed
1 0 0.0000248832 0.00124416 0
2 1 0.0009123840 0.04561920 0
3 2 0.0133816320 0.66908160 3
4 3 0.0981319680 4.90659840 5
5 4 0.3598172160 17.99086080 11
6 5 0.5277319168 26.38659584 31

57
The observed numbers appear fairly close to the expected numbers. In our study, 31
out of 50 children attended all 5 visits. If the distribution of total visits followed a
binomial distribution, with p = 0.88, we would expect this number to be 26. To test
whether the whole vector of observed frequencies fits well with the above expected
frequencies from a binomial distribution, we employ the chi-squared goodness-of-
fit test.
> chisq.test(x=c(0,0,3,5,11,31), p=p)
Chi-squared test for given probabilities
data: c(0, 0, 3, 5, 11, 31)
X-squared = 11.6921, df = 5, p-value = 0.03926

Warning message:In chisq.test(c(0, 0, 3, 5, 11, 31), p = p, : Chi-

squared approximation may be incorrect

The p-value is significant; however the warning suggests that the test may not be
appropriate, most likely because many of the expected frequencies are less than 5.
Let's view the help page for this function.
> help(chisq.test)

The argument 'simulate.p.value' determines whether to compute the p-value by

Monte Carlo simulation. The default is "FALSE". We need to specify "TRUE".
> chisq.test(c(0,0,3,5,11,31), p=p, simulate = TRUE)
Chi-squared test for given probabilities with simulated p-value (based
on 2000 replicates)
data: c(0, 0, 3, 5, 11, 31)
X-squared = 11.6921, df = NA, p-value = 0.06497

The result indicates that there is not enough evidence to conclude that the observed
data do not come from a binomial distribution. Thus we conclude that the observed
distribution of missing visits is missing at random.

Let's now check whether treatment ('trt') is associated with missing records. Note
that treatment is fixed for each subject throughout the study, which can be checked
firstly with eye-ball scanning of the cross-tabulation.
> table(ID, trt)
trt
ID placebo drug drug+
X01 4 0 0
X02 0 0 4
X03 0 5 0
X04 5 0 0
X05 5 0 0
X06 0 4 0
===== lines omitted ====

Within each row, all but one cell should be greater than zero.

58
> table(ID, trt) > 0
trt
ID placebo drug drug+
X01 TRUE FALSE FALSE
X02 FALSE FALSE TRUE
X03 FALSE TRUE FALSE
X04 TRUE FALSE FALSE
X05 TRUE FALSE FALSE
X06 FALSE TRUE FALSE
===== lines omitted =====

If no child changed treatment, then the sum of each row should not be more than 1.
> any(rowSums(table(ID, trt) > 0) > 1)
[1] FALSE

The conclusion is that nobody changed treatment during the study.

Similarly, we can explore the other intervention given (active encouragement to

comply with treatment).
> any(rowSums(table(ID, hilo) > 0) > 1)
[1] FALSE

No child had their level of encouragement to comply with treatment changed during
the study.

At baseline (week 0) we have shown that all children attended for treatment. Let's
explore the treatment allocation for that first week, which is stated in the help page
for the data set as being randomized.
> table(trt[week==0])

placebo drug drug+

21 14 15

The treatment allocation is not completely balanced, but in any case we would like
to see if this distribution is more or less the same throughout the 5 weeks of follow-
up.
> table(week, trt)
trt
week placebo drug drug+
0 21 14 15
2 20 13 11
4 18 12 12
6 17 11 12
11 20 12 12

For each subsequent visit (weeks 2 to 11), the number of children receiving the 3
different treatments appear to be quite similar, indicating a randomness to the
missing records in terms of treatment group. We can test the hypothesis that for

59
each follow up the distribution is no different to the first week by using the chi-
square test.
> chisq.test(table(trt[week==2]), p=table(trt[week==0])/50,
simulate=TRUE)
> chisq.test(table(trt[week==4]), p=table(trt[week==0])/50,
simulate=TRUE)
> chisq.test(table(trt[week==6]), p=table(trt[week==0])/50,
simulate=TRUE)
> chisq.test(table(trt[week==11]), p=table(trt[week==0])/50,
simulate=TRUE)

All four tests have large P values indicating that for each follow-up visit, the
number of children who attended for treatment was very close to the expected value
set by that at the first visit (week 0).

We can continue exploring the distribution of missing records for each variable in
this manner. However, since the number of variables we can explore at one time is
limited, we cannot investigate whether an interaction between variables exists. For
example, we have shown that the distribution of missing records is neither
associated with week nor with treatment. There is, however, a possibility that non-
attendees (the missing records) occurred earlier in one treatment group compared to
another.

A complete exploration of predictors for missingness can be done only if the

missing records are actually present in the data set, in which case an indicator
variable is required to specify if a record was not present in the original data.

To add missing records into a longitudinal data set, we first create a data frame
containing all the possible combinations of 'ID' and 'week' by reshaping the data to
wide format.
> wide <- reshape(.data, idvar="ID", timevar="week", v.names="y",
direction="wide")

The warning message appears because we did not specify the treatment variables in
the command. In the wide data frame the values of the variables will come from the
first record of each ID. Since we have shown that these variables are fixed
(constant) with the ID, we can safely ignore the warning.
> head(wide)
ap hilo ID trt y.0 y.2 y.4 y.11 y.6
1 p hi X01 placebo y y y y <NA>
5 a hi X02 drug+ y y <NA> y n
9 a lo X03 drug y y y y y
14 p lo X04 placebo y y y y y
19 p lo X05 placebo y y y y y
24 a lo X06 drug y y y y <NA>

60
Now we can reshape this data frame back to long format. As explained in chapter 2,
most of the arguments of the function can be omitted, since the data frame was
created from a previous reshape command. The result is a data frame containing
250 records, containing all unique combinations of ID and week.
> long <- reshape(wide, direction="long")
> des(long)
No. of observations = 250
Variable Class Description
1 ap factor
2 hilo factor
3 ID factor
4 trt factor
5 week integer
6 y factor

Note that the order of the variables is alphabetical. This is a side-effect of the
reshape command.
> summ(long)

No. of observations = 250

Var. name obs. mean median s.d. min. max.

1 ap 250 1.42 1 0.495 1 2
2 hilo 250 1.44 1 0.497 1 2
3 ID 250 25.5 25.5 14.46 1 50
4 trt 250 1.88 2 0.842 1 3
5 week 250 4.6 4 3.78 0 11
6 y 220 1.805 2 0.397 1 2

The 'y' variable contains only 220 observations, which comes from the original
data.

The final step is to add an indicator variable, specifying whether the subject
attended at each corresponding follow up visit. These are in fact the records in
which the 'y' variable is not missing.
> long$attend <- !is.na(long$y)

An alternative is to use Epicalc’s addMissingRecords function. This method

is simpler and requires only the one command.
> bacteria.new <- addMissingRecords(dataFrame=bacteria, id=ID,
visit=week, outcome=y)

The new records can be inspected easily.

61
> head(bacteria.new, 10)

ID week y ap hilo trt present

1 X01 0 y p hi placebo 1
2 X01 2 y p hi placebo 1
3 X01 4 y p hi placebo 1
4 X01 6 <NA> p hi placebo 0
5 X01 11 y p hi placebo 1
6 X02 0 y a hi drug+ 1
7 X02 2 y a hi drug+ 1
8 X02 4 <NA> a hi drug+ 0
9 X02 6 n a hi drug+ 1
10 X02 11 y a hi drug+ 1

Note the reordering of the variables and the addition of the 'present' variable, which
indicates whether the record was present in the original dataset.

Either data set is now ready to analyse.

> use(long)
> tableStack(vars=c(1,2,4,5), by=attend, vars.to.factor=week)
FALSE TRUE Test stat. P value
ap Chisq. (1 df) = 1.49 0.222
a 21 (70) 124 (56.4)
p 9 (30) 96 (43.6)

hilo Chisq. (1 df) = 0.08 0.784

hi 18 (60) 122 (55.5)
lo 12 (40) 98 (44.5)

trt Chisq. (2 df) = 3.21 0.201

placebo 9 (30) 96 (43.6)
drug 8 (26.7) 62 (28.2)
drug+ 13 (43.3) 62 (28.2)

week Chisq. (4 df) = 10.61 0.031

0 0 (0) 50 (22.7)
2 6 (20) 44 (20)
4 8 (26.7) 42 (19.1)
6 10 (33.3) 40 (18.2)
11 6 (20) 44 (20)

The output suggests that attendance (and therefore non-attendance) in each week
was not random. Let's run a logistic regression model. The 'week' variable needs to
be converted to a factor first so that a comparison of attendance between the first
visit and each remaining visit can be done.
> week <- factor(week)
> pack()

> glm1 <- glm(attend ~ trt + hilo + week, family = binomial, data =
.data)

62
> logistic.display(glm1)

Logistic regression predicting attend

crude OR(95%CI) adj. OR(95%CI) P(Wald) P(LRtest)

trt: ref.=placebo 0.202
drug 0.73 (0.27,1.98) 0.86 (0.24,3.12) 0.814
drug+ 0.45 (0.18,1.11) 0.38 (0.13,1.16) 0.089

hilo: lo vs hi 1.2 (0.55,2.62) 0.74 (0.18,3.02) 0.678 0.679

week: ref.=0 0.003

2 0 (0,Inf) 0 (0,Inf) 0.986
4 0 (0,Inf) 0 (0,Inf) 0.985
6 0 (0,Inf) 0 (0,Inf) 0.985
11 0 (0,Inf) 0 (0,Inf) 0.986

Log-likelihood = -81.9821
No. of observations = 250
AIC value = 179.9642

Odds ratios comparing the follow-up weeks to the first week are zero because there
was no missing visit in the first week. The likelihood ratio test, however, confirms
that there is a significant difference in missing records among the visits. The
adjusted odds ratio of compliance ('hilo') is quite different from the crude odds ratio
(1.2 vs 0.74). This is due to the fact that compliance is associated with type of
treatment (trt). In fact, treatment ('trt') is simply a recoding of the 'ap' variable
(active/placebo) and 'hilo' (high/low encouragement to comply with treatment).
Nonetheless, both variables are not statistically significant.

Our final conclusion is that missing records is not a random phenomenon but
significantly increased after week 0. Neither treatment nor compliance is associated
with missing.

63
Summary
Missing records are almost unavoidable when the sample size and the number of
follow-up visits are large. When they do occur, it is important to investigate the
reasons and to ensure that they are missing at random. The most effective method is
to fill in the missing visits based on subjects and time of follow-up using
addMissingRecords. By creating a 'missing' indicator variable as the outcome
variable, cross tabulation (tableStack) and logistic regression (glm) can help to
identify determinants of missing records.

Exercise
Load the Xerop data set from Epicalc. This is a dataset from an Indonesian study
on vitamin A deficiency and risk of respiratory infection in 275 children.
• At each scheduled visit, determine how many records are missing.
• Identify and remove the duplicated records based on the combination of
'id' and 'time', then repeat question 1.
• Including the duplicated records that you removed, use Epicalc’s
addMissingRecords function to create a new data frame containing a
complete set of records.
• Was season associated with non-attendance?
• Determine whether or not vitamin A deficiency and/or respiratory
infection preceded non-attendance.
• Find the determinant(s) of the missing records.

64
Chapter 7: Modelling longitudinal data

In modelling, one of the purposes is to maximise the efficiency of explanatory

variables in the data set by keeping only explanatory variables that predict the
dependent variable while trying to minimise the errors (residuals). Ordinary least
squares technique minimizes the sum of squares of the residuals. Linear modelling
in R is achieved using the lm function where the outcome variable is on a
continuous scale. Maximum likelihood estimation (MLE or sometimes just ML)
performs numerical iteration to obtain the estimates of the coefficients that produce
the maximum likelihood. This procedure is done in R using the glm command,
which stands for generalized linear model. The outcome variable, however, can be
continuous, dichotomous or count in nature.

In the previous chapters, the relationship of the outcome and predicting variables is
based on an assumption of independence among the records. Exceptions include,
for example, situations where the data are analysed using non-linear models, such
as conditional logistic regression, where the data are stratified and conditionally
analysed in sets.

In longitudinal studies, data on one individual can be measured more than one time.
Thus records belonging to one individual appear in the dataset more than once.
Measurements from the same individual are not independent from each other.
Analysis of this type of data using (general) linear models will therefore give
erroneous results. There are three main choices of modelling here:

1. Population average models, or marginal models, or generalized estimating

equation (GEE) models
2. Random coefficients or random effects models, or hierarchical models
3. Transition or Markov models.

Population average models are so called because they only focus on the average
relationship among repeated measures. A number of individuals are measured on
the outcome and exposure repeatedly. Repetitions may arise from measurements of
the same individuals several times or from measurement of the same sets of
variables on several individuals. The models don't take into account the source of
repetition. They just find the average relationship. This relationship comes from

65
averaging among repeated times and within repeated persons. The model is so
called marginal model. Remember that when we apply the addmargins function
to a table, we have added, at the right most column, the sum of each rows and, at
the bottom row, the sum of each column. The margins thus focus on the overall
effect of rows and columns and ignore what is inside. In marginal models, we are
not able to estimate the individual person outcome but we can still predict the
outcome value of a new subject if the person is given the covariate values. This
prediction is based on the average effects mentioned above. The modelling
technique is called “generalized estimating equations”. The reason is probably
because the final model is based on several equations of generalized linear models
that share the same set of coefficients. GEE models require more parameters to be
estimated than the ordinary GLM methodology. The correlation coefficient
structure among the residuals of different rounds of observations must be specified.
The choice of possible correlation structures will be discussed in future sections.

Random coefficients models take a different approach. Subjects are assumed to be a

random representative sample of a large population. Members of this population
share the same set of coefficients, called 'fixed effects'. Each member also has
his/her own random variation from these coefficients, called 'random coefficients'
or 'random effects'. Regardless of the number of the subjects, the random
coefficients take only one parameter each making the model parsimonious. This can
be compared with stratified analysis where the number of parameters increases by
stratification of the number of strata less one. For example, if we have 10 strata, and
all strata share other coefficients, the number of stratification parameters can
increase (from the non-stratified analysis) by as much as 9. In random coefficients
models, all these 10 subjects are assumed to randomly represent the population. If
the subjects share all the coefficients, the number of parameters increase from
assuming random baseline (intercept term) of all these 10 subjects will be only one.
In other words, random coefficients models are quite similar to stratified models,
except that the number of parameters is usually greatly reduced.

The other name for these types of models is 'mixed models' because they are a
mixture of fixed coefficients and random ones. While marginal models predict the
outcome of each person based on an average set of predicting coefficients, mixed
models use both fixed coefficients common to all subjects and random coefficients
specific to each individual in order to predict the outcome of that particular
individual. For a model with only one random coefficient, the random coefficient
would be the variation of intercept of each individual with other coefficients (or
slopes) being fixed. The output of the model should show the standard deviation of
the intercepts. If this is large, it would mean that there is a large level of baseline
variation of the subjects under study. Random coefficients models also allow for
random slopes. A model with random slopes means that different subjects may be

66
differently affected by the independent variable. This is similar to interaction of
strata in stratified analysis discussed in our previous book1. Finally, while marginal
models focus on correlations among residuals of different times of observations,
random coefficient models are more interested in correlation of the residuals of
different random variables. For example, if the intercepts have a positive correlation
with the slopes, the lines on the upper part (high value of intercept) of the graph
would be steeper than those in the lower part. When there is only one random
variable, i.e. a random intercept, then the correlation is not of main concern.

Occasionally, analysts use conditional fixed effects models. This is an extreme case
where random terms disappear. Comparison is made within the same person or the
same matched set, just like in a matched case-control study. For longitudinal
studies, the outcome of main concern using a fixed effects model is not at any
individual time point but the difference between two time points in the same person.
This is confined to before-after studies, or studies looking at the difference between
two sides of the same organ, such as eyes or kidneys, within the same person. This
type of model is of limited use and will be omitted in future discussion.

Finally, transitional models are focused on transition of states. They are used to
predict the outcome of a set of subjects who share the same previous status as well
as other independent variables. This model thus has two simultaneous interests.
First, it is interested in the effect of the preceding outcome status on the current one
after adjustment for other covariates. Second, it demonstrates the effects of those
covariates after adjustment with the previous outcome. A simple transition model
will look at the effects of a previous outcome in only one or a few preceding
rounds. Autoregressive regression (AR) models, often employed by economists,
include looking at more preceding lags, since financial data may have longer term
effects compared to medical outcomes. Economic models often further aggregate
outcome of individual time points into a 'moving average' to have a more stable
outcome value.

As moving average at one time point is highly correlated with its neighbours,
moving average outcomes are almost always associated with the autoregressive
approach. The two components make up a new name autoregressive moving
average analysis (ARMA). This is rarely used in epidemiology and will not be
included in future discussion.
___________________________________________________________________
1
Analysis of Epidemiological Data using R and Epicalc

67
Packages in R that are used for different modelling are shown in the following
table:

Model Package Function

Marginal models (GEE) gee gee

geepack geeglm

Random coefficient models

Linear mixed models nlme lme

Generalized linear mixed models MASS glmmPQL

Generalized linear mixed models more lme4 lmer

advanced models

Transition models stats glm

Let's try these packages and functions with the longitudinal data that we have
previously explored. Let's use the gee function in the gee package to model the
Sitka dataset.

> library(epicalc)
> data(Sitka, package="MASS")
> use(Sitka)
> library(gee)
> gee.in <- gee(size ~ Time, id=tree, data=.data)
> summary(gee.in)

GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA

gee S-function, version 4.13 modified 98/01/27 (1998)

Model:
Link: Identity
Variance to Mean Relation: Gaussian
Correlation Structure: Independent

Call:
gee(formula = size ~ Time, id = tree, data = .data)

Summary of Residuals:
Min 1Q Median 3Q Max
-2.02609732 -0.37956123 0.06948273 0.41669270 1.30948273

68
Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 2.27324430 0.1768643245 12.85304 0.1003348470 22.65658
Time 0.01268548 0.0008591845 14.76456 0.0003719549 34.10488

Estimated Scale Parameter: 0.4108594

Number of Iterations: 1

Working Correlation
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 0 0 0
[2,] 0 1 0 0 0
[3,] 0 0 1 0 0
[4,] 0 0 0 1 0
[5,] 0 0 0 0 1

Leaving most arguments to their default values, the link function is 'identity', which
means that the original value of outcome variable is not transformed. This is
applicable to continuous outcome variables in all models. The family is 'gaussian'
by default. The default correlation structure among residuals of different times is
"independent". This assumes that there is no association among residuals of
different time periods, as shown by zero values in the off-diagonal cells of the
working correlation matrix. There are two sets of standard errors produced. The
robust ones are based on a conservative computation. The naïve ones give the same
results as those from using the glm command.
> summary(glm(size ~ Time))$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.27324430 0.1768643245 12.85304 8.319835e-32
Time 0.01268548 0.0008591845 14.76456 1.473822e-39

An independent correlation structure is not observed.

> res <- (glm(size ~ Time))$residuals
> wide <- reshape(data.frame(tree=tree,res=res,Time=Time),
idvar="tree", v.names="res", timevar="Time", direction="wide")
> wide [1:5, ]
tree res.152 res.174 res.201 res.227 res.258
1 1 0.30856322 0.4994827 0.58697486 0.7471525 0.6039027
6 2 0.03856322 -0.2805173 -0.14302514 -0.2328475 -0.5860973
11 3 -0.22143678 -0.1205173 -0.03302514 -0.1628475 -0.5160973
16 4 0.15856322 0.2894827 0.27697486 0.1471525 -0.1860973
21 5 0.13856322 0.4694827 0.59697486 0.8171525 0.7339027

> cor(wide[,2:6])
res.152 res.174 res.201 res.227 res.258
res.152 1.0000000 0.9614699 0.9176641 0.8710606 0.8565763
res.174 0.9614699 1.0000000 0.9721038 0.9370675 0.9247401
res.201 0.9176641 0.9721038 1.0000000 0.9653189 0.9494939
res.227 0.8710606 0.9370675 0.9653189 1.0000000 0.9866713
res.258 0.8565763 0.9247401 0.9494939 0.9866713 1.0000000

69
The correlation coefficients among residuals of different time points are very high.
Thus the assumption of independence is not valid.

We first try the most commonly used correlation structure, “exchangeable”, which
assumes that the correlation between time points is constant and non-zero.
> gee.ex <- gee(size ~ Time, id=tree, data=.data, corstr =
"exchangeable")
> summary(gee.ex)
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)

Model:
Link: Identity
Variance to Mean Relation: Gaussian
Correlation Structure: Exchangeable

Call:
gee(formula = size ~ Time, id = tree, data = .data, corstr =
"exchangeable")

Summary of Residuals:
Min 1Q Median 3Q Max
-2.02609732 -0.37956123 0.06948273 0.41669270 1.30948273

Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 2.27324430 0.0880570302 25.81559 0.1003348470 22.65658
Time 0.01268548 0.0002688318 47.18742 0.0003719549 34.10488

Estimated Scale Parameter: 0.4108594

Number of Iterations: 1

Working Correlation
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 0.9020987 0.9020987 0.9020987 0.9020987
[2,] 0.9020987 1.0000000 0.9020987 0.9020987 0.9020987
[3,] 0.9020987 0.9020987 1.0000000 0.9020987 0.9020987
[4,] 0.9020987 0.9020987 0.9020987 1.0000000 0.9020987
[5,] 0.9020987 0.9020987 0.9020987 0.9020987 1.0000000

The coefficients of the regression are exactly the same as those obtained from
specifying an independent correlation structure. However, the standard errors are
smaller.

Since the working correlation coefficients are constantly high, which is not similar
to what we have found, the argument 'corstr' should be further changed.
> gee.ar1 <- gee(size ~ Time, id=tree, data=.data, corstr = "AR-M")
> summary(gee.ar1)

70
GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
gee S-function, version 4.13 modified 98/01/27 (1998)

Model:
Link: Identity
Variance to Mean Relation: Gaussian
Correlation Structure: AR-M , M = 1

Call:
gee(formula = size ~ Time, id = tree, data = .data, corstr = "AR-M")

Summary of Residuals:
Min 1Q Median 3Q Max
-1.9059082 -0.2816582 0.1728384 0.5287292 1.3964369

Coefficients:
Estimate Naive S.E. Naive z Robust S.E. Robust z
(Intercept) 2.31297888 0.1123461827 20.58796 0.1003399163 23.05143
Time 0.01199296 0.0004319665 27.76362 0.0003508396 34.18359

Estimated Scale Parameter: 0.4216764

Number of Iterations: 3

Working Correlation
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 0.9460445 0.8950002 0.8467100 0.8010253
[2,] 0.9460445 1.0000000 0.9460445 0.8950002 0.8467100
[3,] 0.8950002 0.9460445 1.0000000 0.9460445 0.8950002
[4,] 0.8467100 0.8950002 0.9460445 1.0000000 0.9460445
[5,] 0.8010253 0.8467100 0.8950002 0.9460445 1.0000000

Now the working correlation is closer to what we observed from the simple
regression. AR-M denotes autoregressive correlation of one order one (M=1). This
means corrt, t+(n+1) = (corr t, t+1)n. In our example, the correlation between two
visits of one different time point (or the right side of this equation) is 0.9460445.
When the lag is increased to 2, it is 0.94604452 = 0.8950002 and for a lag of 3, it is
0.94604453 = 0.84671 etc. The correlation coefficients among visits thus slowly
reduce with time. To speed up reduction of correlation, another argument 'Mv' can
be specified. For example,
> gee.ar2 <- gee(size ~ Time, id=tree, data=.data, corstr = "AR-M",
Mv=2)
> summary(gee.ar2)

The results are omitted. Both the coefficients and standard errors are slightly
different compared to those from gee.ar1 as the working correlations drop faster by
time lag.

Let's try the geepack package.

71
> library(geepack)
> geeglm.ex <- geeglm(size~Time, id=tree, data=.data, corstr =
"exchangeable")
> summary(geeglm.ex)
Call: geeglm(formula = size ~ Time, data = .data, id = tree, corstr =
"exchangeable")

Coefficients:
Estimate Std.err Wald p(>W)
(Intercept) 2.27324430 0.0977722507 540.5813 0
Time 0.01268548 0.0003640436 1214.2462 0

Estimated Scale Parameters:

Estimate Std.err
(Intercept) 0.4087791 0.06377465

Correlation: Structure = exchangeable Link = identity

Estimated Correlation Parameters:

Estimate Std.err
alpha 0.9043941 0.01815002
Number of clusters: 79 Maximum cluster size: 5

The estimated correlation parameter is 0.904, which is slightly larger than the one
obtained from the gee package using the same correlation structure (gee.ex). The
coefficients and standard errors are also very similar. These slight differences are
seen among the results from different software regardless of whether the software is
open-source or commercial. The very small differences are quite acceptable.

Let's try a model using an AR-1 correlation structure.

> geeglm.ar1 <- geeglm(size ~ Time, id=tree, data=.data, corstr= "ar1")
> summary(geeglm.ar1)

Call:
geeglm(formula = size ~ Time, data = .data, id = tree, corstr = "ar1")

Coefficients:
Estimate Std.err Wald p(>W)
(Intercept) 2.31280103 0.0980443147 556.4571 0
Time 0.01198546 0.0003445955 1209.7360 0

Estimated Scale Parameters:

Estimate Std.err
(Intercept) 0.4198992 0.06373226

Correlation: Structure = ar1 Link = identity

Estimated Correlation Parameters:

Estimate Std.err
alpha 0.952948 0.009632563
Number of clusters: 79 Maximum cluster size: 5

72
Again, this gives practically the same coefficients and robust standard errors as the
one from the gee package.

Now that we are acquainted with these packages, the next step is to use them for
hypothesis testing. We want to see whether the size of trees is affected by treatment
after adjusting for time.
> geeglm1.ar1 <- geeglm(size ~ Time+treat, id=tree, data=.data,
corstr="ar1")
> summary(geeglm1.ar1)$coefficient
Estimate Std.err Wald p(>W)
(Intercept) 2.46484647 0.1657640661 221.105204 0.0000000
Time 0.01198697 0.0003446439 1209.700571 0.0000000
treatozone -0.22238262 0.1621230081 1.881535 0.1701597

The trees treated with ozone had a non-significantly smaller size throughout the
time of follow up. Unsurprisingly, tree sizes increased over time.

As treatment may have different effects over time we now put in the interaction
term.
> geeglm2.ar1 <- geeglm(size ~ Time*treat, id=tree, data=.data,
corstr="ar1")
> summary(geeglm2.ar1)$coefficient
Estimate Std.err Wald p(>W)
(Intercept) 2.154288609 0.2071115387 108.1930041 0.000000000
Time 0.013501415 0.0005847241 533.1587465 0.000000000
treatozone 0.231863021 0.2331996714 0.9885693 0.320092306
Time:treatozone -0.002219175 0.0007066679 9.8617143 0.001687538

This final model gives a better picture of the ozone effect. The main effect of ozone
is not significant indicating that at Time 0, there was no difference in size between
trees in the two treatment groups. The interaction term is strongly significant and
the negative co-efficient indicates that the growth rate of trees in the ozone treated
group was significantly lower than that of trees in the control group.

You may like to try modelling this data using the gee package. The conclusion
should be the same.

Now let's model dichotomous outcomes using the GEE methodology. Let's return to
the bacteria data set.
> zap()
> data(bacteria, package="MASS")
> use(bacteria)
> infected <- y=="y"
> pack()
> geeglm3.ar1 <- geeglm(infected ~ ap+week, id=ID, data=.data,
corstr = "ar1", family="binomial")

73
> summary(geeglm3.ar1)
Call:
geeglm(formula = infected ~ ap + week, family = "binomial", data =
.data, id = ID, corstr = "ar1")

Coefficients:
Estimate Std.err Wald p(>W)
(Intercept) 1.6315020 0.31066152 27.580382 1.506995e-07
app 0.8256061 0.48819185 2.859991 9.080800e-02
week -0.1041665 0.03738504 7.763552 5.331103e-03

Estimated Scale Parameters:

Estimate Std.err
(Intercept) 0.993334 0.3366146

Correlation: Structure = ar1 Link = identity

Estimated Correlation Parameters:

Estimate Std.err
alpha 0.1921146 0.08664026
Number of clusters: 50 Maximum cluster size: 5

The output indicates that there is no evidence of any effect on bacterial infection
between those receiving active treatment and placebo (p-value=0.09). On the other
hand, the probability of getting infection reduced over time.

Attempt for interaction could be further made but all covariates are non-significant.
The results are omitted here.

Results using the gee function from the gee package are similar.

Exercise
Try modelling the respiratory data set from the geepack package using the
methods described in this chapter. Compare the output from functions in the
different packages.

74
Chapter 8: Mixed models

The previous chapter demonstrated how to use the GEE methodology to model the
Sitka and bacteria data sets. In this chapter will use mixed modelling
techniques to model the same data sets and compare the results.
> library(epicalc)
> library(MASS)
> use(Sitka)
> glmmPQL1 <- glmmPQL(fixed = size ~ Time * treat, random= ~ 1 | tree,
data=.data, family="gaussian")
> summary(glmmPQL1)
Linear mixed-effects model fit by maximum likelihood
Data: .data
AIC BIC logLik
NA NA NA

Random effects:
Formula: ~1 | tree
(Intercept) Residual
StdDev: 0.6003342 0.1932339

Variance function:
Structure: fixed weights
Formula: ~invwt
Fixed effects: size ~ Time * treat
Value Std.Error DF t-value p-value
(Intercept) 2.1217179 0.15374924 314 13.799860 0.0000
Time 0.0141472 0.00046278 314 30.569975 0.0000
treatozone 0.2216775 0.18596433 77 1.192043 0.2369
Time:treatozone -0.0021385 0.00055975 314 -3.820480 0.0002
Correlation:
(Intr) Time tretzn
Time -0.609
treatozone -0.827 0.504
Time:treatozone 0.504 -0.827 -0.609

Standardized Within-Group Residuals:

Min Q1 Med Q3 Max
-2.7050855 -0.4925306 0.1176727 0.5752986 4.3408330

Number of Observations: 395

75
Number of Groups: 79

The glmmPQL function comes from the MASS package. It fits generalized linear
mixed models using the penalized quasi-likelihood technique.

The random component is ~1 | tree. The 1 denotes the first coefficient of the model,
the intercept term. The sign | denotes "given" or "clustered by" or "grouped by". So
the random component for this data is the random intercepts of the trees. In other
words, the model assumes that all trees have the same coefficients for 'time' and
'Treat' as well as the interaction term. The only difference is in their intercepts,
which are assumed to be a random variable (or random coefficients), thus there is
no need for any additional coefficient.

The standard deviation of the random intercept is 0.6003 unit, which is quite large
compared to the standard deviation of the residuals of the main effect, which is
0.193. This means that the intercept values of the trees varied considerably
compared to variation of growth within the same tree.

The fixed effects are due to the interaction between 'Time' and 'treat'. Since the
main objective of the analysis is to compare the size of trees in each group over
time, fixed effects are more important than random effects. The results are not too
dissimilar from those we obtained using the GEE methodology. The results indicate
that the effect of ozone on tree size is different between the two groups of trees.

The correlation section and the standardised residuals are complicated, not very
important, and can be ignored.

Now we add a random slope to the model.

> glmmPQL2 <- glmmPQL(fixed = size ~ Time * treat, random= ~ Time |
tree, data=.data, family="gaussian")
> summary(glmmPQL2)
Linear mixed-effects model fit by maximum likelihood
Data: .data
AIC BIC logLik
NA NA NA

Random effects:
Formula: ~Time | tree
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 0.790968484 (Intr)
Time 0.002487428 -0.649
Residual 0.162608831

Variance function:
Structure: fixed weights

76
Formula: ~invwt

Fixed effects: size ~ Time * treat

Value Std.Error DF t-value p-value
(Intercept) 2.1217179 0.17806707 314 11.915274 0.0000
Time 0.0141472 0.00063379 314 22.321781 0.0000
treatozone 0.2216775 0.21537748 77 1.029251 0.3066
Time:treatozone -0.0021385 0.00076658 314 -2.789663 0.0056
Correlation:
(Intr) Time tretzn
Time -0.729
treatozone -0.827 0.603
Time:treatozone 0.603 -0.827 -0.729

Standardized Within-Group Residuals:

Min Q1 Med Q3 Max
-2.3090537 -0.6772047 0.1037030 0.5585733 3.2705870

Number of Observations: 395

Number of Groups: 79

The random effect is now "~ Time | tree" which allows for a different effect of
'Time' on each tree. From the output, the intercept term has a larger standard
deviation (0.79 compared with 0.6003 in the random intercept model). Individual
slopes are strongly negatively correlated with the intercepts (Corr = -0.649). Within
the same treatment group, trees with the larger initial size tended to have a flatter
growth.

We now try the lme function from the nlme package.

> library(nlme)
> lme1 <- lme(fixed = size ~ Time * treat, random = ~ 1 | tree, data
=.data)

> summary(lme1)
Linear mixed-effects model fit by REML
Data: .data
AIC BIC logLik
175.5170 199.3293 -81.75852

Random effects:
Formula: ~1 | tree
(Intercept) Residual
StdDev: 0.6082011 0.1938483

Fixed effects: size ~ Time * treat

Value Std.Error DF t-value p-value
(Intercept) 2.1217179 0.15439225 314 13.742386 0.0000
Time 0.0141472 0.00046190 314 30.628557 0.0000
treatozone 0.2216775 0.18674206 77 1.187078 0.2388
Time:treatozone -0.0021385 0.00055868 314 -3.827802 0.0002

77
Correlation:
(Intr) Time tretzn
Time -0.606
treatozone -0.827 0.501
Time:treatozone 0.501 -0.827 -0.606

Standardized Within-Group Residuals:

Min Q1 Med Q3 Max
-2.6949604 -0.4923652 0.1176021 0.5733781 4.3279068

Number of Observations: 395

Number of Groups: 79

The results from using the lme function are very close to those from using
glmmPQL.
> lme2 <- lme(fixed = size ~ Time*treat, random= ~ Time | tree, data
=.data)
> summary(lme2)

After adding a random time component, the results are very close to those of
'glmmPQL2' in all aspects.

The advantage of using the LME method is the availability of the AIC, BIC and log
likelihood, which imply relative levels of fit. We can therefore use these to compare
different models.
> anova(lme1, lme2)
Model df AIC BIC logLik Test L.Ratio p-value
lme1 1 6 175.5170 199.3293 -81.75852
lme2 2 8 146.1218 177.8714 -65.06088 1 vs 2 33.39528 <.0001

The model 'lme2' has two degrees of freedom more than 'lme1' but it has a much
smaller AIC value. Thus 'lme2' is significantly better than 'lme1'. The random
slope model fits better than one with a random intercept alone for this data set.

Note that the lme function was designed for linear mixed effects modelling. It is
confined to linear models where the outcome variable is on a continuous scale only.
Therefore, there is no "family" argument. The function lme has an option to use
maximum likelihood (ML) or restricted maximum likelihood (REML) methods.

A more recent package called lme4 allows a choice of "family" as well as a more
versatile nesting procedure. The formula syntax is slightly different.
> library(lme4)
> lmer1 <- lmer(size ~ Time*treat + (1|tree), family="gaussian",
data=.data)
> summary(lmer1)
Linear mixed model fit by REML
Formula: size ~ Time * treat + (1 | tree)

78
Data: .data
AIC BIC logLik deviance REMLdev
175.5 199.4 -81.76 130.2 163.5

Random effects:
Groups Name Variance Std.Dev.
tree (Intercept) 0.369909 0.60820
Residual 0.037577 0.19385
Number of obs: 395, groups: tree, 79

Fixed effects:
Estimate Std. Error t value
(Intercept) 2.1217179 0.1543913 13.742
Time 0.0141472 0.0004619 30.629
treatozone 0.2216775 0.1867409 1.187
Time:treatozone -0.0021385 0.0005587 -3.828

Correlation of Fixed Effects:

(Intr) Time tretzn
Time -0.606
treatozone -0.827 0.501
Time:tretzn 0.501 -0.827 -0.606

The results are the same as that from using the lme function except that p-values
for the coefficients are not displayed. One can however obtain 95% confidence
intervals by creating a Markov Chain Monte Carlo (MCMC) sample and creating a
95% highest posterior density (HPD) interval.
> tmp <- mcmcsamp(lmer1, n=1000)
> HPDinterval(tmp)
$fixef
lower upper
(Intercept) 1.790535519 2.4503748846
Time 0.012564591 0.0155585309
treatozone -0.152775369 0.6803190421
Time:treatozone -0.004096989 -0.0002541869
attr(,"Probability")
[1] 0.95
========= further lines omitted ==========

Both lower and upper limits of the 95% CI for 'Time' are positive whereas that of
the interaction term are negative, indicating the statistical significance of these two
variables. Now we add the random component.
> lmer2 <- lmer(size~Time*treat + (Time|tree), family="gaussian",
data=.data)

> anova(lmer1, lmer2)

Data: .data
Models:
lmer1: size ~ Time * treat + (1 | tree)
lmer2: size ~ Time * treat + (Time | tree)

79
Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
lmer1 6 142.201 166.074 -65.100
lmer2 8 114.102 145.933 -49.051 32.099 2 1.071e-07 ***

The conclusion is the same as before. The model with a random slope is
significantly better than that with random intercept alone.

For dichotomous outcomes, we turn to the bacteria data set.

> zap()
> data(bacteria, package="MASS")
> use(bacteria)
> infected <- y=="y"
> pack()

> glmmPQL.1 <- glmmPQL(infected ~ ap + week, random = ~ 1|ID, data =

.data, family="binomial")
> summary(glmmPQL.1)
Linear mixed-effects model fit by maximum likelihood
Data: .data
AIC BIC logLik
NA NA NA

Random effects:
Formula: ~1 | ID
(Intercept) Residual
StdDev: 1.347478 0.7881903

Variance function:
Structure: fixed weights
Formula: ~invwt
Fixed effects: infected ~ ap + week
Value Std.Error DF t-value p-value
(Intercept) 2.0352357 0.3816667 169 5.332495 0.0000
app 1.0082124 0.5326217 48 1.892924 0.0644
week -0.1450321 0.0390851 169 -3.710677 0.0003
Correlation:
(Intr) app
app -0.485
week -0.536 -0.047

Standardized Within-Group Residuals:

Min Q1 Med Q3 Max
-4.2940431 0.2019745 0.3111024 0.5620260 2.0447455

Number of Observations: 220

Number of Groups: 50

The coefficients are fairly different from those using GEE in the previous chapter,
however the conclusion is the same. Treatment is no better than placebo for
controlling infection, and the probability of infection decreases with time.

80
Modelling using the lmer function gives similar results.
> lmer1 <- lmer(infected ~ ap + week + (1|ID), family="binomial",
data=.data)
> summary(lmer1)
Generalized linear mixed model fit by the Laplace approximation
Formula: infected ~ ap + week + (1 | ID)
Data: .data
AIC BIC logLik deviance
206.4 220 -99.2 198.4
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 1.4012 1.1837
Number of obs: 220, groups: ID, 50

Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.09745 0.41334 5.074 3.89e-07 ***
app 1.07571 0.55431 1.941 0.05230 .
week -0.14440 0.04833 -2.988 0.00281 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Correlation of Fixed Effects:

(Intr) app
app -0.418
week -0.621 -0.063

Note that when the outcome variable is dichotomous, the lmer function also
provides z values and p-values. 95% confidence intervals of the coefficients thus
can be computed.
> coefs <- attr(summary(lmer1), "coefs")
> coefs
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.0974492 0.41333783 5.074419 3.886823e-07
app 1.0757068 0.55431021 1.940622 5.230411e-02
week -0.1444006 0.04833126 -2.987726 2.810615e-03

> ci95 <- data.frame(lower=coefs[,1] - 1.96*coefs[,2], upper=coefs[,1]

+1.96*coefs[,2])
> ci95
lower upper
(Intercept) 1.28730708 2.90759139
app -0.01074125 2.16215476
week -0.23912984 -0.04967129

One advantage of using lmer in hypothesis testing is the availability of the anova
function.

There are 3 levels of treatment. In order to determine whether or not there is any
treatment effect we can try the following model.

81
> lmer2 <- lmer(infected ~ trt+week + (1|ID), family="binomial",
data=.data)
> summary(lmer2)

====== lines omitted =====

Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.1440 0.5290 5.943 2.79e-09 ***
trtdrug -1.3202 0.6252 -2.111 0.03473 *
trtdrug+ -0.7955 0.6401 -1.243 0.21400
week -0.1437 0.0483 -2.975 0.00293 **

====== lines omitted =====

Treatment with the drug (with a low level of encouragement to comply) is

significantly better than with a placebo (referent group). However, to test whether
this variable as a whole is significant we use the anova function..
> lmer3 <- lmer(infected ~ week + (1|ID), family="binomial", data =
.data)
> anova(lmer2, lmer3)
Data: .data
Models:
lmer3: infected ~ week + (1 | ID)
lmer2: infected ~ trt + week + (1 | ID)
Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
lmer3 3 208.204 218.385 -101.102
lmer2 5 207.771 224.739 -98.885 4.4337 2 0.1090

The output indicates that there is not enough evidence to show that treatment (with
or without encouragement to comply) has a significant effect on infection.

In summary, this chapter has shown that generalized linear mixed effects
modelling, in which the outcome variable may be continuous or dichotomous, can
be performed with glmmPQL in the MASS package, and lmer in the lme4
package with similar results. The lmer command has the advantage of being able
to test whether the random effects should be simple random intercepts or random
slopes as well. It is also useful in testing hypothesis for variables with more than
two levels of exposure. However, package lme4, which contains the lmer
function, is still under development. Some changes are expected in future versions.

Exercise
Load the BMD data from Epicalc. It contains data from a clinical trial of three does
of a drug which is thought to affect bone density levels in post-menopausal women.
Treatment 1 is the lowest dosage of the drug and Treatment 2 is the highest dosage.

82
Because it is always preferable to have patents on the lowest effective dosage of a
drug, the interest in this trial is focussed on whether Treatment 1 is significantly
different from Treatments 2 and 3.

Subjects were post-menopausal women, who were randomly allocated to one of

three doses (that is, this is a completely randomized design).

Bone mineral densities were measured at the start of the trial, and at 12 and 24
months after the trial commenced.
1. Is there a beneficial treatment effect on bone density at the hip, after correcting
for the covariates given?
2. Anecdotal evidence is that a side-effect of the treatment is a gain in weight (or
increase in BMI). Do these data provide evidence for this theory?

83
Chapter 9: Transition models

This chapter discusses the last type of modelling of longitudinal data; transition
models.

Transition - the change of status from one point of time to the next, as its name
suggests, is the centre of interest. We have already explored and analysed the
bacteria data set, where the time points did not increase regularly. In this
chapter, let's try to analyse a more complicated data set – Xerop, which is
concerned with the relationship between vitamin A deficiency and respiratory
infection among children.
> library(epicalc)
> zap()
> data(Xerop)
> use(Xerop)
> des()
(subset)
No. of observations = 1200
Variable Class Description
1 id integer
2 respinfect integer
3 age.month integer
4 xerop integer
5 sex factor
6 ht.for.age integer
7 stunted integer
8 time integer
9 baseline.age integer
10 season factor

> summ()
(subset)
No. of observations = 1200

Var. name obs. mean median s.d. min. max.

1 id 1200 168402.32 166150 103202.68 121013 1725110
2 respinfect 1200 0.09 0 0.29 0 1
3 age.month 1200 3.2 3 20.25 -32 50
4 xerop 1200 0.05 0 0.21 0 1
5 sex 1200 1.41 1 0.492 1 2

84
6 ht.for.age 1200 0.91 1 5.85 -23 25
7 stunted 1200 0.12 0 0.33 0 1
8 time 1200 3.42 3 1.76 1 6
9 baseline.age 1200 -4.05 -3 19.63 -32 44
10 season 1200 2.488 2 0.922 1 4

The outcome of interest is contained in the variable 'respinfect'. The main

independent variable is 'xerop'. There is a maximum of 6 scheduled visits ('time')
for each child. Three variables ('age.month', 'ht.for.age' and 'baseline.age') contain
negative values. They were transformed by a previous analyst for unknown reasons.
> length(table(id))
[1] 275

There are 275 children in the data set. Check for missing and duplicate visits.
> T <- table(id, time)
> sum(T == 0)
[1] 452
> sum(T > 1)
[1] 2

There are 452 visits missed and in 2 records the combination of 'id' and 'time' are
duplicated. We can use the following command to list the ids of the duplicated
records:
> id[which(duplicated(cbind(id, time)))]
[1] 161013 161013

Now we can list the records of the child whose 'id' is equal to 161013. (The output
has been edited to fit on the page).
> .data[id==161013,]
id respinf age.month xerop sex ht.for.age stunted time base.age
496 161013 0 -1 0 0 2 0 1 -1
497 161013 0 2 0 0 3 0 2 -1
498 161013 0 5 0 0 2 0 3 -1
499 161013 0 8 0 0 3 0 4 -1
500 161013 0 11 0 0 2 0 1 11
501 161013 0 14 1 0 1 0 2 11

The duplicated records are rows 500 and 501 where 'time' of 1 and 2 is repeated
from rows 496 and 497. This is likely to be a human error arising during data entry.
Inspection of the 'age.month' variable, which has a constant increment of 3, would
suggest that the 'time' variable for the duplicated records should be changed to 5
and 6, respectively. This subject was vitamin A deficient ('xerop' = 1) at the last
visit and had a lower height for age but did not yet have stunted growth. Their
baseline age was -1 in the first four visits but noted to be 11 in the last two. We now
have the dilemma of either deleting these two records or changing the times to 5
and 6. Let's choose the first option just for illustration purposes.
> data.new <- .data[-c(500,501),]

85
> use(data.new)
> anyDuplicated(cbind(id, time)) # final check
[1] 0

The two duplicated records have been removed.

A cross-sectional relationship between vitamin A deficiency and the infection can

now be assessed.
> cc(respinfect, xerop)
xerop
respinfect 0 1 Total
0 1044 47 1091
1 100 7 107
Total 1144 54 1198

OR = 1.55
95% CI = 0.58 3.58
Chi-squared = 1.13 , 1 d.f. , P value = 0.288
Fisher's exact test (2-sided) P value = 0.323

Vitamin A deficiency is associated with a 55 percent increase in the odds of

infection. However, this increase does not reach statistical significance.

We expect that it should take some amount of time before vitamin A deficiency
could have any effect on infection. To create lag variables for respiratory infection
and vitamin A deficiency we use the lagVar command from epicalc.
> respinfect.lag1 <- lagVar(respinfect, id=id, visit=time, lag.unit=1)
> xerop.lag1 <- lagVar(xerop, id=id, visit=time, lag.unit=1)
> pack()

Now we tabulate current infection against preceding vitamin A status.

> cc(respinfect, xerop.lag1)
xerop.lag1
respinfect 0 1 Total
0 754 31 785
1 62 8 70
Total 816 39 855

OR = 3.13
95% CI = 1.19 7.35
Chi-squared = 8.26 , 1 d.f. , P value = 0.004
Fisher's exact test (2-sided) P value = 0.011

The risk of infection is 3 times higher if the child had vitamin A deficiency in the
preceding visit. We should check whether this association is confounded by
preceding infection status using the Mantel Haenszel method.
> mhor(respinfect, xerop.lag1, respinfect.lag1)
Stratified analysis by respinfect.lag1

86
OR lower lim. upper lim. P value
respinfect.lag1 0 3.39 1.1886 8.46 0.01145
respinfect.lag1 1 1.65 0.0305 19.45 0.52669
M-H combined 3.03 1.3348 6.89 0.00534

M-H Chi2(1) = 7.76 , P value = 0.005

Homogeneity test, chi-squared 1 d.f. = 0.33 , P value = 0.567

The adjusted odds ratio (3.03) and the crude odds ratio (3.13) are quite close,
indicating minimal confounding by past infection.

Logistic regression gives similar results.

> glm1 <- glm(respinfect ~ xerop.lag1 + respinfect.lag1,
family="binomial", data=.data)
> logistic.display(glm1)
Logistic regression predicting respinfect

crude OR(95%CI) adj. OR(95%CI) P(Wald) P(LR-test)

xerop.lag1: 1 vs 0
3.14 (1.38,7.12) 3.06 (1.34,6.97) 0.008 0.015

respinfect.lag1: 1 vs 0
1.88 (0.92,3.84) 1.82 (0.88,3.75) 0.104 0.124

Log-likelihood = -237.9799
No. of observations = 855
AIC value = 481.9597

This model suggests that during the transition from previous visit to current visit,
susceptibility to respiratory infection is enhanced by vitamin A deficiency and not
by the presence of the preceding infection.

We can add other putative risk factors to see if the odds ratio changes.
> glm2 <- glm(respinfect ~ xerop + xerop.lag1 + respinfect.lag1 +
season + sex + age.month, family="binomial", data=.data)
> logistic.display(glm2)

The variables 'xerop' and 'sex' are not significant and so we run a new model with
them omitted.
> glm3 <- glm(respinfect ~ xerop.lag1 + respinfect.lag1 + season +
age.month, family="binomial", data=.data)
> logistic.display(glm3, decimal=1)

Logistic regression predicting respinfect

crude OR(95%CI) adj. OR(95%CI) P(Wald) P(LR)

xerop.lag1: 1 vs 0 3.1 (1.4,7.1) 4.2 (1.7,10.2) 0.002 0.004

respinfect.lag1: 1 vs 0 1.9 (0.9,3.8) 1.61 (0.8,3.4) 0.223 0.239

87
season: ref.=1
< 0.001
2 4.0 (1.7,9.5) 4.6 (1.9,11.2) < 0.001
3 1.9 (0.8,4.3) 1.9 (0.8,4.6) 0.133
4 1.3 (0.5,3.6) 1.3 (0.5,3.7) 0.588

age.month (cont. var) 0.98 (0.97,0.99) 0.97 (0.96,0.99) <0.001 < 0.001

Log-likelihood = -223.9393
No. of observations = 855
AIC value = 461.8786

Both season and age are significant determinants of respiratory infection. With their
presence, the odds ratio of preceding vitamin A deficiency increases from 3.1 to 4.2
and the odds ratio of preceding infection slightly decreases from 1.8 to 1.6, which is
again not significant.

In summary, transition models are suitable for cohort studies containing a regular
follow-up interval and changing exposure and outcome status. Transition modelling
is statistically relatively simple but needs careful data management and exploration.
Keeping preceding outcome status in the model ensures that the 'carry-over' effects
are adjusted for and the problem of correlation over time is taken care of.
Demonstrating that preceding exposure is associated with current health outcome
provides stronger logic of causation than what is found in usual cross-sectional
data.

88
Exercise
Analyse the bacteria data set from the MASS package.

1) Create a variable called 'visit' using the markVisit function.

2) Ignore the differences among the visits and proceed with the transition
analysis.
• Compute the crude odds ratio between the outcome 'y' and active treatment
'ap'.
• Compute the odds ratio between 'y' and 'ap' but this time adjusting for
preceding outcome status using the Mantel Haenszel method. Is the
conclusion the same?
• Compute the adjusted odds ratio using logistic regression.
• In the logistic regression model, replace 'ap' with 'trt' and explain the
findings.
3) The conclusion from a transition model is different from those using
marginal (GEE) and mixed models. Which one is the most sensible?
4) Draw up a table to summarize your understanding of the three main modelling
techniques for longitudinal data. The columns should be the three techniques
(marginal model, mixed model, transition model). The rows should include the
following items.
• Underlying interest
• Data manipulation needed
• Interpretation of the main and the minor coefficients and other parameters
• Advantages and disadvantages
• Indication and limitation of application

89
Further Reading

The following are useful reference books on longitudinal data analysis.

Diggle. Analysis of Longitudinal Data

Hedeker & Gibbons. Longitudinal Data Analysis. Wiley.

90
Solutions to exercises

Chapter 1
> library(epicalc)
> des(Theoph) # This dataset has a "lazy loading" attribute
> help(Theoph)

The help page describes the variables in this dataset. The data appear to be in long
format, since there are no repeating variables, and there is a "time" variable. To
confirm this, we can determine if the Subject id is duplicated.
> head(Theoph)
Grouped Data: conc ~ Time | Subject
<environment: R_EmptyEnv>
Subject Wt Dose Time conc
1 1 79.6 4.02 0.00 0.74
2 1 79.6 4.02 0.25 2.84
3 1 79.6 4.02 0.57 6.57
4 1 79.6 4.02 1.12 10.50
5 1 79.6 4.02 2.02 9.66
6 1 79.6 4.02 3.82 8.58
7 1 79.6 4.02 5.10 8.36
8 1 79.6 4.02 7.03 7.47
9 1 79.6 4.02 9.05 6.89
10 1 79.6 4.02 12.12 5.94
11 1 79.6 4.02 24.37 3.28
12 2 72.4 4.40 0.00 0.00
13 2 72.4 4.40 0.27 1.72
14 2 72.4 4.40 0.52 7.91
15 2 72.4 4.40 1.00 8.31

> use(Theoph)
> any(duplicated(Subject))
[1] TRUE

> any(duplicated(cbind(Subject, Time)))

[1] FALSE

The variable "Time" appears to be inconsistent across subjects, as can be seen from:
> tab1(Time)

91
Reshaping this data to wide form using this "time" variable is possible, but would
result in a useless data set.
> Theoph.wide <- reshape(Theoph, direction="wide", idvar="Subject",
timevar="Time", v.names="conc")

> des(Theoph.wide); head(Theoph.wide) # output not shown

Chapter 2
> library(epicalc)
> use(Theoph)
> des()

No. of observations = 132

Variable Class Description
1 Subject ordered
2 Wt numeric
3 Dose numeric
4 Time numeric
5 conc numeric

> summ()

No. of observations = 132

Var. name obs. mean median s.d. min. max.
1 Subject 132 6.5 6.5 3.465 1 12
2 Wt 132 69.58 70.5 9.13 54.6 86.4
3 Dose 132 4.63 4.53 0.72 3.1 5.86
4 Time 132 5.89 3.53 6.93 0 24.65
5 conc 132 4.96 5.28 2.87 0 11.4

> table(Subject)
Subject
6 7 8 11 3 2 4 9 12 10 1 5
11 11 11 11 11 11 11 11 11 11 11 11

For a small data set such as this one, we can easily see that there are 12 subjects,
each having 11 records. For large data sets, the following may be better.
> length(unique(Subject))
[1] 12

> tab1(table(Subject))
table(Subject) :
Frequency Percent Cum. percent
11 12 100 100
Total 12 100 100

Assess the timing of drug measurements using the tab1 and summ commands.
> tab1(Time)
Time :
Frequency Percent Cum. percent

92
0 12 9.1 9.1
0.25 5 3.8 12.9
0.27 3 2.3 15.2
0.3 2 1.5 16.7
0.35 1 0.8 17.4
0.37 1 0.8 18.2
====== remaining lines omitted ======
> table(Time, Subject) # output not shown
> summ(Time)

Distribution of Time

Subject sorted by X-axis values

0 5 10 15 20 25

The jittering of the stacks of points indicates that the time of drawing blood was not
perfectly synchronised for all subjects. It appears as if some attempt was made to
draw the blood at specific intervals for each subject, namely at 15 and 30 minutes,
and then at 1, 2, 3.5, 5, 7, 9, 12 and 24 hours after the start of the study, however
this was not achieved exactly.
> followup.plot(id=Subject, time=Time, outcome=conc, xlab="Time (hrs)",
ylab="Concentration (mg/L)", las=1)
> title(main="Pharmacokinetics of theophylline")

93
Pharmacokinetics of theophylline

Concentration (mg/L)
8

0 5 10 15 20 25

Time (hrs)

The concentration rises sharply after the first dose, then drops gradually over time.
> coplot(conc~Time|Subject, panel=lines, type="b", data=Theoph)

Given : Subject

1 5
12 10
4 9
3 2
8 11
6 7

0 5 10 20 0 5 10 20
0 2 4 6 8

0 2 4 6 8
conc
0 2 4 6 8

0 5 10 20 0 5 10 20

Time

94
Multicoloured lines can be achieved as follows (graph not shown).
> followup.plot(Subject, Time, conc, line.col="multicolour")

For the examination of the subject's weights over time, we can use the aggregate
command. If the standard deviation of each subject’s weight is zero, then that
would indicate stability.
> aggregate(Wt, by=list(Subject=Subject), FUN=sd)
Subject sd.Wt
1 6 0
2 7 0
3 8 0
4 11 0
5 3 0
6 2 0
7 4 0
8 9 0
9 12 0
10 10 0
11 1 0
12 5 0

A weight group variable can be created using the cut command.

> Wt.gp <- cut(Wt, br=c(0, 70, 100), labels=c("<70 kg", "70+ kg"))
> pack()
> followup.plot(Subject, Time, conc, by=Wt.gp, las=1,
main="Pharmacokinetics of theophylline")

Pharmacokinetics of theophylline

<70 kg
70+ kg
10

8
conc

0 5 10 15 20 25

Time

95
A comparison may perhaps be better visualised by aggregating the concentrations
for each subject withing the two weigth groups at suitably chosen time points.
> aggregate.plot(conc, by=Time, group=Wt.gp, lwd=2, lty=c(1,3), las=1)
> title(ylab="Concentration (mg/L)", xlab="Time (hour)")

Note that because the time of drawing blood is not exact for each subject, the
aggregate.plot command will group the time points into 4 "bins" by default.
You may like to experiment with the "bin" arguments to see what effect they have
on the graph.
> aggregate.plot(x=conc, by=Time, group=Wt.gp, bin.method="quantile")
> aggregate.plot(x=conc, by=Time, group=Wt.gp, bin.time=11
bin.method="fixed")

In order to use the scheduled times, which were 15 minutes, 30 minutes, 1 hour and
then 2, 3, 5, 7, 9, 12 and 24 hours after drug administration, a new vector is needed.
> visit <- markVisits(Subject, Time)
> pack()
> aggregate.plot(conc, by=visit, group=Wt.gp)

In this graph, the distances between the time points do not reflect the actual times in
hours. We need to change the visit times.
> scheduled.visit <- c(0,0.25,0.5,1,2,3,5,7,9,12,24)
> recode(visit, 1:11, scheduled.visit)
> aggregate.plot(conc, by=visit, group=Wt.gp, lwd=2, lty=c(1,3), las=1)

Mean and 95% CI of conc by visit and Wt.gp

<70 kg
70+ kg

0 5 10 15 20

Those weighing over 70kg start to have lower theophylline concentrations after 2
hours, but the difference is negligible at the end of the study.

96
Chapter 3
> zap()
> use(Sitka)
> auc.data <- auc(conc=size, time=Time, id=tree)
> treat.data <- reshape(Sitka, direction="wide", idvar="tree",
timevar="Time", v.names="size")[,1:2]

> data <- merge(auc.data, treat.data, by="tree")

> use(data)
> des()
> tableStack(auc, by=treat)
control ozone Test stat. P value
auc t-test (77 df) = 1.45 0.151
mean(SD) 535.4 (72.6) 512.6 (61.1)

Trees treated with ozone have a lower AUC on average, however the difference is
not significant.

Chapter 4
> zap()
> use(Xerop)
> des(); summ()
> table(table(id, time))

0 1 2
452 1196 2

The output indicates that there are 452 missing records, however there are also 2
duplicates. These must be removed before continuing the anlaysis.
> id.dup <- id[duplicated(cbind(id, time))]
> .data[id %in% id.dup,]
> keepData(subset=!duplicated(cbind(id,time)))
> sortBy(id, time)
> visit <- markVisits(id, time)
> pack()
> table(time, visit)

The newly created 'visit' variable is not consistent with the 'time' variable due to the
missing records.
> zap()
> use(Sitka)
> tmp <- by(.data, INDICES=tree,
FUN=function(x) lm(size ~ Time+I(Time^2), data=x))
> tree.coef <- sapply(tmp, FUN=coef)
> tree.growth <- as.data.frame( t(tree.coef) )
> use(tree.growth)

97
> des()
> tableStack(vars=2:4, by=treat, decimal=2)
control ozone Test stat. P value
(Intercept) Ranksum test 0.5728
median(IQR) -0.38 (-3.21,0.53) -1.01 (-2.51,0.16)

Time Ranksum test 0.6849

median(IQR) 0.05 (0.03,0.06) 0.05 (0.04,0.06)

I(Time^2) Ranksum test 0.4966

median(IQR) 0 (0,0) 0 (0,0)

None of the components of a quadratice model are significantly predicted by ozone.

Chapter 5
> zap()
> use(Xerop)
> des(); summ()
> table(time)
time
1 2 3 4 5 6
230 214 177 183 195 201

> length(unique(id))
[1] 275

The visit times are not evenly distributed. Of the 275 subjects, only 230 came to the
first visit (time=1). Subsequent visits are imbalanced. The two duplicates are now
removed.
> keepData(subset=!duplicated(cbind(id,time)),
select=c(id, baseline.age, sex, respinfect, xerop, stunted, time))
> Xerop.wide <- reshape(.data, idvar="id", v.names=c("respinfect",
"xerop", "stunted"), timevar="time", direction="wide")
> summ(Xerop.wide)
> table(time)

Note that the first 3 variables, ('id', 'baseline.age' and 'sex'), all have 275 non-missng
values, since we omitted these in the 'v.names' argument of the reshape command.
The others have varying numbers of missing values, and the frequencies should
match those from the last tabulation of the 'time' variable.
> with(Xerop.wide, addmargins(table(respinfect.1, respinfect.2)))
respinfect.2
respinfect.1 0 1 Sum
0 162 8 170
1 21 2 23
Sum 183 10 193

98
> with(Xerop.wide, mcnemar.test(table(respinfect.1, respinfect.2)))

McNemar's Chi-squared test with continuity correction

data: table(respinfect.1, respinfect.2)

McNemar's chi-squared = 4.9655, df = 1, p-value = 0.02586

There is a significant change in respiratory infection from visit 1 to visit 2. There

were 23 who had respiratory infection at the start of the study, and only 10 at visit
2.

The change from visit 2 to visit 3 is not significant, as evidenced by the following
commands.
> with(Xerop.wide, addmargins(table(respinfect.2, respinfect.3)))
respinfect.3
respinfect.2 0 1 Sum
0 148 8 156
1 4 1 5
Sum 152 9 161
> with(Xerop.wide, mcnemar.test(table(respinfect.2, respinfect.3)))

McNemar's Chi-squared test with continuity correction

data: table(respinfect.2, respinfect.3)

McNemar's chi-squared = 0.75, df = 1, p-value = 0.3865

Continuing in this fashion, you will discover that from visit 4 to visit 5, respiratory
infection actually increases. Vitamin A deficiency (xerop) does not change
significantly during any of the transitional periods. Stunting only changes
significantly between visits 3 and 4.

Chapter 6
> zap()
> data(Xerop)
> use(Xerop)
> id.dup <- id[duplicated(cbind(id, time))]
> .data[id %in% id.dup,]
> keepData(subset=!duplicated(cbind(id,time)))
> Xerop.all <- addMissingRecords(.data, id, time,
outcome=c("season","respinfect", "xerop","stunted"))
> use(Xerop.all)
> summ()
> tableStack(vars=season, by=present)
0 1 Test stat. P value
season Chisq. (3 df) = 22 < 0.001
1 92 (20.4) 183 (15.3)
2 126 (27.9) 424 (35.4)
3 136 (30.1) 414 (34.6)
4 98 (21.7) 177 (14.8)

99
Seasons 1 and 4 had significantly lower attendance rates than the other 2 seasons.
> sortBy(id, time)
> present.next <- lagVar(present, id, time, lag = -1)
> pack()
> logistic.display(glm(present.next ~ xerop+respinfect, data=.data,
family=binomial))

Logistic regression predicting present.next

adj. OR(95%CI) P(Wald's test) P(LR-test)

xerop: 1 vs 0 0.71 (0.34,1.5) 0.37 0.385

respinfect: 1 vs 0 0.87 (0.48,1.59) 0.659 0.663

Log-likelihood = -407.6334
No. of observations = 997
AIC value = 821.2667

There is no evidence that non-attendance was due to vitamin A deficiency or

respiratory infection from the previous visit.
> logistic.display(glm(present ~ ht.for.age, data=.data,
family=binomial), crude=FALSE)

Logistic regression predicting present.varname

adj. OR(95%CI) P(Wald's test) P(LR-test)

season: ref.=1 < 0.001
2 1.7 (1.23,2.34) 0.001
3 1.54 (1.12,2.11) 0.008
4 0.91 (0.64,1.29) 0.589

baseline.age 0.9907 (0.9853,0.9962) < 0.001 < 0.001

sex: 1 vs 0 1.04 (0.83,1.3) 0.724 0.724

Log-likelihood = -952.3859
No. of observations = 1650
AIC value = 1916.7717

Both season and age at baseline are significant predictors for non-attendance. As
age increases, the chances of attending reduces.

Chapter 7
> library(geepack)
> library(epicalc)
> zap()
> data(respiratory)
> use(respiratory)
> des(); summ()
> table(table(id, visit))

100
The output is interesting. According to the help page, two centers were involved in
this study. Each center assigned a running number for the id, thus explaining the
duplicates. We need to create a new id variable.
> id2 <- paste(center,id,sep="")
> label.var(id2, "patient ID")
> table(id2, visit)

Now there are no duplicates. This dataset is not exactly in wide or long form. The
outcome is respiratory status, wich was measured on the 111 patients at baseline,
and at 4 subsequent visits. Thus, the outcome saved in two separate variables, and
we need to reshape the dataset so that it is contained in only one variable. First
create a dataset containing just the records from the baseline.
> Baseline <- .data[visit==1,]

Next, change the values of the 'visit' variable all to 0 and create a new outcome
variable called 'resp'.
> Baseline$visit <- 0
> Baseline$resp <- Baseline$baseline

Finally, do the same to the original dataset and append them together.
> .data$resp <- .data$outcome
> data <- rbind(Baseline, .data)
> use(data)
> sortBy(id2, visit)
> head(.data,30)

There should now be 555 records, with 111 patients having their respiratory status
measured at 5 visits. Modelling can now proceed.
> resp.ex <- geeglm(outcome~treat+center+age+sex, id=id2, data=.data,
family="binomial", corstr = "exchangeable")
> summary(resp.ex)

Only 'treatment' and 'center' are significant. Those from center "2" were more likely
to have a “good” respiratory status in addition to those given the active treatment.
Note that the coding for the outcome variable is 1=good, 0=poor.

Chapter 8

Chapter 9
> library(geepack)
> library(epicalc)
> zap()
> data(bacteria, package="MASS")
> use(bacteria)
> des(); summ()
> sortBy(ID, week)

101
> visit <- markVisits(ID, week)
> pack()
> table(week, visit)

Note the differences between the two variables, due to the missing records.
> cc(y, ap)

The risk of infection is 2.3 times higher for the placebo group.
> y.lag1 <- lagVar(y, ID, visit)
> pack()
> mhor(y, ap, y.lag1)

The risk of bacterial infection is still 2.3 times higher in the placebo group after
adjusting for previous infection. Conclude tha the active treatment is protective
against the bacteria regardless of whether the child was infected in the preceding
visit or not.
> logistic.display(glm(y ~ y.lag1+ap, family=binomial, data=.data))

Note that the crude odds ratio for 'ap' is not the same as above using the cc
command. This is because the records of the first visit do not have preceding
infection status and therefore are not included in the adjusted analysis nor the
logistic regression model. Checking for a confounding effect using the glm
command is more accurate than from a comparison of the results from the cc and
mhor commands. In this case there is moderate confounding by previous infection
status.
> logistic.display(glm(y ~ y.lag1+trt, family=binomial, data=.data))

There are three treatment groups now. One group was given no active treatment at
all (placebo). The second and third groups were given the active treatment
and further randomised to recieve active encouragement to comply with treatment
(drug+) or not (drug).

The effect (OR) of 'y.lag1' is similar to the model which included 'ap'. However,
while the likelihood ratio test (LR-test) for 'ap' is significant (p=0.038), it is not
for 'trt' (p=0.096). This is because the risk for the "drug" and "drug+" groups
are rather close. The LR-test reports the effect of 'trt' as a whole. It tells us that there
is not sufficient evidence that the three treatment groups have a different risk for
infection after adjusting for preceding infection. The Wald's tests of this set of
variables tell a slightly different story. The group receiving treatment with low
compliace (drug) has a significantly lower risk of infection compared to the placebo
group but treatment with high compliance is not better than placebo (one may
wonder if the data contains wrong coding). However, the odds ratio is still
relatively and negatively strong (0.5) with quite a wide 95% CI. We conclude that
the sample size (170 valid subjects) may not be large enough.

102

Modern Statistics With R
100% (3)
Modern Statistics With R
580 pages
Alboukadel Kassambara - Ggplot2: The Elements For Elegant Data Visualization in R
80% (15)
Alboukadel Kassambara - Ggplot2: The Elements For Elegant Data Visualization in R
311 pages
KIPP NYC Sample Daily Lesson Plan Format
100% (2)
KIPP NYC Sample Daily Lesson Plan Format
3 pages
Statistical Regression Modeling With R: Ding-Geng (Din) Chen Jenny K. Chen
No ratings yet
Statistical Regression Modeling With R: Ding-Geng (Din) Chen Jenny K. Chen
239 pages
Analyzing and Modeling Rank Data
No ratings yet
Analyzing and Modeling Rank Data
28 pages
Regression Analysis of Count Data 2nd Ed
No ratings yet
Regression Analysis of Count Data 2nd Ed
9 pages
2013 Book BayesianAndFrequentistRegressi PDF
No ratings yet
2013 Book BayesianAndFrequentistRegressi PDF
700 pages
Arup Future of Project Management2
No ratings yet
Arup Future of Project Management2
68 pages
Florence W. Rosiello - Deepening Intimacy in Psychotherapy - Using The Erotic Transference and CounterTransference (2000, Jason Aronson, Inc.)
0% (1)
Florence W. Rosiello - Deepening Intimacy in Psychotherapy - Using The Erotic Transference and CounterTransference (2000, Jason Aronson, Inc.)
261 pages
The Demography of Health and Health Care: Second Edition
No ratings yet
The Demography of Health and Health Care: Second Edition
385 pages
(Chapman & Hall - CRC Data Science Series) Brandon M. Greenwell - Tree-Based Methods For Statistical Learning in R - A Practical Introduction With Applications in R-CRC Press (2022)
No ratings yet
(Chapman & Hall - CRC Data Science Series) Brandon M. Greenwell - Tree-Based Methods For Statistical Learning in R - A Practical Introduction With Applications in R-CRC Press (2022)
405 pages
Concepts of Nonparametric Theory (PDFDrive)
No ratings yet
Concepts of Nonparametric Theory (PDFDrive)
475 pages
Applied Longitudinal Analysis Lecture Notes
No ratings yet
Applied Longitudinal Analysis Lecture Notes
475 pages
Practical PCA Methods in R
No ratings yet
Practical PCA Methods in R
29 pages
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
From Everand
Sample Size for Analytical Surveys, Using a Pretest-Posttest-Comparison-Group Design
Joseph George Caldwell
No ratings yet
Analysis of Epidemiological Data Using R
No ratings yet
Analysis of Epidemiological Data Using R
285 pages
Introduction To Non Parametric Methods Through R Software
From Everand
Introduction To Non Parametric Methods Through R Software
Editor IJSMI
No ratings yet
SPSS for Applied Sciences: Basic Statistical Testing
From Everand
SPSS for Applied Sciences: Basic Statistical Testing
Cole Davis
2.5/5 (6)
RYAN, THOMAS P. - [Wiley Series in Probability and Statistics] Modern Regression Methods __ (2
No ratings yet
RYAN, THOMAS P. - [Wiley Series in Probability and Statistics] Modern Regression Methods __ (2
658 pages
E5 - Statistical Analysis Using R
100% (1)
E5 - Statistical Analysis Using R
45 pages
Statistical Causal Inferences and Their Applications in Public Health Research-Springer International Publishing (2016)
100% (2)
Statistical Causal Inferences and Their Applications in Public Health Research-Springer International Publishing (2016)
324 pages
13 Pag Design and Analysis of Experiments in The Health Sciences
No ratings yet
13 Pag Design and Analysis of Experiments in The Health Sciences
13 pages
Computational Statistics With R
100% (1)
Computational Statistics With R
125 pages
Tom a. B. Snijders - Multilevel Analysis_ an Introduction to Basic and Advanced Multilevel Modeling (2011)-1
No ratings yet
Tom a. B. Snijders - Multilevel Analysis_ an Introduction to Basic and Advanced Multilevel Modeling (2011)-1
521 pages
R Manual To Agresti's Categorical Data Analysis
100% (1)
R Manual To Agresti's Categorical Data Analysis
280 pages
Análisis de Supervivencia
100% (1)
Análisis de Supervivencia
441 pages
Robust Nonparametric Statistical Methods Second Edition
100% (3)
Robust Nonparametric Statistical Methods Second Edition
532 pages
Diggle 2013 Statistical Analysis of Spatial and
No ratings yet
Diggle 2013 Statistical Analysis of Spatial and
69 pages
Survival Plots SURVMINER Package Tutorial
No ratings yet
Survival Plots SURVMINER Package Tutorial
5 pages
Longitudinal PDF
No ratings yet
Longitudinal PDF
664 pages
Bayesian Statistics: A User's Perspective
No ratings yet
Bayesian Statistics: A User's Perspective
24 pages
Modelos de Fragilidad en El Análisis de Supervivencia PDF
No ratings yet
Modelos de Fragilidad en El Análisis de Supervivencia PDF
320 pages
Gary King, Ori Rosen, Martin A. Tanner - Ecological Inference - New Methodological Strategies (Analytical Methods For Social Research) (2004)
100% (1)
Gary King, Ori Rosen, Martin A. Tanner - Ecological Inference - New Methodological Strategies (Analytical Methods For Social Research) (2004)
433 pages
Survival Analysis For Epidemiologic
100% (2)
Survival Analysis For Epidemiologic
297 pages
Lindsey 1997 Applying Generalized Linear Models PDF
100% (2)
Lindsey 1997 Applying Generalized Linear Models PDF
265 pages
Seefeld-Statistics Using R With Biological Examples PDF
No ratings yet
Seefeld-Statistics Using R With Biological Examples PDF
325 pages
Environmental and Ecological Statistics With R, Second Edition (Song S. Qian)
No ratings yet
Environmental and Ecological Statistics With R, Second Edition (Song S. Qian)
560 pages
Statistics-for-health-data-science-an-organic-approach
No ratings yet
Statistics-for-health-data-science-an-organic-approach
238 pages
Lecture 10 Randomized Complete Block Design Last Lecture
100% (1)
Lecture 10 Randomized Complete Block Design Last Lecture
4 pages
Generalized Linear Models
No ratings yet
Generalized Linear Models
109 pages
Statistical Methods For Stochastic Differential Equations: Monographs On Statistics and Applied Probability 124
No ratings yet
Statistical Methods For Stochastic Differential Equations: Monographs On Statistics and Applied Probability 124
498 pages
Using The R Commander A Point-And-Click Interface For The R by Fox, John
No ratings yet
Using The R Commander A Point-And-Click Interface For The R by Fox, John
238 pages
Biostatistics Concepts and Applications For Biologists
No ratings yet
Biostatistics Concepts and Applications For Biologists
210 pages
(Oxford Statistical Science Series 31) Margaret Sullivan Pepe - The Statistical Evaluation of Medical Tests For Classification and Prediction-Oxford University Press (2010) PDF
100% (2)
(Oxford Statistical Science Series 31) Margaret Sullivan Pepe - The Statistical Evaluation of Medical Tests For Classification and Prediction-Oxford University Press (2010) PDF
319 pages
(Cambridge Series in Statistical and Probabilistic Mathematics) Gerhard Tutz, Ludwig-Maximilians-Universität Munchen - Regression For Categorical Data-Cambridge University Press (2012)
100% (3)
(Cambridge Series in Statistical and Probabilistic Mathematics) Gerhard Tutz, Ludwig-Maximilians-Universität Munchen - Regression For Categorical Data-Cambridge University Press (2012)
574 pages
Graphical Data Analysis With R
No ratings yet
Graphical Data Analysis With R
306 pages
Untitled
100% (1)
Untitled
201 pages
Principles of Biostatistics and Data Analysis
No ratings yet
Principles of Biostatistics and Data Analysis
51 pages
An Introduction To Bayesian Statistics and MCMC Methods
No ratings yet
An Introduction To Bayesian Statistics and MCMC Methods
69 pages
An Introduction To Nonparametric Statistics-CRC Press (2020)
100% (1)
An Introduction To Nonparametric Statistics-CRC Press (2020)
225 pages
EpidemiologyUsingR PDF
No ratings yet
EpidemiologyUsingR PDF
302 pages
Applied Categorical and Count Data Analysis (PDFDrive)
50% (2)
Applied Categorical and Count Data Analysis (PDFDrive)
380 pages
Time Series Analysis
No ratings yet
Time Series Analysis
124 pages
Glimmix
No ratings yet
Glimmix
244 pages
_OceanofPDF.com_Data_Visualization_in_R_and_Python_-_Marco_Cremonini
No ratings yet
_OceanofPDF.com_Data_Visualization_in_R_and_Python_-_Marco_Cremonini
977 pages
Presentation 2
No ratings yet
Presentation 2
39 pages
Generalized Linear Models
100% (9)
Generalized Linear Models
243 pages
From GLM To GLIMMIX-Which Model To Choose
No ratings yet
From GLM To GLIMMIX-Which Model To Choose
7 pages
Bayesian Lecture Notes
No ratings yet
Bayesian Lecture Notes
28 pages
Statistical Modeling
No ratings yet
Statistical Modeling
22 pages
Linear Regression with Multiple Covariates
From Everand
Linear Regression with Multiple Covariates
Brett Kottmann
No ratings yet
The Statistical Analysis of Experimental Data
From Everand
The Statistical Analysis of Experimental Data
John Mandel
3/5 (2)
Introduction to Biostatistics with JMP (Hardcover edition)
From Everand
Introduction to Biostatistics with JMP (Hardcover edition)
Steve Figard
1/5 (1)
Lesson C: Questions With Be and Short Answers: The Students Are Young
No ratings yet
Lesson C: Questions With Be and Short Answers: The Students Are Young
1 page
Hambon and Sagrada Case Digests
No ratings yet
Hambon and Sagrada Case Digests
2 pages
Integradora Ingles (Periodo 1) 2e
No ratings yet
Integradora Ingles (Periodo 1) 2e
4 pages
(Ebook) Ultimate Guide To People Analytics
No ratings yet
(Ebook) Ultimate Guide To People Analytics
35 pages
LP Agreement
100% (1)
LP Agreement
27 pages
Inner Development
67% (3)
Inner Development
63 pages
Eapp - Q3-Week9 - DLL
100% (1)
Eapp - Q3-Week9 - DLL
10 pages
A Study of Caustic Corrosion of Carbon Steel Waste Tanks
No ratings yet
A Study of Caustic Corrosion of Carbon Steel Waste Tanks
10 pages
Christmas Around The World
No ratings yet
Christmas Around The World
4 pages
Sales 1466 To 1475
88% (8)
Sales 1466 To 1475
2 pages
ISWARAN THE STORY TELLER
No ratings yet
ISWARAN THE STORY TELLER
2 pages
Advanced Candlestick Pattern 2.0
0% (1)
Advanced Candlestick Pattern 2.0
12 pages
Group 5 Presentation
No ratings yet
Group 5 Presentation
9 pages
REPUBLIC V SANDIGANBAYAN (Puno Concurring)
No ratings yet
REPUBLIC V SANDIGANBAYAN (Puno Concurring)
36 pages
2.3 Colonial Regions Mystery Handout
No ratings yet
2.3 Colonial Regions Mystery Handout
2 pages
Acc003 Summary IA3 C01 Small Entities
No ratings yet
Acc003 Summary IA3 C01 Small Entities
8 pages
Analysis and Design of A Transformer-Feedback-Based Wideband Receiver
No ratings yet
Analysis and Design of A Transformer-Feedback-Based Wideband Receiver
12 pages
Unpublished
No ratings yet
Unpublished
5 pages
M.SC - in Electrical and Comp Control Eng
No ratings yet
M.SC - in Electrical and Comp Control Eng
36 pages
The Fun They Had
100% (1)
The Fun They Had
3 pages
DR - Sanchit Paul: Pediatric Dentist
No ratings yet
DR - Sanchit Paul: Pediatric Dentist
9 pages
Madan_Mohan_Singh_Ors_vs_Rajni_Kant_Anr_on_13_August_2010[1]
No ratings yet
Madan_Mohan_Singh_Ors_vs_Rajni_Kant_Anr_on_13_August_2010[1]
7 pages
Syllabus For The Post of Telephone Operator
No ratings yet
Syllabus For The Post of Telephone Operator
1 page
Jurilovca English
No ratings yet
Jurilovca English
4 pages
Bacterial Densities From Fermentation Tube Tests Harold A. Thomas
No ratings yet
Bacterial Densities From Fermentation Tube Tests Harold A. Thomas
6 pages
The Power of Love Connecting to the Oneness Instant Access
100% (11)
The Power of Love Connecting to the Oneness Instant Access
14 pages
Parental Acceptance and Rejection: Theory, Measures, and Research in The Arab World
No ratings yet
Parental Acceptance and Rejection: Theory, Measures, and Research in The Arab World
39 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.