Introduction To Statistical Methods For Clinical Trials
Introduction To Statistical Methods For Clinical Trials
Statistical Methods
for Clinical Trials
Introduction to
Statistical Methods
for Clinical Trials
Edited by
Thomas D. Cook
David L. DeMets
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and
information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and
publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission
to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic,
mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or
retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact
the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides
licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment
has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation
without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
List of figures xi
List of tables xv
Preface xix
3 Study Design 75
3.1 Early Phase Trials 76
3.2 Phase III/IV Trials 85
3.3 Non-inferiority Designs 101
3.4 Screening, Prevention, and Therapeutic Designs 106
3.5 Adaptive Designs 109
3.6 Conclusions 112
vii
viii CONTENTS
3.7 Problems 112
5 Randomization 141
5.1 The Role of Randomization 141
5.2 Fixed Randomization Procedures 148
5.3 Treatment- and Response-Adaptive Randomization Procedures 155
5.4 Covariate-Adaptive Randomization Procedures 161
5.5 Summary and Recommendations 165
5.6 Problems 168
References 405
Index 427
List of figures
8.1 Longitudinal data for which equation (8.1) does not hold. 234
xi
xii LIST OF FIGURES
8.2 Ramus height of 3 boys measured at 8, 8.5, 9, and 9.5 years
of age. 235
8.3 Ramus height of 3 boys. 238
8.4 Ramus height of all 20 boys. 241
8.5 Marginal and conditional residuals. 245
8.6 Ramus height data, random effects fitted coefficients β̂ + b̂i . 247
8.7 Ramus height data, standardized conditional residuals. 248
8.8 Ramus height data and fitted curves for conditional indepen-
dence model. 249
8.9 Ramus height data and fitted curves for general conditional
correlation model. 250
8.10 Bone density measured at 10 or 11 times per subject. 257
2.1 Baseline and three-month LDL levels in TNT for subjects with
values at both time points. 31
2.2 Differences in three-month LDL levels in TNT as a function of
baseline LDL. 35
2.3 Power for Wilcoxon and t-test for LDL change in TNT. 55
2.4 Simple example of interaction between treatment, mortality,
and a nonfatal outcome. 67
2.5 Simple example of interaction between treatment, mortality,
and a nonfatal outcome. 68
2.6 Asymptotic variances for three approaches to the use of
baseline values. 72
2.7 Power for treatment difference in TNT using 6 different
analyses. 72
xv
xvi LIST OF TABLES
6.2 A subset of the MedDRA coding system. 191
xix
xx PREFACE
hypothesis testing instruments. While inference beyond simple tests of the
primary and secondary hypotheses is clearly essential for a complete under-
standing of the results, we note that virtually all design features of an RCT are
formulated with hypothesis testing in mind. Some of the material, especially in
Chapter 8, Longitudinal Data, and Chapter 9, Quality of Life, is unavoidably
focused on complex model-based inference. Even in the simplest situations,
however, estimation of a “treatment effect” is inherently model-based, depen-
dent on implicit model assumptions, and the most well conducted trials are
subject to biases that require that point estimates and confidence intervals
be viewed cautiously. Inference beyond the population enrolled and treated
under the circumstances of a carefully conducted trial is precarious—while it
may be safe to infer that treatment A is superior to treatment B based on the
result of RCTs (a conclusion based on a hypothesis test), it is less so to infer
that the size of the effect seen in an RCT (even if could be known without
error) would be realized once a treatment is adopted in common practice.
Thus, the third overarching philosophical perspective that we adopt is that
the results of RCTs are best understood through the application of sound
statistical principles, such as ITT, followed by interpretation rooted in clinical
and scientific understanding. By this we mean that, while many scientific
questions emerge in the analysis of trial data, a large proportion of these have
no direct statistical answer. Nonetheless, countless “exploratory” analyses are
performed, many of which deviate from sound statistical principles and either
do not contribute to scientific understanding, or are in fact misleading. Our
belief is that researchers, and especially statisticians, need to understand the
inherent limitations of clinical studies and thoughtfully conduct analyses that
best answer those questions for which RCTs are suited.
Chapter 1 introduces the clinical trial as a research method and many of
the key issues that must be understood before the statistical methods can
take on meaning. While this chapter contains very little technical material,
many of the issues have implications for the trial statistician and are critical
for statistical students to understand. Chapter 12, the last chapter, discusses
the importance of the manner in which results of a trial are presented. In
between, there are 10 chapters presenting various statistical topics relevant to
the design, monitoring, and analysis of a clinical trial.
The material presented here is intended as an introductory course that
should be accessible to masters degree students and of value to PhD graduate
students. There is more material than might be covered in a one-semester
course and so careful consideration regarding the amount of detail presented
will likely be required.
The editors are grateful to our department colleagues for their contributions,
and to a graduate student, Charlie Casper, who served as editorial assistant
throughout the development of the text. His involvement was instrumental in
its completion. In addition to the editors and contributors, we are grateful for
helpful comments that have been received from Adin-Cristian Andrei, Murray
Clayton, Mary Foulkes, Anastasia Ivanova, and Scott Diegel.
PREFACE xxi
We also note that most of the data analysis and the generation of graphics in
this book was conducted using R (R Development Core Team 2005) statistical
software.
Thomas Cook
David DeMets
Madison, Wisconsin
July, 2007
Author Attribution
In preparing this text, the faculty listed below in the Department of Bio-
statistics and Medical Informatics took responsibility for early drafts of each
chapter, based on their expertise and interest in statistical methodology and
clinical trials. The editors, in addition to contributing to individual chapters,
revised chapters as necessary to provide consistency across chapters and to
mold the individual chapters into a uniform text. Without the contribution
of these faculty to drafting these chapters, this text would not have been
completed in a timely fashion, if at all.
Chapter
1 Introduction to Clinical Trials DeMets Fisher
2 Defining the Question Cook Casper
3 Study Design DeMets Chappell Casper
4 Sample Size Cook
5 Randomization Casper Chappell
6 Data Collection and Quality Control Bechhofer Feyzi Cook
7 Survival Analysis Cook Kim
8 Longitudinal Data Lindstrom Cook
9 Quality of Life Eickhoff Koscik
10 Data Monitoring and Interim Analysis Kim Cook DeMets
11 Selected Issues in the Analysis DeMets Cook Roecker
12 Closeout and Reporting DeMets Casper
Contributors
Robin Bechhofer, BA Researcher
T. Charles Casper, MS Research Assistant & Graduate Student
Richard Chappell, PhD Professor of Biostatistics and Statistics
Thomas Cook, PhD Senior Statistical Scientist
David DeMets, PhD Professor of Biostatistics and Statistics
Jens Eickhoff, PhD Statistical Scientist
Jan Feyzi, MS Researcher
Marian R. Fisher, PhD Research Professor
Kyungmann Kim, PhD Professor of Biostatistics
Rebecca Koscik, PhD Statistical Scientist
Mary Lindstrom, PhD Professor of Biostatistics
Ellen Roecker, PhD Senior Statistical Scientist
xxiii
CHAPTER 1
Clinical trials have become an essential research tool for the evaluation of the
benefit and risk of new interventions for the treatment or prevention of disease.
Clinical trials represent the experimental approach to clinical research. Take,
for example, the modification of risk factors for cardiovascular disease. Large
observational studies such as the Framingham Heart Study (Dawber et al.
1951) indicated a correlation between high cholesterol, high blood pressure,
smoking, and diabetes with the incidence of cardiovascular disease. Focusing
on high cholesterol, basic researchers sought interventions that would lower
serum cholesterol. While interventions were discovered that lowered choles-
terol, they did not demonstrate a significant reduction in cardiovascular mor-
tality.1 Finally, in 1994, a trial evaluating a member of the statin class of
drugs demonstrated a reduction in mortality (Scandanavian Simvistatin Sur-
vival Study 1994). With data from well-controlled clinical trials, an effective
and safe intervention was identified. Sometimes interventions can be adopted
without good evidence and even become widely used. One case was the use of
hormone replacement therapy (HRT) that is used to treat symptoms in post-
menopausal women and is also known to reduce bone loss in these women,
leading to reduced bone fracture rates. HRT also reduces serum cholesterol
leading to the belief that it should also reduce cardiovascular mortality and
morbidity. In addition, large observational studies have shown lower cardiovas-
cular mortality for women using HRT than for those not using HRT (Barrett-
Connor and Grady 1998). These observations led to a widespread use of HRT
for the prevention of cardiovascular mortality and morbidity as well as the
other indications. Subsequently, two trials evaluated the benefits of HRT in
postmenopausal women: one trial in women with existing cardiovascular dis-
ease and a second without any evident disease. The first trial, known as HERS,
demonstrated no benefit and suggested a possible risk of thrombosis (i.e.,
blood clots) (Grady et al. 1998). The second trial, known as the Women’s
Health Initiative, or WHI, demonstrated a harmful effect due to blood clot-
ting and no cardiovascular benefit.2 These trials contradicted evidence derived
from non-randomized trials and led to a rapid decline in the use of HRT for
purposes of reducing cardiovascular disease. HRT is still used when indicated
for short-term symptom relief in postmenopausal women.
1 The Coronary Drug Project Research Group (1975), The Lipid Research Clinics Program
(1979)
2 Writing Group for the Women’s Health Initiative Randomized Controlled Trial (2002)
1
2 INTRODUCTION TO CLINICAL TRIALS
Incomplete understanding of the biological mechanism of action can some-
times limit the adoption of potentially effective drugs. A class of drugs known
as beta-blockers was known to be effective for lowering blood pressure and re-
ducing mortality in patients suffering a heart attack. Since these drugs lower
blood pressure and lower heart rate, scientists believed these drugs should not
be used in patients with heart failure. In these patients, the heart does not
pump blood efficiently and it was believed that lowering the heart rate and
blood pressure would make the problem worse. Nonetheless, a series of trials
demonstrated convincingly an approximate 30% reduction in mortality.3 An
effective therapy was ignored for a decade or more because of belief in a mech-
anistic theory without clinical evidence. Thus, clinical trials play the critical
role of sorting out effective and safe interventions from those that are not.
The fundamental principles of clinical trials are heavily based on statistical
principles related to experimental design, quality control, and sound analysis.
No analytical methods can rescue a trial with poor experimental design and
the conclusions from a trial with proper design can be invalid if sound analyti-
cal principles are not adhered to. Of course, collection of appropriate and high
quality data is essential. With this heavy reliance on statistical principles, a
statistician must be involved in the design, conduct, and the final analyses
phases of a trial. A statistician cannot wait until after the data have been
collected to get involved with a clinical trial. The principles presented in this
text are an introduction to important statistical concepts in design, conduct,
and analysis.
In this chapter, we shall briefly describe the background and rationale for
clinical trials, and their relationship to other clinical research designs as well
as defining the questions that clinical trials can best address. For the purposes
of this text, we shall define a clinical trial to be a prospective trial evaluat-
ing the effect of an intervention in humans. The intervention may be a drug,
biologic (blood, vaccine, and tissue, or other products, derived from living
sources such as humans, animals, and microorganisms), device, procedure, or
genetic manipulation. The trial may evaluate screening, diagnostic, preven-
tion, or therapeutic interventions. Many trials, especially those that attempt
to establish the role of the intervention in the context of current medical
practice, may have a control group. These and other concepts will be further
discussed in this chapter and in more detail in subsequent chapters.
First, the historical evolution of the modern clinical trial is presented fol-
lowed by a discussion of the ethical issues surrounding the conduct of clinical
research. A brief review of various types of clinical research is presented em-
phasizing the unique role that clinical trials play. The rationale, need, and the
timing of clinical trials are discussed. The organizational structure of a clini-
cal trial is key to its success regardless of whether the trial is a single-center
trial or a multicenter trial. All of the key design and conduct issues must be
described in a research plan called a trial protocol.
3 The International Steering Committee on Behalf of the MERIT-HF Study Group (1997),
Krum et al. (2006), Packer et al. (2001)
HISTORY AND BACKGROUND 3
1.1 History and Background
The era of the modern day clinical trial began in the post–World War II
period, beginning with two trials in the United Kingdom sponsored by the
Medical Research Council (1944). The first of these trials was conducted in
1944 and studied treatments for the common cold. The second trial, conducted
in 1948, evaluated treatments for tuberculosis, comparing streptomycin to
placebo. Hill (1971) incorporated many features of modern clinical trials such
as randomization and a placebo-treated control group into this trial (Medical
Research Council 1948).
In the United States, the era of the modern clinical trial probably began
with the initiation of the Coronary Drug Project (CDP)4 in 1965. The CDP
was sponsored by the National Heart Institute (later expanded to be the Na-
tional Heart, Lung, and Blood Institute or NHLBI), one of the major institutes
in the National Institutes of Health (NIH). This trial compared five different
lipid-lowering drugs to a placebo control in men who had survived a recent
heart attack (myocardial infarction). In this study, all patients also received
the best medical care known at that time. Eligible men were randomized to
receive either one of the five drugs or a placebo. They were followed for the
recurrence of a major cardiovascular event such as death or a second heart
attack. Many of the operational principles developed for this trial are still in
use. Shortly after the CDP began, the NHLBI initiated several other large
clinical trials evaluating modifications of major cardiovascular risk factors
such as blood pressure in the Hypertension Detection and Follow-up Pro-
gram (HDFP Cooperative Group 1982), cholesterol in the Coronary Primary
Prevention Trial (The Lipid Research Clinics Program 1979) and simultane-
ous reduction of blood pressure, cholesterol, and smoking in the Multiple Risk
Factor Intervention Trial (Domanski et al. 2002). These trials, all initiated
within a short period of time, established the clinical trial as an important
tool in the development of treatments for cardiovascular diseases. During this
same period, the NHLBI launched trials studying treatments for blood and
lung diseases. The methods used in the cardiovascular trials were applied to
these trials as well.
In 1973, the National Eye Institute (NEI) also began a landmark clinical
trial, the Diabetic Retinopathy Study (DRS) (Diabetic Retinopathy Study
Research Group 1976). Diabetes is a risk factor for several organ systems
diseases including cardiovascular and eye diseases. Diabetes causes progressive
stages of retinopathy (damage to the retina of the eye), ultimately leading
to severe visual loss or blindness. This trial evaluated a new treatment of
photocoagulation by means of a laser device. Many of the concepts of the
CDP were brought to the DRS by NIH statistical staff. Several other trials
were launched by the NEI using the principles established in the DRS (e.g.,
the Early Treatment Diabetic Retinopathy Study (Cusick et al. 2005)).
Other institutes such as the National Cancer Institute (NCI) of the NIH
4 The Coronary Drug Project Research Group (1975)
4 INTRODUCTION TO CLINICAL TRIALS
aggressively used the clinical trial to evaluate new treatments. The NCI es-
tablished several clinical trial networks, or cancer cooperative groups, orga-
nized by either geographic regions (e.g., the Eastern Cooperative Oncology
Group, or ECOG, the South Western Oncology Group, or SWOG), disease
areas (e.g., the Pediatric Oncology Group, or POG), or treatment modality
(e.g., Radiation Treatment Oncology Group, or RTOG). By 1990, most dis-
ease areas were using clinical trials to evaluate new interventions. Perhaps,
the most recent development was in the AIDS Clinical Trial Group (ACTG)
which was rapidly formed in the late 1980s to evaluate new treatments to ad-
dress a rapidly emerging epidemic of Acquired Immune Deficiency Syndrome
(AIDS) (DeMets et al. 1995). Many of the fundamental principles of trial de-
sign and conduct developed in the preceding two decades were reexamined
and at times challenged by scientific, medical, patient, and political interest
groups. Needless to say, these principles withstood the scrutiny and challenge.
Most of the trials we have mentioned were sponsored by the NIH in the
U.S. or the Medical Research Council (MRC) in the U.K. Industry-sponsored
clinical trials, especially those investigating pharmaceutical agents, evolved
during the same period of time. Large industry-sponsored phase III outcome
trials were infrequent, however, until the late 1980s and early 1990s. Prior to
1990, most industry-sponsored trials were small dose-finding trials or trials
evaluating a physiological or pharmacology outcome. Occasionally, trials were
conducted and sponsored by industry with collaboration from academia. The
the Anturane Reinfarction Trial (The Anturane Reinfarction Trial Research
Group 1978) was one such trial comparing a platelet active drug, sulfinpyra-
zone (anturane), to placebo in men following a heart attack. Mortality and
cause-specific mortality were the major outcome measures. By 1990 many
clinical trials in cardiology, for example, were being sponsored and conducted
by the pharmaceutical industry. By 2000, the pharmaceutical industry was
spending $2.5 billion dollars on clinical trials compared to $1.5 billion by the
NIH. In addition, standards for the evaluation of medical devices as well as
medical procedures are increasingly requiring clinical trials as a component in
the assessment of effectiveness and safety.
Thus, the clinical trial has been the primary tool for the evaluation of a
new drug, biologic, device, procedure, nutritional supplement, or behavioral
modification. The success of the trial in providing an unbiased and efficient
evaluation depends on fundamental statistical principles that we shall dis-
cuss in this and following chapters. The development of statistical methods
for clinical trials has been a major research activity for biostatisticians. This
text provides an introduction to these statistical methods but is by no means
comprehensive.
Some of the most basic principles now used in clinical trial design and
analysis can be traced to earlier research efforts. For example, an unplanned
natural experiment to examine the effect of lemon juice on scurvy for sailors
was conducted by Lancaster in 1600 as a captain of a ship for the East Indian
Shipping Company (Bull 1959). The sailors on the ships with lemons on board
ETHICS OF CLINICAL RESEARCH 5
were free of scurvy in contrast to those on the other ships without lemons.
In 1721, a smallpox experiment was planned and conducted. Smallpox was
an epidemic that caused suffering and death. The sentences of inmates at the
Newgate prison in Great Britain were commuted if they volunteered for inoc-
ulation. All of those inoculated remained free of smallpox. (We note that this
experiment could not have been conducted today on ethical grounds.) In 1747,
Lind (1753) conducted a planned experiment on the treatment of scurvy with
a concurrent control group while on board ship. Of 12 patients with scurvy,
ten patients were given five different treatments, two patients per treatment,
and the other two served as a control with no treatment. The two sailors
given fruit (lemons and oranges) recovered. In 1834, Louis (1834) described
the process of keeping track of outcomes for clinical studies of treatment ef-
fect, and the need to take into consideration the patients’ circumstances (i.e.,
risk factors) and the natural history of the disease.
While Fisher (1926) introduced the concept of randomization for agricul-
tural experiments, randomization was first used for clinical research in 1931
by Amberson Jr., McMahon, and Pinner (1931) to study treatments for tuber-
culosis. As already described, Bradford Hill used randomization in the 1948
MRC tuberculosis trial (Hill 1971).
1. Voluntary consent
2. Experiment to yield results for good of society
3. Experiment based on current knowledge
4. Experiment to avoid all unnecessary suffering
5. No a priori reason to expect death
6. Risk not exceed importance of problem
7. Protect against remote injury possibilities
8. Conduct by scientifically qualified persons
9. Subject free to end experiment at any time
10. Scientist free to end experiment
6 http://grants.nih.gov/grants/guide/notice-files/not98-084.html
7 http://www.fda.gov/cber/guidelines.htm
8 INTRODUCTION TO CLINICAL TRIALS
Table 1.3 Eight basic elements of informed consent (45 CFR 46.116).
Medical research makes progress using a variety of research designs and each
contributes to the base of knowledge regardless of their limitations. The most
common types of clinical research designs are summarized in Table 1.4. The
simplest type, and which is often used, is the case report or anecdote—a
physician or scientist makes an astute observation of a single event or a single
patient and gains insight into the nature or the cause of a disease. An exam-
ple might be the observation that the interaction of two drugs causes a life
threatening toxicity. It is often difficult, however, to distinguish the effects of a
treatment from those of the natural history of the disease or many other con-
founding factors. Nevertheless, this unplanned anecdotal observation remains
a useful tool. A particularly important example is a case report that linked a
weight reduction drug with the presence of heart valve problems (Mark et al.
1997).
Epidemiologists seek associations between possible causes or risk factors
and disease. This process is necessary if new therapies are to be developed. To
this end, observational studies are typically conducted using a larger number
of individuals than in the small case report series. Identifying potential risk
factors through observational studies can be challenging, however, and the
scope of such studies is necessarily limited (Taubes 1995).
Observational studies can be grouped roughly into three categories (Ta-
ble 1.4), referred to as retrospective, cross-sectional, and prospective. A case-
control study is a retrospective study in which the researcher collects retro-
spective information on cases, individuals with a disease, and controls, indi-
viduals without the disease. For example, the association between lung cancer
10 INTRODUCTION TO CLINICAL TRIALS
Selection Bias Bias affecting the interventions that a patient may receive
or which individuals are entered into the study.
Publication Bias Studies that have significant (e.g., p < 0.05) results are
more likely to be published than those that are not. Thus, knowledge of
literature results gives a biased view of the effect of an intervention.
Recall Bias Individuals in a retrospective study are asked to recall prior
behavior and exposure. Their memory may be more acute after having
been diagnosed with a disease than the control individuals who do not
have the disease.
Ascertainment Bias Bias that comes from a process where one group of in-
dividuals (e.g., intervention group) is measured more frequently or carefully
than the other group (e.g., control).
Translational Research
two have similar effectiveness. Phase IV trials usually follow patients who
have completed phase III trials to determine if there are long term adverse
consequences.
Phase III trials are also classified according to the process by which a control
arm is selected. Randomized control trials assign patients to either the new
treatment or the standard by a randomization method, described in Chapter 5.
Non-randomized phase III trials can be of two general types. The historical
control trial compares a group of patients treated with the new drug or device
to a group of patients previously treated with the current standard of care. A
concurrent control trial, by contrast, compares patients treated with the new
treatment to another group of patients treated in the standard manner at the
same time, for example, those treated at another medical facility or clinic. As
will be discussed in Chapter 5, the randomized control trial is considered to be
THE NEED FOR CLINICAL TRIALS 15
the gold standard, minimizing or controlling for many of the biases to which
other designs are subject. Trials may be single center or multiple center, and
many phase III trials are now multinational.
Trials may also be classified by the nature of the disease process the exper-
imental intervention is addressing. Screening trials are used to assess whether
screening individuals to identify those at high risk for a disease is beneficial,
taking into account the expense and efforts of the screening process. For ex-
ample, a large cancer screening trial is evaluating the benefits of screening for
prostate, lung, colon, and ovarian cancer (Prorok et al. 2000). These trials
must, by nature, be long term to ascertain disease incidence in the screened
and unscreened populations. Screening trials are conducted under the belief
that there is a beneficial intervention available to at-risk individuals once they
are identified. Primary prevention trials assess whether an intervention strat-
egy in a relatively healthy but at risk population can reduce the incidence to
the disease. Secondary prevention trials are designed to determine whether a
new intervention reduces the recurrence of the disease in a cohort that has
already been diagnosed with the disease or has experienced an event (e.g.,
heart attack). Therapeutic or acute trials are designed to evaluate an inter-
vention in a patient population where the disease is acute or life threatening.
An example would be a trial that uses a new drug or device that may improve
the function of a heart that has serious irregular rhythms.
Policy Board
Data Monitoring
Committee
Central Units
Coordinating Center
(Labs, etc.)
Institutional Review
Clinical Centers
Boards
Subjects
Figure 1.2 NIH model. Reprinted with permission from Fisher, Roecker, and DeMets
(2001). Copyright
c 2001, Drug Information Association.
shown in Figure 1.2, there are several key functional components. All trials
must have a sponsor or funding agency to pay for the costs of the interven-
tion, data collection, and analysis. Funding agencies often delegate the man-
agement of the trial to a steering committee or executive committee, a small
group composed of individuals from the sponsor and the scientific investiga-
tors. The steering committee is responsible for providing scientific direction
and to monitor the conduct of the trial. The steering committee may appoint
working committees to focus on particular tasks such as recruitment, inter-
vention details, compliance to intervention, outcome assessment, as well as
analysis and publication plans. Steering committees usually have a chair who
serves as the spokesperson for the trial. A network of investigators and clinics is
typically needed to recruit patients, apply the intervention and other required
patient care, and to assess patient outcomes. Clinical sites usually will have
a small staff who dedicate a portion of their time to recruit patients, deliver
the intervention, assess patient responses, and complete data collection forms.
For some trials, one or more central laboratories are needed to measure blood
chemistries in a uniform manner or to evaluate electrocardiograms, x-rays,
TRIAL ORGANIZATION 21
eye photographs, or tumor specimens. Figure 1.3 depicts a modification of the
NIH clinical trial model that is often used for industry sponsored trials (Fisher
et al. 2001). The major difference is that the data coordinating center opera-
tion depicted in Figure 1.2 has been divided into a data management center
and a statistical analysis center. The data management center may be internal
to the sponsor or contracted to an outside organization. The statistical anal-
ysis center may also be internal or contracted to an external group, often an
academic-based biostatistics group. As described in Chapter 10, careful mon-
Pharmaceutical
Steering Committee Regulatory Agencies
Industry Sponsor
Independent Data
Monitoring
Committee
Institutional Review
Clinical Centers
Boards
Subjects
Figure 1.3 Industry-modified NIH model. Reprinted with permission from Fisher,
Roecker, and DeMets (2001). Copyright
c 2001, Drug Information Association.
E1A The Extent of Population Exposure to Assess Clinical Safety For Drugs
Intended for Long-Term Treatment of Non-Life-Threatening Conditions
E2A Clinical Safety Data Management: Definitions and Standards for Expe-
dited Reporting (continues through E2E)
E3 Structure and Content of Clinical Study Reports
E4 Dose-Response Information to Support Drug Registration
E5 Ethnic Factors in the Acceptability of Foreign Clinical Data
E6 Good Clinical Practice: Consolidated Guidance
E7 Studies in Support of Special Populations: Geriatrics
E8 General Considerations for Clinical Trials
E9 Statistical Principles for Clinical Trials
E10 Choice of Control Group and Related Issues in Clinical Trials
E11 Clinical Investigations of Medicinal Products in the Pediatric Population
E14 Clinical Evaluation of QT/QTc Interval Prolongation and Proarrhyth-
mic Potential for Non-Antiarrhythmic Drugs
these guidelines is ICH-E9 covering statistical principles for clinical trials. The
guidelines in this document must be considered by statisticians working for ei-
ther the pharmaceutical, biologic, or medical device industry, as well as those
in academia. The documents may change over time and so they should be
consulted with some regularity. There are other statistical guidelines provided
by the FDA which can also be found on the FDA web site and should be con-
sulted. The statistical principles presented in this text are largely consistent
with those in the ICH documents.
We point out one area of particular interest. As described in Chapter 10,
all trials under the jurisdiction of the FDA must have a monitoring plan. For
some trials involving subjects with a serious disease or a new innovative in-
tervention, an external data monitoring committee may be either required or
highly recommended. An FDA document entitled Guidance for Clinical Trial
16 http://www.fda.gov/cber/ich/ichguid.htm
26 INTRODUCTION TO CLINICAL TRIALS
Sponsors on the Establishment and Operation of Clinical Trial Data Moni-
toring Committees 17 contends that the integrity of the trial is best protected
when the statisticians preparing unblinded data for the DMC are external to
the sponsor and uninvolved in discussions regarding potential changes in trial
design while the trial is ongoing. This is an especially important considera-
tion for critical studies intended to provide definitive evidence of effectiveness.
There are many other guidelines regarding the design, conduct, and analysis
of clinical trials.18
In the chapters that follow, many statistical issues pertinent to the devel-
opment of the protocol will be examined in great detail including basic ex-
perimental design, sample size, randomization procedures, interim analyses,
survival, and longitudinal methods for clinical trials along with other analysis
issues. While critical to the conduct and analysis of the trial, these compo-
nents depend on having a carefully defined question that the trial is intended
to answer, a carefully selected study population and outcome variables that
appropriately measure the effects of interest. Failure to adequately address
these issues can seriously jeopardize the ultimate success of the trial. Thus,
Chapter 2 provides guidelines for formulating the primary and secondary ques-
tions and translating the clinical questions into statistical ones. In Chapter 3,
we examine designs used in clinical trials, especially for comparative trials.
While the size of early phase studies is important, the success of phase III
or comparative trials relies critically on their sample size. These trials are
intended to enable definitive conclusions to be drawn about the benefits and
risks of an intervention. Claiming a benefit when none exists, a false positive,
would not be desirable since an ineffective therapy might become widely used.
Thus, the false positive claims must be kept to a minimum. Equally important
is that a trial be sensitive enough to find clinically relevant treatment effects
if they exist. This kind of sensitivity is referred to as the power of the trial.
Failure to find a true effect is referred to as a false negative. Statisticians refer
to the false positive as the type I error and the false negative as the type II
error. Chapter 4 will present methods for determining the sample size for a
trial to meet prespecified criteria for both false positive and false negative
rates.
Given that the randomized control clinical trial is the gold standard, the
process of randomization must be done properly. For example, a randomiza-
tion process that mimics the simple tossing of a fair coin is certainly a valid
process but may have undesirable properties. Thus, constrained randomiza-
tion procedures are commonly used in order to ensure balance. Some of these
methods produce balance on the number of participants in each arm through-
29
30 DEFINING THE QUESTION
surable, and that a valid statistical comparison can be performed. Clinical
relevance requires that the appropriate population be targeted and that ef-
fects of the intervention on the primary outcome reflect true benefit to the
subjects. Measurability requires that we can ascertain the outcome in a clini-
cally relevant and unbiased manner. Finally, we must ensure that the statisti-
cal procedures that are employed make comparisons that answer the clinical
question in an unbiased way without requiring untestable assumptions.
We begin this chapter with a discussion of the statistical framework in
which trials are conducted. First, randomized clinical trials are fundamentally
hypothesis testing instruments and while estimation of treatment effects is an
important component of the analysis, most of the statistical design elements
of a clinical trial—randomization, sample size, interim monitoring strategies,
etc.—are formulated with hypothesis tests in mind and the primary questions
in a trial are framed as tests of specific hypotheses. Second, since the goal of
a trial is to establish a causal link between the interventions employed and
the outcome, inference must be based on well established principles of causal
inference. This discussion provides a foundation for understanding the kinds
of causal questions that RCTs are able to directly answer.
Next, while a trial should be designed to have a single primary question,
invariably there are secondary questions of interest. Specifically, beyond the
primary outcome, there are likely to be a number of secondary questions of
interest ranging from the effect on clinical outcomes such as symptom relief
and adverse events, to quality of life, and possibly economic factors such as
length of hospital stay or requirement for concomitant therapies. In addition,
there is typically a set of clearly defined subgroups in which the primary,
and possibly some of the secondary, outcomes will be assessed. Ultimately,
even when there is clear evidence of benefit as measured by the effect on the
primary outcome, these secondary analyses provide supporting evidence that
can either strengthen or weaken the interpretation of the primary result. They
may also be used to formulate hypotheses for future trials.
Implicit in any trial are assessments of safety. Because many toxic effects
cannot be predicted, a comprehensive set of safety criteria cannot be specified
in advance, and safety assessments necessarily include both prespecified and
accidental findings.
Lastly, we outline the principles used in formulating the clinical questions
that the trial is intended to answer in a sound statistical framework, followed
by a discussion of the kinds of clinical outcomes that are commonly used for
evaluating efficacy. Included in this section we outline the statistical theory for
the validity of surrogate outcomes. Once the clinical outcome is defined, the
result must be quantified in a manner that is suitable for statistical analysis.
This may be as simple as creating a dichotomous variable indicating clinical
improvement or the occurrence of an adverse event such as death, or it may be
a complex composite of several distinct components. Finally, we must specify
an analysis of the quantitative outcome variables that enables the clinical
question to be effectively answered.
STATISTICAL FRAMEWORK 31
2.1 Statistical Framework
We introduce the statistical framework for randomized trial through the de-
tailed discussion of an example. While randomized clinical trials are the best
available tools for the evaluation of the safety and efficacy of medical in-
terventions, they are limited by one critical element in particular—they are
conducted in human beings. This fact introduces two complications; humans
are inherently complex, responding to interventions in unpredictable and var-
ied ways and, in the context of medical research, unlike laboratory animals,
there is an ethical obligation to grant them personal autonomy. Consequently,
there are many factors that experimenters cannot control and that can intro-
duce complications into the analysis and interpretation of trial results. Some
of these complications are illustrated by the following example.
Table 2.1 Baseline and three-month LDL levels in TNT for subjects with values at
both time points.
Mean LDL (mg/dL)
Dose N Baseline 3-Month
80mg 4868 97.3 72.5
10mg 4898 97.6 99.0
Difference 0.3 26.5
p-value for difference < 0.0001
32 DEFINING THE QUESTION
First, from the hypothesis test Table 2.1, for which the p-value is extremely
small, it is clear that the 80mg dose results in highly statistically significant
additional LDL lowering relative to the 10mg dose. Given that the sample
size is very large, however, differences that are statistically significant may
not be clinically relevant, especially since a possible increase in the risk of
serious adverse side-effects with the higher dose might offset the potential
benefit. It is important, therefore, that the difference be quantified in order to
assess the risk/benefit ratio. The mean difference of 26.5mg/dL is, in fact, a
large, clinically meaningful change, although, as will be discussed later in this
chapter, it should be considered a surrogate outcome and this effect is only
meaningful if it results in a decrease in the risk of adverse outcomes associated
with the progression of cardiovascular disease.
The observed mean difference may have other uses beyond characterizing
risk and benefit. It may be used to compare the effect of the treatment to
that observed in other trials that use other drugs or other doses, or differences
observed in other trials with the same doses but conducted in a different
population. It may also be used by treating physicians to select a dose for
a particular subject or by public health officials who are developing policies
and need to compare the cost effectiveness of various interventions. Thus it is
important not only that we test the hypothesis of interest, but that we produce
a summary measure representing the magnitude of the observed difference,
typically with a corresponding confidence interval. What must be kept in
mind, however, is that a test of the null hypothesis that there is no effect of
treatment is always valid (provided that we have complete data), while an
estimate of the size of the benefit can be problematic for reasons that we
outline below. These drawbacks must be kept in mind when interpreting the
results.
To elaborate on the significance of the observed treatment difference, we
first note that the mean difference of 26.5mg/dL is only one simple summary
measure among a number of possible measures of the effect of the 80mg dose
on three month LDL. To what extent, then, can we say make the following
claim?
The effect of the 80mg dose relative to the 10mg dose is to lower LDL an addi-
tional 26.5mg/dL at three months.
Implicit in this claim is that a patient who has been receiving the 10mg dose
for a period of time will be expected to experience an additional lowering of
LDL by 26.5mg/dL when the dose is increased to 80mg. Mathematically, this
claim is based on a particular model for association between dose and LDL.
Specifically,
Three-month LDL = β0 + β1 z + (2.1)
where β0 and β1 are coefficients to be estimated, z is zero for a subject re-
ceiving the 10mg dose and 1 for a subject receiving the 80mg dose, and is
a random error term. Note that the effect of treatment is captured by the
parameter β1 . The validity of the claim relies upon several factors:
1. the new patient is comparable to the typical subject enrolled in the trial,
STATISTICAL FRAMEWORK 33
Protocol
Patient Subjects
I Entry I I Trial Results
Population Enrolled
Criteria
N N N
C B A
Inference about Population J
Binary outcomes
For binary data, we assume that the probability of the event in question
is πj where j = 1, 2 is the treatment group. Therefore, if the number of
subjects in group j is nj , then the number of events in group j, xj , has a
binomial distribution with size nj and probability πj . The null hypothesis is
H0 : π1 = π2 . The observed outcomes for binary data can be summarized as
a 2 × 2 table as follows.
Event
Yes No Total
Group 1 x1 n1 − x 1 n1
Group 2 x2 n2 − x 2 n2
Total m1 m2 N
There are several choices of tests of H0 for this summary table. The most
commonly used are the Fisher’s exact test, and Pearson chi-square test.
Fisher’s exact test is based on the conditional distribution of x1 under
H0 given the marginal totals, m1 , m2 , n1 , and n2 . Conditionally, x1 has a
hypergeometric distribution,
n1 n2
x1 x
Pr{x1 | n1 , n2 , m1 , m2 } = Px1 = 2 (2.6)
N
m1
independent of the common (unknown) probability π1 = π2 . Using Fisher’s
exact test we reject H0 : π1 ≤ π2 if
X
Pl < α. (2.7)
Pl ≤Pxj
Note that the sum is over all values of x1 for which the probability in (2.6)
is smaller than that for the observed value. This test is “exact” in the sense
that, because the conditional distribution does not depend on the unknown
parameters, the tail probability in equation (2.7) can be computed exactly.
Because equation (2.7) uses the exact conditional distribution of x1 , the type I
error is guaranteed to be at most α. For tables with at least one small marginal
total, Fisher’s exact test is conservative, and often extremely so. The actual
type 1 error for α = .05 is often on the order of 0.01 (D’Agostino 1990) with
a corresponding reduction in power.
The Pearson chi-square test uses the test statistic
(x1 N − n1 m1 )2 N (x1 − n1 m1 /N )2 N 3
X2 = = .
m1 m2 n1 n2 m1 m2 n1 n2
48 DEFINING THE QUESTION
Note that X 2 is the score test statistic based on the P binomial likelihood (See
Appendix A.3), which, in multi-way tables, has form (E − O)2 /E where the
sum is over all cells in the summary table, E is the expected count under H0 ,
and O is the observed cell count. Under H0 , the conditional expectation of
x1 given the total number of events, m1 , is m1 n1 /N , so X 2 measures how far
x1 is from its conditional expectation. Asymptotically, X 2 has a chi-square
distribution with one degree of freedom (χ21 ). At level α = 0.05, for example,
we would reject H0 if X 2 > 3.84, the 95th percentile of the χ21 distribution.
In small samples X 2 no longer has a chi-square distribution and conventional
wisdom suggests that because of this, the Pearson chi-square test should only
be used when the expected counts in each cell of the table are at least 5 under
H0 . This belief has been shown to be incorrect, however (see, for example,
Fienberg (1980), Appendix IV and Upton (1982)). Fienberg suggests that the
type I error is reasonably well controlled for expected cell counts as low as
1. For some values of π1 , n1 , and n2 the actual type I error can exceed the
nominal level, but usually only by a small amount. The continuity corrected
Pearson chi-square test (sometimes referred to as Yate’s continuity correction)
has the form
(|x1 − n1 m1 /N | − 0.5)2 N 3
XC2 = .
m1 m2 n1 n2
Using the continuity corrected chi-square test approximates the Fisher’s exact
test quite well. An alternative discussed by Upton (1982) is
(x1 N − n1 m1 )2 (N − 1)
XU2 = .
m1 m2 n1 n2
While Upton recommends XU2 , it is not commonly used and many software
packages do not produce it. In moderate to large trials XU2 and X 2 will be
virtually identical. Our belief is that the X 2 is the preferred test statistic for
2 × 2 tables.
While the 2 × 2 table is a useful summary for binary outcomes, it does not
provide a single summary of the effect of treatment.
There are three commonly used summary statistics for binary outcomes.
Risk Difference The risk difference is the difference in estimates of the event
probabilities,
ˆ = π̂2 − π̂1 = x2 /n2 − x1 /n1 .
∆
Risk Ratio The risk ratio (sometimes referred to as the relative risk ) is the
ratio of the estimated probabilities,
π̂2 x2 /n2
r̂ = = .
π̂1 x1 /n1
Odds Ratio The odds ratio is the ratio of the observed odds,
π̂2 /(1 − π̂2 ) x2 /(n2 − x2 )
ψ̂ = = .
π̂1 /(1 − π̂1 ) x1 /(n1 − x1 )
OUTCOME OR RESPONSE MEASURES 49
There is extensive discussion in the literature regarding the preferred sum-
mary measure.3 Our belief is that the preferred summary measure is the one
most adequately fits the observed data. This view is summarized by Bres-
low and Day (1988), “From a purely empirical viewpoint, the most important
properties of a model are simplicity and goodness of fit to the observed data.
The aim is to be able to describe the main features of the data as succinctly
as possible” (page 58). To illustrate this principle, if baseline variables are
available that predict outcome, then the best summary measure is the one for
which there is the least evidence of interaction between treatment and base-
line risk. For example, suppose that the risk ratio r = π2 /π1 is independent
of baseline predictors, then, the risk difference is ∆ = π2 − π1 = (r − 1)π1 .
For subjects with low baseline risk, i.e., small π1 , then the risk difference is
small, and for subjects with large baseline risk, the risk difference is large. In
this situation, the risk difference fails to adequately fit the observed data.
Similarly, in this situation, the odds ratio will fail to capture the observed
effect of treatment. Here
rπ1 /(1 − rπ1 ) 1 − π1
ψ= =r .
π1 /(1 − π1 ) 1 − rπ1
Ordinal outcomes
3 See, for example, Davies et al. (1998), Bracken and Sinclair (1998), Deeks (1998), Altman
et al. (1998), Taeger et al. (1998), Sackett et al. (1996), Zhang and Yu (1998), Senn (1999),
Cook (2002b) among many others for discussions of risk ratios versus odds ratios
50 DEFINING THE QUESTION
Response Level
0 1 2 ··· K Total
Group 1 x10x11 x12 · · · x1K n1
Group 2 x20x21 x22 · · · x2K n2
Total m0 m1 m2 · · · m K N
P 2
A general test of H0 is the score statistic j,k (xjk −Ejk ) /Ejk where, as with
the Pearson chi-square test, the sum is over the 2(K + 1) cells of the table and
Ejk = nj mk /N is the expected count in cell j, k under H0 , given the margins
of the table. The difficulty with this general test is that it is not necessarily
sensitive to alternative hypotheses of interest, and the result may be difficult
to interpret. Suppose K = 2 and consider the alternative hypothesis given by
π10 = .3, π11 = .4, π12 = .3, π20 = .2, π21 = .6, and π22 = .2. If treatment
2 is the experimental treatment, then the effect of treatment is to reduce
the number of responses in the best and worst categories, but there is no
net benefit, and this hypothesis is of little interest. We are usually willing
to sacrifice power for this kind of hypothesis in favor of tests with power for
alternatives in which there is clear evidence of benefit.
The alternative hypotheses of greatest interest are those for which, for an
arbitrary level, k, subjects in the experimental group have at lower probability
than control subjects of achieving a response no higher than level k; that is,
alternative hypotheses of the form
H1 : Pr{y2 ≤ k} < Pr{y1 ≤ k}, k = 0, 1, 2, . . . , K − 1, (2.8)
where y1 and y2 are subjects randomly selected from groups 1 and 2, respec-
tively.4
One commonly used family of alternative distributions satisfying conditions
such as (2.8) is defined by
Pr{y2 ≤ k} Pr{y1 ≤ k}
=ψ (2.9)
1 − Pr{y2 ≤ k} 1 − Pr{y1 ≤ k}
for some ψ > 0. The model implied by this condition is sometimes called the
proportional odds model. The binary case we have already discussed can be
considered a special case in which K = 1, and ψ is the odds ratio. Under this
model, all 2 × 2 tables derived by dichotomizing the response based on yj ≤ k
versus yj > k have a common odds ratio ψ. The null hypothesis under this
model is H0 : ψ = 1. It can be shown that the score test for this hypothesis
is the well known Wilcoxon rank-sum test (or Mann-Whitney U-test), which
is available in most statistical analysis software packages. A more detailed
discussion of this test follows under continuous outcomes.
Alternatively, one could use the Student t-test which is a test of H0 : E[y1 ] =
E[y2 ]. Since the normality assumption is clearly violated, especially if K is
small, the t-test may be less powerful than the Wilcoxon rank-sum test. Fur-
thermore, the t-test is based on a location shift model that, again, does not
4 This condition is called (strict) stochastic ordering; the cumulative distribution in one
group is (strictly) either above of below that of the other.
OUTCOME OR RESPONSE MEASURES 51
hold because the scale is discrete and the range is likely to be the same in
both groups.
It is less clear which statistic summarizing treatment differences is most
meaningful for ordinal data. The common odds ratio ψ in (2.9) may be a
mathematically natural summary measure; however, the clinical interpretation
of this parameter may not be clear. Because the nonparametric Wilcoxon rank-
sum test is equivalent to the score test, a logical choice might be the difference
in group medians; however, the medians could be the same in both groups if
K is small. The difference in means, while not derived from a natural model
for the effect of treatment, may be the most interpretable summary statistic.
Continuous Outcomes
When outcome measures are known to be normally distributed (Gaussian),
hypothesis tests based on normal data (Student t-tests or analysis of variance
(ANOVA)) are optimal and test hypotheses regarding the effect of treatment
on mean responses. If outcomes are not normal, we may rely on the central
limit theorem which ensures the validity of these tests if sample sizes are
sufficiently large. In the non-normal case, however, tests based on ANOVA
are no longer optimal and other kinds of test statistics may be preferred.
Occasionally the statistical analysis specified in a trial protocol will indicate
that a test for normality should be performed and the test statistic used for the
analysis be based on the results of the test of normality. Given the imperative
that hypothesis tests be prespecified, it is preferable that a clearly defined test
that relies on few assumptions is preferred.
An alternative class of tests are nonparametric or distribution free tests.
The most commonly used nonparametric test for testing the null hypothesis
of no difference between two of groups is the Wilcoxon rank-sum test which is
also equivalent to the Mann-Whitney U -test. The Wilcoxon rank-sum test is
computed by sorting all subjects by their response, irrespective of treatment
group, and assigning to each a rank that is simply an integer indicating their
position in the sort order. For example, the subject with the lowest response
receives a rank of one, the second lowest a rank of two and so on. Because
in practice data are always discrete (outcomes are rarely measured with more
than three or four significant digits), there are likely to be tied observations for
which the sort order is ambiguous. For these observations, the usual convention
is to assign to each the average of the ranks that would have been assigned
to all observations with the same value had the ties been arbitrarily broken.
Specifically, the rank for subject i in group j is
X I(xpq = xij ) + 1
rij = I(xpq < xij ) + , (2.10)
p,q
2
where the sum is over all observations in both groups and I(·) is the indicator
function (taking a value of one if its argument is true, zero otherwise). The
original responses are discarded and the mean ranks are compared. The test
52 DEFINING THE QUESTION
statistic is the sum of the ranks in one group, say j = 1, minus its expectation,
X n1 (n1 + n2 + 1)
S= ri1 − . (2.11)
i
2
We reject H0 if
S2
> χ21,α ,
Var(S)
where χ21,α is the 100 × α-percentage point of the chi-square distribution with
1 degree of freedom. In large samples this is essentially equivalent to perform-
ing a t-test using the ranks. In small samples, the variance that is used has a
slightly different form. If samples are sufficiently small, the mean difference in
ranks can be compared to the permutation distribution which is the distribu-
tion obtained by considering all possible permutations of the treatment labels.
Permutation tests of this form are discussed in more detail in Chapter 5, Ran-
domization. We note that if responses are normally distributed, so that the
t-test is optimal, the relative efficiency of the Wilcoxon test is approximately
95% of that of the t-test, so the Wilcoxon test can be used in the normal case
with little loss of power.
As noted above, if responses are ordinal, the Wilcoxon rank-sum test is
derived from the score test for proportional odds logistic regression models. If
the response has two levels, the Wilcoxon rank-sum test reduces to the Pearson
chi-square test, and thus is appropriate for nearly all univariate responses,
continuous or otherwise.
It can also be shown that the Wilcoxon rank-sum test is equivalent to the
following formulation of the Mann-Whitney U -statistic. Suppose that we have
n1 observations in treatment group 1 and n2 in treatment group 2, and form
all n1 n2 possible pairs of observations (x, y) where x is an observation from
group 1 and y is an observation from group 2. Assign each pair a score, h(x, y),
of 1 if x > y, -1 if x < y, and 0 if x = y. The Mann-Whitney U -statistic is the
mean of these scores:
1 X
U= h(x, y).
n1 n2
all pairs (x, y)
The theory of U -statistics allows a variance to be calculated (since each ob-
servation occurs in multiple terms of the sum, the terms are correlated, and
the variance formula is somewhat complicated).
The appeal of this formulation is that it results in a somewhat intuitive
interpretation of the U -statistic that is not readily apparent in the rank-sum
formulation. We note that
Eh(x, y) = Pr{X > Y } − Pr{X < Y }
where X and Y are random observations from groups 1 and 2 respectively.
Hence, if the distributions of X and Y are identical, we have that Eh(x, y) = 0.
The null hypothesis, therefore, is H0 : Eh(x, y) = 0 which is equivalent to
OUTCOME OR RESPONSE MEASURES 53
H0 : Pr{X > Y } = Pr{X < Y }. Thus, the Mann-Whitney U -test (and,
therefore, the Wilcoxon rank-sum test) tests the null hypothesis that a ran-
dom observation from group 1 is as likely to be greater than as less than a
random observation from group 2. This formulation has an attractive clinical
interpretation—if the null hypothesis is rejected in favor of group 1, then a
subject is more likely to have a better outcome if treated with treatment 1
than with treatment 2. This differs from the interpretation of the t-test. If the
null hypothesis is rejected using the t-test in favor of group 1, we conclude
that the expected outcome (measured on the original scale of the observa-
tions) is greater on treatment 1 than on treatment 2. If the distributions of
outcomes are skewed, the conclusions of the two tests could, in fact, be oppo-
site. That is, mathematically, we could have Pr{X > Y } > Pr{X < Y } yet
have EX < EY .
For normal data, the two null hypothesis are mathematically equivalent and
under an alternative hypothesis in which the variances are equal, the t-test
has more power; however, if the variances are unequal, the t-test is no longer
optimal, and the Wilcoxon rank-sum test can be more powerful.
The properties of the two tests are summarized below.
• The t-test
– is most powerful when data are known to be normal with equal variances,
– is not optimal when responses are not normal or have unequal variances,
– tests the differences in mean responses, and mean difference can be sum-
marized in the original response units with a confidence interval if de-
sired.
• The Wilcoxon rank-sum test (Mann-Whitney U -test)
– has power near that of the t-test when responses are normal,
– can have greater power when responses are not normal,
– has a natural interpretation in terms of the probability that a subject
will have a better response on one treatment versus another.
While we recommend that the nonparametric test be used in most circum-
stances, there may be situations in which there are compelling reasons to
prefer the t-test.
One context in which the t-test is not appropriate is as follows. Suppose
that a continuous outcome measure such as respiratory or cardiac function is
obtained after a specific period of follow-up, but some subjects are unable to
perform the test either because they have died or the disease has progressed to
the point that they are not physically capable of performing the test. Because
the lack of response is informative regarding the condition of the subject,
removing them from the analysis, or using naive imputation based on earlier
observed values is not appropriate. Rather, it may be sensible to consider these
responses to be worse than responses for subjects for whom measurements are
available. If, for example, the response is such that all measured values are pos-
itive and larger values correspond to better function, then imputing zero for
54 DEFINING THE QUESTION
subjects unable to be measured indicates that the responses for these subjects
are worse than the actual measurements. Because the Wilcoxon rank-sum test
only takes the ordering into account, the choice of imputed value is immate-
rial provided that it is smaller than all observed responses. Multiple imputed
values can also be used; for example, zero for subjects alive but unable to per-
form the test, and -1 for subjects who have died. This strategy requires that
the nonparametric Wilcoxon rank-sum test be used. In some cases, if there
is a significant difference in mortality between groups, the difference may be
driven primarily by differences in mortality rather than differences in the out-
come of interest. While this may seem to be a drawback of the method, since
the goal is to assess the effect of treatment on the measure response, in this
case a complete case analysis (based solely on directly observed responses) of
the outcome of interest is meaningless because it compares responses between
quite different subpopulations—those who survived on their respective treat-
ments. There is no obvious, simple solution to the problem of unmeasurable
subjects.
Treatment differences for continuous data can be summarized in several
ways. The most natural may be simply the difference in means along with
an appropriate measure of variability (standard error or confidence interval).
Alternatively, depending on the clinical question, it may be more appropriate
to use the difference in medians, although the variance for the difference in
medians is difficult to derive. (One can use the formula for the variance of the
median derived from the bootstrap by Efron (1982).) The Hodges-Lehmann
estimator (Hodges and Lehmann 1963) can be used for estimates of location
shift based on the Wilcoxon rank-sum tests; however, in some situations the
location shift model is not appropriate.
One situation in which the difference in medians may be the most mean-
ingful is when worst scores are imputed for subjects who have died or are
physically unable to be tested. If subjects with imputed low scores make up
less than half of each group, the median scores will not depend on the choice
of imputed value and the group medians will represent the values such that
half the subjects have better responses and half have worse responses. In this
setting, the location shift model on which the Hodges-Lehmann estimate is
based is not appropriate because the imputed low scores are the same for the
two groups—only the proportion of subjects assuming these values is subject
to change.
Example 2.2. Returning to the TNT example from Section 2.1, Figure 2.2
shows normal probability plots (also known as “Q-Q” plots, see, for example,
Venables and Ripley (1999)) of the 3-month changes in LDL for the two treat-
ment groups. Normal probability plots are useful for detecting deviations from
normality in observed data. If the underlying distribution is normal, then we
expect that the points in the normal probability plot will fall on a diagonal
straight line with positive slope. In both treatment groups, the observations
clearly fail to fall on straight lines which is strong evidence of a lack of nor-
OUTCOME OR RESPONSE MEASURES 55
10mg Atorvastatin
100 80mg Atorvastatin
Sample Quantiles
50
0
−50
−100
−4 −2 0 2 4
Theoretical Quantiles
Figure 2.2 Normal probability (Q-Q) plot of change in LDL at 3 months by treatment
in the TNT study.
mality. In fact, the upper tails of the two distributions (changes above about
37mg/DL) are essentially identical, and in greater numbers than would be
expected if the distributions were normal.
The differences by treatment are large enough that samples of size 15 have
good power to detect the treatment difference at conventional significance
levels. For the sake of this example, we will imagine that the distributions
shown in Figure 2.2 represent the underlying population. Table 2.3 shows
Table 2.3 Power for Wilcoxon rank-sum test and Student’s t-test for change in 3-
month LDL in TNT. Power is estimated using 10,000 samples with 15 subjects per
group.
Significance Level
0.05 0.01 0.005
Student t-test 94.0% 86.2% 82.4%
Wilcoxon rank-sum test 99.2% 96.0% 92.9%
power for samples drawn from the TNT population with 15 subjects per group.
The distributions in each treatment group are sufficiently non-normal, that
the Wilcoxon rank-sum test has greater power than the Student t-test for
this population. The t-test requires between 30% and 50% greater sample size
56 DEFINING THE QUESTION
to achieve the same power as the Wilcoxon rank-sum test, depending on the
desired significance level. 2
Stratification
In multi-center, and especially in international, studies, one might expect dif-
ferences in the baseline characteristics of subjects enrolled in different centers
or geographic regions. It can be helpful to conduct stratified analyses that
account for these differences, to both ensure that confounding resulting from
chance imbalances do not arise, and account for the increased variability aris-
ing from these differences, thereby increasing study power.
Any of the hypothesis tests discussed for binary, ordinal, and continuous
data have a stratified counterpart. For continuous data, the counterpart to
the Student t-test is simply a 2-way analysis of variance with treatment and
stratum as factors. The hypothesis test of interest is based on the F -test for
treatment.
Both the Pearson chi-square test and Wilcoxon rank-sum test are based
on statistics of the form O − E where O is the observed result and E is the
expected result under the null hypothesis. For each there is an accompanying
variance, V . The alternative hypothesis of interest is that the nonzero effect
of treatment is the same in all strata, and in particular that we expect all
O − E to be positive or negative. If we let Oi , Ei , and Vi be the values of O,
E, and V within the ith stratum, assuming that strata are independent, we
can construct stratified test statistics of the form
P
( O i − E i )2
P , (2.12)
Vi
that, under H0 , have a chi-square distribution with 1 degree of freedom.
For binary data, using O, E, and V from the Pearson chi-square test, equa-
tion (2.12) is known as the Cochran-Mantel-Haenszel test (or sometimes just
Mantel-Haenszel test). For the Wilcoxon rank-sum test, O is the observed
rank-sum in either the treatment or control group, E is the expected rank
sum for the corresponding group. The stratified version of the Wilcoxon rank-
sum test is usually referred to as the Van Elteren test (Van Elteren 1960).
Intervention
Surrogate Clinical
Disease
Outcome Outcome
Figure 2.3 Causal pathway diagram for valid surrogate outcome. (Figure adapted
from Fleming and DeMets (1996))
which is the right hand side of (2.13). Additional restrictions are needed along
with (2.14) to give (⇐), but in many cases such restrictions are reasonable.
THE SURROGATE OUTCOME 59
Surrogate Clinical
A Disease
Outcome Outcome
Intervention
Surrogate Clinical
B Disease
Outcome Outcome
Intervention
Surrogate Clinical
C Disease
Outcome Outcome
Intervention
Surrogate Clinical
D Disease
Outcome Outcome
Figure 2.4 Reasons for failure of surrogate endpoints. A. The surrogate is not in
the causal pathway of the disease process. B. Of several causal pathways of disease,
the intervention affects only the pathway mediated through the surrogate. C. The
surrogate is not in the pathway of the effect of the intervention or is insensitive to
its effect. D. The intervention has mechanisms for action independent of the disease
process. (Dotted lines indicate mechanisms of action that might exist. Figure adapted
from Fleming and DeMets (1996))
60 DEFINING THE QUESTION
The preceding derivations have all assumed that the variables T and S are
continuous, but the same arguments apply when one or both are discrete. In
fact, in the case of a binary surrogate outcome, (⇐) can be shown to hold
when (2.14) and (2.15) are true (Burzykowski et al. 2005).
2.4.2 Examples
We now describe several cases for which criterion (2.15) was not met. The
Nocturnal Oxygen Therapy Trial (NOTT) (Nocturnal Oxygen Therapy Trial
Group 1980) evaluated the effect of nocturnal versus continuous oxygen sup-
plementation in 203 subjects with advanced chronic obstructive lung disease
(COPD). The primary outcomes were a series of pulmonary function tests in-
cluding forced expiratory volume in one second (FEV1 ), forced vital capacity
(FVC) and functional residual capacity (FRC), and quality of life measures
including the Minnesota Multiphasic Personality Inventory (Dahlstrom et al.
1972), the Sickness Impact Profile (Bergner et al. 1976), and the Profile of
Mood States (Waskow and Parloff 1975). Mortality was one of several sec-
ondary outcomes. At the conclusion of the scheduled follow-up period, none
of the pulmonary function tests demonstrated a difference between the two
oxygen supplementation schedules, nor were there any differences in the qual-
ity of life outcomes. Nonetheless, the mortality rate in the nocturnal oxygen
group was twice that of the continuous oxygen group (Figure 2.5) and statisti-
cally significant at the 0.01 level. While each of the pulmonary function tests
could be potential surrogate outcomes for the clinical outcome of survival,
none captured the beneficial effect of continuous oxygen. If the pulmonary
function had been the sole outcome, then one might have concluded that both
doses of oxygen were equally effective. Obviously, those suffering from COPD
would have been ill served by such a conclusion. Fortunately, the mortality
outcome was also available.
Another important example is the Cardiac Arrhythmia Suppression Trial
(CAST).5 Recall that cardiac arrhythmias are predictive of sudden death.
Drugs were approved on the basis of arrhythmia suppression in a high risk
group of subjects. CAST was conducted on a less severe group of subjects
using mortality as the primary outcome. Prior to randomization, subjects were
required to complete a run-in period and demonstrate that their arrhythmia
could be suppressed by one of the drugs. At the conclusion of the run-in,
subjects responding to the drug were randomized to either an active drug or
a corresponding placebo. Surprisingly, CAST was terminated early because
of a statistically significant increase in mortality for subjects on the active
arrhythmia suppressing drugs. Arrhythmia suppression failed to capture the
effect of the drug on the clinical outcome of all-cause mortality. If CAST had
not used mortality as a primary outcome, many thousands of patients may
have been treated with this class of drugs.
100
80
Cumulative Survival (%)
60
40
20
Continuous O2 Therapy
Nocturnal O2 Therapy
0
6 12 18 24 30 36
Time from Randomization (months)
Figure 2.5 Cumulative all-cause mortality in NOTT. (Figure adapted from Noctur-
nal Oxygen Therapy Trial Group (1980))
Heart failure is a disease in which the heart is weakened and unable to ad-
equately pump blood. Heart failure symptoms include dyspnea (shortness of
breath) and fatigue and sufferers are often unable to carry on normal daily
activities. A class of drugs known as inotropes are known to increase car-
diac contractility and improve cardiac function. The PROMISE trial (Packer
et al. 1991) was conducted to determine the long term effect of milrinone,
an inotrope, that was known to improve heart function. Over 1,000 subjects
were randomized to either milrinone or placebo, each in addition to the best
available care. In the end, PROMISE was terminated early with a significant
increase in mortality for subjects in the milrinone group compared to placebo.
Finally, AIDS is a disease in which the immune system of those infected with
the HIV virus becomes increasingly compromised until the infected individual
becomes vulnerable to opportunistic infections and other fatal insults to the
immune system. With the AIDS epidemic raging in the late 1980s, there was
a great deal of pressure to find effective treatments as quickly as possible. The
use of CD4 cell counts as a measure of immune system function appeared to be
62 DEFINING THE QUESTION
an effective alternative to trials assessing long term mortality or other clinical
outcomes. Eventually, additional studies demonstrated that drugs improving
immune function do not necessarily improve mortality or increase the time to
progression of AIDS in subjects with HIV. In addition, treatments were found
that did not improve the immune system’s CD4 cell count but did improve
time to progression of AIDS or survival. CD4 cell count seems not to be a
valid surrogate measure for clinical status (Fleming 1994).
Table 2.4 Simple example of interaction between treatment, mortality, and a nonfatal
outcome. Nonfatal events are not observed for subjects who die. Observable quantities
are shown in bold. Treatment has an effect on the nonfatal outcome but not on
mortality.
Treatment A Treatment B
Dead Alive Total Dead Alive Total
Y 10 25 35 10 20 30
Nonfatal outcome
N 10 55 65 10 60 70
20 80 100 20 80 100
Table 2.5 Simple example of interaction between treatment, mortality, and a nonfatal
outcome. Nonfatal events are not observed for subjects who die. Observable quantities
are shown in bold. Treatment has a subject level effect on mortality, but no net effect
on mortality or the nonfatal outcome.
Treatment A Treatment B
Dead Alive Total DeadAlive Total
Y 5 25 30 10 20 30
Nonfatal outcome
N 15 55 70 10 60 70
20 80 100 20 80 100
Table 2.6 Asymptotic variances for three approaches to the use of baseline values
in the estimation of the treatment difference. We assume that there are m subjects
per treatment group and ρ is the correlation between baseline and follow-up measure-
ments.
Method Variance
Ignore baseline 2σ 2 /m(1 − ρ)
Change from baseline 4σ 2 /m
Model (2.20) 2σ 2 (1 + ρ)/m
estimate of γ using model (2.20) always has smaller asymptotic variance than
both of the other estimates. Frison and Pocock (1992) extend this model to
situations with multiple baseline and follow-up measures. Thisted (2006) also
investigates the use of baseline in the context of more general longitudinal
models.
One drawback to the analysis using equation (2.20) is that it assumes that
the observations, (Yi0 , Yi1 )T , are approximately bivariate normal. A nonpara-
metric version of this approach can be implemented by first replacing the
observed Yij by their rank scores (separately for baseline and follow-up) and
performing the analysis based on equation (2.20).
Table 2.7 Power for treatment difference in 3-month LDL for TNT data using sam-
ples of 15 per group and 6 different analyses based on 100,000 samples.
Significance Level
.05 .01 .005
Ignore baseline 88.9 74.0 66.4
Observed Data Change from Baseline 94.0 85.5 81.1
Linear model (2.20) 94.4 86.6 82.8
Ignore baseline 91.3 76.1 68.1
Ranked Data Change from Baseline 98.3 93.0 89.0
Linear model (2.20) 96.5 88.9 83.8
2.6 Summary
At all stages of design, conduct of analysis of a clinical trial, one must keep in
mind the overall question of interest. In most cases, the question has a form
such as “do subjects treated with agent A have better outcomes than those
treated with agent B?” All design elements from the definition of the primary
outcome and the associated statistical test, to the sample size and interim
monitoring plan, are created with this question in mind. Secondary and sup-
plemental analyses will prove to be important aids to the interpretation of
the final result, but cannot be a valid substitute when the primary question
does not provide a satisfactory result. The foundational principles of causal
inference require that we adhere as much as possible to the intent-to-treat
principle.
2.7 Problems
2.1 Find a clinical trial article in a medical journal. Determine how the in-
vestigators/authors have defined the question. Identify the study popu-
lation. Identify the primary, secondary, and other questions. What kind
of outcome measure(s) was(were) used? What was the approach used to
analyze the data? Is there any part of defining the question you would
have done differently? Explain why or why not.
2.2 Find an example of a clinical trial in which a surrogate outcome was
used but results were later questioned because the surrogate perhaps
did not satisfy the criteria.
CHAPTER 3
Study Design
Once the hypothesis has been formulated, including what outcome variables
will be used to evaluate the effect of the intervention, the next major challenge
that must be addressed is the specification of the experimental design. Getting
the correct design for the question being posed is critical since no amount of
statistical analysis can adjust for an inadequate or inappropriate design. While
the typical clinical trial design is usually simpler than many of the classical
experimental designs available (Cochran and Cox 1957; Cox 1958; Fisher 1925;
Fisher 1935), there are still many choices that are being used (Friedman et al.
1998).
While the concept of a randomized control is relatively new to clinical re-
search, starting in the 1950’s with first trials sponsored by the Medical Re-
search Council, there are, of course, examples in history where a control group
was not necessary. Examples include studies of the effectiveness of penicillin
in treating pneumococcal pneumonia and a vaccine for preventing rabies in
dogs. These examples, however, are rare and in clinical practice we must rely
on controlled trials to obtain the best evidence of safety and efficacy for new
diagnostic tests, drugs, biologics, devices, procedures, and behavioral modifi-
cations.
In this chapter, we first describe the early phase trial designs that are used
to obtain critical data with which to properly design the ultimate definitive
trial. As discussed, once a new intervention has been developed, it may have
to go through several stages before the ultimate test for efficacy. For example,
for a drug or biologic one of the first challenges is to determine the maximum
dose that can be given without unacceptable toxicity. This is evaluated in a
phase I trial. A next step is to determine if the new intervention modifies a
risk factor or symptom as desired, and to further assess safety. This may be
accomplished through a phase II trial or a series of phase II studies. Once
sufficient information is obtained about the new intervention, it must be com-
pared to a control or standard intervention to assess efficacy and safety. This
is often referred to as a phase III trial. Trials of an approved treatment with
long-term follow-up of safety and efficacy are often called phase IV trials.
(See Fisher (1999) and Fisher and Moyé (1999) for an example of the U.S.
Food and Drug Administration (FDA) approval process.) While these phase
designations are somewhat arbitrary, they are still useful in thinking about
the progression of trials needed to proceed to the final definitive trial. Once
the classical phase I and II designs have been described, we discuss choices
of control groups, including the randomized control. The rest of the chapter
75
76 STUDY DESIGN
will focus on randomized control designs, beginning with a discussion of trials
designed to show superiority of the new experimental intervention over a stan-
dard intervention. Other trials are designed to show that, at worst, the new
intervention is not inferior to the standard to within a predetermined margin
of indifference. These trials can be extremely difficult to conduct and inter-
pret, and we discuss some of the associated challenges. Finally, we address
adaptive designs which are intended to allow for design changes in response
to intermediate results of the trial.
where ni is the number of subjects receiving dose d[i] . The posterior distri-
bution of β is proportional to p(β)Ln (β) and its mean or mode can be used
to estimate the MTD. The next subject, or subjects, is then simply assigned
the MTD. That is, each subject receives what is currently estimated to be the
best dose. CRM sample sizes are generally fixed in advance. The final MTD
estimate would merely be the dose that would be given the next subject were
there to be one. Use of a vague prior distribution for β reduces to maximum
likelihood estimation, requiring algorithmic dose modification rules for use in
early subjects, where little information is available.
The CRM shares several advantages with many other model-based designs.
It unifies the design and analysis process—as described above, subjects are
assigned to current best estimates. It is very flexible, allowing predictors,
sub-dose limiting toxicities, and incomplete information to be incorporated.
McGinn et al. (2001) illustrated the latter feature in a trial of radiation dose
80 STUDY DESIGN
escalation in pancreatic cancer. Instead of waiting for full follow-up on a co-
hort of subjects treated at the current dose, they used the time-to-event CRM
(TITE-CRM) method of Cheung and Chappell (2000) to utilize interim out-
comes from subjects with incomplete follow-up to estimate the dose to assign
the next subject. Full follow-up is used at the trial’s end to generate a fi-
nal estimate of the MTD. They note that trial length is shortened without
sacrificing estimation accuracy, though at the cost of logistical complexity.
Bayesian methods can build on the results of previous human exposure, if
any, and in turn provide prior distributions for future studies. A disadvan-
tage of model-based designs compared to algorithmic ones is their lack of
transparency, leading clinicians to think of them as “black box” mechanisms,
yielding unpredictable dose assignments.
A phase I trial’s operating characteristics have ethical and scientific implica-
tions, so the prior distribution must be sufficiently diffuse to allow data-driven
dose changes but strong enough to disallow overly aggressive escalation. The
latter can be restricted by forbidding escalation by more than one dose level
per cohort and by conservatively setting the initial dose at less than the prior
MTD. Therefore, since prior distributions used for the CRM and similar meth-
ods are often chosen based on the operating characteristics of the designs they
produce in addition to, or instead of, scientific grounds, one might use a dif-
ferent prior distribution for the analysis.
1 C
Number
of Events
in Second
2 A B
Cohort
3
D
≥4
Region Action
A Reject H0 , stop at stage 1
B Fail to reject H0 , stop at stage 1
C Reject H0 after stage 2
D Fail to reject H0 after stage 2
at the end of each stage, under the null and the alternative hypotheses. The
probability of erroneously rejecting H0 is 0.043 + 0.007 = 0.05. (The calcula-
tions are more complex using a failure time outcome in which subjects who
are accrued in the first stage can continue to be followed during the second.)
Under the alternative hypothesis, there is a 23% probability of stopping and
rejecting H0 after the first stage and a 63% probability of rejecting H0 after
the second stage, yielding 86% power.
Note that the investigators also utilized deterministic curtailment (discussed
in Chapter 10), stopping the trial at a point in which its eventual outcome was
certain. Regions B and D in Figure 3.2 indicate situations in which, during
or after the first stage, they knew that the trial’s outcome would be to fail to
reject the null hypothesis. There is a 41.7% chance under H0 of ending the
trial after fourteen or fewer subjects. The chances of stopping for the same
reason at particular points during the second stage or of stopping near the end
of the second stage because five or more events are unobtainable (so rejecting
H0 is certain) are easily calculated.
This example shows a situation in which the power, type I error rate, and
maximum sample size for a two-stage design are all about the same as those for
84 STUDY DESIGN
Probability Probability
Region Under H0 Under H1
(π = 0.3) (π = 0.1)
A 0.007 0.23
B 0.417 0.009
C 0.043 0.63
D 0.533 0.13
the corresponding single stage trial. The former, however, offers the substantial
possibility of greatly shortening study duration by stopping early, offering
ethical and logistical advantages. 2
Although the two-stage phase II designs discussed here are simple, many
refinements are possible such as the use of three or more stages. Failure time
outcomes can be used in which case the length of follow-up at each stage
will influence total trial duration (Case and Morgan 2001). Thall and Simon
(1995) discuss Bayesian phase II designs. As was the case with phase I trials,
Bayesian considerations may inform the analysis of a study even when they
did not motivate its design.
As a further refinement on two- (and, in principle, three- or more) stage
trials Simon (1989) describes designs in which the numbers of subjects in each
stage may be unequal. Subject to fixed type I rates (denoted by α, usually
specified to be 5% or 10%) and type II error rates (β, usually between 5% and
20%), many designs satisfy the criteria
Pr(Reject H0 |πT = π1 ) ≤ α
and
Pr(Reject H0 |πT = π2 ) ≥ 1 − β
where π1 is the value of πT under H0 and π2 is a possible value under H1 .
Two strategies for choosing subject allocation between two stages are so-called
optimality and the minimax criterion. Both have been widely applied. Opti-
mality minimizes the expected sample size under the null hypothesis, while
minimax refers to a minimization of the maximum sample size in the worst
case. Because these goals conflict, optimal designs tend to have large expected
sample sizes. Jung et al. (2001) present a graphical method for balancing the
two goals. In practice, equal patient allocation to the two stages often serves as
a simple compromise. Also, the actual sample size achieved at each stage may
deviate slightly from the design. This is particularly true for multi-institutional
phase II trials, in which there may be a delay in ascertaining the total accrual
and in closing a stage after the accrual goal is met.
PHASE III/IV TRIALS 85
3.2 Phase III/IV Trials
The typical phase III trial is the first to definitively establish that the new
intervention has a positive risk to benefit ratio. Trials intended to provide
additional efficacy or safety data after the new intervention has been approved
by regulatory agencies are often referred to as phase IV trials. The distinction
is not always clear, however, and we shall focus our discussion on phase III
trials since most of the design issues are similar for phase IV trials.
Historical Controls
One of the earliest trial designs is the historical control study, comparing
the benefits and safety of a new intervention with the experience of subjects
treated earlier using the control. A major motivation for this design is that all
new subjects can receive the new intervention. Cancer researchers once used
this design to evaluate new chemotherapy strategies (Gehan 1984). This is
a natural design for clinicians since they routinely guide their daily practice
based on their experience with the treatment of previous patients. If either
physicians or patients have a strong belief that the new intervention may be
beneficial, they may not want to enroll patients in a trial in which they may
be randomized to an inferior treatment. Thus, the historical control trial can
thereby alleviate these ethical concerns. In addition, since all eligible patients
receive the new intervention, recruitment can be easier and faster, and the
required sample size roughly half that of a randomized control trial, thereby
reducing costs as well. One of the key assumptions, however, is that the patient
population and the standard of care remain constant during the period in
which such comparisons are being made.
Despite these benefits, there are also many challenges. First, historical con-
trol trials are vulnerable to bias. There are many cases in which interventions
appeared to be effective based on historical controls but later were shown not
to be (Moertel 1984). Byar describes the Veterans Administration Urologi-
cal Research Group trial of prostrate cancer, comparing survival of estrogen
treated patients with placebo treated patients.1 During this trial, there was
a shift in the patient population so that earlier patients were at higher risk.
Estrogen appeared to be effective in the earlier high risk patients but not so in
the patients recruited later who were at lower risk. A historical control study
would have been misleading due to a shift in patient referral patterns. Pocock
(1977b) describes a series of 19 studies that were conducted immediately after
1 Veterans Administration Cooperative Urological Research Group (1967), Byar et al.
(1976)
86 STUDY DESIGN
an earlier study of the same patient population and with the same interven-
tion. In comparing the effects of the intervention from the consecutive trials
in these 19 cases, 4 of the 19 comparisons were “nominally” significant, sug-
gesting that the treatment in one trial was more effective in one study than
the other, even though the patients were similar and the interventions were
identical.
A recent trial of chronic heart failure also demonstrates the bias in histori-
cal comparisons (Packer et al. 1996; Carson et al. 2000). The initial PRAISE
trial evaluated a drug, amlodipine, to reduce mortality and morbidity in a
chronic heart failure population. The first trial, referred to as PRAISE I, was
stratified by etiology of the heart failure: ischemic and non-ischemic. Earlier
research suggested that amlodipine should be more effective in the ischemic
population or subgroup. The overall survival comparison between amlodipine
and placebo treated patients, using the log-rank test, resulted in a p-value of
0.07, almost statistically significant. There was a significant interaction be-
tween the ischemic and non-ischemic subgroups, however, with a hazard ratio
of 1.0 for the ischemic subgroup and 0.5 for the non-ischemic subgroup. While
the result in the non-ischemic group alone might have led to its use in clinical
practice, the fact that the result was opposite to that expected persuaded the
investigators to repeat the trial in the non-ischemic subgroup alone. In the
second trial, referred to as PRAISE II, the comparison of the amlodipine and
placebo treated arms resulted in a hazard ratio of essentially 1.0. Of interest
here, as shown in Figure 3.3, is that the placebo treated patients in the sec-
ond trial were significantly superior in survival to the placebo treated patients
in PRAISE I. No adjustment for baseline characteristics could explain this
phenomenon.
In addition, to conduct an historical control trial, historical data must be
available. There are usually two main sources of historical data; a literature
resource or a data bank resource. The literature resource may be subject
to publication bias where only selected trials, usually those with a positive
significant benefit, are published and this bias may be introduced into the
historical comparison. In addition, to conduct a rigorous analysis of the new
intervention, access to the raw data would be necessary and this may not be
available from the literature resources. Thus, researchers often turn to existing
data banks to retrieve data from earlier studies. Quality of data collection
may change over time as well as the definitions used for inclusion criteria and
outcome variables.
Even if data of sufficient quality from earlier studies are available, caution is
still required. Background therapy may have changed over time and diagnostic
criteria may also have changed. For example, international classification of
disease changes periodically and thus there can be increases or decreases in
disease prevalence. For example, the seventh, eighth, and ninth revisions of
the international classification system resulted in a 15% shift in deaths due to
ischemic heart disease. Besides changes in diagnostic criteria and classification
codes, disease prevalence can also change over time as shown in Figure 3.4.
PHASE III/IV TRIALS 87
0.5
0.4
Cumulative Mortality
0.3
0.2
0.1
PRAISE−I
PRAISE−II
0.0
0 1 2 3 4
Figure 3.3 PRAISE I vs. PRAISE II: all-cause mortality for placebo groups by study.
Concurrent Controls
The concurrent control trial compares the effect of the new intervention with
the effect of an alternative intervention applied at some other site or clinic.
This design has many of the advantages of the historical control trial but
eliminates some of the sources of bias. The primary advantage is that the in-
vestigator can apply the new intervention to all participants and only half as
many new participants are needed, compared to the randomized control trial.
Thus, recruitment is easier and faster. The biases that affect historical controls
due to changes in definitions or diagnostic criteria, background therapy, and
changing time trends in disease prevalence are minimized if not eliminated.
These types of comparisons are somewhat common in medical care as suc-
cess rates of various institutions in treating patients are often compared and
evaluated.
The key problem with the concurrent control trial is selection bias, both
from patients or participants and the health care provider. Referral patterns
88 STUDY DESIGN
Figure 3.4 Cancer and heart disease deaths. Cancer and heart disease are the leading
causes of death in the United States. For people less than 65, heart disease death rates
declined greatly from 1973 to 1992, while cancer death rates declined slightly. For
people age 65 and older, heart disease remains the leading killer despite a reduction in
deaths from this disease. Because cancer is a disease of aging, longer life expectancies
and fewer deaths from competing causes, such as heart disease, are contributing to
the increase in cancer incidence and mortality for those age 65 and older. Reprinted
with permission from McIntosh (1995).
are not random but based on many factors. These may include whether the
institution is a primary care facility or a tertiary referral center. Patient mix
would be quite different in those two settings. Patients may choose to get
their care from an institution because of its reputation or accessibility. With
the development of large multidisciplinary health care systems, this source of
bias may not be as great as it otherwise might be. Even within such systems,
however, different clinics may have different expertise or interest and select
patients accordingly.
The key to any evaluation of two interventions is that the populations are
comparable at the start of the study. For a concurrent control trial, one would
have to examine the profile of risk factors and demographic factors. There are
many other important factors, however, that may not be available. Even with
a large amount of baseline data, establishing comparability is a challenging
task. In the previous section, the PRAISE I & II example indicated it was
not possible to explain the difference in survival between the same placebo
subgroups in the two back to back studies by examining baseline risk factors.
Thus, covariate adjustment cannot be guaranteed to produce valid treatment
comparisons in trials using concurrent controls.
PHASE III/IV TRIALS 89
Randomized Control Trials
As indicated earlier, the randomized control trial is viewed as the “gold stan-
dard” for evaluating new interventions. The reason for this, as summarized in
Table 3.2, is that many of the sources of bias present in both the historical and
concurrent control trials are minimized or eliminated (Friedman et al. 1985).
Table 3.3 Possible bias in the estimation of treatment effects for published trials
involving anticoagulation for patients with myocardial infarction as a function of the
control group. The randomized control is the most reliable.
Yes Yes R
Eligible Consent A
A
M
No No
I
B
Z
Dropped Dropped E
There are many clinical trials in various fields that have successfully used
this trial design.2 The parallel control design has the advantage of simplicity
and can give valid answers to one or two primary questions. For example, the
Coronary Drug Project (CDP), one of the early randomized control multi-
center clinical trials, compared several intervention strategies with a placebo
control arm using a parallel design in a population of men who had recently
suffered a heart attack. The primary outcome was mortality with cardiovascu-
lar mortality as a secondary outcome. The intervention strategies used various
drugs that were known to lower serum cholesterol levels but it was not known
whether any could safely lower mortality. The Diabetic Retinopathy Study
(DRS) was a trial to evaluate a new laser treatment in a diabetic popula-
tion to reduce the progression of retinopathy, an eye disease that reduces
visual acuity. The primary outcome was visual acuity and a retinopathy score
which measures disease progression and is based on photographs of the eye
fundus (i.e., back of the eye). The Beta-blocker Heart Attack Trial (BHAT)
was a randomized double-blind parallel design trial in a group of individu-
als having just survived a heart attack, comparing a beta-blocker drug with
a placebo control. Again, mortality was the primary outcome variable. The
Breast Cancer Prevention Trial (P-1) was a cancer prevention trial evaluating
the drug tamoxifen to prevent the occurrence of breast cancer in a population
at risk (Fisher et al. 1998). Here, the primary outcome was disease free sur-
2 The International Steering Committee on Behalf of the MERIT-HF Study Group (1997),
Domanski et al. (2002), HDFP Cooperative Group (1982), The Coronary Drug Project
Research Group (1975), The DCCT Research Group: Diabetes Control and Complica-
tions Trial (DCCT) (1986), Diabetic Retinopathy Study Research Group (1976)
92 STUDY DESIGN
vival; that is, alive without the occurrence of breast cancer. Most phase III
trials use this basic design because of its simplicity and utility.
A variation of the parallel design utilizes a run-in period. A schema for this
design is shown in Figure 3.6. The primary departure from the basic parallel
design is that participants are screened into a prerandomization phase to eval-
uate their ability to adhere to the protocol or to one of the interventions under
evaluation. If they cannot adhere to the intervention schedule such as taking
a high percentage of the medication prescribed, then their lack of compliance
will affect the sensitivity or power of the trial. If a potential participant can-
not comply with some of the required procedures or evaluations, then that
individual would not be a good candidate for the main trial.
Screen Run−In R
and Period A
Consent A
M
Unsatisfactory
I
B
Z
Dropped E
There are many trials that have used a “run-in” period. For example, the
Cardiac Arrhythmia Suppression Trial (CAST) used a run-in phase to deter-
mine if a patient’s cardiac arrhythmia could be suppressed using one of the
three drugs being tested.3 CAST was a trial involving patients with a serious
cardiac arrhythmia, or irregular heartbeat. Individuals with these irregular
heartbeats are known to be a high risk for death, usually suddenly and with-
out warning. Researchers developed a class of drugs that would suppress or
control these irregular heartbeats, on the theory that this would reduce the
risk of sudden death. Not all patients could tolerate these drugs, however,
and in some patients the treatment failed to control the arrhythmia. There-
fore, to improve the efficiency and power of the main CAST trial, patients
were screened to determine both their ability to tolerate these drugs and the
susceptibility of the arrhythmia to pharmacological control. If they met the
screening criteria, they were randomized into the main trial, either to one
of the 3 drugs or to a matching placebo control. Ironically, these drugs were
3 The Cardiac Arrhythmia Suppression Trial (CAST) Investigators (1989)
PHASE III/IV TRIALS 93
shown to be harmful in a patient population who had passed the screening
run-in phase with a successful suppression of their arrhythmias.
Another trial, the Nocturnal Oxygen Therapy Trial (NOTT), evaluated the
benefit of giving 24 hours of continuous oxygen supplementation, relative to
giving 12 hours of nocturnal use, in patients suffering from advance chronic
obstructive pulmonary disease (COPD). Potential participants were entered
into a run-in phase to establish that their disease was sufficiently stable to
ensure that the outcome variables, pulmonary function, could better reflect
the treatment effect. A similar strategy was used in the Intermittent Positive
Pressure Breathing (IPPB) Trial.4
I Treatment A
Treatment A
II Not Treatment A
more often to give better evidence that can help maximize benefit and min-
imize long term risk. There are few examples of the randomized withdrawal
design.
Crossover Design
The crossover design has often been used in the early stages of the development
of new therapies to evaluate their effects compared to a standard control.
Crossover designs are commonly used for studies of analgesic and psychotropic
drugs. One of the unique features of this design is that each patient receives
both treatments and, hence, serves as his or her own control. In the simple
two-period crossover design, as depicted in Figure 3.8, the participants are
4 The Intermittent Positive Pressure Breathing Trial Group (1983)
94 STUDY DESIGN
randomly divided into two groups and each participant is exposed to the new
treatment (A) and to the control (B), each for a prespecified period of time. In
Figure 3.8 these time periods are labeled as period I and period II. Group 1
is exposed to treatment A in period I and treatment B in period II, while
group 2 is exposed to treatment B in period I and treatment A in period II.
H0 : A vs. B Scheme
Group Period
I II
1 TRT A TRT B
2 TRT B TRT A
The model, according to Brown (1980) and Grizzle (1965), can be written
as
Yijk = µ + πk + φu + λv + ξij + εijk ,
where i = 1, 2, j = 1, . . . , ni (ni is the number of subjects in group i), k =I,II,
u =A,B, and v =A,B. The terms in the model are defined as follows:
Yijk = the measurement for subject j in group i
during period k,
µ = the overall mean,
πk = the effect of period k,
φu = the effect of treatment u,
λv = the carryover effect of treatment v from
period I on the response in period II,
ξij = the subject effect,
εijk = random error.
In addition, E[ξij ] = E[εijk ] = 0, Var(ξij ) = σξ2 , Var(εijk ) = σε2 , and the ξij
and εijk are assumed to be mutually independent. The carryover effect, λv ,
has a value of 0 for all measurements in period I. Its value in period II is of
particular interest and may not be 0. For example, if treatment A cures the
disease in period I, then there is no possibility of a response to treatment B in
period II. The validity and efficiency of the crossover design depends strongly
on whether or not there is any carryover effect. The presentation here follows
Brown (1980).
First, we consider at the case in which we assume that λv = 0.5 In this
case, we impose the additional constraint that φA +φB = 0, and the difference
between the two treatment effects, δ = φB − φA , can be estimated by
1
δ̂co = Ȳ1·2 − Ȳ1·1 + Ȳ2·1 − Ȳ2·2 ,
2
5 Actually we need only assume that λA = λB ; however, if carryover effects are present, it
is unlikely that they will be the same for both treatments.
PHASE III/IV TRIALS 95
where Ȳi·k is the average measurement of all subjects in group i during period
k. It is easy to check that
h i σ2 1 1
ε
E δ̂co = δ and Var δ̂co = + ,
2 n1 n2
where σε2 can be estimated by the subject differences between periods with
n1 +n2 −2 degrees of freedom. As pointed out by Chassan (1970), the efficiency
of the crossover design can be compared to that of the randomized parallel
control design using simple calculations. Let the number of subjects in each
group in the crossover design be n and the number of subjects in each arm in
the parallel design be m. The variance of the estimator in the crossover trial
becomes Var(δ̂co ) = σε2 /n and the variance of the estimator in the parallel
design experiment will be
2 2σε2
Var(δ̂p ) = σξ2 + σε2 =
m m(1 − ρ)
where ρ is the correlation between measurements in period I and period II for a
randomly selected subject. The ratio of the variances for the two experiments
will be
Var(δ̂co ) m
= (1 − ρ).
Var(δ̂p ) 2n
Thus, to achieve the same precision, we require that m = 2n/(1 − ρ) which,
depending on ρ, is at least twice the sample size, and likely much more. An-
other observation made by Chassan (1970) is that if the analysis of the parallel
design experiment was based on change from baseline, then the appropriate
estimator, say δ̂p(bl) , would have variance
4 2
Var(δ̂p(bl) ) = σ .
m ε
This is four times the variance of the estimator for the crossover experiment
when the two have the same sample sizes. Similarly if the estimate is based
on equation (2.20) from Chapter 2, the variance is 2(1 + ρ)σε2 /m, which is
between two and four times greater than Var(δ̂p ).
The efficiency results are valid only when λv = 0, however. When this
assumption is not reasonable, adjustments need to be made that ultimately
defeat the purpose of the crossover design. Grizzle (1965) noticed that the hy-
pothesis of no carryover effect can be tested in the simple two-period crossover
design. Letting γ = λB − λA (i.e., the difference in carryover effect between
the two treatments), an unbiased estimator of γ is
γ̂ = Ȳ2·1 + Ȳ2·2 − Ȳ1·1 + Ȳ1·2 ,
with variance
1 1
Var (γ̂) = 4σξ2 + 2σε2 + .
n1 n2
We can estimate 4σξ2 + 2σε2 by estimating the variance of the sum of the two
96 STUDY DESIGN
observations for each individual with n1 + n2 − 2 degrees of freedom. Once
these estimates are available, we can test the hypothesis of no carryover effect.
If it is determined that carryover effect is present, then E[δ̂co ] 6= δ. A valid
(unbiased) estimate of the treatment effect is Ȳ1·2 − Ȳ1·1 —the estimate from
a parallel group trial and which does not make use of the period II data,
subverting the entire benefit of the crossover design.
When in doubt about the carryover effect, Grizzle (1965) recommends test-
ing γ = 0 at a significance level of 0.1. If this hypothesis is not rejected, δ̂co
can be used. If it is rejected, only information from the first period should be
used. On the other hand, in order to have sufficient power for both this test
and the test for treatment effect, one would need a larger sample size and a
simple parallel design would be a more efficient choice. Thus, when there is
a very strong belief that γ = 0, a crossover design may be used. If there is a
possibility of carryover effect, however, one should avoid the crossover design.
These results are a summary of those presented by Brown (1980). Another
discussion, arriving at many of the same conclusions, is given by Hills and
Armitage (1979). Many other possible designs for crossover trials have been
explored. These involve a larger number of treatments or periods and may
involve various patterns of treatment assignment. Some of these have been
discussed in Koch et al. (1989) and Carriere (1994).
Factorial Designs
The factorial design is the most complex of the designs typically used in clini-
cal trials. Using this design, two or more classes of interventions are evaluated
in the same study compared to the appropriate standard or control for each.
As shown in Figure 3.9 for a two-by-two factorial design, evaluating two inter-
ventions A and B compared to a control, there are four cells, each reflecting
an intervention strategy. Approximately 25% of the participants are in each
cell; that is, 25% receive both interventions AB, 25% receive A and control,
25% receive B and control, and 25% receive both controls. The factorial de-
sign allows researchers to evaluate more than one intervention in the same
participant population, thus reducing costs and increasing efficiency. Further-
more, as in the Physicians Health Study (PHS) example described later in
this section, each treatment comparison may be associated with a different
outcome.
Control Trt B Tot
Control N/4 N/4 N/2
Trt A N/4 N/4 N/2
Tot N/2 N/2 N
6 Steering Committee of the Physicians’ Health Study Research Group (1989), Hennekens
et al. (1996)
7 The Alpha-Tocopherol, Beta-Carotene Cancer Prevention Study Group (1994)
8 Writing Group for the Women’s Health Initiative Randomized Controlled Trial (2002),
The Women’s Health Initiative Steering Committee (2004)
100 STUDY DESIGN
nent comparing estrogen plus progestin, the trial was terminated early with
an adverse effect due to blood clotting and an adverse effect on breast can-
cer.9 Osteoporosis was reduced as measured by the hip fracture rate. There
was no observed reduction in mortality. The estrogen alone arm was also ter-
minated early due to a significant adverse effect due to blood clotting, with no
significant reduction in mortality but with a significant improvement in hip
fracture.10
Group-randomization Designs
There are situations in which the intervention of interest cannot be easily ad-
ministered at the individual level. For example, new sanitation practices in a
hospital may prevent certain kinds of infections. It would be difficult to create
a large trial to test this hypothesis if practitioners were required to follow
different procedures for each patient within a hospital. Instead, it would be
easier to treat all patients in a given hospital equally while assigning entire
hospitals to different practices. This is the idea behind group-randomized de-
signs. These designs differ from designs randomizing individuals because the
groups themselves are not created through the experiment, but rather they
arise from relationships among the group members (Murray et al. 2004). This
usually induces a correlation among observations from members of each group.
One example of a group-randomized trial was the Seattle 5 a Day study
(Beresford et al. 2001). In this study, the intervention consisted of a number
of strategies for individual-level and work environment changes. A total of 28
worksites in the Seattle area were randomized, half with intervention and half
without. The primary outcome was the consumption of fruits and vegetables,
as measured by a modified food frequency questionnaire. Other self-reported
measures were used as secondary outcomes. These measures were assessed
at baseline and at 2 years, using independent cross-sectional samples of 125
workers at each worksite. After 2 years, the estimated intervention effect was
0.3 daily servings of fruits and vegetables, which was statistically significant.
A similar trial was known as Teens Eating for Energy and Nutrition at
School (TEENS) (Lytle et al. 2006). The intervention was similar to that in
the 5 a Day study and consisted of classroom-based curricula, newsletters,
and changes in the school food environment in favor of fruits, vegetables, and
other nutritious foods. The group units were 16 schools in Minnesota that were
randomized to either the intervention or the control arm. Outcome measures
included assessments of both the home food environment and the school food
environment. The home food environment was assessed with a one time par-
ent survey sent to random subsamples of parents at the end of the study. The
results showed that parents of children enrolled in schools receiving the in-
tervention made significantly healthier choices when grocery shopping. Other
measures derived from the parent survey did not show significant differences.
9 Writing Group for the Women’s Health Initiative Randomized Controlled Trial (2002)
10 The Women’s Health Initiative Steering Committee (2004)
NON-INFERIORITY DESIGNS 101
The Trial of Activity in Adolescent Girls (TAAG) (Stevens et al. 2005) also
used group randomization. The design called for 36 middle schools to be ran-
domized to control or an intervention with special provisions for opportunities
for physical activity. The primary goal was to increase the intensity-weighted
minutes of moderate to vigorous physical activity engaged in by girls.
Most group-randomized studies are analyzed using linear mixed-effects mod-
els or related approaches. Models of this type are described in Chapter 8 in
the context of longitudinal data analysis. The group randomization setting is
similar to that of repeated measures data because of the correlation between
measurements within randomized units (i.e., schools, worksites, etc.). When
entire groups are randomized, the notion of “sample size” becomes more com-
plex. Both the number of groups and the number of individuals to be sampled
from within those groups must be determined. The necessary calculations are
described in Chapter 4.
Historically, most trials have been designed to show that a new intervention is
better than or superior to a standard intervention, or that a new intervention
added to standard of care is superior to standard of care alone. There are also
many situations, however, where a new intervention need not be superior to
a standard to be of interest. For example, compared to the standard the new
intervention may be less toxic, less expensive, or less invasive and thus have
an advantage over the standard, as long as it is not worse than the standard in
clinical effect. Many industry sponsors may also want to show that their drug,
biological agent, or device is not inferior to a leading competitor product. In
cancer or AIDS treatment, a new intervention might have less toxicity and
would be of great interest to physicians and patients as long as the effect on
recurrence and mortality was almost as good as the standard drug regimen.
Trials designed to show that the new intervention is “at least as good as” the
control are known as non-inferiority trials.
Trials involving diseases that have life-threatening or other severe complica-
tions, and for which known effective treatments are available, cannot ethically
be placebo controlled if the placebo would be used in place of a treatment
known to be effective. (Placebo controls can be used if the new treatment is be-
ing used in addition to the existing standard.) Thus, unless the new treatment
is expected to be superior the current standard, non-inferiority trials must be
conducted. On the other hand, the design and conduct of non-inferiority trials
can be extremely challenging. For example, in superiority trials, deficiencies
in either design or conduct tend to make it more difficult to demonstrate that
the new intervention is superior to control, providing incentives that help en-
sure proper study conduct. For non-inferiority trials, deficiencies can often
serve to attenuate treatment differences. In the extreme case, if no subjects in
either treatment arm comply with the protocol, the two groups will be indis-
tinguishable and non-inferiority based solely on the formal statistical test will
102 STUDY DESIGN
be confirmed (although the result will have no scientific credibility). Based
on this logic, it is commonly believed that sloppy conduct increases the likeli-
hood of showing non-inferiority (in Chapter 11, however, we show that this is
not necessarily the case). Thus, a non-inferiority trial must strive to achieve
quality as good as or better than the corresponding superiority trials.
There are also other challenges in non-inferiority trials. One is the choice of
the control group. This issue is directly addressed by the ICH-E10 guidance
document Choice of Control Group and Related Issues in Clinical Trials.11
In order to demonstrate convincingly that the new intervention is not infe-
rior, the new intervention needs to compete with the best standard available,
not the least effective alternative. The selection of the control is not always
straightforward. Different alternatives may have different risks and benefits,
leading to different levels of compliance. Members of the medical community
may not all agree on the treatment to be considered the best standard. One
concern that regulatory agencies have is that if the least effective alternative
is chosen, then a new intervention might be found to be non-inferior to a less
than optimal alternative or control. A series of such non-inferiority trials could
lead to acceptance of a very ineffective new intervention. This specific concern
leads to an often used requirement that will be discussed later.
Another challenge results from the fact that it is impossible to show sta-
tistically that two groups are not different. Specifically, any statistic summa-
rizing the difference between groups has an associated variance or confidence
interval. To show absolute equivalence requires that the confidence interval
have width zero and this would require an infinite sample size. Similarly, to
show that one group is strictly non-inferior requires that the confidence inter-
val is strictly to one side of zero (but possibly including zero) in which case
we have essentially shown superiority. To overcome this technical difficulty,
researchers construct a zone of non-inferiority within which the groups are
considered practically equivalent (see Figure 3.10). The challenge, therefore,
is to determine the maximum difference that can be allowed, yet for which the
new treatment would still be considered non-inferior. This maximum value is
referred to as the margin of indifference. The margin of indifference may be
expressed as an absolute difference or a relative difference such as relative risk,
hazard ratio, or odds ratio. (Mathematically, this distinction is artificial in the
sense that relative differences can be expressed as absolute differences on a log
scale.) As shown in Figure 3.10, researchers typically use the confidence inter-
val for an intervention effect to draw inferences and conclusions. Figure 3.10
illustrates that if the confidence interval for the treatment difference excludes
zero, the new intervention is concluded to be either better (case A) or worse
(case D and E). For the non-inferiority trial, researchers want the confidence
interval not to exclude zero difference, but rather want the upper limit to be
below the margin of indifference (cases B and C). In case B, the estimate of
the intervention effect indicates improvement with the upper limit being less
11 http://www.fda.gov/cber/ich/ichguid.htm
NON-INFERIORITY DESIGNS 103
than the margin of indifference. For case C, the intervention effect estimate
indicates a slightly worse outcome but the upper limit is still less than the
margin of indifference.
Estimated benefit of
Zone of standard drug over placebo
Non−Inferiority
Test drug Standard
better drug
0 better
A
Superiority
B
Non−Inferiority
C
(i.e. Equivalence) Inferiority
D
E
Underpowered Trial
F
Choosing the value of the margin of indifference is difficult (Gao and Ware
to appear ). The choice must be based on a combination of clinical, ethical, and
statistical factors. The size of the margin of indifference depends on the disease
and the severity of toxicity or magnitude of the cost or invasiveness relative to
the degree of benefit from the standard intervention. For example, researchers
may decide that an increase in mortality by 20% may be tolerated if the new
intervention had little to no toxicity, or did not require an invasive procedure.
Thus, the design would be based on a margin of indifference corresponding to
a relative risk of 1.2.
Regulatory agencies often take an approach that imposes two requirements
on non-inferiority trials. First, a non-inferiority trial must meet the prespec-
ified margin of indifference, δ, when comparing, say, a relative risk of RRT C
for the new treatment (T ) to the standard (C). Second, the researchers must
have relevant data that provides an estimate of the relative risk, RRCP , of
the standard to a placebo (P ). This estimate is typically based on previous
studies, often used to demonstrate the effectiveness of the standard interven-
tion. Then, regulatory agencies may infer the relative risk (RRT P ) of the
new intervention to a placebo control by multiplying the two relative risks,
RRT P = RRT C RRCP . If, for example non-inferiority is demonstrated ver-
sus the control, but the control has only minimal efficacy relative to placebo,
RRT P may show, for example, that despite strong evidence of non-inferiority,
the effect against placebo may still be relatively weak. Of course, this imputa-
tion makes a key assumption—that the control relative risk is based on data
that is still relevant. This may be difficult to determine as background ther-
apy, patient referral patterns, and diagnostic criteria may change over time.
104 STUDY DESIGN
In fact, these arguments are similar to those that were used in section 3.2.1
to indicate why historical control trials have inherent bias.
To make the inference more precise, following Fleming (2007), we transform
2
the relative risks to the log scale, letting βXY = log(RRXY ) and σXY =
Var(β̂XY ). Using this notation, we assume that the effect of T relative to C
is given by βT P = βT C + βCP . Note that since we are assuming that C is
effective, βCP < 0 (i.e., RRCP < 1).
Now suppose that we wish to ensure that a fraction, p, of the effect of C
relative to placebo is preserved; βT P < pβCP , or equivalently,
βT C < pβCP − βCP = −(1 − p)βCP . (3.2)
Assuming that βCP is known exactly, choose δ = −(1 − p)βCP . Then, non-
inferiority of T relative to C requires that
β̂T C + Z1−α/2 σT C < −(1 − p)βCP ,
i.e., the upper limit of the 1 − α confidence interval is below δ.
The assumption that βCP is known exactly is unrealistic, so we will need to
2
use an estimate, β̂CP , and also consider its variance σCP , both based on the
results of previous trials. Rewriting (3.2), we require βT C + (1 − p)βCP < 0
so that the criterion for showing non-inferiority is
β̂T C + (1 − p)β̂CP + Z1−α/2 (σT2 C + (1 − p)2 σCP
2
)1/2 < 0 (3.3)
or
β̂T C + Z1−α/2 σT C
< −(1 − p)β̂CP + Z1−α/2 [(σT2 C + (1 − p)2 σCP
2
)1/2 − σT C ], (3.4)
so the choice of δ is the right-hand side of (3.4). We note, however, that
in this case δ depends on σT2 C which may not be known until the trial is
completed. When this might be of concern, one may simply prespecify the
2
other parameters, p, β̂CP , and σCP , and apply (3.3) once β̂T C and σT2 C are
known.
Certainly, following this approach requires assumptions. One is that the
historical estimate of the control effect relative to placebo (e.g., RRP C ) is still
relevant to the present population and standards of practice. For example,
background therapy may have been developed and the patient mix may have
changed in important but possibly unknown ways. In addition, the initial trial
or trials were based on a particular set of patients or participants who vol-
unteered and they may be different in important respects from the current
population being studied. Whether these differences exist or not is usually
difficult to determine. This assumption is sometimes referred to as the con-
stancy assumption and it may not hold in all cases. Examples exist where two
trials conducted consecutively with the same control intervention had signifi-
cant differences in the control arms between trials (Packer et al. 1996; Carson
et al. 2000). In some cases, the historical data may not even exist if the control
was established on the basis of an intermediate marker such as lowering blood
NON-INFERIORITY DESIGNS 105
pressure, and the next treatment is being evaluated on reducing mortality or
morbidity. Second, the percent of the initial control vs placebo effect, p, to
maintain (e.g., 50%) is arbitrary but has a major impact on the choice of δ.
Another consideration in setting the value of δ is the magnitude of the
difference or relative difference that would change clinical practice or patient
preference. While the methods described in this section may be of value in
providing guidance, other medical considerations should also be considered
and adjustments made before the trial design is finalized. Nonetheless, it may
be more important to focus on the trial that is being conducted, making sure
that the appropriate or best control intervention is being utilized, that the trial
is being conducted in the best possible way, and that the new treatment or
intervention is compared with the statistically and medically determined value
of δ. The imputation of the new intervention effect compared to placebo, had
a placebo arm been present, while relevant, should not necessarily be the main
focus of the interpretation. Control intervention “creep”, that is, sequentially
choosing weaker and less effective controls, can probably be best prevented
by discussions between trial investigators and sponsors with the appropriate
regulatory agencies when necessary.
An example of a non-inferiority trial is the OPTIMAAL trial, a trial of
a new drug losartan and a standard drug captopril for patients with chronic
heart failure (Dickstein et al. 2002). The primary outcome was total mortality.
The margin of indifference was determined to be a relative risk of 1.2. Cough is
a side effect of captopril use and, as a result, many patients are non-adherent.
If the new drug losartan could be shown to be not inferior to captopril, it may
be a viable alternative treatment. The standard drug captopril had previously
been shown to be superior to placebo with a relative risk, RRCP , of 0.805.
As shown in Table 3.4, the relative risk for the losartan-captopril comparison,
RRT C , was 1.12 indicating an increase in mortality for patients treated with
the new drug losartan. The imputed relative risk was obtained by multiplying
the two relative risks together to get an estimate of 0.906. While this imputed
relative risk of 0.906 is favorable, the upper level of the confidence interval,
1.26 for the losartan-captopril comparison, did not meet the OPTIMAAL
prespecified margin of indifference 1.2. As a result, the new drug losartan was
rejected as an alternative to captopril. Of course, the prespecified margin of
indifference was a decision based on judgment of the investigators.
Another example that illustrates important issues is two simultaneous trials
of a new agent, ximelagatran, compared to the standard control of warfarin.
Warfarin is the generally accepted standard to prevent clotting of the blood,
but requires frequent monitoring to maintain the correct dose levels and avoid
either over- or under-clotting. New agents that could provide the same clinical
benefit but without the inconvenience of the intensive monitoring are of great
interest. Ximelagatran had been shown in earlier studies to effectively pre-
vent clotting and two trials, SPORTIF III, conducted largely in Europe, and
SPORTIF V, conducted largely in the U.S., were designed as non-inferiority
trials to assess the ability of ximelagatran to reduce mortality and morbid-
106 STUDY DESIGN
ity.12 The subjects in both trials suffered from atrial fibrillation, increasing the
risk of stroke and thereby the risk of mortality or morbidity. Clearly, warfarin
was the appropriate control and the trials were conducted extremely well,
achieving excellent clotting control.
Two issues arose, however, that are instructive for non-inferiority trials.
The first issue is that the margin of indifference was based on an absolute
difference in event rates, assuming an annual event rate of approximately 3%.
In fact, the annual event rate observed in the trial was half the assumed rate,
raising doubts regarding whether δ was in fact too large. The second issue is
that there is little reliable data comparing warfarin to placebo, making the
imputation of the effect of ximelagatran to placebo problematic. In retrospect,
it would have been better to design these trials using a relative scale that
would have automatically accounted for the lower than expected event rate.
Nevertheless, the lack of historical data regarding the clinical effect of warfarin
precludes meaningful imputation of effectiveness relative to placebo. These
issues, combined with an observed increase in serious adverse effects related
to abnormal liver function, led to the decision by the FDA to not approve
ximelagatran based on these trials.
The design and conduct of non-inferiority trials is especially challenging.
Many authors have discussed advantages and challenges of non-inferiority
trials.13 While such trials are necessary, the precise design and methods of
analysis are still not completely determined and more experience is necessary
before such methods will be established.
14 Steering Committee of the Physicians’ Health Study Research Group (1989), Multiple
Risk Factor Intervention Trial Research Group (1982), Writing Group for the Women’s
Health Initiative Randomized Controlled Trial (2002), Diabetes Control and Complica-
tions Trial Research Group (1993), The Alpha-Tocopherol, Beta-Carotene Cancer Pre-
vention Study Group (1994), Hypertension Detection and Follow-up Program Coopera-
tive Group (1979), Lipid Research Clinics Program (1984)
15 Beta-Blocker Heart Attack Trial Research Group (1982), MERIT-HF Study Group
(1999), Packer (2000), Bristow et al. (2004), ALLHAT Collaborative Research Group
(2000), Aspirin Myocardial Infarction Study (AMIS) Research Group (1980), The Coro-
nary Drug Project Research Group (1975), Hulley et al. (1998)
108 STUDY DESIGN
than about 20%, perhaps as low as 5–10%. Therefore, large numbers of indi-
viduals must be screened for eligibility and willingness to participate in order
to get the desired number of randomized participants.
Since these individuals are not ill, the trial design must account for the
likelihood that not all of them will comply with the intervention. That is, the
individuals are less likely to take all of their medication, especially if there are
undesirable side effects and the sensitivity of the trial to detect a benefit of the
intervention will be reduced. In Chapter 4 we discuss sample size calculations
that attempt to take into account the anticipated level of compliance. Failure
to account for non-compliance can be problematic.
Secondary prevention trials are designed to evaluate whether an interven-
tion can prevent the recurrence of the disease in a population who have had
a defining event. For example, we may assess whether drugs such as beta-
blockers reduce the occurrence of a second heart attack in a patient surviving
a heart attack or whether longer term use of tamoxifen following standard
treatment reduces the risk of the breast cancer recurrence. Secondary pre-
vention trials require a different recruitment strategy since these individuals
are now identified through an event or occurrence of a disease. Secondary
prevention trials, like primary prevention trials, are often large because the
recurrence rates may be low. On the other hand, there may be a large number
of eligible individuals. While these individuals have been diagnosed with a
disease, they may still not fully comply with the intervention being tested.
Thus, compliance must be considered in the design of the trial.
For both primary and secondary prevention trials, designing the trial to
address the compliance to intervention is critical. As shown in Chapter 4, the
sample size increases nonlinearly with the degree of non-compliance. Thus,
every attempt must be made in the design to maximize compliance. Char-
acteristics of the participant population may be used to identify individuals
likely to have a high degree of compliance. For example, individuals with other
competing risks or diseases may have difficulty with compliance. If individuals
are not able to come to the clinic for evaluation and intervention support, they
may not be able to comply optimally. Once every consideration in the entry
criteria and logistical aspects has been covered, then the sample size must be
adjusted to account for the remaining degree of non-compliance.
As we will discuss further in Chapter 11, the “intent to treat” principle
requires that all participants randomized into the trial must be accounted for
in the analysis. Failure to comply with this principle can lead to serious bias in
the analysis and interpretation of the results. It is not appropriate to remove
participants who do not comply fully with the intervention. Thus, potential
non-compliance must be addressed in the design.
Therapeutic trials evaluate new treatments or interventions for patients who
are in the acute phase of their disease for the effectiveness in reducing mortality
and morbidity.16 For example, treatments such as clot busting drugs for a
16 The TIMI Research Group (1988), Volberding et al. (1990), The Global Use of Strategies
ADAPTIVE DESIGNS 109
patient with an evolving heart attack has proven to be effective in reducing
death due to the heart attack. Recruitment strategies for therapeutic trials
depend on access to patients and their willingness to participate in the trial.
The recruitment pool is generally much smaller than for a primary prevention
trial and the yield also much less than 100%, probably also in the neighborhood
of 20%.
to Open Occluded Coronary Arteries (GUSTO III) Investigators (1997), Moss et al.
(1996), Cardiac Arrhythmia Suppression Trial II Investigators (1992), The Diabetic
Retinopathy Study Research Group (1978), Fisher et al. (1995), Gruppo Italiano per
lo Studio Della Sopravvivenze Nell’Infarcto Miocardico (GISSI) (1986)
110 STUDY DESIGN
total information content is unchanged. This is achieved at the conclusion of
the trial by decomposing the Z usual statistic into two components—the Z-
statistic at the time of the design modification and a Z-statistic derived from
the post modification observations.
For simplicity, assume that Xi ∼ N (θ, 1), i = 1, . . . , N0 , where N0 is the
initial total sample size. Let n be the sample size at the time the adjustment
is made and t the information fraction, t = n/N0 . The observed treatment
difference is n
X
θ̂ = Xi /n,
1
and the interim Z statistic for testing H0 : θ = 0 is
n
X √
Z (n) = Xi / n.
1
With no modification the trial would complete with sample size N0 , and a
final test statistic
N0
X p
Z (N0 ) = Xi / N 0
1
r n r N0
n X √ N0 − n X p
= Xi / n + Xi / N 0 − n
N0 1 N0 n+1
√ (n) √
= t Z + 1 − t Z (N0 −n) (3.5)
where Z (N0 −n) is the Z statistic using only the final N0 − n subjects. Note
that this a weighted sum of the Z statistics involving the first n and final
N0 − n subjects where the weights are the square roots of the corresponding
information fractions.
It can be shown that, provided that the weights remain fixed, we can adjust
the final sample size any way that we like, in particular based on θ̂, and use
the corresponding Z statistic in place of Z (N0 −n) in (3.5) and still preserve
the overall type I error rate.
Suppose that, based on θ̃, we propose a revised total sample size N and the
end of the trial we use the test statistic
√ √
Z (N ) = t Z (n) + 1 − t Z (N −n) .
Then, under an alternative hypothesis H1 : θ = θA , unconditionally,
p
(N ) n + (N0 − n)(N − n)
EZ = √ θA .
N0
Thus, if N > N0 , we have EZ (N ) > EZ (N0 ) , and the power for H1 will be
increased, although the increase will be smaller than if we had used N as the
planned sample size at the start of the trial. Using this approach, the outcomes
for the final N − n subjects are down-weighted (provided N > N0 ) so that
the effective information contributed is the fixed.
ADAPTIVE DESIGNS 111
The choice of N can be subjective, possibly using both prior expectations
regarding the likely effect size and the interim estimate, θ̂, and its associated
variability. Note that since Z (n) is known, one likely would base the decision
on the conditional expectation of Z (N ) given Z (n) .
Tsiatis and Mehta (2003) have criticized this method, suggesting that it
is not efficient. Others have worried that these trials are prone to bias since
speculations about the size of the treatment effect in a blinded trial might
arise, based on the size of the sample size adjustment. These more recent
response adaptive designs have been used (Franciosa et al. 2002; Taylor et al.
2004) but as yet not widely due to concerns. Jennison and Turnbull (2006)
suggest that efficient trials can be designed using common group sequential
methods by considering a priori a number of possible effect sizes.
Lan and Trost (1997) proposed using conditional power to make an assess-
ment of whether the interim results are sufficiently encouraging to justify an
increase in sample size. This concept was further developed by Chen, DeMets,
and Lan (2000). Conditional power is the probability of obtaining a statisti-
cally significant result at the end of the trial, given the current observed trend
in the intervention effect compared to control and given an assumption about
the true but unknown treatment effect for the remainder of the trial. Condi-
tional power will be described in more detail in the data monitoring chapter
(Chapter 10). Let Z(t) be the normalized statistic observed at information
fraction t in the trial. The conditional power is Pr(Z(1) ≥ C|Z(t), θ), where
C is the critical value at the conclusion of the trial, and θ is the standardized
treatment difference. The computational details can be found in Section 10.6.3.
Chen, DeMets, and Lan showed that if the conditional power for the ob-
served trend is greater than 0.50 but less than desired, the sample size can be
increased up to 75% with little practical effect on the overall type I error rate
and with no serious loss of power.
The methods described above have many attractive features and many con-
trol the type I error, as desired. The primary concern with adaptive designs
that adjust sample size based on emerging trends is not technical but rather
logistical. Trials are designed and conducted to minimize as many sources of
bias as possible. If the sample size is adjusted based on emerging trends ac-
cording to one of the methods described, it is straightforward for those with
knowledge of both the size of the adjustment and the adjustment procedure
to obtain an estimate of the emerging treatment difference. Of course, follow-
ing any interim analysis, investigators may have been given clues regarding
the current trend—if the recommendation is to continue, investigators might
assume that the current results have exceeded neither the boundary for ben-
efit nor the boundary for harm. If there is a formal futility boundary, then
additional clues may be given about where the current trend might be. Of
course, the difficulty is that data monitoring committees may have reasons
to recommend that a trial continue even if a boundary for the primary out-
come has been crossed. (Related issues are discussed further in Chapter 10.)
Investigators, therefore, may be able to glean knowledge regarding the range
112 STUDY DESIGN
for current trends whether or not the trial is using any adaptive design. For
adaptive designs modifying sample size based on current trends, however, the
clues may be stronger. The concern is whether this information biases the
conduct of the trial in any way. For example, would investigators alter their
recruitment or the way participants are cared for? If the trial is double-blind,
biases may be less than for non-blinded trials. As of today, there is inadequate
experience with these types of adaptive designs to be confident that bias might
not be introduced. These designs are still somewhat controversial.
3.6 Conclusions
An engineer or architect must finalize the design of a large project before
construction of a project begins, for it is difficult, if not impossible, to correct
a design flaw once construction has been begun. The same is true for the
design of a clinical trial. No amount of statistical analysis can correct for a
poor or flawed design. In this chapter, we have discussed a variety of designs,
all of which have an appropriate application. Each design may need to be
modified somewhat to meet the needs of the particular research question. For
the phase III clinical trial, the randomized control trial is the most widely
used design. While simple, it can be used to address many research questions
and is very robust, not relying on many additional assumptions beyond the
use of randomization.
A clinical trial should not use a research design that is more complicated
than necessary. In general, the practice should be to keep the design as simple
as possible to achieve the objectives. In recent years, clinical trials have in-
creasingly utilized the factorial design with success, getting two or more ques-
tions answered with only a modest increase in cost and complexity. Greater
use of the factorial design has the potential to improve efficiency in cases for
which they are appropriate. Statistical methods for adaptive designs that al-
low for modification of sample size during the course of the trial based on
interim trends are available, but there remain concerns regarding their practi-
cal implementation. Use of this kind of adaptive design should be considered
with caution.
3.7 Problems
3.1 Suppose we conduct a phase I trial to find the MTD. We use the daily
doses 1g, 1.9g, 2.7g, 3.4g, 4.0g, and 4.6g which correspond to true (un-
known) toxicity levels of 0%, 10%, . . . , 40%, and 50%, respectively. For
this exercise, you may need to use a computer for simulation or, where
possible, direct calculation.
(a) First, consider using the traditional design. The trial begins with
the 1g dose and proceeds according to the algorithm described in
Section 3.1.1; the last dose without excessive toxicity is the MTD
estimate.
PROBLEMS 113
i. What is the expected value of the MTD estimate in this case?
ii. What is the expected value of the sample size?
iii. What is the expected number of subjects that would be treated
at or over the 40% toxicity level (i.e., 4g) in this case?
iv. What modifications could you make in the design so that the
expected value of the toxicity of the MTD estimate will be ap-
proximately 1/3?
(b) Next, consider the modified design (allowing steps down in dose),
starting at the third level (2.7g), stopping after 15 patients.
i. What is the expected value of the MTD estimate (use the stopping
dose to estimate the MTD)?
ii. What is the expected number of patients who would be treated
at or over the 40% toxicity level?
(c) Finally, consider using another design with single-patient cohorts.
Start at the first dose level, increasing the dose for the next subject
when a non-toxic response is observed. When a toxic response is
observed, a second stage begins at the previous (lower) dose level. If
two subsequent non-toxic responses are observed, the dose increases.
If a toxic response is observed, the dose is reduced. Stop after a total
of 15 patients.
i. What is the expected value of the MTD estimate (again, use the
stopping dose to estimate the MTD)?
ii. What is the expected number of patients who would be treated
at or over the 40% toxicity level?
(d) How would you derive confidence intervals for the MTD using data
from the three designs above? Demonstrate your confidence intervals’
coverage properties via simulation.
3.2 Consider Fleming’s three-hypothesis formulation for phase II trials with
binary outcomes. Suppose you have a single arm trial where π1 = 0.05,
π2 = 0.25, and the total sample size is 30. Set the level for H0 and H2
each to 0.05, i.e.,
P (reject H0 |πT = 0.05) ≤ 0.05 and
P (reject H2 |πT = 0.25) ≤ 0.05
(a) Give the three critical regions that would be used at the trial’s con-
clusion. State the probabilities of observing an outcome in each of
the three regions, given that H2 is true (πT = 0.3).
(b) Suppose that, after fifteen patients, only one success is observed.
What is the probability, conditional on the interim results, of reject-
ing H0 , given that H2 is true (πT = 0.3)? What is the probability,
conditional on the interim results, of rejecting H0 , given that H0 is
true (πT = 0.05)?
114 STUDY DESIGN
(c) Before the trial starts, suppose it is determined that another identical
trial, using the same critical regions, will be performed following
this one in the event that no hypothesis is rejected. What is the
probability, given H0 is true (πT = 0.05), that H0 will be rejected?
What is the probability, given H2 is true (πT = 0.25), that H2 will
be rejected? Should the critical regions be adjusted to conserve the
overall error rates? Explain.
3.3 Consider a non-inferiority study in which we are interested in the prob-
abilities of failure πC and πE in the control and experimental arms,
respectively. In particular, we are interested in the following two pairs
of hypotheses: absolute (additive) non-inferiority,
H0 : πE − πC ≥ 0.1
H1 : πE − πC < 0.1;
and relative (multiplicative) non-inferiority,
πE
H0 : ≥2
πC
πE
H1 : < 2.
πC
There are 80 subjects in each arm. Assume the sample is large enough
to use the normal approximation.
(a) What test statistic do you suggest for the hypotheses of absolute
non-inferiority?
(b) For a one-sided test at the 0.05 level, show the rejection region on a
plot with axes corresponding to the observed failure proportions in
the two arms.
(c) What test statistic do you suggest for the hypotheses of relative non-
inferiority?
(d) For a one-sided test at the 0.05 level, show the rejection region on a
plot as in the previous case.
(e) Which test do you think will be more likely to reject H0 when πE =
πC = 0.1?
(f) In general, supposing that πE = πC = 0.1, for what values (approx-
imate) of these probabilities, if any, do you think that your absolute
non-inferiority test will be more likely to reject H0 ? For what values,
if any, do you think that your relative non-inferiority test will be
more likely to reject H0 ?
CHAPTER 4
Sample Size
115
116 SAMPLE SIZE
N = 10
0 µ1 C10
N = 30
0 C30 µ1
N = 200
0 C200 µ1
Figure 4.1 Illustration of the fundamental principle underlying sample size calcula-
tions. Curves represent the densities for the mean of N observations from normal
distributions with mean either 0 or µ1 for N = 10, 30, 120. The black curves rep-
resent the densities for the sample sizes indicated for each panel. The vertical lines
indicate the location of the critical value for a (one-sided) test of H 0 : µ = 0.
described in this chapter are focused on finding the sample size that ensures
that the trial will have the desired type I and type II errors. Intuitively, this
is equivalent to selecting N so that CN is properly situated between the true
means under H0 and H1 —if CN is too close to µ1 we will have high type II
error, and if CN is too close to 0, we will enroll more subjects than necessary.
which is equivalent to
q p
Z1−α E[V̂0 |H1 ] + Z1−β V1 = ∆. (4.1)
Equation (4.1) is the most general equation relating effect size, significance
level, and power; however, it does not explicitly involve the sample size. In
order to introduce the sample size into equation (4.1), we need a bit more.
First, while many, if not most, studies have equal sized treatment groups,
occasionally the randomization is unbalanced and we have unequal sized treat-
ment groups. Let ξj be the proportion of subjects in treatment group j, so
that ξ1 + ξ2 = 1 (if there are more than two treatment groups, ξj will be the
proportion out of those assigned to the two groups under consideration).
Second, among the cases that we consider, we will have N E[V̂0 |H1 ] ≈
φ(γ̄)/ξ1 ξ2 where γ̄ = ξ1 γ1 + ξ2 γ2 , and N V1 = φ(γ1 )/ξ1 + φ(γ2 )/ξ2 for some
function φ(·). φ(γ) may be regarded as the variance contributed to the test
statistic by a single observation given that the true parameter is γ. In this
case, equation (4.1) becomes
√ p p
∆ N = Z1−α φ(γ̄)/ξ1 ξ2 + Z1−β φ(γ1 )/ξ1 + φ(γ2 )/ξ2 (4.2)
which can be solved for N to find the sample size required to achieve a given
power or for 1 − β to find the power for a given sample size. In many cases
φ(γ1 )/ξ1 + φ(γ2 )/ξ2 ≈ φ(γ̄)/ξ1 ξ2 and this equation can be reasonably approx-
imated by p √
φ(γ̄)/ξ1 ξ2 (Z1−α + Z1−β ) = ∆ N . (4.3)
Note that using equation (4.3), N ∝ 1/ξ1 ξ2 , so that if we have an unbalanced
120 SAMPLE SIZE
Alternative distribution
Null distribution
φ(γ1) + φ(γ2)
Z 1−β ≈ Z 1−β
2φ(γ)
β
α
0 Z 1−α ∆ N
2 φ(γ)
Figure 4.2 Graphical representation of equation (4.2). For this figure we take ξ 1 =
ξ2 = 1/2.
Suppose we wish to conduct a study of short term mortality (say within one
month of treatment) so that the time of death is not of concern. Suppose
further that π1 and π2 are probabilities of death for experimental and control
groups, respectively, and we wish to test H0 : π1 = π2 = π versus H1 : π1 6= π2 .
Let a be the number of deaths in the experimental group and b be the number
of deaths in the control group so that a ∼ Bin(n1 , π1 ) and b ∼ Bin(n2 , π2 ),
where n1 and n2 are known and fixed, so that ξj = nj /N , j = 1, 2.
If we have the table
4.3.2 Non-adherence
As in the previous section, assessing or accounting for non-adherence is diffi-
cult because there may be associations between adherence, treatment assign-
ment, and outcome. To avoid these difficulties we will assume that the analysis
will be conducted according to the Intent to Treat principle, requiring that
all subjects be analyzed using their assigned treatment groups regardless of
the treatment actually received (or degree of adherence). The Intent to Treat
principle will be discussed in more detail in Chapter 11.
To begin, we will assume that all randomized subjects will receive one or the
other of the treatments under study—no other treatments are available and
no subjects will receive only partial treatment. We will also assume that both
treatments are available to all subjects. That is, a proportion, p, of subjects
assigned to the control group, referred to as drop-ins, will actually receive
the experimental treatment. For cases in which the experimental treatment
is unavailable to control subjects, we can take p = 0. Similarly, a proportion
q of subjects assigned the experimental treatment, referred to as drop-outs,
will actually receive the control treatment. Since the control treatment will
usually be either no-treatment (placebo) or an approved or otherwise accepted
treatment, we will generally have q > 0.
For simplicity, suppose that Yij is an observation from subject i in treatment
group j = 1, 2 and that EYij = µj . If adherence status is independent of
outcome and the proportion of subjects in the group assigned the experimental
treatment who fail to adhere is q, EYi1 = (1 − q)µ1 + qµ2 .4 Similarly, EYi2 =
pµ1 + (1 − p)µ2 . Hence the difference in means is
E[Yi1 − Yi2 ] = ((1 − q)µ1 + qµ2 ) − (pµ1 + (1 − p)µ2 )
= (1 − p − q)(µ1 − µ2 ), (4.10)
and the effect of non-adherence of this type is to decrease the expected treat-
ment difference by a factor of 1 − p − q. Note that under the null hypothesis
4 If adherence status is not independent of outcome, we may have, for example, that there
is no effect of treatment on non-adherers. If this were the case, then EY 1 = µ2 and
equation (4.10) would not hold.
124 SAMPLE SIZE
µ1 = µ2 , so there is no effect on the distribution of the observed treatment
difference.
In the case where there is partial adherence—for example, subjects receive
a fraction of their assigned treatment—equation (4.10) still holds provided
that (1) p and q are the mean proportions of assigned treatment that are re-
ceived in the respective treatment groups and (2) the dose-response is a linear
function of the proportion of treatment actually received. These assumptions
are usually adequate for the purpose of sample size estimation and, in the
absence of detailed information regarding adherence rates and dose responses,
are probably the only reasonable assumptions that can be made. If p and q
are small, then equation (4.10) is likely to be a good approximation, and all
efforts should be made in the design and conduct of the study to ensure that
p and q be made as small as possible. In section 4.4 we will discuss a more
general approach in the context of time-to-event analyses.
In the general setting, based on equation (4.10), assumption (2.) becomes
EU = (1 − p − q)∆. Hence, combined with equation (4.9), we can rewrite
equation (4.2) to account for both loss to follow-up and non-adherence to
yield the general formula
p p 2
Z1−α φ(γ̄)/ξ1 ξ2 + Z1−β φ(γ1 )/ξ1 + φ(γ2 )/ξ2
N= .
(1 − p − q)2 ∆2 (1 − r)
To illustrate the impact of non-adherence and drop-out on sample size,
Table 4.1 shows the required sample sizes for non-adherence and loss to follow-
up rates between 0 and 50% when the required sample size with complete
adherence and follow-up is 1000. Since the adjustment to the sample size
depends on the sum p + q, the results are shown as a function of p + q.
Table 4.1 Impact of non-adherence and loss to follow-up on required sample size.
p + q\r 0 10% 20% 30% 40% 50%
0 1000 1111 1250 1429 1667 2000
10% 1235 1372 1543 1764 2058 2469
20% 1562 1736 1953 2232 2604 3125
30% 2041 2268 2551 2915 3401 4082
40% 2778 3086 3472 3968 4630 5556
50% 4000 4444 5000 5714 6667 8000
No Censoring
If there is no censoring, ETij = EYij = 1/λj . Hence φ(λ) = λ/ET = λ2 . Thus
if λ̄ = ξ1 λ1 + ξ2 λ2 , equation (4.2) becomes
p
(Z1−α λ̄/ξ1 ξ2 + Z1−β λ21 /ξ1 + λ22 /ξ2 )2
N=
(λ1 − λ2 )2
and equation (4.3) becomes
(Z1−α + Z1−β )2 λ̄2
N= . (4.11)
ξ1 ξ2 (λ1 − λ2 )2
It is worth noting that these formulas are scale invariant. That is, the sam-
ple size remains unchanged if both λ1 and λ2 are multiplied by a constant.
This implies that while the treatment difference is specified in terms of the
difference in rates, the sample size in fact depends only on the hazard ratio,
λ1 /λ2 .
This fact is easier to see if we let U = log(λˆ1 /λˆ2 ). Then it is easy to show
that φ(λ) = 1, and we have
(Z1−α + Z1−β )2
N= . (4.12)
ξ1 ξ2 log(λ1 /λ2 )2
126 SAMPLE SIZE
Common Censoring Time
In the case of a common censoring time, all subjects are followed for a fixed
length of time. This may happen in studies involving subjects experiencing
an acute episode where, after a period of a few weeks or months, the risk of
the event in question will have returned to a background level. In such cases
we may choose to terminate follow-up after a relatively short, predetermined
time. Here we assume that Cij = C, a known common value, and we have
Z C
ETij = tλj e−λj t dt + Ce−λj C
0
1 − e−λj C
= (4.13)
λj
so, φ(λ) = λ2 /(1 − e−λC ). Hence, the required sample size will be inflated by
approximately 1/(1 − e−λ̄C ).
It is useful to note that 1 − e−λ̄C is approximately the probability that a
subject in the trial will be observed to fail by time C, so that N (1 − e−λ̄C )
is approximately the number of expected failures D· = D1 + D2 . Hence, in
this case equation (4.11) will apply, not to the total sample size, but to the
required number of failures, D· . We have now shown that for exponential ob-
servations with a common length of follow-up, independent of the underlying
rates and the length of follow-up, the number of failures required for given
type I and type II error rates depends only on the hazard ratio. Furthermore,
since in cases where hazards change with time, if the hazard ratio λ2 (t)/λ1 (t)
is constant, the time scale can always be transformed so that the transformed
failure times are exponential and this result will hold. Later we will see that
this is true regardless of the pattern of enrollment and the underlying distri-
bution of failure times, provided that the hazard functions are proportional
across all follow-up times.
The case of staggered entry with uniform enrollment is one in which the Cij
are uniformly distributed on the interval [F − R, F ] where R is the length of
the enrollment period and F is the total (maximum) length of follow-up. We
will assume that all subjects are followed to a common termination date.
If a subject is enrolled at time s ∈ [0, R], then as in equation (4.13), the
expected follow-up time is (1 − e−λj (F −s) )/λj . The overall expected follow-up
time will be
Z
1 R 1 − e−λj (F −s)
ETij = ds
R 0 λj
Rλj − e−λj (F −R) + e−λj F
= , (4.14)
Rλ2j
SURVIVAL DATA 127
so φ(λ) = Rλ3 /(Rλ − e−λ(F −R) + e−λF ). Equation (4.3) becomes
(Z1−α + Z1−β )2 Rλ̄3
N= .
(Rλ̄ − e−λ̄(F −R) + e−λ̄F )ξ1 ξ2 (λ2 − λ1 )2
Note that the expected number of events is
e−λ(F −R) − e−λF
D· = 1 − N,
Rλ
so that D· satisfies equation (4.11). Again, the total number of events is de-
termined by the hazard ratio, the allocation ratios, ξj , and the desired type I
and type II error rates, independent of total follow-up time, enrollment rates,
and underlying hazard rates.
Loss to Follow-up
In Section 4.3.1, we provided an adjustment to the sample size to account for
loss to follow-up when no follow-up information is available for subjects who
are lost to follow-up and we can assume that subjects are lost at random. In
survival studies, we typically have some amount of follow-up on all subjects
and loss to follow-up has the effect of simply reducing the length of follow-up
for a subset of subjects. It may be reasonable in many cases to assume that
the time of loss to follow-up follows an exponential distribution with rate η.
In general, we could assume that the rate is treatment specific, but we will
not do that here.
Note that in this setting we still have that φ(λ, η) = λ/ETij . The follow-up
time, Tij , however, will be the minimum of the administrative censoring time,
Cij , the failure time, Yij , and the time that the subject is lost to follow-up,
Lij . Assuming that Lij and Yij are independent, then min(Lij , Yij ) will have
an exponential distribution with rate λ + η. Hence, we can use a version of
equation (4.14) substituting exp(−(λj + η)(F − s)) for exp(−λ(F − s)) in the
integral. The result is that we have
Rλ(λ + η)2
φ(λ, η) =
R(λ + η) − e−(λ+η)(F −R) + e−(λ+η)F
and the sample size formula is
(Z1−α + Z1−β )2 Rλ̄(λ̄ + η)2
N= .
(R(λ̄ + η) − e−(λ̄+η)(F −R) + e−(λ̄+η)F )ξ1 ξ2 (λ2 − λ1 )2
Shoenfeld’s Formula
Perhaps the most useful formula in this setting is due to Schoenfeld (1983).
Shoenfeld’s formula provides the number of subjects with events (failure) nec-
essary to achieve the desired power as a function of the hazard ratio, inde-
pendent of the underlying rates. Shoenfeld derived his formula in the setting
with covariate adjustment. For simplicity we will derive his formula without
covariate adjustment. We assume that hazard functions depend on time, t,
from treatment start and that λ1 (t) = rλ2 (t), so that r is the hazard ratio.
Since the formula provides the required number of failures, we consider only
subjects who fail, and will assume that there are no tied failure times. We let
xj be the treatment indicator for the subject failing at time tj : xj = 1 if the
subject is in group 1, and xj = 0 otherwise, j = 1, 2, · · · , D where D is the
total number of failures. Also let nj1 and nj2 be the numbers of subjects at
risk at time tj in groups 1 and 2, respectively.
In order to fit into the framework from Section 4.2, we will use a rescaled
log-rank statistic, equation (7.3), for H0 : r = 1,
D
1 X nj1
U= xj − ,
Dξ1 ξ2 j=1 nj1 + nj2
where the second term in the sum is the expected value of xj given nj1 , nj2 ,
and H0 . The variance of U under H0 is
D
1 X nj1 nj2
.
Dξ12 ξ22 j=1 (nj1 + nj2 )2
and
µkllj = 1 − µkl(3−l)j − µkl3j − µkl4j .
In many, if not most, situations one will assume that a subject cannot cross
back over to their assigned treatment once they become non-adherent, that is
λ∗k(3−l)l = 0; however, this not required. One may also make other simplifying
assumptions, such as that λ∗kl3 (t) = λ∗k0 l0 3 (t) for all k, l, k 0 , l0 ; that is, loss to
follow-up does not depend on assigned treatment or adherence status. The
transition probabilities are given in the following matrix, Mk . The columns
represent the state at time tj and the rows the state at time tj+1 .
µk11j µk21j 0 0
µk12j µk22j 0 0
Mj = µk13j µk23j 1 0
µk14j µk24j 0 1
Beginning with Sk0 defined above, Sk(j+1) = M Skj , j = 1, 2, . . .. For example,
µ1110
µ1120
S11 = µ1130 ,
µ1140
µ1111 µ1110 + µ1211 µ1120
µ1121 µ1110 + µ1221 µ1120
S12 =
µ1131 µ1110 + µ1231 µ1120 + µ1130 , etc.
µ1141 µ1110 + µ1241 µ1120 + µ1140
This setup allows us to compute the probability that a given subject will be in
one of four states given arbitrary time-dependencies for the loss to follow-up,
non-adherence. The degree of accuracy will depend on the step size tj − tj−1 ,
and the magnitude of the hazards λ∗klm (t).
The calculation of power requires that we find the expected number of sub-
jects at risk and the expected number of events expected to occur in each
treatment group at any time during follow-up. This calculation requires as-
sumptions regarding patterns of enrollment and study duration.
To account for staggered entry, we follow Cook (2003), which differs from
SURVIVAL DATA 133
the approach of Lakatos and Shih. In what follows, where otherwise unclear,
we refer to time from study start (enrollment of the first subject) as “study
time” and the time from enrollment for individual subjects as “subject time.”
For each treatment group k = 1, 2, we first compute the sequence of state
vectors, Skj , j = 0, 1, . . . out to the maximum potential follow-up time. All
intermediate state vectors are saved as a matrix Qk = [qklj ].
Now fix an enrollment strategy, rkj , j = 1, 2, . . ., the number of subjects
enrolled to group k between time tj−1 and time tj . The implicit assumption
throughout is that the proportion of subjects accrued to each treatment group
is constant over time, i.e., r2j /r1j does not depend on j. We let r·j = r1j + r2j
and ρj = rkj /r·j (the subscript “·” represents summation over the correspond-
ing index).
Then if Nkjp is the cumulative number enrolled through time tj , given that
we enroll through time τp ,
j∧p
X
Nkjp = rjx
x=1
where “∧” indicates the minimum of the two arguments. Let nkljp be the
number of subjects in state l at time tj (study time) given that we enroll
through time τp . Then
j∧p
X
nkljp = rkx qkl(j−x+1) . (4.16)
x=1
At study time tj , with enrollment through τp , the expected value of the log-
134 SAMPLE SIZE
rank statistic is:
j∧p
X j−y+1
X
Ujp = r·y (d1x − ex ) (4.17)
y=1 x=1
and the expected value of its variance estimator is:
j∧p
X j−y+1
X
Vjp = r·y vx . (4.18)
y=1 x=1
EXP 1/2
For given enrollment time and follow-up time let Zjp = [Ujp /Vjp ] be
the expected value of the log-rank Z-score. Since the variance of the observed
Z-score under the specified alternative is approximately one, power (for one-
sided level α) can be calculated by the formula
EXP
Z1−α + Z1−β = Zjp .
Power for two-sided tests is obtained by replacing α above by α/2.
Note that the quantities computed above depend on both the total en-
rollment and the enrollment pattern. If, prior to the start of enrollment, the
pattern of enrollment is held fixed, but rescaled to reflect either a proportion-
ate increase or decrease in the (possibly) time-dependent enrollment rates,
these quantities can be simply rescaled. In particular, for fixed enrollment
EXP
and follow-up times, Zjp will change with the square root of the total
sample size. The goal is to find the combination of enrollment pattern, to-
tal enrollment, and total length of follow-up that yields the desired value of
EXP
Zjp . Clearly these computation are complex and require the use of spe-
cialized software. This algorithm is implemented in software available from
http://www.biostat.wisc.edu/~cook/software.html.
σ2
φw = ,
E[mi /(1 + (mi − 1)ρ)]
where the expectation is with respect to the cluster sizes. If the cluster sizes
are fixed, then this is just the average. If the cluster sizes are all equal, it
further reduces to the (common) cluster variance, σ 2 (1 + (m − 1)ρ)/m. Given
assumptions regarding σ 2 and ρ, and the cluster sizes mi , equation (4.3) can be
used to determine the required number of clusters. When there is discretion
regarding the number of observations per cluster, one can select both the
cluster sizes and the number of clusters, weighing the relative costs of each,
to optimize efficiency.
Alternatively, if there are concerns regarding the validity of the assumptions,
we may prefer to give equal weight to each cluster. If, for example, there is
an association between cluster size and outcomes, the weighted estimate will
be biased—giving higher weight to clusters with, say, larger values of the
outcome variable, and lower weight to lower values. In the unweighted case,
φ will simply be the mean cluster variance,
2 1 + (mi − 1)ρ
φu = σ E .
mi
Example 4.1. Suppose that yijk is the response for subject i with assignments
F1 = j and F2 = k. The responses are assumed independent with yijk ∼
N (µjk , σ 2 ). Then the interaction of interest, ∆, can be written as
∆ = (µ11 − µ21 ) − (µ12 − µ22 ) (4.19)
= (µ11 − µ12 ) − (µ21 − µ22 ). (4.20)
The test statistic is U = (ȳ11 − ȳ12 ) − (ȳ21 − ȳ22 ) with variance
σ2 1 1 1 1
V = + + +
N ξ1 ζ1 ξ1 ζ2 ξ2 ζ1 ξ2 ζ2
2
σ 1 1
= + .
N ζ 1 ζ2 ξ1 ξ2
This expression suggests that the sample size formula can be derived from
2
(4.2) by letting φ(µj1 , µj2 ) = ζσ1 ζ2 . Hence, we find that
(Z1−α + Z1−β )2 σ 2
N= .
∆2 ζ 1 ζ 2 ξ 1 ξ 2
Note that this is the sample size required for main effects using the same σ 2 ,
ξj , and ∆, but multiplied by 1/ζ1 ζ2 . Since 1/ζ1 ζ2 ≥ 4, at least 4 times as many
subjects will be required to detect an interaction of size ∆ as will be for a main
effect of size ∆. Furthermore, it is likely that interactions that might occur
would be smaller than the corresponding main effect of treatment, further
increasing the required sample size. 2
Unlike main effects, interactions are scale dependent. That is, many inter-
EQUIVALENCE/NON-INFERIORITY TRIALS 137
actions can be eliminated by transforming the response variable. We illustrate
with another example.
Example 4.2. Suppose that Yjk are binomial with probability πjk and size
njk with njk as in the previous example. In the example in Section 4.2.1,
the score (Pearson χ2 ), Wald, and likelihood ratio tests of H0 : π1 = π2 are
all asymptotically equivalent, independent of the parameterization, and give
quite similar results, even in small samples. On the other hand, the interaction
test can be a test of H0 : π11 − π21 = π12 − π22 (constant risk difference),
H00 : π11 /π21 = π12 /π22 (constant risk ratio), or H000 : π11 (1 − π21 )/(1 −
π11 )π21 = π12 (1−π22 )/(1−π12 )π22 (constant odds ratio). These are all distinct
hypotheses unless π11 = π12 .
We consider H000 . We let U (Y) = log(Y11 (1−Y21 )/(1−Y11 )Y21 )−log(Y12 (1−
Y22 )/(1 − Y12 )Y22 ). Note that
Y1k (1 − Y2k ) 1 1 1 1
Var log ≈ + + +
(1 − Y1k )Y2k n1k π1k n1k (1 − π1k ) n2k (1 − π2k ) n2k π2k
1 1 1
= + .
N ξk ζ1 π1k (1 − π1k ) ζ2 π2k (1 − π2k )
Hence we can apply equation (4.2) with φ(πik , π2k ) = 1/ζ1 π1k (1 − π1k ) +
1/ζ2 π2k (1−π2k ) and ∆ = log(π11 (1−π21 )/(1−π11 )π21 )−log(π12 (1−π22 )/(1−
π12 )π22 ). 2
Example 4.3. Suppose yij are Gaussian with mean µj and variance σ 2 . We
wish to conduct an equivalence trial with equivalence margin δ. The alterna-
tive hypothesis is H1 : µ1 = µ2 . Using equation (4.3), we have
(Z1−α + Z1−β )2 σ 2 (Z1−α + Z1−β )2 σ 2
N= = .
ξ1 ξ2 (µ1 − µ2 + δ)2 ξ1 ξ2 δ 2
2
Example 4.4. Suppose yij are binary with mean πj and we believe that
treatment 1 is superior with odds ratio ψ = π1 (1 − π2 )/(1 − π1 )π2 = .9. If
we expect that the control rate π2 = 0.2, then π1 = 0.184. To achieve 90%
power with a type I error of 0.01 requires a sample size of nearly 35000. We
are willing to accept a result that shows that the odds ratio, ψ, is less than
1.15.
We take U (π1 , π2 ) = log(π2 /(1 − π2 )) − log(π1 /(1 − π1 )). Then φ(π) =
1/π(1 − π). Under H1 , π̄ = (.2 + .184)/2 = .192, so the required sample size is
(2.57 + 1.28)2 /.192(1 − .192)
N= = 6360.
(log(1.15) − log(.9))2 /4
2
4.9 Problems
4.1 Use equation (4.7) to create the following plots.
(a) Fix β = 0.1, ∆ = 0.1, and π0 = 0.4. Plot sample size vs. α for values
of α from 0.001 to 0.1.
(b) Fix α = 0.05, ∆ = 0.1, and π0 = 0.4. Plot sample size vs. β for
values of β from 0.05 to 0.3.
(c) Fix α = 0.05, β = 0.1, and π0 = 0.4. Plot sample size vs. ∆ for
values of ∆ from 0.05 to 0.3.
(d) Fix α = 0.05, β = 0.1, and ∆ = 0.1. Plot sample size vs. π0 for
values of π0 from 0.15 to 0.95.
4.2 Assume that φ(γ̄) = 1 and ξ1 = ξ2 = 1/2.
(a) Using equation (4.3), calculate the required sample size for a two-
sided test (replacing α with α/2) having 90% power. Do this for
several reasonable values of α and ∆.
(b) Compare your results with the corresponding sample sizes for a one-
sided test.
(c) Approximate the actual power for the sample sizes you calculated in
part 4.2(a) (i.e., you need to include the probability of rejecting H0
by observing a value below the lower critical value). Discuss.
4.3 Suppose a situation arises in which the number of subjects enrolled
in a trial is of no ethical concern. The objective of the trial is to test
an experimental drug against standard care. The experimental drug,
however, costs 2.5 times as much per subject as the standard care.
Assuming normally distributed outcomes with equal variances, what is
the optimal sample size for each group if the goal is to minimize the
cost of the trial while still controlling type I and type II errors?
4.4 A common variance stabilizing transformation for p binomial observa-
tions is the arcsine transformation: Tj = sin−1 ( xj /nj ) where xj ∼
Binom(nj , πj ).
(a) Show that the variance of Tj is asymptotically independent of πj .
140 SAMPLE SIZE
(b) If U = T2 − T1 , find the function φ(·) required for equation (4.2).
(c) Derive the sample size formula for the two sample binomial problem
using this transformation.
(d) For the case π1 = .4, π2 = .3, ξ1 = ξ2 = .5, α = .05, and β = .1, find
the required sample size and compare to the numerical example on
page 121.
CHAPTER 5
Randomization
141
142 RANDOMIZATION
the treatment (equation (2.4)). These ideas, presented in this section, have
been discussed in detail by Kempthorne (1977) and Lachin (1988b), among
others.
5.1.1 Confounding
Suppose we have the following familiar linear regression model:
y = β0 + β1 x + ε,
where y is the outcome of interest, x is a predictor (treatment), β0 and β1
are model parameters, and ε is the error. Now suppose we observe n pairs
(xi , yi ), i = 1 . . . n, and after an appropriate analysis we find that x and y
have significant correlation or that the “effect of x on y” is significant. We
cannot automatically assume that x explains or causes y. To see that this is
so, let the roles of x and y be reversed in the above model. We might see
the same results as before and feel inclined to state that y explains x. A
third possibility is that another characteristic, z, referred to as a confounder
explains or causes both x and y.
Subject characteristics that are associated with both treatment and out-
come result in confounding, introducing bias into hypothesis tests and the
estimates of parameters (i.e., treatment effects) when the nature of the asso-
ciation is not correctly adjusted for. Even when the investigator is aware of
the characteristic and it is measurable, adjustment may be problematic. For
example, suppose an observational study is conducted and treatment assign-
ment is based on disease diagnosis and patient characteristics, as is the case
for observational studies. How could one eliminate the possibility of confound-
ing? Baseline variables can be adjusted for in the analysis using a statistical
model, and thus, a model is required. A typical choice is the linear regression
model,
y = bx + zβ + ε,
where x is treatment, z is a vector of covariates, b and β are the model
parameters, and ε is the error, which is assumed to be normal.
The first problem is that it is impossible to include all covariates in z, and
we must assume that the form of the model is correct. Some of the relevant
(confounding) covariates may not even be known or easily measured. Even a
subset of the list of known characteristics may be too large for the model. In
other words, we could have more variables than observations, giving no chance
of statistical tests. We can select a few covariates to include, however, if we do
not include a sufficient number of covariates, we increase the possibility of an
incorrect estimate of b due to residual confounding. Conversely, including too
many covariates will hurt the precision of estimates. Given that the estimates
of the parameters of interest are strongly influenced by the model building
process, it will be difficult to assess the validity of results. Of course, when
the important characteristics are unknown or not measurable, adjustment is
impossible. In short, when confounding is not accounted for in the alloca-
THE ROLE OF RANDOMIZATION 143
tion procedure, it is difficult, if not impossible to adjust for in the analysis.
The goal of the allocation procedure becomes to eliminate the possibility of
confounding. Since we usually cannot render an outcome independent of a
potential confounding characteristic, we require an allocation procedure that
can make the characteristic independent of the treatment assignment. The use
of randomization ensures this independence.
n1 patients n2 patients n = n 1 + n2 n = n 1 + n2
y1j ∼ G(y|θ1 ) y2j ∼ G(y|θ2 ) patients patients
Randomization Randomization
Figure 5.1 Population sampling models, the patient selection process, and the ran-
domization model for a clinical trial. Modified from Lachin (1988b).
The second model in Figure 5.1.3 illustrates the case in which one in-
vokes the population model after randomization. This model assumes that,
despite having an unspecified population from which patients are drawn us-
ing a method that is not random sampling, the distributional assumptions
can still be used with the final outcome data. This conclusion has no sound
foundation, however, and relies on distributional assumptions that are not
verifiable.
The goal of randomization is to avoid having to invoke the population model
at all by creating a framework for an assumption-free test. This is the third
model shown in Figure 5.1.3. No assumptions are made about the patient pop-
146 RANDOMIZATION
ulation or the underlying probability distribution. Thus, because we cannot
truly sample randomly from the entire population of interest or know with
certainty the underlying model, randomization is the soundest alternative.
Figure 5.2 Randomization distribution of the mean in one treatment group of size
five randomly drawn from the ten observations shown in circles at the bottom. The
vertical line represents the mean of the allocation corresponding to the solid circles.
where I(·) is the indicator function. We reject the null hypothesis if, for ex-
ample, p < α. There are many possible reference sets. For the unconditional
reference set, we consider all K = 2n possible allocations of the participants
into two groups. The conditional
reference set is used when we consider n1 as
fixed. In this case, K = nn1 .
In the early days of clinical trials, permutation tests were often not com-
putationally feasible. Consider a trial with 50 participants, 25 in each of two
treatment arms. In this case there are over 1014 possible allocation patterns.
Of course, many clinical trials are much larger. Thus, even today, such enumer-
ation is rarely used. A way of performing permutation tests without enumerat-
ing every possible randomization sequence uses the fact that the distributions
of statistics used for permutation tests are often asymptotically normal un-
der the null hypothesis of no treatment effect. In most cases, especially with
moderate sample sizes, tests based on normal theory approximate permuta-
tion tests very well, as illustrated even for small samples by Figure 5.2. The
mean and variance of these asymptotic distributions depend on the statistic
and also the randomization method employed.
Of course, today computation has become much easier and takes much less
time. There are many statistical software packages that perform a variety of
permutation tests, even with large sample sizes. If possible, one should use a
permutation test, or a good approximation, because it uses no assumptions
beyond randomization.
n/2 − n∗1
Pr(τ = 1|n∗1 , n∗ ) = .
n − n∗
Because total and group sample sizes are predetermined, there can be no
imbalance in the number of participants in each arm at the end of enrollment.
The exception to this occurs when patients are assigned as they enter the trial
and fewer than n subjects are enrolled. For example, let Q = max(n1 , n2 )/n
be used to measure imbalance. If the desired sample size is 6 but the trial only
enrolls n = 4 participants, the only possible imbalance is Q = 3/4 (the other
possibility, of course, is balance, Q = 1/2). The probability of imbalance is 0.4
and the expected imbalance is 0.6. For arbitrary sample sizes, probabilities
of imbalance correspond to probabilities of the hypergeometric distribution.
The probability of an imbalance greater than or equal to a specific Q can
be calculated using Fisher’s exact test or approximated by a χ2 test (Lachin
1988a).
While imbalance is not usually a problem when using the random alloca-
tion rule, selection bias is a concern. Blackwell and Hodges Jr. (1957) give a
simple way to assess a randomization procedure’s susceptibility to selection
bias in an unblinded study in which there is the possibility that the investi-
gator may, consciously or not, select subjects based on guesses regarding next
treatment assignment. Under this model, it is assumed that the experimenter
always guesses the most probable treatment assignment, conditional on the
past assignments. For any particular realization, the bias factor, F , is defined
as the number of correct guesses minus the number expected by chance alone
(i.e., n/2). If the random allocation rule is used and all participants enter the
trial at once, there can be no selection bias. In the case of staggered entry
(i.e., participants enter one at a time), however, each time we have imbalance
and the following treatment assignment brings us closer to balance, the result
is a correct guess. This will always occur n/2 times and the number expected
by chance alone is n/2. Thus, E[F ] is half of the expected number of times
during accrual that we have balance. Blackwell and Hodges Jr. (1957) show
n
that this is 2n−1 / n/2 − 1, so that
2n−1 1
E[F ] = n
− .
n/2
2
150 RANDOMIZATION
It may be surprising that, as n → ∞,
E[F ]
√ → C,
n
p
where C = π/32. This implies that, as the sample size increases, the poten-
tial for selection bias increases and can become quite large.
Finally, we consider permutaion tests and their relationship to the random-
ization scheme. An example of a permutation test for a trial randomized by
the random allocation rule was given in Section 5.1.4. A popular general fam-
ily of permutation tests is the linear rank family. A linear rank statistic with
centered scores has the form
P
(ai − ā) (Ti − E [Ti ])
S=p P , (5.1)
Var ( (ai − ā) (Ti − E [Ti ]))
P
where ai is a score that is a function of the participant responses, ā = n1 ai ,
Ti is an indicator that the patient received treatment 1, and E [Ti ] is the
permutational expectation of Ti , based on either the conditional or the un-
conditional reference set. The statistic S is asymptotically standard normal.
For the random allocation rule, E [Ti ] = 1/2. The variance in (5.1) is
X X
2 2
V = (ai − ā) Var(Ti ) + (ai − ā) (aj − ā) E [Ti Tj ] − E [Ti ] . (5.2)
i i6=j
Now, E [Ti Tj ] is the probability that participant i and j are both assigned to
treatment 1. This is
n/2
2 n−2
E [Ti Tj ] = n = .
2
4(n − 1)
Using this and the fact that Var(Ti ) = 21 1 − 12 gives
n X 2
V = (ai − ā) .
4(n − 1)
Sample Size
10 30 50 100 150 200
.60 .344 .200 .119 .035 .011 .004
Imbalance .70 .109 .016 .003 3 × 10−5 4 × 10−7 6 × 10−9
.75 .109 .005 .0003 2 × 10−7 4 × 10−10 3 × 10−13
This type of selection bias is eliminated if different block sizes are used and
the physician does not know the possible block sizes. In the situation where
possible block sizes are known, this bias can also be greatly reduced by choos-
ing block sizes randomly. One might expect that selection bias is completely
eliminated in this case. Several assignments could be known, however, if the
total number assigned to one treatment exceeds the total number assigned to
the other by half of the maximum block size.
When permuted-block randomization is used, the permutation test should
take the blocking into account. If blocking is ignored, there is no randomiza-
tion basis for tests. Furthermore, the usual approximate distributions of test
statistics that ignore blocking may not be correct. Matts and Lachin (1988)
discuss the following three examples:
1. 2×2 Table. When the outcome variable is binary, it is common to summarize
the data using a 2 × 2 table as in Table 5.2. The statistic
4(ad − bc)2
χ2P =
n(a + c)(b + d)
may be used. Based on a population model, under the null hypothesis this
statistic is asymptotically χ2 with one degree of freedom. However, when
permuted-block randomization is assumed, the asymptotic distribution is
not necessarily χ2 . The proper permutation test is performed by creating
a 2 × 2 table for each block. If we use a subscript i to denote the entries in
the table for the ith block, the test statistic is
P 2
B
i=1 (ai − E [ai ])
χ2M H = PB , where
i=1 Var (ai )
ai + c i
E [ai ] = and
2
(ai + ci ) (bi + di )
Var (ai ) = .
4(m − 1)
154 RANDOMIZATION
This is the Mantel-Haenszel χ2 statistic, which is based on the hypergeo-
metric distribution. Asymptotically, assuming permuted-block randomiza-
tion and under H0 , its distribution is χ2 with one degree of freedom.
Disease
Yes No Total
Tmt 1 a b n/2
Treatment
Tmt 2 c d n/2
Total a+c b+d n
2. t-test. With a continuous outcome variable, the usual analysis for compar-
ing the two levels of a factor is the familiar ANOVA t-test. If a population
model is assumed, the t-statistic has a t-distribution under the null hypoth-
esis. Of course, the observed value of the statistic will depend on whether
or not a blocking factor is included in the model. If permuted-block ran-
domization is assumed but blocking is not included for testing, the statistic
will not necessarily have the t-distribution under the null hypothesis. If
the given randomization is assumed and blocking is included, the permu-
tation distribution of the resulting statistic will be well approximated by
the t-distribution.
3. Linear Rank Test. Linear rank test statistics have been discussed in previous
sections. The linear rank test statistic that does not take blocks into account
is asymptotically standard normal, assuming either the random allocation
rule or complete randomization, as long as the correct variance term is
used. This is not the case, however, if permuted-block randomization is
assumed. Thus, one must use a version of the statistic that takes the blocks
into account:
Pm PB
i=1 j=1 wj (aij − ā·j ) (Tij − 1/2)
S= q Pm PB ,
m
w 2 (a − ā )2
4(m−1) i=1 j=1 j ij ·j
Pm
where wj is a weight for block j, ā·j = i=1 aij /m, and all other terms
are as in (5.1), with an added j subscript denoting the jth block. In this
case, the scores aij depend solely on the responses in the jth block. This
statistic is asymptotically standard normal under the null hypothesis and
permuted-block randomization.
In each of these situations, if the number of blocks is relatively large, it
can be shown (Matts and Lachin 1988) that the statistic that does not take
blocking into account is approximately equal to 1 − R times the statistic
that does, where R is the intrablock correlation. R can take values in the
interval [−1/(m − 1), 1]. It is probably reasonable to assume that if there is an
TREATMENT- AND RESPONSE-ADAPTIVE PROCEDURES 155
intrablock correlation, then it is positive. Furthermore, the lower bound for the
interval in which R takes values approaches 0 as m gets larger. So, in the more
likely situation that R is positive, the incorrect (without taking blocking into
account) test statistic would be smaller in magnitude than the correct statistic,
resulting in a conservative test. Many argue that its conservativeness makes
using the incorrect test acceptable, but conservativeness reduces efficiency and
power. In addition, for small block sizes, the test may be anticonservative,
inflating the type I error.
Taking the case of UD(0, 1), the probability of a large imbalance can be ap-
proximated by √
Pr(Q > R) ≈ 2Φ (1 − 2R) 3n .
This design can also be shown to achieve efficiency similar to that of the ran-
dom allocation rule. In particular, UD(0, 1) needs at most 4 more observations
to give essentially the same efficiency as the random allocation rule, which is
perfectly balanced (Wei and Lachin 1988).
The probability of making a correct guess of treatment assignment for par-
ticipant n + 1 is
1 E[Dn ]β
gn+1 = + ,
2 2(2α + βn)
where E[Dn ] can be computed recursively using the transition probabilities
and the property
Pr (Dn = j) = Pr (Dn = j |Dn−1 = j − 1) Pr (Dn−1 = j − 1)
+ Pr (Dn = j |Dn−1 = j + 1) Pr (Dn−1 = j + 1) .
Pn
It follows that E[F ] = i=1 gi −n/2. Once again, as the sample size increases,
urn randomization is similar to complete randomization and thus the potential
for selection bias becomes increasingly small.
Again, we consider permutation tests. Under complete randomization, all
assignment patterns in the reference set are equally likely. This is not the case
for UD(α, β). In fact, certain assignment patterns may not be possible. For
158 RANDOMIZATION
example, for UD(0, 1) any pattern that assigns the first two participants to
the same treatment is impossible. These complications make the permutation
distributions of test statistics more difficult P
to calculate. Consider a linear
rank statistic under the UD(α, β). Let V = 14 ni=1 b2i , where
n
X [2α + (i − 1)β]β(aj − ā)
bi = ai − ā − , 1≤i<n
j=i+1
[2α + (j − 1)β][2α + (j − 2)β]
bn = an − ā.
If this V is used as the variance term in (5.1), the resulting statistic will be
asymptotically standard normal.
5.4.1 Stratification
The most common method used to enforce covariate balance is called strati-
fication. The permuted-block design can be considered a special case of strat-
ification, where individuals are stratified according to time or the order in
which they entered the trial. Typically, however, the term “stratification” is
used only when blocking is performed based on other attributes. The concept
is simple; trial participants are grouped into strata based on factors of interest
and a separate realization of the randomization rule is applied to each of these
groups. A simple example is depicted in Figure 5.3. Within each stratum, one
may use any randomization procedure that enforces balance; permuted blocks
are the usual choice for randomization within strata. A discussion of stratifi-
cation was given by Meier (1981) and some of his observations are presented
here. Many of the concepts also apply to other forms of covariate-adaptive
randomization.
There is wide agreement that stratification should be performed in the anal-
ysis stage of a clinical trial if it is present in the design. The issue that has
been the subject of debate is whether or not allocation should be stratified. As
162 RANDOMIZATION
Gender Smoke Strata
Yes 1
Male
No
2
Yes 3
Female
No
4
Figure 5.3 Diagram of a simple example of stratification. There are 2 factors with 2
levels each: gender (male, female) and smoke (yes, no). This gives a total of 2×2 = 4
strata.
5.4.2 Pocock-Simon
Pocock and Simon (1975) developed a method with the primary goal of achiev-
ing balance with respect to a number of covariates in small studies. This ap-
proach could be considered an application of the biased coin design to stratifi-
cation. The covariate values for individuals already assigned treatment, along
with those of the next individual to be assigned treatment, are combined to
calculate a score measuring imbalance. The treatment assignment that results
in the lowest imbalance is given the highest probability.
To make this precise, suppose we have J factors that we would like to
balance in K treatment groups. For each participant to be enrolled in a trial,
let xjk be the number of participants already assigned to treatment k who
have the same level of factor j as the new participant. Thus, we will have an
array of JK integers. Next, define
t xjk , t 6= k,
xjk =
xjk + 1, t = k
so that xtjk is the number of participants with the given level of factor j
who would be in treatment group k if the new participant were assigned to
group t. Next, letting D(y1 , . . . , yK ) be a function that measures the degree
of imbalance among K non-negative integers, djt = D(xtj1 , . . . , xtjK ) is the
imbalance with respect to factor j that would result if the new participant were
assigned to treatment t. We then define a measure that combines imbalance
across all J factors, Gt = G(d1t , . . . , dJt ). This is calculated for each treatment
and then the treatments are ranked according to these values. In other words,
let G(1) = mint Gt , etc., so that G(1) ≤ G(2) ≤ · · · ≤ G(K) . Then, letting τ be
the assigned treatment, assignment is based on probabilities
X
Pr(τ = (t)) = pt , where p1 ≥ p2 ≥ · · · ≥ pK and pt = 1.
This means that the treatment that leads to the lowest degree of imbalance is
164 RANDOMIZATION
given the highest probability of assignment. Note that if there are tied values
of G, either the ties should be randomly broken when the values are ordered
or the corresponding probabilities should be equal.
The following are possible choices for D:
1. Range: the difference between the highest and lowest numbers. This is
probably the most straightforward choice.
P
2. Variance: D(y1 , . . . , yK ) = (yk − ȳ)2 /(K − 1).
3. Upper Limit: e.g., D(y1 , . . . , yK ) = I(Range > U ), for a defined limit U .
The variance or another measure could also be used instead of the range.
For G, a reasonable choice is
J
X
G(d1 , . . . , dJ ) = wj d j ,
j=1
The method illustrated in this example (minimization) was used for treat-
ment allocation in the Metoprolol CR/XL Randomised Intervention Trial in
Congestive Heart Failure (MERIT-HF) (MERIT-HF Study Group 1999). A
total of 3991 patients with chronic heart failure were randomized to either
metoprolol CR/XL or placebo. The groups were balanced with respect to ten
factors: site, age, sex, ethnic origin, cause of heart failure, previous acute my-
ocardial infarction, time since last myocardial infarction (for those who had
one), diabetes mellitus, ejection fraction (fraction of blood emptied by the
heart during systole), and NYHA functional class (level of heart failure).
The Pocock-Simon method has advantages over a basic stratified random-
ization, but it still carries many of the same disadvantages. It is most useful
in a small study with a large number of strata (Therneau 1993).
Another randomization scheme called maximum entropy constrained balance
randomization, which is closely related to the Pocock-Simon method, was
proposed by Klotz (1978). The method provides a criterion for selecting the
assignment probabilities, which are allowed to be distinct for each individual.
Specifically, one chooses pk , k = 1, . . . , K, to maximize the entropy,
K
X
H(p1 , . . . , pK ) ≡ − pk ln pk ,
k=1
Example 5.3. Suppose we have the results for a clinical trial in which 12
participants enrolled. The treatment assignment to group 1 or 2 and the rank
of the continuous response for each individual are known and are as in the
following table.
SUMMARY AND RECOMMENDATIONS 167
Participant 1 2 3 4 5 6 7 8 9 10 11 12
Assignment 1 2 1 2 1 2 1 2 1 2 1 2
Rank 12 1 11 2 10 3 9 4 8 5 6 7
Method p-val
Random Allocation Rule 0.004
Complete 0.001
Permuted-Block (m = 2) 0.063
Permuted-Block (m = 4) 0.019
Permuted-Block (m = 6) 0.010
Biased Coin (BCD(2/3)) 0.007
Urn Design (UD(0,1)) 0.002
5.6 Problems
5.1 Suppose you conduct a two-arm randomized clinical trial with a total
sample size of 12. Assume the outcome is normal with constant variance
and the test statistic to be used is the difference in sample means. As
discussed in this chapter, the allocation that is balanced (6 in each
group) is the most efficient. In this case, that means balanced allocation
leads to the smallest variance of the test statistic. Suppose complete
randomization is to be used.
(a) What is the efficiency of balanced allocation vs. the allocation that
gives 7 in one group and 5 in the other? (Efficiency, in this case, is
defined as the variance of the test statistic for imbalance, divided by
the same for balance.)
(b) Make a table with a column for each possible allocation (including
the balanced case). For each case, give the efficiency of the balanced
allocation and the probability under complete randomization of re-
alizing such an allocation.
(c) What is the median efficiency of the balanced vs. unbalanced de-
signs? What is the mean efficiency?
(d) Make a conjecture about what might happen to the median efficiency
as the sample size increases. Write a computer program to support
your claim and show the results.
5.2 Suppose you are asked to help design a trial with three treatment
groups: a placebo (P), a low dose of a drug (LD), and a high dose
of the same drug (HD). The investigators are primarily interested in
efficacy of the two dose levels, and not in comparing the two levels to
each other. The hypotheses of interest are the following:
H01 : µLD = µP
and
H02 : µHD = µP ,
PROBLEMS 169
where µ represents the mean reponse. You must determine the “op-
timal” proportion of the full sample that should be assigned to each
group. The statistic to be used in each case is the difference in sample
means between the corresponding groups. Assume that the outcomes
are independent and normally distributed with known variance σ 2 . The
total sample size is fixed at n.
(a) Recall that for a single hypothesis test comparing two groups, the
optimal allocation is to put half in each group. One advantage of
this is that it minimizes the variance of the test statistic. If the
investigators are interested only in estimating the two differences in
means, what is the optimal proportion that should be assigned to
each group?
(b) Which (if any) of the assumptions on the outcomes could be relaxed
and still give the same result? Explain.
(c) Another reason for allocating half of the sample to each group in the
single-hypothesis case is to maximize the power of the test. Suppose
the investigators take this point of view and are only interested in
showing efficacy of either treatment. In other words, the goal is to
maximize the probability that at least one of the two null hypotheses
is rejected, given that the alternatives are true (use µLD −µP = 0.5σ
and µHD − µP = σ). Assume that we reject either null hypothesis if
the corresponding standardized difference in means is greater than
1.96. (Note: simulation may be helpful.)
5.3 Consider the example on page 166.
(a) Suppose the observed responses (from smallest to largest) were
{1.16, 1.65, 1.89, 2.34, 2.37, 2.54, 2.74, 2.86, 3.04, 3.50, 3.57, 3.70}. Per-
form a t-test and compare the p-value to the table in the example.
(b) Suppose the randomization method used was permuted-block with
2 blocks of size 6. The p-value in the table for the correct analysis
(taking the blocks into account) is .01, but the p-value calculated by
ignoring blocks (random allocation rule) is .004. Discuss this in the
context of the discussion at the end of Section 5.2.3.
5.4 Consider a stratified trial that uses the permuted-block design with
block size 4 in each stratum. Let N1 and N2 be the number of subjects
in each group at the end of the trial. If the trial stops before the last
block is full (in any stratum) the overall treatment group sizes may
not be balanced. Assume that each possible number (1, 2, 3, or 4) of
subjects in the final block of each stratum has equal probability (1/4).
(a) For one stratum (i.e., no stratification), give the exact distribution
of N1 − N2 . Calculate Var(N1 − N2 ) and E|N1 − N2 | (we know, by
symmetry, that E[N1 − N2 ] = 0).
(b) Calculate E|N1 − N2 | exactly for the case of 2 strata by using the
probabilities from part (5.4(a)). Repeat this procedure for 4 strata.
170 RANDOMIZATION
(c) Assume we can approximate the distribution of N1 − N2 with the
normal distribution, using the variance calculated in part (5.4(a))
(for each stratum). Calculate E|N1 − N2 | for the cases of 1, 2, and 4
strata and compare your result with previous calculations.
(d) Approximate E|N1 − N2 | for the cases of 8, 16, 32, 64, 128, 256, 512,
1024, 2048, and 4096 strata.
(e) Compare these values to the case with no stratification from part
(5.4(a)). Compare them to complete randomization when the total
sample size is 100 (E|N1 − N2 | ≈ 8.0). Discuss.
5.5 For each of the randomization schemes described below, discuss the
properties of the approach. Refer to topics presented in this chapter
such as size of the reference set, balance, selection bias, and testing.
What are your recommendations for the use of these methods? Assume
a total sample size of 12.
(a) Suppose the experimental treatment is not yet available. The in-
vestigators start the trial anyway and assign the first 3 subjects to
placebo. The random allocation rule will then be used for the next 6
subjects and the final 3 subjects will be assigned to the experimental
treatment.
(b) It is determined that a trial will use complete randomization. The
randomization sequence, however, will be generated entirely in ad-
vance and if the sequence happens to be severely imbalanced (defined
as having 10 or more subjects in one group), another sequence will
be generated until an acceptable one is found.
CHAPTER 6
Statisticians have multiple roles in the design, conduct, and analysis of ran-
domized clinical trials. Before the trial starts, the statistician should be in-
volved in protocol development. After the trial starts, the statistician may be
involved in interim analyses. It is important that interim results reflect, as
closely as possible, what is actually occurring at that time in the trial, despite
the fact that the database will not yet have undergone the scrutiny usual in
producing the final analysis dataset. The statistician usually plays a major
role in the final analyses of trial data and in presenting these results to the
broader scientific community.
Clearly, the success of a trial depends on the data that are collected. First
and foremost, data must be collected that both answer the scientific questions
of interest and satisfy regulatory requirements. The statistician must have
sufficient grasp of the scientific questions and familiarity with the relevant
outcome measures to ensure that the results are as unambiguous as possible.
Statisticians are often in the unique position to anticipate data collection and
analysis complications that might arise if particular outcome measures are
selected. Second, the data collected need to be of sufficient quality that the
integrity of the result is assured. The statistician must understand the data
collection process and the limitations of the data collected in order to correctly
characterize the results.
Two fundamental measures of data quality are completeness and accuracy.
The manifestations of poor data quality are bias and increased variability.
Bias is a systematic error that would result in erroneous conclusions given
a sufficiently large sample (in small samples bias may be overwhelmed by
variability). Increased variability decreases power for detecting differences be-
tween treatment groups and increases uncertainty in any treatment differences
that are identified.
While missing data (incompleteness) decrease the effective sample size and
thereby decrease power, the greater danger is the introduction of bias. Analytic
issues involving missing data are dealt with in more detail in Chapter 11, so
here we will only say that, except in very special circumstances, the bias
introduced by missing data cannot be accounted for by analytic methods
without strong, untestable, assumptions. At best its potential impact can only
be quantified via sensitivity analyses.
Non-missing data can be inaccurate as a result of either random or sys-
tematic errors. Random errors can occur, for example, because the person
transferring a value from a diagnostic instrument or source document to a
171
172 DATA COLLECTION AND QUALITY CONTROL
data collection form misread the initial value. This kind of error is unlikely to
introduce bias because it is not likely to be related to assigned treatment and,
for numeric data for example, is probably as likely to increase as decrease the
value. Conversely, bias may be introduced by a poorly calibrated instrument
if the values it produces tend to be higher or lower than they should. While
this kind of bias may skew the distributions of the values from a given clinical
site, it is not likely to be treatment related and therefore should not introduce
bias into treatment comparisons. The impact of errors in non-missing data is
primarily that of increasing the variability of the responses.
The potentially increased variability introduced by inaccurate non-missing
data must be considered in the light of the overall sampling variability. For
example, the standard deviation for systolic blood pressure measurements re-
peated for the same subject is in the range of 8–9mmHg, and the population
standard deviation of systolic blood pressure measurements at a particular
visit for all subjects enrolled in a trial may be 15–20mmHg. Random tran-
scription errors in at most, say, 5% of subjects are likely to increase this
standard deviation by only a few percent and will have minimal impact on
statistical inference. Typical error rates in clinical trials have been shown to
be far less than 5%, probably well below 1%, and perhaps as low as 0.1%
(Califf et al. 1997; Eisenstein et al. 2005). There appears to be little empirical
evidence to support the extensive use of intensive quality control procedures
used in many trials today.
In order to obtain high quality results from randomized clinical trials, close
attention must be paid to the data collection process and to the design of data
collection forms. The primary goal of the data collection process should be to
minimize the potential for bias; hence, the first priority should be complete-
ness. In this chapter, we discuss the process of planning for the collection of
clinical trial data, specific issues that arise for selected types of clinical data,
and issues regarding quality control. We will address problems that can occur
in the data collection process and suggest strategies to help alleviate them.
The study protocol defines the stages of trial participation—entry into the
trial, treatment, and follow-up—and outlines the study tests and procedures to
be performed and the outcomes to be assessed. The specific data elements col-
174 DATA COLLECTION AND QUALITY CONTROL
lected should reflect the objectives specified in the protocol. A brief overview
is given below, with further detail provided in Section 6.2.
Baseline data. Data collected prior to the start of treatment are referred
to as baseline data.
Efficacy outcome data. Efficacy outcome data may be a simple account-
ing of survival status or the occurrence of a morbid event such as a heart
attack, stroke, hospitalization, or cancer recurrence. Other outcome data may
involve results of diagnostic scans or other complex measurements. In some
areas of clinical study, potential outcomes must be reviewed and adjudicated
by experts in the field. Data collection procedures must be developed that are
appropriate to the type, complexity, and assessment frequency of the outcome
measure.
Safety data. Safety data include both serious adverse events (SAEs), dis-
cussed in further detail in Section 6.1.2, and non-serious adverse events or
toxicities. Regulatory reporting requirements for SAEs may influence both
the process and content of data collection. Since potential adverse events fre-
quently cannot be anticipated in advance, adverse event reporting may be
more extensive than other categories of data collected.
Safety data also include laboratory data, based on serum or urine chemistry
values, vital signs, or other physical measurements that are usually collected
throughout the study according to a predetermined schedule.
Follow-up status. Follow-up status or subject study status refers to the
status of the subject with respect to treatment and follow-up. For example,
the subject may be currently on their assigned treatment, off treatment but
still being followed, dead, or lost to follow-up. Depending on the nature of the
trial, there may be other important categories.
Over and above the requirements of an individual protocol, there may be other
factors that influence selection of the data to be collected. For example, there
may be multiple concurrent trials for the same intervention in different popu-
lations or for similar populations in different geographic regions. As discussed
in Chapter 1, approval of a new agent or indication by the U.S. Food and
Drug Administration (FDA) often requires two trials demonstrating a posi-
tive beneficial effect. In these cases, the data collected for the related trials
should be similar, if not identical.
Also, many trialists now attempt to combine all trials of the same inter-
vention, or the same class of interventions, in order to obtain a more precise
estimate of benefit and to investigate rare but important adverse events, es-
pecially SAEs. This type of analysis is often referred to as meta-analysis. In
order to facilitate these analyses, it is desirable for data collection to conform
to certain standards and be as comparable as possible for baseline, primary,
and secondary outcomes, and for adverse events.
PLANNING FOR COLLECTION OF CLINICAL TRIAL DATA 175
6.1.2 Mechanisms of Data Collection
IVRS
A common mechanism used for checking eligibility and randomizing subjects
is an interactive voice response system or IVRS. A well-designed IVRS system
is accurate, efficient, and allows subjects to be enrolled and/or randomized 24
hours a day from clinical sites anywhere in the world. Personnel from clinical
sites interact with the IVRS via telephone, and once they have identified their
site and provided required passwords, respond to a series of questions regard-
ing a potential study participant. Eligibility and other selected baseline data
can be collected immediately. Covariate-adaptive randomization procedures
(see Section 5.4) can also be implemented using IVRS systems, and random-
ization can be easily stratified by region or other prespecified risk factors.
In addition, IVRS systems are often used for other aspects of trial man-
agement, including dose titration in a double-blind study, distribution and
tracking of drug supplies or devices used for the intervention, and collecting
participant information relating to early termination of treatment or follow-
up.
There are many advantages to using an IVRS. First, the trial manage-
ment personnel and the statistician have real-time accounting of recruitment
progress, can monitor site or regional activity more effectively, and ultimately
can terminate recruitment more efficiently. Second, having up-to-date infor-
mation on the number and status of participants in the study, with computer-
generated information on subject identifiers, treatment assignment, and the
date of randomization, facilitates up-to-date monitoring of subject safety (see
Chapter 10).
The traditional method of data collection involves entering data into a study-
specific case report form (CRF) which is a series of paper forms or CRF pages,
typically organized by the type of data being collected. For example, one page
might be an eligibility checklist, another might collect information regarding a
participant’s medical history, and a third page, filled out at the time of study
entry, might record data gathered during a physical exam. CRFs designed for
clinical trials are typically distinct from the source documents, which are usu-
ally part of the general medical record. While paper CRFs are still common,
electronic CRFs, in which the data are entered directly into a computer sys-
176 DATA COLLECTION AND QUALITY CONTROL
tem, are now being used with increasing frequency. Basic principles of CRF
design, discussed in more detail in Section 6.1.3, apply to either medium.
CRFs are typically completed at the clinical site by a study coordinator or
members of the health care team. Completed pages may be collected in person
by monitors from the coordinating center or they may be mailed, faxed, or
submitted via the internet to a central location for inclusion in the study
database. Depending on the technology used, the data may be scanned, keyed,
or uploaded into the database using specialized software. Completed forms
are subject to a variety of quality control procedures, outlined in more detail
in Section 6.3, with queries sent back to the clinical site for clarification or
correction of problems. It is important that the statistician be involved in the
formulation of the quality control procedures, as this will affect the quality of
both interim analyses and the final analysis at the completion of the trial.
Well-designed forms are essential for meeting the goal of obtaining accurate,
complete, and timely clinical trial data. Here we present a set of principles
that are useful in developing and evaluating both paper CRFs and electronic
data collection instruments.
First, the CRF must be clear and easy to use by different groups of people:
those who fill them out, those doing data entry, and ultimately those who
perform the data analyses. Although forms can be accompanied by separate
instruction pages, it is preferable that they be self-explanatory with definitions
included and possible responses clearly indicated.
The package should be logically organized, with the content of each individ-
ual page clearly laid out. Using the same style or format on all pages makes
form completion easier for the clinical team and ultimately leads to better
quality data. The CRF should be parsimonious, focused on data necessary for
analysis. Collecting the same information on multiple pages should be avoided.
If possible, it is advisable to begin with a CRF that has been used for a
previous study, making modifications and improvements based on prior expe-
rience and any specific requirements of the present study. Eventually, certain
classes of CRFs can be perfected and used repeatedly. Familiarity with the
format and content of a form increases the likelihood that the clinical team
will complete it as intended, improving data quality and completeness and
ultimately requiring that less time and effort be spent in quality control.
When a new CRF is created, it should be tested prior to implementation.
While a draft version may appear perfectly clear to the designer, it may be
confusing to those who must complete it or certain clinical realities may not
have been anticipated. A modest amount of pre-study effort can save an enor-
mous amount of time in the long run, as making changes to a CRF once a
trial is underway can create both logistical and analytical problems.
The structure of the CRF should be reviewed with respect to how the pages
are organized. It is important to clearly specify when pages should be com-
pleted at the site and when the data are to be submitted for inclusion in the
central database. Typically pages that are to be completed at the same time
are grouped together (for example, information collected at screening, the ran-
domization visit, the end of treatment visit, etc.). Individual pages should be
clearly labeled and the flow should be logical.
Table 6.1 provides a basic categorization of CRF pages that might be used
180 DATA COLLECTION AND QUALITY CONTROL
for a cardiovascular trial with clinical event endpoints. Further details regard-
ing selected categories of clinical data can be found in Section 6.2.
Table 6.1 CRF pages typical in cardiovascular studies and when they may be com-
pleted.
Study Visit Assessments Performed
Screening and baseline only Inclusion criteria
Exclusion criteria
Demographics
Tobacco and alcohol classifications
Medical and surgical history
Physical examination
Baseline medications
Randomization and first dose
information
Both baseline and at selected Weight and vital signs
follow-up visits
Cardiovascular assessment
Quality of life (QOL) assessment
ECG results
Serum samples for laboratory testing
Urinalysis
Assessment at each follow-up Clinic visit and contact summary
visit Compliance with intervention
Adverse events
Concomitant medications
Hospitalizations
Clinical endpoints (death, MI, stroke,
etc.)
Study completion Treatment termination
Final physical examination
Final assessments of QOL or
cardiovascular status
End of study date and subject status
While tests and procedures are typically organized by visit, certain types
of follow-up data may be collected on a log-form, rather than a visit-oriented
form. Examples include adverse events and concomitant medications. Log-
forms have multiple rows, allowing for multiple entries, each with associated
PLANNING FOR COLLECTION OF CLINICAL TRIAL DATA 181
start and stop dates and relevant characteristics (severity, action taken, out-
come for adverse events; dose, route of administration, indication for medica-
tions). It may be useful for log-forms to include a check box for “none” so it
can be clear that the data are not missing, but rather that the subject had
nothing to report.
When data are being used for interim analyses, it is important to consider
the submission schedule for different categories of information. If data on a
log-form are not included in the central database until all rows on the page
have been filled, none of the data on a partially filled page will be available
for use in interim analyses. In a study where certain types of medications
are of particular importance for interim monitoring, for example, it might be
preferable to have check boxes at each visit to indicate whether or not the
subject had taken that medication at any time since the last visit. Another
way to increase data availability is to require that log-forms be submitted on
a regular basis (e.g., quarterly), regardless of whether or not they have been
filled.
Header Information
Each CRF page should have a clearly marked space for collecting header in-
formation, with a consistent style used throughout. The header should clearly
identify the study, the CRF page, the participant, and the visit associated
with the data. Information contained in the header is critical for linking data
for a given individual and visit, and is also important for tracking data flow
and completeness.
Participants must be clearly identified on each page of the CRF, including
the unique identifier assigned to that subject for the trial (screening number
and/or randomization number). It may be advisable to include additional
identifying information, such as clinical site and/or subject initials, that can
help identify errors in recording subject IDs.
Each header should also include information about the relevant time point
in the trial. Follow-up visits may be identified by a unique visit number in the
database, or they may be organized by calendar date (or sometimes both).
Provisions should also be made for information collected at unscheduled visits
and at the end of the trial. An example of a case report form header in given
in Figure 6.1.
Male
White/Caucasian
Female
Black/African American
3 Some college
Cardiovascular History
4 Associates degree (Check one)
The credibility of the results of a clinical trial relies on the extent to which
the participants adhere to the intended course of treatment as outlined in the
protocol—poor adherence to assigned treatment can leave the results of the
trial open to criticism. Hence, it is necessary to collect data that enable a
meaningful assessment of adherence to assigned treatment.
Minimally, the start and stop dates of treatment should be collected. Anal-
yses of the timing of (premature) treatment withdrawal can aid in the inter-
pretation of the final results.
Some studies call for “per protocol” analyses in addition to those performed
according to the intention-to-treat principle. These analyses typically involve
analyzing only those events (potential primary events or adverse events) that
occur while the participant is complying with the intervention. We note that
these analyses are vulnerable to bias and the results must be viewed cau-
tiously. The biases inherent in these analyses are discussed in greater detail
in Chapter 11.
Some interventions (for example, a surgical procedure or device implanta-
tion) involve a one-time administration of the treatment, and overall adher-
ence is measured simply by the proportion of subjects receiving their assigned
treatment. For drug treatments, the degree of compliance may be measured
by the percent of pills taken based on the number prescribed, or the amount
of the dose received as a function of the target dose.
Assessing the safety of any new intervention is one of the most important tasks
for a clinical trial. However, it is also the most challenging, because safety is
multi-faceted and not easily defined.
As discussed in Section 6.1.2 there are regulatory requirements for rapid
reporting of serious adverse events. In order to comply with regulations, most
trial sponsors have developed a separate fast-track reporting system for SAEs.
In addition to a brief description of the event (e.g., pneumonia), information
collected usually includes the following:
190 DATA COLLECTION AND QUALITY CONTROL
• the reason the event qualifed as an SAE (Did it lead to hospitalization?
Was it life-threatening?),
• the severity of the event (often coded on a scale of mild, moderate, severe),
• was the event thought to be related to the intervention (definitely, possibly,
definitely not),
• the outcome of the event (Was it fully resolved? Were there lingering se-
quelae? Was it fatal?),
• modifications to the study treatment made as a result of the event (change
in dose, temporary hold, permanent discontinuation).
There will also be a detailed narrative that describes treatment and follow-up
of the adverse experience. Information collected in the SAE reporting system,
particularly the coded fields, can be useful for interim safety monitoring.
While other adverse events (AEs) may not qualify as SAEs, they can still be
relevant to the assessment of the risks and benefits of the intervention being
tested. Thus, data on the occurrence of all AEs, especially those that are more
severe or affect adherence, should be collected. Some AEs might be common to
the disease being treated, some might be common to all assigned treatments,
and some attributed to the new treatment. While prior studies might suggest
what side effects are likely to be associated with the new intervention, many
AEs are not sufficiently frequent to be seen in smaller studies. Thus, the AE
reporting system must be prepared for both expected and the unexpected
events. It is the latter that create challenges.
For adverse experiences that are anticipated based on a known side-effect
profile of a drug, it may be most effective to present the participant with
a checklist of potential toxicities to be reviewed at each follow-up visit. For
example, a subject being treated for cancer might be asked if they had experi-
enced any nausea and vomiting, peripheral neuropathy, or hair loss since the
last visit, and if so, how severe it was. Some types of anticipated toxicities,
such as decreases in white blood cell counts, might be abstracted by a data
manager based on review of the local lab data and summarized according to
objective criteria such as the Common Toxicity Criteria used in cancer clinical
trials.
A common practice in industry-sponsored trials of new drugs is to ask the
participant an open-ended question soliciting any complaints or problems they
have experienced since the previous visit. These complaints are typically sum-
marized with a brief phrase by the health professional, recorded on a log-form,
and subsequently coded in a systematic way for analysis. A variety of hierar-
chical coding systems such as WHOART, SNOMED CT, and MedDRA have
been developed and used over the years. Coding may either be performed
by trained individuals or through the use of computer-based auto-encoding
systems.
In recent years, there has been a movement towards creating a standard
coding system that can be used worldwide. The MedDRA system (Medical
Dictionary for Regulatory Activities) is considered the new global standard
CATEGORIES OF CLINICAL DATA 191
in medical terminology. This system allows for a detailed level of coding, and
relates the detailed codes to broader high level terms and system organ classes
(e.g., cardiac or gastrointestinal system). Table 6.2 illustrates the hierarchical
nature of the MedDRA coding system focusing on the high level term of “heart
failure.”
Table 6.2 A subset of the MedDRA coding system highlighting the high level term
“Heart failure.”
System Organ Class High Level Term Preferred Terms
Cardiac system Heart failures NEC Cardiac failure acute
Cardiac failure chronic
Cardiac failure congestive
Cardiogenic shock
For some purposes, the lower level detailed coding may be useful. For the
purposes of comparing different treatment groups, however, this level of coding
is likely to be too fine, since the frequency of the specific event may not
be large enough to detect differences between groups. Examination of the
system organ class or high level term may detect an AE signal, with the finer
breakdown giving additional information. For some studies, preferred terms
from different parts of the hierarchy may be combined based on specifications
given in the protocol or questions posed by a monitoring committee (e.g.,
thrombotic events).
In addition to the classification of the event, the start and stop dates are
usually recorded, along with a coded severity and the determination by the
clinical investigator of whether the event might be related to the interven-
tion. Although it is a standard component of AE reporting, the assessment
of relatedness to treatment may not be meaningful. The primary treatment
comparison of AE incidence should always be performed without regard to
this assessment, even when the statistical analysis plan calls for a summary
of treatment-related adverse experiences.
There is probably no category of data that has been the cause for as much
confusion as subject follow-up status. The confusion arises from the failure
to make a clear distinction between withdrawal from assigned treatment and
withdrawal from follow-up. A common mistake in trials is to stop collecting
follow-up data when the participant terminates their intervention. The usual
rationale is that, if the participant is no longer receiving the intervention, they
are not subject to either the benefits or the risks of the intervention. This ra-
tionale is inconsistent with the principles underlying randomized controlled
trials, however. The reason for using randomized controlled trials is that by
creating treatment arms that differ solely as a result of either chance or as-
signed treatment, the relative effect of the treatments under study can be as-
certained. When follow-up is curtailed because subjects are no longer adhering
to their assigned treatment, the subjects who remain in each group are prob-
ably not representative of the original treatment groups, and we have poten-
tially introduced a third difference between groups—selection bias—defeating
the purpose of performing a randomized trial at all. This bias can only be
avoided by ensuring that, to the extent possible, all subjects are followed for
the entire protocol defined period regardless of their adherence to assigned
treatment.
Data capturing a subject’s status are sometimes collected by an IVRS sys-
tem in addition to the case report form. While the IVRS data may be available
more rapidly for interim analyses, well-designed CRFs should collect subject
194 DATA COLLECTION AND QUALITY CONTROL
status with respect to both the intervention and follow-up, usually with dis-
tinct pages designated to record “early treatment termination” and ‘‘study
termination.”
Early treatment termination refers only to withdrawal from the study inter-
vention. Treatment may be withheld at various times, but it is most important
to record if and when the termination becomes permanent. A field should be
included in the CRF page indicating the date that the subject received their
last treatment, and the reason(s) for treatment termination should be speci-
fied (either by selecting a single primary reason, or choosing all applicable rea-
sons from a specified list). Possible reasons for early termination may include
adverse event, subject request, physician discretion, or death. It is strongly
recommended that there be a single CRF page designated for collecting infor-
mation on permanent treatment discontinuation.
Study termination implies that the subject is no longer being followed. Typ-
ically, the only valid reasons for early study termination are death, withdrawal
of consent, or loss to follow-up. A single CRF should be used to collect data
on termination from study, including the end-of-study reason as well as the
date of death or the date of last visit and/or contact.
Whether or not mortality is a designated study endpoint, the reporting of
deaths is required by regulatory bodies, and this information is critical to the
understanding of the safety of interventions. Many trials have a separate CRF
specifically for death for recording, not only the date of the death, but also
the presumed cause (e.g., fatal heart attack or stroke, death resulting from
a specific type of cancer, or accidental death). Cause of death may also be
adjudicated by the ECC.
In order to minimize the amount of censored survival data, the vital status
of each participant should be recorded systematically at each scheduled visit. If
a subject does not come to the clinic for their scheduled visit, the clinic should
attempt to contact the subject to ascertain whether or not the participant is
alive. All attempts should be made to ensure that information on a subject’s
outcome status is collected. If this is not possible, the subject may still allow
information regarding survival status to be collected at the end of the trial.
A brief discussion of censoring for subjects who are lost to follow-up appears
in Section 11.3.4 on page 362.
It should be clear by this point that the analysis and conclusions drawn from
any trial are only as good as the quality of the data collected. A well-designed
CRF properly focused on the primary questions of the trial, along with pro-
cedures that ensure that data will be collected on all subjects to the extent
possible, are the best guarantors of high quality data. As we have indicated,
some degree of random error can be tolerated with an adequately powered
trial, as long as it does not introduce bias. A good quality control program,
DATA QUALITY CONTROL 195
however, can help to ensure that procedures are followed and that the oppor-
tunity for systematic errors to find their way into the database is minimized.
A data quality control program will have multiple components that depend
on the source of data and mechanism for collection, as well as on the stage
of the data management process. Some quality control measures apply during
the data collection process itself; others are best implemented centrally at the
data management center, while still others tend to be performed prior to data
analysis.
Auditing
In the process of quality control of trial data, choices have to be made regard-
ing the intensity of auditing. One approach, often used by industry sponsored
trials, is to perform on-site, page by page, field by field, auditing of the CRF
data against the source documents. Another approach, more often done by
government sponsored trials, is to conduct central quality control using soft-
ware and statistically based methods, accompanied by sample auditing of trial
sites, and perhaps targeting key trial outcome measures. The first approach is
significantly more costly and may not be cost-effective according to a review
by Califf et al. (1997). Two examples follow in which the cost of intense au-
diting was as much as 30% of the total cost of the trial, yet had little or no
effect on the final results.
In one case, a large multicenter randomized cardiovascular trial was con-
ducted by an academic-based cooperative group to determine if a new inter-
vention would reduce mortality and other morbidity in patients at risk for a
heart attack (The GUSTO Investigators 1993). Prior to submission to regula-
tory agencies for approval, the sponsor of the trial conducted intense on-site
monitoring. Analyses performed before and after the audit did not alter the
numerical results in any appreciable way, even though some errors were dis-
covered.
Another example is provided by a well publicized fraud case (Fisher et al.
1995). A large multicenter breast cancer trial was conducted by an academic
cooperative group to determine which of two strategies was better, a standard
radical mastectomy removing the entire breast or a modified breast sparing
surgery followed by a chemotherapy of tamoxifen. One of the centers modi-
fied dates of six subjects during the screening process so that presumably the
breast biopsy used to diagnose the cancer would not have to be repeated to
be in compliance with the protocol. Although this anomaly was discovered
and reported by the statistician, a controversy arose and an intense audit was
conducted at several of the clinical sites (Califf et al. 1997). While the audit
discovered errors, they were on average less than 0.1% of the data fields and
there was no meaningful effect on the results. This experience, as well as oth-
ers, suggests that perhaps standard statistical quality control of the incoming
data, described elsewhere in this section, is adequate and cost-effective.
DATA QUALITY CONTROL 197
6.3.3 QC of External Laboratories and Review Committees
Central laboratories are another source of possible error. Most high quality
laboratories have internal quality control procedures in which the same spec-
imen may be submitted twice during the same day, or on different days, to
ascertain reproducibility and variability. Results that fall outside those prede-
termined limits may require that the laboratory temporarily halt the process
and recalibrate or correct the deficiencies.
Some trials have submitted duplicate specimens from a clinical site as an
external check of reproducibility. In one study it was discovered that while a
specific external group was the best research laboratory for a particular out-
come, once the trial began, it could not handle large volumes required by the
protocol (The DCCT Research Group: Diabetes Control and Complications
Trial (DCCT) 1986). Statistical analyses of the data received from the labo-
ratory early in the trial uncovered the lack of reproducibility and a change of
laboratories was made.
Adjudication committees reviewing clinical endpoints often have a built-in
process by which two reviewers provide an assessment of the event (e.g., cause
of death), their assessments are compared, and a third person (or perhaps
a larger group) discusses and resolves the discrepancies. For some studies, a
subset of the clinical events are resubmitted through the entire process to
evaluate the reliability of the process and to prompt further discussion and
retraining if necessary.
Examination of Datasets
Examination of the transferred datasets is particularly important at the be-
ginning of the trial in order to promote a better understanding of the variables
and relationships among them. For subsequent data transfers, careful exam-
ination of the data can serve to either verify or correct assumptions about
the data structure and is critical for identifying potential problems with the
incoming data. Data checks may include any or all of the following:
• tabulations of coded data,
• cross-tabulations identifying relationships among key variables,
• range checks for numeric fields and dates,
• comparison between date fields to identify invalid sequences, or
• checks of conventions for recording missing data (for both numeric and
character fields).
Particular attention should be paid to comparing critical information (such as
death, date of death) from different sources. If gaps or inconsistencies in data
fields are identified, the data management center can be alerted or queried.
Careful examination can lead to more informed decisions regarding the appro-
priate algorithms to use for defining derived variables (e.g., which data source
is most reliable for date of death).
6.4 Conclusions
In this chapter, we have discussed elements of data collection and quality con-
trol with which the statistician must be engaged. Planning and testing of data
collection procedures including the CRF is perhaps the most important task.
If this is done carefully, the need for quality control and auditing at the conclu-
sion of the trial will be decreased substantially. Using or developing standard
data collection procedures is also important to minimize the requirement for
retraining of clinical staff for each new trial. The formulation of definitions
of fields and outcomes is part of the standardization process. Ongoing data
200 DATA COLLECTION AND QUALITY CONTROL
quality control is necessary and now facilitated by most data management
software. Thus, many errors can be detected quickly and corrections made
to collection procedures as necessary. Sample auditing is probably more than
adequate for most trials and 100 percent auditing should be used only for
special circumstances.
CHAPTER 7
Survival Analysis
7.1 Background
The term failure time (or survival time; we will use the terms interchangeably
even though the failure in question may not be fatal) refers to a positive-valued
random variable measured from a common time origin to an event such as
death or other adverse event. The true failure time may be unobserved due to a
censoring event. Censoring can happen through several different mechanisms,
which will be explained later in more detail.
Let T denote the failure time, with probability density function f (t) and
Rt
cumulative distribution function F (t) = P (T ≤ t) = 0 f (u)du. In this chapter
we will assume that F (t) is continuous on the positive real line, although for
201
202 SURVIVAL ANALYSIS
most results this assumption is not necessary. Associated with T is a survivor
function defined as
S(t) = 1 − F (t) = P (T > t)
and a hazard function defined as
Pr{T ∈ (t, t + h]|T > t} f (t) f (t)
λ(t) = lim = = .
h→0 h 1 − F (t) S(t)
The hazard function, λ(t), can be thought of as the conditional probability
of failing in a small interval following time t per unit of time, given that the
subject has not failed up to time t. Because λ(t) is a probability per unit time,
it is time-scale dependent and not restricted to the interval [0, 1], potentially
taking any non-negative value. We define cumulative hazard, Λ(t), by
Z t Z t
f (u)du
Λ(t) = λ(t)dt = = − log[1 − F (t)] = − log S(t),
0 0 1 − F (u)
Parameters in parametric models are most often estimated using the method of
maximum likelihood (see Appendix A.2 for an overview of maximum likelihood
estimation). Suppose that t1 , t2 , . . . , tn are an i.i.d. sample from an exponential
distribution with parameter λ. First, we suppose that there is no censoring so
that all subjects are observed to fail. The likelihood for λ is
n
Y n
Y
L = L(λ) = f (ti ) = λ exp(−λti ).
1 1
The log-likelihood is
n
X
log(L) = log(λ) − λti .
i=1
Example 7.1. Suppose we have the following failure times (+ indicates cen-
sored observations).
9, 13, 13+ , 18, 23, 28+ , 31, 34, 45+ , 48, 161+
We have that δ. = 7 and y. = 423, so λ̂ = 0.016. Suppose we wish to test
H0 : λ = .03. Under H0 , Eλ0 δ· = .03 × 423 = 12.7. The LR statistic is
2(7 log 7/12.7 − (7 − 12.7)) = 3.05. The Wald test statistic is (7 − 12.7) 2 /7 =
4.62 while the score statistic is (7 − 12.7)2 /12.7 = 2.55. In this case, because
the variance of δ. under H0 is larger than when λ = λ̂, the score statistic is
smaller than the Wald statistic. The LR statistic is between the score and the
Wald statistics. In large samples, the three statistics will generally be quite
close together. 2
and
2 2
Var(T ) = e2µ+σ eσ − 1 .
where θ is the vector of model parameters that depend on the parametric fam-
ily. The derivatives of the log-likelihood are straightforward, although often
complicated computationally.
Hence, we use
k
X
d Ŝ(τk )) = Ŝ 2 (τk ) dl
Var( .
n0 (n0 − dl )
l=1 l l
This variance estimate is known as Greenwood’s formula.
Example 7.2. Suppose that we have data from the first four columns of the
following table.
k τ k nk d k w k n0 1 − p̂k p̂k c Ŝ(τk ))
Ŝ(τk ) SE(
k
The five right most columns are derived from the others using the formulas
previously defined. For example,
47 5
Ŝ(τ2 ) = p̂1 · p̂2 = 1 − 1− = .54
116.5 51.5
ˆ Ŝ(τ2 )) = (.54)2 47 5
Var( + = .0023.
116.5(116.5 − 47) 51.5(51.5 − 5)
2
Example 7.3. Using the data from Example 7.1, we construct the following
table.
Estimates of S(t) for this example are plotted in Figure 7.1. The 95% pointwise
confidence band is derived for the Kaplan-Meier estimate of S(t) using the
standard error computed using Greenwood’s formula and applied to log(Ŝ(t)).
The result was then transformed back to the original scale. Computation of
the confidence band on the log scale is more reliable than directly using the
standard errors above. We call the confidence band pointwise because for each
time, t, there is a 0.95 probability that the band contains the true value of
S(t). The probability that S(t) is contained with the band for all values of t
is less than 0.95. Simultaneous confidence bands can be constructed with the
property that the band contains S(t) for all t within the range of the data
with a specified probability (Fleming and Harrington 1991). 2
Example 7.4. Figure 7.2 shows cumulative mortality, F (t) = 1 − S(t), for
the two treatment arms of the BHAT (Beta-Blocker Heart Attack Trial Re-
search Group 1982) study. Note that the jumps in the estimated survivor
functions are quite small for the first 30 months or so, after which the number
of subjects at risk, nk , is small and each death results in larger jumps. This
behavior is typical, and often provides a visual clue regarding the reliability
of the estimates at large times. Frequently, the curves exhibit erratic behavior
at the most extreme times and these portions of the curves should be inter-
preted cautiously. In the primary publication of BHAT, the survival plot was
truncated at 36 months. 2
COMPARISON OF SURVIVAL DISTRIBUTIONS 213
1.0
0.8
Kaplan−Meier Estimate
Survival Probability
0.4
0.2
0.0
0 20 40 60 80
Time (days)
The goal of most randomized trials with survival outcomes is to make di-
rect comparisons of two survival curves. The most general form of the null
hypothesis is H0 : S1 (t) = S2 (t) for all t > 0, although in practice we can
only test this hypothesis over the range of the observed failure times. Tests of
this hypothesis can be parametric—under an assumed parametric model—or
nonparametric in which case the forms of S1 (t) and S2 (t) may be completely
unspecified. We begin with parametric methods.
Propranolol
Cumulative Mortality (%) 12 Placebo
10
0
0 10 20 30 40
Time from Randomization (months)
Figure 7.2 Cumulative mortality from BHAT (Beta-Blocker Heart Attack Trial Re-
search Group 1982).
Example 7.6. Returning to the 6–MP acute leukemia example, the common
hazard rate, λ̂0 = 30/541 = 0.055. The score test becomes:
1 1
U (α̂, 0)T I −1 U (α̂, 0) = (21 − 0.055 × 182) × +
0.055 × 359 0.055 × 182
= 17.76
Note that the score test is slightly larger than the square of the Wald test
(3.832 = 14.68). 2
p(dj0 , dj1 )
p(dj0 , dj1 |dj. ) = Pdj.
u=0 p(u, dj. − u)
nj0 nj1
dj0
dj0 dj1 λ0 (tj ) λ1 (tj )dj1 ∆tdj.
≈ Pdj. nj0 nj1
u=0 u λ (t )u λ1 (tj )dj. −u ∆tdj.
dj. −u 0 j
nj0 nj1
dj1
dj0 dj1 ψ
= Pdj. nj0 nj1 d −u ,
(7.2)
dj. −u ψ
j.
u=0 u
where ψ = λ1 (tj )/λ0 (tj ). The distribution defined by (7.2) is known as the
non-central hypergeometric distribution and does not depend on the underly-
ing hazard, but only on the hazard ratio ψ.
The null hypothesis H0 : λ1 (t) = λ0 (t) for all t is equivalent to H0 : ψ = 1.
COMPARISON OF SURVIVAL DISTRIBUTIONS 217
When this holds, (7.2) simplifies as
nj0 nj1
dj0 dj1
p(dj0 , dj1 |dj. ) = nj0 +nj1
dj.
If the true underlying hazard is uniformly larger (or smaller) in group 1 than
in group 0, then we expect that dj1 will be systematically larger (or smaller)
than its expected value under H0 . In particular, this holds when we have
proportional hazards (λ1 (t)/λ0 (t) 6= 1 does not depend on t). This suggests
that the test statistic,
X dj·
U= dj1 − nj1 , (7.3)
j
n j·
which is the overall sum of the observed number of events minus its condi-
tional expectation under H0 , is a measure of the degree to departure from H0
suggested by the data. A formal test of H0 requires that we compare U to its
standard error under H0 .
First we note that the dj1 are not independent over the observed failure
times; however, it can be shown that they are uncorrelated. Hence, we may
use the following test statistic:
P 2
j d j1 − E[d j1 ]
P ∼ χ21
j Var(dj1 )
This test is usually known as the log-rank test or Mantel-Haenszel test (Mantel
1966).
We also note that, since the distribution of the test statistic is derived under
H0 without reference to a particular alternative hypothesis, the log-rank test is
valid for any alternative hypotheses including those for which the proportional
hazards assumption does not hold, i.e., λ1 (t)/λ0 (t) is not constant. It can be
shown, however, that the log-rank test is optimal for proportional hazards
alternatives, but it has lower power than other tests against non-proportional
hazards alternatives. If we desire higher power for particular non-proportional
hazards alternatives, the test can be modified to improve power by replacing
the sum in (7.3) by a weighted sum, with weights chosen to optimize the test
for the alternatives of interest.
218 SURVIVAL ANALYSIS
In general, any test of the form
P 2
w (d
j j j1 − E[d j1 ])
P 2 ∼ χ21
w
j j Var(d j1 )
Example 7.7. Returning to the 6–MP example, there are 17 distinct failure
times. At the first failure time, day 1, we have the table
Dead Alive Total
Active 0 21 21
Control 2 19 21
2 40 42
Here, d11 −Ed11 = 0−2×21/42 = −1, Var(d11 ) = 2×19×21×21/422/41 =
.488. At the second failure time, day 2, we have d21 − Ed21 = 0 − 2 × 21/40 =
−1.05, and Var(d21 ) = 2 × 38 × 21 × 19/402/38 = .486. Continuing, we have
P17 P17
j=1 dj1 − Edj1 = 9 − 19.3 = −10.3, and j=1 Var(dj1 ) = 6.25. Thus, the
chi-square statistic for the log-rank test of H0 is
P 2
j dj1 − E[dj1 ] (−10.3)2
P = = 16.8,
j Var(dj1 ) 6.25
which is similar to the score test under the exponential model. 2
Example 7.8. In the BHAT example, there were 138 and 188 deaths in the
REGRESSION MODELS 219
propranolol and placebo groups, respectively. The expected numbers under
the null hypothesis are 164.4 and 161.6 respectively. Note that 138 - 164.4
= -26.4 = -(188-161.6), so the difference between observed and expected is
the same in absolute value, regardless of which group is chosen as group 1.
The variance of the difference is 81.48. Note that this is approximated quite
closely by d.. /4 = (138 + 188)/4 = 81.5. This approximation works well when
the sample sizes and censoring patterns are the same in the two groups, which
is usually the case in randomized trials with 1:1 randomization schemes. The
log-rank chi-square statistic is 26.42 /81.5 = 8.52 (p = 0.0035), so we have
strong evidence of the beneficial effect of propranolol on mortality in this
population. 2
Note that the conditional likelihood does not depend on the underlying hazard
function.
The partial log-likelihood is the sum of the log conditional likelihoods for
all failure times, !
rj
X X
zjl β
log L = zj1 β − log e .
j l=1
In the case where there are ties, the exact partial likelihood is more complex,
224 SURVIVAL ANALYSIS
and computationally prohibitive in large samples. Two common approxima-
tions are the Breslow approximation,
rj !
X X
zjl β
log L ≈ sj β − dj log e ,
j l=1
where
P Dj is the set of indices for the subjects failing at time tj and sj =
l∈Dj zjl .
where z̄j is a weighted mean of the covariate values for the subjects at risk
at time t, weighted by the hazard ratios ezjl β . Hence the MLE, β̂, satisfies an
equation of the form observed − expected = 0.
The Fisher information is
∂
I(β) = − U (β)
∂β
Prj 2 zjl β Prj
X zjl β 2
l=1 zjl e l=1 zjl e
= dj Prj z β − dj P rj zjl β
l=1 e l=1 e
jl
j
X Prj (zjl − z̄j )2 ezjl β
= dj l=1Prj z β .
l=1 e
jl
j
Point estimates of β̂ along with standard errors are computed in the usual
way, along with the usual battery of tests: score, Wald, and likelihood ratio.
Now consider the special case in which we have a single binary covariate,
taking values zero or one. Suppose we wish to test H0 : β = 0 using the score
test. Then sj is the number of subjects failing at time tj with covariate value
1, which in the notation in section 7.3.2 is dj1 . Also, under H0 the hazard
ratios ezjl β are identically equal to one and z̄j is simply the proportion of
subjects at risk for which z = 1, which, again in the notation in section 7.3.2,
REGRESSION MODELS 225
is E[dj ] = nj1 /nj· . Therefore,
X
U (0) = sj1 − dj z̄j
j
X
= dj1 − dj nj1 /nj· .
j
I(0) = d j P rj z β − d j P r j zjl β
l=1 e l=1 e
jl
j
!
X nj1 n2j1
= dj − 2
j
nj· nj·
X dj nj1 nj2
= . (7.8)
j
n2j·
Hence, the score test reduces essentially to the log-rank test of section 7.3.2.
Note that the Fisher information based on the Breslow approximation differs
from the hypergeometric variance by a factor of (nj· − 1)/(nj· − dj· ), which is
one in the event that there are no ties, but otherwise may be slightly larger
than one.
Example 7.9. Revisiting the BHAT example, the estimate of the log hazard
ratio from the Cox proportional hazards model is β = −0.326, corresponding
to a hazard ratio of exp(−.326) = .722. The standard error of β obtained from
the Fisher information matrix is 0.112, so the Wald test yields a chi-square
statistic of 8.45 which is quite close to that of the log-rank test in Example 7.8.
We can also consider the effect of baseline covariates on death and com-
pute the hazard ratio for treatment adjusted for baseline covariates. Table 7.1
summarizes the model that includes treatment, age, and sex. We see that
the adjusted log hazard ratio for treatment changes from -0.326 to -0.317 (a
clinically and statistically insignificant difference), the effect of age is quite
large (HR = 1.58 for a 10 year increase in age), and the effect of sex is not
significant by either Wald test or likelihood ratio test.
Table 7.1 Cox proportional hazards model for BHAT. Treatment is coded as 1=pro-
pranolol, 0=placebo, and sex is coded as 1=female, 0=male. The likelihood ratio
chi-square for the model without the sex term is 51.0 (2 df ).
β se(β) Z p-value
Treatment -0.3172 0.11212 -2.829 0.0047
Age 0.0463 0.00738 6.269 <0.0001
Sex -0.0711 0.14986 -0.474 0.64
Likelihood ratio chi-square: 51.2 (3 df) p <0.001
226 SURVIVAL ANALYSIS
Next we consider the effect of race, which is coded with four levels: “White”,
“Black”, “Asian”, and “Other.” We note from Table 7.2 that there is a highly
significant difference between “Black” and “White” (p < 0.001); however,
more appropriate is a simultaneous test of the effect of race, which cannot
be achieved using coefficients in Table 7.2 without knowledge of the full co-
variance matrix. Instead, the likelihood ratio test can be constructed by using
the difference in the likelihood ratio chi-square statistics between the model
containing only treatment and age and the model with treatment, age, and
race: 64.7 = 51.0 with 3 degrees of freedom p = 0.033. 2
Table 7.2 Cox proportional hazards model for BHAT. Treatment is coded as 1=pro-
pranolol, 0=placebo, reference category for race is “White.”
β se(β) Z p-value
Treatment -0.312 0.112 -2.781 0.0054
Age 0.047 0.007 6.432 < 0.001
Race
Black 0.633 0.160 3.962 < 0.001
Asian -0.128 0.504 -0.254 0.80
Other 0.178 0.581 0.306 0.761
Likelihood chi-square: 64.7 (5 df) p < 0.001
7.4.6 Residuals
Similar to ordinary linear regression, examination of residuals can be useful
in assessing model assumptions and fit. In the setting of the Cox proportional
COMPOSITE OUTCOMES 227
hazards model, however, the definition of the residuals is not as clear. Four
commonly used types of residuals for Cox proportional hazards models are
Martingale: The Martingale residuals are defined as M̂i = δi −Λ̂0 (ti ) exp(β̂zi )
where δi is the outcome for subject i (0 or 1) and Λ̂0 (ti ) is the baseline cu-
mulative hazard at the end of observation for subject i. This residual is the
difference between the outcome and its expected value, given the length of
follow-up.
q
Deviance: di = sign(M̂i ) 2[−M̂i − δi log(δi − M̂i )]. The sum of the squares
of the deviance residuals is twice the log-partial likelihood ratio 2(log L(β̂)−
log L(0)).
P P
Score: (zi − z̄(ti ))δi − j:sj ≤ti (zi − z̄(ti ))eβ̂zi dj / l∈Rj eβ̂zl where t1 , t2 , . . .
P P
are the distinct failure times and z̄(s) = l∈Rj zl eβ̂zl / l∈Rj eβ̂zl . The sum
of the score residuals is the score function U (β̂) and hence is zero.
Schoenfeld: Schoenfeld residuals are defined for each event and have the
form zj −z̄(ti ). The Schoenfeld residuals are useful for identifying departures
from the proportional hazards assumption.
See Wei (1984), Therneau, Grambsch, and Fleming (1990), Grambsch and
Therneau (1994), and Grambsch, Therneau, and Fleming (1995) (and others)
for procedures making use of residuals for diagnosing lack of fit in proportional
hazards models.
7.6 Summary
Survival analysis may be the most widely applied analytic approach in the
analysis clinical trial data, certainly in disease areas such cancer and cardi-
ology. The techniques described in this chapter are widely understood and
accepted. The primary point of contention regards censored observations. Ad-
ministrative censoring, i.e., censoring resulting from a subject reaching a pre-
determined study end, usually satisfies the required conditions that it be in-
PROBLEMS 229
dependent of outcome. Time trends in the underlying risk, so that baseline
risk varies with time of enrollment, can induce a dependence leading to bias
in Kaplan-Meier estimates; however, in randomized trials, there should be no
corresponding bias in the hypothesis tests of interest.
A larger concern regards censoring which is not independent of outcome.
Loss to follow-up may be both outcome and treatment dependent, so assess-
ment of study discontinuation rates are important. Even when the rates are
comparable among treatment groups, there is no guarantee, however, that
when rates are high that bias has not been introduced. Sensitivity analyses
such as those described in Chapter 11 (see Scharfstein, Rotnitzky, and Robins
(1999), for example) can be helpful in understanding the potential impact of
early study discontinuation. In the case of survival outcomes, the analysis is
much more complex than the simple example we present.
A common source of dependent censoring (although we prefer the term
“truncation” following Frangakis and Rubin (2002)) is death from causes not
included in the outcome of interest. Frequently, especially in cardiology, deaths
from non-cardiovascular causes are excluded from the primary outcome with
the expectation that the treatment will have no effect, and therefore such
deaths serve to dilute the observed difference, requiring a larger sample size.
While this rationale is appealing, there are methodological difficulties because,
as we have pointed out, the parameter of interest is no longer a hazard rate
in the usual sense, but the additional condition that subjects be alive at the
a specified time is imposed (see page 227). Since we cannot statistically rule
out the possibility of a complex interaction between treatment, death, and
the outcome of interest, these comparisons are inherently problematic. More
troubling are secondary analyses of nonfatal components in which subjects are
“censored at death.” While these analyses will almost certainly be requested
by investigators in many settings, one should always keep in mind that there
are potential biases lurking in the background.
7.7 Problems
7.1 Show that the exponential distribution is “memoryless”, i.e., if T has
an exponential distribution, Pr{T > t|T > t0 } = Pr{T > t − t0 }.
7.2 Suppose that we have 20 patients, 10 per treatment group, and we
observe the following survival times:
A: 8+, 11+, 16+, 18+, 23, 24, 26, 28, 30, 31
B: 9, 12, 13, 14, 14, 16, 19+, 22+, 23+, 29+
where the ‘+’ indicates a censored observation.
(a) Compute and plot the Kaplan-Meier estimate of survival for the
combined group and for each treatment group.
(b) Test equality of treatments using the log-rank test and the Gehan-
Wilcoxon test.
230 SURVIVAL ANALYSIS
7.3 Recall that one way of computing the expectation of a positive ran-
dom variable is by integrating one minus the cumulative distribution
function. When dealing with a survival time T , this means
Z ∞
E[T ] = S(t)dt.
0
This is sometimes referred to as the unrestricted mean life. One quantity
often used to estimate this mean is obtained by using the Kaplan-Meier
estimate for S, giving the total area under the Kaplan-Meier curve.
(a) Show that, in the absence of censoring, the unrestricted mean life
estimate is equal to the sample mean.
(b) Calculate the unrestricted mean life estimate for the combined group
data from the previous exercise.
(c) What would happen if you tried to calculate the estimate for group
B only?
CHAPTER 8
Longitudinal Data
231
232 LONGITUDINAL DATA
where i = 1, . . . , m and m is the number of subjects. We can write this in
matrix form as:
y i = Xβ + δ i
Here the δij are not assumed to be i.i.d. within subject and a more com-
plex correlation structure is required to model the within-subject correlation.
Models of this type are called population average models and are discussed
in Section 8.5. The term population average is used because the expectation
function includes only parameters that are common to all subjects (β). The
most common method for estimating the parameters in population average
models is generalized least squares.
The model we will discuss to begin in this chapter is the subject-specific
model
yij = βi1 + βi2 tij + eij
or in matrix form
y i = Xβi + ei .
The term subject-specific is used because the parameters (β i ) in the expecta-
tion function vary by subject. This allows the fitted values to vary from subject
to subject which means it may be still acceptable to assume the eij are i.i.d..
The subject-specific model is described in detail in Section 8.2. There are a
number of approaches to estimating parameters in this model including two-
stage/OLS estimation (Section 8.3) and random-effect/maximum-likelihood
estimation (Section 8.4).
In the remaining portion of this chapter, we describe an alternative to maxi-
mum likelihood estimation called restricted maximum likelihood (Section 8.6),
standard errors and testing (Sections 8.7 and 8.8), additional levels of clus-
tering (Section 8.9), and the impact of various types of missing data on the
interpretation of the estimates (Section 8.11).
Treatment
Control
Figure 8.1 Longitudinal data for which equation (8.1) does not hold. The symbols
represent the true population means, the solid line represents the fitted line for Treat-
ment, and the dashed line represents the fitted line for Control. Fitted lines in panel
(a) are based on model (8.1) and in (b) are based on model (8.2).
Table 8.1 Ramus height of 3 boys measured at 8, 8.5, 9, and 9.5 years of age.
Subject Age Ramus Height
2 8.0 46.4
2 8.5 47.3
2 9.0 47.7
2 9.5 48.4
8 8.0 49.8
8 8.5 50.0
8 9.0 50.3
8 9.5 52.7
10 8.0 45.0
10 8.5 47.0
10 9.0 47.3
10 9.5 48.3
10 2 8
52
Ramus height (mm)
50
48
46
Figure 8.2 Ramus height of 3 boys measured at 8, 8.5, 9, and 9.5 years of age.
for each boy. It seems reasonable to assume a (straight line) linear model for
the responses from the i th subject: yij = βi1 + βi2 tj + eij , j = 1, 2, 3, 4,
i = 1, 2, 3 where yij is the j th observation on the i th subject and tj is the j th
time point (assumed to be the same for all subjects), e.g., y13 = 47.7 (mm)
and t32 = 8.5 (years). We also assume that E[eij ] = 0. The models for all of
236 LONGITUDINAL DATA
the observations for subject i are
yi1 = βi1 + βi2 t1 + ei1
yi2 = βi1 + βi2 t2 + ei2
yi3 = βi1 + βi2 t2 + ei2
yi4 = βi1 + βi2 t4 + ei4 .
These can be rewritten in matrix form as
yi1 1 t1 ei1
yi2 1 t2 ei2
= βi1 + βi2 + ,
yi3 1 t3 ei3
yi4 1 t4 ei4
or equivalently
1 t1
1 t2
yi = βi1 + ei = X i β + ei ,
1 t3 βi2 i
1 t4
where the design matrix X i is the same for all i. We will keep the subscript,
i, for generality. Now that we have notation for the complete error vector for
subject i we can write down a more complete distributional assumption:
ei ∼ N (04 , σ 2 Λi (φ))
where 04 is a 4 by 1 vector of zeros and Λi (φ) is a 4 by 4 correlation matrix
that depends on the parameter vector φ.
The possibility that Λi (φ) is not the identity matrix makes this a general
linear model. Note that while the variance parameter vector φ is common
among subjects, the subscript i on the matrices Λi is included to allow for
the situation where the observation times vary over subjects (more details to
follow). Some possible structures for Λi (φ) are:
• General correlation matrix
φhk for h 6= k
[Λi (φ)]hk = [Λi (φ)]kh =
1 for h = k
where [Λi (φ)]hk is the h, k entry of Λi (φ).
• Independence
Λi (φ) = I 4×4
where I 4×4 is the 4 × 4 identity matrix.
• Equal correlation (compound symmetric, or exchangeable)
1 φ φ φ
φ 1 φ φ
Λi (φ) = .
φ φ 1 φ
φ φ φ 1
TWO-STAGE ESTIMATION 237
• Toeplitz (for equally spaced time points)
1 φ1 φ2 φ3
φ1 1 φ1 φ2
Λi (φ) =
φ2 φ1
.
1 φ1
φ3 φ2 φ1 1
For the Ramus height data the coefficients from each fit are the intercept
and the slope. We have two numbers summarizing the data for each subject
(an intercept and a slope) rather than the original 4 (ramus height at each
of 4 time points).
Stage 2 Analyze the estimated coefficients. The most common second stage
analysis is to calculate means and standard errors of the parameters cal-
culated in the first stage. Other options include calculating the area under
the curve, and the value of the curve or its derivatives at prespecified times.
In this case we calculate the means:
Intercept age
33.41667 1.706667
and use them to plot an estimated “typical” or “average” growth curve
(Figure 8.3). We can also calculate standard errors and test hypotheses of
238 LONGITUDINAL DATA
52
Ramus height (mm)
50
48
46
Figure 8.3 Ramus height of 3 boys. The dashed lines correspond to subject fits. The
light solid lines connect each subject’s data points and the heavy solid line corresponds
to the mean estimated coefficients: 33.417 + 1.707Age.
interest. The standard errors of the means of the coefficients are correct
since they rely only on the independence of the estimates from different
boys.
If the data are likely to be serially correlated (after all we are assuming a
general linear model), there are two options
1. Assume a specific correlation structure (e.g., AR(1)) and find maximum
likelihood estimates,1 β̂i , σ̂i , and φ̂i for each subject. The log-likelihood for
subject i has the multivariate normal (MVN) form
− log(|σi2 Λi (φi )|) − (y i − X i βi )T [σi2 Λi (φi )]−1 (y i − X i β i ).
(For a matrix A, |A| is the determinant of A.) We have added a subscript i
to the variance parameters σ and φ since model fitting is done separately for
each subject. Because there is only one observation at each time for each
subject, a parsimonious correlation structure must be used—the general
1 See Appendix A.2 for a discussion of likelihood inference.
TWO-STAGE ESTIMATION 239
correlation matrix has too many parameters to estimate in the first stage
of two-stage estimation.
The maximum likelihood estimator for β i has a closed form (given σ̂i and
φ̂i ):
β̂ i = (X Ti (σ̂i2 Λi (φ̂i ))−1 X i )−1 X Ti (σ̂i2 Λi (φ̂i ))−1 y.
This is also the generalized least squares (GLS) estimate. Furthermore,
conditional on σ̂i and φ̂i ,
Var(β̂ i ) = (X Ti (σ̂i2 Λi (φ̂i ))−1 X i )−1 .
Case I: Λi = I
Since Equation 8.3 has the form of a standard linear regression, we can esti-
mate the vector of β i ’s using OLS (Ordinary Least Squares). The OLS calcu-
lation gives us:
T
X1 0 0 X1 0 0
−1
0 X 2 0
0 X 2 0
0 0 X3 0 0 X3
T
(X 1 X 1 )−1 0 0
= 0 (X T2 X 2 )−1 0 .
0 0 (X T3 X 3 )−1
So
XT 0 0
β̂ 1 (X T1 X 1 )−1 0 0 1
β̂ 2 = 0 (X T2 X 2 )−1 0 0 X T2 0 y
T
β̂ 3 0 0 (X T3 X 3 )−1 0 0 X3
T
(X 1 X 1 )−1 X T1 0 0
y1
= 0 (X T2 X 2 )−1 X T2 0 y 2
T −1 T
0 0 (X 3 X 3 ) X 3 y3
(X T1 X 1 )−1 X T1 y 1
T
= −1 T
(X 2 X 2 ) X 2 y 2 .
(X T3 X 3 )−1 X T3 y 3
So, under independence, the OLS estimates of the β i s are the same, whether
we do the estimation individually or as one large regression problem. The
estimate for the only variance parameter is
σ̂ = (y − ŷ)T (y − ŷ)/(nM − pM )
where
X 1 β̂ 1
ŷ = X 2 β̂ 2 .
X 3 β̂ 3
TWO-STAGE ESTIMATION 241
54
52
Ramus height (mm)
50
48
46
1 9.5
We return to the subset of 3 subjects from the complete data to illustrate the
difference between the marginal and the conditional model. The parameter
estimates for the fixed effects, β̂, can be used to construct marginal fitted
values and marginal residuals. The top panel of Figure 8.5 shows the marginal
fitted values, X i β̂, as the heavy solid line and the residuals associated with
those fitted values for one subject as vertical dotted lines. These residuals
correspond to the marginal errors, y i − X i β, which have variance Σi (α).
They also have high serial correlation since (in this example) a subject’s data
vectors tend to be either entirely above the marginal fitted values or entirely
below.
The bottom panel of Figure 8.5 shows the conditional (subject-specific) fit,
X i β̂ + Z i b̂i , for the same subject as a heavy dashed line. The conditional
residuals, y i − (X i β̂ + Z i b̂i ), which, in the fit to the complete data were
assumed to be independent, are shown as vertical dotted lines. These residuals
do not show the very high positive serial correlation of the marginal residuals.
THE RANDOM-EFFECTS, SUBJECT-SPECIFIC MODEL 245
52
Ramus height (mm)
50
48
46
Figure 8.5 Marginal and conditional residuals. In both panels the heavy solid line
corresponds to the marginal or population average fitted values X i β̂ and the lighter
lines connect the data points for each of the three subjects. In the top panel the dotted
lines are the marginal residuals, y i − X i β̂. In the bottom panel one of the subject-
specific fitted curves is shown as a heavy dashed line and the dotted lines are the
corresponding conditional residuals, y i − (X i β̂ + Z i b̂i ).
246 LONGITUDINAL DATA
8.4.5 Serial Conditional Correlation
When we looked at the residuals from the first stage of the two stage estimation
of the ramus data, we found substantial correlations. Since they do not seem
to follow any simple structure such as AR(1), we might fit a general, subject-
specific correlation structure.
If we are only interested in the estimate of the mean parameters, there is
little difference between this model (general Λi ) and the first model, assuming
conditional independence. However, the random-effects variance matrices D̂
are very different. The fitted coefficients β̂+ b̂i for the two models are shown in
Figure 8.6. These plots give some evidence supporting the general conditional
correlation model since it is more plausible that the coefficients in (b) arise
from a bivariate normal distribution than those in (b).
The standardized conditional residuals, y i − [X i β̂ + Z i b̂i ], divided by their
estimated standard errors, are shown for the two models in Figure 8.7. The
fitted curves, X i β̂+Z i b̂i , for the two models are shown in Figures 8.8 and 8.9.
Notice that the slopes are quite similar among subjects for the general con-
ditional correlation model. We might wish to refit this model with a random
effect solely for the intercept. This fit results in very little change in the log-
likelihood from the model with random slope. (See the first three lines in
Table 8.2 for statistics assessing the overall fit of the three models discussed
thus far.)
where ri (β) = y i − X i β. Table 8.2 shows statistics evaluating the fit of six
different models; the three random effects models discussed in Section 8.4
and three marginal models. The three marginal models differ in the choice of
covariance matrices. Model 4 uses a general covariance matrix with a constant
diagonal so that all observations have the same variance. Model 5 uses a
completely general covariance matrix, allowing for a different variance at each
time, while model 6 uses an AR(1) covariance structure.
We need criteria with which to choose from among these models. Since for
nested models (see Section 8.8.1) the log-likelihood increases with the number
THE POPULATION-AVERAGE (MARGINAL) MODEL 247
4
3
Slope
2
1
10 15 20 25 30 35 40 45
10 15 20 25 30 35 40 45
Intercept
Figure 8.6 Ramus height data, random effects fitted coefficients β̂ + b̂i .
248 LONGITUDINAL DATA
Conditional independence
3
2
Standardized residuals
1
0
−1
−2
−3
46 48 50 52 54 56
46 48 50 52 54 56
Fitted values (mm)
48
46
56
54
52
50
48
46
56
54
52
50
48
46
56
54
52
50
48
46
Ra
52
50
48
46
56
54
52
50
48
46
48
46
56
54
52
50
48
46
56
54
52
50
48
46
56
54
52
50
48
46
Figure 8.9 Ramus height data and fitted curves for general conditional correlation
model. Each panel contains data and fitted models for one subject. The solid line is
the marginal fitted curve X i β̂ and does not vary by subject. The dashed lines are
individual fitted curves X i β̂ + Z i b̂i .
total information content. Note that while larger values of the log-likelihood
are desirable, smaller values of AIC and BIC are desirable.
If we are interested only in the mean parameters we might pick marginal
AR(1) since it has small AIC and BIC. If we need estimates of the subject-
level parameters, a random effects model is required, and we would probably
choose the random effects model with general conditional correlation and only
the intercept random. The fitted fixed effects and standard errors do not differ
substantially among the models (Table 8.3)
The subscript i on Λi (φ) indicates that the matrices depend on common pa-
rameter vectors, φ, but are subject specific—usually based on the timing of
THE POPULATION-AVERAGE (MARGINAL) MODEL 251
Table 8.3 Fixed effects estimates and standard errors for the models listed in Ta-
ble 8.2.
Model Intercept S.E.(Intercept) Slope S.E.(Slope)
then
tcomplete = (1, 2, 3, 4, 5)T
and
1 φ1 φ2 φ3 φ4
φ1 1 φ5 φ6 φ7
Λcomplete(φ) =
φ2 φ5 1 φ8 φ9
φ3 φ6 φ8 1 φ10
φ4 φ7 φ9 φ10 1
and we would have
1 φ 1 φ3
Λ1 (φ) = φ1 1 φ6
φ3 φ6 1
1 φ 2 φ3 φ4
φ2 1 φ 8 φ9
Λ2 (φ) = φ3 φ8
1 φ10
φ4 φ9 φ10 1
1 φ 5 φ6
Λ3 (φ) = φ5 1 φ8 .
φ6 φ8 1
Note that for this example, we would need more than these three subjects to
estimate this correlation structure. A general correlation matrix can be esti-
mated only when the vector tcomplete is not too large relative to the individual
ti vectors.
where Σ(α) = Diag(Σ1 (α), Σ2 (α), . . . , ΣM (α)) and β̂(α) is the generalized
least squares (GLS) estimate of β given α. In other words,
β̂ = (X T [Σ(α)]−1 X)−1 X T [Σ(α)]−1 y. (8.6)
For further details see Diggle, Liang, and Zeger (1994) Chapter 4, Section 5
and McCulloch and Searle (2001), Chapter 1, Section 7 and Chapter 6, Section
6. We note the following regarding REML estimation:
• The restricted log-likelihood is solely a function of α and does not involve
β. Therefore the REML estimate of β is undefined. The GLS estimate of
β (equation 8.6) and the BLUP estimate of bi (equation 8.5), however, can
be evaluated at the REML estimates of α and should perform well.
• The maximum likelihood estimates of α are invariant to any one-to-one
transformation of the fixed effects. Since REML redefines the X i matrices,
however, the REML estimates are not.
• The REML estimates of the variance components are consistent whereas
the maximum likelihood estimates will be too small. The size of this bias
will depend on the particular data.
8.8 Testing
It is important in model building that we are able to assess whether particular
model parameters, either fixed effects and variance components, are necessary
or can be dropped. In many cases testing for the necessity of variance com-
ponents is essential for finding a model for which reliable estimates of the
parameters of interest can be found.
Variance Components
Because variance components must be non-negative, 0 is on the edge of the
corresponding parameter space and the LR tests for the variance components
violate one of the required assumptions. In practice, however, the LR test
works reasonably well (though it tends to be conservative) and no good al-
ternatives exist so the LR test is still recommended for variance components.
See Pinheiro and Bates (2000) Chapter 2, Section 4 for a detailed discussion.
In summary, the LR test is conservative (overestimates p-values) for testing
variance components.
256 LONGITUDINAL DATA
Fixed Effects
The LR test can be quite anti-conservative (p-values are too small) for testing
elements of β. Instead, we recommend the conditional F - and t-tests (condi-
tional on α̂) (Pinheiro and Bates 2000).
If the null hypothesis is that Sβ = m where S is r × p and if we let
Σ = σ 2 V , where V is assumed known, and define
y T (V −1 − V −1 X(X T V −1 X)−1 X T V −1 )y
σ̂ 2 =
N −p
then
(S β̂ − m)T [S(XV −1 X)−1 S T ]−1 (S β̂ − m)
f=
rσ̂ 2
has an Fr,N −p distribution under the null hypothesis. To calculate the test
statistic in practice we must use an estimate for V . We would typically use a
maximum or restricted maximum likelihood estimator.
0.7
bone.density
0.6
0.5
0.4
Figure 8.10 Bone density measured at 10 or 11 times per subject. Observation days
are not identical for each subject.
Table 8.4 ANOVA table for bone density example. The numerator and denominator
degrees of freedom are the same for both models.
Table 8.6 Two-stage estimates of time-treatment interaction for bone density data.
Equation (8.8) Equation (8.9)
Value Standard Error Value Standard Error
trt:day 0.0000122 0.0000056 0.0000130 0.0000054
yij = β + bi + eij
where
bi ∼ N (0, D1×1 ) eij ∼ N (0, σ 2 ).
If we have an additional level of nesting (say subjects in a clinical trial are
grouped into centers) we simply add an additional term to the model for
center. In the following formulation, if we assume there are 15 subjects at
each of 4 centers, index i runs from 1 to the total number of subjects = 60
rather than from 1 to 15. This method of numbering keeps the number of
subscripts to a minimum. So we have, for subject i at center k,
(1) (2)
yij = β + bk + bi + eij
where the parameters and their distributions are described in Table 8.7.
While having the advantage of simplicity, this is not a very interesting
model for longitudinal data because there are no continuous predictors. We
can generalize the model by adding back in the design matrices. The standard
linear mixed effects model has the familiar form
y i = X i β + Z i bi + ei
Table 8.8 Definitions and distributions for fixed and random effects in a 2 level
clustered model with more than one random effect at each level.
Level Parameter Meaning
0 β fixed effects
(1)
1 bk ∼ N (0, D(1) ) random effect vector for center k
(2)
2 bi ∼ N (0, D(2) ) random effect vector for subject i (in
center k)
3 ei ∼ N (0, σ 2 Λi (φ)) random effect (error) vector for subject i
(in center k)
would have
1 1
1 2
(1) (2) 1 3
Xi = Zi = Zi =
1
.
4
1 5
1 6
Under these assumptions, β (in Equation 8.10) is the overall mean intercept
(1)
and slope, bk for k = 1, . . . , 10 are random intercept and slope effects for
(2)
center, and bi for i = 1, . . . , 60 are random intercept and slope effects for
subject i.
8.10.1 Examples
We motivate the GEE methods by first considering simple examples for non-
normal data without repeated measures.
GENERALIZED ESTIMATING EQUATIONS 261
Example 8.1. Suppose that y1 , y2 , . . . , yn have a binomial distribution with
probability pi and size mi . Then the log-likelihood has the form
Xn
pi mi
log Lp (Y ) = yi log + mi log(1 − pi ) + log
i=1
1 − pi yi
n
X
= yi θi − mi log(1 + eθ ) + C(Y ) (8.11)
i=1
∂ µi (β )
where X̃ i (β) = . Once again, the variance of β̂W calculated assuming
∂β T
that W is true has the form X̃ W −1 X̃ −1 and can be seriously biased if
W is incorrectly specified. Instead we compute the variance of β̂ under the
true model,
T T
T
Var(β̂ W ) = X̃ W −1 X̃ −1 X̃ W −1 ΣW −1 X̃ X̃ W −1 X̃ −1 .
Thall and Vail (1990) present data on epileptic seizures for 59 individuals. We
will use this data to demonstrate the generalized estimating equation (GEE)
approach. The data have the format shown in Table 8.9 where seizures is the
number of seizures in the two weeks preceding the visit, tmt is the treatment:
progabide (1) or placebo (0), base is the number of seizures in the 8 weeks
before randomization to treatment group, age is age in years, id is a subject
id, and
textttvisit is the visit number (1 to 4).
We fit a GEE model without a main effect for treatment and using a link
and variance function derived from the Poisson distribution, that is
E(y i ) = exp(X i β) = µi (β)
MISSING DATA 263
and
Var(yij ) = E(yij ).
We also used a conditional independence correlation matrix. The parameter
of interest in this model is the time-treatment interaction, visit:tmt.
The estimated coefficients and naive (based on independence correlation
structure) and robust standard errors are shown in Table 8.10. Treatment does
not appear to have a significant effect on the number of seizures. The estimate
of the time-treatment interaction coefficient is -0.0119 with a standard error of
0.0650 (a significant negative coefficient would suggest that treatment reduces
the seizure rate).
8.12 Summary
Longitudinal data analysis is an important tool for the clinical trial statisti-
cian. Nonetheless, it is important to keep in mind that clinical trial results
may be less compelling when based on complicated analyses requiring strong
assumptions. Since the primary outcome analysis is prespecified, it must be as
robust as possible to departures from assumptions. It is also important to be
aware of the effect that missing data might have on the analysis, and select an
analysis that is as robust as possible to the effect of missing data. In clinical
SUMMARY 265
trials, a two stage analysis in which all subjects receive equal weight may often
be preferred, both because of its simplicity and its robustness.
CHAPTER 9
Quality of Life
267
268 QUALITY OF LIFE
9.1 Defining QoL
Although there is no single definition of QoL, most definitions are rooted
in the World Health Organization’s 1948 definition of health as a “state of
complete physical, mental, and social well-being.”1 Thus, it is generally agreed
that QoL is multi-dimensional, encompassing an individual’s self-report of
functional status over a range of health-related domains that include social,
emotional, psychological, and physical well being (Speith and Harris 1996;
Quittner 1998). Another key feature of QoL outcomes is that they reflect
the individual’s subjective and objective evaluation of how an illness and its
therapy affect the individual functionally across these multiple domains.
9.3.2 Validity
Validity is a term that refers to the degree to which an instrument “does what
it is intended to do” (Carmines and Zeller 1979). A good instrument will
possess adequate construct, criterion, and content validity. Construct validity
refers to the relationship among items and between items and scales. Crite-
rion validity, in contrast, concerns whether the instrument (or scale) shows
an empirical relationship to one or more established measures of that domain.
Finally, content validity refers to the overall degree to which the items reflect
the domain(s) of interest. Comprehensive assessment of any given domain is
key to ensuring adequate content validity. Statistical methods used in estab-
272 QUALITY OF LIFE
lishing the validity of an instrument include, but are not limited to, item
response theory, factor analysis, and multitrait-multimethod analysis (Fayers
and Machin 2000).
9.3.3 Reliability
Reliability refers to the degree to which an assessment tool will yield consistent
results if given multiple times under similar conditions. Reliability can be
assessed empirically several ways, including test-retest, split-half, alternative
form, and Cronbach’s α coefficient (Carmines and Zeller 1979). The first three
describe a measure’s stability over time while the latter is used to describe
an instrument or scale’s internal reliability (or consistency). The formula for
computing the Cronbach’s α coefficient is given by
Pk !
2
k s i
α= 1 − i=1 ,
k−1 s2T
where k is the number of items, s2i the variance of the ith item, and s2T the
variance of the total score that is formed by summing up all the items. If all
items are identical, then all the s2i will be equal and s2T = k 2 s2i , so that α = 1.
P
If, on the other hand, the items are all independent, then s2T = ki=1 s2i so
that α = 0.
Regardless of the method used, the higher the reliability, the less measure-
ment error there is, so a high degree of reliability is desired. A scale or whole
questionnaire may be revised based on the reliability data obtained during
the validation process.
Graphical Summaries
As an initial step, a simple exploratory analysis can provide important in-
sights. Graphical procedures are useful exploratory tools that can reveal trends,
outliers, and patterns of non-compliance (Billingham, Abrams, and Jones
1999). Graphical summaries of longitudinal QoL data include profile plots
of individual subjects and group profile plots. Profile plots of individual QoL
responses help to identify general trends over time and may provide infor-
mation about inter-individual variability. For example, Machin and Weeden
(1998) recommend plotting QoL scores from individual subjects versus the
time from randomization. Profile plots of individual subject’s QoL scores can
be displayed as a set of small graphs or overlaid in one graph. Figure 9.2 shows
the profile plots of the global QoL score of the MLHF questionnaire (see Sec-
tion 9.2) for a subgroup of subjects from the Vesnarinone Trial (VesT) (Cohn
et al. 1998). The global MLHF score is computed as a sum of all 21 individ-
276 QUALITY OF LIFE
ual item scores and ranges from 0 (best QoL) to 105 (worst QoL). The VesT
study was a double-blind, placebo controlled, randomized phase III trial that
assessed mortality, morbidity, and quality of life of subjects with severe heart
failure. A total number of 3833 subjects at 189 study centers were randomized
to receive either 30 mg daily dose of vesnarinone, 60 mg daily dose of vesnar-
inone, or placebo. The MLHF questionnaire was filled out by the subject in
the VesT study at baseline, and at weeks 8, 16, and 26.
0 5 10 15 20 25 0 5 10 15 20 25
80
60
40
20
0
Subject:A Subject:B Subject:C Subject:D
100
80
60
40
20
0
0 5 10 15 20 25 0 5 10 15 20 25
Time (Weeks)
Figure 9.2 Profile plots of individual subjects’ global MLHF scores from the VesT
trial. Lower scores indicate better quality of life.
Figure 9.2 suggests that there is considerable variability both between and
within subjects and that there is an overall decrease in the total QoL score
over time. Of course for large trials, examining individual profile plots of all
subjects becomes impractical. In these situations, group summary measures
should be used. For example, if the data are on a continuous scale, the means
with the corresponding confidence intervals can be plotted. If the data are
on a categorical scale, the proportions of responses that fall into particular
categories can be plotted over time. Profile plots of summary measures should
be interpreted with caution when informative drop-out occurs (Billingham,
Abrams, and Jones 1999). Figure 9.3 shows the profile plots of means of the
global MLHF scores of the VesT trial along with the corresponding 95% con-
fidence intervals for the three treatment groups. Ignoring the potential effects
ANALYSIS OF QoL DATA 277
of missing data, Figure 9.3 suggests that all three treatment groups saw an
improvement in QoL during follow-up. At the 8 and 16 week assessments, the
means of the QoL total scores for the 60 mg vesnarinone treated group were
significantly lower than those in the placebo group.
55
Placebo (N=1283)
30 mg of Vesnarinone (N=1275)
60 mg of Vesnarinone (N=1275)
50
Total Score
45
40
0 5 10 15 20 25
Time (Weeks)
Figure 9.3 Profile plots of means of the global MLHF scores in the VesT trial along
with the corresponding 95% confidence intervals.
Summary Measures
Summary measures can reduce the repeated measures of an individual sub-
ject’s assessment into a single measure. Longitudinal data can be summarized
using a variety of standard summary measures that can be easily computed.
The disadvantage of summarizing repeated QoL assessments into a single mea-
sure is that information about patterns of change over time may be lost. Fur-
thermore, if subjects drop out, the analysis of these measures might lead to
biased results.
The most commonly used summary measure of longitudinal QoL outcomes
is the area under the curve (AUC). Letting Tij denote the j th time (j =
1, · · · , K) for subject i and xij the QoL score of the ith subject at the j th
time, the AUC can be computed using the trapezoidal rule by
K
1X
AUCi = xij + xi(j−1) Tij − Ti(j−1) .
2 j=2
278 QUALITY OF LIFE
QoL scores missing at individual times can be imputed by interpolation. For
example, if an intermediate observation is missing, the corresponding QoL
score could be computed using only the non-missing observations. This is
equivalent to imputation using linear interpolation. If a subject drops out and
follow-up QoL assessments are missing, the corresponding missing QoL scores
could be imputed by extrapolating from the last observations. Alternatively,
the AUC can be considered as censored at the last available QoL assessment.
Censoring, however, can be informative and standard survival analysis tech-
niques can lead to biased estimation (Billingham, Abrams, and Jones 1999). A
bias adjustment procedure has been proposed by Korn (1993) using estimates
of the cumulative distribution function of an interpolated AUC. Other stan-
dard summary measures include the mean, median, or the lowest or highest
values reported. All these summary measures are straightforward to compute
and can be easily interpreted. The summary measure should be chosen to
most directly address the primary questions of the trial.
Table 9.2 Mean and standard deviation AUC of MLHF global score of VesT study.
Group Mean Standard Deviation
Placebo 1882.79 908.07
30 mg vesnarinone 1871.44 921.40
60 mg vesnarinone 1863.64 917.13
Table 9.2 shows the means and standard deviations for the AUCs of the
MLHF global scores for the three treatment arms of the VesT study (Cohn
et al. 1998). The AUCs were computed from the baseline to the 26 week
assessment. For subjects who died or underwent heart transplantation before
the 26 weeks assessment, the worst possible MLHF score was imputed for the
missing assessments. The mean AUC in the placebo arm is 11.35 units higher
than the mean AUC in the 30 mg vesnarinone arm and 19.15 units higher
than the mean AUC in the 60 mg vesnarinone arm.
Table 9.3 T-statistic values for comparing “emotional well-being” and “physical well-
being” MLHF scores between placebo and 60 mg vesnarinone arm in the VesT study
0 Iq
yi = + ξi + i , (9.2)
µ0 Λ0
(g) 0 Iq (g) (g)
yi = (g) + (g) ξi + i . (9.3)
µ0 Λ0
Since in clinical trials the same QoL instrument is administered to all study
arms simultaneously, it is reasonable to assume that the measurement prop-
(1) (G) (1)
erties are invariant across study arms, i.e., µ0 = · · · = µ0 and Λ0 = · · · =
(G)
Λ0 . The distribution parameters of the latent variables remain unrestricted
so that differences in QoL between study arms can be directly assessed by
comparing latent variable means and covariances between arms. Since in val-
idated QoL instruments the subscale structure is known, certain elements of
Λ0 are fixed at zero. For example, the MLHF questionnaire consists of an
“emotional well-being” and a “physical well-being” component. The “physi-
cal well-being” component consists of eight items j = 1, · · · , 8, with (index,
j, in parentheses) “rest during day” (1), “walking and climbing stairs” (2),
“working around the house” (3), “going away from home” (4), “sleeping” (5),
“doing things with others” (6), “dyspnea” (7), and “fatigue” (8). The “emo-
tional well-being” component consists of five items j = 9, · · · , 13, with “feeling
burdensome” (9), “feeling a loss of self-control” (10), “worry” (11), “difficulty
concentrating and remembering” (12), and “feeling depressed” (13). There-
fore, the measurement model (9.2) for QoL data obtained from the MLHF
instrument can be written as
ANALYSIS OF QoL DATA 283
(g)
y1i (g)
1i
(g) 0 1 0 (g)
y2i µ2 λ2 0 2i
.. ..
.. ..
. . . ! .
(g) (g) (g)
y µ8 λ8 0 ξ1i
8i = + + 8i . (9.4)
y (g) 0 0 1 (g)
ξ2i (g)
9i 9i
(g) µ10 0 λ10 (g)
y10i 10i
. ..
.. .. . ..
. .
(g) µ13 0 λ13 (g)
y13i 13i
(g)
The λ’s are unknown slope parameters, the latent variable ξ1i represents the
(g)
“physical well-being” component and the latent variable ξ2i represents the
(g)
“emotional well-being” component of the g th study arm. Larger values of ξ1i
(g)
and ξ2i indicate a worse health state. The slope parameters λ’s represent
the direct effects of the latent variables (“physical well-being” or “emotional
well-being”) on the observed responses of the MLHF questions. Model (9.4)
is illustrated in the path diagram in Figure 9.4. Table 9.4 shows the maxi-
mum likelihood estimates of the latent variables means for the 8 weeks post-
treatment assessment of the VesT study.
Table 9.4 Maximum likelihood estimates (standard errors) of the latent variable
means for the 8 weeks post-treatment assessment of the VesT study.
Feeling depressed
Difficulty concentrating
Worry
Fatigue
Dyspnea
Sleeping
Figure 9.4 Path diagram for measurement model for MLHF instrument.
however, between clinical benefit measured by prolonged survival and side ef-
fects that might reduce the quality of life. Quality-adjusted survival analysis
is a method for combining survival and quality of life into a single measure.
Q-TWiST, “Quality-adjusted Time Without Symptoms of disease and Toxi-
city of treatment”, was introduced by Goldhirsch et al. (1989) as a means to
combine quality of life information and survival analysis. The Q-TWiST anal-
ysis consists of three steps. In the first step, several clinical health states are
defined. In the original application for which Q-TWiST was developed, three
health states were defined (Goldhirsch et al. 1989): time with toxicity (TOX),
time without symptoms and toxicity (TWiST), and time after progression or
relapse (REL). In the second step, Kaplan-Meier curves for the clinical health
ANALYSIS OF QoL DATA 285
transition times are used to partition the area under the overall survival curves
(see Zhao and Tsiatis (1997, 1999)). The average time spent in each health
state in calculated separately for each treatment as illustrated in Figure 9.5.
1.0
0.8
REL
0.6
Survival
TWIST
0.4
0.2
TOX
0.0
0 5 10 15 20 25 30
Time (Months)
In the third step, utility weights between 0 and 1 are assigned to each health
state by considering how valuable a time period with toxicity or relapse is to
each subject. The treatment regimens are then compared using the weighted
sum of the mean durations of each clinical health status calculated in step 2.
For example, in the original application (Goldhirsch et al. 1989), Q-TWiST
was calculated as
Q-TWiST = ut TOX + TWiST + uR REL,
where ut and uR are the utility weights for estimated time with toxicity and
estimated time after progression or relapse, respectively. In most situations,
the utility weights ut and uR are unknown. One approach in these situations
is to perform a threshold utility analysis as illustrated in Figure 9.6. In a
threshold utility analysis, the Q-TWiST treatment effect is evaluated for all
possible combinations of utility weight values.
286 QUALITY OF LIFE
1.0
Treatment B is better
0.8
0.6
ur
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
ut
Some more recent work regarding the joint modeling of QoL scores and
survival is due to Wang, Douglas, and Anderson (2002) and Chi and Ibrahim
(2006, 2007) but is not presented here.
9.7 Summary
In this chapter we have presented some of the basic issues encountered when
using Quality of Life measures as either a primary or a secondary outcome
measure. In either case, the best choice of a QoL instrument is one that is
specific to the needs of the trial. That is, the instrument should be targeted
to a specific disease and its direct consequences, or targeted to specific QoL
dimensions. General, non-specific QoL instruments tend not to be sensitive
to the concerns most relevant in a particular trial and subject population.
Inappropriate instruments may not only be insensitive to the intervention
effect but may be so irrelevant or awkward for the subject that it may even
discourage them from continuing to participate.
In addition to selecting a targeted QoL instrument, the analysis for that in-
strument is generally more complex than for other outcome measures. Plans
for analysis must be laid out in advance, especially if this is the primary out-
come measure. Design aspects including sample size are more complicated
than for continuous, binomial, longitudinal, or survival analysis and may re-
quire simulation methods. Missing data in an instrument can be a common
problem that must be addressed in any analysis plan.
Nonetheless, QoL is an important measure of the effects of an intervention
SUMMARY 287
in a subject population, and should be included as an outcome measure when
appropriate. With proper design and planning, many trials have successfully
utilized QoL instruments.
CHAPTER 10
The term interim analysis refers to statistical analysis conducted during the
course of a clinical trial for the purpose of monitoring efficacy and safety.
While interim analyses serve a variety of purposes, a primary goal of interim
analyses is to determine if the trial should stop before its planned termination
time if the superiority of the intervention under study is clearly established,
if the ultimate demonstration of a relevant intervention difference has become
unlikely (futility), or if unacceptable adverse effects are apparent. Also, as a
result of interim analyses, the intervention may be modified or elements of
the experimental design, such as subject eligibility criteria, changed. Since all
interventions have the potential for causing harm, there is an ethical obli-
gation to the study participants that a trial not continue beyond the point
at which the potential risks to the study participants outweigh the potential
benefits. On the other hand, if one intervention is conclusively demonstrated
to be substantially superior to another intervention, there is an ethical obli-
gation not to continue to expose subjects to an inferior therapy. Finally, if it
is clear that the study is unlikely to provide definitive answers to the study
question, continued participation in the trial may expose subjects to potential
risk with minimal scientific justification. The statistical issues related to early
stopping because of unexpected adverse side effects are more complex and
sound decision making generally requires a combination of careful statistical
analysis and informed clinical judgment. It is difficult, if not impossible, to
prespecify stopping criteria that address all possible safety issues that might
arise. Consequently, formal stopping rules for safety monitoring are typically
specified for a small number of safety outcomes such as all-cause mortality or
adverse events of particular concern such as liver abnormalities that may be
suspected of being associated with the treatment.
These are complex issues and in order to ensure ethical conduct of clinical
trials, the National Institutes of Health (NIH), the Food and Drug Administra-
tion (FDA), and the International Conference on Harmonization of Technical
Requirements for Registration of Pharmaceuticals for Human Use (ICH) have
issued principles governing data and safety monitoring and guidelines for data
and safety monitoring procedures. We will review general issues in monitoring
of data and safety in Section 10.1. To illustrate the issues in data and safety
monitoring, two examples are introduced in Section 10.2 and will be used in
subsequent sections to illustrate various issues in interim analysis. A typical in-
terim analysis will use one of three general approaches: group sequential tests,
triangular tests, and stochastic curtailment tests. These three approaches will
289
290 DATA MONITORING AND INTERIM ANALYSIS
be described in Sections 10.4, 10.5, and 10.6, respectively. Sequential methods
for interim analysis have implications for statistical inference such as point
and confidence intervals and the observed significance. Methods for statistical
inference following sequential tests will be described in Section 10.7.
4 http://grants.nih.gov/grants/guide/notice-files/not99-107.html
5 http://grants.nih.gov/grants/guide/notice-files/NOT-OD-00-038.html
292 DATA MONITORING AND INTERIM ANALYSIS
consequences of these analyses affect the interpretation of the trial results,
interim analyses should be carefully planned in advance and described explic-
itly in the protocol. When an interim analysis is planned with the intention of
deciding whether or not to terminate a trial early, this is usually accomplished
by the use of a procedure from one of three general classes known as group
sequential procedures, triangular tests, and stochastic curtailment.
10.2 Examples
Two examples of randomized, control trials will be introduced here and will be
used throughout the chapter to illustrate different interim analysis procedures
and issues to consider in early termination of clinical trials.
Table 10.1 Probabilty of rejection of the null hypothesis as a function of the number
of observations and nominal significance level. (See Armitage et al. (1969).)
The first sequential trials frequently enrolled pairs of subjects, one on each
treatment, and evaluated the effectiveness of treatment after each pair. For
example, in the 6-Mercaptopurine in Acute Leukemia trial (Freireich et al.
1963) described in Chapter 7 (Example 7.5), pairs of subjects were enrolled
and one given 6-MP and the other placebo. The preferred treatment from each
pair was the treatment with the longest remission time. The test statistic was
the difference between the number of preferences for 6-MP and the number of
preferences for placebo. The trial stopped after 21 pairs (18 preferring 6-MP
and 3 preferring placebo) when the difference became sufficiently large. Se-
quential designs based on pairing of individual subjects has not been widely
adopted in more recent years for a variety of reasons. First, subjects are typ-
ically paired solely because they are enrolled at the same time. Since paired
subjects may differ widely with respect to underlying risk or prognosis, there
is likely to be a loss of efficiency. Second, especially in large, multicenter tri-
als, pairing is more difficult, and the logistical difficulties make the continual
reassessment of the status of enrolled subjects required to implement these
kinds of procedures difficult. Alternative approaches are necessary to improve
efficiency and enable data monitoring to be conducted in a wide range of
situations.
Before describing the details of data monitoring procedures, we note that
rarely are the statistical tests used in clinical trials based simply on sums of
i.i.d. random normal random variables, so interim monitoring procedures are
required that can accommodate a variety of statistical tests. In the general
setting we will consider, we have a sequence of analysis “times”, t1 , t2 , . . . , tK
and a sequence of test statistics S(t1 ), S(t2 ), . . . , S(tK ) so that for Sk defined
in the previous section, Sk = S(tk ). In the cases we have considered thus
far, the “times” represent the number of observations or pairs. While the
296 DATA MONITORING AND INTERIM ANALYSIS
tk can be the actual calendar times at which the analyses are performed,
it is usually more mathematically convenient to consider “time” to be on
the information scale (see Appendix A.5 for a more detailed discussion of
information). In the simple case in the previous section, we could simply let
tk = k, the number of observations at the k th analysis. Alternatively, the tk
could represent the information fraction, relative the complete information
at the planned conclusion of the trial, tk = k/K where K is the maximum
sample size (or information). In this case we have tk ∈ (0, 1].
The test statistics, S(tk ), can arise from a number of different settings,
but for the most part we will focus on those arising from score tests (see
Appendix A.3). For simplicity, first suppose that the data are iid observations
from a single parameter family of distributions, where µ is the parameter of
interest and the null hypothesis is H0 : µ = 0. Let the score function be Uk (µ)
at analysis k and Sk = −Uk (0). The Fisher information is
I(µ) = −E[Uµ (µ)]
where the subscript µ indicates differentiation with respect to µ. Note that
by expanding U (µ) in a Taylor series about µ = 0, and using the fact that
E[U (µ)] = 0 (see Appendix A.2), we have
E[Sk ] = −E[Uk (0)] ≈ Ik (0)µ.
If µ is sufficiently small, then Ik (µ) = Ik (0), so we have that
a
Sk ∼ N (µIk , Ik ), (10.2)
where we suppress dependence on µ and write Ik in place of Ik (0). The case
in which we have a multi-parameter family with density f (x1 ; µ, φ), where φ is
a nuisance parameter, is more complicated, but the result in (10.2) still holds
(see Whitehead (1997) or Whitehead (1978)). P
These test statistics often have the form S(tk ) = i (yi (tk ) − E{yi (tk )})/φ
where the sum is over observations, yi (tk ), in, say, the experimental group
available at time tk , the expectation E{yi (tk )} is computed under the null
hypothesis of no difference between groups, and φ is a scale parameter (and
may be known or estimated). Many commonly used tests can be written in
this form (Scharfstein et al. 1997) and the methods described in this chapter
will apply. Examples include:
P
• Normal observations:
P PFor normal observations, S(tk ) = (yi − µ̂k )/σ̂ 2 ,
where µ̂k = ( j xj + i yi )/(nx +ny ) is the overall mean of the observations
at time tk . The Fisher information is Ik = σ 2 nx ny /(nx +ny ) which does not
depend on µ and, assuming that the proportion nx /ny remains√constant,
is proportional to the total number of observations at tk . S(tk )/ Ik is the
t-statistic.
• Binary observations: S(tk ) = yk − nyk pˆk , where yk is the number of events
in the experimental group, nyk is the total number of observations in the ex-
perimental group, and pˆk is the aggregate event rate over the two groups,
ignoring treatment. Ik is p(1 − p)nxk nyk /(nxk + nyk ). S(tk )2 /Ik is the
THE REPEATED TESTING PROBLEM 297
√
Pearson chi-square test statistic, and alternatively, S(tk )/ Ik has, approx-
imately, a standard normal distribution.
P
• Failure time observations: S(tk ) = j djyk −(djxk +djyk )njyk /(njxk +njyk )
(equation (7.3) on page 217) is the log-rank statistic (Mantel 1966), where
the sum is over distinct failure times, djxk and djyk are the number of
failures, and njxk and njyk are the number of subjects at risk in the control
andPexperimental groups, respectively, at the j th failure time. Ik is given
by j dj·k (nj·k −dj·k )njxk njyk /n2j·k (nj·k −1), where dj·k = djxk +djyk and
nj·k = njxk + njyk . (See Tsiatis (1982, 1981) for a proof that the log-rank
statistic has the desired structure.)
Because the Wilcoxon rank-sum test can be considered the score test for a
proportional odds model (Section 2.3.1), it can also be placed into this frame-
work. In each case, we can define tk to satisfy either tk = Ik or tk = Ik /IK .
Lee and DeMets (1991) discuss monitoring in the longitudinal data setting for
normal data and Gange and DeMets (1996) in the longitudinal data setting
for non-normal data.
We note that each of the above test statistics can be shown to (asymptoti-
cally) satisfy 3 properties:
1. S(t1 ), S(t2 ), . . . , S(tK ) have a multivariate normal distribution,
2. E{Sk } = tk µ, for some µ, and
3. for k1 < k2 , Cov(S(tk1 ), S(tk2 )) = Ik (0).
Because processes satisfying these three properties behave like sums of iid
random variables, Proschan et al. (2006) refer to a process satisfying these
properties as an S-process (this definition differs slightly from that of Lan and
Zucker (1993)). Specifically, if a process satisfies these properties, for tk1 < tk2 ,
E{S(tk2 ) − S(tk1 )} = µ(tk2 − tk1 ) and S(tk2 ) − S(tk1 ) is independent of S(t)
for t ≤ tk1 . The latter property is sometimes referred to as the independent
increments property.
Another process of interest is the so called B-value (Lan and Wittes 1988),
B(tk ) = Sk /Ik , (10.3)
where tk = Ik /IK . In this case, the process B(t) satisfies
B1 B(t1 ), B(t2 ), . . . , B(tK ) have a multivariate normal distribution,
B2 E{Bk } = tk θ, for some θ, sometimes referred to as the standardized treat-
ment difference, and
B3 for tk1 < tk2 , Cov(B(tk1 ), B(tk2 )) = tk1 .
These properties imply that for tk1 < tk2 , B(tk1 ) and B(tk2 ) − B(tk1 ) are
independent and Var(B(tk2 ) − B(tk1 )) = tk2 − tk1 .
For simplicity, the discussion that follows will generally be framed in terms
of sums of i.i.d. normal random variables. While this makes the formulation
easier, keep in mind the results generally apply to any test statistic satisfying
properties 1, 2, and 3 above.
298 DATA MONITORING AND INTERIM ANALYSIS
10.3.3 Repeated Significance Tests
Pk
Again letting Sk = i=1 Xi , where the Xi are iid N (µ, 1), a repeated signifi-
cance test is a procedure in which observations accrue until, for some constant
c > 0,
√
|Sk | > c k,
where Sk is defined in the previous section, or until a maximum sample size,
K, is reached. Define k ∗ by
√
k ∗ = inf{k : 1 ≤ k ≤ K and |Sk | > c k}
√
and reject H0 if and only if |Sk∗ | > c k ∗ . The probability of rejecting H0 is
then
√
α∗ = Pr(|Sk | > c k, for some k ≤ K),
which may be controlled
√ by choice of c.
Letting ck = c k the probability distribution of k ∗ can be found as follows
(see Armitage et al. (1969)). First, note that Sk is observed only if k ∗ ≥ k,
and, therefore, for k > 1, Sk doesn’t have a sampling distribution in the
usual sense. We can, however, define a sub-density function fk (·; µ), with the
property that
Z ∞
fk (u; µ)du = Pr{k ∗ ≥ k}.
−∞
That is, the density function of Sk , given that k ∗ ≥ k, is fk (s; µ)/ Pr{k ∗ ≥ k}.
Because S1 = X1 , we have
f1 (s; µ) = φ(s − µ) (10.4)
where φ(·) is the standard normal density function. If we write Sk = Sk−1 +
Xk , then, because Sk−1 and Xk are independent, for k > 1, fk (·; µ) is the
convolution of the (sub)-densities of Sk−1 and Xk ,
Z ck−1
fk (s; µ) = fk−1 (u; µ)φ(s − u − µ)du. (10.5)
−ck−1
(The integral on the right hand side of (10.6) is the probability of not rejecting
H0 through stage k.) The overall type I error rate is the probability of rejecting
H0 when it is true, PK (0), and the power for H1 : µ = µ1 is PK (µ1 ). Iterative
numerical integration can be used to determine c and the maximum sample
size K so that the overall type I error is α and the power to detect a specified
alternative µ = µ1 is 1 − β. (See Reboussin et al. (2000) for a discussion of
computational issues.) Software for performing these calculations is available
at http://www.biostat.wisc.edu/landemets.
GROUP SEQUENTIAL TESTS 299
10.4 Group Sequential Tests
Repeated significance tests and other early sequential designs described earlier
are formulated assuming that accumulating data are assessed continuously. A
group sequential procedure (GSP) allows monitoring to occur periodically, af-
ter possibly unequal sized groups of observations have accrued. In this section
we describe the most commonly used group sequential approaches.
where X A and X B denote the sample means for groups A and B, respectively.
Now suppose that we assess the accumulating data after every 2n observa-
tions (n from each treatment arm) up to a maximum of K pinterimP analyses, or
2 k
looks, and a maximum of 2nK observations. Let Sk = 2/nσ j=1 X Aj −
X Bj , where X Aj and X Bj are the means of the observations p in the j th groups
for treatments A and B, respectively. Sk has mean kδ n/2σ 2 = kδ ∗ and
variance k. Using the repeated √ significance test of Section 10.3.3, we would
stop and reject H0 if |Sk | ≥ c k.
The constant c can be found using the formula (10.5). Using this procedure,
the nominal significance level required to reject H0 is the same at each k. In
general, however, we may want a procedure in which the nominal significance
level varies over the course of the trial. A more general group sequential proce-
dure can be defined by the number, K, of interim looks and the corresponding
critical values, ck , defined so that we would stop and reject H0 if |Sk | ≥ ck
for some k = 1, 2, . . . , K. We require that the overall type I error rate for the
procedure be controlled at level α, and that the procedure have power 1 − β
for a particular alternative, δ = δ1 .
We now describe two classical group sequential tests, one discussed by
Pocock (1977a)6 and the other by O’Brien and Fleming (1979). The Pocock
(P) group sequential test rejects the null hypothesis the first time that
√
|Sk | ≥ ck = cP k
6 Note that Pocock does not advocate the use of the Pocock boundary (Pocock and White
1999).
300 DATA MONITORING AND INTERIM ANALYSIS
or equivalently
Sk
√ ≥ cP ,
k
i.e., when the standardized partial sum of accumulating data exceeds a con-
stant critical value cP . The O’Brien-Fleming (OBF) group sequential test
rejects H0 the first time that
√
|Sk | ≥ ck = cOBF K,
i.e., when the partial sum of accumulating
√ data exceeds a fixed boundary that
is a constant cOBF multiplied by K. Equivalently,
Sk p
√ ≥ cOBF K/k.
k
When α = 0.05 and K = 5, cP = 2.41 for the Pocock group sequential test and
cOBF = 2.04 for the O’Brien-Fleming group sequential test (see Figure 10.1).
Therefore, when α = 0.05 and K = 5, the Pocock group sequential test rejects
the null hypothesis the first time that
Sk
|Zk | = √ ≥ 2.41,
k
which corresponds to a constant nominal p-value of 0.0158 at each look. In
comparison, the O’Brien-Fleming group sequential test rejects the null hy-
pothesis the first time that
Sk p
|Zk | = √ ≥ 2.04 5/k,
(10.7)
k
which corresponds to a nominal p-value of 0.00000504 at the first look, 0.00125
at the second look, 0.00843 at the third look, 0.0225 at the fourth look, and
0.0413 at the last look. Note, however, that both group sequential tests yield
the type I error probability of 0.05.
An alternative to these tests, proposed by Haybittle (1971) and Peto et al.
(1976), uses a fixed critical value of 3.0 for interim analysis and a fixed sample
critical value of 1.96 for the final analysis for an overall signficance level of
approximately 0.05. The use of a large critical value for the interim analysis
ensures that the resulting group sequential test will have overall significance
level close to, but not exactly, the desired level with no additional adjustment.
Precise control of the overall type I error rate can be achieved by adjusting the
final critical value once the trial is completed. For example, if 4 equally spaced
interim analyses are conducted using a critical value of 3.0, a final critical of
1.99 will ensure an overall type I error rate of 0.05.
With group sequential designs which allow early termination solely for sta-
tistical significance in the observed treatment differences, there is a reduction
in power relative to the fixed sample test of the same size. Maintenance of
power requires that the maximum sample size be increased by an amount
that depends on the choice of boundary. For given δ1 under the alternative
GROUP SEQUENTIAL TESTS 301
O’Brien−Fleming
4 Pocock
2
Z−Value
0
0.2 0.4 0.6 0.8 1
Information Fraction
−2
−4
hypothesis, one can compute the power of the group sequential test as
1 − β = 1 − Pr(|S1 | < c1 , . . . , |SK | < cK ; δ1∗ ),
p
where δ1∗ = δ1 n/2σ 2 .
Conversely, given the desired power 1 − β of the group sequential test, one
can determine the value of δ1∗ = δ̃ that satisfies the above equation. The value
of δ̃ depends on α, K, 1 − β, and c = (c1 , c2 , . . . , cK ), i.e., δ̃ = δ̃(α, K, c, 1 − β).
The required group size, ñ, satisfies
r
ñ
δ̃ = δ1
2
so one can solve for ñ to determine the necessary group size, i.e.,
!2
δ̃
ñ = 2 .
δ1
Hence the maximum sample size for the group sequential design becomes
√ !2
δ̃ K
2ñK = 4 .
δ1
Table 10.2 Inflation factors for Pocock (P) and O’Brien-Fleming (OBF) group se-
quential designs.
K
2 3 4 5
α 1−β P OBF P OBF P OBF P OBF
0.05 .80 1.11 1.01 1.17 1.02 1.20 1.02 1.23 1.03
.90 1.10 1.01 1.15 1.02 1.18 1.02 1.21 1.03
.95 1.09 1.01 1.14 1.02 1.17 1.02 1.19 1.02
0.01 .80 1.09 1.00 1.14 1.01 1.17 1.01 1.19 1.02
.90 1.08 1.00 1.12 1.01 1.15 1.01 1.17 1.01
.95 1.08 1.00 1.12 1.01 1.14 1.01 1.16 1.01
which will in turn determine the necessary group size n on each interven-
tion. The values of the group sequential boundaries, cL L U U
1 , . . . , cK and c1 , . . . cK ,
can be determined using a variation of equations (10.4) and (10.5) in which
asymmetric limits are used. Emerson and Fleming suggest that H0 and H1 be
treated symmetrically so that cU L
k = ck = ck .
An example of Emerson and√Fleming boundaries is shown in Figure 10.2.
Here K = 5 and ck = 4.502/ k, so the upper boundary is similar to the
O’Brien-Fleming boundary in equation (10.7). We note that for this bound-
ary, the final critical value is 2.01 which is smaller than the corresponding
value from the O’Brien-Fleming boundary of 2.04. This happens because the
possibility of early stopping to accept H0 reduces the overall type I error
rate, effectively “buying back” type I error probability that can then be used
to lower the upper efficacy boundary. This may be considered an undesir-
able feature since, because the decision to stop a trial is complex, involving
both statistical and non-statistical considerations, a data monitoring commit-
tee may choose to allow a trial to continue even though the lower boundary
GROUP SEQUENTIAL TESTS 305
2 1.96
Z−Value
0
0.2 0.4 0.6 0.8 1
Information Fraction
−2
was crossed. If the upper and lower boundaries were constructed as one-sided
(α/2 level) tests without regard to the other boundary, the resulting proce-
dure would be conservative (after accounting for the probability of crossing
the lower boundary, the (one-sided) type I error rate would be less than α/2),
but would not suffer from these difficulties.
The group sequential method for early stopping for the null hypothesis can
be extended to two-sided tests, the so called inner wedge tests, similar to what
one might obtain by reflecting the top half of Figure 10.2 about the x-axis.
The BHAT trial was one of the first large, randomized trials to employ a
group sequential monitoring plan. BHAT was designed to begin in June 1978
and last 4 years, ending in June 1982, with a mean follow-up of 3 years. An
O’Brien-Fleming monitoring boundary with K = 7 and α = .05, so that
cOBF = 2.063, was chosen with the expectation that analyses would be con-
ducted every six months after the first year, with approximately equal incre-
ments of information between analyses. In October 1981 the data showed a
9.5% mortality rate in the placebo group (183 deaths) and a 7% rate in the
propranolol group (135 deaths) corresponding to a log-rank pZ-score of 2.82,
well above the critical value for the sixth analysis of 2.063/ 6/7 = 2.23. The
details of the statistical aspects of early stopping in this trial are given by
DeMets et al. (1984). The independent Policy and Data Monitoring Board
306 DATA MONITORING AND INTERIM ANALYSIS
(PDMB) recommended that the BHAT be terminated in October 1981. The
interim analyses are summarized in Table 10.3.
The target number of deaths (total information) was not prespecified, so
for our purpose, we will assume that an additional 90 deaths, for a total of
408 deaths, would have occurred between the point that the trial was stopped
and the planned conclusion in June 1982.
4
Z−value
3 2.82
Figure 10.3 Group sequential monitoring in BHAT. (Adapted from DeMets et al.
(1984).)
GROUP SEQUENTIAL TESTS 307
BHAT is typical in that, while the boundary was computed under the as-
sumption of equally spaced analyses, it is difficult in practice to enforce a
prespecified monitoring schedule. The boundary values that were used were
based on type I error probabilities computed using equations (10.4) and (10.5),
which assume equal increments of information. The actual overall type I er-
ror, assuming an additional final analysis with 408 total deaths and accounting
for the timing of the interim analyses, is 0.055, so, in this example, the un-
equal spacing results in slightly inflated type I error. We further note that the
accumulated boundary crossing probability at the time of the October 1981
analysis was 0.034. Therefore, in spite of the deviation of the actual timing
of the interim analyses from the planned schedule, the overall type I error
could still have been strictly controlled by adjusting the critical value at the
final analyses slightly upward. That is, while the planned final critical value
was 2.06327, by using a generalization of equation (10.5) that accommodates
unequal increments (equation (10.11) in the next section), we can compute
a new critical value of 2.06329, ensuring that the overall type I error will be
exactly 0.05. Clearly the difference between these critical values is negligible,
and in this case, the unequal spacing does not pose a serious problem.
In addition, note that the recalculated value does not depend on the ob-
served test statistics, but only on the timing of the interim analyses. This idea
can be applied, not only to the final analysis, but can, in principle, be applied
to each interim analysis in turn so that the cumulative type I error rate is
controlled at each stage, regardless of the actual timing of the analyses. This
reasoning led to the development of the alpha spending approach that is the
subject of the next section.
In the group sequential procedures discussed so far, interim looks were equally
spaced and the maximum number was prespecified. In practice, as illustrated
by the BHAT example, it is generally not feasible to know when and how
many times the study will be monitored. Lan and DeMets (1983) proposed
the use of a prespecified alpha spending function that dictates the rate at which
the total type I error probability accumulates as data accrue. Specifically, let
α(t) = Pr{|Sk | > ck for some tk ≤ t}.
First, consider the group sequential procedures from Section 10.4.1. For
K = 5 and α = 0.05, the cumulative probabilities of rejecting H0 at or
before the kth analysis for k = 1, . . . , 5 are given in Table 10.4 and plotted in
Figure 10.4 for the Pocock and O’Brien-Fleming boundaries. (Note that as it is
defined, α(t) is constant between tk and tk+1 ; however, in Figure 10.4 we have
used linear interpolation to suggest that α(t) is a continuous function.) Note
that there is a one-to-one correspondence between α(tk ), k = 1, 2, . . . , K and
the critical values, ck . Therefore, rather than defining the stopping boundary
in terms of the critical values, one could choose an increasing function, α(t),
308 DATA MONITORING AND INTERIM ANALYSIS
0.05
O’Brien−Fleming
Cumulative Probability
0.04 Pocock
0.03
0.02
0.01
0.00
0.0 0.2 0.4 0.6 0.8 1.0
Information Fraction
Figure 10.4 Cumulative probabilities of rejecting H0 for Pocock (solid line) and
O’Brien-Fleming (dotted line) tests with α = 0.05 and K = 5. (Note that linear
interpolation is used to suggest that the cumulative probability is a continuous func-
tion.)
0.05
Pocock
α3 = αt
Cumulative Probability
0.04
α4 = αt3 2
O’Brien−Fleming
0.03
0.02
0.01
0.00
0.0 0.2 0.4 0.6 0.8 1.0
Information Fraction
Figure 10.5 Pocock and O’Brien-Fleming type alpha spending functions with α =
0.05.
7 http://www.biostat.wisc.edu/landemets/
312 DATA MONITORING AND INTERIM ANALYSIS
10.5.2 General Form of the Triangular Test for Continuous Monitoring
In the general case, we will assume that at each interim analysis we observe
the pair8 (S, V ) where V = Var[S] = I(µ) and S ∼ N (µV, V ).
Given the statistics S and V , the general form of the triangular test is
defined by a continuation region of the form
S ∈ (−a + 3cV, a + cV ),
with the apex of the triangle at V = a/c and S = 2a (see Figure 10.6). If
α/2 is the probability of crossing the upper boundary when H0 is true, and
1 − β is the probability of crossing the lower boundary when H1 is true, in
the special case where α/2 = β, it can be shown that
a = −2(log α)/µ1 and c = µ1 /4.
The case in which α/2 6= β is more complex, and we do not present the
computations here. See Whitehead (1997) for details.
(a/c, 2a)
S
Experimental Treatment
Superior
Continue Trial
No Difference
Experimental Treatment
−a Inferior
Figure 10.6 Stopping boundaries for continuous monitoring based on the triangular
test (Adapted from Whitehead (1997)).
8 Whitehead uses “Z” to denote the score statistic, −U (0). For consistency with notation
used elsewhere in this text, we use “S” to denote the score statistic used in the triangular
test.
TRIANGULAR TEST 313
10.5.3 Triangular Test with Discrete Monitoring
The type I and type II error rates for the triangular test as defined above
are computed under the assumption that data are monitored continuously.
In practice interim analyses take place at discrete times, so an adjustment
is required to maintain the desired error rates. When data are monitored
continuously, because the S is a continuous function of V , the point at which
the upper boundary is crossed, (V ∗ , S ∗ ), satisfies S ∗ = a + cV ∗ . On the other
hand, when data are monitored periodically, we have that S ∗ ≥ a + cV ∗ . The
overshoot R is the vertical distance between the final point of the sample path
and the continuous boundary defined as
R = S ∗ − (a + cV ∗ ).
A reliable correction to the boundary can be made using the expected amount
of overshoot. Specifically, the continuous stopping criterion
S ≥ a + cV
is replaced by
S ≥ a + cV − A
where
A = E(R; µ),
the expected value of the overshoot at the time of boundary crossing, leading
to what is sometimes referred to as the “Christmas tree p adjustment” (see
Figure 10.7). Whitehead (1997) claims that the A = 0.583 Vi − Vi−1 where
Vi is the value of V at analysis i (see also Siegmund (1985)). Because a greater
information difference, Vi − Vi−1 , between consecutive assessments leads to
larger values of A, larger adjustments are required when monitoring is less
frequent.
S
Experimental Treatment
Superior
No Difference
Experimental Treatment
Inferior
S
15 Defibrilator Superior
10
Continue Trial
No Difference
0
10 20 30
V
−5
−10
Figure 10.8 Triangular test in MADIT. Once interim monitoring had begun, the
monitoring frequency is high enough that the adjustment for discrete monitoring is
quite small and does not meaningfully change the result. (Figure adapted from Moss
et al. (1996).)
To make this concept concrete statistically, let S denote a test statistic that
measures the intervention difference and let the sample space, Ω, of S consist
of disjoint regions, an acceptance region, A, and a rejection region, R, such
that at the end of the study
Pr(S ∈ R|H0 ) = α and Pr(S ∈ A|H1 ) = β
where α and β are the type I and II error probabilities. Let t denote the time
of an interim analysis, and let D(t) denote the accumulated data up to time
t. A deterministic curtailment test rejects or accepts the null hypothesis H0 if
Pr(S ∈ R|D(t)) = 1 or Pr(S ∈ A|D(t)) = 1,
respectively, regardless of whether H0 or H1 is true. Note that this procedure
does not affect
Pn the type I and II error probabilities. For test statistics of the
form Sn = i=1 xi , where the x1 , x2 , . . . are i.i.d., deterministic curtailment
is possible only when the range of the xi is finite (see DeMets and Halperin
(1982)). In Example 10.1, had the analysis used a t-test for difference in mean
failure time, deterministic curtailment would not have been possible, because
the mean failure time for the control group could be arbitrarily large depend-
ing on timing of the two remaining failures. The use of ranks constrains the
influence of the remaining observations.
The concept of stochastic curtailment and the consequence of its use were
first proposed by Halperin et al. (1982) and further investigated by Lan et al.
(1982). Consider a fixed sample size test of H0 : µ = 0 at a significance level α
with power 1−β to detect the intervention difference µ = µA . The conditional
probability of rejection of H0 , i.e., conditional power, at µ is defined as
PC (µ) = Pr(S ∈ R|D(t); µ).
For some γ0 , γ1 > 1/2, using a stochastic curtailment test we reject the null
hypothesis if
PC (µA ) ≈ 1 and PC (0) > γ0
and accept the null hypothesis (reject the alternative hypothesis) if
PC (0) ≈ 0 and PC (µA ) < 1 − γ1 .
Lan et al. (1982) established that the type I and II error probabilities are
inflated but remain bounded from above by
α0 = α/γ0 and β 0 = β/γ1 .
Generally stochastic curtailment tests of this type are quite conservative, and
if γ0 = 1 = γ1 , they become deterministic curtailment.
PC (θ)
2
t 0)
B(
B(t)
1 θt θt
+
0
0.2 0.4 0.6 0.8 1 t
−1
Figure 10.9 Conditional power computed using B-value in Example 10.3. The dotted
line represents the unconditional expected value of B(t) (prior to the collection of
any study data). The dashed line represents the expected value of B(t) given B(t 0 ),
for t0 = 0.6.
sistent with the observed data and we may want to consider a range of values
for θ, creating the following table:
∗
Table 10.5 Parameter for conditional power (θ = E[ZN ]).
p
1. Survival θ= D/4 log(λC /λT ) D = total events
θ = 3.24
θ=1
4
2
Z(t)
10%
5%
0
5%
t
10%
1
−2
slightly. For the boundaries based on θ = 3.24, however, the power loss is well
below 1%.
Note that a key assumption in the calculation of conditional power is as-
sumption B2 on page 297. If this condition fails, then E[B(t)] will not follow
a straight line as shown in Figure 10.9. This might happen, for example, in
a survival trial in which there is a delay in the effect of treatment. Analyses
conducted early in the trial, when all or most subjects have yet to receive
CURTAILMENT PROCEDURES 321
the benefit of treatment, may show little or no systematic difference between
groups. As more subjects have sufficient follow-up, the difference between
groups should emerge and strengthen throughout the remainder of the trial.
Thus, caution is advised in settings where such an effect is a possibility.
where B(t) denotes the accumulated data by time t of the interim analysis,
and π(θ|B(t)) = M (B(t))f (B(t); θ)π(θ) where f (B(t); θ) is the likelihood
function for B(t), π(θ) is a given prior distribution, and M (B(t)) is chosen
so that π(θ|B(t)) is a proper probability distribution. The primary difficulty
with this approach is that the choice of π(θ) is somewhat subjective.
To illustrate, suppose that a priori θ is distributed N (η, 1/ν), so that η is
the prior mean and ν is the prior precision of θ. Given B(t0 ) and using Bayes’
theorem, the posterior distribution of θ is
B(t0 ) + ην 1
π(θ|B(t0 )) ∼ N , .
ν + t0 ν + t0
Since B(1) − B(t0 ) ∼ N (θ(1 − t0 ), 1 − t0 ), and integrating out the unknown
322 DATA MONITORING AND INTERIM ANALYSIS
θ, we have that
B(t0 )(ν + 1) + ην(1 − t0 ) (ν + 1)(1 − t0 )
π(B(1)|B(t0 )) ∼ N , ,
ν + t0 ν + t0
so clearly the predictive power, PP (η, ν), depends on the prior parameters, η
and ν.
We note two special cases. If ν → ∞, so that the prior precision is large,
we have π(B(1)|B(t0 )) ∼ N (B(t0 ) + η(1 − t0 ), 1 − t0 ) and predictive power
is identical to conditional power when θ = η. Conversely, if ν → 0, the prior
variance of θ is large, so the prior distribution is diffuse and uninformative; we
have π(B(1)|B(t0 )) ∼ N (B(t0 )/t0 , (1 − t0 )/t0 ), so predictive power is similar
to conditional power using the current estimate θ̂, except that the variance is
larger by a factor of 1/t0 , reflecting the uncertainty in θ̂.
Example 10.4. Continuing Example 10.3, using the diffuse prior (ν = 0),
then predictive power is 2.1%, larger than the conditional power assuming
θ = θ̂ (0.5%), but suggesting that a successful trial is unlikely given the
current data. If we take η = 3.24 and ν = 1, so that we have a strong belief
that the treatment is beneficial, but with uncertainty regarding the size of
the effect, the PP (3.24, 1) = 9.2%. In spite of our strong belief, the evidence
against θ = 3.24 is sufficiently strong that the predictive power is relatively
small. 2
Predictive power was part of the rationale for early termination of a trial
in traumatic hemorrhagic shock (Lewis et al. 2001). In this trial, a diffuse
(uninformative) prior distribution was used and at the time that the trial was
stopped, the predictive probability of showing a benefit was extremely small
(0.045%).
where
p? = Pr(S1 < c1 , . . . , Sk−1 < ck−1 , Sk ≥ S ? ).
Clearly, if k ∗ = 1, pSW is identical to the corresponding fixed sample p-value.
Furthermore, the p-value will be smaller if the test terminates at interim
analysis k − 1 than if it terminates at interim analysis k regardless of the
value of S ∗ . The SW ordering is the only ordering for which more extreme
observations cannot occur at future analysis times. Thus the SW ordering
has the desirable property that the p-value can be computed solely from the
analyses conducted up to the point at which the trial is stopped without regard
for the timing or critical values for planned future analyses.
The MLE ordering orders observations based on the point estimate µ̂.
Unlike the SW ordering, a more extreme observation can occur at an analysis
time beyond the one at which the trial is stopped. Thus, to compute the p-
value exactly, one needs to know both the timing and critical values for all
planned future analyses. Since these analyses will never be conducted, when
flexible procedures are used, their timing—possibly even their number—can
only be hypothesized at study termination. The p-value is computed as
K
X
pM LE = Pr{S1 ≥ I1 µ̂ ∨ c1 } + Pr{Sj < cj , j = 1, . . . , k − 1, Sk ≥ I1 µ̂ ∨ ck }
k=2
where “∨” denotes that maximum of its arguments. Under this ordering, there
can exist realizations of the trial data for which the boundary is crossed at an
INFERENCE FOLLOWING SEQUENTIAL TESTS 325
earlier time, but that are not considered “more extreme.” This phenomenon
is less likely to occur with the MLE ordering than with the LR and score
orderings because boundaries usually cannot be crossed at the earliest analyses
times unless the point estimates, µ̂, are quite large.
The LR ordering is similar to the MLE ordering in that the p-value de-
pends on the timing and critical values at future analyses. The p-value is
computed in a similar fashion to that for the MLE ordering,
pLR = Pr{S1 ≥ (I1 /Ik∗ )1/2 S ∗ ∨ c1 } +
K
X
Pr{Sj < cj , j = 1, . . . , k − 1, Sk ≥ (Ik /Ik∗ )1/2 S ∗ ∨ ck }.
k=2
In the same way that repeated significance testing increases the overall type I
error, confidence intervals derived without accounting for the sequential proce-
dure will not have the correct coverage probability. Confidence intervals with
the correct coverage probability can be derived using the orderings considered
in the previous section.
In the fixed sample case, confidence intervals can be constructed by inverting
hypothesis tests. For example, suppose that X1 , X2 , . . . , Xn ∼ N (µ, σ 2 ). We
reject the null hypothesis H0 : µ = µ0 if and only if
|X̄ − µ0 |
p ≥ t1−α/2,n−1
σ̂ 2 /n
where t1−α/2,n−1 is the 100 × (1 − α/2) percentage point of the t-distribution
with n − 1 degrees of freedom. A 100 × (1 − α)% confidence interval can be
constructed by including all µ0 for which we do not reject H0 at level α. That
is, the 100 × (1 − α)% confidence interval for µ is the set
n p p o
R = µ0 : X̄ − t1−α/2,n−1 σ̂ 2 /n < µ0 < X̄ + t1−α/2,n−1 σ̂ 2 /n
n |X̄ − µ0 | o
= µ0 : p < t1−α/2,n−1
σ̂ 2 /n
n o
= µ0 : p(µ0 ) > α
INFERENCE FOLLOWING SEQUENTIAL TESTS 327
where p(µ0 ) is the (two-sided) p-value function for X̄ under H0 : µ = µ0 .
Combining the last equality above with the p-values defined in the previous
section provides a procedure for finding valid confidence intervals. For a given
ordering, O, and observation, (k ∗ , S ∗ ), define the (upper) one-sided p-value
pO (µ0 ) = Pr{(k, S) (k ∗ , S ∗ )|µ = µ0 }. The 100 × (1 − α)% confidence region,
R, is the set of values of µ0 satisfying
α/2 < pO (µ0 ) < 1 − α/2.
Confidence intervals based on the SW ordering were proposed by Tsiatis et al.
(1984). Rosner and Tsiatis (1988) and Emerson and Fleming (1990) investi-
gated the MLE ordering. Chang and O’Brien (1986) consider the LR ordering
for the case of binary observations in phase II trials. Rosner and Tsiatis (1988)
compared properties of all four orderings. Kim and DeMets (1987a) consider
confidence intervals based on the SW ordering for a variety of error spending
functions.
For the SW and MLE orderings, the p-value function is a monotone function
of µ0 so R is guaranteed to be an interval. For the LR ordering, monotonicity
may not strictly hold and it is possible, although unlikely, for R to not be an
interval. For the score ordering, R is frequently not an interval.
Similar to the p-values in the previous section, for the SW ordering, R
does not depend on the future interim analyses. It is typically shifted towards
zero, relative to the fixed sample interval, but in some instances, it can fail to
contain the sample mean. Furthermore, if k ∗ = 1, R coincides with the fixed
sample confidence interval.
A desirable property of a method for generating a confidence interval is that
it should minimize the probability that the interval covers the wrong values
of µ; i.e., Pr(µ0 ∈ I(ST , T )) should be as small as possible when µ0 6= µ. Also
a confidence interval should be as short as possible. When µ is near zero,
both the LR and SW orderings above give similar results. Neither method
dominates uniformly over the parameter space. In general, however, the LR
does better than the MLE ordering. Emerson and Fleming (1990) suggest that
the MLE ordering is competitive when looking at the average length of the
confidence intervals.
0.15
Bias = E[µ̂] − µ
0.10
0.05
0.00
0.0 0.5 1.0 1.5
µ
10.8 Discussion
Data monitoring is a difficult undertaking and not without controversy. We
close this chapter with a discussion of issues surrounding interim monitoring.
330 DATA MONITORING AND INTERIM ANALYSIS
In this chapter we have focused primarily on efficacy monitoring. Because it
is difficult, if not impossible, to prespecify all possible safety issues that might
arise, safety monitoring is less amenable to a rigorous statistical approach. A
few authors have proposed formal monitoring plans explicitly accommodating
safety monitoring (Jennison and Turnbull 1993; Bolland and Whitehead 2000).
6
O’Brien−Fleming
Emerson−Fleming
Pocock
4 Triangular
Z−value
0
0.2 0.6 0.8 1.0
Information Fraction
−2
morbidity such as heart attack or stroke, the trial should be stopped to allow
both control subjects and future patients access to the more effective treat-
ment as soon as possible. Conversely, if benefit is demonstrated with respect
to a nonfatal, non-serious outcome such as quality of life or exercise tolerance
(even though these outcomes may be invaluable to the individual subject),
then one might argue that there is no ethical imperative to stop, and that
the trial should continue to its planned conclusion in order to both strengthen
the efficacy result and maximize the amount of other information, especially
regarding safety, that is collected.
The former viewpoint suggests that boundaries that satisfy an optimality
criterion such as average sample number (ASN) should be preferred because
they allow stopping as soon as the result is clear, with the minimum expo-
sure of trial subjects to inferior treatments. Conversely, the latter viewpoint
suggests that boundaries should be as conservative as possible while allowing
study termination in cases where the evidence is sufficiently compelling. For
example, Pocock (2005) commends the Haybittle-Peto boundary which re-
quires a nominal p-value less than 0.001 before termination, and also suggests
that interim monitoring not begin too soon when little information regarding
both safety and efficacy is available.
For example, the primary outcome in the WIZARD trial (O’Connor et al.
2003; Cook et al. 2006) was a composite of all-cause mortality, recurrent my-
ocardial infarction (MI), revascularization procedure, or hospitalization for
angina. This outcome was chosen, in part, to increase study power while
332 DATA MONITORING AND INTERIM ANALYSIS
maintaining an affordable sample size. Because benefit with respect to revas-
cularization procedures or hospitalizations for angina may not be sufficient
to compel study termination on ethical grounds, interim monitoring was con-
ducted using the composite of all-cause mortality and recurrent MI. (Chen,
DeMets, and Lan (2003) discuss the problem of monitoring using mortality for
interim analyses while using a composite outcome for the final analysis.) Fur-
thermore, even though the overall type I error rate of α = 0.05 was used, the
monitoring boundary used an O’Brien-Fleming type alpha spending function
based on an overall type I error rate of α = 0.01. The resulting critical values
at the planned information fractions of 1/3 and 2/3 were more conservative
than those for the Haybittle-Peto boundary (±4.46 and ±3.15 respectively,
(Cook et al. 2006)). The effect of interim monitoring on the final critical value
using the primary composite outcome was virtually unaffected by the interim
monitoring plan.
The ideal probably lies somewhere between these two extremes. For trials
involving mortality or serious morbidity, one should be less willing to allow a
trial to continue past the point at which the result has shown clear and con-
vincing evidence of benefit sufficient to change clinical practice. For diseases
with less serious outcomes, one may be more willing to impose more stringent
stopping criteria in order to gain a fuller understanding of the full effect of the
treatment. Regardless, early stopping is a complex decision requiring careful
consideration of many factors, some of which are outlined in the next section.
10.8.4 Symmetry
We close this chapter with two additional comments beginning with a com-
ment regarding p-values. Historically, p-values have served two purposes: first
as summary statistics upon which hypothesis tests are based—H0 is rejected if
p is smaller than a prespecified value—and second, as measures of strength of
evidence. The latter use is primarily a result of historical practice rather than
being based on foundational principles. In the fixed sample setting where there
is a monotone relationship between p-values and other measures of evidence
such as likelihood ratios, it is somewhat natural for p-values to play both roles.
In the sequential setting, however, the p-value, either nominal (unadjusted) or
adjusted, is affected by the monitoring procedure, i.e., the sampling distribu-
tion of the nominal p-value is no longer uniform under H0 , and consequently,
the necessary correction, and therefore, the adjusted p-value, depends on the
monitoring procedure. Since there is no theoretical justification for linking
measures of evidence to the monitoring plan, the utility of the p-value as a
measure of strength of evidence is diminished. Therefore, it may be desirable
for an alternative summary of the strength of evidence to be introduced to
serve as a measure of strength of evidence. It may be that the likelihood ra-
tio (or equivalently the standardized Z-score) is the best choice, although it
should be clear that it does not necessarily have its usual probabilistic inter-
pretation.
Second, we emphasize once again that traditional interim monitoring is
based on tests of hypotheses and monitoring procedures are constructed to
ensure the validity of these tests. Thus, rejection of H0 in favor of the exper-
imental treatment implies that either the treatment is beneficial, or a type I
error has been made, albeit with a low, controlled, probability. On the other
hand, because, as noted previously, the monitoring procedure affects the sam-
pling distributions of the observed summary statistics, formal inference beyond
the hypothesis test—construction of p-values, point estimates, and confidence
intervals—is difficult. Given that the monitoring procedure provides a valid
test of the hypothesis, and in light of the first comment in the previous para-
graph, in trials employing such procedures, p-values may be entirely unneces-
sary, and perhaps undesirable, in sequential trials. Furthermore, as noted in
Chapter 2, direct interpretation of an estimate of treatment effect is problem-
atic because it is not clear that the “true” value of the associated parameter
has meaning beyond an individual trial. This estimate is based on a sample
population that is not a random sample or a representative population. It
is based on volunteers who met entry criteria and are treated under more
rigorously controlled conditions than is likely in common practice. Thus, the
estimate obtained, corrected or not, may not really reflect the population or
circumstances in which the intervention may be used. From this point of view,
one can argue that bias adjustment serves little purpose, especially since, as
shown by Figure 10.11, the bias introduced by sequential monitoring is not
likely to be large compared to the variability. This argument may apply equally
336 DATA MONITORING AND INTERIM ANALYSIS
to the construction of confidence intervals. While the coverage probability of
the nominal confidence interval is not quite correct for the “true” parameter,
it is not likely to be misleading. Thus, our position is that there is probably
little to be gained from formal adjustments to point estimates and confidence
intervals and that the unadjusted quantities are generally adequate. This rec-
ommendation may be controversial among those who insist on mathematical
rigor; however, in light of the complexities of clinical trials, with the oppor-
tunity for bias to be introduced at multiple points throughout, it is not clear
that such mathematical refinements serve a real purpose.
10.9 Problems
10.1 For each method of analysis, create an example for which deterministic
curtailment could be used (if such an example exists). If deterministic
curtailment is not possible for the given method, explain.
(a) Single-sample binomial test of π = .5.
(b) Two-sample binomial test of π1 = π2 .
(c) Paired t-test of µ1 = µ2 on normal data.
(d) Two-sample t-test of µ1 = µ2 on normal data.
(e) Log-rank test on two groups of possibly censored survival times (no
ties).
10.2 Suppose you are testing
H0 : µ1 − µ2 = 0 vs.
H1 : µ1 − µ2 6= 0,
using two independent groups with normal outcomes and σ = 1. At
the end of the study, which will have 500 patients in each group, you
intend to use a standard two-sample, two-tailed test at level α = .05.
An interim analysis, at which you have only 200 patients in each group,
shows a z-value (standardized test statistic) of 2.6 under H0 .
(a) Given the interim information, what is the probability of rejecting
H0 at the end of the trial, given H0 is true?
(b) Given the interim information, what is the probability of rejecting
H0 at the end of the trial, given H1 is true (µ1 − µ2 = 2)?
10.3 Take the group sequential boundary implemented by using the usual
critical value (e.g., 1.96 for a two-sided test at α = 0.05) at the trial’s
conclusion and a single conservative value (e.g., 3.00) for interim analy-
ses. Suppose we use this method (with the boundary values given above
in parentheses) for a trial comparing two proportions (i.e., a two-sample
binomial test). There are to be a total of 100 subjects in each arm if
the trial does not stop early.
PROBLEMS 337
(a) Suppose, at an interim analysis, 90 subjects have been observed in
each group. In the experimental treatment group we have observed 25
failures, while in the control group we have observed 44. Calculate the
z-statistic. Does its value indicate that the trial should be stopped?
(b) At the end of the trial 30 failures have been observed in the treatment
group and 44 failures have been observed in the control group. What
is the conclusion of the final test?
(c) What is interesting about this example?
(d) What are the advantages and disadvantages of this sequential method?
Would you recommend its use?
CHAPTER 11
The previous chapters discussed methods in the design and conduct of clinical
trials that minimize bias in the comparison of the experimental and control
treatments. It is also possible, however, to introduce bias into both hypothesis
tests and estimates of treatment effects by flawed analysis of the data. This
chapter addresses issues regarding the choice of analysis population, missing
data, subgroup and interaction analyses, and approaches to the analysis of
multiple outcomes. Methods specific to interim analyses or intended for use
after early termination are discussed in Chapter 10 and will not be covered
here. Since the underlying goal in most of this material is the minimization of
bias in the analysis, we begin with a brief discussion of bias.
339
340 SELECTED ISSUES IN THE ANALYSIS
subject. Finally, there may be subjects who are unable to tolerate or otherwise
undergo their assigned treatment, and therefore for these subjects the effect
of treatment cannot be defined.
If we can assume that the “true effect of treatment” is well defined, there
are still multiple potential sources of bias in its estimation. One source of bias
discussed in this chapter is the non-adherence1 of subjects to their assigned
treatment. If the response to a drug is dose dependent and some proportion
of subjects in a trial fail to receive the entire dose of the treatment to which
they have been assigned, we might expect that the observed benefit will be less
than what would be observed had all subjects taken the full dose. In any case,
subjects who fail to receive any of their assigned treatment cannot receive
the benefit of the treatment and this will certainly attenuate the observed
treatment difference. Another source of bias, discussed in Chapter 10, is a
sequential monitoring procedure that allows a trial to be terminated prior
to its planned conclusion because of compelling evidence of benefit. These
two sources of bias tend to operate in opposing directions and it may not be
possible to know the true extent or even the direction of the overall bias.
If all randomized subjects in a clinical trial meet all of the inclusion criteria
and none of the exclusion criteria, remain on their assigned treatment, and
are otherwise fully compliant with all study procedures for the duration of the
study with no missing data, then the choice of the analysis population at the
conclusion of the study as well as their duration of follow-up is obvious. This
is rarely, if ever, the case.
The collection of subjects included in a specified analysis, as well as the
period of time during which subjects are observed, is referred to as an analysis
set. Different analysis sets may be defined for different purposes. There is
wide agreement that the the most appropriate analysis set for the primary
efficacy analyses of any confirmatory (phase III) clinical trial is the intent-
to-treat population.2 A commonly used alternative is referred to as the “per
protocol” or “evaluable” population, defined as the subset of the subjects
who comply sufficiently with the protocol. Outcomes observed after treatment
discontinuation may also be excluded from the analysis of this population. As
we will see, use of the “per protocol” population can introduce bias of unknown
magnitude and direction and is strongly discouraged.
1 As we note in Section 4.3, we use the term non-adherence to refer to the degree to which
the assigned treatment regimen is followed, rather than non-compliance which can refer,
more generally, to any deviation from the study protocol.
2 http://www.fda.gov/cber/gdlns/ichclinical.txt
CHOICE OF ANALYSIS POPULATION 341
11.2.1 The Intent-To-Treat Principle
The Intent-to-Treat (ITT) principle may be the most fundamental principle
underlying the analysis of data from randomized controlled trials. The ITT
principle has two elements: all randomized subjects must be included in the
analysis according to their assigned treatment groups regardless of their ad-
herence to their assigned treatment, and all outcomes must be ascertained
and included regardless of their purported relationship to treatment. The lat-
ter element requires that follow-up continue for subjects even if they have
discontinued their assigned treatment.
ITT is foundational to the experimental nature of randomized controlled
trials. The distinction between RCTs and observational studies is that the
randomization ensures that there are no factors other than chance confounding
the relationship between treatment and outcome—any differences in outcomes
between treatment groups after randomization must be due to either chance or
the effect of treatment. Therefore, properly applied, the ITT principle provides
unbiased hypothesis tests (see Lachin (2000) for examples). This is in marked
contrast to observational studies which are subject to confounding by factors
related to the exposure of interest and the outcome (examples of which were
provided in Chapter 1).
Confounding similar to that found in observational studies may be intro-
duced in common alternatives to an ITT analysis, for example when subjects
or outcomes are excluded from the analysis because of nonadherence to the
assigned treatment (discussed in Section 11.2.2) or because of ineligibility (dis-
cussed in Section 11.2.3). When adherence is related to both the exposure of
interest (assigned treatment) and the outcome, the exclusion of subjects or
outcomes from the analysis based on nonadherence will result in confound-
ing and a biased test of the effect of treatment on outcome. This can easily
occur, for example, when subjects are more likely to discontinue from one
treatment arm because of progression of their disease, which also puts them
at higher risk for the outcome of interest. In general, any comparison of treat-
ment groups within a subgroup defined according to observations made after
randomization violates the ITT principle and is potentially biased (see also
Section 11.4.2). There are circumstances under which such analyses may be
useful, for example to help answer questions related to the mechansim of ac-
tion of the treatment, but the primary assessment of treatment effect should
be the ITT analysis.
Objections to ITT
The primary objection to the use of ITT is that subjects who discontinue
their study treatment can no longer receive the benefit from the treatment, or
that adverse experiences occurring after treatment discontinuation cannot be
attributed to the treatment, and, therefore, including these subjects or events
inappropriately biases the analysis against the treatment. There is merit to
this objection to the extent that the effect of treatment discontinuation is
342 SELECTED ISSUES IN THE ANALYSIS
to attenuate the mean difference between treatment groups. In the presence
of non-adherence, the ITT estimate of the treatment difference is likely to
be smaller than if full adherence were strictly enforced (assuming that full
adherence is possible).
We provide three rebuttals to this objection. First, the goal of ITT is to pro-
duce unbiased hypothesis tests, and not necessarily unbiased estimates of the
treatment effect. Common alternatives to ITT are subject to confounding and
result in biased hypothesis tests. Therefore, the overall conclusion regarding
whether the treatment is effective can be incorrect.
Second, one can argue that an ITT analysis assesses the overall clinical ef-
fectiveness most relevant to the real life use of the therapy. Once a treatment
is approved and in general use, it is likely that adherence to the treatment
will be no better, and is probably worse, than that in an RCT. Thus, the
treatment difference estimated using ITT better reflects the effect that would
be expected in actual use than the effect estimated under full adherence. The
former has been referred, most commonly in the contraceptive literature, to
as use effectiveness and the latter method effectiveness (Meier 1991). For ex-
ample, oral contraceptives have been shown to have extremely high method
effectiveness, i.e., when used strictly as prescribed. Their use effectiveness is of-
ten much lower in certain populations because of poor adherence. Alternative
methods with lower method effectiveness may actually more reliably prevent
pregnancy if there is greater use effectiveness. Other authors (Schwartz and
Lellouch 1967) distinguish between pragmatic and explanatory trials. Explana-
tory trials are intended to assess the underlying effect of a therapy, carried out
under “optimal” conditions, whereas pragmatic trials are intended to assess
the effectiveness to be expected in normal medical practice. Application of
ITT corresponds to the pragmatic approach.
Third, just as naive analyses that exclude subjects or events as a function of
subject adherence to assigned treatment produce biased hypothesis tests, they
also produce biased estimates of treatment benefit. Therefore, such analyses
cannot overcome this fundamental objection to the ITT analysis but can only
introduce a different kind of bias.
A second objection to ITT, raised by Rubin (1998), is that when there is
significant nonadherence, the ITT analysis can lack power and that power
can be increased by making use of information regarding adherence. This
argument also has merit, although it is not clear that the potential meaningful
increases in power can be achieved except in extreme cases or without strong,
untestable assumptions. It is likely that cases with sufficiently extreme non-
adherence would lack credibility in the scientific community, regardless of the
analysis approach.
A third objection to ITT is often raised in the setting of non-inferiority or
equivalence trials (see Section 3.3). Because the effect of nonadherence is often
to dilute the observed (ITT) treatment difference and thus make treatments
look more similar, when the goal is to establish that the experimental treat-
ment is similar to the control, nonadherence may work to the advantage of the
CHOICE OF ANALYSIS POPULATION 343
experimental treatment. In the extreme case, if no subjects in either group
comply with their assigned treatment, there will be no difference between
groups and non-inferiority will be established. This leads to the notion that
‘‘sloppy” trial conduct, in which little attempt is made to ensure adherence,
especially to the control therapy, may be beneficial to the sponsor by making
it easier to establish non-inferiority. Again, there is merit to the argument that
in this setting the ITT analysis is deficient. The primary difficulty, however,
lies less with the analysis than with the design and conduct of the trial. As
noted, naive analyses based on, for example, the “per protocol” population do
not correctly address the confounding introduced by non-adherence, and sim-
ply introduce additional biases of unknown magnitude and direction. There
is currently no practical alternative to ITT so it is imperative that all efforts
be made to ensure that the protocol is followed as closely as possible.
Additionally, the notion that in non-inferiority/equivalence trials the effect
of nonadherence is to attenuate differences between treatment groups is incor-
rect. A key difference between placebo-controlled trials and non-inferiority tri-
als (and active-controlled trial generally) is that the active control is presumed
to be effective and this fact alters the impact of non-adherence. We illustrate
with a simplified example. Suppose that treatment A is the experimental
treatment and treatment B is the active control. Assume that if untreated, all
subjects have (mean) response y = 0 and that if given treatment A, the mean
response is y = µ and if given treatment B, the mean response is y = µ + ∆.
Suppose further that nonadherence to assigned treatment is independent of
response,3 and that in group A, a fraction, qA , of subjects adhere to treat-
ment A, and in group B, a fraction, qB , adhere to treatment B. Nonadherers
receive no treatment. The treatment difference (B-A) from the ITT analysis
shows the mean difference to be qB (µ + ∆) − qAµ = qB ∆ + (qB − qA )µ. If A is
ineffective, µ = 0 and dilution of the true difference ∆ does in fact occur. On
the other hand, if, for example, A and B are equally effective (∆ = 0) then the
ITT difference will depend on the relative adherence between the two groups,
and could be either positive or negative. In practice, the situation is likely to
be far more complex, making an analytic approach extremely difficult if not
impossible.
Implementation of ITT
Strict adherence to the ITT principle has been formally espoused by regula-
tory authorities at the Food and Drug Administration (FDA), yet remains
incompletely applied in practice. On the one hand, the concept remains dif-
ficult for many clinicians to understand and accept. It is also possible that
an inferior therapy could appear effective if it leads to rapid withdrawal and
subsequent treatment with a better therapy. Supplemental analyses of adher-
3 This assumption is almost certainly false. In trials where it can be assessed, poor adherers
typically also have poorer outcomes. See, for example, the CDP example on page 344.
344 SELECTED ISSUES IN THE ANALYSIS
ence to treatment and use of concomitant therapies are necessary to ensure
the validity and acceptance of the conclusions.
Of course, every subject is free to withdraw their consent to participate
in a clinical trial at any time, making subsequent follow-up difficult or even
impossible. Consideration should be given to including in the consent form a
provision allowing subjects who withdraw consent for study procedures to nev-
ertheless agree to having limited outcome information collected about them.
In double-blind studies, a variation of ITT, sometimes referred to as modified
ITT, is commonly used. Modified ITT excludes subjects who are randomized
but fail to receive study drug. The rationale for modified ITT is that, in a
double-blind trial, neither the subject nor the investigator know the treatment
assignment, and since no drug is taken, there is no opportunity for the subject
to experience side-effects of the treatment. Therefore, exclusion of the subject
cannot be related to treatment assignment so no confounding is possible. If
in fact the blind is secure, then modified ITT should be considered a valid
alternative to strict ITT. If modified ITT is used, some rudimentary exam-
ination of the data is recommended, for example, ensuring that the number
of exclusions is comparable across treatments. An imbalance is evidence that
the blind may not be secure.
Examples
The Coronary Drug Project (CDP) (The Coronary Drug Project Research
Group 1980) was a randomized placebo-controlled trial conducted to evalu-
ate several cholesterol lowering drugs, including a drug called clofibrate, in
the long-term treatment of coronary heart disease. Overall, the five-year mor-
tality in 1103 men treated with clofibrate was 20.0% and nearly identical to
that of 2789 men given placebo (20.9%), p=0.55 (Table 11.1). Among sub-
jects randomized to clofibrate, the better adherers, defined as those who took
CHOICE OF ANALYSIS POPULATION 345
Table 11.1 5-year mortality in the Coronary Drug Project according to adherence.
Clofibrate Placebo
N % Mortality N % Mortality
All Randomized 1103 20.0 2789 20.9
< 80% Adherent 357 24.6 882 28.2
≥ 80% Adherent 708 15.0 1813 15.1
80% or more of the protocol prescription during the five-year period, had a
substantially lower five-year mortality than did poor adherers (15.0 vs. 24.6%,
p=0.0001). One might be tempted to take this as evidence that clofibrate is
effective. On the other hand, the mortality difference between adherers and
non-adherers is even greater among those subjects assigned to placebo, 15.1%
and 28.2%, respectively. Because this latter difference cannot be due to the
pharmacologic effect of the placebo, it is most likely due to differences in sub-
ject characteristics, such as disease severity, between good and poor adherers.
A second example is from the VA Cooperative Study of Coronary Artery
Bypass Surgery.4 Conducted between 1972 and 1990, this randomized clinical
trial was designed to compare the survival time of subjects assigned optimal
medical therapy to those assigned coronary artery bypass surgery. A total
of 686 subjects were enrolled between 1972 and 1974—354 were assigned to
medical therapy and 332 were assigned to surgical therapy. The Kaplan-Meier
plots of survival by assigned treatment (ITT analysis), shown in Figure 11.1
(from Peduzzi et al. (1991)), suggest that there is no difference by assigned
treatment (p= 0.99). Of those assigned medical therapy, however, 147 even-
tually received coronary artery bypass surgery during the 14 year follow-up
period (after a mean waiting time of 4.8 years), while of those assigned surgery,
only 20 refused surgery. Given the high crossover rate from the medical group
to the surgical group, there was concern that the ITT analysis did not reflect
the true benefit of coronary artery bypass surgery and alternative analyses
were sought.
Peduzzi et al. (1991, 1993) considered several alternatives to ITT and we
comment on two of them here. The first alternative is a ‘‘Treatment Received”
analysis. In this analysis subjects are reassigned to treatment groups according
to the treatment eventually received—subjects who underwent bypass surgery
anytime prior to the end of the study are considered in the surgery group and
subjects who never underwent bypass surgery during the study are considered
medical subjects—regardless of their assigned treatment groups. The second
alternative is an “Adherers Only” analysis. This analysis considers subjects
according to their assigned treatment group, but excludes any subjects who
4 The Veterans Administration Coronary Artery Bypass Surgery Cooperative Study Group
(1984)
346 SELECTED ISSUES IN THE ANALYSIS
Intention to Treat
1.0
0.6
0.4
Surgical
0.2 Medical
0.0
0 2 4 6 8 10 12 14
Year
Figure 11.1 Cumulative survival rates for intention to treat analysis in the VA Co-
operative Study of Coronary Artery Bypass Surgery. (Adapted from (Peduzzi et al.
1991)).
crossed over. The results of these alternative analyses are qualitatively quite
similar (Figure 11.2), and both differ substantially from the ITT analysis.
These results seem to suggest that, in contrast to the ITT analysis, there
is a substantial survival benefit among surgery subjects relative to medical
subjects.
Given the apparent contradiction between these two analyses, which should
we believe? The first consideration is that the effect of nonadherence is to
dilute the effect of treatment so that the observed treatment difference is
smaller on average than the true difference. Unless nonadherence is extremely
large, however, a proportion of the true treatment effect should remain. In
this example, were the ‘‘Treatment Received” apparent treatment effect real,
we would expect that the residual effect in the ITT analysis would be much
larger than what we have seen. Second, subjects crossing over to surgery do not
do so immediately. In fact, only about 10% of medical subjects have crossed
over by two years. This would suggest that for the ITT analysis, the early
portion of the Kaplan-Meier curves, before a meaningful number of medical
subjects have crossed over, would be largely free of the influence of crossovers.
During this period, however, the Kaplan-Meier curves by assigned treatment
are virtually indistinguishable. Conversely, the corresponding curves for the
‘‘Treatment Received” analysis diverge immediately. Peduzzi et al. (1993) also
conduct an analysis (not shown here) in which subjects begin in their assigned
groups, but medical subjects are switched into the surgery group at the time
CHOICE OF ANALYSIS POPULATION 347
0.6 0.6
0.4 0.4
Surgical Surgical
Medical Medical
0.2 0.2
0.0 0.0
0 2 4 6 8 10 12 14 0 2 4 6 8 10 12 14
M 227 163 Year 120 81 M 207 149 Year 108 71
S 459 407 337 218 S 312 269 216 130
Figure 11.2 Cumulative survival rates for “Treatment Received” and “Adherers
Only” analyses in the VA Cooperative Study of Coronary Artery Bypass Surgery.
(Adapted from (Peduzzi et al. 1991)).
of surgery. This analysis yields results virtually identical to the ITT analysis
in Figure 11.1.
Finally, there is a simple model that easily explains most of what we have
observed in the VA study. The underlying idea is simple—in order for a med-
ical subject to receive surgery, they must survive long enough to become a
candidate for the procedure. Specifically, the crossover time must be before
both the death time and the end of follow-up. The model we construct is sim-
ilar to one constructed in Peduzzi et al. (1993). Under the null hypothesis and
using the ITT analysis, the combined 14 year cumulative survival is approxi-
mately 45%, so the hazard rate is approximately λM = − log(.45)/14 =0.057.
Similarly, the cumulative 14 year rate of crossover for medical subjects is
approximately 55% (“censoring” subjects at death, assuming independence
between death and crossover), so the hazard function for crossover is also ap-
proximately λC = − log(1 − .55)/14 = .057. We simulate death times with the
common hazard rate of λM , and independent crossover times for the medical
group with hazard rate λC . A medical subject is considered a crossover if the
crossover time precedes the death time and is before 14 years. For this sim-
ulated data, the Kaplan-Meier curves for the ITT analysis correctly show no
difference between groups. The ‘‘Treatment Received” and “Adherers Only”
analyses for the simulated data, in which the null hypothesis is true, yield
Kaplan-Meier curves similar to those shown in Figure 11.2. Thus, the naive
alternatives to ITT presented in this example almost certainly yield biased
and misleading results. The ITT analysis is the only credible analysis among
those considered.
348 SELECTED ISSUES IN THE ANALYSIS
11.2.3 Ineligibility
Another reason given for eliminating randomized subjects from the analysis
is that they are found to be ineligible after randomization (Gent and Sackett
1979; Fisher et al. 1990). In most cases, eligibility criteria can be verified
before the subject is randomized. In other cases, the validation of entry criteria
may take a few hours or days to verify. Consider, for example, a study of
a treatment for subjects experiencing an acute myocardial infarction (heart
attack or MI). The definition of an MI usually involves the assessment of
chest pain, levels of specific enzymes in the blood, and EKG measurements.
An accurate diagnosis may require several hours of observation and repeated
measurements. On the other hand, to be effective, it may be necessary to
start treatment as soon as possible following the MI. In other cases, errors in
judgment or in the recording of measurements may lead to the subject being
erroneously enrolled and treated. Thus, there may be some subjects who are
enrolled and treated, but later found not to have met the specific criteria for
inclusion. One possible effect of having technically ineligible subjects in the
trial, subjects for whom the treatment is not expected to be beneficial, is to
dilute the observed treatment effect and thereby reduce the power of the trial.
The temptation is to remove such subjects from the analysis.
Examples
A well known example involving subject eligibility is the Anturane Reinfarc-
tion Trial (ART).5 This trial was a well-conducted, multi-center randomized
double-blind placebo controlled trial of a drug, anturane, whose effect is to
inhibit clotting of blood. The hypothesis was that this drug would reduce the
incidence of all-cause death, sudden death, and cardiovascular death in sub-
jects with a recent myocardial infarction. The trial randomized and followed
1629 subjects but a re-evaluation of eligibility identified 71 subjects who did
not meet the entry criteria. The ineligible subjects were divided almost equally,
38 vs. 33, between the two treatment groups. The primary reasons for being
ineligible were that the heart attack did not occur within the window of time
specified in inclusion/exclusion criteria, that the blood enzymes levels specific
to heart muscle damage were not sufficiently elevated suggesting a very mild
heart attack, and that there were other competing medical conditions.
The authors of the initial ART publication presented mortality data for only
the eligible subjects. The justification for excluding the ineligible subjects was
that the eligibility criteria were prespecified in the protocol and based on
data measured prior to randomization. Further review of the results by the
FDA revealed that the mortality results differed substantially depending on
whether the analysis included all randomized subjects or only those declared
to be eligible (Temple and Pledger 1980). Final data for the ART study are
shown in Table 11.2. The analysis of all randomized subjects with 74 deaths
Table 11.3 Mortality in the Beta-blocker Heart Attack Trial according to eligibility.
Mortality
N Propranolol Placebo
All Randomized 3837 7.2% 9.8%
Eligible 3496 7.3% 9.6%
Ineligible 341 6.7% 11.3%
7 For consistency with Rubin’s terminology, here we will use the term complier to refer to
subjects who adhere to assigned treatment.
352 SELECTED ISSUES IN THE ANALYSIS
We let µCT and µN T denote the mean responses for compliers and never-takers
respectively after receiving treatment T , noting that µN A is not defined. Then,
if the observed proportion of subjects who are compliers is q, we have
E[ȲB ] = qµCB + (1 − q)µN B
E[ȲA ] = qµCA + (1 − q)µN B
where ȲA and ȲB are the observed means among subjects assigned A and B
respectively. By replacing E[ȲT ] with the observed ȲT , an unbiased estimate
of DC = µCA − µCB is
YA − Y B
DC = . (11.1)
q
The quantity DC is known as the complier average causal effect (CACE).
The estimate of the CACE in equation (11.1) is a simple moment estimator.
Other more sophisticated estimators are available if one makes distributional
assumptions regarding the three unknown distributions (Heitjan 1999; Rubin
1998).
It is important to note that the CACE estimate in (11.1) cannot “rescue” a
negative result. That is, since the estimate of the CACE is simply a rescaling
of the ITT treatment difference, a test of H0 : DC = 0 using the estimate
in (11.1) is nearly equivalent to the ITT test of H0 : D = 0. Since q in
equation (11.1) is estimated from the data, the test of H0 : DC = 0 would be
slightly less powerful when the variability in this estimate is accounted for.
Heitjan (1999) and Rubin (1998) illustrate how more powerful tests can be
constructed, although either extreme non-adherence or strong distributional
assumptions are required in order to achieve meaningful improvement.
Despite our best efforts, data are often missing for one or more of the out-
come variables. This may be because a subject did not return for a scheduled
follow-up visit, or that the subject was unable or unwilling to complete the
outcome assessment. Data may also be missing because a clinical site simply
failed to perform the required test or that the equipment necessary to make
the measurement was not available or not functioning properly. Whatever the
reason, missing data are problematic because of the potential for bias to be
MISSING DATA 355
introduced either by the exclusion of subjects with missing data, or by incor-
rect assumptions required for application of statistical methods that account
for missing data. For our purposes, we consider data to be missing if the value
that the observation takes is well defined, but is not observable for reasons
such as those suggested above. If an observation is missing because the sub-
ject has died, or is otherwise nonsensical, such as the six minute walk distance
for a subject who is either a paraplegic or an amputee, we do not consider
this to be missing for the purpose of the discussion that follows. This kind of
“missingness” is more appropriately viewed in the setting of competing risks
and will be discussed in Section 11.3.5.
11.3.1 Terminology
Little and Rubin (2002) proposed terminology that is now commonly used for
classification of missing data according to the missingness mechanism. Missing
data are classified as:
1. Missing Completely at Random (MCAR) if the probability that an observa-
tion, Y , is missing does not depend on values of the missing or non-missing
observations,
2. Missing at Random (MAR) if the probability that Y is missing depends on
non-missing observations but not on the values of the missing observations,
3. Missing Not at Random (MNAR) if the probability that Y is missing de-
pends on the value of Y .
For data that are MCAR or MAR, unbiased analyses can be performed—
the effect of missing data will be solely to decrease power. For data that are
MNAR, unless the missing data mechanism is correctly modeled, unbiased
analyses cannot, in general, be performed. When the missing data are MCAR
or MAR, we also say that the missing-data mechanism is ignorable.
Unfortunately, as with nonadherence, the missing data mechanism is not
identifiable from observed data—that is, there are many potential missing
data models that produce observed data with a given distribution, so there
is no information in the data from which the missing data mechanism can be
ascertained (“you don’t know what you don’t know”). At best, one or more
analysis can be performed using different assumptions regarding the missing-
ness mechanism. Regrettably, often the only analysis performed is one that
assumes that the missingness is ignorable and without assessment of the sen-
sitivity of the conclusion to deviations from this assumption. By considering
a range of potential associations between missingness and response, one can
assess the degree to which the conclusion can be influenced by the missingness
mechanism. This approach is generally referred to as sensitivity analysis. If the
conclusion is largely unchanged for plausible alternatives to MAR, the result
may be considered robust. Otherwise, the conclusion should be interpreted
cautiously and may be misleading. In the following section, we illustrate the
use of sensitivity analyses with a simple example.
356 SELECTED ISSUES IN THE ANALYSIS
11.3.2 Sensitivity Analysis
Suppose that we have randomized N subjects to two treatment groups, a
control group (j = 1) and an experimental group (j = 2), with nj subjects
in group j. Let pj be the true probability of failure, say death, in group j,
and yj be the observed number of deaths in group j. We wish to compare the
incidence of failure between the two groups. Suppose that we have vital status
for mj subjects in each group, so that we are missing vital status for nj − mj
from each group.
A simple, hypothetical example is provided in Table 11.4. In this case, the
trial has 170 events observed in the treatment arm and 220 events in the
control arm, a trend that favors treatment. There are also 30 subjects with
missing data on the treatment arm and 10 on the control arm.
The naive analysis, assuming that missing vital status is ignorable, might
use the Pearson chi-square test for the 2 × 2 table and only consider the
subjects with non-missing outcomes. Two additional naive analyses are best-
case and worst-case analyses. For the best-case analysis we perform a Pearson
chi-square test for the 2 × 2 table obtained by assuming that all missing
subjects in the control group are dead, and that all the missing subjects in
the experimental group are alive. The worst-case analysis is performed by
making the opposite assumption. If the results are consistent for the three
analyses, the result is clear. Unless the extent of the missing data is small or
the observed difference is extremely large, it is unlikely that the results will
be consistent.
If we consider only the observed cases, we have that Z = −2.75 (p=0.0059).
The “best-case” is the one in which none of the 30 missing subjects on treat-
ment died whereas all of the 10 control subjects died, so Z = −3.87. In the
‘‘worst-case” all of the 30 missing treatment arm subjects died and none of
the 10 controls died and Z = −1.28. In the “worst-case”, the result would
no longer reach statistical significance. On the other hand, the “worst-case”
is probably not plausible, if only because it is unlikely that the missing data
mechanism could differ so dramatically between the two treatment groups.
A more useful assessment examines robustness for a set of plausible devia-
tions from ignorability. To do this, we first propose a model for missingness as
a function of the unobserved outcome (vital status). In this simple example,
let πjy be the probability that a subject in group j and outcome y = 0, 1
is missing. Given the simplicity of the data, this is the most general miss-
MISSING DATA 357
ingness model that we can propose. The observed data will depend on three
probabilities for each treatment group:
Pr{Observed to die|group = j} = pj (1 − πj1 )
Pr{Observed to survive|group = j} = (1 − pj )(1 − πj0 )
Pr{missing|group = j} = pj πj1 + (1 − pj )πj0
Now, we re-parameterize the model as a logistic model, so let
pj
log = α + βzj
1 − pj
where z1 = 0 and z2 = 1. Similarly, let πj1 = eγj πj0 . The likelihood for the
observed data becomes
Y2 yj
exp(α + βzj )
L(α, β) = (1 − eγj πj0 ) ×
j=1
1 + exp(α + βzj )
mj −yj
1
(1 − πj0 ) ×
1 + exp(α + βzj )
nj −mj
exp(α + βzj ) γj 1
e πj0 + πj0
1 + exp(α + βzj ) 1 + exp(α + βzj )
2
Y nj nj −mj
= e(α+βzj )yj 1 + eα+βzj 1 + eα+βzj +γj H(γj , πj0 )
j=1
and
∂ log L(α, β)
Uβ (α, β) =
∂β
eα+β eα+β+γ2
= y 2 − n2 + (n 2 − m 2 ) . (11.3)
1 + eα+β 1 + eα+β+γ2
We note several things.
• The score function, and hence inference regarding α and β, depends on the
missingness mechanism only through γ1 and γ2 .
358 SELECTED ISSUES IN THE ANALYSIS
• The first two terms of each of these equations represent the observed minus
expected number of deaths if we had complete ascertainment of vital status.
The third term adjusts this difference for the number of unobserved deaths
that are expected given the proposed missingness model and the parameters
α and β.
• If γ1 = γ2 = 0, the score function reduces to the score function for just
the nonmissing data, and, hence, the missingness is ignorable. In the case
γj → +∞, all missing subjects for treatment j are dead, effectively adding
these nj − mj deaths to yj . Similarly, in the case γj → −∞, all missing
subjects for treatment j are alive, so there is no contribution from the third
term of equations (11.2) and (11.3). Hence γ1 → +∞, γ2 → −∞ coincides
with the “best-case” as described above, and γ1 → −∞, γ2 → +∞ coincides
with the ‘‘worst-case.”
The score test for H0 : β = 0 is obtained by first finding the solution,
α̂0 , to Uα (α, 0) = 0. Then, under H0 , Uβ (α̂, 0) has mean zero and variance
V (α̂, 0) = Uβ,β (α̂, 0)−Uα,β (α̂, 0)2 /Uα,α (α̂, 0), where Ust = ∂ 2 log L(α, β)/∂s∂t
(see Appendix A.2).
The test statistic for H0 is Z = Uβ (α̂, 0)/V (α̂, 0)1/2 . The sensitivity of
this test to the missingness mechanism can be ascertained by considering the
values of Z for a range of values of γ1 and γ2 . In the case γ1 = γ2 = 0, Z 2
is the usual Pearson chi-square statistic for the no-missing data. In the cases
γ1 → +∞, γ2 → −∞ and γ1 → −∞, γ2 → +∞, Z 2 is the usual Pearson
chi-square statistic for the best-case and worst-case scenarios.
In Figure 11.3, the value of Z is plotted against eγ2 = π20 /π21 for sev-
eral values of eγ1 = π10 /π11 . We see, for example, that if eγ1 = 1/8, so that
in group 1 subjects who survive are eight times more likely to be missing
than those who die, in order for Z to be less than 1.96, we would need for
eγ2 > 3.2, so that in group 2, subjects who die are more than 3 times as likely
to be missing than those who survive. This requires that in both treatment
groups there is a significant deviation from ignorable (γj = 0) and that it is in
the opposite direction. The assessment of whether these deviations are plausi-
ble involves clinical judgment and additional knowledge regarding the known
reasons that subjects’ data are missing. It may be that, ultimately, there is
disagreement regarding whether or not this result is compelling. See Kenward
(1998), Scharfstein et al. (1999), and Verbeke et al. (2001) for applications to
more complex models.
11.3.3 Imputation
The “best-case” and ‘‘worst-case” scenarios posited in the previous example
illustrate a type of imputation. That is, we impute values for the missing data,
and behave as if they were the observed values. For these simple techniques,
the imputed values do not depend on the model parameters of interest, and
therefore, inference regarding the model parameters can proceed as if the
missing responses were known. This result follows directly from the score
MISSING DATA 359
3.5
3.0
Z statistic
2.5
eγ1 = 8
2.0 eγ1 = 2
eγ1 = 1
eγ 1 = 1 2
eγ 1 = 1 8
1.5
1/8 1 2 8
eγ2
Figure 11.3 Sensitivity analysis for hypothetical result from Table 11.4.
function (equations (11.2) and (11.3)) since, in the cases γj → ±∞, α and β
drop out of the third term in each equation. If other imputation techniques
are used in which the imputed values depend on either the responses or the
non-missing data, correct inference can be performed only if the imputation
method is taken into account.
Note that the likelihood inference performed in the previous example can be
viewed as a kind of imputation. Since inference is based on the likelihood for
the observed data under an assumed model for missingness, inference for the
treatment effect of interest will be correct if the missingness model is correct.
To see how the method used in this example uses a kind of imputation, we can
apply Bayes theorem to compute the probability that a missing observation
is a death. Letting yij be the outcome for subject i in group j,
Pr{yij missing|yij = 1} Pr{yij = 1}
Pr{yij = 1|yij missing} =
Pr{yij missing}
π1j pj
=
π1j pj + π0j (1 − pj )
eα+βzj +γj
= . (11.4)
1 + eα+βzj +γj
Therefore, the third term in each of equations (11.2) and (11.3) is the expected
number of deaths in the corresponding treatment group among the subjects
360 SELECTED ISSUES IN THE ANALYSIS
with missing responses. Viewing these terms as representing the imputed num-
ber of deaths also provides a simple iterative procedure for estimation of α
and β. Starting with preliminary estimates of α and β (from the non-missing
data, for example), we compute the expected number of deaths among the
missing observations using (11.4), impute these values for the missing out-
comes, and re-estimate α and β. The process is repeated until sufficient accu-
racy is achieved. This is a simple application of the EM algorithm (Dempster
et al. 1977). The EM algorithm is a procedure that alternates between the
E-step (estimation) and the M-step (maximization). Given estimates of the
model parameters, the E-step fills in the missing observations with their ex-
pectations under the estimated model. Given the imputations, for the M-step
we recompute the maximum likelihood estimates of the model parameters
based on complete data. When the observed data likelihood is complex, the
EM algorithm is generally a computationally simpler procedure than direct
maximization of the likelihood, although the convergence rate for the EM al-
gorithm can be slow, depending on the amount of missing data. In the previous
example, only two or three iterations provide adequate accuracy.
It is also clear from this example, that for finite γj , the imputed observations
depend on the observed data via the parameter estimates. If we impute values
based on the fitted model and then assume that they are known, we are likely
to underestimate the variance in the test statistic and overstate the statistical
significance of the result. By using the variance derived from the observed
data likelihood as we did for the score test in the example, this dependence is
correctly accounted for.
In more complex situations, the likelihood for the observed data, accounting
for the missing data mechanism, can quickly become analytically intractable.
While in some cases it may be relatively straightforward to perform the impu-
tation, obtaining correct variances is usually more difficult. Correct inference
may require a technique such as multiple imputation (see, for example, Rubin
(1987), Rubin (1996), Liu and Gould (2002), or Little and Rubin (2002)).
Using multiple imputation, imputed values are randomly generated for each
subject using the posterior distribution derived using Bayes theorem in a
manner similar to what was done in the simple example above. The desired
test statistic can be computed from the augmented dataset, and the process
repeated many times. The variability among the multiple values of test statis-
tic can be used to adjust the full-data variance (conditional on the imputed
values) to account for the method of imputation that was used. This differs
from the technique we have used in that we have imputed the expected value
for each observation, whereas in multiple imputation one generates imputed
values as random samples from the entire posterior distribution. Sensitivity
analysis can be conducted in the same way as in the previous example to as-
sess robustness of the result to the potential association between missingness
and response.
The critical feature in multiple imputation is that the analysis and the
imputation are decoupled so that the same analysis can be applied to datasets
MISSING DATA 361
generated by a variety of imputation schemes that are derived from a variety
of missingness models. Liu and Gould (2002) state
Imputation and statistical inference are carried out separately with the MI ap-
proach. The MAR assumption needs to apply only for the imputation step; the
statistical inference will be valid if the imputed values are drawn from a proper
distribution. Since the model used to impute missing values can differ from the
model used for inference, the imputation model can include considerably more
potentially predictive information. . . .
The MI method is generally robust against misspecification of the imputation
model. The imputation method incorporates the variability that arises because of
the uncertainty about the missing values. The analysis results from each imputed
data set also provide a way to assess the influence of the missing observations.
When the estimates vary within a small range, the missing data are unlikely to
influence the conclusions dramatically. Otherwise, if the results from the imputed
data sets differ considerably, the impact of missing data or the imputation model
may be substantial.
Results in subgroups may also be biased when the subgroup is defined using
characteristics or outcomes observed after randomization. The properties of
randomization do not apply for such subgroups, for the same reason they do
not apply in the Treatment Received and Adherers Only analyses discussed in
Section 11.2.2. The fact that such subgroup analyses may be specified in ad-
vance within the study protocol or in a detailed analysis plan, or that there are
no significant baseline imbalances by treatment group within the subgroups,
does not mitigate the potential for bias.
One of the early examples of the problem of interpreting results for sub-
groups defined by a post-randomization event was again demonstrated by the
CDP (The Coronary Drug Project Research Group 1980) and the clofibrate
arm compared to the placebo arm. Recall that, overall, there was no differ-
ence between the clofibrate arm and the placebo arm in the 5-year mortality
(20.0% vs. 20.9%). Nonetheless, there was lower mortality in the clofibrate arm
among subjects with a baseline cholesterol greater than 250 mg/dl (17.5% vs.
20.6%) and among subjects with a fall in cholesterol from baseline to the
last follow-up visit (17.2% vs. 20.7%) but not a rise in cholesterol (22.2% vs.
19.7%). These results are consistent with the assumed mechanism of action of
clofibrate—reduce mortality by lowering cholesterol. Table 11.7 shows the 5-
year mortality results according to both baseline cholesterol level and change
in cholesterol. As shown, there was only modest variation in the placebo arm
mortality results according to baseline cholesterol or change in cholesterol.
There were notable differences in the clofibrate arm, however. A favorable
difference between clofibrate and placebo is observed among subjects with a
low baseline cholesterol level and a subsequent reduction in cholesterol. The
greatest observed effect of clofibrate, however, is among subjects with a high
baseline cholesterol level and a subsequent increase in cholesterol level (21.3%
vs. 15.5% in favor of clofibrate). These results are not consistent with scientific
expectations and are not sensible. Furthermore, those comparisons that con-
sider subgroups based on post-treatment changes are inconsistent with ITT
and should be avoided.
Table 11.7 5-Year mortality in the CDP by baseline cholesterol and change.
Baseline Cholesterol Clofibrate Placebo
Cholesterol Change N % Mortality N % Mortality
< 250 mg/dl Fall 295 16.0 614 21.2
< 250 mg/dl Rise 212 25.5 705 18.7
≥ 250 mg/dl Fall 385 18.1 762 20.2
≥ 250 mg/dl Rise 105 15.5 454 21.3
370 SELECTED ISSUES IN THE ANALYSIS
11.5 Multiple Testing Procedures
In its simplest form, a clinical trial is designed to answer a single well-defined
question. For example, the primary goal of the COPERNICUS study (Packer
et al. 2001) was to determine whether the beta-blocker carvedilol reduced all-
cause mortality relative to placebo in subjects with severe heart failure. There
was a single treatment (25 mg carvedilol twice a day), a single comparator
(placebo), and a single outcome (all-cause mortality). In practice, clinical tri-
als tend to be more complex. There may be multiple doses of the study drug,
multiple comparators (active and placebo), or multiple outcomes. In Chap-
ter 2, we discussed primary and secondary questions, noting that the type I
error for the primary question should be strictly controlled. This forces us to
compromise between the number of primary hypothesis tests and the ability
to find differences that exist (i.e., power). Thus, the primary analysis should
be confined to the fewest and most clinically relevant questions.
In some settings, it may also be desirable to control the type I error for
secondary questions. For example, the FDA may request that a testing pro-
cedure be prespecified for secondary outcomes to aid with labeling decisions
that need to be made for any approved drug.
In the previous section we discussed the examinination of treatment effect
in subgroups and touched on the multiplicity concerns this raises. The fun-
damental problem is that multiple tests lead to multiple opportunities for
rejecting hypotheses as a result of chance alone. For example, if one uses the
usual 0.05 criterion for statistical significance and performs five tests each at
the 0.05 level, if all the null hypotheses are true, the probability may be as
high as 0.25, depending on the extent of correlation among the tests, that at
least one test reaches statistical significance by chance alone. In this section
we discuss multiple testing procedures for the control of type I error in clini-
cal trials. The repeated testing of an outcome during interim monitoring of a
clinical trial was discussed in Chapter 10 and is not considered further here.
Example
Throughout this section we will refer to the following example. Suppose that
we have three treatment groups, high dose, low dose, and placebo, and two
outcomes, all-cause mortality and subject-assessed global status. If we are
interested in comparing both doses of experimental drug to placebo for each
of the two outcomes, we have a total of four elementary hypotheses that are
enumerated in Table 11.8. Table 11.8 also provides hypothetical observed p-
values from the individual hypothesis tests, since the procedures that we will
be discussing are based on these p-values.
Additional Notation
In addition to the four hypotheses in Table 11.8, there are others that are
potentially of interest. For example, we might want to test the hypothesis
that neither dose of drug has an effect on mortality. This hypothesis claims
MULTIPLE TESTING PROCEDURES 371
that both H1 and H2 are true, and would be represented by the intersection
H1 ∩ H2 . Alternatively, the hypothesis that the low dose had no effect on
either outcome would be written as H2 ∩H4 . For this example, there are eleven
intersection hypotheses in addition to the elementary hypotheses H1 , . . . , H4 ,
although it is not likely that all of these would be of interest. In general, if we
have m elementary hypotheses, and I is a subset of {1, 2, . . . , m}, we will let
HI indicate the intersection hypothesis ∩i∈I Hi . The goal of the analysis is to
test all hypotheses of interest while ensuring that the probability of a type I
error is controlled at a prespecified level, α. We will return to this example
after providing some background for multiple testing procedures.
One difficulty that arises is that some of the null hypotheses may be true
and some may be false. Ideally, a testing procedure would reject the false hy-
potheses and not reject the true hypotheses; however, both type I and type II
errors are inevitable and at best we can only control the rate at which errors
occur. To clarify this, some additional terminology is helpful. The global null
hypothesis is the hypothesis, H1 ∩ H2 ∩ . . . ∩ Hm , that asserts that all of
H1 , H2 , . . . , Hm are true. A procedure that rejects the global null hypothesis
with probability at most α when it is true is said to provide weak control
of the familywise error rate (FWER). A procedure that weakly controls the
FWER not only for the global null hypothesis, but also for all intersections,
HI = ∩i∈I Hi , I ⊂ {1, 2, . . . , m}, of hypotheses is said to provide strong control
of the FWER.
11.6 Summary
Clinical trials are inherently complex enterprises, and the results can be diffi-
cult to interpret. In this chapter we have tried to point out some of the issues
376 SELECTED ISSUES IN THE ANALYSIS
that frequently arise, some of which have clear defined statistical solutions,
while others can be ambiguous and subjective. Because “we don’t know what
we don’t know,” the effects of non-compliance and missing data are not readily
ascertained or accounted for using purely statistical methods. Subjective as-
sessments regarding the degree to which the results are affected may be central
to the ultimate interpretation of the result. That is, if we believe it is plau-
sible that the result that we observe could be an artifact of non-compliance
or missing data, the credibility of the result is in question and no amount of
speculative exploratory analyses can definitively answer the question.
Treatment comparisons adjusted for post-randomization attributes such as
adherence to assigned treatment or the presence or absence of adverse re-
sponses are inherently problematic and should be avoided. Fundamentally
flawed analyses are subject to bias of unknown magnitude and direction and
cannot provide a meaningful assessment of the robustness of the overall result.
“Per-protocol” or “evaluable” analyses do not answer the questions that one’s
instinct might suggest. Unless we are extremely careful, once we leave ITT
behind, we have left the world of experimentation and reentered the world of
observational studies in which confounding is a serious issue. Just as we have
been misled by epidemiological studies, we can be easily misled by non-ITT
analyses.
11.7 Problems
11.1 If the power of a single two-sided test with α = 0.05 is 80%, what is
the power of the same test carried out with a Bonferroni correction for
2 outcomes? For 6 outcomes? (Assume normality.)
11.2 Simulate a set of data as described at the end of section 11.2.2. Use
the hazard rate given, 0.057, for both death and crossover, starting
with the original allocation of subjects: 354 to medical therapy and
332 to surgical therapy. Plot and compare the two Kaplan-Meier curves
for each type of analysis: ITT, “Treatment Received”, and “Adherers
Only.”
11.3 In each of the following cases, use the EM algorithm described in sec-
tion 11.3.3 to impute the missing values in Table 11.4, then perform a
Pearson chi-square test and compare the result to the test based on the
score equations, (11.2) and (11.3).
(a) γ1 = γ2 = 0.9
(b) γ1 = −γ2 = 1.5
(c) γ1 = −γ2 = −1.5
CHAPTER 12
We have seen that before enrollment in a trial begins, the protocol, which
establishes all trial procedures, outcome measures, and analyses, must be in
place. A similar set of procedures must be established to bring the trial project
to a close once subject follow-up is complete. There are multiple tasks to be
completed at the clinics, by the sponsor and study leadership, and by the
statistical or data coordinating center. The literature on these topics is exten-
sive and we will consider only some of the major issues that have important
statistical implications. First, we will discuss important issues in the closeout
of a clinical trial. We will emphasize those challenges commonly encountered
by the statistical center. We will then focus on reporting issues. There are
at least three types of reporting activities: presentation of results at profes-
sional meetings, publication of results in scholarly journals, and regulatory
reporting for product approval. Of these, we will focus on the publication and
presentation aspects. Regulatory aspects were briefly discussed in Chapter 1.
377
378 CLOSEOUT AND REPORTING
In other cases, new information or concern about a safety issue may emerge
from other studies and further analysis or additional subject follow-up may
be required for the trial.
While archiving a trial is not a task that researchers take on eagerly, it
will not be achieved if left to normal routine activity. In a few months after
the data collection has been completed, memories of key details about the
trial and the database fade and the task of retrieving a data set for further
analysis can be daunting unless proper documentation has been developed.
Thus, a plan for database archiving must be developed well before the end of
the trial and worked towards during the course of the trial.
The protocol, an annotated copy of the case report form, the documentation
of the database, and the analysis programs should be carefully archived. The
database documentation may be the most challenging, as databases usually
have several components that have to be linked together.
Statistical programs used for the analysis may be written using a statistical
computing language. Software packages change over time, however, and there
is no guarantee that previous programs will run exactly as before and may
need to be updated. Thus, it is important to document these programs suffi-
ciently, so that a future statistician or programmer could make the appropriate
changes and reproduce the original analyses if necessary.
The medium of storage may also change in our rapidly changing technology
environment. Thus, the statistical center closeout plans should address long
term archiving. Paper will deteriorate over time. Forms, either paper or elec-
tronic, are now routinely stored on digital medium as well as the database and
programs. However, the ability to read the digital medium may also change.
Thus, if a trial is of sufficient importance, the trial archive may require periodic
updates.
Once the trial is closed and the database finalized, the next task is to present
the results of the trial as quickly as possible while being thorough and accu-
rate. Typically, results are presented in two forums, an oral presentation at a
professional meeting and a scholarly publication in a relevant medical journal.
Sometimes the oral presentation takes place before the scientific publication
and, for other trials, the oral presentation does not occur until shortly after a
paper is published. These decisions are often made by the sponsor of the trial,
the steering committee, the publication committee, or some combination. The
oral presentation must be much shorter and cannot cover as much detail as
the publication. In either case, the amount of statistical analyses that can be
presented will likely be much less than what was conducted. Choices have to
be made regarding which results are most important. This can be a challenge
because all important information that might affect the interpretation of the
trial results must be included. It is important for the statistician to be involved
REPORTING TRIAL RESULTS 379
actively in these decisions so that the results presented accurately reflect the
overall sense of the analyses.
Introduction
In the “Introduction”, the authors should give the scientific background and
describe the rationale for the trial. This should include a very brief review of
recent relevant publications or trials that have addressed the same or similar
questions. The current trial is likely designed to answer some of the questions
left open by previous trials or is testing a new approach to a similar ques-
380 CLOSEOUT AND REPORTING
tion. The introduction should make it clear why the current trial has been
conducted and that it was in fact ethical to do so. For example, the COPER-
NICUS (Packer et al. 2001) trial tested the effect of a beta-blocker drug in
a population with more severe heart failure (New York Heart Class IV) than
the two previous beta-blocker trials, MERIT-HF (MERIT-HF Study Group
1999) and CIBIS-II (CIBIS-II Investigators 1999). Those two trials tested a
beta-blocker in a Class II–III heart failure population and demonstrated a
35% reduction in mortality and heart failure hospitalization. Prior to these
trials, use of beta-blockers in this type of heart failure population was be-
lieved to be harmful. Thus, the potential benefit of a beta-blocker in a more
severe, higher risk heart failure population remained an open question and
the COPERNICUS trial was designed to answer that very specific question.
A brief review of this history was included in the COPERNICUS article and
provided the necessary background information to the readers. This review
also signals to the readers that the authors are aware of the previous relevant
trials.
The “Introduction” should also state clearly the specific hypothesis that
is being tested and the outcome measures that will be used to address the
hypothesis. In the COPERNICUS trial, the hypothesis stated was that total
mortality and mortality plus hospitalization for heart failure would be reduced
with the beta-blocker. The trial was designed to compare the best standard
of care with the best standard of care plus a beta-blocker (carvedilol).
This section may also refer to the organizational structure, number of clin-
ical centers, and sponsor of the trial.
Methods Section
The “Methods” section is a synopsis of the trial protocol. The methods section
briefly conveys the essence of the design. Some trials may have a separate
published design paper that can be cited, but otherwise the protocol must be
cited for further details.
In this section, the eligibility criteria for the trial should be concisely de-
scribed. This is often done by listing the main inclusion and exclusion criteria,
following a similar list to that provided in the protocol as a guide. This is im-
portant because a physician reading the paper must understand how the re-
sults pertain to his/her practice. Other researchers may find the results useful
when planning further trials. Similarly, the number, type, and location of each
clinic at which data were collected should be reported to assist in application
of trial results.
In addition, the paper should describe the specific outcome measures that
were used to assess the risks and benefits of the intervention and the frequency
and timing of those measurements. If cause-specific mortality or morbidity is
used as an outcome, then the definition must be included. For example, in
COPERNICUS, hospitalization for heart failure was an outcome of interest.
To qualify as an event, a hospitalization must have been at least 24 hours
long and related to worsening symptoms of heart failure. Treatment in an
REPORTING TRIAL RESULTS 381
emergency room for only a few hours would not be considered an outcome
event. Trial masking or blinding procedures must also be described, if used,
since it is relevant to the interpretation of the results. For example, a trial may
have masked tablets to minimize bias in outcome assessment. If the treatment
or intervention cannot be masked, then the classification of potential events
should be assessed in a masked or double-blind fashion. In some trials, both
the treatment and the outcome adjudication are masked. Such details should
be clearly described.
The randomization process, or allocation of interventions to participants,
must also be described. For example, if the paper states that the intervention
was randomly allocated to participants using a permuted block design with
varying block sizes (2-6), stratified by clinic or region, then the reader has suf-
ficient information with appropriate references to understand the allocation
method. Also the manner in which the allocation was implemented is impor-
tant. For example, the randomization assignments could be provided in sealed
envelopes (which are vulnerable to tampering) or an interactive voice response
system (IVRS) could be used. The paper should mention the particular in-
dividuals or groups responsible for implementation of randomization. These
details can be succinctly described but are important to the interpretation of
the results.
The statistician should take responsibility for the sample size summary. Suf-
ficient detail is needed so that the sample size calculation could be replicated
by an informed reader. The significance level, whether the test was one sided
or two sided, and the power of the design must be specified. In addition, the
event rate or the mean level of response, and its variability, and the size of
the hypothesized treatment effect must be provided. The latter could be ex-
pressed either as a percent change or an absolute change but it is necessary
to be clear which metric is presented. For example, in COPERNICUS, the
paper states that the trial was designed for a two sided 0.05 significance level
with 90% power to detect a 25% change in mortality, assuming an annual
mortality of 20% in the control arm. From this information, we can estimate
the sample size independently. If adjustments were made for non-compliance,
either drop-out or drop-in, these should be stated as well as the effect on the
final sample size. References to the statistical methods should be included.
Inclusion of the sample size methods is important because it tells the reader
the significance level that was used as the criterion for success, whether the
trial had adequate power, whether the assumed control group response rate
was actually observed, and whether the effect size specified was reasonable.
The methods for analysis must also be presented. Based on the issues de-
scribed in Chapter 11, the primary outcome measures must be analyzed on
an intent to treat basis where all subjects who are randomized and all events
occurring during the follow up period are accounted for. Any supplementary
analyses that will be presented, such as “per protocol” analyses, must be de-
fined as well. The description of methods also includes the statistical test(s)
that were used. This can be done succinctly by naming the statistical tests
382 CLOSEOUT AND REPORTING
with appropriate references. For example, for time to event outcome measures
such as mortality or death plus hospitalization, the Kaplan-Meier method
might be used to estimate the time-to-event curves and the log-rank test used
to compare them. A simple sentence with references can convey the necessary
information. Generally a summary measure of effect size, accompanied by an
assessment of variability, will be included. This might be a mean difference,
relative risk, odds ratio, or hazard ratio estimate, along with, for example, a
corresponding 95% confidence interval. If missing data are anticipated, then
the protocol should address how this was to be handled and this should also
be summarized briefly in the methods section. If any covariate adjustment
was done, such as adjustment for clinic or baseline risk factors, this should be
stated along with the statistical methods with references and the list of covari-
ates utilized. Enough detail about the analysis plan should be given so that
another researcher with access to the original data would be able to reproduce
the main results. Some trials, especially those used for regulatory purposes,
have a separate analysis plan document that can be cited.
Most trials conducted in the United States must now have a monitoring
plan, and trials with serious outcomes such as death or irreversible morbidity
should have an independent data monitoring committee (DMC), as described
in Chapter 10. Membership on the DMC might be reported in an appendix,
for example. In order to protect against false positive claims of treatment
effect, either for benefit or harm, many DMCs use a statistical procedure to
account for interim analyses, and these should be reported with appropriate
references. For example, COPERNICUS used an independent DMC with a
Lan-DeMets alpha spending function of the O’Brien-Fleming type (Lan and
DeMets 1983) with an overall 0.05 significance level. This informs the reader
how the trial was monitored, and guides the interpretation of the primary
results. This is especially important for trials that terminate early for either
benefit or harm. If adjustments to the estimate of the effect size were made
to account for interim monitoring, the specific methods should be referenced.
Subgroups that were predefined in the protocol should also be described. If
there were numerous subgroups, indications of which were primary are useful.
If this was not done in the protocol, adjustments made for multiple testing
of many subgroups, as discussed in Chapter 11, must be described. Lack of
any such description may leave the reader to assume that the authors do
not appreciate the issue and may minimize any subgroup results, potentially
leading readers to ignore an important subgroup result.
Results Section
The “Results” section of the scientific paper is typically the primary focus
and must be carefully prepared. This section must not include any editorial
opinions or comments; those should be included in the “Discussion” section.
Here, the results must be presented in a complete, clear, and concise manner.
We must report the results as observed in the trial, not as we might have
REPORTING TRIAL RESULTS 383
wished them to be or after we have “cleaned them up.” We might, within
reason, have alternative results to share in the “Discussion” section.
There are typically more analyses conducted than can be summarized in
the standard scientific publication. Thus, the results presented must be con-
sistent with the goals stated in the protocol and described in the “Methods”
section. Still, difficult choices regarding which analyses to present will have
to be made. It may be that not all of the important results can be presented
in one manuscript. The MERIT-HF trial, for example, presented the two pri-
mary outcome results in two separate papers, published almost a year apart
(MERIT-HF Study Group 1999; Hjalmarson et al. 2000). One reason for this
was that the mortality paper (MERIT-HF Study Group 1999) could be com-
pleted and verified more quickly. The second paper focused on the second
primary outcome of death and heart failure hospitalization and came later
because it took much longer to collect, validate, and adjudicate the cause-
specific hospitalization data (Hjalmarson, Goldstein, Fagerberg, et al. 2000).
By separating the results into two papers, the mortality results could be more
rapidly disseminated and both papers could include more detailed analyses
than if these two outcomes had been combined in one single manuscript.
We will now describe some of the typical, and necessary, results of the trial
that are to be presented. Some of these results may best be presented in a
graphical display, and others as tables. In some cases, either is acceptable. As
we go through the elements, we will comment on those for which we have a
recommendation regarding the style of display.
One of the first tables or figures is a summary of the screening process,
typically shown in the format of the “CONSORT diagram” (Moher et al.
2001). While the subjects enrolled are rarely a representative sample (nor do
they need to be), it is still useful for the reader and practicing physician to see
how many subjects were screened but did not qualify or volunteer for the trial.
There cannot be great detail in this table but it is useful to know how many
subjects were excluded for some of the major entry criteria and the number
of subjects who did not consent. This type of information gives physicians a
sense of how the results might pertain to their own practice. This table should
also clearly present the exact number of subjects randomized. The reader can
judge how close the trial came to the recruitment goal and then knows how
much data to expect for any given variable, based on the accompanying tables.
An example from the MERIT-HF study is shown in Figure 12.1.
The interpretation of trial results may depend heavily on the time frame in
which the trial took place. Medical practice, and even population attributes,
can change over this period. The “Results” section should report important
dates including the dates on which subject accrual began and ended, as well
as the ending date of follow-up.
The next standard table (see Table 12.1) describes the major baseline char-
acteristics, as defined in the protocol, by treatment arm with an indication of
where, if at all, imbalances occurred. This table is important for two primary
reasons. First, it describes in some detail the characteristics of the subjects or
384 CLOSEOUT AND REPORTING
Figure 12.1 CONSORT diagram (Figure 1) from MERIT-HF study report (Hjal-
marson et al. 2000). Copyright
c 2000 American Medical Association. All rights
reserved.
Table 12.1 Baseline table (Table 1) from MERIT-HF study report (Hjalmarson et al.
2000). Copyright
c 2000 American Medical Association. All rights reserved.
Once the balance across treatment arms among baseline covariates has been
established, a description of treatment and protocol compliance should be
presented. The reader needs to know how much of the treatment that was
specified in the protocol was received. If a trial demonstrates a statistically
significant positive benefit but the compliance was less than anticipated, the
interpretation is still that the treatment is beneficial. On the other hand, if
a trial comparing standard care plus an experimental treatment to standard
care alone produced a neutral outcome with poor compliance, the result could
386 CLOSEOUT AND REPORTING
be due either to an ineffective experimental treatment or to poor adherence
to assigned treatment.
The publication should also report the relevant concomitant medication
that subjects in each treatment arm received. This can be important if sub-
jects in one of the arms received more of a particular or key concomitant
medication or therapy than those in the others. It is ethically mandated that
all subjects receive the best available care. This may result in differences in
concomitant care between treatment groups and these differences may affect
the interpretation of the results. For example, if subjects in the experimen-
tal treatment arm received more of a particular concomitant medication, it
may be difficult to distinguish the effect of the experimental treatment from
that of the concomitant medication. This can be especially problematic in
unblinded trials. Because most concomitant medication is reported on a log-
form (see Chapter 6), it is often of poorer quality and reliability than other
data. Nonetheless, the use of any important concomitant medications should
be reported.
Compliance to the protocol beyond concomitant medication and experimen-
tal medication should also be reported for each treatment arm. For example,
the completeness of subject clinic visits, the percent of subject visits within
the prescribed time window, and the completeness of prescribed procedures
and measurements are important. The number of participants lost to follow-up
should be reported. If protocol compliance is not as expected or not accept-
able, then the results may not be easily understood. Because non-compliance
with the protocol procedures is likely not to happen at random, it is important
that this information be presented.
The primary outcome(s) and major secondary outcomes are the next results
to present. If the trial has identified a single primary outcome, then this result
should be highlighted. If there is more than one primary outcome, all primary
outcomes should be clearly identified as such. Secondary outcomes may be
shown in the same table as the primary outcomes as long as they are clearly
identified.
If the outcomes are summarized by sample means, the tables should indi-
cate, for each treatment arm, the sample size, the mean value, and the stan-
dard deviation or standard error. In addition, the standardized test statistic
or the corresponding p-value (or both) should be shown. Also, it is helpful
to show the estimated treatment difference and the 95% confidence interval.
There has been debate whether p-values or confidence intervals should be
presented in published reports. Our recommendation is that both should be
presented since they convey different information, but we agree with some
authors that the confidence intervals are preferred since they contain more
useful information.
A time-to-event outcome should include Kaplan-Meier (Kaplan and Meier
1958) survival or mortality plots that also indicate the number of subjects at
risk at selected time points in the follow-up. In addition, a table should also be
included summarizing the number of events in each arm and the statistic used
REPORTING TRIAL RESULTS 387
Table 12.2 Composite outcome table (Table 2) from MERIT-HF study report (Hjal-
marson et al. 2000). Copyright
c 2000 American Medical Association. All rights
reserved.
for comparing the survival curves, typically the log-rank test statistic, and its
corresponding p-value. The table should also include a summary statistic such
as a relative risk, odds ratio, or hazard ratio with a 95% confidence interval.
Examples of these presentations from MERIT-HF (MERIT-HF Study Group
1999) are shown in Table 12.2 and Figure 12.2.
After the primary and secondary outcomes have been presented, most trial
reports present subgroup analyses to assess the consistency of the results
across the range of patients the practicing physician might encounter. The
important point is not that each subgroup result be “statistically significant”
but that the results are generally consistent. While absolute consistency might
seem to be the ideal, it should not be expected. The size of the observed treat-
ment differences can vary considerably because of random variation due to
the smaller sample sizes in most baseline subgroups. Real differences in treat-
ment effect might be present, but it is difficult to distinguish real differences
from random differences. One convenient way to represent subgroup results
is illustrated by Figure 12.3, taken from the MERIT-HF trial (Hjalmarson
et al. 2000). In this figure, each hazard ratio is plotted with a 95% confidence
interval for several standard subgroups of interest in the treatment of heart
failure. The results are remarkably consistent in direction of treatment benefit
with relatively small variations in the size of the effect. Tests for treatment-
by-subgroup interactions can be provided as part of the figure if desired, or
described in the text for subgroups for which a qualitative interaction is of
interest.
In order to understand and assess the risk-to-benefit ratio for a new treat-
ment, the publication must present serious adverse events (SAEs) in a table.
An example from MERIT-HF is shown in Table 12.3. SAEs are usually defined
388 CLOSEOUT AND REPORTING
Figure 12.2 Composite outcome figure (Figure 2) from MERIT-HF study report
(Hjalmarson et al. 2000) (modified slightly from the original). Copyright
c 2000
American Medical Association. All rights reserved.
Figure 12.3 Subgroup analyses figure (Figure 4) from MERIT-HF study report (Hjal-
marson et al. 2000). Copyright
c 2000 American Medical Association. All rights
reserved.
Discussion Section
While the “Methods” section of the paper should present the trial design and
the “Results” section should provide the trial results with little or no inter-
pretation or editorial comment, the “Discussion” section offers the authors
the opportunity to give their interpretation of the results. The authors may
summarize in the opening paragraphs what they believe the results mean. For
some trials, this may be obvious, but for others it may take considerable ef-
fort and insight. The “Discussion” section should comment on how the actual
trial conduct and results were relative to what was expected in the protocol
and the planning process. Few trials turn out exactly as expected so authors
should not be defensive or apologetic. Mistakes and disappointments should
also be discussed.
The “Discussion” section should put the results of the current trial in
the context of previous trials. Each trial contributes important information.
For example, are the results scientifically consistent with previous knowledge
about the treatment and are the results consistent with previous relevant tri-
als? Perhaps the current trial focused on a higher risk group or one of the
subgroups identified in a previous trial. For example, the MERIT-HF and
390 CLOSEOUT AND REPORTING
Table 12.3 Adverse event table (Table 4) from MERIT-HF study report (Hjalmarson
et al. 2000). Copyright
c 2000 American Medical Association. All rights reserved.
Authorship/Attribution
A clinical trial obviously requires contributions from many individuals, not all
of which can be authors of the primary or later publications. Two practices
have emerged over the past decades to address this problem. One practice is
to publish manuscripts using a group authorship such as the “Coronary Drug
Project Research Group.”2 Many components of the trial research group, in-
cluding the actual writing team, are then identified in the appendix. This
practice has recently met with some resistance from medical journals, prefer-
ring to have specific individuals identified and accountable for the contents of
the article. In response, more recent trial reports have identified selected in-
dividuals to be authors, along with the trial research group name. The entire
trial research group is then enumerated as before in the appendix. A principal
investigator with perhaps one or two associates may be listed for each of the
participating clinical centers. Often the study chair, the steering committee,
and other key committees are also identified. The Data Monitoring Commit-
tee, if there was one, should be identified, as well as the statistical center or
data coordinating center. The study statistician might well be one of the lead
authors since that individual has contributed to the trial design and much of
the technical analysis and has participated in the drafting of the primary and
secondary papers. This does, however, require special effort by the statistician,
but this effort is necessary for the publication to be of the highest quality.
It is important that agreement regarding authorship be reached and doc-
1 MERIT-HF Study Group (1999), Hjalmarson, Goldstein, Fagerberg, et al. (2000), CIBIS-
II Investigators (1999)
2 The Coronary Drug Project Research Group (1975, 1980)
392 CLOSEOUT AND REPORTING
umented in the protocol or another planning document before the trial is
complete. For those in academia as well as for government and industry, recog-
nition of scholarly achievement, as measured by authorship on publications,
is important. Allowance should also be made for those investigators who have
either successfully recruited the most subjects or made extraordinary efforts
on key committees. Our experience is that these issues can be addressed most
satisfactorily if done in advance.
12.3 Problems
12.1 Find an article in a medical journal that reports the results of a ran-
domized clinical trial. Try to identify the components of a good clinical
trial publication as discussed in this chapter. Discuss which elements
were included by the authors and which are missing or inadequate.
APPENDIX A
One simple technique commonly used for deriving variance estimators for
functions of random variables is called the delta method. Suppose that a
random variable X has mean µ and variance σ 2 . Suppose further that we
construct a new random variable Y by transforming X, Y = g(X), for a
continuously differentiable function g(·). Then, by Taylor’s theorem, g(X) =
g(µ) + g 0 (µ)(X − µ) + O((X − µ)2 ). Ignoring the higher order terms, we have
that
393
394 DELTA METHOD, ML THEORY, AND INFORMATION
The maximum likelihood estimate (MLE) of θ is the value of θ, θ̂, that maxi-
mizes LY (θ), or equivalently, the value of θ for which
∂ log LY (θ)
UY (θ) = = 0.
∂θ
For the cases we will consider, this equation will have a unique solution. The
function UY (θ) is known as the efficient score for θ. It can be shown that
Eθ UY (θ) = 0 where Eθ denotes expectation when Y has distribution fY (y; θ).
An important special case is the one in which fY (y; θ) is part of an expo-
nential family. That is, fY (y; θ) = exp(η(x, θ)T (y) − A(η(x, θ))) where x is a
subject level covariate (possibly vector valued), η(x, θ) is a function of the co-
variate and the parameter θ, and A(η) is a function of η which forces fY (y; θ)
one. In this case, A0 (η) = ET (y) and the score function has the
to integrate to P
form UY (θ) = i B(x, θ)(T (y) − A0 (η)), where B(x, θ) = ∂η(x, θ)/∂θ. Hence,
the score function is a linear combination of the centered, transformed obser-
vations T (y) − ETP
P (x) and the solution to the score equations (A.1) satisfies
i B(x, θ)T (y) = i B(x, θ)ET (y).
The expected value of the derivative of −UY (θ) with respect to θ is known
HYPOTHESIS TESTING 395
as the Fisher information, I(θ). In the one dimensional case it can be shown
that
∂ 2 log LY (θ)
I(θ) = −Eθ [ ]
∂θ2
= Eθ [UY (θ)2 ]
= Varθ [UY (θ)].
In many situations, we cannot compute I(θ) directly, but will need to use an
estimate obtained from the data. It can be shown that under modest regularity
a a
conditions, θ̂ ∼ N (θ, I −1 (θ)) where ∼ indicates the asymptotic distribution,
e.g., asymptotically θ̂ has a normal distribution with mean θ and variance
I −1 (θ). Note that I(θ) is of the expected curvature of the log-likelihood func-
tion at the true value of θ. Larger values of I(θ) indicate that the likelihood
function is more sharply peaked, and therefore, estimates of θ are more precise.
These results can be generalized to the case where θ is a p-dimensional
vector with no difficulty. In this case the score, UY (θ) is a p-dimensional
vector of partial derivatives, and the Fisher information, I(θ), is a matrix
of partial derivatives. The asymptotic covariance matrix of θ̂ is the matrix
inverse I −1 (θ).
Likelihood given
θ = θ̂
Likelihood given
θ = θ0
θ0 θ̂
1. Likelihood Ratio Test (LRT). The test statistic is twice the log-likelihood
ratio:
LY (θ̂, ν̂ ) a 2
2 log ∼ χp under H0
LY (θ0 , ν̃ )
where χ2p is the χ2 distribution with p degrees of freedom. Figure A.1 il-
lustrates the principle underlying the LRT when there are no nuisance
parameters. Since (θ̂, ν̂) is the MLE, the likelihood evaluated at (θ̂, ν̂) is at
least as large as the likelihood evaluated at (θ0 , ν̂). If the likelihood ratio is
large enough, there is strong evidence from the data that the observations
do not arise from the distribution fY (y; θ0 ).
2. Wald test. Since the upper left block of the matrix I(θ)−1 is the asymptotic
covariance matrix of θ̂, we have that
−1 a
(θ̂ − θ0 )T (Iθ,θ − Iθ,ν Iν,ν Iν,θ )(θ̂ − θ0 ) ∼ χ2p under H0 ,
(θ̂ − θ0)2
Var(θ̂) U2
L(θ̂) L(θ̂) Var(U)
2log 2log
L(θ0) L(θ0)
θ̂ θ0 θ̂ θ0
Figure A.2 Illustrations showing Wald and score tests as quadratic approximations
to the likelihood ratio test. The solid line represents the log-likelihood function and
the dashed line represents the quadratic approximation.
π1 (1 − π1 ) (π1 + ∆)(1 − π1 − ∆)
= + .
n1 n2
Note that this is identical to the variance obtained directly using the binomial
variance, Var yi /ni = πi (1 − πi )/ni . The variance estimate is obtained by
replacing π1 and ∆ by their estimates.
Therefore, the Wald test statistic for H0 has form
(y2 /n2 − y1 /n1 )2
.
b
π̂1 (1 − π̂1 )/n1 + (π̂1 + ∆)(1 b
− π̂1 − ∆)/n 2
COMPUTING THE MLE 399
The score test is similar. Under H0 , we have
y2 n2 − y 2
U∆,Y (0, π̃1 ) = −
π̃1 1 − π̃1
1
= (y2 − n2 π̃1 )
π̃1 (1 − π̃1 )
which is proportional to the observed number of events in group 2 minus
the expected number under H0 . Under H0 , the variance of U∆,Y (0, π̃1 ) is
I∆,∆ − I∆,π1 Iπ−1 I
1 ,π1 π1 ,∆
= (1/n1 + 1/n2 )−1 /π1 (1 − π1 ), so the score test
statistic is
2
y2 − n2 π̃1 1 1
π̃1 (1 − π̃1 )( + )
π̃1 (1 − π̃1 ) n1 n2
(y2 − n2 (y1 + y + 2)/(n1 + n2 ))2 (n1 + n2 )3
=
n1 n2 (y1 + y2 )(n1 + n2 − y1 − y2 )
which is the (uncorrected) Pearson Pχ2 test statistic. It can be shown that the
score statistic can also be written (O − E)2 /E, where O and E are as in
equation (A.3).
Note that the Wald test and the score test have similar form; the difference
between them is that the Wald test uses the variance computed under H1 ,
while the score test uses the variance computed under H0 . 2
In most situations there is no closed form solution for the MLE, θ̂, but finding
the solution requires an iterative procedure. One commonly used method for
finding θ̂ is the Newton-Raphson procedure.
We begin with an initial guess θ̂0 , from which we generate a sequence of
estimates θ̂1 , θ̂2 , θ̂3 , . . . until convergence is achieved. For i = 1, 2, 3, . . ., θ̂i is
derived from θ̂i−1 by first applying Taylor’s theorem. We have that
Ignoring the higher order terms, we set the above equal to zero and solve for
θ yielding
Example A.7. Continuing Example A.6 above, suppose that Tij and yij are
observed for subject i at times t1 and t2 , respectively, t1 < t2 . Note that if
yi1 = 1, then yi2 = 1 and Ti2 = Ti1 . Furthermore, if additional subjects are
enrolled between times t1 and t2 , we can consider all subjects to be observed
at time t1 by letting yi1 = 0 and Ti1 = 0 for these subjects.
Therefore,
X X
UY 2 (µ) − UY 1 (µ) = (yi2 − exp(µ)Ti2 ) − (yi1 − exp(µ)Ti1 )
i i
X
= yi2 − yi1 − exp(µ)(Ti2 − Ti1 ).
i
P P P
Under H0 : β = 0, µ̂ = log( i yi / i Ti ), and Uβ,Y (µ̂) = i zi (yi − eµ̂ Ti )
which is the sum of the observed minus expected under H0 in the group
zi = 1. The Fisher information for the parameter β has the form given by
the inverse of the upper left entry of the partitioned matrix in equation (A.3).
P P −λ̂Ci
We have Iβ,β = Iβ,µ = Iµ,β = i zi exp(µ̂)ETi = i zi (1 − e ) and
P −λ̂Ci
Iβ,µ = i 1 − e , so
2
I(β|µ̂) = Iβ,β − Iβ,µ /Iµ,µ
P P
−λ̂Ci −λ̂Ci
i z i (1 − e ) i (1 − z i )(1 − e )
= P .
−λ̂Ci
i1−e
the log-likelihood is
log LY (µ) =
P
1 ρ
−K(σ 2 , ρ) − j 2σ 2 (yj − µJmj )T Imj − T
1+mj ρ Jmj Jmj (yj − µJmj ).
BROWNIAN MOTION 403
The score function is
X 1
T ρ T
UY (µ) = J I mj − Jm J (yj − µJmj )
j
σ 2 mj 1 + m j ρ j mj
X 1
= J T (y − µJmj )
2 (1 + m ρ) mj j
j
σ j
0.8
0.6
0.4
W(t)
0.2
0.0
−0.2
405
406 REFERENCES
(1985). Extracorporeal circulation in neonatal respiratory failure: A prospective
randomized study. Pediatrics 76, 479–487.
Beresford, S., B. Thompson, Z. Feng, A. Christianson, D. McLerran, and D. Patrick
(2001). Seattle 5 a day worksite program to increase fruit and vegetable consump-
tion. Preventive Medicine 32, 230–238.
Bergner, M., R. Bobbit, W. Pollard, D. Martin, and B. Gilson (1976). The sickness
impact profile: validation of health and status measure. Med Care 14, 57–67.
Beta-Blocker Heart Attack Trial Research Group (1982). A randomized trial of
propranolol in patients with acute myocardial infarction. Journal of American
Medical Association 247, 1707.
Billingham, L., K. Abrams, and D. Jones (1999). Methods for the analysis of quality-
of-life and survival data in health technology assessment. Health Technol Assess
3 (10), 1–152.
Billingsley, P. (1995). Probability and Measure. John Wiley & Sons.
Blackwell, D. and J. Hodges Jr. (1957). Design for the control of selection bias. Ann
Math Stat 28, 449–460.
Bolland, K. and J. Whitehead (2000). Formal approaches to safety monitoring of
clinical trials in life-threatening conditions. Statistics in Medicine 19 (21), 2899–
2917.
Bollen, K. (1989). Structural Equations with Latent Variables. New York: Wiley.
Box, G. E. P. and N. R. Draper (1987). Empirical Model-building and Response
Surfaces. John Wiley & Sons.
Bracken, M. B. and J. C. Sinclair (1998). When can odds ratios mislead? Avoidable
systematic error in estimating treatment effects must not be tolerated. [letter;
comment]. BMJ 317 (7166), 1156–7.
Breslow, N. E. and N. E. E. Day (1980). Statistical Methods in Cancer Research
Volume I: The Analysis of Case-Control Studies. OxfordUniv:Oxf.
Breslow, N. E. and N. E. E. Day (1988). Statistical Methods in Cancer Research
Volume II: The Design and Analysis of Cohort Studies. OxfordUniv:Oxf.
Bristow, M., L. Saxon, J. Boehmer, S. Krueger, D. Kass, T. DeMarco, P. Car-
son, L. CiCarlo, D. DeMets, B. White, D. DeVries, and A. Feldman (2004).
Cardiac-resynchronization therapy with or without an implantable defibrillator
in advanced chronic heart failure. New England Journal of Medicine 350, 2140.
Bristow, M., L. Saxon, et al., and for the COMPANION Investigators (2004). Re-
synchronization therapy for patients with and without an implantable defibrillator
in patients with advanced chronic heart failure. N Engl J Med 350, 2140–2150.
Brown, B. (1980). The crossover experiment for clinical trials. Biometrics 36, 69.
Bull, J. (1959). The historical development of clinical therapeutic trials. J Chron
Dis 10, 218–248.
Burzykowski, T., G. Molenberghs, and M. Buyse (Eds.) (2005). The Evaluation of
Surrogate Endpoints. New York: Springer.
Byar, D. P., A. M. Herzberg, and W.-Y. Tan (1993). Incomplete factorial designs
for randomized clinical trials. Statistics in Medicine 12, 1629–1641.
Byar, D. P., R. M. Simon, M. D. Friedewald, J. J. Schlesselman, D. L. DeMets,
J. H. Ellenberg, M. H. Gail, and J. H. Ware (1976). Randomized clinical trials,
perspectives on some recent ideas. New England Journal of Medicine 295, 74–80.
Califf, R., S. Karnash, and L. Woodlief (1997). Developing systems for cost-effective
auditing of clinical trials. Controlled Clinical Trials 18, 651–660.
Cardiac Arrhythmia Suppression Trial II Investigators (1992). Effects of the antiar-
REFERENCES 407
rhythmic agent moricizine on survival after myocardial infarction. N Engl J Med
327, 227–233.
Carmines, E. and R. Zeller (1979). Reliability and Validity Assessment Series: Quan-
titative applications in the social sciences, Volume 17. Sage University Press.
Carriere, K. (1994). Crossover designs for clinical trials. Statistics in Medicine 13,
1063–1069.
Carroll, R. J. (2004). Discussion of two important missing data issues. Statistica
Sinica 14 (3), 627–629.
Carroll, R. J. and D. Ruppert (1988). Transformation and Weighting in Regression.
Chapman & Hall Ltd.
Carson, P., C. O’Connor, A. Miller, S. Anderson, R. Belkin, G. Neuberg,
J. Wertheimer, D. Frid, A. Cropp, and M. Packer (2000). Circadian rhythm and
sudden death in heart failure: results from prospective randomized amlodipine
survival trial. J Am Coll Cardiol 36, 541–546.
Case, L. and T. Morgan (2001). Duration of accrual and follow-up for two-stage
clinical trials. Lifetime Data Analysis 7, 21–38.
CASS Principal Investigators and their Associates (1983). Coronary artery surgery
study (CASS): a randomized trial of coronary artery bypass surgery. Circulation
68, 939.
Cella, D., D. Tulsky, G. Gray, B. Sarafin, E. Lin, A. Bonomi, et al. (1993). The
functional assessment of cancer therapy scale: development and validation of the
general measure. Journal of Clinical Oncology 11 (3), 570–579.
Chalmers, T., P. Celano, H. Sacks, and H. Smith Jr. (1983). Bias in treatment
assignment in controlled clinical trials. N Engl J Med 309, 1358–1361.
Chalmers, T., R. Matta, H. Smith, and A. Kunzier (1977). Evidence favoring the
use of anticoagulants in the hospital phase of acute myocardial infarction. N Engl
J Med 297, 1091–1096.
Chang, M. and P. O’Brien (1986). Confidence intervals following group sequential
tests. Controlled Clinical Trials 7, 18–26.
Chang, M. N., A. L. Gould, and S. M. Snapinn (1995). P -values for group sequential
testing. Biometrika 82, 650–654.
Chassan, J. (1970). A note on relative efficiency in clinical trials. J Clin Pharmacol
10, 359–360.
Chen, J., D. DeMets, and K. Lan (2000). Monitoring multiple doses in a com-
bined phase II/III clinical trial: Multivariate test procedures (dropping the loser
designs). Technical Report 155, Department of Biostatistics and Medical Infor-
matics, University of Wisconsin, Madison.
Chen, J., D. DeMets, and K. Lan (2003). Monitoring mortality at interim analyses
while testing a composite endpoint at the final analysis. Controlled Clinical Trials
23, 16–27.
Cheung, K. and R. Chappell (2000). Sequential designs for phase I clinical trials
with late-onset toxicities. Biometrics 56, 1177–1182.
Chi, Y.-Y. and J. G. Ibrahim (2006). Joint models for multivariate longitudinal and
multivariate survival data. Biometrics 62 (2), 432–445.
Chi, Y.-Y. and J. G. Ibrahim (2007). Bayesian approaches to joint longitudinal and
survival models accommodating both zero and nonzero cure fractions. Statistica
Sinica 17, 445–462.
CIBIS-II Investigators (1999). The Cardiac Insufficiency Bisoprolol Study II (CIBIS-
II): a randomised trial. Lancet 353, 1–13.
408 REFERENCES
Cochran, W. and G. Cox (1957). Experimental Designs: 2nd Ed. New York: John
Wiley and Sons.
Cohn, J., S. Goldstein, B. Greenberg, B. Lorell, R. Bourge, B. Jaski, S. Gottlieb,
F. McGrew, D. DeMets, and B. White (1998). A dose-dependent increase in
mortality with vesnarinone among patients with severe heart failure. New England
Journal of Medicine 339, 1810–1816.
Cook, T. D. (2000). Adjusting survival analysis for the presence of non-adjudicated
study events. Controlled Clinical Trials 21 (3), 208–222.
Cook, T. D. (2002a). P -value adjustment in sequential clinical trials. Biometrics
58 (4), 1005–1011.
Cook, T. D. (2002b). Up with odds ratios! a case for the use of odds ratios when
outcomes are common. Academic Emergency Medicine 9, 1430–1434.
Cook, T. D. (2003). Methods for mid-course corrections in clinical trials with survival
outcomes. Statistics in Medicine 22 (22), 3431–3447.
Cook, T. D., R. J. Benner, and M. R. Fisher (2006). The WIZARD trial as a case
study of flexible clinical trial design. Drug Information Journal 40, 345–353.
Cook, T. D. and M. R. Kosorok (2004). Analysis of time-to-event data with incom-
plete event adjudication. JASA 99 (468), 1140–1152.
Coronary Drug Project Research Group (1981). Practical aspects of decision making
in clinical trials: the coronary drug project as a case study. Controlled Clinical
Trials 1, 363–376.
Coumadin Aspirin Reinfarction Study (CARS) Investigators (1997). Randomised
double–blind trial of fixed low–dose warfarin with aspirin after myocardial infarc-
tion. Lancet 350, 389–96.
Cox, D. (1958). Planning of Experiments. New York: John Wiley and Sons.
Cox, D. (1972). Regression models and life tables (with discussion). Journal of the
Royal Statistical Society, Series B 34, 187–220.
Cox, D. R. and D. Oakes (1984). Analysis of Survival Data. Chapman & Hall Ltd.
Crosby, R., R. Kolotkin, and G. Williams (2003). Defining clinically meaningful
change in health-related quality of life. Journal of Clinical Epidemiology 56, 395–
407.
Cui, L., H. M. J. Hung, and S.-J. Wang (1999). Modification of sample size in group
sequential clinical trials. Biometrics 55, 853–857.
Cusick, M., A. Meleth, E. Agron, M. Fisher, G. Reed, F. Ferris III, E. Chew, and
the Early Treatment Diabetic Retinopathy Study Research Group (2005). As-
sociations of mortality and diabetes complications in patients with Type 1 and
Type 2 Diabetes. Early Treatment Diabetic Retinopathy Study Report No. 27.
Diabetes Care 28, 617–625.
Cytel Software Corporation (2000). EaSt–2000: Software for the Design and Interim
Monitoring of Group Sequential Clinical Trials. Cambridge, MA.
D’Agostino, R. B. (1990). Comments on “Yates’s correction for continuity and the
analysis of 2 × 2 contingency tables”. Statistics in Medicine 9, 377–378.
Dahlstrom, W., G. Welsh, and L. Dahlstrom (1972). An MMPI Handbook, Vol I.
University of Minnesota, Minneapolis.
Davies, H. T. O., I. K. Crombie, and M. Tavakoli (1998). When can odds ratios
mislead? BMJ 316, 989–91.
Dawber, T., G. Meadors, and F. Moore (1951). Epidemiological approaches to heart
disease: the Framingham study. Am J Public Health 41, 279–286.
Deeks, J. J. (1998). When can odds ratios mislead? Odds ratios should be used only
REFERENCES 409
in case-control studies and logistic regression analyses. [letter; comment]. BMJ
317 (7166), 1156–7.
DeGruttola, V., P. Clax, D. DeMets, G. Downing, S. Ellenberg, L. Friedman, M. Gail,
R. Prentice, J. Wittes, and S. Zeger (2001). Considerations in the evaluation of
surrogate endpoints in clinical trials: Summary of a National Institutes of Health
workshop. Control Clin Trials 22, 485–502.
DeMets, D., T. Fleming, R. Whitley, et al. (1995). The Data and Safety Moni-
toring Board and Acquired Immune Deficiency Syndrome (AIDS) clinical trials.
Controlled Clinical Trials 16 (6), 408–421.
DeMets, D. and M. Halperin (1982). Early stopping in the two-sample problem for
bounded random variables. Controlled Clinical Trials 3, 1.
DeMets, D., R. Hardy, L. Friedman, and K. Lan. (1984). Statistical aspects of
early termination in the beta-blocker heart attack trial. Control Clin Trials 5 (4),
362–372.
DeMets, D. and J. Ware (1980). Group sequential methods for clinical trials with a
one-sided hypothesis. Biometrika 67, 651–60.
DeMets, D. and J. Ware (1982). Asymmetric group sequential boundaries for mon-
itoring clinical trials. Biometrika 69, 661–663.
DeMets, D. L. and R. M. Califf (2002). Principles from clinical trials relevant to
clinical practice: Part I. Circulation 106 (8), 1115–1121.
DeMets, D. L., S. J. Pocock, and D. G. Julian (1999). The agonizing negative trend
in monitoring of clinical trials. Lancet 354, 1983–88.
Dempster, A., N. Laird, and D. Rubin (1977). Maximum likelihood from incomplete
data via the em algorithm. Journal of the Royal Statistical Society, Series B 39,
1–38.
Diabetes Control and Complications Trial Research Group (1993). The effect of
intensive treatment of diabetes on the development and progression of long-term
complications in insulin-dependent diabetes mellitus. N Engl J Med 329, 977–986.
Diabetic Retinopathy Study Research Group (1976). Preliminary report on effects
of photocoagulation therapy. Amer J Ophthal 81, 383.
Dickstein, K., J. Kjekshus, the OPTIMAAL Steering Committee, et al. (2002). Ef-
fects of losartan and captopril on mortality and morbidity in high-risk patients
after acute myocardial infarction: the OPTIMAAL randomised trial. The Lancet
360 (9335), 752–760.
Diggle, P. and M. Kennward (1994). Informative drop-out in longitudinal data
analysis. Applied Statistics 43, 49–93.
Diggle, P., K. Liang, and S. Zeger (1994). Analysis of Longitudinal Data. New York:
Oxford Science Publications.
Domanski, M., G. Mitchell, M. Pfeffer, J. Neaton, J. Norman, K. Svendsen,
R. Grimm, J. Cohen, J. Stamler, and for the MRFIT Research Group (2002).
Pulse pressure and cardiovascular disease-related mortality: Follow-up study of
the Multiple Risk Factor Intervention Trial (MRFIT). JAMA 287, 2677–2683.
Ederer, F., T. R. Church, and J. S. Mandel (1993). Sample sizes for prevention trials
have been too small. American Journal of Epidemiology 137 (7), 787–796.
Efron, B. (1971). Forcing a sequential experiment to be balanced. Biometrika 58,
403–417.
Efron, B. (1982). The Jackknife, the Bootstrap and Other Resampling Plans. SIAM
[Society for Industrial and Applied Mathematics].
Efron, B. and D. Feldman (1991). Compliance as an explanatory variable in clinical
410 REFERENCES
trials (with discussion). Journal of the American Statistical Association 86, 9–26.
Eisenstein, E. L., P. W. Lemons II, B. E. Tardiff, K. A. Schulman, M. K. Jolly, and
R. M. Califf (2005). Reducing the costs of phase III cardiovascular clinical trials.
American Heart Journal 149 (3), 482–488.
Ellenberg, S., T. Fleming, and D. DeMets (2002). Data Monitoring Committees in
Clinical Trials: A Practical Perspective. Wiley.
Ellenberg, S. and R. Temple (2000). Placebo-controlled trials and active-control
trials in the evaluation of new treatments. Part 2: Practical issues and specific
cases. Ann Int Med 133, 464–470.
Emerson, S. and T. Fleming (1989). Symmetric group sequential test designs. Bio-
metrics 45, 905–923.
Emerson, S. and T. Fleming (1990). Parameter estimation following group sequential
hypothesis testing. Biometrika 77, 875–892.
Fairbanks, K. and R. Madsen (1982). P values for tests using a repeated significance
test design. Biometrika 69, 69–74.
Fairclough, D. (2002). Design and Analysis of Quality of Life Studies in Clinical
Trials. Chapman and Hall/CRC.
Fairclough, D., H. Peterson, and V. Chang (1998). Why are missing quality of life
data a problem in clinical trials of cancer therapy? Statistics in Medicine 17,
667–677.
Fayers, P. and D. Hand (1997). Factor analysis, causal indicator and quality of life.
Quality of Life Research 8, 130–150.
Fayers, P. and R. Hays (Eds.) (2005). Assessing Quality of Life in Clinical Trials:
Methods and Practice (2nd edition). Oxford University Press.
Fayers, P. and D. Machin (2000). Quality of Life: Assessment, Analysis and Inter-
pretation. Chichester, England: John Wiley & Sons.
Feldman, A. M., M. R. Bristow, W. W. Parmley, et al. (1993). Effects of vesnarinone
on morbidity and mortality in patients with heart failure. New England Journal
of Medicine 329, 149–155.
Fienberg, S. E. (1980). The Analysis of Cross-classified Categorical Data. MIT
Press.
Fisher, B., S. Anderson, C. Redmond, N. Wolmark, D. Wickerham, and W. Cronin
(1995). Reanalysis and results after 12 years of follow-up in a randomized clinical
trial comparing total mastectomy with lumpectomy with or without irradiation
in the treatment of breast cancer. N Engl J Med 333, 1456–1461.
Fisher, B., J. Constantino, D. Wickerham, C. Redmond, M. Kavanah, W. Cronin,
V. Vogel, A. Robidoux, N. Dimitrov, J. Atkins, M. Daly, S. Wieand, E. Tan-
Chiu, L. Ford, N. Wolmark, other National Surgical Adjuvant Breast, and B. P.
Investigators (1998). Tamoxifen for prevention of breast cancer: Report of the
National Surgical Adjuvant Breast and Bowel Project P-1 study. JNCI 90, 1371–
1388.
Fisher, L. (1999). Carvedilol and the Food and Drug Administration (FDA) approval
process: the FDA paradigm and reflections on hypothesis testing. Control Clin
Trials 20 (1), 16–39.
Fisher, L., D. Dixon, J. Herson, R. Frankowski, M. Hearron, and K. Peace (1990).
Intention to treat in clinical trials. In K. Peace (Ed.), Statistical Issues in Drug
Research and Development. Marcel Dekker, Inc., New York.
Fisher, L. and L. Moyé (1999). Carvedilol and the Food and Drug Administration
approval process: an introduction. Control Clin Trials 20, 1–15.
REFERENCES 411
Fisher, L. D. (1998). Self-designing clinical trials. Statistics in Medicine 17, 1551–
1562.
Fisher, M. R., E. B. Roecker, and D. L. DeMets (2001). The role of an independent
statistical analysis center in the industry-modified National Institutes of Health
model. Drug Information Journal 35, 115–129.
Fisher, R. (1925). Statistical Methods for Research Workers. Edinburgh: Oliver and
Boyd.
Fisher, R. (1926). The arrangement of field experiments. J Min Agric G Br 33,
503–513.
Fisher, R. (1935). The Design of Experiments. Edinburgh: Oliver and Boyd.
Fitzmaurice, G. M., N. M. Laird, and J. H. Ware (2004). Applied longitudinal
analysis. Wiley-Interscience.
Fleiss, J. L., J. T. Bigger Jr., M. McDermott, J. P. Miller, T. Moon, A. J. Moss,
D. Oakes, L. M. Rolnitzky, and T. M. Therneau (1990). Nonfatal myocardial
infarction is, by itself, an inappropriate end point in clinical trials in cardiology.
Circulation 81 (2), 684–685.
Fleming, T. (1982). One-sample multiple testing procedures for phase II clinical
trials. Biometrics 38, 143–151.
Fleming, T. (1990). Evaluation of active control trials in aids. Journal of Acquired
Immune Deficiency Syndromes 3 Suppl, S82–87.
Fleming, T. (1992). Evaluating therapeutic interventions: some issues and experi-
ences. Statistical Science 7, 428–441.
Fleming, T. (1994). Surrogate markers in AIDS and cancer trials. Statistics in
Medicine 13, 1423–1435.
Fleming, T. (2000). Design and interpretation of equivalence trials. Amer Heart J
139, S171–S176.
Fleming, T. (2007). Design and interpretation of equivalence trials. Statistics in
Medicine 26, to Appear.
Fleming, T. and D. DeMets (1996). Surrogate endpoints in clinical trials: Are we
being misled? Ann Intern Med 125, 605–613.
Fleming, T. R. and D. P. Harrington (1991). Counting Process and Survival Analysis.
New York: Wiley.
Fowler, F. (1995). Improving Survey Questions. Applied Social Research Methods
Series, Volume 38. Sage Publications.
Fowler, F. (2002). Survey Research Methods. Applied Social Research Methods Series
(3rd Edition). Sage Publications.
Franciosa, J. A., A. L. Taylor, J. N. Cohn, et al. (2002). African-American Heart
Failure Trial (A-HeFT): rationale, design, and methodology. Journal of Cardiac
Failure 8 (3), 128–135.
Frangakis, C. E. and D. B. Rubin (2002). Principal stratification in causal inference.
Biometrics 58 (1), 21–29.
Freedman, L., B. Graubard, and A. Schatzkin (1992). Statistical validation of inter-
mediate endpoints for chronic diseases. Statistics in Medicine 11, 167–178.
Freireich, E. J., E. Gehan, E. Frei III, et al. (1963). The effect of 6-mercaptopurine
on the duration of steroid-induced remissions in acute leukemia: a model for eval-
uation of other potentially useful therapy. Blood 21, 699–716.
Friedman, L., C. Furberg, and D. DeMets (1985). Fundamentals of Clinical Trials
(2 ed.). PSG Pub. Co.
Friedman, L. M., C. Furberg, and D. L. DeMets (1998). Fundamentals of Clinical
412 REFERENCES
Trials. Springer-Verlag Inc.
Frison, L. and S. J. Pocock (1992). Repeated measures in clinical trials: Analy-
sis using mean summary statistics and its implications for design. Statistics in
Medicine 11, 1685–1704.
Gabriel, K. R. (1969). Simultaneous test procecdures – some theory of multiple
comparisons. The Annals of Mathematical Statistics 40, 224–250.
Gallo, P., C. Chuang-Stein, V. Dragalin, B. Gaydos, M. Krams, and J. Pinheiro
(2006). Adaptive designs in clinical drug development—an executive summary of
the pharma working group. J Biopharm Stat 16, 275–283.
Gange, S. J. and D. L. DeMets (1996). Sequential monitoring of clinical trials with
correlated responses. Biometrika 83, 157–167.
Gao, P. and J. Ware (To appear). Assessing non-inferiority: a combination approach.
Statistics in Medicine.
Garrett, E.-M. (2006). The continual reassessment method for dose-finding studies:
a tutorial. Clinical Trials 3, 57–71.
Gart, J., D. Krewski, P. Lee, R. Tarone, and J. Wahrendorf (Eds.) (1986). Statistical
Methods in Cancer Research 3: the Design and Analysis of Long-Term Animal
Experiments. Oxford University Press, Oxford.
Gehan, E. (1961). The determination of the number of patients required in a prelim-
inary and a follow-up trial of a new chemotherapeutic agent. Journal of Chronic
Diseases 13, 346–353.
Gehan, E. (1984). The evaluation of therapies: historical control studies. Statistics
in Medicine 3, 315–324.
Gent, M. and D. Sackett (1979). The qualification and disqualification of patients
and events in long-term cardiovascular clinical trials. Thromb Haemost 41 (1),
123–134.
Gillen, D. L. and S. S. Emerson (2005). A note on P -values under group sequential
testing and nonproportional hazards. Biometrics 61 (2), 546–551.
Goldberg, P. (2006). Phase 0 trials signal FDA’s new reliance on biomarkers in drug
development. The Cancer Letter 32, 1–2.
Goldhirsch, A., R. Gelber, R. Simes, P. Gasziou, and A. Coates (1989). Costs and
benefits of adjuvant therapy in breast cancer: a quality-adjusted survival analysis.
Journal of Clinical Oncology 7, 36–44.
Gooley, T., P. Martin, L. Fisher, and M. Pettinger (1994). Simulation as a design
tool for phase I/II clinical trials: an example from bone marrow transplantation.
Control Clin Trials 15, 450–462.
Grady, D., W. Applegate, T. Bush, C. Furberg, B. Riggs, and S. Hulley (1998).
Heart and estrogen/progestin replacement study (HERS): Design, methods, and
baseline characteristics. Control Clin Trials 19, 314–335.
Grambsch, P. M. and T. M. Therneau (1994). Proportional hazards tests and diag-
nostics based on weighted residuals. Biometrika 81, 515–526.
Grambsch, P. M., T. M. Therneau, and T. R. Fleming (1995). Diagnostic plots to
reveal functional form for covariates in multiplicative intensity models. Biometrics
51, 1469–1482.
Grizzle, J. (1965). The two-period change-over design and its use in clinical trials.
Biometrics 21, 467.
Grizzle, J. and D. Allen (1969). Analysis of growth and dose response curves. Bio-
metrics 25, 357–382.
Gruppo Italiano per lo Studio Della Sopravvivenze Nell’Infarcto Miocardico (GISSI)
REFERENCES 413
(1986). Effectiveness of intravenous thrombolytic treatment in acute myocardial
infarction. Lancet I, 397–402.
Halperin, J. and Executive Steering Committee, SPORTIF III and V Study In-
vestigators (2003). Ximelagatran compared with warfarin for prevention of
thromboembolism in patients with nonvalvular atrial fibrillation: rationale, objec-
tives, and design of a pair of clinical studies and baseline patient characteristics
(SPORTIF III and V). American Heart Journal 146, 431–438.
Halperin, M., Lan, KKG, N. Johnson, and D. DeMets (1982). An aid to data
monitoring in long-term clinical trials. Controlled Clinical Trials 3, 311–323.
Halperin, M. and J. Ware (1974). Early decision in a censored Wilcoxon two-sample
test for accumulating survival data. Journal of the American Statistical Associa-
tion 69, 414–422.
Hardwick, J., M. Meyer, and Q. Stout (2003). Directed walk designs for dose-
response problems with competing failure modes. Biometrics 59, 229–236.
Hareyama, M., K. Sakata, A. Oouchi, H. Nagakura, M. Shido, M. Someya, and
K. Koito (2002). High-dose-rate versus low-dose-rate intracavitary therapy for
carcinoma of the uterine cervix: a randomized trial. Cancer 94, 117–124.
Harrington, D. and T. Fleming (1982). A class of rank test procedures for censored
survival data. Biometrika 69, 553–566.
Haybittle, J. (1971). Repeated assessment of results in clinical trials of cancer treat-
ment. Brit. J. Radiology 44, 793–797.
HDFP Cooperative Group (1982). Five-year findings of the hypertension detection
and follow-up program. III. Reduction in stroke incidence among persons with
high blood pressure. JAMA 247, 633–638.
Healy, B., L. Campeau, R. Gray, J. Herd, B. Hoogwerf, D. Hunninghake, G. Knat-
terud, W. Stewart, and C. White (1989). Conflict-of-interest guidelines for a
multicenter clinical trial of treatment after coronary-artery bypass-graft surgery.
N Engl J Med 320, 949–951.
Heart Special Project Committee (1998). Organization, review, and administra-
tion of cooperative studies (Greenberg Report). A report from the Heart Special
Project Committee to the National Advisory Heart Council, May 1967. Control
Clin Trials 9, 137–148.
Heitjan, D. F. (1999). Causal inference in a clinical trial: A comparative example.
Controlled Clinical Trials 20, 309–318.
Hennekens, C., J. Buring, J. Manson, M. Stampfer, B. Rosner, N. Cook, C. Belanger,
F. LaMotte, J. Gaziano, P. Ridker, W. Willett, and R. Peto (1996). Lack of effect
of long-term supplementation with beta-carotene on the incidence of malignant
neoplasms and cardiovascular disease. N Engl J Med 334, 1145–1149.
Hill, A. (1971). Principles of Medical Statistics (9 ed.). New York: Oxford University
Press.
Hill, S. A. B. (1965). The environment and disease: Association or causation. Pro-
ceedings of the Royal Society of Medicine 58, 295–300.
Hills, M. and P. Armitage (1979). The two-period cross-over clinical trial. Br J Clin
Pharmacol 8, 7–20.
Hjalmarson, Å., S. Goldstein, B. Fagerberg, et al. (2000). Effects of controlled-release
metoprolol on total mortality, hospitalizations, and well-being in patients with
heart failure: The metoprolol CR/XL randomized intervention trial in congestive
heart failure (MERIT-HF). JAMA 283, 1295–1302.
Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of signifi-
414 REFERENCES
cance. Biometrika 75, 800–802.
Hodges, J. L. and E. L. Lehmann (1963). Estimates of location based on rank tests
(Ref: V42 p1450-1451). The Annals of Mathematical Statistics 34, 598–611.
Holland, P. W. (1986). Statistics and causal inference. Journal of the American
Statistical Association 81, 945–960.
Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandina-
vian Journal of Statistics 6, 65–70.
Hughes, M. (2002). Commentary: Evaluating surrogate endpoints. Control Clin
Trials 23, 703–707.
Hulley, S., D. Grady, T. Bush, C. Furberg, D. Herrington, B. Riggs, E. Vittinghoff,
and for the Heart and Estrogen/progestin Replacement Study (HERS) Research
Group (1998). Randomized trial of estrogen plus progestin for secondary preven-
tion of coronary heart disease in postmenopausal women. JAMA 280, 605–613.
Hung, H., S. Wang, and R. O’Neil (2005). A regulatory perspective on choice of
margin and statistical inference issue in non-inferiority trials. Journal of AIDS 3,
82–87.
Hung, H., S. Wang, Y. Tsong, J. Lawrence, and R. O’Neil (2003). Some fundamental
issues with non-inferiority testing in active controlled trials. Stat in Med 22, 213–
225.
Hürny, C., J. Bernhard, R. Joss, Y. Willems, F. Cavalli, J. Kiser, K. Brunner,
S. Favre, P. Alberto, A. Glaus, H. Senn, E. Schatzmann, P. Ganz, and U. Metzger
(1992). Feasibility of quality of life assessment in a randomized phase III trial of
small cell lung cancer. Ann Oncol 3, 825–831.
Hypertension Detection and Follow-up Program Cooperative Group (1979). Five-
year findings of the hypertension detection and follow-up program. I. Reduction
in mortality of persons with high blood pressure, including mild hypertension.
JAMA 242, 2562–2571.
Jennison, C. and B. W. Turnbull (1993). Group sequential tests for bivariate re-
sponse: interim analyses of clinical trials with both efficacy and safety endpoints.
Biometrics 49, 741–752.
Jennison, C. and B. W. Turnbull (2000). Group Sequential Methods with Applications
to Clinical Trials. CRC Press Inc.
Jennison, C. and B. W. Turnbull (2006). Efficient group sequential designs when
there are several effect sizes under consideration. Statistics in Medicine 25 (6),
917–932.
Jöreskog, K. and D. Sörbom (1996). LISREL 8: User’s Reference Guide. Chicago:
Scientific Software International.
Jung, S., M. Carey, and K. Kim (2001). Graphical search for two-stage designs for
phase II clinical trials. Control Clinical Trials 22, 367–372.
Kalbfleisch, J. and R. Prentice (1980). The Statistical Analysis of Failure Time
Data. New York: Wiely.
Kaplan, E. L. and P. Meier (1958). Nonparametric estimation from incomplete
observations. Journal of the American Statistical Association 53, 457–481.
Kaplan, R., T. Ganiats, W. Sieber, and J. Anderson (1998). The quality of well-
being scale: critical similarities and differences with sf-36. International Journal
for Quality in Health Care 10, 509–520.
Karrison, T., D. Huo, and R. Chappell (2003). A group sequential, response-adaptive
design for randomized clinical trials. Controlled Clin Trials 24 (5), 506–522.
Kempthorne, O. (1977). Why randomize? Journal of Statistical Planning and In-
REFERENCES 415
ference 1, 1–25.
Kenward, M. G. (1998). Selection models for repeated measurements with non-
random dropout: An illustration of sensitivity. Statistics in Medicine 17, 2723–
2732.
Kim, K. (1989). Point estimation following group sequential tests. Biometrics 45,
613–617.
Kim, K. and D. DeMets (1987a). Confidence intervals following group sequential
tests in clinical trials. Biometrics 43, 857–864.
Kim, K. and D. DeMets (1987b). Design and analysis of group sequential tests based
on the type I error spending rate function. Biometrika 74, 149–154.
Klein, R., M. Davis, S. Moss, B. Klein, and D. DeMets (1985). The Wisconsin
Epidemiologic Study of Diabetic Retinopathy. A comparison of retinopathy in
younger and older onset diabetic persons. Advances in Experimental Medicine &
Biology 189, 321–335.
Klein, R., B. Klein, K. Linton, and D. DeMets (1991). The Beaver Dam Eye Study:
visual acuity. Opthalmology 98, 1310–1315.
Klotz, J. (1978). Maximum entropy constrained balanced randomization for clinical
trials. Biometrics 34, 283.
Koch, G., I. Amara, B. Brown, T. Colton, and D. Gillings (1989). A two-period
crossover design for the comparison of two active treatments and placebo. Statis-
tics in Medicine 8, 487–504.
Kolata, G. (1980). FDA says no to anturane. Science 208, 1130–1132.
Korn, E. (1993). On estimating the distribution function for quality of life in cancer
clinical trials. Biometrika 80, 535–542.
Koscik, R., J. Douglas, K. Zaremba, M. Rock, M. Spalingard, A. Laxova, and P. Far-
rell (2005). Quality of life of children with cystic fibrosis. Journal of Pediatrics
147 (3 Suppl), 64–8.
Krum, H., M. Bailey, W. Meyer, P. Verkenne, H. Dargie, P. Lechat, and S. Anker
(2006). Impact of statin therapy on clinical outcomes in chronic heart failure
patients according to beta-blocker use: Results of CIBIS-II. Cardiology 108, 28–
34.
Lachin, J. (1988a). Properties of simple randomization in clinical trials. Controlled
Clinical Trials 9, 312.
Lachin, J. (1988b). Statistical properties of randomization in clinical trials. Con-
trolled Clinical Trials 9, 289.
Lachin, J. (2000). Statistical considerations in the intent-to-treat principle. Con-
trolled Clin Trials 21, 167.
Lachin, J. M. (2005). Maximum information designs. Clinical Trials 2, 453–464.
Laird, N. and J. Ware (1982). Random effects models for longitudinal data. Bio-
metrics 38, 963–974.
Lakatos, E. (1986). Sample size determination in clinical trials with time-dependent
rates of losses and noncompliance. Controlled Clinical Trials 7, 189.
Lakatos, E. (1988). Sample size based on logrank statistic in complex clinical trials.
Biometrics 44, 229.
Lakatos, E. (2002). Designing complex group sequential survival trials. Statistics in
Medicine 21 (14), 1969–1989.
Lan, K. and D. DeMets (1983). Discrete sequential boundaries for clinical trials.
Biometrika 70, 659–663.
Lan, K., R. Simon, and M. Halperin (1982). Stochastically curtailed testing in long-
416 REFERENCES
term clinical trials. Communications in Statistics, Series C 1, 207–219.
Lan, K. and J. Wittes (1988). The B-value: a tool for monitoring data. Biometrics
44, 579–585.
Lan, K. K. G. and D. C. Trost (1997). Estimation of parameters and sample size
re-estimation. In ASA Proceedings of the Biopharmaceutical Section, pp. 48–51.
American Statistical Association (Alexandria, VA).
Lan, K. K. G. and D. M. Zucker (1993). Sequential monitoring of clinical trials: the
role of information and Brownian motion. Statistics in Medicine 12, 753–765.
Landgraf, J., L. Abetz, and J. Ware (1996). The CHQ User’s Manual 1st ed. Boston,
MA: The Health Institute, New England Medical Center.
LaRosa, J., S. Grundy, D. Waters, C. Shear, P. Barter, J. Fruchart, et al. (2005).
Intensive lipid lowering with atorvastatin in patients with stable coronary disease.
N Engl J Med 352, 1425–1435.
Lawless, J. F. (2003). Statistical Models and Methods for Lifetime Data. John Wiley
& Sons.
Lee, E. T. (1992). Statistical Methods for Survival Data Analysis. John Wiley &
Sons.
Lee, J. W. and D. L. DeMets (1991). Sequential comparison of changes with repeated
measurements data. Journal of the American Statistical Association 86, 757–762.
Lewis, R. J., D. A. Berry, H. Cryer III, N. Fost, R. Krome, G. R. Washington,
J. Houghton, J. W. Blue, R. Bechhofer, T. Cook, and M. Fisher (2001). Moni-
toring a clinical trial conducted under the FDA regulations allowing a waiver of
prospective informed consent: The DCLHb traumatic hemorrhagic shock efficacy
trial. Annals of Emergency Medicine 38 (4), 397–404.
Li, Z. and D. L. DeMets (1999). On the bias of estimation of a Brownian motion
drift following group sequential tests. Statistica Sinica 9, 923–938.
Lin, D., T. Fleming, and V. DeGruttola (1997). Estimating the proportion of treat-
ment effect explained by a surrogate marker. Statistics in Medicine 16, 1515–1527.
Lind, J. (1753). A Treatise of the Scurvy. Edinburgh: University Press (reprint).
Lipid Research Clinics Program (1984). The Lipid Research Clinics Coronary Pri-
mary Prevention Trial results. 1. Reduction in incidence of coronary heart disease.
JAMA 251, 351–364.
Little, R. and T. Raghunathan (1999). On summary measures analysis of the linear
mixed effects model for repeated measures when data are not missing completely
at random. Statistics in Medicine 18, 2465–2478.
Little, R. J. A. and D. B. Rubin (2002). Statistical Analysis with Missing Data.
John Wiley & Sons.
Liu, A. and W. J. Hall (1999). Unbiased estimation following a sequential test.
Biometrika 86, 71–78.
Liu, G. and A. L. Gould (2002). Comparison of alternative strategies for analysis of
longitudinal trials with dropouts. Journal of Biopharmaceutical Statistics 12 (2),
207–226.
Louis, P. (1834). Essays in clinical instruction. London, U.K.: P. Martin.
Love, R. and N. Fost (1997). Ethical and regulatory challenges in a randomized
control trial of adjuvant treatment for breast cancer in Vietnam. J Inv Med 45,
423–431.
Love, R. and N. Fost (2003). A pilot seminar on ethical issues in clinical trials for
cancer researchers in Vietnam. IRB: a Review of Human Subjects Research 25,
8–10.
REFERENCES 417
Lytle, L., M. Kubik, C. Perry, M. Story, A. Birnbaum, and D. Murray (2006). Influ-
encing healthful food choices in school and home environments: results from the
TEENS study. Preventive Medicine 43, 8–13.
Machin, D. and S. Weeden (1998). Suggestions for the presentation of quality life
data from clinical trials. Statistics in Medicine 17, 711–724.
MADIT Executive Committee (1991). Multicenter automatic defibrillator implan-
tation trial (madit): design and clinical protocol. Pacing Clin Electrophysiol 14 (5
Pt 2), 920–927.
Mallinckrodt, C. H., W. S. Clark, R. J. Carroll, and G. Molenberghs (2003). As-
sessing response profiles from incomplete longitudinal clinical trial data under
regulatory considerations. Journal of Biopharmaceutical Statistics 13 (2), 179–
190.
Mantel, N. (1966). Evaluation of survival data and two new rank order statistics
arising in its consideration. Cancer Chemotherapy Reports 50, 163–170.
Marcus, R., E. Peritz, and K. R. Gabriel (1976). On closed testing procedures with
special reference to ordered analysis of variance. Biometrika 63, 655–660.
Mark, E., E. Patalas, H. Chang, R. Evans, and S. Kessler (1997). Fatal pulmonary
hypertension associated with short-term use of fenfluramine and phentermine. N
Engl J Med 337, 602–606.
Matts, J. and J. Lachin (1988). Properties of permuted-block randomization in
clinical trials. Controlled Clinical Trials 9, 327.
McCulloch, C. and S. Searle (2001). Generalized, Linear, and Mixed Models. New
York: Wiley.
McGinn, C., M. Zalupski, I. Shureiqi, et al. (2001). Phase I trial of radiation dose
escalation with concurrent weekly fulldose gemcitabine in patients with advanced
pancreatic cancer. J Clin Oncol 19, 4202–4208.
McIntosh, H. (1995). Stat bite: Cancer and heart disease deaths. J Natl Cancer Inst
87 (16), 1206.
McPherson, C. and P. Armitage (1971). Repeated significance tests on accumulating
data when the null hypothesis is not true. Journal of the Royal Statistical Society,
Series A 134, 15–26.
Medical Research Council (1944). Clinical trial of patulin in the common cold.
Lancet 2, 373–375.
Medical Research Council (1948). Streptomycin treatment of pulmonary tuberculo-
sis. BMJ 2, 769–782.
Meier, P. (1981). Stratification in the design of a clinical trial. Control Clin Trials
1, 355–361.
Meier, P. (1991). Comment on “compliance as an explanatory variable in clinical
trials”. Journal of the American Statistical Association 86, 19–22.
Meinert, C. (1996). Clinical trials dictionary: usage and recommendations. Harbor
Duvall Graphics, Baltimore, MD.
MERIT-HF Study Group (1999). Effect of metoprolol CR/XL in chronic heart fail-
ure: Metoprolol CR/XL randomised intervention trial in congestive heart failure
(MERIT-HF). Lancet 353, 2001.
Mesbah, M., B. Cole, and M. Lee (2002). Statistical Methods for Quality of Life
Studies Design, Measurements, and Analysis. Springer.
Miller, A., C. Baines, T. To, and C. Wall (1992). Canadian National Breast Cancer
Study: breast cancer detection and death rates among women aged 40 to 49 years.
Canadian Medical Association Journal 147, 1459–1476.
418 REFERENCES
Moertel, C. (1984). Improving the efficiency of clinical trials: a medical perspective.
Statistics in Medicine 3, 455–465.
Moher, D., K. F. Schulz, D. G. Altman, and the CONSORT Group (2001). The
CONSORT statement: revised recommendations for improving the quality of re-
ports of parallel-group randomized trials. Ann Intern Med 134, 657–662.
Molenberghs, G., M. Buyse, H. Geys, D. Renard, T. Burzykowski, and A. Alonso
(2002). Statistical challenges in the evaluation of surrogate endpoints in random-
ized trials. Control Clin Trials 23, 607–625.
Molenberghs, G., H. Thijs, I. Jansen, C. Beunkens, M. Kenward, C. Mallinkrodt,
and R. Carroll (2004). Analyzing incomplete longitudinal clinical trial data. Bio-
statistics 5, 445–464.
Montori, V. M., G. Permanyer-Miralda, I. Ferreira-González, J. W. Busse,
V. Pacheco-Huergo, D. Bryant, J. Alonso, E. A. Akl, A. Domingo-Salvany,
E. Mills, P. Wu, H. J. Schünemann, R. Jaeschke, and G. H. Guyatt (2005). Va-
lidity of composite end points in clinical trials. BMJ 330 (2), 594–596.
Moss, A., W. Hall, D. Cannom, J. Daubert, S. Higgins, H. Klein, J. Levine, S. Sak-
sena, A. Waldo, D. Wilber, M. Brown, M. Heo, and for the Multicenter Automatic
Defibrillator Implantation Trial Investigators (1996). Improved survival with an
implanted defibrillator in patients with coronary disease at high risk for ventric-
ular arrhythimia. N Engl J Med 335, 1933–1940.
Moss, A., W. Hall, D. Cannom, J. Daubert, S. Higgins, J. Levine, S. Saksena,
A. Waldo, D. Wilber, M. Brown, and M. Heo (1996). Improved survival with
an implanted defibrillator in patients with coronary disease at high risk for ven-
tricular arrhythmia. N Engl J Med 335, 1933–1940.
Moyé, L. (1999). End-point interpretation in clinical trials: the case for discipline.
Control Clin Trials 20 (1), 40–49.
Moyé, L. A. (2003). Multiple Analyses in Clinical Trials: Fundamentals for Investi-
gators. New York: Springer-Verlag.
MPS Research Unit (2000). PEST 4: Operating Manual. The University of Reading.
Multiple Risk Factor Intervention Trial Research Group (1982). Multiple Risk Factor
Interventional Trial: Risk factor changes and mortality results. JAMA 248, 1465–
1477.
Murray, D., S. Varnell, and J. Blitstein (2004). Design and analysis of group-
randomized trials: a review of recent methodological developments. Am J Public
Health 94, 423–432.
Muthén, B. and L. Muthén (1998). Mplus User’s Guide. Los Angeles: Muthen &
Muthen.
National Commission for the Protection of Human Subjects of Biomedical and Be-
havioral Research (1979). The Belmont Report: ethical principles and guidelines
for the protection of human subjects of research. Fed Regist 44, 23192–23197.
Neaton, J. D., G. Gray, B. D. Zuckerman, and M. A. Konstam (2005). Key issues in
end point selection for heart failure trials: composite endpoint. Journal of Cardiac
Failure 11 (8), 567–575.
Nocturnal Oxygen Therapy Trial Group (1980). Continuous and nocturnal oxygen
therapy in hypoxemic chronic obstructive lungdisease: a clinical trial. Ann Intern
Med 93, 391–398.
O’Brien, P. C. (1984). Procedures for comparing samples with multiple endpoints.
Biometrics 40, 1079–1087.
O’Brien, P. C. and T. R. Fleming (1979). A multiple testing procedure for clinical
REFERENCES 419
trials. Biometrics 35, 549–556.
O’Connor, C. M., M. W. Dunne, M. A. Pfeffer, J. B. Muhlestein, L. Yao, S. Gupta,
R. J. Benner, M. R. Fisher, and T. D. Cook (2003). A double-blind, randomized,
placebo controlled trial of azithromycin for the secondary prevention of coronary
heart disease: the WIZARD trial. JAMA 290 (11), 1459–1466.
Olschewski, M. and M. Schumacher (1990). Statistical analysis of quality of life data
in cancer clinical trials. Statistics in Medicine 9, 749.
Omenn, G., G. Goodman, M. Thornquist, J. Balmes, M. Cullen, A. Glass, J. Keogh,
F. Meyskens, B. Valanis, J. Williams, S. Barnhart, and S. Hammar (1996). Effects
of a combination of beta-carotene and vitamin A on lung cancer and cardiovascular
disease. New Engl J Med 334, 1150–1155.
Omenn, G., G. Goodman, M. Thornquist, J. Grizzle, L. Rosenstock, S. Barnhart,
J. Balmes, M. Cherniack, M. Cullen, A. Glass, and et al. (1994). The beta-
carotene and retinol efficacy trial (caret) for chemoprevention of lung cancer in
high-risk populations: smokers and asbestos-exposed workers. Cancer Research
54, 2038s–2043s.
O’Quigley, J., M. Pepe, and L. Fisher (1990). Continual reassessment method: a
practical design for phase I clinical trials in cancer. Biometrics 46, 33–48.
Packer, M. (2000). COPERNICUS (Carvedilol Prospective Randomized Cumulative
Survival): evaluates the effects of carvedilol top-of-ACE on major cardiac events
in patients with heart failure. European Society of Cardiology Annual Congress.
Packer, M., M. R. Bristow, J. N. Cohn, et al. (1996). The effect of carvedilol on
morbidity and mortality in patients with chronic heart failure. The New England
Journal of Medicine 334 (21), 1349–1355.
Packer, M., J. Carver, R. Rodeheffer, R. Ivanhoe, R. DiBianco, S. Zeldis, G. Hendrix,
W. Bommer, U. Elkayam, M. Kukin, et al. (1991). Effect of oral milrinone on
mortality in severe chronic heart failure. N Engl J Med 325, 1468–1475.
Packer, M., A. Coats, M. Fowler, H. Katus, H. Krum, P. Mohacsi, J. Rouleau,
M. Tendera, A. Castaigne, E. Roecker, M. Schultz, and D. DeMets (2001). Effect
of carvedilol on the survival of patients with severe chronic heart failure. New
England Journal of Medicine 344 (22), 1651–1658.
Packer, M., C. O’Connor, J. Ghali, M. Pressler, P. Carson, R. Belkin, A. Miller,
G. Neuberg, D. Frid, J. Wertheimer, A. Cropp, and D. DeMets (1996). Effect
of amlodipine on morbidity and mortality in severe chronic heart failure. New
England Journal of Medicine 335, 1107–1114.
Pampallona, S. and A. Tsiatis (1994). Group sequential designs for one-sided and
two-sided hypothesis testing with provision for early stopping in favor of the null
hypothesis. Journal of Statistical Planning and Inference 42, 19–35.
Peduzzi, P., K. Detre, J. Wittes, and T. Holford (1991). Intention-to-treat analysis
and the problem of crossovers: an example from the Veterans Affairs randomized
trial of coronary artery bypass surgery. Journal of Thoracic and Cardiovascular
Surgery 101, 481–487.
Peduzzi, P., J. Wittes, K. Detre, and T. Holford (1993). Analysis as-randomized and
the problem of non-adherence: an example from the Veterans Affairs randomized
trial of coronary artery bypass surgery. Statistics in Medicine 12, 1185–1195.
Peto, R., M. Pike, P. Armitage, N. Breslow, D. Cox, S. Howard, N. Mantel,
K. McPherson, J. Peto, and P. Smith (1976). Design and analysis of randomised
clinical trials requiring prolonged observation on each patient: I. Introduction and
design. British Journal of Cancer 34, 585–612.
420 REFERENCES
Pinheiro, J. and D. Bates (2000). Mixed Effects Models in S and S-PLUS. Springer.
Pinheiro, J. C. and D. L. DeMets (1997). Estimating and reducing bias in group
sequential designs with Gaussian independent increment structure. Biometrika
84, 831–845.
Pitt, B. (2005). Low-density lipoprotein cholesterol in patients with stable coronary
heart disease — is it time to shift our goals? The New England Journal of Medicine
352 (14), 1483–1484.
Pocock, S. (1977a). Group sequential methods in the design and analysis of clinical
trials. Biometrika 64, 191–199.
Pocock, S. (1977b). Randomised clinical trials. Br Med J 1, 1661.
Pocock, S. and I. White (1999). Trials stopped early: too good to be true? Lancet
353, 943–944.
Pocock, S. J. (2005). When (not) to stop a clinical trial for benefit. JAMA 294 (17),
2228–2230.
Pocock, S. J., N. L. Geller, and A. A. Tsiatis (1987). The analysis of multiple
endpoints in clinical trials. Biometrics 43, 487–498.
Pocock, S. J. and R. Simon (1975). Sequential treatment assignment with balanc-
ing for prognostic factors in the controlled clinical trial (Corr: V32 p954-955).
Biometrics 31, 103–115.
Prentice, R. (1989). Surrogate endpoints in clinical trials: definition and operation
criteria. Statistics in Medicine 8, 431.
Prorok, P., G. Andriole, R. Bresalier, S. Buys, D. Chia, E. Crawford, R. Fogel,
E. Gelmann, F. Gilbert, et al. (2000). Design of the Prostate, Lung, Colorectal
and Ovarian (PLCO) Cancer Screening Trial. Control Clin Trials 21, 273–309S.
Proschan, M., K. Lan, and J. Wittes (2006). Statistical Monitoring of Clinical Trials:
A Unified Approach. Statistics for Biology and Health. USA: Springer USA.
Qu, R. P. and D. L. DeMets (1999). Bias correction in group sequential analysis
with correlated data. Statistica Sinica 9, 939–952.
Quittner, A. (1998). Measurement of quality of life in cystic fibrosis. Curr Opin
Pulm Med 4 (6), 326–31.
Quittner, A., A. Buu, M. Messer, A. Modi, and M. Watrous (2005). Development
and validation of the cystic fibrosis questionnaire in the United States: A health-
related quality-of-life measure for cystic fibrosis. Chest 128, 2347–2354.
R Development Core Team (2005). R: A language and environment for statistical
computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-
900051-07-0.
Rea, L. and R. Parker (1997). Designing and Conducting Survey Research: A Com-
prehensive Guide (2 ed.). Jossey-Bass.
Reboussin, D. M., D. L. DeMets, K. Kim, and K. K. G. Lan (2000). Computations
for group sequential boundaries using the Lan–DeMets spending function method.
Controlled Clinical Trials 21 (3), 190–207.
Rector, T., S. Kubo, and J. Cohn (1993). Validity of the Minnesta Living with
Heart Failure questionnaire as a measure of therapeutic response to enalapril of
placebo. American Journal of Cardiology 71, 1106–1107.
Ridker, P., M. Cushman, M. Stampfer, R. Tracy, and C. Hennekens (1997). Inflam-
mation, aspirin, and the risk of cardiovascular disease in apparently healthy men.
New England Journal of Medicine 336, 973–979.
Rosner, G. and A. Tsiatis (1988). Exact confidence intervals following a group
sequential trial: a comparison of methods. Biometrika 75, 723–730.
REFERENCES 421
Rubin, D. (1974). Estimating causal effects of treatment in randomized and non-
randomized studies. Journal of Educational Psychology 66, 688.
Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley
& Sons.
Rubin, D. B. (1996). Multiple imputation after 18+ years (Pkg: P473-520). Journal
of the American Statistical Association 91, 473–489.
Rubin, D. B. (1998). More powerful randomization-based p-values in double-blind
trials with non-compliance (Pkg: P251-389). Statistics in Medicine 17, 371–385.
Ruffin, J., J. Grizzle, N. Hightower, G. McHardy, H. Shull, and J. Kirsner (1969).
A co-operative double-blind evaluation of gastric “freezing” in the treatment of
duodenal ulcer. N Engl J Med 281, 16–19.
Ryan, L. and K. Soper (2005). Preclinical treatment evaluation. In Encyclopedia of
Biostatistics. John Wiley & Sons, Ltd.
Sackett, D. L., J. J. Deeks, and D. G. Altman (1996). Down with odds ratios.
Evidence-Based Medicine 1 (6), 164–6.
Samsa, G., D. Edelman, M. Rothman, G. Williams, J. Lipscomb, and D. Matchar
(1999). Determining clinically important differences in health status measures: a
general approach with illustration to the health utilities index mark II. Pharma-
coeconomics 15 (2), 141–55.
Scandanavian Simvistatin Survival Study (1994). Randomized trial of cholesterol
lowering in 4444 patients with coronary heart disease. The Lancet 344, 1383–1389.
Scharfstein, D., A. Tsiatis, and J. Robins (1997). Semiparametric efficiency and its
implication on the design and analysis of group-sequential studies. Journal of the
American Statistical Association 92, 1342–1350.
Scharfstein, D. O., A. Rotnitzky, and J. M. Robins (1999). Adjusting for nonig-
norable drop-out using semiparametric nonresponse models (C/R: P1121-1146).
Journal of the American Statistical Association 94, 1096–1120.
Schluechter, M. (1992). Methods for the analysis of informatively censored longitu-
dinal data. Statistics in Medicine 11, 1861–1870.
Schneiderman, M. (1967). Mouse to man: statistical problems in bringing a drug to
clinical trial. In L. LeCam and J. Neyman (Eds.), Proceedings of the 5th Berkeley
Symposium in Mathematical Statistics and Probability, Volume IV. Berkeley U.
Press, Berkeley.
Schoenfeld, D. (1983). Sample-size formula for the proportional-hazards regression
model. Biometrics 39, 499–503.
Schwartz, D. and J. Lellouch (1967). Explanatory and pragmatic attitudes in ther-
apeutical trials. J. Chron. Dis. 20, 637–648.
Senn, S. (1999). Rare Distinction and Common Fallacy [letter]. eBMJ http://bmj.
com/cgi/eletters/317/7168/1318.
Shen, Y. and L. Fisher (1999). Statistical inference for self-designing clinical trials
with a one-sided hypothesis. Biometrics 55, 190–197.
Shih, J. H. (1995). Sample size calculation for complex clinical trials with survival
endpoints. Controlled Clinical Trials 16, 395–407.
Shopland (ed), D. (1982). The Health Consequences of Smoking: Cancer: A Report
of the Surgeon General. United States. Public Health Service. Office on Smoking
and Health.
Sidak, Z. (1967). Rectangular confidence regions for the means of multivariate nor-
mal distributions. Journal of the American Statistical Association 62, 626–633.
Siegmund, D. (1978). Estimation following sequential tests. Biometrika 65, 341–349.
422 REFERENCES
Siegmund, D. (1985). Sequential Analysis: Tests and Confidence Intervals. Springer-
Verlag Inc.
Silverman, W. (1979). The lesson of retrolental fibroplasia. Scientific American 236,
100–107.
Simes, R. J. (1986). An improved Bonferroni procedure for multiple tests of signifi-
cance. Biometrika 73, 751–754.
Simon, R. (1989). Optimal two-stage designs for phase II clinical trials. Controlled
Clinical Trials 10, 1–10.
Simon, R., B. Freidlin, L. Rubinstein, S. Arbuck, J. Collins, and M. Christian (1997).
Accelerated titration designs for phase I clinical trials in oncology. J National
Cancer Inst 89, 1138–1147.
Simon, R., G. Weiss, and D. Hoel (1975). Sequential analysis of binomial clinical
trials. Biometrika 62, 195–200.
Simon, R., J. Wittes, and J. Ellenberg (1985). Randomized phase II clinical trials?
Cancer Treatment Reports 69, 1375–1381.
Sloan, E. P., M. Koenigsberg, D. Gens, et al. (1999). Diaspirin cross-linked
hemoglobin (DCLHb) in the treatment of severe traumatic hemorrhagic shock,
a randomized controlled efficacy trial. JAMA 282 (19), 1857–1863.
Sloan, J. and A. Dueck (2004). Issues for statisticians in conducting analyses and
translating results for quality of life end points in clinical trials. Journal of Bio-
pharmaceutical Statistics 14, 73–96.
Smith, E. L., C. Gilligan, P. E. Smith, and C. T. Sempos (1989). Calcium supple-
mentation and bone in middle-aged women. Am J Clin Nutr 50, 833–42.
Speith, L. and C. Harris (1996). Assessment of health-related quality of life in
children and adolescents: an integrative review. Journal of Pediatric Psychology
21 (2), 175–193.
Spiegelhalter, D., L. Freedman, and P. Blackburn (1986). Monitoring clinical trials:
conditional or predictive power? Controlled Clinical Trials 7, 8–17.
Steering Committee of the Physicians’ Health Study Research Group (1989). Final
report on the aspirin component of the ongoing Physicians Health Study. N Engl
J Med 321, 129–135.
Stephens, R. (2004). The analysis, interpretation, and presentation of quality of life
data. Journal of Biopharmaceutical Statistics 14, 53–71.
Stevens, J., D. Murray, D. Catellier, P. Hannan, L. Lytele, J. Elder, D. Young,
D. Simons-Morton, and L. Webber (2005). Design of the trial of activity in
adolescent girls (TAAG). Contemporary Clinical Trials 26, 223–233.
Storer, B. (1989). Design and analysis of phase I clinical trials. Biometrics 45,
925–937.
Storer, B. (1990). A sequential phase II/III trial for binary outcomes. Statistics in
Medicine 9, 229–235.
Taeger, D., Y. Sun, and K. Straif (1998). On the use, misuse and interpretation of
odds ratios. eBMJ http://bmj.com/cgi/eletters/316/7136/989.
Taubes, G. (1995). Epidemiology faces its limits. Science 269, 164–169.
Taves, D. (1974). A new method of assigning patients to treatment and control
groups. Clin Pharmacol Ther 15, 443–453.
Taylor, A., S. Ziesche, C. Yancy, et al. (2004). Combination of isosorbide dinitrate
and hydralazine in blacks with heart failure. N Engl J Med 351, 2049–2057.
Temple, R. and S. Ellenberg (2000). Placebo-controlled trials and active-control
trials in the evaluation of new treatments. Part 1: Ethical and scientific issues.
REFERENCES 423
Ann Int Med 133, 455–463.
Temple, R. and G. Pledger (1980). The FDA’s critique of the anturane reinfarction
trial. New England Journal of Medicine 303, 1488.
Thackray, S., K. Witte, A. Clark, and J. Cleland (2000). Clinical trials update:
Optime-chf, praise-2, all-hat. Eur J Heart Fail 2 (2), 209–212.
Thall, P. and R. Simon (1995). Recent developments in the design of phase II clinical
trials. In P. Thall (Ed.), Recent Advances in Clinical Trial Design and Analysis.
Kluwer: New York.
Thall, P. F. and S. C. Vail (1990). Some covariance models for longitudinal count
data with overdispersion. Biometrics 46, 657–671.
The Alpha-Tocopherol, Beta-Carotene Cancer Prevention Study Group (1994). The
effect of vitamin E and beta-carotene on the incidence of lung cancer and other
cancers in male smokers. N Engl J Med 330, 1029–1035.
The Anturane Reinfarction Trial Research Group (1978). Sulfinpyrazone in the
prevention of cardiac death after myocardial infarction. The New England Journal
of Medicine 298, 289.
The Anturane Reinfarction Trial Research Group (1980). Sulfinpyrazone in the
prevention of sudden death after myocardial infarction. The New England Journal
of Medicine 302, 250.
The Beta-Blocker Evaluation of Survival Trial Investigators (2001). A trial of the
beta-blocker bucindolol in patients with advanced chronic heart failure. The New
England Journal of Medicine 344 (22), 1659–1667.
The Cardiac Arrhythmia Suppression Trial (CAST) Investigators (1989). Prelimi-
nary report: effect of encainide and flecainide on mortality in a randomized trial
of arrhythmia suppression after myocardial infarction. The New England Journal
of Medicine 321, 129.
The Coronary Drug Project Research Group (1975). Clofibrate and niacin in coro-
nary heart disease. Journal of American Medical Association 231, 360.
The Coronary Drug Project Research Group (1980). Influence of adherence to treat-
ment and response of cholesterol on mortality in the coronary drug project. New
England Journal of Medicine 303, 1038.
The DCCT Research Group: Diabetes Control and Complications Trial (DCCT)
(1986). Design and methodologic considerations for the feasibility phase. Diabetes
35, 530–545.
The Diabetic Retinopathy Study Research Group (1978). Photocoagulation treat-
ment of proliferative diabetic retinopathy: the second report of the Diabetic
Retinopathy Study findings. Opathalmology 85, 82–106.
The Global Use of Strategies to Open Occluded Coronary Arteries (GUSTO III)
Investigators (1997). A comparison of reteplase with alteplase for acute myocardial
infarction. N Engl J Med 337, 1118–1123.
The GUSTO Investigators (1993). An international randomized trial comparing
four thrombolytic strategies for acute myocardial infarction. N Engl J Med 329,
673–82.
The Intermittent Positive Pressure Breathing Trial Group (1983). Intermittent pos-
itive pressure breathing therapy of chronic obstructive pulmonary disease. Annals
of Internal Medicine 99, 612.
The International Steering Committee on Behalf of the MERIT-HF Study Group
(1997). Rationale, design, and organization of the metoprolol CR/XL randomized
intervention trial in heart failure (MERIT-HF). Amer J Cardiol 30, 54J–58J.
424 REFERENCES
The Lipid Research Clinics Program (1979). The coronary primary prevention trial:
design and implementation. J Chronic Dis 32, 609–631.
The TIMI Research Group (1988). Immediate vs. delayed catheterization and an-
gioplasty following thrombolytic therapy for acute myocardial infarction. JAMA
260, 2849–2858.
The Veterans Administration Coronary Artery Bypass Surgery Cooperative Study
Group (1984). Eleven-year survival in the veterans administration randomized
trial of coronary bypass surgery for stable angina. N Engl J Med 311 (21), 1333–
9.
The Women’s Health Initiative Steering Committee (2004). Effects of conjugated
equine estrogen in postmenopausal women with hysterectomy: the Women’s
Health Initiative randomized controlled trial. JAMA 291, 1701–1712.
Therneau, T. M. (1993). How many stratification factors are “too many” to use in
a randomization plan? Controlled Clinical Trials 14, 98–108.
Therneau, T. M., P. M. Grambsch, and T. R. Fleming (1990). Martingale-based
residuals for survival models. Biometrika 77, 147–160.
Thisted, R. A. (2006). Baseline adjustment: Issues for mixed-effect regression models
in clinical trials. In Proceedings of the American Statistical Association, pp. 386–
391. American Statistical Association.
Tsiatis, A. (1981). The asymptotic joint distribution of the efficient scores test for
the proportional hazards model calculated over time. Biometrika 68, 311–315.
Tsiatis, A. (1982). Repeated significance testing for a general class of statistics used
in censored survival analysis. Journal of the American Statistical Association 77,
855–861.
Tsiatis, A., G. Rosner, and C. Mehta (1984). Exact confidence intervals following a
group sequential test. Biometrics 40, 797–803.
Tsiatis, A. A. and C. Mehta (2003). On the inefficiency of the adaptive design for
monitoring clinical trials. Biometrika 90 (2), 367–378.
Upton, G. J. G. (1982). A comparison of alternative tests for the 2 × 2 comparative
trial. Journal of the Royal Statistical Society, Series A: General 145, 86–105.
U.S. Government (1949). Trials of War Criminals before the Nuremberg Military
Tribunals under Control Council Law No. 10, Volume 2. Washington, D.C.: U.S.
Government Printing Office.
Van Elteren, P. (1960). On the combination of independent two-sample tests of
Wilcoxon. Bull Int Stat Inst. 37, 351–361.
Venables, W. N. and B. D. Ripley (1999). Modern Applied Statistics with S-PLUS.
Springer-Verlag Inc.
Verbeke, G. and G. Molenberghs (2000). Linear Mixed Models for Longitudinal Data.
Springer-Verlag Inc.
Verbeke, G., G. Molenberghs, H. Thijs, E. Lesaffre, and M. G. Kenward (2001).
Sensitivity analysis for nonrandom dropout: a local influence approach. Biometrics
57 (1), 7–14.
Veterans Administration Cooperative Urological Research Group (1967). Treatment
and survival of patients with cancer of the prostate. Surg Gynecol Obstet 124,
1011–1017.
Volberding, P., S. Lagakos, M. Koch, C. Pettinelli, M. Myers, D. Booth, et al., and
the AIDS Clinical Trials Group of the National Institute of Allergy and Infectious
Diseases (1990). Zidovudine in asymptomatic human immunodeficiency virus
infection. A controlled trial in persons with fewer than 500 cd4 positive cells per
REFERENCES 425
cubic millimeter. The AIDS Clinical Trials Group of the National Institute of
Allergy and Infectious Diseases. N Engl J Med 322, 941–949.
Wald, A. (1947). Sequential Analysis. New York: Wiley.
Wang, C., J. Douglas, and S. Anderson (2002). Item response models for joint
analysis of quality of life and survival. Statistics in Medicine 21 (1), 129–142.
Wangensteen, O., E. Peter, E. Bernstein, A. Walder, H. Sosin, and A. Madsen (1962).
Can physiological gastrectomy be achieved by gastric freezing? Ann Surg 156,
579–591.
Ware, J. (2000). SF-36 health survey update. Spine 25 (24), 3130–3139.
Ware, J. H. (1989). Investigating therapies of potentially great benefit: ECMO (C/R:
P306-340). Statistical Science 4, 298–306.
Waskow, I. and M. Parloff (1975). Psychotherapy Change Measures. U.S. Govern-
ment Printing Office, Washington, D.C.
Wedel, H., D. DeMets, P. Deedwania, S. G. B Fagerberg, S. Gottlieb, A. Hjalmarson,
J. Kjekshus, F. Waagstein, J. Wikstrand, and MERIT-HF Study Group (2001).
Challenges of subgroup analyses in multinational clinical trials: experiences from
the MERIT-HF trial. The American Heart Journal 142 (3), 502–511.
Wei, L. (1984). Testing goodness of fit for the proportional hazards model with
censored observations. Journal of the American Statistical Association 79, 649–
652.
Wei, L. and S. Durham (1978). The randomized play-the-winner rule in medical
trials. JASA 73, 840–843.
Wei, L. and J. Lachin (1988). Properties of the urn randomization in clinical trials.
Controlled Clinical Trials 9, 345.
Whitehead, J. (1978). Large sample sequential methods with application to the
analysis of 2 × 2 contingency tables. Biometrika 65, 351–356.
Whitehead, J. (1983). The Design and Analysis of Sequential Clinical Trials. Chich-
ester: Horwood.
Whitehead, J. (1986). On the bias of maximum likelihood estimation following a
sequential test. Biometrika 73, 573–581.
Whitehead, J. (1997). The Design and Analysis of Sequential Clinical Trials (Revised
2nd ed.). Wiley: Chichester.
Whitehead, J. and I. Stratton (1983). Group sequential clinical trials with triangular
continuation regions. Biometrics 39, 227–236.
Willke, R., L. Burke, and P. Erickson (2004). Measuring treatment impact: a review
of patient-reported outcomes and other efficacy endpoints in approved product
labels. Control Clin Trials 25 (6), 535–52.
World Medical Association (1989). World Medical Association Declaration of
Helsinki. Recommendations guiding physicians in biomedical research involving
human subjects. World Health Association.
Writing Group for the Women’s Health Initiative Randomized Controlled Trial
(2002). Risks and benefits of estrogen plus progestin in healthy postmenopausal
women. JAMA 288, 321–333.
Wu, M. and R. Carroll (1988). Estimation and comparison of changes in the presence
of informative right censoring by modeling the censoring process. Biometrics 44,
175–188.
Wu, M., M. Fisher, and D. DeMets (1980). Sample sizes for long-term medical
trial with time-dependent dropout and event rates. Controlled Clinical Trials 1,
111–121.
426 REFERENCES
Yabroff, K., B. Linas, and K. Shulman (1996). Evaluation of quality of life for diverse
patient populations. Breast Cancer Res Treat 40 (1), 87–104.
Yusuf, S. and A. Negassa (2002). Choice of clinical outcomes in randomized trials
of heart failure therapies: Disease-specific or overall outcomes? American Heart
Journal 143 (1), 22–28.
Zelen, M. (1969). Play the winner rule and the controlled clinical trial. JASA 64,
131–146.
Zhang, J., H. Quan, J. Ng, and M. Stepanavage (1997). Some statistical methods
for multiple endpoints in clinical trials. Controlled Clinical Trials 18, 204.
Zhang, J. and K. F. Yu (1998). What’s the relative risk? A method of correcting
the odds ratio in cohort studies of common outcomes. JAMA 280, 1690–1.
Zhao, H. and A. A. Tsiatis (1997). A consistent estimator for the distribution of
quality adjusted survival time. Biometrika 84, 339–348.
Zhao, H. and A. A. Tsiatis (1999). Efficient estimation of the distribution of quality-
adjusted survival time. Biometrics 55 (4), 1101–1107.