0% found this document useful (0 votes)
23 views157 pages

Day1 SISCER2023 ClinicalTrials

Uploaded by

Jimmy Pig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views157 pages

Day1 SISCER2023 ClinicalTrials

Uploaded by

Jimmy Pig
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 157

Design and Analysis

of Clinical Trials
Pamela Shaw & Michael Proschan

UW Summer Institute in Statistics for Clinical &


Epidemiological Research (SISCER)
Virtual Course, July 10-12, 2023

1 / 51
Introductions

Pamela Shaw
I Kaiser Permanente Washington Health Research Institute
I Biostatistics Division
I pamela.a.shaw@kp.org

Michael Proschan
I National Institute of Allergy and Infectious Diseases (NIAID)
I Biostatistics Research Branch
I proscham@niaid.nih.gov

2 / 51
Course Outline

Day 1
1. 8:30-8:40 Introductions
2. 8:40-9:30 Choice of primary outcome and analysis
9:30-9:45 Break
3. 9:45-10:30 Randomization
10:30-10:45 Break
4. 10:45-12:00 Sample size/ Power

Day 2
6. 8:30-10:15 Interim monitoring
7. 10:15-10:45 Break
8. 10:45-12:00 Futility

3 / 51
Course Outline (2)

Day 3
1. 8:30-9:40 Handling missing data
9:40-9:55 Break
2. 9:55-10:45 Multiple Comparisons
10:45-11:00 Break
3. 11:00-11:55 Adaptive design
4. 11:55-12:00 Wrap Up

4 / 51
Course overview

Overall aim
That you will gain a set of simple tools and principles that go a long
way towards robust clinical trial design and analysis.

I Emphasis will be on practical application


I Examples will be used throughout
I Key references provided

5 / 51
Lecture 1: Choice of primary outcome and
analysis

6 / 51
Key Features of Randomized Controlled Trial (RCT)

I A Randomized Controlled Trial is a study of a novel


intervention in human subjects where the intervention
assignment is randomized
I A randomized controlled trial (RCT) is the gold standard for
clinical evidence for establishing efficacy
I International Council for Harmonisation (ICH)/FDA Guidance
provide universally adopted guidelines to maintain rigorous
standards for the ethical and scientific integrity of the trial
I ICH E9 Statistical Principles
I https://www.ich.org/page/efficacy-guidelines
I A central pillar to the scientific rigor of the RCT is the choice of a
relevant clinical endpoint that will reliably and efficiently capture
the treatment effect of interest

7 / 51
A few definitions....

8 / 51
Types of randomized studies

Parallel group - subjects randomized to one of k treatments


Cross-over - each subject used as their own control. Patients
receive each treatment sequentially, and the order is randomized
Factorial design - Multiple treatments under study, where each
has a control. So for k treatments, subjects randomized to one of
2k possible arms in a full factorial design
Cluster randomized - groups instead of individuals are
randomized (eg. schools, building, clinic)
Group sequential designs - Studies with prespecified methods
to analyze data partway through trial for potential early stopping
(Day 2)
Adaptive designs - Studies with pre-specified methods to use
within trial data to change aspects of design (Day 3)

9 / 51
Phases of clinical trials
https://www.fda.gov/patients/drug-development-process/step-3-clinical-research

I Phase 1: Evaluates safety and dosage of drug. Generally 20-100


subjects, either healthy or with condition
I Phase 2: Evaluating efficacy and side effects. Up to 300 subjects
with condition
I Phase 3: Evaluating efficacy and monitoring adverse events.
300-3000 with disease condition
Note, these sample size ranges can vary based on disease setting,
e.g. cancer treatment trials tend to be smaller, cancer prevention tend
to be large

10 / 51
The RCT Gold Standard

Key features that contribute to the strength of evidence of the RCT:


I Randomization of the treatment allocation allows for causality to
be established
I A single primary outcome to evaluate efficacy is chosen
I Outcomes and analyses are pre-specified
I Analyses are done as intent-to-treat

11 / 51
Intent-to-treat analyses
An intent-to-treat (ITT) analysis is one where randomized individuals
are analyzed in the group they were randomized to, regardless of
what happens during the trial. Analyze as you randomize!

I Randomization ensures that there are no systematic differences


between the treatment groups
I The exclusion of patients from the analysis on a systematic basis
(e.g., lack of compliance with assigned treatment) can introduce
systematic differences between treatment groups, thereby
biasing the comparison
I Sometimes a modified ITT analysis (mITT) is considered, which
would consider very limited exceptions to ITT
I Shortly after randomization and before any intervention was given
(control or otherwise), trial participant drops out
I Other exceptions to ITT not widely accepted (more details in
Lecture 2)
12 / 51
Implication of ITT for primary endpoint

I ITT means representing patients in the analysis even if they have


missing data
I Missing data must be imputed for an ITT or IPW approach to be
considered (Day 3 topic)
I Too much missing data will degrade the integrity/acceptability of
the trial results
I A fundamental consideration of primary endpoint is that it be
something that can be reliably obtained on all subjects

13 / 51
Handling missing data in an RCT
Little et al. (2012)

The best way to handle missing data is to prevent it

14 / 51
Preventing missing data

I A well-written, complete and understandable informed consent is


vital: not just to protect ethics but to make sure participants know
what they are getting into
I Poor understanding of trial procedures and expectations for
follow-up will lead to missing data
I Off-treatment does not mean off-study
I Highly burdensome procedures will have drop-out regardless of
how good the informed consent is
I When constructing endpoint, it is worth considering what is
minimally necessary to obtain the clinically necessary
information to evaluate the treatment
I Any procedures above the minimum, will need to think carefully
about tradeoff of participant burden

15 / 51
Three-prong Approach to Minimizing Impact of
Missing Data

Design
I Avoid endpoints that are more likely to be missing
I Choose the smallest time frame for primary analysis that still
yields clinically relevant information on treatment effects
I Consider a run-in period to ensure commitment
I Particularly important for long/complicated studies

Conduct
I Make extensive efforts to retain subjects
I Continue follow-up for outcomes even if subject stops treatment
Analysis
I Choose analyses that require minimally problematic assumptions

16 / 51
Considerations for the primary outcome

I Should be measured similarly in both treatment arms


I Less is more (benefits of choosing 1 primary)
I Reliability/feasibility
I Clinical relevance (surrogate endpoints, composite endpoints)
I Primary analysis: Efficiency versus robustness. (phase 1 vs
phase 3)
I In phase 1 avoid type II error, in phase 3 avoid type I error. Phase 2
you are somewhere in between
I Composite outcomes
I Interpretability (Clearly stated estimand)

17 / 51
The Measurement Principle

The process of measurement of the primary outcome should not be


influenced by treatment

I The primary outcome should be measurable in all subjects


I There should be similar monitoring of events in both treatment
arms
I Sometimes violations of the measurement principle can be
subtle
I Example: Suppose a viral vaccine causes mild disease. Then
comparison of viral load between arms may suggest a treatment
benefit, but point is moot since vaccine caused the disease

18 / 51
So why choose only 1 endpoint?

Probability of at least one false positive test assuming multiple


independent tests under the null

Number of tests Prob of ≥ 1 significant test


1 0.0500
2 0.0975
3 0.1426
4 0.1855
5 0.2262
6 0.2649
7 0.3017
8 0.3366
9 0.3698
10 0.4013

19 / 51
Maintaining type I error without loss of power

The dominant paradigm:


1. Pick one efficacy outcome as the primary outcome
I Formal hypothesis test maintains α level

2. Consider other endpoints as secondary or exploratory


3. If rigorous standards are sought for more than primary outcome,
consider adjustment for multiple comparisons (Day 3 topic)
4. Rigorous control of type 1 error generally does not apply when
evaluating safety outcomes in an early phase trial. Generally do
not want to miss a safety signal (may evaluate several AEs)
5. Type 1 error control is needed when you win if at least one test is
statistically significant.
I When all tests must be significant (e.g., showing that a combination
drug beats each of its constituents), no multiple comparison
adjustment is needed.

20 / 51
Reliability

I Clinical outcomes that are more variable between patients, such


as those affected by more factors than just the treatment, will
have less power
I Difficult to measure quantities will have more missing data (e.g.
more assay failure)
I This can introduce bias, particularly if say lower levels more likely
to be missing
I Worry if too many values below limit of detection, it will be difficult
to detect arm differences
I If your trial involves a novel assay, having some pilot data will be
important before launching the trial

21 / 51
Efficacy and Safety of Metronidazole for Pulmonary
Multidrug Resistent Tuberculosis (MDR-TB)
Study NCT00425113

Background
I MDR TB is a difficult to treat disease. Individuals have been
observed to fail first line therapies (Isoniazid and Rifampicin)
I Standard MDR-TB treatment is 18-24 months of 2nd-line
antibiotics
I In vitro data showed that metronidazole is active against
Mycobacterium tuberculosis (MTB) maintained under anaerobic
conditions
I Pre-clinical studies (non-human primates, rabbits) also showed
metronidazole may have unique activity against an anaerobic
sub-population of bacilli in human disease

22 / 51
Design of the Metronidazole for MDR-TB Trial

I A double-blinded RCT with a planned 60 patients with MDR-TB


randomized to one of placebo or 500 mg MTZ for first 8 weeks of
2nd-line TB therapy. 2nd line therapy to continue for 18-24
months.
I Primary outcome was ”Changes in TB lesion sizes” at 6 months
I In humans, TB disease characterized by aerobic (cavities) and
anaerobic (caseous necrotic nodules) areas
I Hypothesis was that MTZ would reduce the volume of nodules in
the lung, which would be quantified at baseline and follow-up,
using FDG-PET HRCT
I FDG-PET HRCT was a relatively novel tool for assessing extent
of TB disease

23 / 51
Problematic Primary outcome

I As scans were evaluated on the patients in the trial it became


clear that the primary outcome was not a good measure of
change
I For some patients, volume of lesions decreased because lesions
were reducing in size as patients improved
I For some patients, volume of lesions decreased because lesions
collapsed into cavities as patients got worse
I Number of lesions was discussed as a secondary endpoint, but
the number of lesions could increase or decrease as patients got
better
I Investigators had no choice but to alter the primary and other
outcome measures of the trial
I Changing primary endpoint mid-trial is problematic

24 / 51
Transparency on Clinical Trials.gov
Study NCT00425113

5 years after trial opened and after study had closed early, primary
outcome was changed
Changes in TB Lesion Sizes Using High Resolution Computed Tomography (HRCT). [
Time Frame: 6 months. ] Lesions were defined as nodules (<2 mm, 2-<4 mm, and
4−10 mm), consolidations, collapse, cavities, fibrosis, bronchial thickening, tree-in-bud
opacities, and ground glass opacities. Each CT was divided into six zones (upper,
middle, and lower zones of the right and left lungs) and independently scored for the
above lesions by three separate radiologists blinded to treatment arm. A fourth
radiologist adjudicated any scores that were widely discrepant among the initial three
radiologists. The HRCT score was determined by visually estimating the extent of the
above lesions in each lung zone as follows: 0=0% involvement; 1= 1-25% involvement;
2=26-50% involvement; 3=51-75% involvement; and 4=76-100% involvement. A
composite score for each lesion was calculated by adding the score for each specific
abnormality in the 6 lung zones and dividing by 6, with the change in composite score
measured at 2 and 6 months compared to baseline. Composite sums of all 10
composite scores are reported.

25 / 51
RCT Example: Effect of Ranitidine on Hyper-IgE
Recurrent Infection (Job’s) Syndrome
NCT00527878

Background
I Hyper-IgE syndrome (HIES) is an immunological disorder
caused by a genetic mutation (STAT3) characterized by recurrent
infections of the ears, sinuses, lungs and skin, and abnormal
levels of the antibody immunoglobulin E (IgE).
I Patients with hyper-IgE syndrome also tend to have skeletal
abnormalities: characteristic face, retained teeth, and recurrent
fractures from minimal trauma
I An early phase RCT was launched in 2007 at NIAID to study
whether ranitidine would reduce infections
I At time trial done only about 76 known cases in US.

26 / 51
Considerations for an endpoint for this diverse disease
One possibility: A patient-reported score of severity of symptoms.
Problem: Patients with more severe disease less bothered by mild to
moderate symptoms, and high functioning patients bothered by
relatively minor symptoms
Alternative: A numeric score was considered that would capture the
number of new infections
I The number of infections that required new antibiotics was
reported on a quarterly basis, to balance burden and accuracy
(require recall over shorter period)
I Total number in a year is prone to missingness
I Rate of infections per month is a more flexible endpoint
I Disease had many other chronic morbidities (e.g. recurrent
fracture), but ranitidine only expected to affect infections
Final: Primary endpoint chosen was the rate of infections (i.e. avg #
per month during first year). Primary endpoint to require at least 2 of
the 4 quarters to give a robust estimate of the yearly rate.
27 / 51
HIES Ranitidine Trial Study: Double trouble

I A cross-over design was chosen given the rarity of disease. 20


patients were to be followed on ranitidine and control arm (usual
care) for one year each, in random order
I In a cross-over design, when someone drops out, it’s like losing
two subjects - particularly if drop out during the first period
I For this trial, complete follow-up over two years was essential
I This trial closed early: higher than expected drop-out and a
diminishing interest in Ranitidine

28 / 51
Clinical relevance

I When weighing possible outcomes/endpoints want to consider


the relative seriousness of the different conditions/symptoms the
drug could be affecting
I Also need to consider the mechanisms of action for the
intervention under study and the outcomes expected to have the
biggest change
I Often a trade-off between clinical relevance and power: frequent
less serious events and infrequent serious events
I In some trials it is more practical to observe a surrogate outcome
I In TB trials the short-term endpoint of sputum conversion or
change in first 6 months used in place of the gold standard “cure”
outcome: 6 months after end of therapy need to be disease free

29 / 51
Surrogate Outcome
I Various definitions exist for a surrogate endpoint. Ellenberg and
Hamilton (1989) lay out a general definition: A “Surrogate
endpoint captures an intermediate endpoint on the disease
pathway, which is informative of the true outcome”
I Generally, the point of a surrogate endpoint is to have an
expected reduction in sample size or trial duration, such as when
a rare or distal endpoint is replaced by a more frequent or
proximate endpoint
I In 1989, Prentice laid out conditions for a surrogate outcome
(known as the Prentice criterion), as well as a working definition,
that assumes a treatment Z effect on the true endpoint Y is
completely captured by the surrogate endpoint X
I E(Y |Z , X ) = E(Y |X )
I Prentice criterion has come under criticism as not practical and
various other discussions have ensued regarding the definition of
and how to validate a surrogate
30 / 51
Examples of surrogate endpoints

I Cholesterol, when ultimate goal to reduce cardiovascular events


(e.g., heart attacks and strokes)
I Blood pressure, when ultimate goal to reduce stroke risk
I CD4 or HIV viral load, when ultimate goal to reduce serious
infections AIDS infections or death
I Hemoglobin A1c, when ultimate goal to reduce serious
complications of diabetes

31 / 51
CAST Example: Caution is needed when working with
surrogate endpoints

I Arrhythmias can lead to cardiac arrest, which is fatal a high


percentage of time
I Given that arrhythmia is on the causal pathway to cardiac arrest
and sudden death, arrhythmia could be considered a surrogate
endpoint for cardiac arrest/sudden death
I In mid-80s to 1990, encainide, flecainide and moricizine were
approved by FDA on basis of their effect on arrhythmias
I Anti-arrhythmia drugs were in broad use at the time of Cardiac
Arrhythmia Suppression Trial (CAST )
I Led to difficulties in recruitment

32 / 51
Cardia Arrhythmia Suppression Trial (CAST)
CAST Investigators, 1989

CAST would test hypothesis that suppression of ventricular


premature complexes after a myocardial infarction would improve
survival
I Patients at high risk for death from cardiac arrest were eligible
(recent MI, low ejection fraction)
I Three suppression drugs considered, with matching placebos:
encainide, flecainide and moricizine
I The primary endpoint of the trial was death or cardiac arrest with
resuscitation, either of which was due to arrhythmia
I During titration phase analysis of Holter recordings required to
show that a drug had indeed suppressed arrhythmias adequately
before a patient could be randomized
I Randomization was to the agent that achieved successful
suppression or matching placebo
I Trial launched in June 1987 with 3 year planned recruitment
33 / 51
CAST Results
Ruskin (1989)

I In April 1989 after 1498 patients randomized, the ecainide and


flecainide arms were stopped due to higher overall cardiac
mortality and higher mortality due to arrhythmia
I In April 1989 CAST II - placebo-controlled trial was launched with
moricizine as only active drug
I Only 277 patients to date had been randomized
I Titration phase now had a blinded placebo
I Early exposure to moricizine was shown to have higher death
rates than the placebo arm
I Anti-arrhythmia drugs were no longer routinely recommended
(Greene et al., 1992)
I Deadly Medicine: Why tens of thousands of heart patients died
in America’s worst drug disaster Moore (1995)
34 / 51
Surrogate Endpoints: Controversy continues

I In early phase trials, need biologically motivated intermediate


endpoints
I Many would argue in large phase III trials, need to move to the
target clinical (non-surrogate) endpoint
I Those in pharmaceutical industry would argue that validated
surrogates would mean smaller, faster cheaper trials
I Fewer patients are exposed during testing, and beneficial new
medications reach the market faster
I Problem: No universal way to validate a surrogate. Some
advocate meta-analyses (Molenberghs et al., 2002)
I Buyse and Molenberghs (1998); Buyse et al. (2000) reviews
different methods to validate a surrogate, with extension to
meta-analyses

35 / 51
Many examples of misleading surrogates

I Cyclic adenosine monophosphate– enhancing agents, such as milrinone, were


considered a “particularly rational approach to the treatment of chronic heart
failure.” Milrinone was later found to increase mortality by 28% over placebo.
I Estrogen in pre-menopausal women thought to be protective against heart
disease. Hormone replacement therapy used for decades in post-menopausal
women before found to be harmful in the WHI
I High blood sugar in diabetics can lead to bad outcome. Hemoglobin A1c used to
monitor diabetes mellitus therapy (short-term effects of treatment). In ACCORD,
over suppression of hbA1c led to increased mortality (Action to Control
Cardiovascular Risk in Diabetes Study Group, 2008)
I Svensson et al. (2013) give multiple examples of treatment approved based on
surrogate, later found harmful on true outcome
I Even when new drug under consideration is a member of an already
established class, adequate safety cannot be assumed (cerivastatin)
I Demonstrated value for one indication does not necessarily extend to a
related indication (Dronedarone hydrochloride)

36 / 51
Composite outcomes are another way to improve
practicality
Composite outcomes are an outcome that combine multiple clinical
endpoints. General idea behind composite endpoints is to increase
power through an increased event-rate

Examples
I Time-to-first of disease progression or death (Progression-free
survival)
I Relapse-free survival
I Major adverse cardiovascular events (MACE)
I Time to first serious AIDS or serious non-AIDS event in the
Strategic Timing of AntiRetroviral Treatment (START) trial
I Time to first of cardiac arrest or arrhythmic death (CAST)
Note: Some composite endpoints are surrogate endpoints
37 / 51
Considerations for composite endpoints
Neaton et al. (2005)

I The definition of the endpoint should be clearly established a


priori
I Endpoints should have similar seriousness
I In order to interpret the results of the trial, should look at
treatment effect on the individual components of the composite
(secondary endpoints)
I All endpoints should be affected by the drug
I You could decrease power if you expand composite to include
things not affected by treatment just for sake of higher event rate

38 / 51
Example: SOLVD Trial
NEJM 1991, 325: 293-302.

Background
I SOLVD a RCT examining novel treatment for prevention of
mortality/hospitalization in patients with congestive heart failure
(CHF) and weak left ventricle ejection fraction (EF)
I In 1986-89, 2569 patients randomized to enalapril or placebo
I Enalapril found beneficial for mortality (p = 0.0036) and time to
first hospitalization/death (p < 0.0001)

Analysis
I Seek to evaluate treatment effect on subset of 662 diabetic
subjects
I Considered alternative to time to first that considers overall
severity

39 / 51
SOLVD: Results

Enalapril Placebo
(N=319) (N=343) Cox PH Score Test
Endpoint Yes No Yes No HR (P-value)
Death 137 182 145 198 0.99 (0.91)
Hospitalization 94 225 148 195 0.60 (< 0.0001)
TTF 174 145 229 114 0.71 (0.0007)

I Treatment arm: 57/94 (61%) hospitalization followed by death


I Placebo arm: 64/148 (43%) hospitalization followed by death
Shaw and Fay severity score test p = 0.07 (Shaw and Fay, 2016)

40 / 51
SOLVD: Results

Enalapril Placebo
(N=319) (N=343) Cox PH Score Test
Endpoint Yes No Yes No HR (P-value)
Death 137 182 145 198 0.99 (0.91)
Hospitalization 94 225 148 195 0.60 (< 0.0001)
TTF 174 145 229 114 0.71 (0.0007)

I Treatment arm: 57/94 (61%) hospitalization followed by death


I Placebo arm: 64/148 (43%) hospitalization followed by death
Shaw and Fay severity score test p = 0.07 (Shaw and Fay, 2016)

40 / 51
Alternative to Time-to-First: Prioritized severity score
(Shaw and Fay, 2016; Shaw, 2018)
I General idea: rank individuals according to clinical severity
I Depending on setting, clinical severity could consider two or
more outcomes or event times
I Shaw and Fay (2016) Proposed ranking considered surrogate
and ”true” event of interest
I Rank the time to event of interest (death) if it is observed
I Rank time to surrogate event (MI hospitalization) for the survivors
I Surrogate time does not affect clinical severity when event of
interest is observed
I Perform two sample test on clinical severity which incorporates
bivariate survival information
I Resulting test is average of two log-rank tests (aids interpretation)
I Prioritization endpoints have grown in popularity in recent years.
Examples: win ratio (Pocock et al., 2012), Desirability of outcome
ranking (DOOR) (Evans et al., 2015). See review Shaw (2018).
41 / 51
Choice of primary analysis

I Want to choose an efficient analysis


I Need to consider the interpretation of the parameter for your test
statistic
I If no one understands the method or parameter interpretation,
then unlikely to affect clinical practice
I There may be some trade-off between efficiency and
interpretability.

42 / 51
Common test statistics for a parallel two-arm trial

I Continuous outcome: t-test (assuming unequal variance) is a


common choice
I Note non-parametric tests like Wilcoxon Rank sum test will be
more robust, particularly for modest sample sizes.
I Binary outcome: difference of proportions often of interest - exact
test will be more robust and often preferred particularly for small
samples sizes
I Survival outcome: simple log-rank test
I If anticipate missing data, good to consider how your primary test
statistic will be calculated

43 / 51
Interpretability: Wilcoxon
I There are different ways to interpret a test, and some may be
more relevant than others.
I Example: Wilcoxon rank sum and Mann-Whitney tests are
equivalent.
I Wilcoxon, assumes one distribution is shifted relative to other, and
estimates size of shift.
I Mann Whitney compares (treatment, control) pairs and estimates
the following probability for outcome of randomly picked
treatment/control patients (> means better):

P(treatment > control) + (1/2)P(treatment = control)


I Latter helpful in COVID-19 trial with ordinal score like WHO-8.
Even if proportional odds assumption is violated, still asymptotically
equivalent to Mann-Whiney (Wang and Tian (2017)), so still
estimates above probability parameter.
I Another example: Hazard ratio versus restricted mean survival
time (RMST). Many believe RMST is easier to interpret.
44 / 51
Change from baseline
I Frison and Pocock (1992) generalize the following.
I With continuous outcome Y , at least 3 ways to analyze:
I T-test on end of study value, treatment effect estimate δˆ = ȲT − ȲC .
I T-test on change from baseline, treatment effect estimator
δˆ = (ȲT − X̄T ) − (ȲC − X̄C ).
I Analysis of covariance (ANCOVA) regression using baseline value
as covariate:
Y = β0 + β1 X + β2 Z + ε,
where z is treatment indicator and ε is a random error independent
of X . Treatment effect estimator δˆ = ȲT − ȲC − βˆ1 (X̄T − X̄C ).
I Which one is best?
I Assume ANCOVA model is correct.
I Unconditionally (averaged over distribution of X ), all 3 estimate
the same parameter, E(YT ) − E(YC ) because X is baseline
variable, so E(XT ) = E(XC ).
45 / 51
Change from baseline
I Asymptotically,
I T-test on Y , var(Y ) = β 2 σ 2 + σε2 .
1 X
I T-test on Y − X ,
var(Y − X ) = var{β0 + (β1 − 1)X + ε} = (β1 − 1)2 σX2 + σε2 .
I ANCOVA is essentially t-test on Y − β1 X , and
var(Y − β1 X ) = var(ε) = σε2 . Smallest variance, so best.

Asymptotic power when σX = σY = 1, ρ = cor(X , Y ).


ρ Post Change ANCOVA
0.00 0.50 0.28 0.50
0.20 0.50 0.34 0.52
0.40 0.50 0.43 0.57
0.60 0.50 0.59 0.69
0.80 0.50 0.87 0.90
0.90 0.50 0.99 0.99
0.95 0.50 1.00 1.00

Note: Post is better than change if ρ < 0.50.


46 / 51
Special considerations for cluster RCT
Public Access Defibrillation (PAD) Trial (Hallstrom et al. (2004))
I Cardiac arrest has very low survival probability (10%). Can we
improve survival by putting defibrillators in communities and
letting lay people use them?
I Note: No guarantees because lay people might make mistakes
with defibrillator and fail to call 911.
I Communities (shopping malls, apartment buildings, etc.)
randomized to CPR training of lay people (like managers) or
CPR training of lay people plus defibrillators.
I Primary outcome: number of people saved after cardiac arrest.
I In community-randomized trial, think of community like we think
of individuals in individual-randomized trial. Primary endpoint is
measured in each community (number of saves).
I 993 communities! Most community-randomized trials have only
about 20 communities.
47 / 51
Conclusions

I The choice of a good primary outcome is paramount. Ultimately,


an RCT will be judged a success or failure based on the primary
outcome results
I For RCT’s there are rigorous standards for the primary outcome:
single endpoint with a pre-specified ITT, analysis
I Reliability/feasibility of measurement need to be considered
I Surrogate endpoints are often used in early phase trials, but a
definitive trial on the clinical endpoint is required to truly
understand treatment effect (remember CAST!)
I When choosing a primary analysis, robustness, power and
interpretability all come into play

48 / 51
References I
Action to Control Cardiovascular Risk in Diabetes Study Group (2008). Effects of intensive glucose
lowering in type 2 diabetes. New England Journal of Medicine 358, 2545–2559.
Buyse, M. and Molenberghs, G. (1998). Criteria for the validation of surrogate endpoints in
randomized experiments. Biometrics pages 1014–1029.
Buyse, M., Molenberghs, G., Burzykowski, T., Renard, D., and Geys, H. (2000). The validation of
surrogate endpoints in meta-analyses of randomized experiments. Biostatistics 1, 49–67.
Ellenberg, S. S. and Hamilton, J. M. (1989). Surrogate endpoints in clinical trials: cancer. Statistics
in Medicine 8, 405–413.
Evans, S. R., Rubin, D., Follmann, D., Pennello, G., Huskins, W. C., Powers, J. H., Schoenfeld, D.,
Chuang-Stein, C., Cosgrove, S. E., Fowler Jr, V. G., et al. (2015). Desirability of outcome ranking
(door) and response adjusted for duration of antibiotic risk (radar). Clinical Infectious Diseases
61, 800–806.
Frison, L. and Pocock, S. J. (1992). Repeated measures in clinical trials: analysis using mean
summary statistics and its implications for design. Statistics in Medicine 11, 1685–1704.
Greene, H. L., Roden, D. M., Katz, R. J., Woosley, R. L., Salerno, D. M., and Henthorn, R. W.
(1992). The cardiac arrhythmia suppression trial: First cast. . . then cast-ii. Journal of the
American College of Cardiology 19, 894–898.
Hallstrom, A., Ornato, J., Weisfeldt, M., Travers, A., Christenson, J., McBurnie, M., Zalenski, R.,
Becker, L., and Proschan, M. (2004). Public-access defibrillation and survival after
out-of-hospital cardiac arrest. New England Journal of Medicine 351, 637–646.

49 / 51
References II
Little, R., D’Agostino, R., Cohen, M., Dickersin, K., Emerson, S., Farrar, J., Frangakis, C., Hogan, J.,
Molenberghs, G., Murphy, S., and Neaton, J. (2012). The prevention and treatment of missing
data in clinical trials. NEJM 367, 1355–60.
Molenberghs, G., Buyse, M., Geys, H., Renard, D., Burzykowski, T., and Alonso, A. (2002).
Statistical challenges in the evaluation of surrogate endpoints in randomized trials. Controlled
Clinical Trials 23, 607–625.
Moore, T. (1995). Deadly medicine: Why tens of thousands of heart patients died in the America’s
worst drug disaster. Simon and Schuster.
Neaton, J. D., Gray, G., Zuckerman, B. D., and Konstam, M. A. (2005). Key issues in end point
selection for heart failure trials: composite end points. Journal of Cardiac Failure 11, 567–575.
Pocock, S. J., Ariti, C. A., Collier, T. J., and Wang, D. (2012). The win ratio: a new approach to the
analysis of composite endpoints in clinical trials based on clinical priorities. European Heart
Journal 33, 176–182.
Ruskin, J. N. (1989). The cardiac arrhythmia suppression trial (CAST).
Shaw, P. A. (2018). Use of composite outcomes to assess risk–benefit in clinical trials. Clinical
Trials 15, 352–358.
Shaw, P. A. and Fay, M. P. (2016). A rank test for bivariate time-to-event outcomes when one event
is a surrogate. Statistics in Medicine 35, 3413–3423.
Svensson, S., Menkes, D., and Lexchin, J. (2013). Surrogate outcomes in clinical trials: A
cautionary tail. JAMA Internal Medicine 173, 611–612.

50 / 51
References III

Wang, Y. and Tian, L. (2017). The equivalence between mann-whitney wilcoxon test and score test
based on the proportional odds model for ordinal responses. In 2017 4th International
Conference on Industrial Economics System and Industrial Security Engineering (IEIS), pages
1–5. IEEE.

51 / 51
Lecture 2: Randomization

1 / 52
Outline

I Basic principles
I Randomization Methods
- simple, permuted block, stratified
I Cluster vs individual designs
I Platform trials
I Adaptive randomization
I Threats to integrity of randomization

2 / 52
Basic definitions (1)

What do we mean by random?

I In everyday speech, we describe a process as random if there


was no discernible pattern
I In statistics, random characterizes a process of selection that is
governed by a known, probability rule
- i.e., 2 treatments have equal chance of being assigned
- Random in statistics does not mean haphazard

3 / 52
Basic Definitions (2)

What is random treatment allocation?


By random allocation, we mean that each patient has a known
chance, usually equal chance, of being given each treatment, but the
treatment to be given cannot be predicted. (Altman 1991)
I Randomization is the act of allocating a random treatment
assignment.
I A patient is said to be randomized to a treatment arm or group
when they are assigned to the treatment group using random
allocation
I Clinical trials that use random treatment allocation are referred to
as randomized clinical trials

4 / 52
Random Examples

Flip a coin
- “Heads” and “Tails” have equal chance for a fair coin

Rolling a die
- The numbers 1 through 6 have equal chance of coming up

Draw one ball out of an urn filled with 10 red balls and 10 blue balls
- The chance of drawing a red or blue ball are equal

5 / 52
Random Examples

Everybody pick a random number from this list


0 1 2 3 4 5 6 7 8 9 10

6 / 52
Examples of Randomized Designs

I Parallel 1-1 randomized 2-arm trial


I Parallel k-1 randomized 2-arm trial
I Factorial designs: Two or more treatments given in combination:
AB, aB, Ab, ab
I Crossover trials: every patient gets all treatments under study
I Cluster randomized trials: entire communities are randomized to
receive a treatment (example: anti-smoking campaign for high
schools)

7 / 52
Motivation Behind Randomization

I Randomization tries to ensure that only one factor is different


between two or more study groups.
I Provides basis for valid statistical tests between treatment groups
I Randomization means we can attribute causality, i.e. any
between group difference in outcomes can be attributed to the
treatment
I In truth, randomization does not guarantee causality, but it
increases the likelihood that causality is the main driving factor

8 / 52
Ethics of Randomization

Equipoise – uncertainty about which intervention understudy in a


clinical trial would have a better outcome for the participant
- The fundamental principle underlying the ethics of random
treatment allocation

9 / 52
Masking/Blinding: Key Components of Randomization

I Double-blinded trial: the treatment assignment is masked so


neither the investigator nor participants know the treatment
assignment
- Treatment assignments are masked, individuals are blinded
I Single-blinded trial : only one of investigator/participant (usually
investigator) knows the treatment assignment
I Unpredictability of treatment allocation prevents selection bias
- Even when treatment can’t be blinded, it is helpful to have a blinded
randomization process
I Maintaining blind throughout the trial prevents
evaluation/response bias

10 / 52
Ways to Randomize

I Standard ways:
- Computer programs (R, stata, sas, REDCap...)
- Random number tables
- Online tools (e.g., randomization.com)
I NOT legitimate
- Odd vs even birth dates
- Last digit of the medical record number
- Alternate as patients enroll
I Theoretically legitimate, but not so in practice
- Flipping a coin
- Rolling dice
- Drawing balls (m&ms) out of an urn (bag)

11 / 52
Summary of Important Features of Randomization

I Random Allocation
- Known chance receiving a treatment
- Cannot predict the treatment to be given
- Scheme is reproducible
I Minimizes the risk of selection bias
I In double-blinded trials, no response/evaluation bias
I Similar treatment groups
- Patient characteristics will tend to be balanced across study arms
- Chance baseline imbalances between groups may still occur

12 / 52
Types of Randomization

I Simple
I Blocked Randomization
I Stratified Randomization
I Cluster Randomization
I Baseline Covariate Adaptive Allocation
I Response Adaptive Allocation (using interim data)

13 / 52
Simple Randomization

I Randomize each patient to a treatment with a known probability


- For example, to assign one of (T,C) with equal chance then:
Use a random number generator to generate a number in (0,1);
If u < 0.5 assign C; If u >= 0.5 assign T
I Advantage: Simple to conduct
I Advantage: Simple to analyze. The usual two-group tests ( t-test,
Wilcoxon, Fisher’s exact, etc) are appropriate
I Disadvantage: Could have imbalance in # per arm or trends in
group assignment
- No guarantee equal number of heads and tails
- Could have runs of heads or tails
- Could have different distributions of a trait like gender in the
different arms
I Particularly good for large trials

14 / 52
Chance of Imbalance Decreases with Sample Size

E.g., suppose 1000 women; expected & “worse case” allocation


across T and C:

15 / 52
Block Randomization

I Each block would contain the desired treatment ratio. For


example: equal numbers of patients assigned to each treatment
within a block
- Sample size 24, Block size = 6, 2 study interventions- A & B
BAABAB AAABBB ABABAB BBABAA
I Exactly balanced after each completed block
I Ensures treatment number on each arm at any given time is not
that not far out of balance
- Maintaining balance over time protects against unintended patterns
created by changes in patient population over time
I Good for small and modest sample sizes

16 / 52
Block Randomization (2)

I Block size can be fixed or random


I Variable block size (permuted) adds an additional layer of
blindness, especially if not masked
I Does not protect against possibility of an imbalance of a trait like
gender in the two arms possible
I Any complication means more ways to make a mistake: Test
algorithm!!!
- Archive code and results (preserve reproducibility)

17 / 52
Issues for Block Randomization
I If blocking is not masked, the sequence can get predictable
Example: Block size 4
ABABBAB? Must be A.
AA?? Must be B B.
I If block too small, unblinding one subject can reveal rest of block
- i.e. if block size is 2, knowing one reveals a second
- Solution: use random block sizes, don’t use block size of 2
I Predictability can lead to selection bias
I Simple solution to selection bias
- Do not reveal blocking mechanism
- Use random block sizes
I Proper analysis would incorporate the blocking used in
randomization, such as a test stratified on the randomization
blocks (Matts and Lachin, 1988)
I This is rarely done
I Why some have advocated for simple randomization for larger
trials, allows for simpler analysis (Lachin et al., 1988)
18 / 52
Sample Code in R

> library(blockrand)
> set.seed(31415)
> list<-blockrand(24,num.levels=2,
levels=c("T","C"),id.prefix="CCP2-",block.sizes=2:4)
> list
id block.id block.size treatment
1 CCP2-01 1 6 T
2 CCP2-02 1 6 T
3 CCP2-03 1 6 C
...
28 CCP2-28 5 8 T
29 CCP2-29 5 8 T
30 CCP2-30 5 8 C
> table(list$treatment)
C T
15 15

19 / 52
Blocked Randomization Example: Flu Vaccine Dose
Escalation Study

I An early phase I dose-finding study for a flu vaccine candidate


sought to investigate 6 dose levels in a blinded, placebo
controlled study
- Considered single (one nare) and double dose (both nares) of
0.25mg, 0.5 mg, and 1mg administered intranasally
- Primary outcome was safety and tolerability
I Dose cohorts had 5 active and 2 placebo subjects
- Randomized block design, with a block of size 7
- The order of the 5 active and 2 placebo assignments are randomly
permuted for each dose group
- Note: if want 5:2 ratio, then block sizes have to be multiples of 7

20 / 52
Stratified Randomization
I A priori certain factors known to be important predictors of
outcome (e.g. age, gender, diabetes)
I AABB BABA BABA BAAB, balanced trial of 16 but what if women
are patients 1,2,6,8 and 16?
I Stratified randomization: Randomize within strata so different
levels of the factor are balanced between treatment groups
I Stratified blocked randomization is a useful way to achieve
balance
- For each subgroup or strata, perform a separate block
randomization

## stratified by sex, 100 in stratum, 2 treatments


male <- blockrand(n=100, id.prefix=’M’,
block.prefix=’M’,stratum=’Male’)
female <- blockrand(n=100, id.prefix=’F’,
block.prefix=’F’,stratum=’Female’)
21 / 52
Considerations for Stratified/Blocked Randomization
I Common choices for strata
- Strong prognostic variables: age, gender, diabetes
- Logistics and politics can motivate stratification by center
I Balance will be defeated if you choose too many strata and wind
up with many incomplete blocks
- Strata add up quickly: 5 age groups, 2 genders, 3 centers = 30
strata
I Stratification should be taken into account in the data analysis
- Blocks commonly ignored due to preference for a simple (easy to
understand) analysis
- Adjusting for strong prognostic variables can help with precision
(Pocock et al., 2002; Tsiatis et al., 2008)
- Not adjusting for stratification variables can result in inflated
standard errors and incorrect nominal confidence interval coverage
(Kahan and Morris, 2012)
- Adjusting for too many factors could be a concern for small trials,
another reason to keep strata # small (Kahan and Morris, 2012)
22 / 52
Stratified Block Randomization Example
Preexposure Prophylaxis Initiative (iPrEx) Trial
REF: NEJM 2010 v363 (27): 2587-2599

I Double-blinded placebo-controlled randomized trial examining


safety and efficacy of a chemoprophylaxis regimen (once-daily
oral FTC–TDF) for HIV prevention
I International multi-center study
- 9 sites: US: Boston, San Francisco; Peru: Iquitos, Lima; Brazil:
São Paulo, Rio de Janeiro (2 sites); Ecuador; Guayaquil; Thailand:
Chiang Mai
- Multiple advantages to achieving balanced allocation by site
I 2499 HIV- men or transgender women were randomly assigned
in blocks of 10, stratified according to site
- Main analysis an unadjusted logrank test for HIV seroconversion

23 / 52
Design consideration: Who/What to Randomize

I Person
- Most common unit of randomization in RCTs
I Provider
- Doctor
- Nursing station
I Locality
- School
- Community
I The sample size is predominantly determined by the number of
randomized units
- This is due to correlation of repeated samples within a
person/doctor/community

24 / 52
Cluster Randomization

I Same ideas as before


I Unit of randomization
- School/Clinic/Hospital/Providers/Community
I Outcome measurement
- Students/Patients
I Need to use special models for analysis when those reporting
outcomes are nested within a cluster, to account for within cluster
correlation
I Best for interventions meant to be implemented at community
level (smoking cessation program) and relatively quick and easy
to assess outcome
- Cost can often be an issue
- In todays world, isolated communities harder to find

25 / 52
Randomization in Platform Trials

I Platforms trials compare multiple intervention arms to the same


control to treat a single disease
- Different from umbrella trials that might be studying multiple
indications for a single drug
I A common strategy is to randomize first to a component [(A,C)
(B,C) (D,C)] and then randomize to arm in that component (Drug
vs Control)
I Randomization probabilities are generally set so that you have
approximately equal sized groups for each drug and control
- From power standpoint and fixed total sample size, nC/nA =
sqrt(#active arms) Is optimal
- But a preference often to have equal sample size

26 / 52
Platform Example

27 / 52
Considerations for Platform Trials
Gold et al. (2022); Berry et al. (2015)

I Attractive in settings where there may be multiple novel


candidates, potentially evolving over time
- COVID 19: Solidarity, Recovery, ACTIV-k
- Cancer
I Analytical Downsides:
- Comparisons are little tricky between active drugs: Unless all
patients eligible for all components, not a randomized comparison if
lumping all exposed patients
- If adding interventions over time, need to worry about issue of
non-concurrent controls
I Upsides
- Can take more patients into trial.
- Efficient infrastructure

28 / 52
The Danger of Non-concurrent Controls
Dodd et al. (2021)

29 / 52
What is Adaptive Randomization?

I All previously discussed methods of randomization were


examples of fixed allocation schemes
- Order of treatment assignments can be completely determined in
advance of the trial
I Adaptive randomization schemes “adapt” or change according
to characteristics of subjects enrolled in trial
- Sequence of treatment assignments cannot be determined in
advance
- Probability of assigning a new participant a particular treatment can
change over time
- Two major classes: adaptive with respect to baseline
characteristics or with respect to patient outcomes
I Note not all adaptive trials involve adaptive randomization, namely
group sequential trials

30 / 52
Baseline Adaptive Schemes (1)

I Biased coin randomization: allocates treatment for the next


participant with a probability that depends on current balance
between arms
- Introduced by Brad Efron, suggested p=2/3 for the arm with fewer
participants
- Benefits: low probability of long runs, maintaining simple coin-flip
type randomization, avoids the potential unmasking problems of
permuted blocks
- Con: statistical analysis less straight forward. Familiar tests lose
their asymptotic normality and exact inference is recommended
(Markaryan and Rosenberger, 2010)

31 / 52
Baseline Adaptive Schemes (2)
I Dynamic allocation algorithms based on maintaining balance
across multiple important prognostic variables
- Develop an index of imbalance across multiple baseline
covariates
- Minimization: next treatment assignment minimizes current
imbalance
- Other dynamic allocation schemes give the treatment which
minimizes the imbalance a higher probability of assignment
- Benefits: can maintain balance across several prognostic
variables, without worrying about lots of incomplete blocks.
Maintains balance better than stratified permuted block,
particularly in small trials and/or many covariates
- Cons: statistical analysis less straight forward, easy to screw up,
hard to document. Classic problem: what happens if find an error
in allocation or participant’s data
32 / 52
Eye-Opening Experience for Minimization

I Genzyme conducted Late Onset Treatment Study (LOTS)


- 90 patients with late onset Pompe’ disease
- Primary outcome: 6 minute walk test
- 2:1 allocation to drug/placebo using minimization
I Site
I BL 6 minute walk (≤300m, >300m)
I Forced vital capacity (≤55% pred., >55% pred.)

- One analysis requested by FDA: re-randomization test

33 / 52
Eye-Opening Experience for Minimization

I At the time, FDA was skeptical about minimization, so they


require companies to use a re-randomization test
I Proponents of minimization argue that you can do a
re-randomization test, but it is unnecessary because you get
about same answer as t-test
I Wrong!

34 / 52
ANCOVA p = 0.035 Rerandomization p = 0.06

35 / 52
Eye-Opening Experience for Minimization

I The problem is that minimization severely limits amount of


randomization
I The particular randomization scheme for unequal allocation was
flawed (see Kuznetsova and Tymofyeyev (2012) for how to fix it)
For the statistical geeks:
I Big problem: mean of re-randomization distribution is NOT 0
- It is 0 for standard randomization methods
I Nonzero mean causes loss of efficiency of re-randomization test:
no longer close to t-test even for very large sample sizes

36 / 52
Eye-Opening Experience for Minimization

I For more details on LOTS trial, see Van der Ploeg et al (2010)
NEJM 362, 1396-1406
I For more details about statistical problems minimization caused
see Proschan et al. (2011), and for how to fix them, see
Kuznetsova and Tymofyeyev (2012)
I For more details about mathematics of randomization see:
Rosenberger, W. F., and Lachin, J. M. (2015). Randomization in
clinical trials: theory and practice. John Wiley & Sons.

37 / 52
Response Adaptive Schemes

I Response adaptive allocation: responses of participants


enrolled to date are taken into account when randomizing next
participant
- Relies on assumption that response to treatment can be assessed
fairly quickly and cohort is not changing over time
I Zelen’s Play the Winner: assigns same treatment if previous
patient a success and the other treatment if otherwise
I Randomized Play the Winner: gives more successful treatment
a higher chance of allocation (but p<1)
I Many other methods, including Bayesian approaches

38 / 52
Goals of Response Adaptive Schemes can vary

I Some may seek to relate probability of the treatment arm with


probability of a positive response
- Lots of algorithms. Methods vary whether this may be
deterministic, probabilistic
I Some response adaptive schemes may target increasing power

39 / 52
ECMO Trial: A Cautionary Tale for Response
Adaptive Allocation

I ECMO Trial: study of extracorporeal membrane oxygenator in


newborns suffering from respiratory failure
I Play the winner type algorithm used for treatment allocation
I First baby randomly assigned to active arm and was a success;
2nd baby randomized to control and died; next 9? babies
assigned to experimental ECMO arm and survived
I Trial stopped after 2 more babies non-randomly assigned ECMO
I By chance, control baby was the sickest
I After much controversy, a second trial was launched
I More controversy and debates over methodology and ethics

40 / 52
Challenges of Response Adaptive Schemes:
Analytical properties are hard to decipher
I Some argue these trials are more ethical, because they aim to
maximize number of people on the better treatment
- There have been statistical efficiency claims, but actually now
shown to be false
I Adaptive allocation designs are difficult to implement without
mistakes or problems with blinding
I Inference for response-adaptive randomization is very
complicated because both the treatment assignment and
responses are correlated (Rosenberger and Lachin, 2015)
I Analytical properties are not well-established, especially of new
designs
I Advice: These methods are controversial and prone to
problems, avoid unless you are an expert and willing to repeat
your trial
41 / 52
Lessons from ECMO If you must use RAR
Proschan and Evans (2020); Chandereng and Chappell (2020)

I Randomization should have fairly long run in of standard


randomization before you start the adaptive allocations
I In multi-arm trial, you can change assigned probability of being
assigned to a component but you keep the probability of being
assigned to the control the same
I In trial analysis, need to have methods to adjust for time trends
- Essentially doing a stratified analysis within time buckets

42 / 52
What Randomization scheme is best?

I Depends on the study and resources available


- Currently likely never to recommend response-adaptive
- Best scheme likely dictated by what is practical given resources,
including programming resources and other infrastructure
I Keep it simple
- Simple randomization: hard to mess up, large trials will be
balanced
- Permuted block randomization: simple, widely used, widely
understood
- Stratified by site: common choice for multi-site trials
- Choose block size(s) appropriate to sample size
- Randomize at last possible second

43 / 52
Maintaining Randomization Integrity

I Fundamental motivation of randomization: create comparable


treatment groups
- Allows causality inference
I To maintain comparability, primary analysis is an intent-to-treat
(ITT) analysis
- All subjects are analyzed according to randomization assignment,
regardless of what treatment they actually get

44 / 52
Flavors of ITT

I ITT analysis
- Analyze according to the study regimen assigned
- Requires models to weight observed or impute missing outcomes,
requires sensitivity analysis
- Only analysis which preserves randomization
I Modified ITT (MITT) analysis
- ITT, but only include people who take the first dosage
- In well-implemented trials few people drop out before first dose
- Potentially minor departure from ITT if blinded

45 / 52
Analysis Choices (2)

I Per Protocol Analysis:


- Analysis includes data only from completers/adherers
- Subject to bias, analyzes only the well behaved and potentially only
the healthiest participants (no adverse events)
- Especially problematic when drop out rates different by treatment
arm

46 / 52
Threats to Randomization Integrity

I Improper masking or blinding


- Bias will creep into data
I Excluding subjects who withdraw from treatment: can lead to
bias
I Drop-out/missing data: breaks randomization without ITT;
weakens treatment result with ITT
- Long trials may need to have a screening period to assess
commitment of subjects before randomization

47 / 52
“Analyze as you randomize”

I Analysis of study results generally should take into account the


method of randomization
- Adjusting for stratification is recommended (to avoid overly wide
confidence intervals)
- Adaptive procedures need to be accounted for
- Ignoring “blocks” is standard and generally considered okay

48 / 52
Summary

I Permuted block randomization often the best


- Stratify on only a few factors, usually one or two
- Choose block size(s) appropriate to sample size
I Randomize smallest independent element at last possible
second
I Masking/blinding is key for preventing bias
I ITT (intent to treat) analysis necessary to preserve
randomization and infer causality, or lack thereof
I Proper documentation as important as proper implementation

49 / 52
Conclusion

I Randomized Studies are the Gold Standard of Clinical


Research
I Randomization to treatments separates clinical trials from all
other studies; don’t muck it up!
I Randomization
- Eliminates selection bias
- Forms basis for statistical tests
- Balances arms with respect to prognostic variables (known and
unknown)

50 / 52
References I
Berry, S. M., Connor, J. T., and Lewis, R. J. (2015). The platform trial: an efficient strategy for
evaluating multiple treatments. Jama 313, 1619–1620.
Chandereng, T. and Chappell, R. (2020). How to do response-adaptive randomization (rar) if you
really must. Clinical Infectious Diseases 73, 560.
Dodd, L. E., Freidlin, B., and Korn, E. L. (2021). Platform trials—beware the noncomparable control
group. New England Journal of Medicine 384, 1572–1573.
Gold, S. M., Bofill Roig, M., Miranda, J. J., Pariante, C., Posch, M., and Otte, C. (2022). Platform
trials and the future of evaluating therapeutic behavioural interventions. Nature Reviews
Psychology 1, 7–8.
Kahan, B. C. and Morris, T. P. (2012). Improper analysis of trials randomised using stratified blocks
or minimisation. Statistics in medicine 31, 328–340.
Kuznetsova, O. M. and Tymofyeyev, Y. (2012). Preserving the allocation ratio at every allocation
with biased coin randomization and minimization in studies with unequal allocation. Statistics in
Medicine 31, 701–723.
Lachin, J. M., Matts, J. P., and Wei, L. (1988). Randomization in clinical trials: conclusions and
recommendations. Controlled clinical trials 9, 365–374.
Markaryan, T. and Rosenberger, W. F. (2010). Exact properties of efron’s biased coin
randomization procedure. The Annals of Statistics 38, 1546–1567.
Matts, J. P. and Lachin, J. M. (1988). Properties of permuted-block randomization in clinical trials.
Controlled clinical trials 9, 327–344.

51 / 52
References II

Pocock, S. J., Assmann, S. E., Enos, L. E., and Kasten, L. E. (2002). Subgroup analysis, covariate
adjustment and baseline comparisons in clinical trial reporting: current practiceand problems.
Statistics in medicine 21, 2917–2930.
Proschan, M., Brittain, E., and Kammerman, L. (2011). Minimize the use of minimization with
unequal allocation. Biometrics 67, 1135–1141.
Proschan, M. and Evans, S. (2020). Resist the temptation of response-adaptive randomization.
Clinical Infectious Diseases 71, 3002–3004.
Rosenberger, W. F. and Lachin, J. M. (2015). Randomization in clinical trials: theory and practice.
John Wiley & Sons.
Tsiatis, A. A., Davidian, M., Zhang, M., and Lu, X. (2008). Covariate adjustment for two-sample
treatment comparisons in randomized clinical trials: a principled yet flexible approach. Statistics
in medicine 27, 4658–4677.

52 / 52
Lecture 3: Sample Size/Power

1 / 45
Introduction to Power/Sample Size

Introduction to EZ Principle

Where Does The Key Formula Come From?

General EZ Principle and Applications


t-test
Test of Proportions
Survival
Noninferiority
Lack of Reproducibility

Sample Size: Practical Aspects


Treatment Effect
Nuisance Parameters

Sample Size: Estimation

Sample Size: Safety


Outline
Introduction to Power/Sample Size
Introduction to EZ Principle
Where Does The Key Formula Come From?
General EZ Principle and Applications
t-test
Test of Proportions
Survival
Noninferiority
Lack of Reproducibility
Sample Size: Practical Aspects
Treatment Effect
Nuisance Parameters
Sample Size: Estimation
Sample Size: Safety
1 / 45
Introduction to Power/Sample Size

I Clinical trials are the gold standard of evidence.

I Clinical trials use hypothesis testing to choose between null and


alternative hypotheses:

H0 : treatment has no effect


H1 : treatment has an effect.

I We hope to reject H0 and conclude that treatment works.

I α, the probability of falsely rejecting H0 (making a type 1 error)


is set low to avoid approving an ineffective treatment.

I If H0 is rejected, there is strong evidence against the null


hypothesis and in favor of treatment benefit.

2 / 45
Introduction to Power/Sample Size

I But abandoning an effective treatment by failing to reject H0


when H1 is true (making a type 2 error) is also a serious error.

I To be confident we are not making a type 2 error, we should


make β , the probability of a type 2 error, low.

I Equivalently, we should make power , namely


1 − β = P( rejecting H0 when H1 is true), high.

I If power is high and we still do not reject H0 , treatment probably


did not have its intended effect.

3 / 45
Introduction to Power/Sample Size

I See chapter 8 of Proschan (2022).

I Standardized test statistics (z-scores) in clinical trials are often:


I Of form Z = δˆ/se(δˆ), where δˆ is a treatment effect estimator.
I Approximately N(θ , 1) for large sample sizes, where θ = 0 under
H0 .

I Examples:
I T-statistic: δˆ = ȲT − ȲC ; se(δˆ) = 2σ 2 /n.
p

I Z-score for proportions: δˆ = p̂T − p̂C ; se(δˆ) ≈ 2p(1 − p)/n.


p

I Z-score for logrank statistic: δˆ = ∑(Oi − Ei )/ ∑ Vi estimates log


hazard ratio; se(δˆ) = 1/ ∑ V .i
I Z-scores for maximum likelihood estimators (MLEs), minimum
variance unbiased estimators, Cox models, etc.

4 / 45
Outline
Introduction to Power/Sample Size
Introduction to EZ Principle
Where Does The Key Formula Come From?
General EZ Principle and Applications
t-test
Test of Proportions
Survival
Noninferiority
Lack of Reproducibility
Sample Size: Practical Aspects
Treatment Effect
Nuisance Parameters
Sample Size: Estimation
Sample Size: Safety
Introduction to EZ Principle

I There is really only one power/sample size formula.

I EZ principle (its easy!): Power depends on E(Z ), the expected


z-score.

I Parameterize so that large z-scores mean treatment is beneficial.


E.g., may need to change µT − µC to µC − µT .

I For a 2-sided test at α = 0.05 (or a 1-sided test at α = 0.025),


E(Z ) must be:
I 3.24 for 90% power.
I 3.00 for 85% power.
I 2.80 for 80% power.
I We will see justification after some examples.

5 / 45
Introduction to EZ Principle
I Makes checking sample size calculations quick and easy.

I Example: You compare a new treatment to standard treatment


for hepatitis C virus.

I Primary outcome: change in log viral load from baseline. Use t-test.
I Want 80% power for difference δ = 0.5 and you expect σ = 1.25.
I Investigator says you need 50/arm. Is that correct?

I Z = δˆ/ 2σ 2 /n, δˆ = ȲC − ȲT .


p

I Expected z-score is

µ − µT 0.5
E(Z ) = pC =p = 2.
2
2σ /n 2(1.25)2 /50

6 / 45
Introduction to EZ Principle

I Expected z-score is lower than 2.80.

I The trial is underpowered.

I Investigator: “Sorry, I meant 100 per arm.”

I Check:
δ 0.5
E(Z ) = p =p = 2.828.
2σ 2 /n 2(1.25)2 /100
I Close to 2.80. Sample size is accurate.

7 / 45
Introduction to EZ Principle

I Example: New treatment for hospitalized COVID-19 patients on


mechanical ventilation/ECMO.
I Primary endpoint: 60-day mortality.
I Want 85% power to detect improvement in 60-day mortality from
0.20 to 0.12.
I Statistician reports you need n = 1, 000 per arm. Is it correct?
Paramerize so large values are good.
δˆ p̂ − p̂T 0.20 + 0.12
Z= =p C , p= = 0.16.
se(δˆ) 2p(1 − p)/n 2
I Expected z-score is:

pC − pT 0.20 − 0.12
E(Z ) = p =p = 4.880.
2p(1 − p)/n 2(0.16)(1 − 0.16)/1000

8 / 45
Introduction to EZ Principle

I Expected z-score is much greater than 3.00.

I Trial is overpowered.

I Statistician: “My bad. I meant 400 per arm.”

I Check:
pC − pT 0.20 − 0.12
E(Z ) = p =p = 3.086.
2p(1 − p)/n 2(0.16)(1 − 0.16)/400

I Still slightly overpowered, but not much because E(Z ) is not too
far from 3.00.

9 / 45
Introduction to EZ Principle

I Before looking at more examples, let’s look at the basis for the
EZ principle.

I This will show us a general formula that allows us to compute, for


any alpha :
I Sample size for a given treatment effect and power.
I Power for a given treatment effect and sample size.
I Treatment effect for a given sample size and power.

10 / 45
Outline
Introduction to Power/Sample Size
Introduction to EZ Principle
Where Does The Key Formula Come From?
General EZ Principle and Applications
t-test
Test of Proportions
Survival
Noninferiority
Lack of Reproducibility
Sample Size: Practical Aspects
Treatment Effect
Nuisance Parameters
Sample Size: Estimation
Sample Size: Safety
Where Does The Key Formula Come from?

Area 0.025

0 1.96

Under H0

Figure: The standard normal null density for the z-statistic. For a 1-tailed test at
α = 0.025, we reject H0 if Z > 1.96.

11 / 45
Where Does The Key Formula Come from?

Area 0.90

0 1.96 θ

Under H1

Figure: The alternative N(θ , 1) density for Z . For power 0.90, we want the blue
shaded area to be 0.90.

12 / 45
Where Does The Key Formula Come from?

Area 0.90

1.96 − θ 0

Figure: The blue shaded area in Figure 2 equals the blue shaded area to the right of
1.96 − θ under the standard normal curve. For power 0.90, 1.96 − θ = −1.28, so
θ = 1.96 + 1.28 = 3.24.
13 / 45
Outline
Introduction to Power/Sample Size
Introduction to EZ Principle
Where Does The Key Formula Come From?
General EZ Principle and Applications
t-test
Test of Proportions
Survival
Noninferiority
Lack of Reproducibility
Sample Size: Practical Aspects
Treatment Effect
Nuisance Parameters
Sample Size: Estimation
Sample Size: Safety
General EZ Principle and Applications

I Same reasoning applies for different levels of α and β .


I For 2-tailed test at level α and power 1 − β , set

E(Z ) = zα/2 + zβ , (EZ Principle) (1)

where, for 0 < a < 1, za denotes the (1 − a)th quantile of a


standard normal distribution.
I zα/2 = 1.96 for α = 0.05, 2-sided test.
I zβ = 0.84, 1.04, or 1.28 for β = 0.20, 0.15, or 0.10.
I zα/2 + zβ = 2.80, 3.00, or 3.24 for 80%, 85%, or 90% power.
I 1 formula, for sample size, power, or detectable effect.

14 / 45
General EZ Principle and Applications

area= 1 − β area= β

0 zβ

Figure: The area to the right of zβ is β , so the area to the left of zβ is 1 − β =power.

15 / 45
General EZ Principle and Applications
I Example: return to hepatitis C (HCV) trial.
I Primary outcome: Change in log viral load from baseline. T-test.
I Want sample size for 80% power for 2-sided test at α = 0.05.
I δ = 0.5 and σ = 1.25.

δˆ Ȳ − ȲT δ 0.5
Z = = pC ; E(Z ) = p =q .
se(δˆ) 2
2σ /n 2
2σ /n 2(1.25)2
n

E(Z ) = zα/2 + zβ (EZ Principle).

0.5
q = 1.96 + 0.84 = 2.80
2(1.25)2
n

2(1.25)2 (2.80)2
n = = 98.
0.52
Need 98/arm.
16 / 45
General EZ Principle and Applications

I Suppose you can only recruit 75/arm. What is the detectable


effect with 80% power?

δˆ Ȳ − ȲT
Z = = pC
se(δˆ) 2σ 2 /n

δ
E(Z ) = q = (zα/2 + zβ ) (EZ Principle)
2(1.25)2
75
δ
q = 1.96 + 0.84 = 2.80
2(1.25)2
75
r
2(1.25)2
δ = 2.80 = 0.57.
75
Detectable effect is 0.57 logs.

17 / 45
General EZ Principle and Applications
I What is power for detecting 0.5 log if you only recruit 75/arm?

δˆ Ȳ − ȲT δ
Z = = pC ; E(Z ) = p
se(δˆ) 2
2σ /n 2σ 2 /n

0.5
E(Z ) = q = zα/2 + zβ (EZ Principle)
2(1.25)2
75

2.449 = 1.96 + zβ

0.489 = zβ

Φ(0.489) = Φ(zβ ) = 1 − β = power,


Φ(0.489) is N(0,1) distribution at 0.489 (in R, pnorm(0.489)).

Power is Φ(0.489) ≈ 0.69.


18 / 45
General EZ Principle and Applications
I Return to COVID-19 example:
I Primary endpoint: 60-day mortality.
I Want 85% power to detect improvement in 60-day mortality from
0.20 to 0.12.

p̂ − p̂T
Z = qC .
2p(1−p)
n

p − pT 0.20 − 0.12
E(Z ) = qC =q .
2p(1−p) 2(0.16)(1−0.16)
n n
0.20 − 0.12
q = zα/2 + zβ = 3 (EZ Principle)
2(0.16)(1−0.16)
n
2(0.16)(0.84)(3)2
n ≈ = 378. (2)
(0.08)2

Need 378/arm.
19 / 45
General EZ Principle and Applications

I If you can only get 300/arm, what is power?

p̂ − p̂T
Z = qC .
2p(1−p)
n

p − pT 0.20 − 0.12
E(Z ) = qC =q = 2.673
2p(1−p) 2(0.16)(1−0.16)
n 300

E(Z ) = 2.673 = (zα/2 + zβ ) = 1.96 + zβ (EZ Principle)

Φ(2.673 − 1.96) = Φ(zβ ) = 1 − β = power. (3)

Power is approximately Φ(0.713) = pnorm(0.713) = 0.76.

20 / 45
General EZ Principle and Applications
I Schoenfeld (1981) derives sample size for survival tests.

Table: Table at ith death.

Dead Alive
Control Oi nCi
Treatment nTi
1 ni − 1 ni

nCi , nTi = numbers at risk in treatment, control just prior to ith death.

Oi =indicator that ith death is in control arm.

Under H0 , no difference in survival, the ith death is equally likely to be


from any of the ni = nCi + nTi people at risk.

Oi is Bernoulli with parameter pi = nCi /ni under H0 .


21 / 45
General EZ Principle and Applications

Table: Table at ith death.

Dead Alive
Control Oi nCi
Treatment nTi
1 ni − 1 ni

nCi , nTi = numbers at risk in treatment, control just prior to ith death.

Oi =indicator that ith death is in control arm.

Ei = pi = nCi /ni =null expected value of Oi , given marginals.


nC nT
Vi = pi (1 − pi ) = i
ni2
i
=null variance of Oi , given marginals.

22 / 45
General EZ Principle and Applications

I FUN FACT: Each δˆi = (Oi − Ei )/Vi estimates the log hazard ratio
and has variance 1/Vi .

I Optimal weighted average of the δˆi weights inversely proportional


to variances, wi = 1/var(δˆi ) = Vi .

∑di=1 wi δˆi ∑di=1 Vi {(Oi − Ei )/Vi }


δˆ = =
∑di=1 wi ∑di=1 Vi

∑di=1 (Oi − Ei )
= . (4)
∑di=1 Vi

I δˆ estimates log hazard ratio and var(δˆ) = 1/ ∑di=1 Vi .

23 / 45
General EZ Principle and Applications
Logrank z-statistic is
!v
δˆ ∑d=1 (Oi − Ei ) ∑di=1 (Oi − Ei )
u d
u
Z = = iq = t
∑ Vi
∑di=1 Vi
q
var(δˆ) ∑di=1 Vi i =1

v
ud
u
ˆ
= δ t ∑ Vi , (5)
i =1

where δˆ estimates the log hazard ratio. With 1-1 randomization,


Vi ≈ (1/2)(1 − 1/2) = 1/4, so ∑di=1 Vi ≈ ∑di=1 (1/4) = d/4 and

Z ≈ δˆ d/4
p
p
E(Z ) ≈ δ d/4, (6)

d =number of deaths, δˆ, δ are estimated and true log hazard ratios.
24 / 45
General EZ Principle and Applications

I For power 1 − β , equate E(Z ) to zα/2 + zβ and solve for number


of deaths (events) d:

p
E(Z ) = δ d/4 = (zα/2 + zβ ) (EZ Principle)

4(zα/2 + zβ )2
d = , (7)
δ2
δ =log hazard ratio (parameterized so that large hazard ratios
show that treatment works).

I Continue the trial until this number of deaths.

25 / 45
General EZ Principle and Applications

I Example: return to COVID-19 trial, but suppose you use logrank


test instead of test of proportions.

I Want 85% power to detect a control-to-treatment hazard ratio of


1.333. Set
!v
d d
ud
∑ =1 (Oi − Ei ) ∑i =1 (Oi − Ei ) u
t V ≈ δˆ d/4.
p
Z = iq = d ∑ i
∑di=1 Vi ∑i =1 Vi i =1

p
E(Z ) = δ d/4 = (zα/2 + zβ ) = (1.96 + 1.04) = 3 (EZ Principle)

4(3)2 4(3)2
d = ≈ 436 events. (8)
δ {ln(1.333)}2
2

26 / 45
General EZ Principle and Applications

I Suppose you get only 350 events (deaths).


I For power to detect hazard ratio of 1.333,

p
E(Z ) = ln(1.333) 350/4 = 2.689

2.689 = (zα/2 + zβ ) = 1.96 + zβ (EZ Principle)

Φ(2.689 − 1.96) = Φ(zβ ) = 1 − β = power. (9)

Power is approximately Φ(0.729) = pnorm(0.729) = 0.77.

27 / 45
General EZ Principle and Applications

I Suppose you get only 350 events (deaths).

I What hazard ratio can be detected with 85% power?,

p
E(Z ) = ln(λ ) 350/4 = zα/2 + zβ = 1.96 + 1.04 = 3 (EZ Principle)
!
3
λ = exp p
350/4

= 1.378. (10)

I 85% power for a hazard ratio of 1.378.

28 / 45
General EZ Principle and Applications

I In noninferiority trials, not trying to prove new treatment (N) is


better than standard treatment (S), but that it is almost as good.

I One application of NI testing: Standard treatment might be


onerous (3 injections/day) and new treatment is easier to take (1
pill/day).

I Prefer new treatment provided is is not worse than standard by


more than some small amount known as the noninferiority (NI)
margin).

I Let pN and pS be probability of event on new and standard


treatment.

29 / 45
General EZ Principle and Applications

I NI trials often use 1-sided α = 0.05 and test nonzero null.

I E.g., if willing to tolerate new treatment being worse than


standard by 0.10 (NI margin=0.10), test:

H0 : pS − pN < −0.10 versus H1 : pS − pN ≥ −0.10.

I Convert to zero-null by:

H0 : pS − pN + 0.10 < 0 versus H1 : pS − pN + 0.10 ≥ 0.

I Suppose we want 90% probability of showing noninferiority if


truth is that pN = pS .

I Again use EZ principle:

30 / 45
General EZ Principle and Applications

p̂S − p̂N + 0.10


Z=q (11)
p̂S (1−p̂S )+p̂N (1−p̂N )
n

If pS = pN ,

0.10
E(Z ) ≈ q = (zα/2 + zβ ) (EZ Principle)
2p(1−p)
n
0.10
q = 1.645 + 1.282 = 2.927.
2p(1−p)
n

n = 2p(1 − p)(2.927)2 /(0.1)2 .

If p is expected to be 0.5, n = 429 per arm.

31 / 45
General EZ Principle and Applications

I One important implication of EZ principle: Inability to replicate


results (see, e.g, Goodman (1992); Halsey et al. (2015)).

I Suppose Z = 1.96 and this represents the true treatment effect;


i.e., E(Z ) = 1.96.

I If we repeat trial, power is, by EZ principle,

E(Z ) = 1.96 + zβ
1.96 = 1.96 + zβ
0 = zβ
Φ(0) = Φ(zβ ) = 1 − β = power
0.5 = power (12)

Only a 50% chance of replicating the result!

32 / 45
Outline
Introduction to Power/Sample Size
Introduction to EZ Principle
Where Does The Key Formula Come From?
General EZ Principle and Applications
t-test
Test of Proportions
Survival
Noninferiority
Lack of Reproducibility
Sample Size: Practical Aspects
Treatment Effect
Nuisance Parameters
Sample Size: Estimation
Sample Size: Safety
Sample Size: Practical Aspects

I Sample size depends on treatment effect and nuisance


parameters.
I Nuisance for t-test: σ .
I Nuisance for test of proportions: pC (or overall p).

I The treatment effect and nuisance parameter are very different.


I We can specify treatment effect either as minimal relevant effect or
the anticipated effect based on other studies.
I We must estimate the nuisance parameter accurately for power
calculations.

I Understimating σ in a t-test or overestimating pC in a test of


proportions to detect a given relative effect (e.g., 25%) will
result in an underpowered trial.

33 / 45
Sample Size: Practical Aspects: Treatment Effect

I If treatment has many side effects or is difficult (e.g., several


injections a day), then treatment effect should be large to justify
its use.

I If treatment has few side effects (e.g., a diet), even a small effect
is worthwhile.

I Dietary Approaches to Stop Hypertension (DASH) trial


I Compared 3 diets: (1) control, (2) fruits & vegetables, (3)
combination fruits and vegetables and lowfat dairy.
I Primary endpoint: change in diastolic blood pressure from baseline.
I Powered for 2mmHg difference because even a small effect has
public health benefit with few expected side effects.

34 / 45
Sample Size: Practical Aspects: Treatment Effect

I In early phase trials, type 2 may be more serious than type 1


error.
I Type 2 error ends further testing– may be abandoning a good drug.
I Type 1 error is not tragic because definitive test is in phase 3.

I Therefore, want to ensure high power in early phase trials.

I Stack deck in your favor by:


I Picking population especially expected to benefit (might use a
run-in phase before randomization to weed out people who cannot
tolerate drug)
I Using an intermediate outcome that treatment should affect.

35 / 45
Sample Size: Practical Aspects: Nuisance
Parameters

I With t-test, power depends on standard deviation, σ .

I Err on side of overestimating σ to avoid underpowered trial.

I Use estimates based on similar trials, if possible.

I If σ estimated from observational study, increase it.

I When standard deviation is of a change from baseline, use a trial


with a similar or longer duration (standard deviation of a change
usually increases with duration)

36 / 45
Sample Size: Practical Aspects: Nuisance
Parameters
I Useful formula for variance of change from baseline (BL) to end
of study (EOS):

var(YEOS − YBL ) = 2σ 2 (1 − ρ), where

σ 2 is variance at fixed time (BL or EOS) and ρ = cor(YBL , YEOS ).

I E.g., if variance at single time is 65, and ρ = 0.80, use

var(YEOS − YBL ) = 2(65)(1 − 0.80) = 26.

I NOTE: If ρ < 0.5, then you should use YEOS , NOT YEOS − YBL .
Even better, use baseline value as covariate (also called analysis
of covariance–ANCOVA).

I Err on side of underestimating ρ to avoid underpowered trial.


37 / 45
Sample Size: Practical Aspects: Nuisance
Parameters

I With binary endpoint, nuisance parameter is control event


probability, pC .

I If treatment effect is a relative reduction (e.g., 25%, so


RR= 0.75), err on side of underestimating pC to avoid an
underpowered trial.

I If estimate of pC comes from observational trial, decrease it!


Clinical trial participants tend to be more health conscious &
have lower event rates (healthy volunteer effect).

38 / 45
Sample Size: Practical Aspects: Nuisance
Parameters

I Sample size is often a negotiation between principal investigator


and statistician.

I Statistician: “You will need 10,000 people”.

I Options:
I Increase treatment effect. PI: “A larger effect is unrealistic.”
I Use a different primary endpoint. E.g., add stroke to composite of
coronary heart disease/death. Statistician: “That will work as long
as treatment has a similar effect on added component. Otherwise,
you could decrease power.”

39 / 45
Outline
Introduction to Power/Sample Size
Introduction to EZ Principle
Where Does The Key Formula Come From?
General EZ Principle and Applications
t-test
Test of Proportions
Survival
Noninferiority
Lack of Reproducibility
Sample Size: Practical Aspects
Treatment Effect
Nuisance Parameters
Sample Size: Estimation
Sample Size: Safety
Sample Size: Estimation

I In early phase, may do 1-arm trial to get a reasonable estimate


of effect. Set sample size to achieve given accuracy.

I Example: How large does n need to be to estimate the


proportion of successes on new treatment to within 0.15?
p
I 95% confidence interval: p̂ ± 1.96 p(1 − p)/n

I Set
p
1.96 p(1 − p)/n = 0.15

and solve for n:

40 / 45
Sample Size: Estimation

(1.96)2 p(1 − p)
n= = 341.4756p(1 − p). (13)
(0.15)2

I If we expect p = 0.3, substitute 0.3 into Equation (13) to get


n = 72.

I Could also use p = 0.5 as a worst case scenario: Substituting


0.5 into (13) gives n = 86.

41 / 45
Outline
Introduction to Power/Sample Size
Introduction to EZ Principle
Where Does The Key Formula Come From?
General EZ Principle and Applications
t-test
Test of Proportions
Survival
Noninferiority
Lack of Reproducibility
Sample Size: Practical Aspects
Treatment Effect
Nuisance Parameters
Sample Size: Estimation
Sample Size: Safety
Sample Size: Safety

I In safety studies, want to know sample size needed to see at


least one adverse event (AE) of a given probability.

P(see at least one AE of probability p)

= 1 − P(0 AEs of probability p)


= 1 − (1 − p)n . (14)

I Example: in a study of 20 people, the probability of at least one


AE of probability 0.10 is 1 − (1 − 0.10)20 = 0.88.

I Confident we will see at least one if true probability is 0.10.

42 / 45
Summary

I High power is essential for avoiding type 2 errors.

I EZ principle: You need only 1 formula for sample


size/power/detectable effect for 2-sided test at level α (or 1-sided
at α/2). For power 1 − β , set

E(Z ) = zα/2 + zβ .

I For 2-sided α = 0.05, E(Z ) must be


I 3.24 for 90% power.
I 3.00 for 85% power.
I 2.80 for 80% power.

I Can apply to any statistic that is asymptotically N(θ , 1).

43 / 45
Summary

I Sample size depends on treatment effect and nuisance


parameters.

I Nuisance parameters are:


I σ for continuous outcome using t-test.
I pC or overall p for binary outcome.

I For t-test, err on side of overestimating σ .

I For binary outcome when trying to detect a given relative


treatment effect (i.e., 25% reduction in event rate), err on side of
underestimating pC .

44 / 45
References I

Goodman, S. (1992). A comment on replication, p-values and evidence.


Statistics in Medicine 11, 875–879.
Halsey, L. G., Curran-Everett, D., Vowler, S. L., and Drummond, G. B. (2015).
The fickle p value generates irreproducible results. Nature Methods 12,
179–185.
Proschan, M. A. (2022). Statistical Thinking in Clinical Trials. Chapman and
Hall/CRC.
Schoenfeld, D. (1981). The asymptotic properties of nonparametric tests for
comparing survival distributions. Biometrika 68, 316–319.

45 / 45

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy