Barnard PrincipalStratificationApproach 2003
Barnard PrincipalStratificationApproach 2003
ofSchool Choice Vouchers in New York City Author(s): John Barnard, Constantine E.
Frangakis, Jennifer L. Hill and Donald B.Rubin Source: Journal of the American
Statistical Association, Jun., 2003, Vol. 98, No. 462 (Jun., 2003), pp. 299-311
Published by: Taylor & Francis, Ltd. on behalf of the American Statistical
Association Stable URL: https://www.jstor.org/stable/30045238 JSTOR is a not-for-
profit service that helps scholars, researchers, and students discover, use, and
build upon a widerange of content in a trusted digital archive. We use information
technology and tools to increase productivity andfacilitate new forms of
scholarship. For more information about JSTOR, please contact support@jstor.org.
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions
of Use, available athttps://about.jstor.org/terms Taylor & Francis, Ltd. and
American Statistical Association are collaborating with JSTOR todigitize, preserve
and extend access to Journal of the American Statistical Association This content
downloaded from ............199.33.91.52 on Wed, 22 Nov 2023 15:13:22
+00:00............ All use subject to https://about.jstor.org/terms Principal
Stratification Approach to Broken Randomized Experiments: A Case Study of School
Choice Vouchers in New York City John BARNARD, Constantine E. FRANGAKIS, Jennifer
L. HILL, and Donald B. RUBIN The precarious state of the educational system in the
inner cities of the United States, as well as its potential causes and solutions,
have been popular topics of debate in recent years. Part of the difficulty in
resolving this debate is the lack of solid empirical evidence regarding the true
impact of educational initiatives. The efficacy of so-called "school choice"
programs has been a particularly contentious issue. A current multimillion dollar
program, the School Choice Scholarship Foundation Program in New York, randomized
the distribution of vouchers in an attempt to shed some light on this issue. This
is an important time for school choice, because on June 27, 2002 the U.S. Supreme
Court upheld the constitutionality of a voucher program in Cleveland that provides
scholarships both to secular and religious private schools. Although this study
benefits immensely from a randomized design, it suffers from complications common
to such research with human subjects: noncompliance with assigned "treatments" and
missing data. Recent work has revealed threats to valid estimates of experimental
effects that exist in the presence of noncompliance and missing data, even when the
goal is to estimate simple intention-to-treat effects. Our goal was to create a
better solution when faced with both noncompliance and missing data. This article
presents a model that accommodates these complications that is based on the general
framework of "principal stratification" and thus relies on more plausible
assumptions than standard methodology. Our analyses revealed positive effects on
math scores for children who applied to the program from certain types of schools-
those with average test scores below the citywide median. Among these children, the
effects are stronger for children who applied in the first grade and for African-
American children. KEY WORDS: Causal inference; Missing data; Noncompliance;
Pattern mixture models; Principal stratification; Rubin causal model; School
choice. 1. INTRODUCTION There appears to be a crisis in America's public schools.
"More than half of 4th and 8th graders fail to reach the most minimal standard on
national tests in reading, math, and sci.ence, meaning that they probably have
difficulty doing grade.level work" (Education Week l 998). The problem is worse in
high poverty urban areas. For instance, although only 43% of urban fourth-graders
achieved a basic level of skill on a National Assessment of Educational Progress
(NAEP) reading test, a meager 23% of those in high-poverty urban schools met this
standard. One of the most complicated and contentious of educational reforms
currently being proposed is school choice. Debates about the equity and potential
efficacy of school choice have in.creased in intensity over the past few years.
Authors making a case for school choice include Cobb ( 1992), Brandl ( 1998), and
John Barnard is Senior Research Statistician, deCODE Genetics, Waltham, MA-02451
(E-mail: john.barnard@decode.is). Constantine E. Frangakis is Assistant Professor,
Department of Biostatistics, Johns Hopkins University, Baltimore, MD 21205 (E-mail:
cfrangak@jhsph.edu). Jennifer L. Hill is As.sistant Professor, School of
International and Public Affairs, Columbia Univer.sity, New York, NY 10027 (E-mail:
jhl030@columbia.edu). Donald B. Rubin is John L. Loeb Professor of Statistics and
Chair, Department of Statistics, Har.vard University, Cambridge, MA 02138 (E-mail:
rubin@stat.harvard.edu). The authors thank the editor, an associate editor, and
three reviewers for very help.ful comments; David Myers and Paul E. Peterson as
principal coinvestigators for this evaluation; and the School Choice Scholarships
Foundation (SCSF) for cooperating fully with this evaluation. The work was
supported in part by Na.tional Institutes of Health (NIH) grant ROI EY 014314-01,
National Science Foundation grants SBR 9709359 and DMS 9705158; and by grants from
the following foundations: Achelis Foundation, Bodman Foundation, Lynde and Harry
Bradley Foundation, Donner Foundation, Milton and Rose D. Friedman Foundation, John
M. Olin Foundation, David and Lucile Packard Foundation, Smith Richardson
Foundation, and the Spencer Foundation. The authors also thank Kristin Kearns
Jordan and other members of the SCSF staff for their co.operation and assistance
with data collection, and Daniel Mayer and Julia Kim, from Mathematica Policy
Research, for preparing the survey and test score data and answering questions
about that data. The methodology, analyses of data, reported findings and
interpretations of findings are the sole responsibility of the authors and are not
subject to the approval of SCSF or of any foundation providing support for this
research. Coulson (1999). A collection of essays that report mainly pos.itive
school choice effects has been published by Peterson and Hassel (1998). Recent
critiques of school choice include those by the Carnegie Foundation for the
Advancement of Teaching (1992), Cookson (1994), Fuller and Elmore (1996), and Levin
(1998). In this article we evaluate a randomized experiment con.ducted in New York
City made possible by the privately-funded School Choice Scholarships Foundation
(SCSF). The SCSF program provided the first opportunity to examine the ques.tion of
the potential for improved school performance (as well as parental satisfaction and
involvement, school mobility, and racial integration) in private schools versus
public schools us.ing a carefully designed and monitored randomized field
exper.iment. Earlier studies were observational in nature and thus sub.ject to
selection bias (i.e., nonignorable treatment assignment). Studies finding positive
educational benefits from attending pri.vate schools include those of Coleman,
Hoffer, and Kilgore (1982), Chubb and Moe (1990), and Derek (1997). Critiques of
these studies include those of Goldberger and Cain (1982) and Wilms (1985). On June
27, 2002, the U.S. Supreme Court upheld the constitutionality of a voucher program
in Cleveland that provides scholarships both to secular and religious private
schools. As occurs in most research involving human subjects, how.ever, our study,
although carefully implemented, suffered from complications due to missing
background and outcome data and also to noncompliance with the randomly assigned
treatment. We focus on describing and addressing these complications in our study
using a Bayesian approach with the framework of principal stratification (Frangakis
and Rubin 2002). We describe the study in Section 2 and summarize its data
complications in Section 3. In Section 4 we place the study in � 2003 American
Statistical Association Journal of the American Statistical Association June 2003,
Vol. 98, No. 462, Applications and Case Studies DOI 10.1198/016214503000071 299
Table 1. Sample Sizes in the SCSF Programthe context of broken randomized
experiments, a phrase appar.ently first coined by Barnard, Du, Hill, and Rubin
(1998). We Randomized block Familydiscuss the framework that we use in Section 5.
We present our size Treatment PMPD 2 3 4 Subtotal Total model's structural
assumptions in Section 6, and its parametric Single Scholarship 353 72 65 82 104
323 676 assumptions in Section 7. We give the main results of the analy.Control 353
72 65 82 104 323 676 31 177 324sis in Section 8 (with some supplementary results in
the Appen.Multi Scholarship 147 44 27 Control 147 27 23 33 137 284dix). We discuss
model building and checking in Section 9, and conclude the article in Section 10.
2. THE SCHOOL CHOICE SCHOLARSHIPS Total 1,000 960 1,960 FOUNDATION PROGRAM In
February 1997 the SCSF announced that it would provide 1,300 scholarships to
private school to "eligible" low-income families. Eligibility required that the
children, at the time of application, be living in New York City, entering grades
1-5, currently attending a public school, and from families with in.comes low
enough to qualify for free school lunch. That spring, SCSF received applications
from more than 20,000 students. To participate in the lottery that would award the
scholarships, a family had to attend a session during which (1) their eligi.bility
was confirmed, (2) a questionnaire of background char.acteristics was administered
to the parents/guardians, and (3) a pretest was administered to the eligible
children. The final lot.tery, held in May 1997,
was administered by Mathematica Pol.icy Research (MPR), and the SCSF offered
winning families help in finding placements in private schools. Details of the
design have been described by Hill, Rubin, and Thomas (2000). The final sample
sizes of children are displayed in Table 1. PMPD refers to the randomized design
developed for this study (propensity matched pairs design). This design relies on
propensity score matching (Rosenbaum and Rubin 1983), which was used to choose a
control group for the families in the first application period, where there were
more applicants that did not win the scholarship than could be followed. The
"single" and "multi" classifications describe families that have one child and more
than one child participating in the program. For period 1 and single-child
families, Table 2 (taken from Barnard, Frangakis, Hill, and Rubin 2002, table 1.6)
compares the balance achieved on background variables with the PMPD and two other
possible designs: a simple random sample (RS) and a stratified random sample (STRS)
of the same size, from the pool of all potential matching subjects at period l. For
the STRS, the strata are the "applicant's school" (low/high), which indicates
whether the applicant child originates from a school that had average test scores
below (low) or above (high) the citywide median in the year of application.
Measures of com.parison in Table 2 are Z statistics between the randomized arms.
Overall, the PMPD produces better balance in 15 of the 21 variables compared with
the RS design. The PMPD's bal.ance was better in 11 variables and worse in 9
variables (l tie) compared with the STRS, although the gains are generally larger
in the former case than in the latter case. The table also demonstrates balance for
the application periods 2-5, which were part of a standard randomized block design
in which the blocks were each of the four periods, cross-classified by family size
and by applicant's school. More generally, the entire experiment is a randomized
de.sign where the assignment probabilities are a function of the following design
variables: period of application, applicant's school, family size (single child
versus multichild), and the es.timated propensity scores from the PMPD. (For
additional in.formation on the design, see Hill et al. 2000.) Next, we focus on the
study's main data complications. Table 2. Design Comparisons in Balance of
Background Variables: Single-Child Families. The Numbers Are Z Statistics From
Comparing Observed Values of Variables Between Assignments Application period 1
Periods2-5 Simple Stratified Randomized Variable random sample random sample
PMPD block Applicant's school (low/high) -.98 0 .11 .21 Grade level -1.63
-.03 -.39 Pretest read score -.38 .65 .48 -1.05 Pretest math score -.51 1.17 .20 -
1.37 African-American 1.80 1.68 1.59 1.74 Mother's education .16 .14 .09 1.67 In
special education .31 1.66 -.17 .22 In gifted program .42 -1.16 -.13 English main
language -1.06 -.02 -1.03 -.44 AFDC -.28 .83 -1.57 Food stamps -1.08 -.27 .94 -1.31
Mother works -1.26 -.30 -1.18 .40 Educational expectations .50 1.79 .57 .19
Children in household -1.01 -1.75 .41 -1.02 Child born in U.S. -1.40 -.69 Length of
residence .42 .71 .66 -.78 Male .88 1.22 .76 .53 Income -.38 -.62 .74 -1.21 Age as
of 4/97 -1.57 .18 -.47 -.87 3. DATA COMPLICATIONS The data that we use include the
design variables, the back.ground survey collected at the verification sessions,
pretest data, and posttest data collected the following spring. The test scores
used as outcomes in our analyses are grade-normed national percentile rankings of
reading and math scores from the Iowa Test of Basic Skills (ITBS). The ITBS was
used because it is not the test administered in the New York City public school
sys.tem, which reduces the possibility of teachers in schools with participating
children "teaching to the test." Outcomes were also incomplete. Among the observed
out.comes in single-child families, the average (standard deviation) was 23.5
(22.5) percentile rankings for mathematics and 28.1 (23.9) percentile rankings for
reading, and unadjusted com.parisons between randomized arms do not point to a
notable difference. However, in contrast to missingness of background variables,
missingness of outcomes that occurs after random.ization is not guaranteed to be
balanced between the random.ized arms. For example, depending on application period
and background strata, 18%-27% of the children did not provide posttest scores, and
during design periods 2-5, the response is higher among scholarship winners (80%)
than among the other children (73%). Analyses that would be limited to com.plete
cases for these variables and the variables used to calculate the propensity score
would discard more than half of the units. Moreover, standard adjustments for
outcome missingness ig.nore its potential interaction with the other complications
and generally make implicit and unrealistic assumptions. Another complication was
noncompliance; attendance at a private school was not perfectly correlated with
winning a scholarship. For example, for single-child families, and depend.ing on
application period and background strata, 20%-27% of children who won scholarships
did not use them (scholarships were in the amount of $1,400, which generally did
not fully cover tuition at a private school), and 6%-10% of children that did not
win scholarships were sent to private schools neverthe.less. Two additional
complications limit our analysis sample. First, no pretest scores were obtained for
applicants in kinder.garten, because these children most likely had never been
ex.posed to a standardized test, hence considerable time would have been spent
instructing them on how to take a test, and there was concern that separating such
young children from their guardians in this new environment might lead to
behav.ioral issues. Hence we focus our analyses on the children who applied in
first grade or above. Second, we do not yet have complete compliance data for the
multichild families. Conse.quently, all analyses reported in this article are
further limited to results for the 1,050 single-child families who were in grades
1-4 at the time of the spring 1997 application process. 4. THE STUDY AS A BROKEN
RANDOMIZED EXPERIMENT The foregoing deviations from the study's protocol clarify
that our experiment does not really randomize attendance, but rather randomizes the
"encouragement," using financial sup.port, to attend a private school rather than a
public school. Moreover, as in most encouragement studies, interest here fo.cuses
not only on the effect of encouragement itself ( which will depend on what
percentage of people encouraged would actu.ally participate if the voucher program
were to be more broadly implemented), but also on the effect of the treatment being
encouraged-here, attending private versus public schools. If there were perfect
compliance, so that all those encouraged to attend private school actually did so
and all those not so encour.aged stayed in public school, then the effect being
estimated typically would be attributed to private versus public school
at.tendance, rather than simply to the encouragement. Attempts to reduce
missingness of data included requiring at.tendance at the initial verification
sessions and providing finan.cial incentives to attend the follow-up testing
sessions. Despite these attempts, missingness of background variables did occur
before randomization. In principle, such missingness is also a covariate and so
does not directly create imbalance of subjects between randomized arms, although it
does create loss in effi.ciency when, as in this study, background covariates are
used in the analysis. For example, for single-child families, depend.ing on
application period and background strata, 34%-51 % of the children's pretest scores
were missing at the time of design planning. Since then, MPR has processed and
provided an ad.ditional 7% and 14% of the reading and mathematics pretest scores.
These scores were not as balanced between arms as when choosing the design (see
Table 2), although the difference was not statistically significant at conventional
levels. Hence we used all available pretest scores in the final analysis that is
con.ditional on these scores. We focus on defining and estimating two estimands:
the en.couragement on all subjects; and the complier average causal effect (CACE),
the effect of the randomized encouragement on all subjects who would comply with
their treatment assignment no matter which assignment they would be given (here,
the chil.dren who would have attended private school if they had won a scholarship
and would not have attended had they not won a scholarship). These quantities are
defined more formally in Section 8. In recent years there has been substantial
progress in the analysis of encouragement designs, based on building bridges
between statistical and econometric approaches to causal in.ference. In particular,
the widely accepted approach in statis.tics to formulating causal questions is in
terms of "potential outcomes." Although this approach has roots dating back to
Neyman and Fisher in the context of perfect randomized exper.iments (Neyman 1923;
Rubin 1990), it is generally referred to as Rubin's causal model (Holland 1986) for
work extending the framework to observational studies (Rubin 1974, 1977) and
in.cluding modes of inference other than randomization-based.in particular,
Bayesian (Rubin 1978a, 1990). In economics, the technique of instrumental
variables, due to Tinbergen (1930) and Haavelmo (1943), has been a main tool of
causal inference in the type of nonrandomized studies prevalent in that field.
An.grist, Imbens, and Rubin ( 1996) showed how the approaches can be viewed as
completely
compatible, thereby clarifying and strengthening each approach. The result was an
interpretation of the instrumental variables technology as a way to approach a
randomized experiment that suffers from noncompliance, such as a randomized
encouragement design. In encouragement designs with compliance as the only
par.tially uncontrolled factor, and where there are full outcome data, Imbens and
Rubin ( 1997) extended the Bayesian approach to causal inference of Rubin ( 1978a)
to handle simple random.ized experiments with noncompliance, and Hirano, Imbens,
Ru.bin, and Zhou (2000) further extended the approach to handle fully observed
covariates. In encouragement designs with more than one partially un.controlled
factor, as with noncompliance and missing outcomes in our study, defining and
estimating treatment effects of inter.est becomes more challenging. Existing
methods (e.g., Robins, Greenland, and Hu 1999) are designed for studies that
dif.fer from our study in the goals and the degree of control of the aforementioned
factors. (For a detailed comparison of such frameworks, see Frangakis and Rubin
2002.) Frangakis and Ru.bin (1999) studied a more flexible framework for
encourage.ment designs with both noncompliance to treatment and miss.ing outcomes,
and showed that for estimation of either the ITT effect or CACE, one cannot in
general obtain valid estimates using standard ITT analyses (i.e., analyses that
ignore data on compliance behavior) or standard IV analyses (i.e., those that
ignore the interaction between compliance behavior and out.come missingness). They
also provided consistent moment.based estimators that can estimate both ITT and
CACE under assumptions more plausible than those underlying more stan.dard methods.
Barnard et al. (1998) extended that template to allow for missing covariate and
multivariate outcome values; they stopped short, however, of introducing specific
methodology for this framework. We present a solution to a still more chal.lenging
situation in which we have a more complicated form of noncompliance-some children
attend private school without receiving the monetary encouragement (thus receiving
treat.ment without having been assigned to it). Under assumptions similar to those
of Barnard et al. ( 1998), we next fully develop a Bayesian framework that yields
valid estimates of quantities of interest and also properly accounts for our
uncertainty about these quantities. 5. PRINCIPAL STRATIFICATION IN SCHOOL CHOICE
AND ROLE FOR CAUSAL INFERENCE To make explicit the assumptions necessary for valid
causal inference in this study, we first introduce "potential outcomes" (see Rubin
1979; Holland 1986) for all of the posttreatment variables. Potential outcomes for
any given variable represent the observable manifestations of this variable under
each pos.sible treatment assignment. In particular, if child i in our study ( i =
I, ... , n) is to be assigned to treatment z ( 1 for private school and O for
public school), we denote the following: Di (z) for the indicator equal to 1 if the
child attends private school and O if the child attends public school; Yi (z) for
the poten.tial outcomes if the child were to take the tests; and Ry ; (z) for the
indicators equal to 1 if the child takes the tests. We denote by Zi the indicator
equal to 1 for the observed assignment to private school or not, and let Di=
Di(Zi), Yi = Yi(Zi), and ; = Rt; (Zi) designate the actual type of school, the
outcome Ryy to be recorded by the test, and whether or not the child takes the test
under the observed assignment. The notation for these and the remaining definitions
of rele.vant variables in this study are summarized in Table 3. In our study the
outcomes are reading and math test scores, so the di.mension of Yi (z) equals two,
although more generally our tem.plate allows for repeated measurements over time.
Our design variables are application period, low/high indi.cators for the
applicant's school, grade level, and propensity score. In addition, corresponding
to each set of these individual.specific random variables is a variable (vector or
matrix), with.out subscript i, that refers to the set of these variables across all
study participants. For example, Z is the vector of treatment assignments for all
study participants with ith element Zi, and X is the matrix of partially observed
background variables with ith row Xi. The variable Ci (see Table 3), the joint
vector of treatment receipt under both treatment assignments, is particularly
im.portant. Specifically, Ci defines four strata of people: compli.ers, who take
the treatment if so assigned and take control if so assigned; never takers, who
never take the treatment no matter Table 3. Notation for the ith Subject Notation
Z; O;(z) O; C; Y;(z) Y; Ryimath)(z) Ry;(Z) Ry; W;X; Rx; Specifics 1 if i assigned
treatment 0 if i assigned control 1 if i receives treatment under assignment z 0 if
i receives control under assignment z O;(Z;)c if O;(0) = 0 and 0;(1) = 1 n if O;(0)
= 0 and 0;(1) = 0 a if O;(0) = 1 and 0;(1) = 1 d if O;(0) = 1 and 0;(1) = 0
(Y/math)(z), y?ead)(z)) (Yrath)(Z;), y?ead)(Z;)) 1 if y/mathl (z) would be observed
0 if y/mathl (z) would not be observed (Ryimalh)(z), Rlead)(z)) (Ryimath\Z;),
Rlead)(Z;))(W;1, ... ,W;K)(X;1, ... ,X;a) (Rx;1, ... , Rx;a) General description
Binary indicator of treatment assignment Potential outcome formulation of treatment
receipt Binary indicator of treatment receipt Compliance principal stratum: c =
complier; n = never taker; a= always taker; d = defier Potential outcomes for math
and reading Math and reading outcomes under observed assignment (malResponse
indicator for y;th)(z) under assignment z; similarly for Rlead)(z) Vector of
response indicators for Y;(z) Vector of response indicators for Y; Fully observed
background and design variables Partially observed background variables Vector of
response indicators for X; the assignment; always takers, who always take the
treatment no matter the assignment; and defiers, who do the opposite of the
assignment no matter its value. These strata are not fully ob.served, in contrast
to observed strata of actual assignment and attendance (Z;, D;). For example,
children who are observed to attend private school when winning the lottery are a
mixture of compliers (C; = c) and always takers (C; = a). Such explicit
stratification on C; dates back at least to the work of lmbens and Rubin (1997) on
randomized trials with noncompliance, was generalized and taxonomized by Frangakis
and Rubin (2002) to posttreatment variables in partially controlled studies, and is
termed a "principal stratification" based on the posttreatment variables. The
principal strata C; have two important properties. First, they are not affected by
the assigned treatment. Second, com.parisons of potential outcomes under different
assignments within principal strata, called principal effects, are well-defined
causal effects (Frangakis and Rubin 2002). These properties make principal
stratification a powerful framework for evalu.ation, because it allows us to
explicitly define estimands that better represent the effect of attendance in the
presence of non.compliance, and to explore richer and explicit sets of assump.tions
that allow estimation of these effects under more plausible than standard
conditions. Sections 6 and 7 discuss such a set of more flexible assumptions and
estimands. 6. STRUCTURAL ASSUMPTIONS First, we state explicitly our structural
assumptions about the data with regard to the causal process, the missing data
mech.anism and the noncompliance structure. These assumptions are expressed without
reference to a particular parametric distribu.tion and are the ones that make the
estimands of interest identi.fiable, as also discussed in Section 6.4. 6.1 Stable
Unit Treatment Value Assumption A standard assumption made in causal analyses is
the stable unit treatment value assumption (SUTVA), formalized with po.tential
outcomes by Rubin (1978a, I 980, I 990). SUTVA com.bines the no-interference
assumption (Cox 1958) that one unit's treatment assignment does not affect another
unit's outcomes with the assumption that there are "no versions of treatments." For
the no-interference assumption to hold, whether or not one family won a scholarship
should not affect another family's out.comes, such as their choice to attend
private school or their chil.dren's test scores. We expect our results to be robust
to the types and degree of deviations from no interference that might be
an.ticipated in this study. Tosatisfy the "no versions of treatments," we need to
limit the definition of private and public schools to those participating in the
experiment. Generalizability of the re.sults to other schools must be judged
separately. 6.2 Randomization We assume that scholarships have been randomly
assigned. This implies that p(Z I Yt(l), Yt(0), X, W, C, Ry(0),Ry(l), Rx, 0) = p(Z
I W*, 0) = p(Z I W*), where W* represents the design variables in W and 0 is
generic notation for the parameters governing the distribution of all the
variables. There is no dependence on 0, because there are no unknown parameters
governing the treatment-assignment mechanism. MPR assigned the scholarships by
lottery, and the randomization probabilities within for applicant's school
(low/high) and application period are known. 6.3 Noncompliance Process Assumptions:
Monotonicity and Compound Exclusion We assume monotonicity, that there are no
"defiers"-that is, for all i, D; (1) ::: D; (0) (lmbens and An grist 1994 ). In the
SCSF program, defiers would be families who would not use a scholarship if they won
one, but would pay to go to private school if they did not win a scholarship. It
seems implausible that such a group of people exists; therefore, the monotonicity
assumption appears
to be reasonable for our study. By definition, the never takers and always takers
will par.ticipate in the same treatment (control or treatment) regard.less of which
treatment they are randomly assigned. For this reason, and to facilitate
estimation, we assume compound ex.clusion: The outcomes and missingness of outcomes
for never takers and always takers are not affected by treatment assign.ment. This
compound exclusion restriction (Frangakis and Ru.bin 1999) generalizes the standard
exclusion restriction (An.grist et al. 1996; Imbens and Rubin 1997) and can be
expressed formally for the distributions as obsp(Y(l), Ry(]) I x , Rx, W, C = n)
X0bS= p(Yt(0), Ry(0) I , Rx, W, C = n), for never takers and Xobsp(Yt(l), Ry(l) I ,
Rx, W, C = a) = p(Yt(0), Ry(0) I xobs, Rx, W, C=a), for always takers. Compound
exclusion seems more plausible for never takers than for always takers in this
study. Never takers stayed in the public school system whether they won a
scholarship or not. Although a disappointment about winning a scholarship but still
not being able to take advantage of it can exist, it is unlikely to cause notable
differences in subsequent test scores or response behaviors. Always takers, on the
other hand, might have been in one private school had they won a scholarship or in
another if they had not, particularly because those who won scholarships had access
to resources to help them find an appropriate private school and had more money (up
to $1,400) to use toward tu.ition. In addition, even if the child had attended the
same pri.vate school under either treatment assignment, the extra $1,400 in family
resources for those who won the scholarship could have had an effect on student
outcomes. However, because in our application the estimated percentage of always
takers is so small (approximately 9%)-an estimate that is robust, due to the
randomization, to relaxing compound exclusion-there is reason to believe that this
assumption will not have a substan.tial impact on the results. Under the compound
exclusion restriction, the ITT compar.ison of all outcomes Y;(0) versus Y;( 1)
includes the null com.parison among the subgroups of never takers and always
takers. Moreover, by monotonicity, the compliers (C; = c) are the only group of
children who would attend private school if and only if offered the scholarship.
For this reason, we take the CACE (Im.bens and Rubin 1997), defined as the
comparison of outcomes Yi (0) versus Yi ( 1) among the principal stratum of
compliers, to represent the effect of attending public versus private school. 6.4
Latent lgnorability We assume that potential outcomes are independent of
miss.ingness given observed covariates conditional on the compli.ance strata, that
is, p(Ry (0), Ry (l) I Rx, Yt(l), Yt(0), Xobs , W, C, 0) Rx, Xobs= p(Ry (0), Ry (l)
I , W, C, 0). This assumption represents a form of latent ignorability (LI)
(Frangakis and Rubin 1999) in that it conditions on variables that are (at least
partially) unobserved or latent-here compli.ance principal strata C. We make it
here first because it is more plausible than the assumption of standard
ignorability (SI) (Ru.bin 1978a; Little and Rubin 1987), and second, because making
it leads to different likelihood inferences. LI is more plausible than SI to the
extent that it provides a closer approximation to the missing-data mechanism. The
intu.ition behind this assumption in our study is that for a subgroup of people
with the same covariate missing-data patterns, Rxt; similar values for covariates
observed in that pattern, xobs ; and the same compliance stratum C, a flip of a
coin could deter.mine which of these individuals shows up for a posttest. This is a
more reasonable assumption than SI, because it seems quite likely that for example,
compliers, would exhibit different at.tendance behavior for posttests than, say,
never takers (even conditional on other background variables). Explorations of raw
data from this study across individuals with known compli.ance status provide
empirical support that C is a strong factor in outcome missingness, even when other
covariates are included in the model. This fact is also supported by the literature
for noncompliance (see, e.g., The Coronary Drug Project Research Group 1980).
Regarding improved estimation, when LI (and the preceding structural assumptions)
hold but the likelihood is constructed assuming SI, the underlying causal effects
are identifiable (al.ternatively, the posterior distribution with increasing sample
size converges to the truth) only if the additional assumption is made that within
subclasses of subjects with similar observed variables, the partially missing
compliance principal stratum C is not associated with potential outcomes. However,
as noted earlier, this assumption is not plausible. Theoretically, only the
structural assumptions described ear.lier are needed to identify the underlying ITT
and CACE causal effects (Frangakis and Rubin 1999). Estimation based solely on
those identifiability relations in principle requires very large sample size and
explicit conditioning (i.e., stratification) on the subclasses defined by the
observed part of the covariates, xobs , and the pattern of missingness of the
covariates, Rx, as well as implicit conditioning (i.e., deconvolution of mixtures)
on the sometimes missing compliance principal stratum C. This works in large
samples because covariate missingness and compli.ance principal stratum are also
covariates (i.e., defined pretreat.ment), so samples within subclasses defined by
xobs , Rx, and C themselves represent randomized experiments. With our
ex.periment's sample size, however, performing completely sepa.rate analyses on all
of these strata is not necessarily desirable or feasible. Therefore, we consider
more parsimonious modeling approaches, which have the role of assisting, not
creating, infer.ence in the sense that results should be robust to different
para.metric specifications (Frangakis and Rubin 2001, rejoinder). 7. PARAMETRIC
PATTERN MIXTURE MODEL Generally speaking, constrained estimation of separate
analy.ses within missing-data patterns is the motivation behind pat.tern mixture
modeling. Various authors have taken pattern mix.ture model approaches to missing
data, including Little (1993, 1996), Rubin (1978b), and Glynn, Laird, and Rubin
(1993). Typically, pattern mixture models partition the data with re.spect to the
missingness of the variables. Here we partition the data with respect to the
covariate missing-data patterns Rx, as well as compliance principal strata C,
design variables W, and the main covariates xobs . This represents a partial
pattern mix.ture approach. One argument in favor of this approach is that it
focuses the model on the quantities of interest in such a way that parametric
specifications for the marginal distributions of Rx, W, and xobs can be ignored. To
capitalize on the structural assumptions, consider the factorization of the joint
distribution for the potential outcomes and compliance strata conditional on the
covariates and their missing-data patterns, p(Yi (0), Yi (l), Ryi (0), Ryi (l), Ci
I Wi , Xfbs , Rxi , 0) =p(Ci I wi ,tx7bs ,tRXi ,t0t(CJ ) X p(Ryi (0), Ryi (1) I wi
' x7bs' Rxi ' Ci ' 0(R) ) X p(Yi (0), Yi (l) I wi , Xfbs , Rxi , Ci , 0t(Yl ),
where the product in the last line follows by latent ignorability and 0 = (0(C),
0(R), 0cYl)1 . Note that the response pattern of co.variates for each individual is
itself a covariate. The parametric specifications for each of these components are
described next. 7.1 Compliance Principal Stratum Submode! The compliance status
model contains two conditional probit models, defined using indicator variables Ci
(c) and Ci (n), for whether individual i is a complier or a never taker: Ci (n) = 1
if Ci (n)* = 81 (Wi , Xfbs , Rxi )1 f3(C, I)t+ Vi :'.:: 0 and Ci (c) = 1 iftCi (n)*
> 0 and Ci (c)* = go(Wi , Xfbs , Rxi )'13tcc, 2) + Ui .:::: 0, where Vi ~ N(0, 1)
and Ui ~ N(0, 1) independently. The link functions, go and g1, attempt to strike a
balance be.tween on the one hand including all of the design variables as well as
the variables regarded as most important either in pre.dicting compliance or in
having interactions with the treatment effect and on the other hand maintaining
parsimony. The re.sults discussed in Section 8 use a compliance component model
whose link function, g1, is linear in, and fits distinct parameters for, an
intercept, applicant's school (low/high), indicators for application period,
propensity scores for subjects applying in the PMPD period and propensity scores
for the other periods, indicators for grade of the student, ethnicity (I if the
child or guardian identifies herself as African-American, 0 otherwise), an
indicator for whether or not the pretest scores of reading and math were available,
and the pretest scores (reading and math) for the subjects with available scores. A
propensity score for the students not in the PMPD period is not necessary from the
design standpoint, and is in fact constant. However, to increase efficiency, and to
reduce bias due to missing outcomes, here we include an "estimated propensity score
value" for these periods, calculated as the function derived for the propensity
score for students in the PMPD period and evaluated at the covariate val.ues for
the students in the other periods as well. Also, the fore.going link function go is
the same as g1 except that it excludes the indicators for application period as
well as the propensity scores for applicants who did not apply in the PMPD period
(i.e., those for whom propensity score was not a design vari.able). This link
function, a more parsimonious version of one we used in earlier models, was more
appropriate to fit the rel.atively small proportion of always takers in this study.
Finally, because
the pretests were either jointly observed or jointly miss.ing, one indicator for
missingness of pretest scores is sufficient. The prior distributions for the
compliance submode! are k indicator for whether or not the pretest scores were
avail.able, pretest score values for the subjects with available scores, and
propensity score 2. For the students of the other periods: the variables in item 1,
along with indicators for application period 3. An indicator for whether or not a
person is an always taker 4. An indicator for whether or not a person is a complier
5. For compliers assigned treatment: an intercept, appli.cant's school (low/high),
ethnicity, and indicators for the first three grades (with the variable for the
fourth-grade treatment effect as a function of the already-included vari.ables) For
the variance of the outcome component, the link func.tion g3 includes indicators
that saturate the cross-classification of whether or not a person applied in the
PMPD period and whether or not the pretest scores were available. This depen.dence
is needed because each pattern conditions on a different set of covariates; that
is, xobs varies from pattern to pattern. The prior distributions for the outcome
submode} are 13(math) st(math) ~ N(O, F(st(math)).I),I K t3t(C. ll ~ N(f3f' I),
{at(C . 1)}2 1) and t3(C, 2l ~ N(O, {a(C. 2l}2I)l)2 l)2independently, where (a(C ,
ltand (atCC.t2 are hyperpara.of the K (in our case, K = 4) subgroups defined by
cross.classifying the PMPD/non-PMPD classification and the mis.meters set at 10 and
f36c , I) is a vector of Os with the excep.1tion of the first element, which is set
equal to -<I>-(1/3) * {at(C, )1/n I:7 (g.t. ;g1,t;) + 1} 112tg1 (W;t, x7bs sing-
data indicator for pretest scores, and where . is a hyper-It, where g1.t; = ,
(math) iid � 2 2parameter set at 10; and exp({k ) ~ mvx (v, a ), whereRx;) and n is
our sample size. These priors reflect our a priori invxt2 (v, a2 ) refers to the
distribution of the inverse of a chi.ignorance about the probability that any
individual belongs to each compliance status by approximately setting each of their
prior probabilities at 1/3. squared random variable with degrees of freedom v (set
at 3) and scale parameter a2 (set at 400 based on preliminary esti.mates of
variances). The sets of values for these and the hy.perparameters of the other
model components were chosen for7.2 Outcome Submode! satisfying two criteria:
giving posterior standard deviations for We first describe the marginal
distribution for the math out.the estimands in the neighborhood of the respective
standard er.lcome yCmatht. To address the pile-up of many scores of 0, we rors of
approximate maximum likelihood estimate fits of simi.posit the censored model lar
preliminary likelihood models and giving satisfactory model Y/math)(z) = I .00
checks (Sec. 9.2), and producing quick enough mixing of the if Y/malh), *(z) :S 0
Bayesian analysis algorithm. We specify the marginal distribution for reading
outcome read) in the same way as for the math outcome, with separate y_(math) ,
*(z)l y(otherwise, mean and variance regression parameters f3 (read) and s (read) .
where Finally, we allow that, conditionally on W;t, x7bs , Rx;, C;t, the math and
reading outcomes at a given assignment, Y/marb) , *(z)y(math) , t, t l, t, .(read)
x?bs and Y; ' * (z), may be dependent with an unknown correla.~ N(g (W-l Rx� c-z)'
R(math) 2 l, , l, l, JJ , tion, p. We set the prior distribution for p to be
uniform in l , ( -1, l) independently of the remaining parameters in their prior
distributions. for z = 0, 1 and e(math) = (f3t(mathl, st(mathl). Here
Y/math) ,t*(O) and Y/matht7.3 Outcome Response Submode! l� *(l) are assumed
conditionally independent, an as.As with the pretests, the outcomes on mathematics
and read.ing were either jointly observed or jointly missing, thus one sumption
that has no effect on inference for our superpopulation parameters of interest
(Rubin 1978a). indicator Ry; (z) for missingness of outcomes is sufficient for each
assignment z = 0, 1. For the submodel on this indicator, we also use a probit
specification, The results reported in Section 8 use an outcome component model
whose outcome mean link function, g2, is linear in dis.tinct parameters for the
following: Ry;t(z) = lI. For the students of the PMPD design: an intercept, appli.
x7hs ' cant's school (low/high), ethnicity, indicators for grade, an if Ry;t(z)* =
g2(W;t' , Rx;t' C;z)'13<R) + E;(z)t:::: 0, where Ry; (0) and Ry; (1) are assumed
conditionally indepen.dent (using the same justification as for the potential
out.comes) and where E; (z) ~ N(0, 1). The link function of the probit model on the
outcome response g2 is the same as the link function for the mean of the outcome
component. The prior distribution for the outcome response submode} is t3<R) ~ N(0,
{a<Rl }2I), where {at<R) }2 is a hyperparameter set at 10. 8. RESULTS All results
herein were obtained from the same Bayesian analysis. We report results for the ITT
and CACE estimands, for proportions of compliance principal strata, and for outcome
response rates. The reported estimands are not parameters of the model, but rather
are functions of parameters and data. The results are reported by applicant's
school (low/high) and grade. Both of these variables represent characteristics of
chil.dren that potentially could be targeted differentially by gov.ernment
policies. Moreover, each was thought to have possible interaction effects with
treatment assignment. Except when oth.erwise stated, the plain numbers are
posterior means, and the numbers in parentheses are 2.5 and 97.5 percentiles of the
pos.terior distribution. 8.1 Test Score Resu lts Here we address two questions: 1.
What is the impact of being offered a scholarship on stu.dent outcomes, namely, the
ITT estimand? 2. What is the impact of attending a private school on student
outcomes, namely, the CACE estimand? The math and reading posttest score outcomes
represent the national percentile rankings within grade. They have been ad.justed
to correct for the fact that some children were kept be.hind while others skipped a
grade, because students transferring to private schools are hypothesized to be more
likely to have been kept behind by those schools. The individual-level causal
estimates have also been weighted so that the subgroup causal estimates correspond
to the effects for all eligible children be.longing to that subgroup who attended a
screening session. 8.1.1 Effect of Offering the Scholarship on Mathematics and
Reading. We examine the impact of being offered a scholar.ship (the ITT effect) on
posttest scores. The corresponding es.timand for individual i is defined as E(Y;(l)
-Y; (0) I wft, 0), where Wf denotes the policy variables grade level and
appli.cant's school (low/high) for that individual. The simulated pos.terior
distribution of the ITT effect is summarized in Table 4. To draw from this
distribution, we take the following steps: Table 4. IT T Effect of Winning the
Lottery on Math and Reading Test Scores 1 Grade at Applicant's school: Low
Applicant's school: High application Reading Math Reading Math 2.3(-1.3, 5.8)
5.2(2.0, 8.3) 1 .4(-4.8, 7.2) 5. 1 (.1 , 10.3) 2 .5(-2.6, 3.5) 1 .3(-1 .7,
4.3) -.6(-6.2, 4.9) 1 .0(-4.3, 6.1) 3 .7(-2.7, 4.0) 3.3(-.5, 7.0) -.5(-6.0,
5.0) 2.5(-3.2, 8.0) 4 3.0(-1 .1,7.4) 3. 1 (-1 .2, 7.2) 1 .8(-4.1 , 7.6) 2.3(-
3.3, 7.8) Overall 1 .5(-.6, 3.6) 3.2(1 .0, 5.4) .4(-4.6, 5.2) 2,8(-1 .8, 7.2)
NOTE: Year Postrandomization Plain numbers are means, and numbers in parentheses
are central 95% intervals of the posterior distribution of the effects on
percentile rank. 1. Draw 0 and { C;} from the posterior distribution (see App.tA).
2. Calculate the expected causal effect E(Y;(l) -Y;t(O) I W;, xrbs , C;, Rx;, 0)
for each subject based on the model in Section 7 .2. 3. Average the values of step
2 over the subjects in the sub.classes of WP with weights reflecting the known
sampling weights of the study population for the target population. These results
indicate posterior distributions with mass pri.marily (>97.5%) to the right of 0
for the treatment effect on math scores overall for children from low applicant
schools, and also for the subgroup of first graders. Each effect indicates an
average gain of greater than three percentile points for children who won a
scholarship. All of the remaining intervals cover 0. As a more general pattern,
estimates of effects are larger for mathematics than for reading and larger for
children from low.applicant schools than for children from high-applicant schools.
These results for the ITT estimand using the method of this article can also be
contrasted with the results using a simpler method reported in Table 5. The method
used to obtain these re.sults is the same as that used in the initial MPR study
(Peterson, Myers, Howell, and Mayer 1999) but limited to the subset of single-child
families and separated out by applicant's school (low/high). This method is based
on a linear regression of the posttest scores on the indicator of treatment
assignment, ignor.ing compliance data. In addition, the analysis includes the
de.sign variables and the pretest scores and is limited to individuals for whom the
pretest and posttest scores are known. Separate analyses are run for math and
reading posttest scores, and the missingness of such scores is adjusted by inverse
probability weighting. Finally, weights are used to make the results for the study
participants generalizable to the population of all eligible single-child families
who were screened. (For more details of this method, see the appendix of Peterson
et al. 1999.) Table 5. ITT Effect of Winning the Lottery on Test Scores, Estimated
Using the Original Grade at application 2 3 4 Overall MPR Method Applicant's
school: Low Reading Math -1 .0(-7.1, 5.2) 2. 1 (-2.6, 6.7 ) -,8(-4.9, 3.2)
2,0(-4.0, 8.0) 3,2(-1 .7, 8.1) 5,0(-.8, 10.7) 2. 7 (-3.5, 8.8) ,3(-7.3, 7.9)
Applicant's school: High Reading Math 4,8(-1 0.0, 19.6) -3.4(-16.5,9.7) -8.0(-
25.4, 9.3) 27.9(8.0, 47.8) 2,6(-1 5.5,20.7) 2.7(-1 0.3, 1 5.7) 4,0(-1 7.7, 25.6)
22.7(-1.5,46.8) NOTE: Plain numbers are point estimates, and parentheses are 95%
confidence intervals for the mean effects on percentile rank. Table 6. Effect of
Private School Attendance on Test Scores Grade at application Reading Math
Reading Math 3.4(-2.0, 8.7) 7.7(3.0, 12 .4) 1 .9(-7.3, 10.3) 7.4(.2 , 14.6)
2 ,7(-3.7, 5.0) 1 .9(-2.4, 6.2) -,9(-9.4, 7.3) 1 .5(-6.2 ,9.3) 3 1 ,0(-
4.1,6.1) 5.0(-.8, 10.7) -,8(-9.5,7.7) 4.0(-4.9, 1 2 .5) 4 4.2(-1 .5, 10.1)
4.3(-1 .6, 10.1 ) 2.7(-6.3, 11 .3) 3.5(-4.7, 11 .9) Overall 2.2(-.9, 5.3)
4.7(1 .4, 7.9) .6(-7.1, 7.7) 4.2(-2 .6, 10.9) NOTE: Plain numbers are means, and
parentheses are central 95% intervals of the posterior distribution of the effects
on percentile rank. The results of the new method (Table 4) are generally more
stable than those of the original method (Table 5), which in some cases are not
even credible. In the most extreme cases, the original method estimates a 22.7-
point gain [95% confidence interval, (-1.5, 46.8)] in mathematics and a 27.9-point
gain [95% confidence interval, (8.0, 47.8)] in reading for fourth.grade children
from high-applicant schools. More generally, the new method's results display a
more plausible pattern in com.paring effects in high-applicant versus low-applicant
schools and mathematics versus reading. 8.1.2 Effect of Private School Attendance
on Mathemat.ics and Reading. We also examine the effect of offering the scholarship
when focusing only on the compliers (the CACE). The corresponding estimand for
individual i is defined as E(Yi(l) -Yi(0) I wt , C = c, 0). This analysis defines
the treatment as private school atten.dance (Sec. 6.3). The simulated posterior
distribution of the CACE is summarized in Table 6. A draw from this distribu.tion
is obtained using steps 1-3 described in the Section 8.1.1 for the ITT estimand,
with the exception that at step 3 the av.eraging is restricted to the subjects
whose current draw of Ci is "complier." The effects of private school attendance
follow a pattern similar to that of the ITT effects, but the posterior means are
slightly bigger in absolute value than ITT. The intervals have also grown,
reflecting that these effects are for only subgroups of all children, the
"compliers," in each cell. As a result, the as.sociated uncertainty for some of
these effects (e.g., for fourth.graders applying from high-applicant schools) is
large. 8.2 Compliance Principal Strata and Missing Outcomes Table 7 summarizes the
posterior distribution of the esti.mands of the probability of being in stratum C
as a function of an applicant's school and grade, p(Ci = t I wt , 0). To draw from
this distribution, we use step 1 described in Section 8.1.1 and then calculate
p( Ci = t I wt, Xfbs , Rxi, 0) for each sub.ject based on the model of Section 7 .1
and average these values as in step 3 of Section 8.1.1. The clearest pattern
revealed by Table 7 is that for three out of four grades, children who ap.plied
from low-applicant schools are more likely to be compli.ers or always takers than
are children who applied from high.applicant schools. As stated earlier,
theoretically under our structural assump.tions, standard ITT analyses or standard
IV analyses that use ad hoc approaches to missing data are generally not
appropri.ate for estimating the ITT or CACE estimands when the com.pliance
principal strata have differential response (i.e., outcome missing-data) behaviors.
To evaluate this here, we simulated the posterior distributions of
p(Rit(z)tItCi=tt,tWt,0), z=t0,tI. (1) To draw from the distributions of the
estimands in (1), we used step 1 from Section 8.1.1 and then, for z = 0, 1 for each
subject calculated p(R;(z) I wt, Xfbs , Rxi, C, 0). We then averaged these values
over subjects within subclasses defined by the dif.ferent combinations of wt and
Ci. For each draw ( across individuals) from the distribution in ( 1) we calculate
the odds ratios (a) between compliers attending public schools and never takers,
(b) between compliers attend.ing private schools and always takers, and ( c)
between com.pliers attending private schools and compliers attending public
schools. The results (omitted) showed that response was in.creasing in the
following order: never takers, compliers attend.ing public schools, compliers
attending private schools, and always takers. These results confirm that the
compliance prin.cipal stratum is an important factor in response both alone and in
interaction with assignment. 9. MODEL BUILDING AND CHECKING This model was built
through a process of fitting and check.ing a succession of models. Of note in this
model are the follow.ing features: a censored normal model that accommodates the
Table 7. Proportions of Compliance Principal Strata Across Grade and Applicant's
School Grade at Applicant's school: Low Applicant's school: High application
Never taker Compiler Always taker Never taker Compiler Always taker 1 24.5(2
.9) 67. 1 (3.8) 8.4(2 .4) 25.0(5.0) 69.3(6.1) 5.7(3.3) 2 20.5(2 .7)
69.4(3.7) 10. 1 (2 .5) 25.3(5.1) 67.2(6.2) 7.5(3.4) 3 24.5(3.2 ) 65.9(4.0)
9.6(2 .5) 28.8(5.6) 64. 1 (6.7) 7. 1 (3.5) 4 18.4(3.3) 72.8(4.6) 8.8(3.0)
27.0(5.5) 66.7(6.7) 6.3(3.6) NOTE: Plain numbers are means, and parentheses are
standard deviations of the posterior distribution of the estimands. pile-up of Os
in the outcomes, incorporation of heteroscedastic.ity, and a multivariate model to
accommodate both outcomes simultaneously. It is reassuring from the perspective of
model robustness that the results from the last few models fit, includ.ing those in
the conference proceedings report of Barnard et al. (2002), are consistent with the
results from this final model. 9. 1 Convergence Checks Because posterior
distributions were simulated from a Markov chain Monte Carlo algorithm (App. A), it
is impor.tant to assess its convergence. To do this, we ran three chains from
different starting values. To initiate each chain, we set any unknown compliance
stratum equal to a Bernoulli draw, with probabilities obtained from moment
estimates of the prob.abilities of being a complier given observed attendance and
randomization data (D, Z) alone. Using the initialized compli.ance strata for each
chain, parameters were initialized to values based on generalized linear model
estimates of the model com.ponents. Each chain was run for 15,000 iterations. At
5,000 iterations, and based on the three chains for each model, we calculated the
potential scale-reduction statistic (Gelman and Rubin 1992) for the 250 estimands
(parameters and functions of parameters) that serve as building blocks for all
other estimands. The re.sults suggested good mixing of the chains (with the maximum
potential scale reduction statistic across parameters 1.04) and provided no
evidence against convergence. Inference is based on the remaining 30,000
iterations, combining the three chains. 9.2 Model Checks We evaluate the influence
of the model presented in Section 7 with six posterior predictive checks, three
checks for each of the two outcomes. A posterior predictive check generally
involves (a) choosing a discrepancy measure, that is, a function of ob.served data
and possibly of missing data and the parameter vec.tor 0; and (b) computing a
posterior predictive p value (PPPV), which is the probability over the posterior
predictive distribu.tion of the missing data and 0 that the discrepancy measure in
a new study drawn with the same 0 as in our study would be as or more extreme than
in our study (Rubin 1984; Meng 1996; Gelman, Meng, and Stem 1996). Posterior
predictive checks in general, and PPPV s in partic.ular, demonstrate whether the
model can adequately preserve features of the data reflected in the discrepancy
measure, where the model here includes the prior distribution as well as the
like.lihood (Meng 1996). As a result, properties of PPPVs are not exactly the same
as properties of classical p values under fre.quency evaluations conditional on the
unknown 0, just as they are not exactly the same for frequency evaluations over
both levels of uncertainty, that is, the drawing of 0 from the prior dis.tribution
and the drawing of data from the likelihood given 0. For example, over frequency
evaluations of the latter type, a PPPV is stochastically less variable than but has
the same mean as the uniform distribution and so tends to be more conservative than
a classical p value, although the reverse can be true over frequency evaluations of
the first type (Meng 1996). (For more details on the interpretation and properties
of the PPPV s, see also Rubin 1984 and Gelman et al. 1996.) Table 8. Posterior
Predictive Checks: p Values Signal Noise Signal to Noise Math .34 .79 .32
Read .40 .88 .39 The posterior predictive discrepancy measures that we choose
here are functions of A.:z = {r;;Pt: I(Rl;P= l)I(r:;P :;t:O)I(C?=c)I(Z;=tz)= 1} for
the measures that are functions of data f;;P, Rl;P, and c;e p from a replicated
study and A..fY= {Y;pt: l(Ry;p= l)l(f;p # O)l(C; = c)l(Z; = z)= 1} for the measures
that are functions of this study's data. Here Rl;P is defined so that it equals 1
if f;;P is observed and O oth.erwise and p equals 1 for math outcomes and 2 for
reading out.comes. The discrepancy measures, "rep" and "study", that we used for
each outcome (p = 1, 2) were (a) the absolute value of the difference between the
mean of
AP, 1 and the mean of AP, o ("signal"), (b) the standard error based on a simple
two-sample comparison for this difference ("noise"), and (c) the ratio of (a) to
(b) ("signal to noise"). Although these measures are not treat.ment effects, we
chose them here to assess whether the model can preserve broad features of signal,
noise, and signal-to-noise ratio in the continuous part of the compliers' outcome
distribu.tions, which we think can be very influential in estimating the treatment
effects of Section 8. More preferable measures might have been the posterior mean
and standard deviation for the ac.tual estimands in Section 8 for each replicated
dataset, but this would have required a prohibitive amount of computer mem.ory
because of the nested structure of that algorithm. In settings such as these,
additional future work on choices of discrepancy measures is of interest. PPPV s
for the discrepancy measures that we chose were cal.culated as the percentage of
draws in which the replicated dis.crepancy measures exceeded the value of the
study's discrep.ancy measure. Extreme values ( close to O or 1) of a PPPV would
indicate a failure of the prior distribution and likelihood to repli.cate the
corresponding measure of location, dispersion, or their relative magnitude and
would indicate an undesirable influence of the model in estimation of our
estimands. Results from these checks, displayed in Table 8, provide no special
evidence for such influences of the model. 10. DISCUSSION In this article we have
defined the framework for princi.pal stratification in broken randomized
experiments to accom.modate noncompliance, missing covariate information, missing
outcome data, and multivariate outcomes. We make explicit a set of structural
assumptions that can identify the causal effects of interest, and we also provide a
parametric model that is ap.propriate for practical implementation of the framework
in set.tings such as ours. Results from our model in the school choice study do not
indicate strong treatment effects for most of the subgroups ex.amined. But we do
estimate positive effects ( on the order of 3 percentile points for ITT and 5
percentile points for the effect of attendance) on math scores overall for children
who applied to the program from low-applicant schools, particularly for first
graders. Also, the effects were larger for African-American children than for the
remaining children (App. B). Posterior distributions for the CACE estimand, which
mea.sures the effect of attendance in private school versus public school, are
generally larger than the corresponding ITT effects but are also associated with
greater uncertainty. Of importance, because of the missing outcomes, a model like
ours is needed for valid estimation even of the ITT effect. The results from this
randomized study are not subject to se.lection bias in the way that nonrandomized
studies in school choice have been. Nevertheless, although we use the CACE, a well-
defined causal effect, to represent the effect of attendance of private versus
public schools, it is important to remember that the CACE is defined on a subset of
the study children (those who would have complied with either assignment) and that
for the other children there is no information on such an effect of attendance in
this study. Therefore, as with any ran.domized trial based on a subpopulation,
external information, such as background variables, also must be used when
gen.eralizing the CACE from compliers to other target groups of children. Also, it
is possible that a broader implementation of a voucher program can have a
collective effect on the public schools if a large shift of the children who might
use vouch.ers were to have an impact on the quality of learning for chil.dren who
would stay in public schools (a violation of the no.interference assumption).
Because our study contains a small fraction of participants relative to all school
children, it cannot provide direct information about any such collective effect,
and additional external judgment would need to be used to address this issue. The
larger effects that we estimated for children applying from schools with low versus
high past scores are also not, in principle, subject to the usual regression to the
mean bias, in contrast to a simple before-after comparison. This is be.cause in our
study both types of children are randomized to be offered the scholarship or not,
and both types in both treat.ment arms are evaluated at the same time after
randomization. Instead, the differential effect for children from schools with
different past scores is evidence supporting the claim that the school voucher
program has greater potential benefit for chil.dren in lower-scoring schools. Our
results also reveal differences in compliance and missing-data pattern across
groups. Differences in compliance indicate that children applying from low-
applicant schools are generally more likely to comply with their treatment
assign.ment; this could provide incentives for policy makers to target this
subgroup of children. However, this group also exhibits higher levels of always
takers, indicating that parents of chil.dren attending these poorly performancing
schools are more likely to get their child into a private school even in the
ab.sence of scholarship availability. These considerations would of course have to
be balanced with (and indeed might be dwarfed by) concerns about equity. Missing-
data patterns reveal that per.haps relatively greater effort is needed in future
interventions of this nature to retain study participants who stay in public
schools, particularly the public-school students who actually won but did not use a
scholarship. The approach presented here also has some limitations. First, we have
presented principal stratification only on a binary con.trolled factor Z and a
binary uncontrolled factor D, and it is of interest to develop principal
stratification for more levels of such factors. Approaches that extend our
framework in that di.rection, including for time-dependent data, have been proposed
(e.g., Frangakis et al. 2002b). Also, the approach of explicitly conditioning on
the patterns of missing covariates that we adopted in the parametric com.ponent of
Section 7 is not as applicable when there are many patterns of missing covariates.
In such cases, it would be more appropriate to use models related to those of
D'Agostino and Rubin (2000) and integrate them with principal stratification for
noncompliance. Moreover, it would be also interesting to investigate results for
models fit to these data that allow devia.tions from the structural assumptions,
such as weakened exclu.sion restrictions (e.g., Hirano et al. 2000; Frangakis,
Rubin, and Zhou 2002a), although to ensure robustness of such models, it would be
important to first investigate and rely on additional alternative assumptions that
would be plausible. APPENDIX A: COMPUTATIONS Details are available at
http://biosunOJ. biostat.jhsph. edu/~ cfrangak/papers/sc. Table B. 1. IT T Estimand
Applicant school: Low Applicant school: High Grade at application Ethnicity Reading
Math Reading Math AA 6 .7(3.0, 10.4) 6 .3(.8, 11.9) Other .8(-4.5,6.1) 2 AA Other
3 AA Other 4 AA Other Overall AA 4.5(1 .8, 7.2) .8(-5.1, 6.5) Other . 1 (-
4.6, 4.6) NOTE: Plain numbers are means, and parentheses are central 95% intervals
of the posterior distribution of the effects on percentile rank. AA 3.8(-2.0, 9.6)
9.0(4.1, 14.0) 2.3(-7.3, 10.9) 8.3(1 .1, 1 5.5) Other 2.9(-3.1, 8.8) 6.3(.8,
11 .8) 1 .3(-8.0, 10.2) 5.9(-2.2, 13.9) 2 AA 1. 1 (-3.6, 5.7) 3.1 (-1 .3,
7.5) -.4(-9.0, 7.9) 2.9(-5.0, 10.7) Other .3(-4.9, 5.4) .8(-4.4, 5.9) -1 .5(-
1 0.5, 7.3) 0(-8.5, 8.5) 3 AA 1 .5(-4.1 , 7.0) 6.3(.3, 12.2) -.2(-9.1 , 8.5)
5.6(-3.2, 14.1) Other .5(-5.4, 6.4) 3.5(-3.4, 10.2) -1 .3(-1 0.5, 7.9) 2.7(-
7.0, 12.0) 4 AA 4.6(-1.6, 10.8) 5.5(-.7, 11.6) 3.0(-6.0, 11 .6) 4.8(-3.6,
13.1) Other 3.7(-2.6, 10.1) 2.8(-3.8, 9.3) 2.3(-7.4, 11 .7) 2.0(-7.0, 11.0)
Overall AA 2.6(-1 .2, 6.3) 6.0(2.4, 9.5) 1.0(-6.9, 8.4) 5.5(-1 .5, 12.2)
Other 1.7(-2.3, 5.8) 3.3(-1.2, 7.7) .1 (-8.1, 7.9) 2.7(-5.0, 10.3) NOTE: Plain
numbers are means, and parentheses are central 95% intervals of the posterior
distribution of the effects on percentile rank. APPENDIX B: ETHNIC BREAKDOWN The
model of Section 7 can be used to estimate the effect of the program on finer
strata that may be of interest. For example, to es.timate effects stratified by
ethnicity (AA for African-American), we obtain the posterior distribution of the
causal effect of interest (ITT or CACE) using the same steps described in Section
8.1.l or 8.1.2, but allowing ethnicity to be part of the definition of WP . The
results for this stratification are reported in Table B.1 for the estimands of ITT
and Table B.2 for the estimands of CACE. The results follow similar patterns to
those in Section 8. For each grade and applicant's school (low/high) combination,
however, the effects are more positive on average for the subgroup of
African.American children. For both estimands, this leads to 95% intervals that are
entirely above O for math scores for the following subgroups: African-American
first-and third-graders and overall from low.applicant schools, African-American
first-graders from high-applicant schools, and non-African-American first-graders
from low-applicant schools. All intervals for reading covered the null value. This
suggests that the positive effects reported on math scores in Section 8 for
chil.dren originating from low-applicant schools are primarily attributable to
gains among the African-American children in this subgroup. [Received October 2002.
Revised October 2002.] REFERENCES Angrist, J. D., Imbens, G. W., and Rubin, D. B.
(1996),
"Identification of Causal Effects Using Instrumental Variables" (with discussion),
Journal of the Amer.ican Statistical Association, 91, 444-472. Barnard, J., Du, J.,
Hill, J. L., and Rubin, D. B. (1998), "A Broader Template for Analyzing Broken
Randomized Experiments," Sociological Methods and Research, 27, 285-317. Barnard,
J., Frangakis, C., Hill, J. L., and Rubin, D. B. (2002), "School Choice in NY City:
A Bayesian Analysis of an Imperfect Randomized Experiment," in Case Studies in
Bayesian Statistics, vol. V, eds. Gatsonis, C., Kass, R. E., Carlin, B.,
Cariiguiry, A., Gelman, A., Verdinelli, I., West, M. New York: Springer-Verlag, pp.
3-97. Brandl, J.eE. (1998), Money and Good Intentions Are Not Enough, or Why
Lib.eral Democrats Think States Need Both Competition and Community, Wash.ington,
DC: Brookings Institute Press. Carnegie Foundation for the Advancement of Teaching
(1992), School Choice: A Special Report, San Francisco: Jossey-Bass. Chubb, J.eE.,
and Moe, T. M. (1990), Politics, Markets and America's Schools, Washington, DC:
Brookings Institute Press. Cobb, C. W. (1992), Responsive Schools, Renewed
Communities, San Fran.cisco: Institute for Contemporary Studies. Coleman, J. S.,
Hoffer, T., and Kilgore, S. (1982), High School Achievement, New York: Basic Books.
Cookson, P. W. (1994), School Choice: The Struggle for the Soul of American
Education, New Haven, CT: Yale University Press. Coulson, A. J. (1999), Market
Education: The Unknown History, Bowling Green, OH: Social Philosophy & Policy
Center. Cox, D.eR. (1958), Planning of Experiments, New York: Wiley. D' Agostino,
Ralph B., J. and Rubin, D. B. (2000), "Estimating and Using Propensity Scores With
Incomplete Data," Journal of the American Statis.tical Association, 95, 749-759.
Education Week (1998), Quality Counts '98: The Urban Challenge; Public Education in
the 50 States, Bethesda, MD: Editorial Projects in Education. Frangakis, C. E., and
Rubin, D. B. (1999), "Addressing Complications of Intention-to-Treat Analysis in
the Combined Presence of All-or-None Treatment-Noncompliance and Subsequent Missing
Outcomes," Biometrika, 86, 365-380. ---(2001), "Addressing the Idiosyncrasy in
Estimating Survival Curves Using Double-Sampling in the Presence of Self-Selected
Right Censoring" (with discussion), Biometrics, 57, 333-353. ---(2002), "Principal
Stratification in Causal Inference," Biometrics, 58, 20-29. Frangakis, C. E.,
Rubin, D. B., and Zhou, X. H. (2002a), "Clustered Encourage.ment Design With
Individual Noncompliance: Bayesian Inference and Appli.cation to Advance Directive
Forms" (with discussion), Biostatistics, 3, 147.164. Frangakis, C. E., Brookmeyer,
R. S., Varadhan, R., Mahboobeh, S., Vlahov, D., and Strathdee, S. A. (2002b),
"Methodology for Evaluating a Partially Con.trolled Longitudinal Treatment Using
Principal Stratification, With Applica.tion to a Needle Exchange Program,"
Technical Report NEP-06-02, Johns Hopkins University, Dept. of Biostatistics.
Fuller, B., and Elmore, R. F. (1996), Who Chooses? Who Loses? Culture,
In.stitutions, and the Unequal Effects of School Choice, New York: Teachers College
Press. Gelfand, A. E., and Smith, A. F. M. (1990), "Sampling-Based Approaches to
Calculating Marginal Densities," Journal of the American Statistical Associ.ation,
85, 398-409. Gelman, A., Meng, X.-L., and Stern, H. (1996), "Posterior Predictive
Assess.ment of Model Fitness via Realized Discrepancies (Disc: P760-807),"
Statis.tica Sinica, 6, 733-760. Gelman, A., and Rubin, D. B. (1992), "Inference
from Iterative Simulation Us.ing Multiple Sequences" (with discussion), Statistical
Science, 7, 457-472. Glynn, R. J., Laird, N. M., and Rubin, D. B. (1993), "Multiple
Imputation in Mixture Models for Nonignorable Nonresponse With Follow-Ups," Journal
of the American Statistical Association, 88, 984-993. Goldberger, A. S., and Cain,
G. G. (1982), "The Causal Analysis of Cognitive Outcomes in the Coleman, Hoffer,
and Kilgore Report," Sociology of Educa.tion, 55, 103-122. Haavelmo, T. (1943),
"The Statistical Implications of a System of Simultaneous Equations," Econometrica,
11, 1-12. Hastings, W. (1970), "Monte Carlo Sampling Methods Using Markov Chains
and Their Applications," Biometrika, 57, 97-109. Hill, J. L., Rubin, D. B., and
Thomas, N. (2000), ''The Design of the New York School Choice Scholarship Program
Evaluation," in Donald Campbell's Legacy, ed. L. Bickman, Newbury Park, CA: Sage
Publications. Hirano, K., Imbens, G. W., Rubin, D. B., and Zhou, A. (2000),
"Assessing the Effect of an Influenza Vaccine in an Encouragement Design,"
Biostatistics, I, 69-88. Holland, P. ( 1986), "Statistics and Causal Inference,"
Journal of the American Statistical Association, 81, 396, 945-970. Imbens, G. W.,
and Angrist, J. D. (1994), "Identification and Estimation of Local Average
Treatment Effects," Econometrica, 62, 467-476. Imbens, G. W., and Rubin, D. B.
( 1997), "Bayesian Inference for Causal Effects in Randomized Experiments With
Noncompliance," The Annals of Statistics, 25, 305-327. Levin, H. M. (1998),
"Educational Vouchers: Effectiveness. Choice, and Costs." Journal of Policy
Analysis and Management, 17, 373-392. Little, R. J. A. (1993), "Pattern-Mixture
Models for Multivariate Incomplete Data," Journal of the American Statistical
Association, 88. 125-134. ---(!996), "Pattern-Mixture Models for Multivariate
Incomplete Data With Covariates," Biometrics, 52, 98-111. Little. R. J. A., and
Rubin, D. B. (1987), Statistical Analysis With Missing Data, New York: Wiley. Meng,
X. L. (1996), "Posterior Predictive p Values," The Annals of Statistics, 22, 1142-
1160. Neal, D. (1997), "The Effects of Catholic Secondary Schooling on Educational
Achievement," Journal of Labor Economics, 15, 98-l 23. Neyman, J. (1923), "On the
Application of Probablity Theory to Agricultural Experiments Essay on Principles.
Section 9," translated in Statistical Science, 5, 465--480. Peterson, P. E., and
Hassel, B. C. (Eds.) (1998), Learninge.from School Choice, Washington, DC:
Brookings Institute Press. Peterson, P. E., Myers, D. E., Howell, W. G., and Mayer,
D. P. (1999). "The Effects of School Choice in New York City," in Earning and
Learning; How Schools Matter, eds. S. E. Mayer and P. E. Peterson, Washington, DC:
Brook.ings Institute Press. Robins, J.M., Greenland, S., and Hu. F.-C. (1999).
�'Estimation of the Causal Effect of a Time-Varying Exposure on the Marginal Mean
of a Repeated Bi.nary Outcome" (with discussion), Journal of the American
Statistical Associ.ation, 94. 687-712. Rosenbaum, P.R., and Rubin, D. B. ( 1983),
'The Central Role of the Propensity Score in Observational Studies for Causal
Effects," Biometrika, 70, 41-55. Bengt MUTHEN, Booil Jo, and C. Hendricks BROWN 1.
INTRODUCTION The article by Barnard, Frangakis, Hill, and Rubin (BFHR) is timely in
that the Department of Education is calling for more randomized studies in
educational program evaluation. (See the discussion of the "No Child Left Behind"
initiative, in e.g., Slavin 2002.) BFHR can serve as a valuable pedagogical
ex.ample of a successful sophisticated statistical analysis of a ran.domized study.
Our commentary is intended to provide addi.tional pedagogical value to benefit the
planning and analysis of future studies, drawing on experiences and research within
our research group. [The Prevention Science Methodology Group (PSMG;
www.psmg.hsc.usfedu), co-PI's Brown and Muthen, has collaborated over the last 15
years with support from the National Institute of Mental Health and the National
Institute on Drug Abuse.] BFHR provides an exemplary analysis of the data from an
imperfect randomized trial that suffers from several compli.cations simultaneously:
noncompliance, missing data in out.comes, and missing data in covariates. We are
very pleased to Bengt Muthen is Professor and Booil Jo is postdoc, Graduate School
of Education and Information Studies. University of California Los Angeles, Los
Angeles, CA-90095 (E-mail: bmuthen@ucla.edu). C. Hendricks Brown is Professor,
Department of Epidemiology and Biostatistics, University of South Florida, Tampa,
FL 33620. The research of the first author was supported by National Institute on
Alcohol Abuse and Alcoholism grant K02 AA 00230. The research of the second and
third authors was supported by National Insti.tute on Drug Abuse and National
Institute of Mental Health grant MH40859. The authors thank Chen-Pin Wang for
research assistance, Joyce Chappell for graphical assistance, and the members of
the Prevention Science Methodology Group and the Fall ED299A class for helpful
comments. Rubin, D. B. (1974), "Estimating Causal Effects of Treatments in
Randomized and Non-Randomized Studies," Journal of Educational Psychology, 66,
688.701. ---( 1977), "Assignment to Treatment Groups on the Basis of a Covariate,"
Journal of Educational Statistics, 2, 1-26. ---( 1978a), "Bayesian Inference for
Causal Effects: The Role of Random.ization," The Annals of Statistics, 6, 34-58.
---(1978b), "Multiple Imputations in Sample Surveys: A Phenomenologi.cal Bayesian
Approach to Nonresponse (CIR: P29-34)," in Proceedings of the Survey Research
Methods Section, American Statistical Association. pp. 20.28. ---(1979), "Using
Multivariate Matched Sampling and Regression Ad.justment to Control Bias in
Observational Studies." Journal of the American Statistical Association, 74, 318-
328. ---(1980). Comments on "Randomization Analysis of Experimental Data: The
Fisher Randomization Test," Journal of the American Statistical Association, 75,
591-593. --(1984), "Bayesianly Justifiable and Relevant Frequency Calculations for
the Applied Statistician," The Annals of Statistics, 12, 1151-1172. --( 1990),
"Comment: Neyman (1923) and Causal Inference in Experi.ments and Observational
Studies," Statistical Science, 5, 472--480. The Coronary Drug
Project Research Group (1980), "Influence of Adherence to Treatment and Response
of Cholesterol on Mortality in the Coronary Drug Project," New En[?land Journal of
Medicine, 303. 1038-1041. Tinbergen, J. (I 930), "Determination and Interpretation
of Supply Curves: An Example," in The Foundations of Econometric Analysis, eds. D.
Hendry and M. Morgan, Cambridge, U.K: Cambridge University Press. Wilms, D. J.
(1985), "Catholic School Effect on Academic Achievement: New Evidence From the High
School and Beyond Follow-Up Study," Sociology of Education, 58. 98-114. Comment see
their application of cutting-edge Bayesian methods for deal.ing with these
complexities. In addition, we believe the method.ological issues and the results of
the study have important im.plications for the design and analysis of randomized
trials in education and for related policy decisions. BFHR provides results of the
New York City school choice experiment based on I-year achievement outcomes. With
the planned addition of yearly follow-up data, growth models can provide an
enhanced examination of causal impact. We discuss how such growth modeling can be
incorporated and provide a caution that applies to BFHR's use of only one posttest
occa.sion. We also consider the sensitivity of the latent class ignora.bility
assumption in combination with the assumption of com.pound exclusion. 2.
LONGITUDINAL MODELING ISSUES BFHR focuses on variation in treatment effect across
compli.ance classes, This part of the commentary considers variation in treatment
effect across a different type of class based on the notion that the private school
treatment effect might very well be quite different for children with different
achievement de.velopment. (Also of interest is potential variation in treatment
effects across schools, with respect to both the public school the child originated
in and the private school the child was � 2003 American Statistical Association
Journal of the American Statistical Association June 2003, Vol. 98, No. 462,
Applications and Case Studies DOI 10.1198/016214503000080