SSRN 4487202
SSRN 4487202
1 Introduction 9
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Pedagogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3
4 CONTENTS
7
Chapter 1
Introduction
Many scientific questions are causal inference questions. Much of science is concerned
with causal inference, namely estimating the effect of a “treatment” on an “outcome”. For that
purpose, a gold-standard method is to run a randomized experiment, where units’ exposure to
treatment is determined randomly. Then, one can compare the average outcome of treated and
untreated units, to unbiasedly estimate the so-called average treatment effect (ATE), namely
the average effect of the treatment in the population of interest.
In the social sciences, many important causal-inference questions are hard or im-
possible to study using a randomized experiment. For instance, one cannot randomly
increase imports from China to some countries and not to other countries, so as to measure
the effect of imports from China on destination-countries’ employment. Similarly, randomly
assigning firms to high- and low-minimum-wage groups to study the minimum wage’s effect on
employment has never been done so far. Even when they are feasible, randomized experiments
sometimes lack “external validity”: their findings may not be extrapolated from the experimen-
tal sample to the population whose ATE the researcher would like to learn. One reason is that
research ethics requires enlightened consent to include subjects in an experiment. Then, the
hypothetical minimum wage experiment would unbiasedly estimate the minimum wage’s effect
among firms that enrolled in the experiment, but that effect would probably differ from the
minimum wage’s effect among all firms.
9
10 CHAPTER 1. INTRODUCTION
Instead, social scientists often resort to natural experiments. To answer hard causal-
inference questions for which randomized experiments are unfeasible or would lack external
validity, researchers usually rely on so-called natural experiments, where a treatment and a
control group arise naturally, without any intervention from the researcher.1 Typically, natural
experiments used in the social sciences are policy changes. For instance, a US state increases
its minimum wage while the neighboring state does not, thus giving researchers a treatment
group facing a high minimum wage, and a control group facing a lower minimum wage. Natural
experiments often affect an entire region or province, so the findings from studies leveraging
natural experiments typically apply to large and unselected populations, unlike the findings
from randomized experiments.
did not change their minimum wage, policy makers may decide to change it at a later point.
Also, some states may implement large minimum wage increases, while other states implement
smaller increases. This creates lots of treatment variation, which complicates the analysis, and
may preclude researchers from estimating the clean-and-simple effects they were initially hoping
to estimate. The complexity of the majority of natural experiments leveraged by social scientists
to learn causal effects is a central theme of this book.
1.1 Overview
The classical DID design. After presenting the book’s set-up and notation in Chapter
2, Chapter 3 reviews the classical difference-in-differences (DID) design. In that design, the
treatment is a binary variable, the treatment is absorbing, meaning that one cannot switch out
of treatment once treated, and there is no variation in treatment timing: all treated groups are
treated at the same time. In this introduction, we consider a simplified version of this design,
with two groups and two periods: group s switches from untreated to treated from period 1 to
2, and group n is untreated at both dates.
Potential and observed outcomes. Let us also introduce simplified potential outcome nota-
tion, that will be sufficient for the purposes of this introduction (the next chapter introduces the
general potential outcome notation used throughout the book). For g ∈ {s, n} and t ∈ {1, 2},
let Yg,t (0) and Yg,t (1) denote the potential outcomes in g at t without and with treatment, re-
spectively (Neyman, Dabrowska and Speed, 1990; Rubin, 1974). In the minimum wage example,
Yg,t (0) is the employment level that g will have at t with a low minimum wage, and Yg,t (1) is the
12 CHAPTER 1. INTRODUCTION
employment level that g will have at t with a high minimum wage. Let Yg,t denote the observed
outcome in g at t. If “cell”2 (g, t) is untreated, its observed outcome is its potential outcome
without treatment: Yg,t = Yg,t (0). If (g, t) is treated, its observed outcome is its potential out-
come with treatment: Yg,t = Yg,t (1). Letting Dg,t be an indicator equal to one if (g, t) is treated,
we have the following relationship between the observed and potential outcomes:
Justify the previous equality, by showing that it holds for any value that Dg,t can take.
If Dg,t = 1, the observed outcome Yg,t is equal to Yg,t (1), and the right-hand-side of the previous
display is also equal to Yg,t (1) so the equality holds. If Dg,t = 0, the observed outcome Yg,t is
equal to Yg,t (0), and the right-hand-side of the previous display is also equal to Yg,t (0) so the
equality holds. The unobserved potential outcome of (g, t) is referred to as its counterfactual
outcome.
Target parameter, and three possible estimators. We would like to estimate E(Ys,2 (1) −
Ys,2 (0)), the average effect of the treatment in group s at period 2. (s, 2) is the only treated (g, t)
cell, so E(Ys,2 (1) − Ys,2 (0)) is the average treatment effect on the treated (ATT). As group s is
treated in period 2, Ys,2 (1) is just Ys,2 , the observed outcome in s at period 2. On the other hand,
Ys,2 (0), the counterfactual outcome that s would have had at period 2 if it had been untreated,
is unobserved and has to be estimated. To estimate E(Ys,2 (1) − Ys,2 (0)), we could use
a treated-versus-control comparison of s’s and n’s outcomes at period 2. However, which group
got treated was not determined randomly, so Yn,2 may not be a good counterfactual of the
outcome that s would have had at period 2 without treatment. Alternatively, we could use
a before-after comparison of s’s outcome at periods 1 and 2. However, perhaps that s’s outcome
would have changed from period 1 to 2 even without treatment, so Ys,1 may again not be a good
counterfactual. Instead, we could use
the comparison of the period-1-to-2 outcome evolutions of s and n. DID is called a Difference-
in-Differences estimator: the before-after longitudinal difference in (1.2) is combined with the
treated-versus-control cross-sectional difference in (1.1). For DID to be a reliable estimator of
the treatment effect, one assumption has to hold, which one?
=E [Ys,2 (1) − Ys,2 (0)] + E [Ys,2 (0) − Ys,1 (0)] − E [Yn,2 (0) − Yn,1 (0)]
The first equality follows from the fact that (s, 2) is treated, and all other cells are untreated.
The second equality follows from adding and subtracting Ys,2 (0). The third equality follows
from the parallel-trends assumption. Many unbiasedness proofs in this book have the same
structure as that of (1.5). First, one maps observed outcomes into potential outcomes. Second,
one adds and subtracts the treated group’s missing counterfactual outcome. Third, one invokes
the parallel-trends assumption.
the assumption underlying the before-after comparison in (1.2), which requires that groups’
average untreated outcomes do not evolve over time.
3
In this book, we use the expressions “pre-trend tests” and “placebo tests” interchangeably
1.1. OVERVIEW 15
While useful, pre-trend tests have two limitations. First, while we can test for parallel
trends before treatment, we cannot do so after treatment, because the untreated outcome of s
is no longer observed once s becomes treated. Even if s and n were on parallel trends before the
date when the treatment started, that does not necessarily mean that without treatment, they
would have remained on parallel trends after that date. Accordingly, parallel-trends tests remain
suggestive. Second, a recent literature has shown that tests of parallel trends are sometimes
underpowered. Those tests may fail to detect differential trends between treated and control
groups that are large enough to significantly bias DID estimators.
Relaxations of the parallel-trends assumption. There are at least three instances where
one may want to relax the parallel-trends assumption. First, pre-trend tests may be rejected.
Second, even if pre-trend tests are not rejected, one may worry that they lack power. Third,
even if pre-trend tests are not rejected and one does not worry that they lack power, one may
still worry that in the absence of treatment, treated and control groups could have experienced
differential trends after the date when the treatment started. Thus, an important literature,
reviewed in Chapter 4, has proposed estimators relying on relaxations of the parallel-trends
assumption. First, in Section 4.1 we consider DID estimators with control variables, which rely
on a conditional parallel-trends assumption. Second, in Section 4.2 we consider interactive-
fixed-effects, synthetic-control, and synthetic-DID estimators. Third, in Section 4.3 we consider
estimators relying on a bounded-differential-trends assumption.
Historical notes. Baker, Callaway, Cunningham, Goodman-Bacon and Sant’Anna (2025) ex-
hibit what constitutes, as of now, the earliest example where DID was used. In the 1840’s, Ignaz
Semmelweis was an assistant physician at the Vienna maternity clinic. At the time, maternal
mortality from childbed fever was very high, especially in one of Vienna’s clinics that was ex-
clusively staffed with physicians. In Vienna’s other clinic, exclusively staffed with midwives,
mortality was lower. Semmelweis conjectured that the gap may be due to hygiene: doctors
routinely performed autopsies before seeing to laboring mothers, without always washing “ca-
daverous particles” from their hands. Midwives, on the other hand, performed no autopsies.
To assess whether contaminated hands caused childbed fever, Semmelweis mandated a simple
protocol of handwashing using chlorinated lime. After the implementation of the hand-washing
16 CHAPTER 1. INTRODUCTION
protocol, mortality sharply dropped in the doctor-staffed clinic, while mortality did not change in
the midwives-staffed clinic. This DID analysis demonstrated that hand-washing was a simple yet
powerful way of preventing childbed fever (Semmelweis, 1983). Semmelweis’s theory was mocked
by his peers, who refused to admit that their actions were the cause of women’s mortality. In
1865, the increasingly outspoken Semmelweis allegedly suffered a nervous breakdown and was
committed to an asylum by his colleagues, where he died 14 days later in unclear circumstances.
Another famous early DID example is John Snow’s study of whether cholera is transmitted by
air or water, the two leading theories in the 1850s. Snow used a change in the water supply in
one district of London, namely the switch from polluted water taken from the Thames in the
centre of London to a supply of cleaner water taken upriver. He showed that following that
change, cholera outbreaks diminished in that district, while they did not change in neighboring
districts which were still getting their water from central London. As the treated and control
districts had the same air quality, his DID analysis showed that cholera is transmitted by water
(Snow, 1856; Lechner, 2011).
TWFE estimators are ubiquitous, and the majority of TWFE estimators are com-
puted outside of the classical design. In a classical design, the DID estimator is equal to
the treatment coefficient in a two-way fixed effects (TWFE) linear regression of Yg,t , on group
fixed effects, period fixed effects, and Dg,t , the treatment of group g at period t. Motivated by
this fact, researchers have also estimated TWFE regressions in more complicated designs, with
treatments that may be non-absorbing and/or non-binary, and where groups may experience
several treatment changes, at different points in time. Their hope was that there as well, TWFE
was giving them an estimator that only relied on a placebo-testable parallel-trends assumption.
Accordingly, TWFE regressions have become a very commonly-used technique in economics.
de Chaisemartin and D’Haultfœuille (forthc.) conducted a survey of the 100 papers with the
most Google Scholar citations published by the American Economic Review from 2015 to 2019.
Of those, 26 use a TWFE regression to estimate the effect of a treatment on an outcome. By
comparison, 11 of those 100 papers use a randomized experiment (six use a field experiment,
1.1. OVERVIEW 17
three use a survey experiment, and two use a laboratory experiment), and three use a regression
discontinuity design.4 Of the 26 papers estimating a TWFE regression, only two have a classical
design with an absorbing and binary treatment, and no variation in treatment timing. TWFE
regressions are also very common in political sciences: Chiu, Lan, Liu and Xu (2023) find that
of all papers published from 2017 to 2022 by the American Political Science Review, the Amer-
ican Journal of Political Science, and the Journal of Politics, 93 estimate a TWFE regression,
and only nine have a classical design. TWFE regressions are also commonly used in sociology,
environmental sciences, and epidemiology.
Outside of the classical design, TWFE estimators can be misleading if the treat-
ment effect varies across groups and over time. In Chapter 5, we show that outside of
the classical design, the parallel-trends assumption is not sufficient to ensure that the TWFE
estimator is unbiased for the ATT. Under parallel trends, the TWFE estimator is unbiased for
a weighted sum of group-and-period specific treatment effects across all treated (g, t) cells. The
weight assigned to the treatment effect of cell (g, t) is not proportional to the population of
cell (g, t), so the TWFE estimator does not estimate the ATT. Perhaps more worryingly, the
TWFE estimator may weight negatively the treatment effects of some (g, t) cells. Then, one
could have that the treatment effect is, say, positive in every group and time period, but the
TWFE coefficient is negative. The TWFE estimator is unbiased if one further assumes that the
treatment effect is constant across groups and over time, but that assumption is likely to fail
in many applications. For instance, the effect of the minimum wage on employment is likely to
differ in states with highly-educated workers, and in states with less-educated workers.
1.2 Pedagogy
Theory. As we realize that the book’s level of technicality may be intimidating to many
applied readers, we have included, throughout the book, questions in blue, to help the reader
make sense of notation, understand concepts, and follow derivations. Those questions do not
bear on topics covered earlier in the book. Rather, they bear on what comes next, just after the
question. A large body of research in cognitive sciences shows that quizzing learners on topics
the instructor has not covered yet, in the spirit of Socrates’ maieutics, improves their learning
more than quizzing them on topics the instructor has just covered. The “blue questions” are an
essential element of the book’s pedagogy. An instructor may ask those questions to students in
class, giving them a couple of minutes to discuss the question with their classroom neighbors.
Students reading the book together in a reading group may discuss the question for a couple
of minutes, before reading the answer. Readers studying the book on their own may stop their
reading for a couple of minutes to think about those questions when they encounter one.
Applications. The methods are illustrated by revisiting several empirical articles. Chapters 3
to 8 each have an empirical running example, used throughout the chapter. Stata datasets
and dofiles to perform the book’s empirical exercises can be obtained by typing ssc desc
cc_xd_didtextbook and then net get cc_xd_didtextbook from within Stata.5 Each chap-
ter includes “green questions”, asking the reader to perform an estimation on the chapter’s
dataset. Reproducing the estimations outlined in the green questions will facilitate the transi-
tion from theoretical understanding to practical application. Soon, R datasets and codes will
also be made available. Many green questions leverage user-written Stata packages. To install
them, you need to run ssc install packagename within Stata. Some of those packages have
dependencies (other user-written packages that need to be installed to run the package), which
you will also need to install.
the reader to the software’s syntax. Those tools are provided freely by the researchers who create
them, a crucial yet under-appreciated part of the research process. As package developers, we
cannot stress how time-consuming it is to create and maintain a package, and answer users’
questions. As there is not always a one-to-one mapping between the authors of a paper that
proposes an estimator and the authors of a user-written package, we encourage the readers of
this book, whenever they use a user-written package, to cite the package, together with the
paper that has proposed the estimator.
Core and non-core material. This books is admittedly much longer than what one may
expect for a book covering “only” DIDs. This reflects the fact that many seemingly unrelated
methods actually have a connection to DIDs (the book for instance discusses interactive-fixed
effects and synthetic control methods, as well as Bartik regressions). Moreover, DID-like methods
are ubiquitous, and are used in a very broad set of applications, so covering comprehensively
all common use-cases is, in and of itself, a big undertaking. Still, to make the book more
approachable, we have divided its material into two categories. Starred material correspond to
non-core and/or technically more difficult material, which readers may skip when reading the
book for the first time.
Acknowledgements. We are extremely grateful to participants in all the DID mini courses
where this book was used as a reference: their comments and questions have greatly improved the
book. We are grateful to Andrea Albanese, Bocar Ba, Damian Clarke, Bruno Ferman, Yagan
Hazard, David McKenzie, and Matteo Pinna Pintor for their helpful comments. We thank
Romain Angotti, David Arboleda Cárcamo, Diego Ciccia, Felix Knau, Bingxue Li, Mélitine
Malézieux, and Doulo Sow for exceptional research assistance.
Chapter 2
We seek to estimate the effect of a treatment on an outcome. For that purpose, we use a panel of
G groups observed at T periods, respectively indexed by g and t. Typically, groups are locations,
like states, counties, or municipalities, but a group could also just be a single individual or firm.
The group-level panel data may be constructed by aggregating an individual-level repeated
cross-section data set at the (g, t) level, defining groups, say, as individuals’ county of birth. The
group-level panel data may also be constructed from a single cross-section dataset, with cohort of
birth playing the role of the time variable. In most of this textbook, we consider estimators that
are not weighted by Ng,t , the population of cell (g, t), and we also assume that the group-level
panel dataset is balanced, meaning that the outcome and treatment of each group is observed at
every period. This is mostly to reduce notational complexity, but we discuss the consequences
of imbalancedness and weighting at the end of Chapters 3 and 6.
Treatment. Let Dg,t denote the treatment of group g at period t, let Dt be the set of values
Dg,t can take at period t (i.e.: its support), let Dg = (Dg,1 , ..., Dg,T ) be a 1 × T vector stacking
the treatments of group g from period 1 to T , and let D = (D1 , ..., DG ) be a vector stacking
21
22 CHAPTER 2. DATA, NOTATION, AND ASSUMPTIONS
the treatments of all groups at every period. D is referred to as the design of a study.
Potential outcomes. For all d = (d1 , ..., dG ) in the support of D, let Yg,t (d1 , ..., dG ) denote
the potential outcome of group g at t if D = d, and let Yg,t = Yg,t (D) denote the observed
outcome of g at t. This notation allows groups’ outcome at t to depend on their current, past,
and future treatments, and on other groups’ current, past, and future treatments.
2.3 Assumptions
SUTVA In most of this book, we implicitly assume that g’s potential outcomes only depend
on g’s treatment, the so-called Stable Unit Treatment Value Assumption (SUTVA, Rubin, 1978):
(2.1) may fail in the presence of spatial spillovers: a policy is targeted to some locations, but
the effect of treatment spills over onto “nearby” locations. For example, individuals in control
areas can travel to the treated areas and receive treatment. Few papers have attempted to relax
the SUTVA assumption in DID designs, though this is a topic that has started receiving more
attention recently. We will review some of this burgeoning literature in the next chapters.
Assumption NA (No Anticipation) For all g and (d1 , ..., dT ) ∈ D1 × ... × DT , Yg,t (d1 , ..., dT ) =
Yg,t (d1 , ..., dt ).
2.3. ASSUMPTIONS 23
Assumption NA requires that a group’s current outcome does not depend on its future treat-
ments. It is plausible when treatment’s introduction is hard to anticipate. It is less plausible
when treatment’s introduction is announced saliently ahead of time. Then, researchers some-
times redefine a (g, t) cell as treated if at period t, it has been announced that group g will get
treated in the future.
Initial conditions. The potential outcome notation Yg,t (d1 , ..., dt ) implicitly assumes that
groups’ treatment prior to period one, the first time period in the data, does not affect their
outcome, the so-called “initial conditions” assumption. In many of the applications we will
consider in this textbook, groups cannot be exposed to treatment before period one. Then,
the “initial conditions” assumption is innocuous. When groups might have been exposed to
treatment before period one, this assumption is not innocuous, though very few papers have
attempted to relax it in the DID literature.
Assumption ND (No Dynamic Effects) For all g and (d1 , ..., dt ) ∈ D1 ×...×Dt , Yg,t (d1 , ..., dt ) =
Yg,t (dt ).
Assumption ND requires that a group’s current outcome do not depend on its past treatments,
the so-called no-dynamic-effects or no carry-over-effects hypothesis. When one studies the effect
of a tax on prices, Assumption ND is plausible if prices adjust quickly, while Assumption ND is
implausible if prices are sticky. Under Assumptions NA and ND and with a binary treatment,
each cell (g, t) has two potential outcomes: Yg,t (0) if g is untreated at t, and Yg,t (1) if g is treated
at t. Then, we are back to the standard Neyman-Rubin model of potential outcomes (Neyman
et al., 1990; Rubin, 1974).
Bibliographic notes. Robins (1986) introduced the dynamic potential outcome model we
use in this book. He also introduced Assumption NA, which was first discussed by Abbring and
Van den Berg (2003) in the context of duration models, and by Malani and Reif (2015) in the
context of DID models.
24 CHAPTER 2. DATA, NOTATION, AND ASSUMPTIONS
In most of this textbook we will assume that groups are on parallel trends. Our baseline parallel-
trends assumption is as follows. For all t, let 0t denote a vector of t zeros. Yg,t (0t ) is the outcome
of group g at t if it remains untreated from period 1 to t, hereafter referred to as the never-treated
outcome of g at t.
h
Assumption PT (Parallel trends for the never-treated outcome) For all t ≥ 2, E Yg,t (0t )−
i
Yg,t−1 (0t−1 ) does not vary across g.
Assumption PT requires that every group experiences the same expected evolution of its never-
treated potential outcome. Under Assumption ND, Assumption PT reduces to:
For all t ≥ 2, E [Yg,t (0) − Yg,t−1 (0)] does not vary across g. (2.2)
t
E [Yg,t (0t ) − Yg,1 (0)] = E [Yg,t′ (0t′ ) − Yg,t′ −1 (0t′ −1 )]
X
t′ =2
is constant across g. Then, let γt = E [Yg,t (0t ) − Yg,1 (0)], and let αg = E [Yg,1 (0)]. One has that
Conversely, it is easy to verify that if E [Yg,t (0t )] = αg +γt , then Assumption PT holds. Therefore,
letting
denote the deviation of g’s never treated outcome at period t from its expectation, one has that
Assumption PT holds if and only if the never-treated outcome follows a TWFE model:
Parallel trends for all groups, or parallel trends on average? The parallel-trends con-
dition in Assumption PT is strong, as it requires that all groups experience parallel trends.
Actually, many of the results in this book rely on the weaker condition that on average, treated
and control groups experience parallel trends. For instance, in the classical design we study in
Chapter 3, with G1 treated (Dg = 1) and G0 control groups (Dg = 0), most results hold if
1 X 1 X
E (Yg,t (0t ) − Yg,t−1 (0t−1 )) = E (Yg,t (0t ) − Yg,t−1 (0t−1 )) , (2.6)
G1 g:Dg =1 G0 g:Dg =0
meaning that the average expected evolution of the never-treated outcome is the same across
treated and control groups.
Pre-trend tests. As we will discuss in details later, Assumptions NA and PT are partly
testable, by checking that on average, treated and control groups follow parallel outcome evolu-
tions before the treated get treated, a so-called pre-trends test.
We now introduce the main statistical assumption underlying the inferential procedures (confi-
dence intervals, tests...) introduced in this book.
Assumption IND (Independent groups) The vectors (Yg,t (d1 , ..., dt ))(d1 ,...,dt )∈D1 ×...×Dt ,1≤t≤T are
mutually independent.
Assumption IND allows groups’ potential outcomes to be serially correlated over time, an im-
portant feature to account for in DID studies (Bertrand et al., 2004). Assumption IND does
not make any assumption on groups’ treatments, because groups’ treatments are implicitly con-
ditioned upon and treated as non-stochastic in most of this book. The next section contains a
thorough discussion of Assumption IND, of the fact that treatments are conditioned upon, and
of the book’s perspective on statistical inference. Readers that plan to skip the book’s starred
sections may skip this section when reading the book for the first time.
26 CHAPTER 2. DATA, NOTATION, AND ASSUMPTIONS
involving judgement calls by the researcher. For instance, in a thought experiment where one
imagines that treatments were randomly assigned, would it make more sense to imagine that
the assignment was independent across g, across t, across both dimensions, across none of these
two dimensions? Similarly, in the applications we consider, researchers rarely draw their sample
from a larger population. In fact their study sample often includes all the states or municipalities
of a country. Then, the sampling-based perspective also relies on a thought experiment, where
one imagines that the sample is drawn from an hypothetical infinite super-population.
On the other hand, conditioning on the design affects the unbiasedness of the variance estimators
Vb (θ)
b we consider: by the law of total variance, if a variance estimator is unbiased for an estima-
tor’s conditional variance, it is downward biased for the estimator’s unconditional variance:
h i h i h i h h ii h i
E Vb (θ)
b D = V θb D ⇒ E Vb (θ)
b = E V θb D ≤ V θb . (2.7)
28 CHAPTER 2. DATA, NOTATION, AND ASSUMPTIONS
where Ms(g),t is a “macro” stochastic shock common to all counties in the same state as g,
while mg,t is a “micro” stochastic shock specific to county g, assumed to be mean zero and
mean-independent of Ms(g),t . Then, Assumption IND cannot hold unconditionally, as potential
outcomes of different counties in the same state are necessarily correlated. But it can hold
conditional on the macro-shocks Ms(g),t . The weaker version of the parallel-trends assumption
in (2.6) also holds conditional on the state-level shocks if
1 X 1 X
Ms(g),t = Ms(g),t , (2.8)
G1 g:Dg =1 G0 g:Dg =0
meaning that the state-level shocks are on average the same across treated and control counties.1
At what level should one cluster standard errors? The level at which one should assume
independence is often unclear. Since the work of Bertrand, Duflo and Mullainathan (2004), it
1
If one supposes that Assumption IND and (2.6) hold conditional on the macro shocks, all the analysis is
conditional on those shocks. In particular, the treatment effects defined below depend on those shocks: as the
two previous displays show, different state-level shocks may lead to different potential outcomes, and to different
treatment effects. Those conditional effects remain valid measures of the treatment’s effect, they just depend on
the specific shocks that took place over the study period (Deeb and de Chaisemartin, 2019).
2.4. DISCUSSION OF THE BOOK’S PERSPECTIVE ON STATISTICAL INFERENCE∗ 29
has become common practice to make the independence assumption at the level at which the
treatment is assigned, and use standard errors clustered at that level. For instance, with county-
level data but a treatment assigned at the state level, one clusters at the state level. While this
approach is uncontroversial in a design-based approach to inference (Abadie, Athey, Imbens and
Wooldridge, 2023), it might not be the only possibility in the model-based approach we adopt.
Instead, one may use standard errors clustered at the county level. The main motivation for
doing so is that clustering at the state level can lead to non-trivial power losses,2 and a recent
literature has highlighted that power is often a first-order concern in the analysis of natural
experiments (see, e.g., Roth, 2022). (2.8) could fail, meaning that average state-level shocks
differ across treated and control counties, leading to differential trends. But pre-trend tests can
detect such differential trends, and the analyst may adjust the model, say by controlling for
some covariates as discussed in Chapter 4 below. Instead, clustering standard errors at the state
level slightly changes the parallel-trends assumption (now (2.8) is required to hold in expectation
rather than exactly), but also decreases the precision of the pre-trend tests, which could lead
one to fail to detect pre-trends. To conclude, we believe that due to the panel structure of the
data, allowing for serial correlation across potential outcomes over time is important, so using
clustered standard errors is in order. Then, we recommend clustering either at the level at which
the treatment is assigned, following Bertrand et al. (2004), or at the most disaggregated level at
which one can still construct a panel dataset, following the arguments above.
the sampling-based one. Going back and forth between these two conceptual frameworks im-
poses a cognitive cost on the reader. At the same time, strengthening one’s translation skills
between those two languages can be useful, because both are used in the methodological papers
on DID. When we adopt the sampling-based perspective, we replace Assumption IND by the
slightly stronger requirement that groups’ potential outcomes and treatments are i.i.d.:
Assumption IID (i.i.d. groups) The vectors ((Yg,t (d1 , ..., dt ))(d1 ,...,dt )∈D1 ×...×Dt , Dg,t )1≤t≤T are
i.i.d.
As groups are i.i.d., the g subscript can be dropped. In the sampling-based perspective, ex-
pectations are taken with respect to both the distribution of groups’ potential outcomes and
treatments. To highlight the difference with the conditional expectations introduced above, we
let Eu [·] denote such unconditional expectations. Finally, in the sampling-based perspective,
and thus under Assumption IID, one can show that (2.6) implies
Eu [Yt (0t ) − Yt−1 (0t−1 )|D = 1] = Eu [Yt (0t ) − Yt−1 (0t−1 )|D = 0] , (2.9)
a common way of stating the parallel-trends assumption (see, e.g., Abadie, 2005).
Part II
31
Chapter 3
Design CLA (Classical DID design) Dg,t = 1{t > T0 }Dg , with T0 ≥ 1, Dg ∈ {0, 1} for all g,
and ming Dg = 0 and maxg Dg = 1.
Dg is an indicator equal to 1 for treatment groups, that all become treated at period T0 + 1 and
remain treated thereafter, while Dg is equal to 0 for control groups that never become treated.
We require that T0 ≥ 1, meaning that there is at least one pre-treatment period, and that there
is at least one treatment and one control group. Unlike the conditions in say, Assumption NA,
those in Design CLA only involve observed quantities, and one can directly verify from the data
whether those conditions are satisfied or not. Let G1 = Dg and G0 = G − G1 respectively
PG
g=1
denote the number of treated and control groups, let T1 = T − T0 denote the number of treated
periods, and let N1 = G1 T1 denote the number of treated (g, t) cells.
33
34 CHAPTER 3. THE CLASSICAL DID DESIGN
TWEA, the US organic chemical industry was laggind behind Germany’s. Moser and Voena
(2012) study the effect of the compulsory licensing of German chemical patents on US domestic
invention. For that purpose, they use a balanced panel of the number of patents granted by the
US Patent and Trademark Office per year, for each of the 7 248 organic-chemistry subclasses.
Dg is an indicator for 336 subclasses that received at least one German patent under the TWEA.
The treatment is Dg,t = 1{t > T0 }Dg , with T0 + 1 = 1919.
Dataset used in this chapter. To answer the green questions in this chapter, you need to
use the moser_voena_didtextbook dataset, which contains the following variables:
• patents1900: the number of patents in subclass g in 1900 (that variable will be used in
Chapter 4).
3.1. TARGET PARAMETERS 35
While the original data of Moser and Voena (2012) is a 1875 to 1939 panel, moser_voena_didtextbook
starts in 1900, thus yielding a lighter dataset on which some of the commands below take less
time to run.
(g, t)-specific effects. For all t, let 1t denote a vector of t ones. For all (g, t) such that Dg = 1
and t > T0 , let
TEg,t = E [Yg,t − Yg,t (0t )] = E [Yg,t (0T0 , 1t−T0 ) − Yg,t (0t )]
denote the expected effect in cell (g, t) of having been treated rather than untreated for t − T0
periods, from T0 + 1 to t. If we impose Assumption ND, thus ruling out dynamic effects, TEg,t
reduces to
TEg,t = E [Yg,t (1) − Yg,t (0)] .
In the set-up we consider, TEg,t applies to only one group, and can therefore not be consistently
estimated, an issue we return to in Section 3.6 below. Instead, researchers often consider effects
aggregated across all treated groups, that can be consistently estimated.
the average effect of having been treated rather than untreated for t − T0 periods, across all
treated (g, t) cells. If we rule out dynamic effects, ATT reduces to
1
ATT = E [Yg,t (1) − Yg,t (0)] ,
X
G1 T1 (g,t):Dg,t =1
Average effect of having been treated for ℓ periods. For any ℓ ∈ {1, ..., T1 }, let
1 X
ATTℓ = E [Yg,T0 +ℓ (0T0 , 1ℓ ) − Yg,T0 +ℓ (0T0 +ℓ )] .
G1 g:Dg =1
36 CHAPTER 3. THE CLASSICAL DID DESIGN
ATTℓ is the average effect of having been treated for ℓ periods, across all treated groups and at
period T0 + ℓ. One has
1 X T1
ATT = ATTℓ ,
T1 ℓ=1
so the ATT is just the average of the average effects of having been treated for one, two, ..., and
T1 periods.
Can we use ℓ 7→ ATTℓ to learn how treatment effects vary with length of expo-
sure? In many contexts, it would be interesting to learn if treatment effects vary with length
of exposure. Assume that the treatment is costly, and the treatment cost has to be paid at
each period of exposure. Then, if treatment effects do not increase beyond, say, three periods of
exposure, it may be optimal to stop the treatment after three periods. Unfortunately, ℓ 7→ ATTℓ
cannot be used to learn how the treatment effect varies with length of exposure, unless one is
willing to assume that the treatment effect does not vary with calendar time. For instance, if
ATT2 > ATT1 > 0, can we conclude that two periods of exposure to treatment have a larger
effect than one period of exposure?
ATT2 is the effect of two periods of exposure at period T0 + 2, while ATT1 is the effect of one
period of exposure at period T0 + 1. Then, if ATT2 > ATT1 > 0, this difference could be due to
the fact that being treated for two periods has a larger effect than being treated for one period,
but it could also be due to the fact that the effect of one period of exposure is larger at period
T0 + 2 than at period T0 + 1. It is only if one is ready to assume that the effect of one period
of exposure is the same at periods T0 + 1 and T0 + 2 that ATT2 > ATT1 > 0 implies that two
periods of exposure has a larger effect than one period of exposure. In this chapter’s appendix,
we show this point more formally.
3.2. TWO-WAY FIXED EFFECTS ESTIMATORS 37
Let βbfe denote the sample coefficient on Dg,t , the treatment in group g at period t, in an OLS
regression of Yg,t , the outcome of group g at period t, on group fixed effects (FEs), period FEs,
and Dg,t :
G T
Yg,t = b g′ 1{g = g ′ } + γbt′ 1{t = t′ } + βbfe Dg,t + ϵ̂g,t , (3.1)
X X
α
g ′ =1 t′ =1
where ϵ̂g,t denotes the regression residual.1 Hereafter, βbfe is referred to as the Two-Way Fixed
Effects (TWFE) estimator.
equal to 0.724. According to the TWFE regression, this average would have been equal to
0.724 − 0.288 = 0.436 without treatment. Thus, the treatment increased innovation in the
treated subclasses by more than two thirds. Our 0.288 TWFE coefficient is very close to the
0.255 coefficient in Table 2 Column (2) Moser and Voena (2012). The difference comes from the
fact they use a 1875 to 1939 panel, while we only use years 1900 to 1939.
In Design CLA, βbfe is a simple DID estimator. In designs with an absorbing and binary
treatment and no variation in treatment timing, one can show that βbfe is equal to the coefficient
on Dg,t in a simpler regression of Yg,t on an intercept, the treatment-group indicator Dg , the
post-treatment indicator 1{t > T0 }, and Dg,t , the interaction of the two indicators. Therefore,
βbfe is a simple DID estimator comparing the average outcome’s evolution, before and strictly
after T0 , in treatment and control groups:
1 1
βbfe =
X X
Yg,t − Yg,t
G1 T1 g:Dg =1,t>T0 G1 T0 g:Dg =1,t≤T0
1 1
(3.2)
X X
− Yg,t − Yg,t .
G0 T1 g:Dg =0,t>T0 G0 T0 g:Dg =0,t≤T0
Proof of (3.2).∗ By Frisch-Waugh-Lovell’s theorem (see the appendix of this chapter for a
restatement of this theorem), βbfe is the coefficient on the regression of Yg,t on the residual from
the regression of Dg,t on group and time FEs. Since Dg,t only depends on Dg and 1{t > T0 },
this latter regression yields the same residuals as that of Dg,t on Dg and 1{t > T0 }. Thus,
by applying Frisch-Waugh-Lovell’s theorem in the other direction, βbfe is also the coefficient on
Dg,t in a simpler regression of Yg,t on an intercept, the treatment-group indicator Dg , the post-
treatment indicator 1{t > T0 }, and Dg,t , the interaction of the two indicators. This regression is
saturated in (Dg , 1{t > T0 }) (see Section 3.1.4 of Angrist and Pischke, 2009, for a definition of a
saturated regression), so the regression function is equal to the conditional mean function of Yg,t
given (Dg , 1{t > T0 }) (see Theorem 3.1.4 of Angrist and Pischke, 2009), and all coefficients are
functions of that conditional mean function. In particular, it follows from the ninth unnumbered
equation on page 50 of Angrist and Pischke (2009) that βbfe satisfies (3.2) QED.
number of patents in subclass g and year t on the treatment group indicator, the post-treatment
indicator, and the twea treatment.
Proof of Theorem 1
E [Yg,t ] =E [Yg,t (0t )] + Dg,t E [Yg,t − Yg,t (0t )] = E [Yg,t (0t )] + Dg,t TEg,t . (3.4)
The second equality simply follows from the definition of TEg,t . The first equality is an example
of definitional equalities that relate observed and potential outcomes, that are commonly used
in the causal inference literature. Justify it, by showing that it holds for any value that Dg,t can
take.
If Dg,t = 1, the right hand side is equal to E [Yg,t (0t )] + E [Yg,t − Yg,t (0t )] = E [Yg,t ] so the
equality holds. If Dg,t = 0, the left hand side is equal to E [Yg,t (0t )], because Yg,t = Yg,t (0t ) if
40 CHAPTER 3. THE CLASSICAL DID DESIGN
Dg,t = 0: in Design CLA, if a (g, t) cell is untreated its observed outcome is its never-treated
outcome because it has never been treated yet. If Dg,t = 0, the right hand side is also equal to
E [Yg,t (0t )] so the equality holds. The equality is true when Dg,t = 1 and when Dg,t = 0, and
Dg,t cannot take another value. Therefore, the equality is always true.
Then,
h i
E βbfe
1 1 1 1
= E [Yg,t ] − E [Yg,t ] − E [Yg,t ] + E [Yg,t ]
X X X X
G1 T1 g:Dg =1,t>T0 G1 T0 g:Dg =1,t≤T0 G0 T1 g:Dg =0,t>T0 G0 T0 g:Dg =0,t≤T0
1 1
= (E [Yg,t (0t )] + TEg,t ) − E [Yg,t (0t )]
X X
G1 T1 g:Dg =1,t>T0 G1 T0 g:Dg =1,t≤T0
1 1
E [Yg,t (0t )] + E [Yg,t (0t )]
X X
−
G0 T1 g:Dg =0,t>T0 G0 T0 g:Dg =0,t≤T0
=ATT
1 1 1 1
+ (αg + γt ) − (αg + γt ) − (αg + γt ) + (αg + γt )
X X X X
G1 T1 g:Dg =1,t>T0 G1 T0 g:Dg =1,t≤T0 G0 T1 g:Dg =0,t>T0 G0 T0 g:Dg =0,t≤T0
=ATT.
The first equality follows from (3.2), the fact the design is conditioned upon, and the linearity
of the expectation operator. The second equality follows from (3.4) and the fact that in Design
CLA Dg,t = 0 if Dg = 0 or t ≤ T0 . The third equality follows from the definition of ATT and
(2.3). The last equality follows from the following four facts. First, the average of αg in the first
and second summations cancel each other. Second, the average of αg in the third and fourth
summations cancel each other. Third, the average of γt in the first and third summations cancel
each other. Fourth, the average of γt in the second and fourth summations cancel each other
QED.
3.2. TWO-WAY FIXED EFFECTS ESTIMATORS 41
where “⊥⊥” denotes independence of random variables. For all t ≤ T0 , Yg,t = Yg,t (0t ) : groups’
outcome is equal to their untreated outcome till period T0 . Then, the previous display implies
that
Dg ⊥⊥ (Yg,1 , ..., Yg,T0 ), (3.5)
an equation that only involves observed variables, and can therefore be tested. To test (3.5),
one can run a pooled regression of Yg,t on Dg , for all t ≤ T0 . However, this test is not consistent
against all violations of the independence condition in (3.5). For instance, the test will fail to
reject if Yg,t is positively correlated to Dg for some ts, negatively correlated to Dg for other ts,
and those positive and negative correlations offset each other. Alternatively, one can run T0
regressions of Yg,t on Dg , for all t ≤ T0 . However, one needs to account for multiple testing when
computing the T0 p-values of the coefficients on Dg , which may lead to a low-power test. To
avoid this issue, one can regress Dg on (Yg,1 , ..., Yg,T0 ), and run an F-test that all coefficients are
equal to zero.
42 CHAPTER 3. THE CLASSICAL DID DESIGN
A test consistent against all violations of (3.5).∗ Even a regression of Dg on (Yg,1 , ..., Yg,T0 )
is still not consistent against all violations of (3.5). For instance, this test could fail to reject if
Yg,1 ×Yg,2 increases the probability that Dg = 1. To obtain a test consistent against all violations
of (3.5), one can use a permutation test based on the Kolmogorov-Smirnov test-statistic:
√ 1 XG
KSG = sup G 1 {Dg ≤ d, Yg,1 ≤ y1 , ..., Yg,T0 ≤ yT0 }
(d,y1 ,...,yT0 )∈S G g=1
1 XG
1 XG
− 1 {Dg ≤ d} 1 {Yg,1 ≤ y1 , ..., Yg,T0 ≤ yT0 } ,
G g=1 G g=1
where S denotes the values taken by (Dg , Yg,1 , ..., Yg,T0 ). The test compares KSG to the (1 − α)-
quantile of its permuted version:
√ 1 X G n o
KSG =
Π
sup G 1 DΠ(g) ≤ d, Yg,1 ≤ y1 , ..., Yg,T0 ≤ yT0
(d,y1 ,...,yT0 )∈S G g=1
1 XG n o 1 XG
− 1 DΠ(g) ≤ d 1 {Yg,1 ≤ y1 , ..., Yg,T0 ≤ yT0 } ,
G g=1 G g=1
where Π is a random permutation over {1, ..., G}, with uniform distribution over the G! possible
permutations. This test has exact size, and unlike the three aforementioned tests it is consistent
against all alternatives (see, e.g. van der Vaart and Wellner, 2023). On the other hand, it is less
straightforward and computationally more costly to implement than those three tests.
In Design CLA, alongside (3.1) researchers often also estimate the following regression:
T T1
Yg,t = α
b0 + α
b 1 Dg + γbt′ 1{t = t } +
′
βbℓfe 1{t = T0 + ℓ}Dg + ϵ̂g,t . (3.6)
X X
If ℓ > 0, 1{t = T0 + ℓ}Dg is an indicator equal to 1 if at t, group g has been treated for ℓ periods.
If ℓ < 0, 1{t = T0 +ℓ}Dg is an indicator equal to 1 if at t, group g will be treated in −ℓ+1 periods.
(3.6) is often referred to as a TWFE event-study (ES) regression, and researchers often plot its
coefficients (βbℓfe )ℓ̸=0 on a so-called ES graph, with ℓ = t − T0 , the relative time to treatment onset
for the treated groups,3 on the x−axis. The regression has 2T explanatory variables, which is the
number of values that the vector of its explanatory variables (Dg , (1{t = t′ })t′ ∈{1,...,T } ) can take.
Therefore, the regression is saturated in (Dg , (1{t = t′ })t′ ∈{1,...,T } ) (see Section 3.1.4 of Angrist
and Pischke, 2009, for a refresher on saturated regressions). Then, its predicted values are just
equal to the conditional mean function of Yg,t given (Dg , (1{t = t′ })t′ ∈{1,...,T } ) (see Theorem 3.1.4
of Angrist and Pischke, 2009), and all its coefficients are functions of that conditional mean
function. In particular, one can show that for all ℓ ̸= 0,
1 X 1 X
βbℓfe = (Yg,T0 +ℓ − Yg,T0 ) − (Yg,T0 +ℓ − Yg,T0 ), (3.7)
G1 g:Dg =1 G0 g:Dg =0
a simple DID comparing the T0 to T0 + ℓ outcome evolution in treatment and control groups. As
period T0 is the omitted time period in (3.6), all DIDs are relative to that period. Accordingly,
for ℓ ≥ 1, the DIDs in (3.7) consider evolutions from the past to the future, whereas for ℓ ≤ −1,
those DIDs consider evolutions from the future to the past.
Theorem 2 In Design CLA, if Assumptions NA and PT hold, then for all ℓ ∈ {1, ..., T1 },
h i
E βbℓfe = ATTℓ . (3.8)
3
Researchers usually rather define relative time as (t−(T0 +1)), with treatment onset corresponding to relative
time 0 rather than to relative time 1. We prefer to define relative time as t − T0 , to ensure that the ES graph is
normalized at 0, and that estimated effects and pre-trends are shown symetrically around 0.
44 CHAPTER 3. THE CLASSICAL DID DESIGN
=ATTℓ .
The first equality follows from (3.7) and the fact the design is conditioned upon, the second equal-
ity follows from Design CLA, the third equality follows adding and subtracting Yg,T0 +ℓ (0T0 +ℓ ),
the fourth equality follows from Assumption PT and the definition of ATTℓ . This proves (3.8)
QED.
Theorem 3 In Design CLA, if Assumptions NA and PT hold, and T0 ≥ 2, then for all ℓ ∈
{−1, ..., −(T0 − 1)},
h i
E βbℓfe = 0. (3.10)
3.2. TWO-WAY FIXED EFFECTS ESTIMATORS 45
=0.
The first equality follows from (3.7) and the fact the design is conditioned upon, the second
equality follows from Design CLA, the third equality follows from Assumption PT. This proves
(3.10) QED.
Doing pre-trend tests correctly. This testing procedure does not account for multiple
h i
hypothesis testing. To account for it, one can run an F-test that E βbℓfe = 0 for all ℓ ∈
{−1, ..., −(T0 −1)}. Alternatively, one can adjust p-values for multiple testing, using, say, a Bon-
ferroni adjustment. Related to, but different from, the Bonferroni adjustment, one can use the
“Sup-t” test proposed by Montiel Olea and Plagborg-Møller (2019), where maxℓ∈{−1,...,−(T0 −1)} |βbℓfe /σbℓ |,
the largest pre-trend t-statistic, is compared to the quantiles of the max of T0 −1 normal variables
with mean 0 and variance equal to the estimated variance of (βbℓfe /σbℓ )ℓ∈{−1,...,−(T0 −1)} . The F- and
Sup-t test both have asymptotically nominal size under the null. Neither test is universally more
h i
powerful than the other. Schematically, the Sup-t test will be more powerful if E βbℓfe is far
from zero for one ℓ but equal to zero for all other ℓ, while the F-test will be more powerful if
h i
E βbℓfe is never very far from zero but slightly different from 0 for several ℓ.
How many pre-trend estimators should one show? When T0 is large, one can potentially
compute many pre-trend estimators, thus assessing if treatment and control groups experienced
parallel trends over a long period. In such cases, one may find that while treated and control
groups were experiencing parallel trends a few periods before treatment, their trends are not
parallel anymore when one moves further back into the past. Should this be a cause of concern?
In other words, how many insignificant pre-trend estimators should one show to convincingly
demonstrate parallel pre-trends? While very natural, this question has received little attention,
so we do not have a good answer, backed by a sound theory. Yet, we would like to suggest the
following, simple rule: researchers should not compute much fewer pre-trends than event-study
estimators. Consider two researchers. The first one can only compute three pre-trend estimators,
because T0 = 4. The second one can compute more than three pre-trend estimators (T0 > 4),
but while they find that treated and control groups experience roughly parallel trends for three
periods before treatment, their trends become markedly different more than three periods before
treatment. In our opinion, both researchers should not report more than three event-study
estimators. The rationale for this recommendation is simple: an estimator of ATTℓ relies on
a parallel-trends assumption from T0 to T0 + ℓ. To support this identifying assumption, the
researcher should show that parallel trends holds from T0 to T0 − ℓ. With our recommendation,
large and significant pre-trend estimators just before treatment are more problematic than large
3.2. TWO-WAY FIXED EFFECTS ESTIMATORS 47
and significant pre-trend estimators long before treatment, which makes intuitive sense. When
a researcher computes many more event-study than pre-trend estimators, their readers and
reviewers should factor the fact that the parallel-trends assumption underlying some of their
long-run event-study estimators was not placebo tested.
be another pre-trend estimator, that should also not significantly differ from zero under Assump-
tions NA and PT. βbℓfe is a “long-difference” pre-trend estimator, that compares treatment- and
control-groups’ outcome evolution over several periods before the treatment onset. βbℓfd instead is
a “first-difference” pre-trend estimator, that compares treatment- and control-groups’ outcome
evolution over consecutive periods before the treatment onset. Accordingly,
−1
βbℓfe = βbkfd . (3.12)
X
k=ℓ
h i
As first- and long-difference pre-trend estimators are linearly dependent, F-tests that E βbℓfe = 0
h i
for all ℓ ∈ {−1, ..., −(T0 − 1)} and that E βbℓfd = 0 for all ℓ ∈ {−1, ..., −(T0 − 1)} are equal, so
using one or the other will always yield the same results.
How to interpret non-zero pre-trend coefficients? Figure 3.1 below displays three com-
mon patterns of non-zero pre-trend coefficients, that lead to different interpretations. First, one
may have that βbℓfe is significantly different from zero for all ℓ < 0, but ℓ 7→ βbℓfe is approximately
constant, as in Panel A. Then, it follows from (3.12) that βb−1
fd
̸= 0 but βbℓfd = 0 for all ℓ < −2:
treated and control groups are on parallel-trends throughout the pre-treatment period, except
from T0 − 1 to T0 . This suggests that Assumption PT holds, but Assumption NA fails: at period
T0 , treatment groups are already affected by their upcoming treatment in the next period. Then,
one may just recompute the estimators defined above, redefining the date when the treatment
started as T0 or as the date when the treatment was announced. Second, one may have that
βbℓfe ≈ ℓθ for some real number θ, as in Panel B. This suggests that Assumption PT fails due to
differential linear trends between the treated and control groups. Then, one may allow for linear
trends in the estimation, see Section 4.1.3.3 for further details. Third, one may have that βb−1
fe
is
48 CHAPTER 3. THE CLASSICAL DID DESIGN
significantly different from zero but βbℓfe ≈ 0 for all ℓ ≤ −2, as in Panel C. Then, it follows from
(3.12) that βb−1
fd fd
≈ −βb−2 ̸= 0, and βbℓfd ≈ 0 for all ℓ ≤ −3. Therefore, such pre-trend coefficients
cannot be interpreted as evidence of anticipation effects arising at period T0 , and there is no
obvious alternative estimation method one can suggest in the face of such pre-trend estimators.
1 1
1
0.5 0.5
0.5
0 0
0
-0.5 -0.5
-0.5
-1 -1
-1.5 -1 -1.5
-5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5 -5 -4 -3 -2 -1 0 1 2 3 4 5
14 years effects become very large. Prior to 1919, patenting was low in the treated subclasses.
Then, as argued by Moser and Voena (2012), US firms in those subclasses had to bridge a
large gap to the technological frontier before they could patent their own inventions. Moreover,
patent grants typically occur two to three years after application. All this could explain why
the effects of compulsory licensing take time to emerge. The pre-trends estimates are small, and
substantially smaller than the estimated mid- and long-run effects of compulsory licensing. Only
one out of 18 pre-trend estimates is individually significant at the 5% level. However, an F-test
that all pre-trends coefficients are equal to zero is rejected (p-value< 0.001), thus suggesting
modest differential trends between treated and control subclasses.
.75
.5
Effect
.25
-.25
-18 -15 -12 -9 -6 -3 0 3 6 9 12 15 18 21
Relative time to year before TWEA
Note: This figure shows the estimated effects of compulsory licensing on patents, as well as pre-trends estimates,
using years 1900 to 1939 of the data from Moser and Voena (2012), and the TWFE event-study regression in
(3.6). Standard errors are clustered at the patent subclass level. 95% confidence intervals are shown in red.
Verify that (3.7) holds for ℓ = 1, by computing in Stata the DID in the right hand side of that
50 CHAPTER 3. THE CLASSICAL DID DESIGN
equation.
Depending on how large the sample is, there are several possible approaches to test hypotheses
on, or build confidence intervals for, the event-study effects (ATTℓ )ℓ∈{1,...,T1 } and the ATT. With
many treated and control groups, standard asymptotic inference is possible. If there are few
treated groups but many control groups, we cannot consistently estimate the (ATTℓ )ℓ∈{1,...,T1 } and
the ATT, but we can still perform valid inference, without making parametric assumptions. With
few treated and control groups, inference needs to rely either on parametric assumptions or on
homogeneity conditions. Finally, we discuss how to determine which of the three aforementioned
approaches may be more appropriate and reliable in a specific application. Note that we mostly
focus on confidence intervals (CIs), but hypotheses tests can be obtained similarly. Note also
that our review of the literature on inference with few treated and/or few control groups is not
exhaustive, though we hope it provides a useful starting point for practitioners.
3.3. INFERENCE ON THE ATT AND ON THE EVENT-STUDY EFFECTS 51
Estimating the variance of βbℓfe . Under Assumption IND, groups are independent, so it
directly follows from (3.7) that
1 X 1 X
V (βbℓfe ) = V (Y g,T +ℓ − Y g,T ) + V (Yg,T0 +ℓ − Yg,T0 ). (3.13)
G21 g:Dg =1 0 0
G20 g:Dg =0
where the first equality is proven in this chapter’s appendix. Combined with (3.14) and (3.13),
this implies that E[σbℓ2 ] ≥ V (βbℓfe ). If treatment effects are homogenous, the inequality in (3.16)
becomes an equality.
Asymptotically conservative CIs for ATTℓ . When both G0 and G1 tend to infinity, (3.7)
and the central limit theorem imply that
βbℓfe − ATTℓ d
−→ N (0, 1).
V (βbℓfe )1/2
Then, as σbℓ2 overestimates V (βbℓfe ), one can show that the CI [βbℓfe ±z1−α/2 σbℓ ], where z1−α/2 denotes
the quantile of order 1 − α/2 of a standard normal distribution, is asymptotically conservative:
it includes ATTℓ with probability tending to at least 1 − α. If treatment effects are homogenous,
this CI is exact. In practice, researchers estimate ATTℓ for more than one value of ℓ. Then, one
can show that a multivariate version of the asymptotic normality result in the previous display
holds, and use that result to derive a joint test that ATTℓ = 0 for all ℓ, or jointly valid confidence
intervals for the effects ATTℓ .
Inference on the ATT. Comparing (3.2) and (3.7), one can see that βbfe has the same struc-
ture as βbℓfe , with Yg,T0 +ℓ replaced by (1/T1 ) t>T0 Yg,t and Yg,T0 replaced by (1/T0 ) t≤T0 Yg,t .
P P
Hence, one can follow the same steps as above to estimate the variance of βbfe and construct an
asymptotically conservative CI for the ATT.
Finite-sample adjustment. The CI above uses normal quantiles, neglecting the randomness
in the estimation of V (βbℓfe ). Bell and McCaffrey (2002) suggest using instead the quantiles of a
t-distribution, with degrees-of-freedom (DOF)
3.3.2 When are G0 and G1 large enough to rely on the CIs in Section 3.3.1? A
simulation study
The question we ask in this section is a difficult one, which we try to answer with simulations.
Our simulations are inspired from those in Tables 1 and 2 of Imbens and Kolesar (2016), but
with different sample sizes, and different errors’ distributions.
denotes the deviation of g’s never treated outcome at period t from its expectation. Then, let
=Yg,T0 +ℓ (0T0 +ℓ ) − Yg,T0 (0T0 ) − E(Yg,T0 +ℓ (0T0 +ℓ ) − Yg,T0 (0T0 )). (3.17)
With this notation, using the same steps as those used to obtain (3.9), and using the fact that
under Assumption PT E(Yg,T0 +ℓ (0T0 +ℓ ) − Yg,T0 (0T0 )) does not depend on g, one can show that
1 X 1 X 1 X
βbℓfe = Yg,T0 +ℓ (0T0 , 1ℓ ) − Yg,T0 +ℓ (0T0 +ℓ ) + ηg,ℓ − ηg,ℓ . (3.18)
G1 g:Dg =1 G1 g:Dg =1 G0 g:Dg =0
Under the sharp null of no treatment effect (Yg,T0 +ℓ (0T0 , 1ℓ ) − Yg,T0 +ℓ (0T0 +ℓ ) = 0 for all (g, ℓ)),
the previous display simplifies to
1 X 1 X
βbℓfe = ηg,ℓ − ηg,ℓ .
G1 g:Dg =1 G0 g:Dg =0
If the treatment has no effect, our TWFE ES estimators just compare the average of the errors
ηg,ℓ in the treatment and in the control group, so their distribution depends on G0 , G1 , the
distribution of the errors, and how that distribution differs in the treatment and in the control
groups. For our simulations, it is enough to consider one event-study coefficient, so we drop the
ℓ subscript in the remainder of this section.
Data Generating Process. We consider three distributions for the errors (ηg )g=1,...,G :
3. ηg is drawn from the empirical distribution of the (ηbg,14 )g=1,...,G in Moser and Voena (2012),
where we let
1
ηbg,ℓ = Yg,T0 +ℓ − Yg,T0 − (Yg′ ,T0 +ℓ − Yg′ ,T0 ) (3.19)
X
G0 g ′ :Dg′ =0
Data Generating Process (DGP) 1 follows that in Table 1 of Imbens and Kolesar (2016). DGP2
is similar to that in their Table 2, where they assume that conditional on Dg , ηg /σ(Dg ) follows a
recentered log-normal distribution. However, our DGP should be more adversarial than theirs:
assuming that −ηg /σ(0) rather than ηg /σ(0) follows a log-normal in the control group preserves
the asymmetry of the log-normal distribution in βbfe = η 1 − η 0 . DGP3 is based on the dataset
of Moser and Voena (2012), with errors drawn from the empirical distribution of the estimated
errors (ηbg,14 )g=1,...,G in that application. We choose ℓ = 14, because large treatment effects start
emerging after 14 years of exposure to the TWEA, but results are similar for other values of ℓ.
For each model, we consider both an homoscedastic and an heteroscedastic error distribution.
In Models 1 and 2, σ(1) = 1, and either σ(0) = 1/2 (heteroscedastic errors) or σ(0) = 1
(homoscedastic errors).5 For Model 3, in the homoscedastic case we randomly allocate groups
to the control or treatment, whereas in the heteroscedastic case we draw the control group’s
errors from the empirical distribution of (ηbg,14 )g:Dg =0 in Moser and Voena (2012), and we draw
the treatment group’s errors from the empirical distribution of (ηbg,14 )g:Dg =1 . Finally, we consider
eight sample sizes: G1 ∈ {5, 10, 20, 40}, and G0 = G1 or G0 = 4G1 .
1. “HC2-BM” uses the HC2 variance estimator σbℓ2 , and the critical value from a t-distribution
with the degrees-of-freedom adjustement of Bell and McCaffrey (2002);
5
With G1 ≤ G0 , coverage rates increase with σ(0), so coverage rates with σ(0) > 1 are higher than those
obtained in Table 3.1 below.
3.3. INFERENCE ON THE ATT AND ON THE EVENT-STUDY EFFECTS 55
2. “HC2-∞” uses the HC2 variance estimator σbℓ2 , and the critical value from a normal distri-
bution;
3. “HC3” uses the HC3 variance estimator recommended for instance by MacKinnon, Nielsen
and Webb (2023), and the critical value from a t-distribution with G0 + G1 − 2 degrees-
of-freedom;
4. “EHW” uses the standard robust Eicker-Huber-White variance estimator, and the critical
value from a normal distribution.
Results. Results in Table 3.1 below show that the HC2-BM CI often has a coverage rate
closer to the 95% nominal value than the other CIs. Yet, with the log-normal DGP, it can still
exhibit non-trivial size distortions, even with relatively large sample sizes. For instance, with
(G0 , G1 ) = (80, 20) its coverage rate is still only equal to 88%. The size distortions we find for
the HC2-BM CI are larger than those in Table 2 of Imbens and Kolesar (2016). This is partly
due to the fact that we consider a more adversarial DGP, and partly due to the fact that for
the specific sample size they consider ((G0 , G1 ) = (3, 27)), the coverage of the HC2-BM CI is
particularly good. In less adversarial DGPs, the HC2-BM CI has close-to-nominal coverage with
as few as five treated and control groups.
56 CHAPTER 3. THE CLASSICAL DID DESIGN
Heteroskedasticity Homoskedasticity
DGP G1 G0 HC2-BM HC2-∞ HC3 EHW HC2-BM HC2-∞ HC3 EHW
Table 3.1: Coverage rate for various DGPs and confidence intervals
Recommendations.
1. Like Imbens and Kolesar (2016), we recommend that researchers use HC2
standard errors, with the DOF adjustment of Bell and McCaffrey (2002). The
only instance where they may not follow this recommendation is if G0 and G1 are very
3.3. INFERENCE ON THE ATT AND ON THE EVENT-STUDY EFFECTS 57
large: then HC2 standard errors with critical values from a normal distribution will also
be reliable, and somewhat surprisingly, the implementation in standard statistical software
of the correction of Bell and McCaffrey (2002) can be computationally costly.
2. When both G0 and G1 are larger than 40, researchers may use HC2-BM CIs,
without assessing the validity of those CIs via simulations. This recommendation
is based on the fact that when min(G0 , G1 ) ≥ 40, the coverage of HC2-BM CIs is always
larger than 0.90 in our simulations, so those CIs never massively undercover. Of course,
we do not claim that their coverage rate could not be lower for other distributions of the
(ηg )g=1,...,G . But this log-normal DGP seems adversarial enough to suggest that our rule of
thumb should work well in many cases. Importantly, this recommendation only applies to
unweighted TWFE regressions: as we will discuss later, with weighting HC2-BM CIs may
require larger numbers of treated and control groups to be reliable.
3. When either G0 or G1 is lower than 40, researchers may run simulations tailored
to their application to assess if HC2-BM CIs have satisfactory coverage in their
data. Our DGP 1 and 3 show that HC2-BM CIs can still have very good coverage, even
with much fewer than 40 treated or control groups. Then, before resorting to the inference
methods described in the next section, researchers should conduct simulations, to assess
if in their data, the coverage of HC2-BM CIs is closer to that in our Models 1 and 3, or
closer to that in our Model 2. To conduct those simulations, they may follow the method
we used to conduct simulations based on the data from Moser and Voena (2012) (see also
Ferman, 2019, for related proposals).
Our answer to that question depends on whether G1 is low but G0 is large, or G1 and G0 are
both low. As a word of caution on what follows, note that this area of research is still active
and no consensus has emerged on it yet.
58 CHAPTER 3. THE CLASSICAL DID DESIGN
Inference problem. There are many applications where G1 , the number of treated groups,
is low. For instance, a well-studied US schooling merit aid program is the HOPE scholarship.
It was implemented in Georgia only, so in a state-level analysis, G1 = 1. We saw earlier that
1 X 1 X 1 X
βbℓfe = Yg,T0 +ℓ (1) − Yg,T0 +ℓ (0) + ηg,ℓ − ηg,ℓ .
G1 g:Dg =1 G1 g:Dg =1 G0 g:Dg =0
If G0 → ∞ but G1 remains finite, the third average in the previous display converges to zero by
the law of large numbers, but the first two averages do not and they remain random. Therefore,
βbℓfe is not consistent anymore.
Approach assuming identically distributed disturbances. Conley and Taber (2011) pro-
pose a method to draw valid inference on ATTℓ , under two conditions. First, they assume that
treatment effects are not random:
Yes it does, as TEg,T0 +ℓ may vary with g and T0 + ℓ. Plugging (3.21) into the previous equation
for βbℓfe ,
1 X 1 X
βbℓfe =ATTℓ + ηg,ℓ − ηg,ℓ (3.22)
G1 g:Dg =1 G0 g:Dg =0
1 X
=ATTℓ + ηg,ℓ + oP (1),
G1 g:Dg =1
where oP (1) denotes a random variable tending to 0 in probability. To understand the construc-
tion of Conley and Taber (2011), suppose for the moment that the distribution of 1 P
G1 g:Dg =1 ηg,ℓ
is known and let qα/2 (resp. q1−α/2 ) denote its quantile of order α/2 (resp. 1 − α/2). Then,
P ATTℓ ∈ [βbℓfe − q1−α/2 , βbℓfe − qα/2 ] = P qα/2 ≤ βbℓfe − ATTℓ ≤ q1−α/2
→ 1 − α,
3.3. INFERENCE ON THE ATT AND ON THE EVENT-STUDY EFFECTS 59
which implies that [βbℓfe − q1−α/2 , βbℓfe − qα/2 ] is an asymptotically valid CI. Now, obviously, the
distribution of 1
ηg,ℓ is unknown. To recover it, the authors impose a second condition,
P
G1 g:Dg =1
namely:
The distribution of ηg,ℓ does not vary across g. (3.23)
(3.23) is a strengthening of Assumption PT. Under Assumption PT, E[ηg,ℓ ] = 0, so the expec-
tation of ηg,ℓ does not vary across g. (3.23) further requires that the distribution of ηg,ℓ does not
vary across g. Then, for all g in the control group, one has
1
ηbg,ℓ =Yg,T0 +ℓ (0T0 +ℓ ) − Yg,T0 (0T0 ) − (Yg′ ,T0 +ℓ (0T0 +ℓ ) − Yg′ ,T0 (0T0 ))
X
G0 g ′ :Dg′ =0
1
=γT0 +ℓ − γT0 + ηg,ℓ − (γT0 +ℓ − γT0 + ηg′ ,ℓ )
X
G0 g ′ :Dg′ =0
=ηg,ℓ + oP (1).
P
Therefore, ηbg,ℓ −→ ηg,ℓ . Then, the empirical distribution of the (ηbg,ℓ )g:Dg =0 is a consistent
estimator of the distribution of ηg,ℓ , for any treated group g. If G1 = 1, qα/2 and q1−α/2 are
just the quantiles of ηg,ℓ , which we can consistently estimate by qbα/2 and qb1−α/2 , the quantiles of
order α/2 and 1 − α/2 of the (ηbg,ℓ )g:Dg =0 , and
is an asymptotically valid CI. If G1 > 1, the following algorithm produces consistent estimators
of qα/2 and q1−α/2 :
1. For s = 1 to S (large):
(a) Draw (with replacement) a sample of size G1 from {g : Dg = 0}. Let (g1s , ..., gG
s
1
)
denote the labels of the corresponding groups.
s
(b) Compute ηb := ηbgis .
1 PG1
G1 i=1
1 S
2. Compute qbα/2 and qb1−α/2 , the empirical quantile of order α/2 and 1 − α/2 of (ηb , ..., ηb ).
with groups’ size Ng ,6 and the distribution of groups’ size may not be the same in control and
treated groups. We can still perform valid inference if we replace (3.23) by:
ηg,ℓ = σ(Ng ) × ζg,ℓ , where the distribution of ζg,ℓ does not depend on g. (3.24)
Without loss of generality, we can assume that V (ζg,ℓ ) = 1. The function σ(Ng ) is unknown but
we suppose we have a consistent estimator σb (Ng ) of it. For instance, Ferman and Pinto (2019)
assume that σ 2 (Ng ) = A + B/Ng , as is the case if ηg,ℓ is the sum of a group-specific component
and the average of independent individual-specific components. Then, one can consistently
estimate A and B by a regression, within the control group, of ηbg,ℓ
2
on a constant and 1/Ng .
The following algorithm modifies that from Conley and Taber (2011), so as to still produce
consistent estimators of qα/2 and q1−α/2 in this set-up (to simplify notation, we suppose here
that the treated groups are groups {1, ..., G1 }):
1. Compute ζbg,ℓ = ηbg,ℓ /σb (Ng ) for all g in the control group.
(a) Draw (with replacement) a sample of size G1 from {g : Dg = 0}. Let (g1s , ..., gG
s
1
)
denote the labels of the corresponding groups.
s
(b) Compute ηb := (1/G1 ) σb (Ni )ζbgis ,ℓ .
PG1
i=1
1 S
3. Compute qbα/2 and qb1−α/2 , the empirical quantile of order α/2 and 1 − α/2 of (ηb , ..., ηb ).
We have considered here that V (ηg,ℓ ) varies with group size but obviously, the same reasoning
would apply with any other variable Xg , provided that we can consistently estimate σ(Xg ).
This assumption may approximately hold if Yg,t is an average over many individuals in cell (g, t),
all cells have a similar number of individuals, and we have weak dependence within each cell,
see Bester, Conley and Hansen (2011). Then, we estimate ση,ℓ
2
by
1
2
= ηb2 .
X
σbη,ℓ
G0 − 1 g:Dg =0 g,ℓ
It follows from (3.22) and standard properties of Gaussian vectors that under (3.21) and (3.25),7
G0 G1 βbℓfe − ATTℓ
s
∼ N (0, 1),
G0 + G1 ση,ℓ
σb 2
(G0 − 1) η,ℓ 2
∼ χ2 (G0 − 1).
ση,ℓ
G0 G1 βbℓfe − ATTℓ
s
∼ tG0 −1 ,
G0 + G1 σbη,ℓ
a t distribution with G0 −1 degrees of freedom. Thus, if we let qG0 −1 (1−α/2) denote the quantile
of order 1 − α/2 of such a distribution,
qG0 −1 (1 − α/2) qG0 −1 (1 − α/2)
βbfe
ℓ − q
G0 G1
σbη,ℓ , βbℓfe + q
G0 G1
σbη,ℓ
G0 +G1 G0 +G1
is a CI for ATTℓ with coverage equal to 1 − α. While similar in spirit to Donald and Lang
(2007), this CI slightly differs from theirs: our variance estimator only uses control groups and
periods T0 and T0 + ℓ. This allows for heterogeneous treatment effects, and autocorrelation and
non-stationarity of (εg,1 , ..., εg,T ), since V (ηg,ℓ ) may vary with ℓ.
Approach assuming homogeneous treatment effects. With few treated and control
groups, a second idea is to use a permutation-test approach, following DiCiccio and Romano
(2017).8 Specifically, we maintain (3.21) and (3.23) and add the following constant treatment
7
The term (G0 G1 /(G0 + G1 ))1/2 comes from the fact that under (3.25), we have
1 1 G0 + G1
V (βbℓfe ) = ση,ℓ
2
+ = ση,ℓ .
G0 G1 G0 G1
8
See also MacKinnon and Webb (2020), who also consider a permutation-based approach in the context of
difference-in-differences. Their test statistic differs from that we propose below, and does not lead to valid
inference asymptotically under heteroskedasticity.
62 CHAPTER 3. THE CLASSICAL DID DESIGN
effect assumption:
For all treated g, TEg,T0 +ℓ = ATTℓ . (3.26)
This is a strong condition, but note that it mechanically holds if G1 = 1. Under this condition,
we can draw inference on ATTℓ without imposing the normality condition above. To test that
ATTℓ = c, we use the fact that under (3.21), (3.23), (3.26) and the null hypothesis,
has the same distribution for all groups, irrespective of whether they belong to the treatment or
to the control group. Letting π denote an arbitrary permutation of {1, ..., G} and D denote the
proportion of treated groups, this implies that the distribution of
(Dg − D)ξπ(g),c
PG
Tπ := PGg=1
g=1 (Dg − D) ξπ(g),c
2 2
does not depend on π. Let T denote the test statistic above using π =Id, the identity permu-
tation, and let qπ (1 − α) denote the empirical quantile of order 1 − α of the (|Tπ |)π∈G , where
G is the set of permutations of {1, ..., G}. A test of level α of the null hypothesis can be ob-
tained using the critical region {|T | > qπ (1 − α)}.9 In practice, the set G is often too large to
compute Tπ for all π ∈ G. However, the test keeps its properties if we replace G by G ′ obtained
by drawing at random (without replacement) N − 1 permutations from G\{Id}, and adding Id
to G ′ . Also, we can obtain a valid CI on ATTℓ by, basically, inverting this test. The following
algorithm indicates a fast way to obtain c, the upper bound of this interval (the lower bound
can be obtained similarly).10
2. Find an initial value c1 (resp. c2 > c1 ) for which the permutation test ATTℓ = c1 is not
rejected (resp. ATTℓ = c2 is rejected). Usually, we can choose c1 = βbℓfe and c2 twice the
upper bound of the CI based on normality.
Importantly, while these permutation tests and CIs are valid in finite samples under strong
conditions, Theorem 3.1 of DiCiccio and Romano (2017) implies that they are also valid asymp-
totically under, basically, the sole condition that D → p ∈ (0, 1).
3.3.3.3 Steps that researchers may follow when G0 < 40 or G1 < 40, and simulations tailored
to their application indicate that HC2-BM CIs may not have good coverage
One should start by testing (3.23), using a “permutation pre-trends” test. With the
exception of that of Ferman and Pinto (2019), all the CIs in Sections 3.3.3.1 and 3.3.3.2 rely
on (3.23). If T0 ≥ 2, (3.23) is testable, since it implies that for each ℓ ∈ {−1, ..., −(T0 − 1)},
(Yg,T0 −ℓ − Yg,T0 )g:Dg =1 and (Yg,T0 −ℓ − Yg,T0 )g:Dg =0 have the same distribution. Then, we can use
a “permutation pre-trends” test to check whether (3.23) holds or not (Bickel, 1969).11
2. If G1 > 1 and G0 ≥ 50, we recommend using the CIs of Conley and Taber (2011). The
cut-off value of 50 for G0 comes from the simulations in Conley and Taber (2011), where
they do not consider G0 smaller than 50. Investigating if their CIs remain valid for G0
smaller than 50 is an interesting avenue for future research.
11
Bickel (1969) suggests using the Kolmogorov-Smirnov two-sample test statistic together with a critical value
obtained by permutation. This test has nominal level in finite samples. Note that this is a test of (3.23) for a
given ℓ. To jointly test (3.23) for all ℓ ∈ {−1, ..., −(T0 − 1)}, one may perform the test for each ℓ separately, and
use a Bonferroni adjustment to obtain the joint test’s p-value.
64 CHAPTER 3. THE CLASSICAL DID DESIGN
3. If G1 > 1 and G0 < 50, we recommend one of the two CIs discussed in Section 3.3.3.2
above, depending on whether assuming normal disturbances or homogeneous treatment
effects seems more credible.
3.4.1 We can only test for parallel trends before but not after treatment
While Assumptions NA and PT are partly testable, those assumptions are not fully testable:
we can only test for parallel trends before but not after treatment. Even if treated and control
groups are on parallel trends before the date when the treatment starts, that does not necessarily
mean that without treatment, they would have remained on parallel trends after that date.
Accordingly, parallel-trends tests remain suggestive.
Even when a pre-trends test is not rejected, Kahn-Lang and Lang (2020) argue that researchers
should use their contextual knowledge, to assess whether a shock happening at the same time
as or after the treatment could have led to differential trends in the absence of treatment. For
instance, was there an economic recession at the time of treatment, that could have affected
treated and control groups differently and led to differential counterfactual trends? In the
compulsory licensing example, researchers could use their contextual knowledge to assess if the
sudden spike in the treatment group’s patenting in 1932, visible in Figure 3.2, may not have
been due to a shock affecting the treated subclasses and unrelated to the compulsory licensing
treatment. To suggestively test whether it is plausible that counterfactual trends would have
3.4. LIMITATIONS OF PRE-TREND TESTS 65
been different after treatment even if they were parallel before, one can identify covariates that
significantly predict post-treatment outcome trends in the control group, but do not predict
pre-treatment outcome trends. Then, if treated and control groups differ on those covariates, it
could be that treated and controls were on parallel trends before treatment but would still have
experienced differential trends after treatment. Formalizing this proposal, which is related to
one by Kahn-Lang and Lang (2020), is an interesting avenue for future research.
A second, related concern is that policies are rarely implemented in isolation. For instance,
Hoehn-Velasco, Penglase, Pesko and Shahid (2024) show that in the US, unilateral divorce laws,
which we will return to in Chapter 6, were often adopted at the same time as other policies,
in particular legal abortion laws. Then, post-treatment differential trends between treated and
control groups may reflect the total effect of all those treatments, rather than just the effect
of the treatment under consideration. There again, researchers should document whether other
treatments likely to affect the outcome of interest also changed around the time where the
treatment of interest changed.
Overall, even when pre-trend tests are not rejected, researchers should present statistical and
qualitative evidence that parallel trends would have continued in the post-treatment period.
Providing sound guidelines as to how to do so is an important avenue for future research: this
issue has received little attention while it is an important one, as we will reiterate in the book’s
conclusion.
he collects the event-study coefficients (βbℓfe )ℓ∈{−(T0 −1),...,−1,1,...,T1 } and their estimated variance-
covariance matrix Σ,
b and then runs the following simulations. In each simulation draw, he gen-
erates coefficients (βbℓs,f e )ℓ∈{−(T0 −1),...,−1,1,...,T1 } from a normal distribution with variance-covariance
matrix Σ,
b and where
for some real number γ ̸= 0. Interpret the DGP in those simulations. Does the parallel-trends
assumption hold? What is the value of ATTℓ ?
For ℓ ∈ {−(T0 − 1), ..., −1}, E(βbℓs,f e ) = γℓ ̸= 0, so parallel trends fails in this DGP. Treated
and control groups experience their own linear trends, and the difference between the linear
trends of treated and control groups is equal to γ. ATTℓ is equal to βbℓfe , the actual value of the
estimated ATTℓ in each paper. In each simulation draw, Roth mimicks a pre-trends test, which
is rejected if there is at least one ℓ ∈ {−(T0 − 1), ..., −1} such that |βbℓs,f e | > 1.96σbℓ , where σbℓ
is the estimated standard error of βbℓfe .12 He evaluates the power of the pre-trend tests across
a range of values for γ, until finding the value γ0.5 such that the pre-trends test is rejected
50% of the time. In each of the 12 papers he considers, γ0.5 represents the differential linear
trends between treatment and control groups that have 50% chances of being detected by the
researcher. Finally, he evaluates γ0.5 ℓ and compares that quantity to βbℓfe . What
1 PT1 1 PT1
T1 ℓ=1 T1 ℓ=1
Under differential linear trends that have 50% chances of being detected, γ0.5 ℓ is the
1 PT1
T1 ℓ=1
bias of the event-study estimator of the ATT. It is interesting to compare the magnitude of this
12
As discussed above, this is not a valid procedure to test for pre-trends. However, Roth seeks to reproduce
current practice, and articles in his survey that formally test for pre-trends use that procedure.
3.4. LIMITATIONS OF PRE-TREND TESTS 67
potential bias with the magnitude of the actual event-study estimate of the ATT, to assess if
differential trends that have high chances of not being detected by the researcher can account
for a large share of the estimated treatment effect. Results are compelling. His appendix Figure
D1, reproduced below, shows that in 7 papers out of 12, under differential linear trends that
have 50% chances of being detected, the bias in the event-study estimate of the ATT, the green
circle, is no smaller than a half of the actual estimate of the ATT, the blue square. In other
words, in 7 papers out of 12, the authors had 50% chances of failing to detect differential trends
large enough to account for at least a half of their estimated ATT. Based on his findings, Roth
(2022) recommends that practitioners run simulations similar to his, and provides the Stata and
R packages pretrends for that purpose. Thus, researchers can assess the power of pre-trend
tests in their application, and whether they could fail to detect differential trends large enough
Figure D1: Original Estimates and Bias from Linear Trends for Which Pre-tests Have 50
toPercent
accountPower
for a substantial
– Averagefraction of their
Treatment estimated ATT.
Effect
Figure 3.3:
Note: I calculate Powertrend
the linear of pre-trend tests in
against which 12 published
conventional economics
pre-tests would papers:
reject 50re-
percent of the time
(γ0.5
h ). The red triangles show
i the bias that would result from such a trend conditional on passing the pre-test
(E τ̂ − τ∗production
| β̂pre ∈ BNof Figure D1 in Roth (2022).
IS (Σ) ); the green circles show the unconditional bias from such a trend (E [τ̂ − τ∗ ]).
As a benchmark, I plot in blue the OLS estimates and 95% CIs from the original paper. All values are
normalized by the standard error of the estimated treatment effect and so the OLS treatment effect estimate
is positive. The estimand is the average of the treatment effects in all periods after treatment began, τ∗ = τ̄ .
Application to the compulsory licensing example. Using the moser_voena_didtextbook
dataset, run the following lines of code:
68 CHAPTER 3. THE CLASSICAL DID DESIGN
We could fail to detect differential linear trends of 0.010. Figure 3.4 shows that extrapolating
such differential linear trends throughout the post-treatment period, they can account for all or
almost all of the estimated short-term effects of the treatment, but they can only account for
around 25% of the estimated effects after 14 years of exposure or more.
13
The pretrends command takes a long time to run with more pre-trends estimates, which is why we only
leverage six estimates.
3.4. LIMITATIONS OF PRE-TREND TESTS 69
.75
.5
Effect
.25
-.25
-6 -3 0 3 6 9 12 15 18 21
Relative time to year before TWEA
Note: This figure shows the estimated effects of compulsory licensing on patents, using years 1912 to 1939 of
the data from Moser and Voena (2012), and the TWFE event-study regression in (3.6). Standard errors are
clustered at the patent subclass level. 95% confidence intervals are shown in red. The grey dotted line shows the
differential linear trend one would have 50% chances of detecting, in view of the variance-covariance matrix of
the pre-trends estimates, computed using the pretrends Stata command.
3.4.3 Pre-trend tests might exacerbate the bias from violations of no anticipation
and parallel trends∗
Tests of parallel trends may lead to a pre-testing problem. Parallel trends tests are
often used as a way to decide whether the analysis should be continued, or which specification
should be reported: researchers may add control variables, add group-specific linear trends,
change the definition of their control group, etc., until the parallel-trends test is not rejected. This
70 CHAPTER 3. THE CLASSICAL DID DESIGN
means that often times, the vector of estimated effects we observe (βbℓfe )ℓ∈{1,...,T1 } is conditional
on values of the pre-trends coefficients (βbℓfe )ℓ∈{−(T0 −1),...,−1} such that the pre-trends test is not
rejected. Let Pub be an indicator equal to 1 when that event is realized, where Pub stands for
publishable. If (βbℓfe )ℓ∈{1,...,T1 } and (βbℓfe )ℓ∈{−(T0 −1),...,−1} are not independent, we may have that for
ℓ ∈ {1, ..., T1 }
so such pre-testing could lead to a bias of βbℓfe , on top of the potential bias that may come
from differential trends. Pre-testing could also lead to distorted inference: the distribution of
βbℓfe conditional on not rejecting the pre-test may differ from its unconditional distribution, and
standard critical values used to construct confidence intervals and tests, which are derived from
the unconditional distribution, may not be valid anymore.
If trends are parallel, pre-testing does not lead to a bias and does not distort in-
ference. Reassuringly, Proposition 1 in Roth (2022) shows that when trends are parallel, this
additional bias is equal to zero: under parallel trends, testing for pre-trends and estimating the
treatment effect only if the pre-trends test is not rejected does not lead to a bias. Furthermore,
as shown by de Chaisemartin and D’Haultfœuille (2024), it readily follows from the Gaussian
correlation inequality (Royen, 2014) that under parallel trends, conditional on not rejecting the
pre-trends test, confidence intervals for treatment effects are conservative. Therefore, pre-testing
cannot lead to over-reject null hypotheses.
If trends are not parallel, pre-testing might exacerbate the bias of βbℓfe , though this
phenomenon seems modest in practice. Proposition 2 in Roth (2022) shows that if trends
are not parallel, differential trends widen over time, and the estimators of the pre-trends and
actual effects are positively correlated and homoscedastic, then pre-testing leads to a bias which
goes in the same direction as the bias coming from differential trends, thus exacerbating it. In
Figure D1 in Roth (2022), reproduced above, the red triangles represent E(βbℓfe |Pub = 1), while
the green circles represent E(βbℓfe ), under the same pre-trends as above. In practice, does the
potential bias exacerbation phenomenon uncovered by Roth (2022) seem modest or serious?
3.5. AN ALTERNATIVE ESTIMATOR OF THE EVENT-STUDY EFFECTS 71
This bias exacerbation phenomenon is relatively modest in the 12 papers reviewed by the author:
in most cases, the red triangles are close to the green circles.
The estimator of ATTℓ proposed by Borusyak, Jaravel and Spiess (2024), Liu, Wang and Xu
(2024), and Gardner (2021) for the binary-and-staggered designs that we will consider in Chapter
6 reduces to βbℓb,l,g in the classical DID designs we consider in this chapter, hence its superscript.
Explain the difference between βbℓfe and βbℓb,l,g .
Both estimators are DID estimators comparing the outcome evolution of treated and control
groups. In βbℓfe , the “before” period in the “before-after” comparison is t = T0 , the last period
before treated groups get treated. Instead, in βbℓb,l,g the “before” period is actually an average of
all pre-treatment periods, from period 1 to T0 . One can show that under Assumptions NA and
PT, βbℓb,l,g is also unbiased for ATTℓ .
A TWFE ES regression to compute βbℓb,l,g . βbℓfe can be obtained from a TWFE ES regression
with t = T0 as the omitted/reference period. With that in mind, which TWFE ES regression
could we use to estimate (βbℓb,l,g )ℓ∈{1,...,T1 } ?
72 CHAPTER 3. THE CLASSICAL DID DESIGN
Yg,t = α
b 0b,l,g + α
b 1b,l,g Dg + γbtb,l,g 1{t = t′ } + βb 1{t = T0 + ℓ}Dg + ϵ̂g,t . (3.28)
X X b,l,g
′ ℓ
t′ >T0 ℓ>0
One can also show that βbℓb,l,g is numerically equivalent to the coefficients from a regression with
FEs for all periods but interactions of Dg and relative-time after but not before treatment:
T
Yg,t = α
b 0b,l,g + α
b 1b,l,g Dg + γbtb,l,g 1{t = t′ } + βb 1{t = T0 + ℓ}Dg + ϵ̂g,t . (3.29)
X X b,l,g
′ ℓ
t′ =1 ℓ>0
Intuitively, do you expect that V (βbℓb,l,g ) > V (βbℓfe ), V (βbℓb,l,g ) < V (βbℓfe ), or V (βbℓb,l,g ) = V (βbℓfe )?
V (βbℓb,l,g ) ≤ V (βbℓfe ) if errors are not serially correlated. βbℓfe use groups’ T0 outcome, the
last period before treatment onset, as the baseline outcome. On the other hand, βbℓb,l,g uses their
average outcome from period 1 to T0 . As βbℓb,l,g uses more data than βbℓfe , one may expect the
3.5. AN ALTERNATIVE ESTIMATOR OF THE EVENT-STUDY EFFECTS 73
former estimator to be more precise than the latter. Actually, whether this is the case depends
on the data generating process. Recall that in (2.5), we showed that under Assumptions NA
and PT,
Yg,t (0t ) = αg + γt + εg,t , E[εg,t ] = 0,
with εg,t = Yg,t (0t ) − E(Yg,t (0t )). Then, Borusyak et al. (2024) show that if the treatment effects
are non stochastic and the errors εg,t are independent and identically distributed (i.i.d.) across
both g and t, βbℓb,l,g is the best linear unbiased estimator (BLUE) of ATTℓ . Thus,
h i h i
V βbℓb,l,g ≤ V βbℓfe , (3.30)
as βbℓfe is also a linear unbiased estimator of ATTℓ . However, this result relies on a strong
assumption, namely that within group g the errors εg,t are uncorrelated over time. This rules
out within-group serial correlations that are likely to be present in many empirical settings, and
that Bertrand et al. (2004) recommend accounting for when conducting inference in DID studies.
V (βbℓb,l,g ) ≥ V (βbℓfe ) if errors follow a random walk. Inequality (3.30) heavily relies on this
no-serial-correlation assumption. Instead of assuming independent errors, assume that in each
group errors follow a random walk: εg,t = εg,t−1 + ug,t , with ug,t i.i.d.. This means that in each
group, errors are very strongly positively correlated over time. Then, Harmon (2022) shows that
βbℓfe is the BLUE estimator of ATTℓ , thus implying that
h i h i
V βbℓb,l,g ≥ V βbℓfe . (3.31)
Some intuition for (3.31) goes as follows. Assume that T0 = 2. Then, as treatment effects are
assumed to be non-stochastic, the stochastic part of βb1b,l,g is a difference between independent
averages of
1 1 1
Yg,3 (03 ) − (Yg,2 (02 ) + Yg,1 (0)) =γ3 − (γ2 + γ1 ) + εg,3 − (εg,2 + εg,1 )
2 2 2
1 1
=γ3 − (γ2 + γ1 ) + ug,3 + ug,2 ,
2 2
where the first equality follows from (2.5) and the second from the random walk assumption.
On the other hand, the stochastic part of βb1fe is the difference between independent averages of
In general, we recommend using βbℓfe , though using one or the other estimator should
not really matter. At the end of the day, our recommendation is to rather use βbℓfe . Note that
under Assumptions NA and PT, βbℓb,l,g and βbℓfe are both unbiased for ATTℓ , so both estimators
should be close and using one or the other should not make a large difference: if they significantly
differ that implies that Assumption NA or PT must fail.
.75
.5
Effect
.25
-.25
0 3 6 9 12 15 18 21
Relative time to year before TWEA
Note: This figure shows the estimated effects of compulsory licensing on patents, using years 1900 to 1939 of
the data from Moser and Voena (2012), and the TWFE event-study regressions in (3.6) and (3.28). Standard
errors are clustered at the patent subclass level. The 95% confidence intervals rely on a normal approximation.
76 CHAPTER 3. THE CLASSICAL DID DESIGN
3.6.1 Estimating the correlation between treatment effects and some covariates
Target parameter: best-linear predictor. Assume that one wants to assess if the group-
specific effects of ℓ periods of exposure to treatment TEg,T0 +ℓ are correlated with a K × 1 vector
of time-invariant covariates Xg . Let Xg,k denote the kth coordinate of Xg . We assume that
Xg,1 = 1: the first coordinate of Xg is a constant. For instance, in an analysis at the level of
Chilean communes, one may want to investigate if communes’ effects are correlated with their
poverty rate and/or the share of their population with a college degree. For any matrix A let
AT denote its transpose. Let
−1
g:Dg =1 g:Dg =1
where X.,2 is the average of Xg,2 across treated groups. If Xg,2 is binary, βℓ,X,2 further simplifies
to
1 1
TEg,T0 +ℓ − TEg,T0 +ℓ ,
X X
G1,1 g:Dg =1,Xg,2 =1 G1,0 g:Dg =1,Xg,2 =0
the difference between the ATT of treated groups with Xg,2 = 1 and Xg,2 = 0, where Gd,x
denotes the number of groups such that Dg = d, Xg,2 = x.
Estimator when K = 2 and Xg,2 is binary. If K = 2 and Xg,2 is binary, estimating βℓ,X,2
is straightforward: one can just estimate the TWFE ES regression in (3.6) separately among
groups such that Xg,2 = 1 and Xg,2 = 0, and take the difference between the coefficients on 1{t =
T0 + ℓ}Dg in the two regressions. This is equivalent to running a TWFE ES regression of Yg,t on
Dg , Xg,2 , Dg Xg,2 , time FEs, time FEs interacted with Xg,2 , (1{t = T0 + ℓ}Dg )ℓ∈{−(T0 −1),...,T1 },ℓ̸=0 ,
3.6. ESTIMATING HETEROGENEOUS TREATMENT EFFECTS 77
and (1{t = T0 +ℓ}Dg Xg,2 )ℓ∈{−(T0 −1),...,T1 },ℓ̸=0 . The coefficient on 1{t = T0 +ℓ}Dg Xg,2 is numerically
equivalent to the difference between the coefficients on 1{t = T0 + ℓ}Dg in the two separate
regressions. This estimator is unbiased for βℓ,X,2 under the following assumption: for x ∈ {0, 1},
1
(Yg,t (0t ) − Yg,t−1 (0t−1 ))
X
E
G1,x g:Dg =1,Xg,2 =x
1
=E (Yg,t (0t ) − Yg,t−1 (0t−1 )) . (3.32)
X
G0,x g:Dg =0,Xg,2 =x
This assumption requires that treated and control groups with the same value of Xg,2 have
the same average expected outcome evolution without treatment, a conditional parallel-trends
assumption. When Xg takes a small number of values relative to the sample size, a similar
strategy can be used to estimate the differences between the ATT across groups with different
values of Xg . However, this type of estimation strategy is no longer applicable when Xg takes
a large number of values, as is for instance the case if some of its coordinates are continuously
distributed.
Estimator in the general case. In Section 4.1.3.6 of the next chapter, we will discuss
methods to estimate βℓ,X under a conditional parallel-trends assumption, when Xg takes a large
number of values. For now, let us introduce an alternative method which relies on a different
assumption and is extremely simple to implement. Let
−1
g:Dg =1 g:Dg =1
meaning that untreated outcome evolutions of treated groups are uncorrelated with the non-
constant variables in Xg , then one can show that βbℓ,X,k is unbiased for βℓ,X,k . Interestingly,
βbℓ,X can be computed even if there is no control group such that Dg = 0. Thus, under (3.33)
control groups are not necessary to estimate the correlation between treatment effects and some
covariates, a result that seems to have been first noted in Shahn (2023) and in this textbook.
78 CHAPTER 3. THE CLASSICAL DID DESIGN
Assessing the plausibility of (3.33), the assumption underlying βbℓ,X . First, note that
under Assumption PT, E [Yg,T0 +ℓ (0T0 +ℓ ) − Yg,T0 (0T0 )] is constant so (3.33) holds: (3.33) is weaker
than the strong parallel-trends condition in Assumption PT. Then, if Xg contains only one non-
constant variable and that this variable is binary, (3.33) reduces to
1 1
(Yg,t (0t ) − Yg,t−1 (0t−1 )) = E (Yg,t (0t ) − Yg,t−1 (0t−1 )) ,
X X
E
G1,1 g:Dg =1,Xg,2 =1 G1,0 g:Dg =1,Xg,2 =0
a parallel-trends assumption, between treated groups with Xg,2 = 1 and treated groups with
Xg,2 = 0. That condition is neither stronger nor weaker than the conditional parallel-trends
assumption in (3.32). Finally, for ℓ ≤ T0 −1, (3.33) is placebo testable, by regressing Yg,T0 −ℓ −Yg,T0
on Xg in the sample of treated groups, and testing if the coefficient on Xg is equal to zero.
1 X
vℓ2 := (TEg,T0 +ℓ − ATTℓ )2 ,
G1 g:Dg =1
the variance of the effects (TEg,T0 +ℓ )g:Dg =1 . Assume that treatment effects Yg,T0 +ℓ (0T0 , 1ℓ ) −
Yg,T0 +ℓ (0T0 +ℓ ) are not random, as in (3.21). Then,
ηg,ℓ = Yg,T0 +ℓ (0g,T0 +ℓ ) − E[Yg,T0 +ℓ (0g,T0 +ℓ )] − (Yg,T0 (0g,T0 ) − E[Yg,T0 (0g,T0 )]) .
If one further assumes that V (ηg,ℓ ) does not depend on g, a weaker assumption than (3.23), the
assumption made by Conley and Taber (2011), then it readily follows from (3.14) and (3.15)
that
G1 − 1 2
vbℓ2 := 2
σbℓ,1 − σbℓ,0
G1
is unbiased for vℓ2 .14 vbℓ2 is essentially a comparison of the variances of the outcome’s long-
differences in the treatment and control groups.15 Omitting the (G1 − 1)/G1 term, which will
often be close to one, one can use, say, the sdtest Stata command to compute vbℓ and test the
null that vℓ = 0.
We strongly reject the null that v14 = 0. Moreover, vb14 = 0.239, which is more than a third
of βb14
fe
= 0.642. If the group-specific effects TEg,T0 +ℓ were normally distributed, their first and
14
A similar result can be obtained with random treatment effects: if Yg,T0 +ℓ (0T0 , 1ℓ ) − Yg,T0 +ℓ (0T0 +ℓ ) is un-
correlated with ηg,ℓ , vbℓ2 is consistent, as G0 , G1 → ∞, for
1 X 2
veℓ2 := E (Yg,T0 +ℓ (0T0 , 1ℓ ) − Yg,T0 +ℓ (0T0 +ℓ ) − ATTℓ ) ,
G1
g:Dg =1
a treatment-effect variance accounting both for heterogeneous treatment effects between groups, and for hetero-
geneous treatment effects within groups due to the randomness of Yg,T0 +ℓ (0T0 , 1ℓ ) − Yg,T0 +ℓ (0T0 +ℓ ).
15
A drawback of vbℓ2 is that it can be negative. One could consider instead max 0, vbℓ2 , but this latter estimator
is upward biased.
80 CHAPTER 3. THE CLASSICAL DID DESIGN
last decile would be respectively 0.336 and 0.948, thus indicating a substantial amount of effect
heterogeneity.
Unbiased estimation of vℓ2 relies on stronger assumptions than unbiased estimation of ATTℓ : the
strong parallel-trends condition in Assumption PT instead of the weaker one in (2.6), and the
assumption that V (ηg,ℓ ) does not depend on g. How could you run a pre-trends test of those
assumptions?
By considering σbℓ,1
2 2
− σbℓ,0 , for ℓ ≤ −1. This is similar in spirit to a standard pre-trends test, ex-
cept that we compare the variances of the outcome evolutions in the treatment and control groups
prior to the treatment, rather than comparing their averages. Using the moser_voena_didtextbook
dataset, we run the following line of code:
sdtest diffpatentswrt1918 if year==1904, by(treatmentgroup),
thus giving us a placebo estimator symmetric to our estimator of v14 = 0. We strongly reject
the null that σ−14,1
2 2
− σb−14,0 = 0, thus casting doubt on Assumption PT and the assumption that
V (ηg,ℓ ) does not depend on g. However, for every ℓ ≤ −1, we find that σbℓ,1
2 2
− σbℓ,0 < 0, with a
difference between the two variances that is large and stable over time. This suggests that for
ℓ ≥ 1, our estimators of the treatment effect variance vbℓ2 might be downward biased: perhaps
treatment effects are even more heterogeneous than suggested by vbℓ2 .
Under Assumption NA and the strong parallel-trends condition in Assumption PT, one can
unbiasedly estimate the group-level treatment effects
(TEg,T0 +ℓ )g:Dg =1 . Under the independent groups framework in Assumption IND, an estimator
may be consistent if it averages outcomes from a number of groups that goes to infinity when
G → ∞. For any value of G, Yg,T0 +ℓ − Yg,T0 averages the outcomes of only one group, so TE
d
g,T0 +ℓ
of those estimators to estimate the distribution of treatment effects across groups would be
misleading: roughly speaking, one needs to remove the estimation error from the distribution
of (TE
d
g,T0 +ℓ )g:Dg =1 . Arellano and Bonhomme (2012) propose a deconvolution method that can
be applied to recover the density of group-specific treatment effects in DID models, though the
resulting estimator relies on the strong assumption that the treatment effects are independent
of the errors (εg,1 , ..., εg,T ) in (2.5), and will often converge at a slow rate (see Bonhomme and
Robin, 2010).
In this section, to simplify the exposition we assume that T = 2, T0 = T1 = 1, and that As-
d
sumption ND holds. Also, for all (d, t) ∈ {0, 1}2 , and any variable Xg,t let X t = 1 P
Gd g:Dg =d Xg,t
denote the average of Xg,t across groups with Dg = d at period t.
With a binary outcome, the parallel-trends assumption can yield problematic pre-
dictions. Researchers are often interested in limited dependent variables, that can only take a
limited set of values. To fix ideas, let us assume for now that Yg,t is binary. Under Assumption
1 0 0
h 1 i
PT, Y 1 + Y 2 − Y 1 is supposed to estimate E Y 2 (0) , the expected average outcome without
treatment in the treatment group at period two. This estimator fails to satisfy a desirable
property, which one?
82 CHAPTER 3. THE CLASSICAL DID DESIGN
This estimator is not guaranteed to be included between zero and one, while by construction,
h 1 i
E Y 2 (0) must be included between zero and one. For instance, if the average outcome is equal
to 0.8 in the treatment group at period one, and the average outcome increases from 0.4 to 0.7
in the control group, then the treatment group’s estimated counterfactual outcome at period
two is equal to 1.1, which is not possible.
h 1 i h 1 i
L−1 E Y 2 (0) − L−1 E Y 1 (0)
h 0 i h 0 i
=L−1 E Y 2 (0) − L−1 E Y 1 (0) , (3.34)
for a known, strictly increasing function L, taking values in [0, 1]. Note that if L is the identity
function, then (3.34) is equivalent to Assumption PT. With two pre-treatment periods or more,
one can run pre-trend tests of (3.34), comparing the evolution of L−1 (average outcome) in the
h 1 i
treatment and control groups, before the treatment starts. Show that E Y 2 (0) is identified
under (3.34).
h 1 i
E Y 2 (0)
h 1 i
=L L−1 E Y 2 (0)
h 1 i h 1 i h 1 i
=L L−1 E Y 1 (0) + L−1 E Y 2 (0) − L−1 E Y 1 (0)
h 1 i h 0 i h 0 i
=L L−1 E Y 1 (0) + L−1 E Y 2 (0) − L−1 E Y 1 (0)
h 1 i h 0 i h 0 i
=L L−1 E Y 1 + L−1 E Y 2 − L−1 E Y 1 , (3.35)
Estimation of the ATT under (3.34). From what precedes, a natural estimator of the ATT
is
1
1 0 0
fe
βbbin := Y 2 − L L−1 Y 1 + L−1 Y 2 − L−1 Y 1 .
fe
βbbin = L (α
b0 + α
b1 + α
b2 + α
b 3 ) − L (α
b0 + α
b1 + α
b2) , (3.36)
where (α
b0, α
b1, α b 3 ) is the maximum likelihood estimator of a binary choice model assuming
b2, α
P (Yg,t = 1) = L(α0 + Dg α1 + 1{t = 2}α2 + Dg 1{t = 2}α3 ). If L is the logistic (resp. normal) cdf,
for instance, (α
b0, α
b1, α b 3 ) can be obtained by a logit (resp. probit) regression. Such regressions
b2, α
Non-negative dependent variables. With non-negative dependent variables, one can use a
Poisson regression of Yg,t on an intercept, Dg , 1{t = 2}, and Dg 1{t = 2} to estimate the ATT.
Wooldridge (2023) shows that this estimator relies on (3.34) with L−1 () = ln(), a parallel-trends
assumption on the logarithm of the average untreated outcome.
In this section and in the next, to simplify the exposition we assume that T = 2 and T0 = T1 = 1.
Then, because groups can be treated for at most one period of time, Assumption ND can
be imposed without loss of generality. Also, we adopt a sampling-based perspective, where
the G groups we observe are an independent and identically distributed sample from a larger
population, as assumed in Assumption IID. Then, the g subscript can be dropped, the design is
no longer conditioned upon, and we make the following parallel-trends assumption:
where we recall that Eu [·] denotes unconditional expectations, that are not conditional to the
design.
the parallel-trends assumption is not invariant to functional form. For instance, it may hold in
levels but not in logs, and conversely. A practical consequence is that DID estimators can be
sensitive to functional form: for instance, they can give opposite results with Y and ln(Y ). An
early example of this phenomenon is Meyer, Viscusi and Durbin (1995). They use increases in
the benefits workers receive when injured that took place in the 1980s in Kentucky and Michigan,
to study the effect of the benefits’ amount on injury duration. While they do not find an effect
on injuries’ duration, they do find an effect on the log of injuries’ duration.
Building up intuition. Consider a simple numerical example with two-periods and two
groups, where group one is the control group and group two is the treated group, and Y1,t = t
while Y2,t = 2t. Then, the DID estimator on Yg,t is equal to
2 × 2 − 2 × 1 − (1 × 2 − 1 × 1) = 1 :
from period one to two, the outcome increases by one more unit in the treatment than in the
control group, which might suggest that the treatment has a positive effect. Yet, the DID
estimator on ln(Yg,t ) is equal to
the outcome increases by 100% in the treatment and in the control group, which might, equally
reasonably, suggest that the treatment has no effect. Which feature of this numerical example
3.7. NON-LINEAR DID 85
makes DID estimators sensitive to functional form? Can you think of another numerical example
where DID estimators would be insensitive to functional form?
In the numerical example, the sensitivity to functional form comes from the very different out-
come levels at period one in the treatment and control groups. If the treatment and control
group have the same outcome at period one (Y2,1 = Y1,1 ), then
an estimator whose sign is the same for any strictly increasing function h(.). This suggests that
sensitivity to functional form is more likely to be a concern when treated and control groups have
very different pre-treatment outcome levels, as is for instance the case in Meyer et al. (1995),
than when their pre-treatment outcome levels are similar.
3.7.2.2 A necessary and sufficient condition to have parallel-trends for any functional form
for any strictly increasing function h(.) if and only if, for all y ∈ R,
parallel trends holds for any functional form if and only if the cumulative distribution function
(cdf) of the untreated outcome follows parallel trends in the treatment and control groups. (3.38)
is a strong condition. As shown by Roth and Sant’Anna (2023), this condition also has a testable
implication. It implies that
Pu (Y2 (0) ≤ y|D = 1) =Pu (Y1 (0) ≤ y|D = 1) + Pu (Y2 (0) ≤ y|D = 0) − Pu (Y1 (0) ≤ y|D = 0)
The right-hand-side of the previous display is the outcome’s cdf at period one in the treatment
group plus the change in the outcome’s cdf in the control group, a function that is identified and
can be estimated from the data, replacing probabilities by sample proportions. If (3.38) holds,
y 7→ Pu (Y1 ≤ y|D = 1) + Pu (Y2 ≤ y|D = 0) − Pu (Y1 ≤ y|D = 0) identifies the cdf of Y2 (0) in
the treatment group. Then, as a cdf is weakly increasing, y 7→ Pu (Y1 ≤ y|D = 1) + Pu (Y2 ≤
y|D = 0) − Pu (Y1 ≤ y|D = 0) should be weakly increasing in y. But because the increasing
function Pu (Y1 ≤ y|D = 0) enters with a negative sign in it, that function may or may not be
increasing, hence the testability of (3.38). Kim and Wooldridge (2024) propose a test, computed
by the ddid (Kim, 2024) Stata package. When the test is rejected, parallel trends is sensitive
to functional form. An alternative way of assessing the plausibility of (3.38) is by testing if the
same condition holds in pre-treatment periods. However, as the null involves functions rather
than scalars, the testing problem is more complicated than with standard pre-trends test, though
one may be able to construct a Kolmogorov-Smirnov or Cramer-Von Mises test. One could also
combine a pre-trends test of (3.38) with the monotonicity test of (3.39).
Average treatment effects are also sensitive to functional form. First it is important to
realize that sensitivity to functional form is not a unique feature of the parallel-trends assump-
tion. Average treatment effects are also sensitive to functional form. In general, for a non-linear
and strictly increasing function h(.), Eu [h(Y2 (d))|D = 1] ̸= h (Eu [Y2 (d)|D = 1]). Then, one
may have that Eu [Y2 (1) − Y2 (0)|D = 1] and Eu [h(Y2 (1)) − h(Y2 (0))|D = 1] are of a different
sign. Therefore, a positive DID in levels and a negative one in logs either means that parallel
trends is violated for at least one of the two functional forms, or that parallel trends holds for
both functional forms but ATT > 0 while ATTln := Eu (ln (Y2 (1)) − ln (Y2 (0)) |D = 1) < 0.
the average across all treated groups of the relative outcome change in response to treatment,
the so-called average semi-elasticity. Due to the ratio inside the average, this parameter cannot
3.7. NON-LINEAR DID 87
be estimated using a standard DID. However, using the fact that Y2 (1)/Y2 (0) − 1 ≈ ln (Y2 (1)) −
ln (Y2 (0)) if Y2 (1)/Y2 (0) − 1 is close to zero, ATTln may provide a reasonable approximation of
ASE. Then, the researcher should use a DID estimator on the log outcome. That estimator is
unbiased if parallel trends holds in logs, even if parallel trends fails for other functional forms.
for any strictly increasing function h(.), then parallel trends has to hold for any h(.), and
sensitivity to functional form is an issue. Such an estimation goal, while much more ambitious
88 CHAPTER 3. THE CLASSICAL DID DESIGN
than just trying to estimate ATT and/or ATTln , may be justified: as will become clear later,
meeting this goal is one way to ensure that one can estimate all quantile treatment effects,
namely how the treatment affects the entire distribution of treated groups’ period-two outcome,
rather than just their average outcome. Sensitivity to functional form is also an issue if the
researcher is indifferent between several functional forms, but pre-trend tests cannot detect
functional forms for which parallel trends fails, either because they lack power, or because one is
not ready to assume that parallel trends prior to treatment imply parallel trends post treatment.
One may also worry that pre-testing for the right functional form could bias the DID estimator
with the functional form selected at the outset of the pre-tests, and could distort inference.
Investigating whether this is a legitimate concern in realistic settings is an interesting avenue for
future research.
Defining quantiles and quantile treatment effects. For any (d, t) ∈ {0, 1} × {1, 2} and
for any generic variable X, let x 7→ FXt |D=d (x) denote the cdf of X in treatment group d at time
t. For any weakly increasing function f let f −1 denote its generalized inverse, f −1 (x) = inf{y :
f (y) ≥ x}. Then, for any τ ∈ [0, 1], FX−1t |D=d (τ ) is the quantile of order τ of X in treatment
group d at time t. For instance, FX−1t |D=d (0.5) is the median of X in treatment group d at time
t. Finally, let
QTE(τ ) = FY−1
2 (1)|D=1
(τ ) − FY−1
2 (0)|D=1
(τ )
denote the quantile treatment effect (QTE) of order τ in the treatment group at period two.
For instance, QTE(0.5) = FY−1
2 (1)|D=1
(0.5) − FY−1
2 (0)|D=1
(0.5) is the difference between the median
outcome in the treatment group at period two with and without treatment. One has
Z 1
ATT = QTE(τ )dτ :
0
Estimating QTEs under a parallel-trends assumption on the cdf. (3.39) shows that
y 7→ FY2 (0)|D=1 (y), the cdf of the treated group’s period-two untreated outcome, is identified
3.7. NON-LINEAR DID 89
under the cdf parallel-trends condition in (3.38). As a variable’s quantile function is just the
generalized inverse of its cdf, if y 7→ FY1 (0)|D=1 (y) + FY2 (0)|D=0 (y) − FY1 (0)|D=0 (y) is weakly in-
creasing then it follows from (3.39) that
−1
FY−1
2 (0)|D=1
(τ ) = FY1 |D=1 + FY2 |D=0 − FY1 |D=0 (τ ) :
QTEs are identified under (3.38). Kim and Wooldridge (2024) propose to use
cdf −1
[ (τ ) = Fb −1 (τ ) − FbY1 |D=1 + FbY2 |D=0 − FbY1 |D=0
QTE Y2 |D=1 (τ )
to estimate the QTEs. Those estimators are computed by the ddid Stata package.
FY−1
2 (0)|D=1
(τ ) − FY−1
1 (0)|D=1
(τ ) = FY−1
2 (0)|D=0
(τ ) − FY−1
1 (0)|D=0
(τ ). (3.41)
Unlike (3.38), (3.41) is not invariant to functional form. On the other hand, (3.38) needs to hold
for all y. Then, one can estimate the full QTE function τ 7→ QTE(τ ), but that estimation relies
90 CHAPTER 3. THE CLASSICAL DID DESIGN
on an assumption on the evolution of the entire distribution of the untreated outcome. Instead,
one may only impose (3.41) for specific quantiles (e.g. the median). Then, one can only estimate
specific QTEs, but now estimation only relies on restrictions on the evolution of specific features
of the distribution of the untreated outcome, in the spirit of the standard DID estimator that
only imposes restrictions on the evolution of the average. An additional advantage of only using
the quantile DID estimator at specific quantiles is that then, pre-trend tests are straightforward:
they do not involve functionals, just vectors of quantile regression coefficients similar to the
aforementioned one, but estimated in the pre-treatment periods.
cic
[ (τ ) = Fb −1 (τ ) − Fb −1
QTE Y2 |D=1 Y2 |D=0 FY1 |D=0 FY1 |D=1 (τ )
b b −1 .
The identifying assumptions rationalizing that estimator is that for all t ∈ {1, 2}, Yt (0) =
ht (Ut ) for a strictly increasing function ht (.), and for all d ∈ {0, 1} U2 |D = d ∼ U1 |D = d,
where ∼ denotes equality in distribution. Under those assumptions, the τ th quantile of Yt (0)
may not follow the same evolution in the treatment and in the control group. On the other
hand, those assumptions ensure that if τ and τ ′ are such that at period one, the τ th quantile
in the treatment group corresponds to the τ ′ th quantile in the control group, then without
treatment the period-one-to-two change in the treatment group’s τ th quantile would have been
the same as that of the control group’s τ ′ th quantile. Then, under the CIC assumptions, one
has parallel-trends between the treatment and control group quantiles, not for the same quantile
order τ but for quantile orders τ and τ ′ for which treated and control groups’ outcomes are
equal at period one. Like (3.38), the identifying assumptions underlying the CIC estimator
are invariant to functional form, but the CIC assumptions are not parallel-trends assumptions:
for instance they do not imply that the average untreated outcome of the treated and control
groups are on parallel trends. Like (3.38), the CIC assumptions impose restrictions on the
evolution of the untreated outcome’s entire distribution. The CIC estimators are computed
by the fuzzydid (de Chaisemartin, D’Haultfœuille and Guyonvarch, 2019b) and cic (Kranker,
2019) Stata packages, and by the qte (Callaway, 2023) R package.
3.8. INSTRUMENTAL-VARIABLE DID ESTIMATORS∗ 91
Motivation. There are instances where one may prefer making a parallel-trends assumption
with respect to an instrument rather than the treatment. For instance, one may be interested in
estimating the price-elasticity of a good. If prices respond to demand shocks, the counterfactual
consumption trends of units experiencing and not experiencing a price change may not be the
same, so a parallel-trends assumption with respect to prices may not hold. But instead, one can
make a parallel-trends assumption with respect to taxes, used as an instrument for prices.
Eu [Y2 (D2 (0)) − Y1 (D1 (0))|Z = 1] = Eu [Y2 (D2 (0)) − Y1 (D1 (0))|Z = 0] , (3.42)
and
D2 (1) ≥ D2 (0). (3.44)
(3.42) requires that on average, instrumented and uninstrumented groups have the same outcome
evolutions from period one to two, in the counterfactual where the instrumented group does not
receive the instrument. (3.42) is a “reduced-form” version of the parallel-trends condition in
(2.9). (3.43) requires that on average, instrumented and uninstrumented groups have the same
92 CHAPTER 3. THE CLASSICAL DID DESIGN
treatment evolutions from period one to two, in the counterfactual where the instrumented
group does not receive the instrument. Finally, (3.44) is a monotonicity condition, analogous to
that introduced by Imbens and Angrist (1994). It requires that receiving the instrument cannot
decrease groups’ treatment.
Eu (Y2 (1) − Y2 (0)|D2 (1) > D2 (0), Z = 1), the average treatment effect at period two across com-
pliers in the instrumented group, is identified by a so-called Wald-DID, a ratio whose numerator
compares the outcome evolutions of the instrumented and uninstrumented groups, while its de-
nominator compares the treatment evolutions of the two groups. See also Hudson et al. (2015)
for a closely-related result. To estimate the Wald-DID, one can replace expectations by sample
averages. Equivalently, one can use a so-called 2SLS-TWFE regression of Yg,t on an intercept,
Zg , 1{t = 2}, and Dg,t , using 1{t = 2}Zg as the excluded instrument for Dg,t .
+Eu [Y2 (1) − Y2 (0)|D2 (0) − D1 (0) = 1, Z = 1] Pu (D2 (0) − D1 (0) = 1|Z = 1)
−Eu [Y2 (1) − Y2 (0)|D2 (0) − D1 (0) = −1, Z = 1] Pu (D2 (0) − D1 (0) = −1|Z = 1)
+Eu [Y2 (1) − Y2 (0)|D2 (0) − D1 (0) = 1, Z = 0] Pu (D2 (0) − D1 (0) = 1|Z = 0)
−Eu [Y2 (1) − Y2 (0)|D2 (0) − D1 (0) = −1, Z = 0] Pu (D2 (0) − D1 (0) = −1|Z = 0). (3.47)
3.8. INSTRUMENTAL-VARIABLE DID ESTIMATORS∗ 93
Under parallel trends on the first stage, the conditions below are sufficient for (3.47) to hold:
∀(z, z ′ , δ, δ ′ ) ∈ {0, 1}2 × {−1, 1}2 , Eu [Y2 (1) − Y2 (0)|D2 (0) − D1 (0) = δ, Z = z]
(3.48) requires that the LATE of groups treated at period one is constant over time, both in the
instrumented and uninstrumented groups. (3.49) requires that the LATE of groups naturally
switching into treatment over time (D2 (0) − D1 (0) = 1) is the same as the LATE of groups
switching out (D2 (0) − D1 (0) = −1), and that those LATEs are the same in the instrumented
and uninstrumented groups. Conversely, (3.47) may fail if (3.48) or (3.49) fails. Thus, when
combined with (3.46), (3.42) may fail if the treatment effect of groups treated at period one
changes over time, or if switchers in and out have heterogeneous effects, or if switchers have
heterogeneous effects in the instrumented and uninstrumented groups. Then, to have that (3.42)
does not restrict effects’ heterogeneity, (3.46) has to fail: groups have to be on parallel trends
in the counterfactual where they do not receive the instrument, but they have to experience
differential trends in the counterfactual where they remain untreated. Such a scenario might be
hard to rationalize, so we view reduced-form parallel-trends as restricting effects’ heterogeneity.
]ℓ
imb 1
ATT = TEg,T0 +ℓ
X
G1,T0 +ℓ g:Dg =1,Og,T0 +ℓ =1
be the analogue of ATTℓ , for the subsample of treatment groups observed at T0 + ℓ. If we ignore
missingness, we estimate a TWFE ES regression in the subsample of (g, t) cells for which Yg,t is
observed. One can show that the coefficient on 1{t = T0 + ℓ}Dg in this naive regression is equal
to
1 X 1 X
Yg,T0 +ℓ − Yg,T0
G1,T0 +ℓ g:Dg =1,Og,T0 +ℓ =1 G1,T0 g:Dg =1,Og,T0 =1
1 1
(3.50)
X X
− Yg,T0 +ℓ − Yg,T0 .
G0,T0 +ℓ g:Dg =0,Og,T0 +ℓ =1 G0,T0 g:Dg =0,Og,T0 =1
imb
]ℓ .
Under the parallel-trends assumption in (3.51), the estimator in (3.50) is unbiased for ATT
However, (3.51) is not an easy-to-rationalize assumption: one could have that Assumption PT
holds, yet (3.51) fails. Intuitively, this is because the average trends in the left- and right-hand
sides of (3.51) mix time trends on the untreated outcomes with composition effects, as missing
groups differ between periods T0 and T0 + ℓ.
3.9. FURTHER TOPICS∗ 95
Alternative estimators. For any ℓ ∈ {1, ..., T1 }, let Dgℓ be an indicator equal to 1 if Dg =
1, Og,T0 = 1, Og,T0 +ℓ = 1, to 0 if Dg = 0, Og,T0 = 1, Og,T0 +ℓ = 1, and missing otherwise, and let
G1,ℓ and G0,ℓ respectively denote the number of groups such that Dgℓ = 1 and Dgℓ = 0. Then, let
1 X
ATTimb
ℓ = E [Yg,T0 +ℓ (0T0 , 1ℓ ) − Yg,T0 +ℓ (0T0 +ℓ )]
G1,ℓ g:Dℓ =1
g
be the analogue of ATTℓ , for the subsample of treatment groups observed at T0 and T0 + ℓ.
Similarly, let
1 X 1 X
βbℓfe,imb = (Yg,T0 +ℓ − Yg,T0 ) − (Yg,T0 +ℓ − Yg,T0 )
G1,ℓ g:Dℓ =1 G0,ℓ Dℓ =0
g g
3.9.2 Weighting
Weighted TWFE regressions. In many applications, researchers weight their TWFE re-
gression. One can show that the coefficient on 1{t = T0 + ℓ} in the TWFE ES regression in (3.6)
weighted by Wg,t is equal to
βbℓfe,W := n n n n
X X X X
Wg,T Y
0 +ℓ g,T0 +ℓ
− Wg,T Y
0 g,T0
− Wg,T Y
0 +ℓ g,T0 +ℓ
− Wg,T Y ,
0 g,T0
g:Dg =1 g:Dg =1 g:Dg =0 g:Dg =0
where Wg,t
n
= Wg,t / Wg′ ,t are normalized weights summing to one at each date in the
P
g ′ :Dg′ =Dg
treatment and in the control group. Under the parallel-trends condition in Assumption PT, do
we have that βbℓfe,W is unbiased for ATTW
ℓ := g:Dg =1 Wg,T0 +ℓ TEg,ℓ , a weighted version of ATTℓ ?
n r
P
If the weights are time-varying, βbℓfe,W may not be unbiased for ATTW
ℓ under the parallel-trends
=E n
(0T0 +ℓ ) − n
Y (0T0 ) . (3.53)
X X
Wg,T Y
0 +ℓ g,T0 +ℓ
Wg,T 0 g,T0
g:Dg =0 g:Dg =0
On the other hand, if the weights are time-invariant, βbℓfe,W does compare a weighted average of
outcome evolutions Yg,T0 +ℓ −Yg,T0 in the treatment and in the control group, and it is unbiased for
3.9. FURTHER TOPICS∗ 97
ℓ under Assumption PT. Overall, weighting changes the effect that the regression estimates,
ATTW
and it may change the identifying assumption underlying it if the weights are time-varying.
Two arguments in favor of weighting. To fix ideas, suppose one has a panel data of Malian
communes, and one considers weighting the regression by the population of commune g at period
ATTℓ is the average effect of ℓ periods of exposure to treatment across Malian communes, where
each commune receives the same weight. ATTW
ℓ is the average effect of ℓ periods of exposure to
treatment across the Malian population, because each commune is weighted by its population.
A cautionary note on weighting. Weighting also has one drawback: it has been shown,
theoretically and through simulations, that with weighted regressions, confidence intervals re-
lying on asymptotic approximations need larger sample sizes to be reliable (see e.g. Cameron
and Miller, 2015; Carter, Schnepel and Steigerwald, 2017). This is especially true when a few
groups receive a very large weight, as is likely to be the case when one weights by population.
Then, a few groups have a disproportionate amount of influence on the regression coefficients,
and the regression’s effective sample size is much smaller than its nominal sample size. Carter
et al. (2017) propose a measure of the regression’s effective sample size G∗ , computed by the
clusteff Stata package (Lee and Steigerwald, 2018). They recommend relying on asymptotic
approximations only if G∗ is larger than 50. As we saw earlier in this chapter, in the DID con-
text the reliability of asymptotic approximations depends on the numbers of treated and control
98 CHAPTER 3. THE CLASSICAL DID DESIGN
groups, not just on the total number of groups. To our knowledge, measures of the effective
number of treated and control groups have not been proposed yet. Doing so is an interesting
avenue for future research.
When the weights are affected by the treatment, we recommend using time-invariant
weights, determined prior to treatment. Sometimes, the weighting variable may be af-
fected by the treatment. Typically, Wg,t is the population of group g at time t, and the treatment
may affect that population, something one can assess by estimating an unweighted TWFE ES
regression, with Wg,t as the outcome variable. When Wg,t is affected by the treatment, (3.53)
might be implausible. Let Wg,t
n
(0) and Wg,t
n
(1) denote counterfactual values of Wg,t
n
without and
with treatment. To rationalize (3.53), one for instance needs to assume that
n
(0)Yg,T0 +ℓ (0T0 +ℓ ) − n
(0)Yg,T0 (0T0 )
X X
E Wg,T0 +ℓ
Wg,T0
g:Dg =1 g:Dg =1
=E n
(0)Yg,T0 +ℓ (0T0 +ℓ ) − n
(0)Yg,T0 (0T0 ) , (3.54)
X X
Wg,T0 +ℓ
Wg,T 0
g:Dg =0 g:Dg =0
(Wg,T
n
(1) − Wg,T
n
(0))Yg,T0 +ℓ (0T0 +ℓ ) = 0. (3.55)
X
E 0 +ℓ 0 +ℓ
g:Dg =1
While (3.54) can be placebo tested by computing weighted pre-trend estimators, (3.55) cannot
be placebo tested, thus making (3.53) not fully placebo-testable, and at odds with standard DID
logic. Therefore, when the weights are affected by the treatment, we recommend weighting the
regression by time-invariant weights determined before treatment. For instance, one can weight
the regression by Wg,1 , or by Wg,T0 .
Geographic spillovers to control groups close to treated ones. There are many appli-
cations where the effect of the treatment might spill over onto untreated groups geographically
close to a treated group. For example, if a county increases its minimum wage, individuals from
contiguous counties may decide to go work there to benefit from the higher minimum wage.
Similarly, if a coal-fired power plant adopts an emissions-control technology, this will reduce air
pollution in the county where the plant is located, but as air pollution travels this can also reduce
3.9. FURTHER TOPICS∗ 99
air pollution in neighboring counties downwind of the treated one. This leads to a violation of
the SUTVA assumption in (2.1). Then, a treated unit may experience both a direct treatment
effect, arising from its own treatment, but it may also experience an indirect effect, arising from
the treatment of treated units located close to it. The sum of such direct and indirect effects is
sometimes referred to as the treatment’s total effect. If a parallel-trends assumption holds but
SUTVA fails, Clarke (2017) and Butts (2021c) show that a TWFE regression that ignores such
violations will not estimate the average total effect of the treatment across all treated groups.
Instead, this regression will estimate the average total effect, minus the proportion of untreated
groups that are affected by treated groups’ treatment, times the average indirect effect of the
treatment across those “affected” untreated groups. Assuming that both effects are of the same
sign but that the indirect effect is closer to zero than the total effect, the TWFE regression suf-
fers from an attenuation bias. If one is ready to assume that a subset of untreated groups is not
affected by treatment groups’ treatment, for instance because they are geographically far to all
treated groups, then the population can be partitioned into treated groups, affected untreated
groups, and unaffected untreated groups. Under a parallel-trends assumption, Butts (2021c)
shows that a TWFE regression restricting the sample to treated and to unaffected untreated
groups estimates the average total effect of the treatment across all treated groups. This ratio-
nalizes the common approach in applied work of excluding groups located close to the treatment
area. Then, Butts (2021c) also shows that a TWFE regression restricting the sample to affected
untreated groups and to unaffected untreated groups estimates the average indirect effect of the
treatment across all affected untreated groups.
Market spillovers to substitutes or complement products. There are also many ap-
plications where a treatment affects a product A, and the treatment’s effect spills over onto
an untreated product B that is a substitute or a complement of A. For instance, product A
might start receiving a subsidy, thus leading producers to decrease its price. This may reduce
demand for and prices of substitute product B. A commonly used strategy in such instances
is to move the analysis at the market rather than at the product level, though this also shifts
the research question to measuring the aggregate, market-level effects of the subsidy. If some
untreated markets are observed and one can reasonably rule out spillovers across markets, then
100 CHAPTER 3. THE CLASSICAL DID DESIGN
to estimate the treatment’s direct effect one can use a DID estimator comparing the evolution of
demand for and/or price of product A in treated and untreated markets. Similarly, to estimate
the spillover effect one can use a DID estimator comparing the evolution of demand for and/or
price of product B in treated and untreated markets.
de Chaisemartin and D’Haultfœuille (forthc.) reviewed the 100 most cited papers published by
the American Economic Review from 2015 to 2019, and found that 26 have estimated at least
one TWFE regression. Of those 26 papers, only two have an absorbing and binary treatment
with no variation in treatment timing. This suggests that classical DID designs are more the
exception than the norm, and the results we have seen so far only apply to a minority of the
research designs encountered by social scientists analyzing natural experiments. Accordingly, in
the following chapters we will consider more complex designs. We will show that in such designs,
TWFE regressions rely on much stronger assumptions than in the classical design, which will
lead us to consider more robust estimators. But before considering more complex designs, in the
next chapter we consider relaxations of the parallel-trends assumption in the classical design.
3.11 Appendix∗
One has
1 X
ATT2 = E [Yg,T0 +2 (0T0 , 1, 1) − Yg,T0 +2 (0T0 +2 )]
G1 g:Dg =1
1 X
= E [Yg,T0 +2 (0T0 , 0, 1) − Yg,T0 +2 (0T0 +2 )]
G1 g:Dg =1
1 X
+ E [Yg,T0 +2 (0T0 , 1, 1) − Yg,T0 +2 (0T0 , 0, 1)] ,
G1 g:Dg =1
3.11. APPENDIX∗ 101
while
1 X
ATT1 = E [Yg,T0 +1 (0T0 , 1) − Yg,T0 +1 (0T0 +1 )] .
G1 g:Dg =1
meaning that being treated for two periods produces a larger effect than being treated for one
period, or
1 X 1 X
E [Yg,T0 +2 (0T0 , 0, 1) − Yg,T0 +2 (0T0 +2 )] > E [Yg,T0 +1 (0T0 , 1) − Yg,T0 +1 (0T0 +1 )] ,
G1 g:Dg =1 G1 g:Dg =1
meaning that the effect of being treated for one period is larger at T0 + 2 than at T0 + 1.
We do not give here the most general statement of this theorem, but this version is sufficient for
our purposes. The second equality below, which is less commonly presented, is used at several
places in the book.
Theorem 4 Let θb1 be the coefficient on a scalar variable Xi,1 , in an ordinary least squares
regression of a scalar variable (Yi )1≤i≤n on a vector of K +1 variables (1, Xi,1 , Xi,2 , ..., Xi,K )1≤i≤n .
One has that
Pn Pn
i=1 u
b i Yi ubi Yi
θb
1 = Pn = Pni=1 ,
b2i
i=1 u i=1 u
bi Xi,1
where ubi is the residual from a regression of (Xi,1 )1≤i≤n on (1, Xi,2 , ..., Xi,K )1≤i≤n .
Proof. Let θb = (θb0 , ..., θbK )′ denote the OLS estimator of the regression of (Yi )1≤i≤n on (Xi )1≤i≤n ,
with Xi := (1, ..., Xi,K )′ and (εbi )1≤i≤n be the corresponding residuals. Then,
K
Yi = + Xi,1 + Xi,j θbj + εbi . (3.56)
X
θb 0 θb
1
j=2
As a residual, ubi is uncorrelated with the regressors (1, Xi,2 , ..., Xi,K )′ . Hence, for all j = 2, ..., K,
n n
ubi = ubi Xi,j = 0. (3.57)
X X
i=1 i=1
102 CHAPTER 3. THE CLASSICAL DID DESIGN
Also as a residual, εbi is uncorrelated with any linear combination of the regressors Xi . Since ubi
is one such combination,
n
ubi εbi = 0. (3.58)
X
i=1
The second equality of the theorem follows. To obtain the first, let X
c denote the prediction of
i,1
Xi,1 from the regression of (Xi,1 )1≤i≤n on (1, Xi,2 , ..., Xi,K )1≤i≤n , so that Xi,1 = X
c +u
i,1 bi . Since
c is a linear combination of (1, X , ..., X ), we obtain, from (3.57),
X i,1 i,2 i,K
n n n
ubi Xi,1 = ubi (X
c +u bi ) = ub2i .
X X X
i,1
i=1 i=1 i=1
We prove (3.15), by proving that for any independent but not necessarily identically distributed
variables (Ug )g=1,...,G ,
1 X G
1 XG
1 X G
E (Ug − U )2 = V (Ug ) + (E(Ug ) − E(U ))2 , (3.59)
G − 1 g=1 G g=1 G − 1 g=1
1 X G
1 X G h i2
(Ug − U ) =
2
Ug − E(U ) + E(U ) − U
G − 1 g=1 G − 1 g=1
1 X G h i2 G h i2
= Ug − E(U ) − U − E(U ) . (3.60)
G − 1 g=1 G−1
G G 1 G
2
E U − E(U ) = V (U ) = V (Ug ). (3.61)
X
G−1 G−1 G(G − 1) g=1
3.11. APPENDIX∗ 103
Besides,
E[(Ug − E(U ))2 ] =E[(Ug − E(Ug ))2 ] + (E(Ug ) − E(U ))2 + 2(E(Ug ) − E(U ))E[Ug − E(Ug )]
Throughout this chapter, we assume that we are in Design CLA: Dg,t = 1{t > T0 }Dg , as we
have assumed in Chapter 3. Unless otherwise noted, we also assume that Assumption ND holds:
the treatment does not have dynamic effects. Then, TEg,t reduces to
the ATE in cell (g, t). That restriction is not of essence and simplifies the exposition.
Unless otherwise indicated, in this section we assume that the data contains only two time
periods: T = 2, T0 = T1 = 1.
105
106 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
to replace Assumption PT (which, under Assumption ND, reduces to (2.2)) by the following
condition.
Assumption CPT (Conditional parallel trends) There exists a function m : x 7→ m(x) such
that E [Yg,2 (0) − Yg,1 (0)] = m(Xg ).
Xg1 = Xg2 ⇒ E [Yg1 ,2 (0) − Yg1 ,1 (0)] = E [Yg2 ,2 (0) − Yg2 ,1 (0)] .
But under Assumption CPT, groups with different values of their control variables may have
different expected evolutions of their untreated outcomes. Thus, Assumption CPT may be more
plausible than Assumption PT. It has for instance been considered by Heckman, Ichimura and
Todd (1997), Blundell et al. (2004), Abadie (2005), and Sant’Anna and Zhao (2020).
Conditional parallel-trends with a linear functional form. One could also strengthen
Assumption CPT, assuming that m : x 7→ m(x) is linear:
Assumption LPT (Conditional parallel trends, with a linear functional form) There exists a
real number γ2 and a K × 1 vector θ such that E [Yg,2 (0) − Yg,1 (0)] = γ2 + Xg′ θ.
With Xg = Xg,1 , Assumption CPT means that groups with the same baseline levels of the
covariates experience parallel trends. With Xg = Xg,2 − Xg,1 , Assumption CPT means that
groups with the same changes of the covariates experience parallel trends. Whether Assumption
CPT is more plausible with levels and/or first-differences is context specific and depends on
whether differential trends between groups are more likely to be due to differences in the level
of their covariates, to the change of their covariates, or to both. Still, there is one important
difference between controlling for covariates in levels or in first-difference. The requirement that
the covariates be unaffected by the treatment is fairly plausible when one controls for Xg,1 : the
treatment is not available yet at period one, so this is only ruling out anticipation effects on
the covariates. On the other hand, when one controls for Xg,2 − Xg,1 , this requirement becomes
stronger: Xg,2 could be affected by the treatment, in which case controlling for Xg,2 − Xg,1 could
lead to a so-called “bad controls” problem (see Section 3.2.3 of Angrist and Pischke, 2009).
Then, researchers controlling for covariates in first difference need to be able to plausibly rule
out treatment effects on covariates.
A TWFE regression with control variables. Assume that one controls for Xg = Xg,1 , and
one is ready to assume that Assumption LPT holds. Propose a TWFE regression to estimate
the ATT.
G
Yg,t = b g′ 1{g = g ′ } + γ
b2 1{t = 2} + Xg,1
′ b
θ1{t = 2} + βbX
fe
Dg,t + ϵ̂g,t . (4.1)
X
α
g ′ =1
With T = 2, the regressions in (4.1) and (3.1) are very similar, except that by including Xg,1
′
1{t =
2}, the one in (4.1) allows groups with different values of Xg,1 to experience different trends. In
designs with a binary treatment and no variation in treatment timing, we have seen that βbfe is
unbiased for the ATT under Assumption PT. Then, it would seem natural to assume that in the
108 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
fe
βbX may be biased if treatment effects are heterogeneous and correlated with the
covariates. Assume that Xg,1 is of dimension one. Let µX,1 = 1
Xg,1 , µX,0 =
P
G1 g:Dg =1
1 P
G0 g:Dg =0 Xg,1 , σX,1
2
= 1 P
G1 g:Dg =1 (Xg,1 − µX,1 )2 , and σX,0
2
= 1 P
G0 g:Dg =0 (Xg,1 − µX,0 )2 respec-
tively denote the average and the variance of the covariate in the treatment and in the control
2
G1 σX,1
group. Let w1 = 2 +G σ 2
G0 σX,0 1 X,1
, and let
be the coefficient from an unfeasible regression of TEg,2 on Xg,1 in the treatment group.
Theorem 5 In Design CLA, if Assumptions NA and LPT hold and Xg,1 is of dimension one,
fe
= ATT + w1 (µX,1 − µX,0 )βX = Wg TEg,2 ,
X
E βbX
g:Dg =1
The first equality of Theorem 5, which to our knowledge was first shown in this textbook,
shows that if treatment effects are heterogeneous and correlated with the baseline covariate Xg,1
(βX ̸= 0), and if Xg,1 is correlated with the treatment group indicator Dg (µX,1 − µX,0 ̸= 0), then
fe
βbX is biased for the ATT. The proof of that first equality is given in this chapter’s appendix.
Intuitively, the problem comes from the fact that (4.1) tries to do two things at the same time:
estimate θ, the “effect” of Xg,1 on the evolution of the untreated outcome, and estimate the
treatment’s effect. As (4.1) is estimated on the full sample of treated and untreated groups, θb
captures both the correlation between Xg,1 and the evolution of the untreated outcome, but also
the correlation between Xg,1 and the treatment’s effect. Then, θb is biased for θ, which ultimately
biases βbX
fe
.
fe
A decomposition of βbX as a weighted sum of treatment effects where weights sum
to one but may not all be positive. The second equality of Theorem 5 shows that βbX
fe
4.1. TWFE AND DID ESTIMATORS WITH CONTROL VARIABLES 109
is unbiased for a linear combination of the effects TEg,t across all treatment groups, where the
loadings multiplying the treatment effects sum to one but may not all be positive. Throughout
this textbook, linear combinations of effects with loadings summing to one are referred to as
“weighted sums of effects”, while linear combinations of effects with positive loadings summing
to one are referred to as “weighted averages of effects” or “convex combinations of effects”. This
second equality directly follows from the first equality in the proposition, once noted that βX is
Xg,1 −µX,1
a weighted sum of the effects TEg,t , with weights 2 G
σX,1 1
that sum to zero, while the ATT is a
weighted average of the same effects with weights 1
G1
summing to one. This second equality is a
special case of Theorem S4 in the Web Appendix of de Chaisemartin and D’Haultfœuille (2020).
That result shows that beyond the special case considered here with a single time-invariant
covariate, under a parallel-trends assumption TWFE regressions with covariates always estimate
weighted sums of effects where weights sum to one, but where some weights may be negative.
Due to the negative weights, one could have that TEg,t ≥ 0 for all (g, t), but E βbX
fe
< 0: the
TWFE regression does not satisfy the so-called no-sign reversal property, an issue we will return
to in the following chapter.
Linear outcome regression. The TWFE regression with control variables is biased because
it tries to do too many things at the same time (jointly estimate the effect of the controls Xg on
the untreated-outcome’s evolution and the treatment’s effect). Instead, one can use a two-step
estimation procedure, inspired from Heckman et al. (1997), where in a first step one estimates
θ in Assumption LPT, before estimating the treatment effect. In the first step, which regression
should we run to estimate θ?
110 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
We can regress Yg,2 − Yg,1 on the covariates, restricting the sample to groups such that Dg = 0:
Let
1 X
DIDX,or = (Yg,2 − Yg,1 − (γb2or + Xg′ θbor )).
G1 g:Dg =1
Here is the intuition underlying DIDX,or . Under Assumption LPT, groups’ outcome evolution
without treatment is a linear function of the control variables. We estimate the coefficients
in that linear function by regressing control groups’ outcome evolution on the covariates. We
then compute γb2or + Xg′ θbor , groups’ predicted outcome evolution without treatment, based on
that regression. Finally, DIDX subtracts from treated groups’ actual outcome evolution their
predicted outcome evolution without treatment, to recover their treatment effect.
Proof of Theorem 6. Under Assumption LPT, one can show that γb2or and θbor are unbiased
for γ2 and θ. Then,
1 X
E (DIDX,or ) = E (Yg,2 (1) − Yg,1 (0)) − E (γb2or ) + Xg′ E θbor
G1 g:Dg =1
1 X
= E (Yg,2 (1) − Yg,2 (0))
G1 g:Dg =1
1 X
+ E (Yg,2 (0) − Yg,1 (0)) − γ2 + Xg′ θ
G1 g:Dg =1
1 X
=ATT + γ2 + Xg′ θ − γ2 + Xg′ θ
G1 g:Dg =1
=ATT.
The second equality follows from adding and subtracting Yg,2 (0) and from the unbiasedness of
γb2or and θbor . The third equality follows from the definition of ATT and Assumption LPT QED.
4.1. TWFE AND DID ESTIMATORS WITH CONTROL VARIABLES 111
Propensity-score reweighting. Under Assumption CPT, one can also use propensity-score
reweighting to estimate the ATT. In a first step, one regresses the treatment group indicator
Dg on Xg , using a logit or probit model. Let p(X
b g ) denote group’s g predicted probability to
be a treatment group according to this regression, its so-called propensity score. Then, Abadie
(2005) proposes to use
1 X 1 X b g ) G0
p(X
DIDX,ps := (Yg,2 − Yg,1 ) − (Yg,2 − Yg,1 )
G1 g:Dg =1 G0 g:Dg =0 1 − p(X
b g ) G1
to estimate the ATT. Intuitively, DIDX,ps compares the outcome evolution of treated and control
groups, after reweighting control groups. As x 7→ x/(1 − x) is increasing in x on (0, 1), the
reweighting gives more weight to control groups that have a larger value of p(X
b g ), namely to
control groups who, based on their Xg , have a larger predicted probability of being treated.
Thus, the reweighting gives more weight to control groups that “look like” treatment groups.
Actually, one can show that the reweighting ensures that asymptotically, the distribution of
Xg is the same in the treatment group and in the reweighted control group, so that DIDX,ps
is consistent for the ATT. Importantly, DIDX,ps relies on the assumption that the population
propensity score p(Xg ) be bounded away from one: p(Xg ) ≤ 1 − ϵ for some ϵ > 0, the so-
called overlap condition. Groups with an estimated propensity score too close to one can be
dropped from the estimation, to decrease the estimator’s variance. One can for instance follow
the popular rule proposed by Crump, Hotz, Imbens and Mitnik (2009) of trimming treatment
and control groups with a propensity score larger than 0.9.
Zhao (2020) propose to use a so-called doubly-robust DID estimator, that combines outcome
regression and propensity-score reweighting. Specifically, one of their estimators of the ATT is
1 X 1 X b g ) G0
p(X
DIDX,dr := (Yg,2 −Yg,1 −(γb2or +Xg′ θbor ))− (Yg,2 −Yg,1 −(γb2or +Xg′ θbor )),
G1 g:Dg =1 G0 g:Dg =0 1 − p(Xg ) G1
b
where (γb2or , θbor ) are the coefficients from the linear regression in (4.2), and where p(X
b g ) comes
from a probit or logit regression of Dg on Xg . Sant’Anna and Zhao (2020) show that DIDX,dr is
consistent for the ATT if either m(Xg ) follows the linear functional form in Assumption LPT,
or the parametric model for the propensity score is correctly specified. Thus, DIDX,dr offers
some robustness against misspecification, though it still requires that one of the two parametric
models be correctly specified: if m(Xg ) does not follow the linear functional form in Assumption
LPT and the parametric model for p(Xg ) is incorrectly specified, DIDX,dr is inconsistent.
Non-parametric estimator when the control variables take a small number of values.
When Xg takes a small number of values relative to the sample size, it is straightforward to esti-
mate ATT fully non-parametrically. For instance, one can estimate m(Xg ) non-parametrically,
by regressing Yg,2 − Yg,1 on FEs for all values that Xg can take, restricting the sample to groups
such that Dg = 0. Then, one uses
1 X
DIDX,or-np := (Yg,2 − Yg,1 − m
c(Xg ))
G1 g:Dg =1
to estimate the ATT. This strategy can for instance be used to allow for state-specific trends
in a county-level analysis, or industry-specific trends in a firm-level analysis. Then, a full set of
state or industry FEs needs to be included in the regression in (4.2). The resulting estimator
relies on the assumption that counties in the same state (resp. firms in the same sector) would
have experienced the same outcome evolutions without treatment. On the other hand, counties
in different states (resp. firms in different sectors) are allowed to experience different evolutions.
Non-parametric estimator when the control variables take a large number of values.
Assume that one wants to control for variables taking a large number of values, such as continuous
variables, without imposing functional-form assumptions on m(Xg ) and the propensity score
p(Xg ). To achieve that purpose, a recent literature proposes to use debiased machine learning
(DML) methods, namely the combination of machine learning (ML) estimators (e.g. Lasso,
4.1. TWFE AND DID ESTIMATORS WITH CONTROL VARIABLES 113
random forest, neural networks) of m(Xg ) and p(Xg ), doubly-robust estimation, and cross-fitting
(see, e.g., Chang, 2020; Lu et al., 2019, for early applications of DML to DID). For instance, one
may use the following algorithm:
2. Use an ML method to predict Yg,2 (0) − Yg,1 (0) and Dg given Xg in subsample 1, and let
(1) (1)
cml (Xg ) and pbml (Xg ) denote the corresponding estimators.
m
(2) 1 (1)
DIDX,dr-ml := (Yg,2 − Yg,1 − m
cml (Xg ))
X
#{g ∈ I2 , Dg = 1} g∈I2 ,Dg =1
(1)
1 pbml (Xg ) #{g ∈ I2 , Dg = 0} (1)
(Yg,2 − Yg,1 − m
cml (Xg )).
X
−
#{g ∈ I2 , Dg = 0} g∈I2 ,Dg =0 1 − pbml (Xg ) #{g ∈ I2 , Dg = 1}
(1)
(1)
4. Revert the roles of subsamples 1 and 2: compute DIDX,dr-ml a doubly-robust ATT estimator
in subsample 1, based on estimators of m(Xg ) and p(Xg ) computed in subsample 2.
Identifying assumption: common deviations from linear trends. In this section, let us
momentarily assume that T = 5 and T0 = 3: the data contains five periods, and treated groups
become treated at period four. We consider the following assumption.
Assumption CDLT (Common deviations from linear trends) For all t ≥ 2, E [Yg,t (0) − Yg,t−1 (0)] =
γt + λg .
Assumption CDLT allows groups to experience group-specific linear trends λg , but requires that
between each pair of consecutive periods, all groups have the same expected deviation from their
114 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
linear trend. Under Assumption CDLT, the standard parallel-trends assumption holds, but for
the first-differenced outcome: for all t ≥ 3,
DID estimators. Under Assumption CDLT, Mora and Reggio (2019) show that one can use
1 X
DIDX,tr-lin := (Yg,4 − Yg,3 − (Yg,3 − Yg,2 ))
G1 g:Dg =1
1 X
− (Yg,4 − Yg,3 − (Yg,3 − Yg,2 )) (4.3)
G0 g:Dg =0
to estimate ATT1 . Intuitively, instead of comparing first-differences of the outcome in the
treatment and in the control group, DIDX,tr-lin compares second-differences. To estimate ATT2 ,
one can use
1 X 1 X
(Yg,5 − Yg,3 − 2 (Yg,3 − Yg,2 )) − (Yg,5 − Yg,3 − 2 (Yg,3 − Yg,2 )) , (4.4)
G1 g:Dg =1 G0 g:Dg =0
where Yg,3 − Yg,2 has to be multiplied by two to account for the fact that Yg,5 − Yg,3 is a long
difference that incorporates 2λg , the linear trend of group g. To test Assumption CDLT, one
can compute the following pre-trends estimator:
1 X 1 X
(Yg,2 − Yg,1 − (Yg,3 − Yg,2 )) − (Yg,2 − Yg,1 − (Yg,3 − Yg,2 )) , (4.5)
G1 g:Dg =1 G0 g:Dg =0
whose expectation is equal to zero under Assumption CDLT.
TWFE estimators. Alternatively, one could run a TWFE regression with group-specific
linear trends:
G T G
Yg,t = b g′ 1{g =g}+′
γbt′ 1{t =t}+
′ b ′ 1{g = g ′ } + βbfe D + ϵ̂ . (4.6)
X X X
α tλ g tr-lin g,t g,t
g ′ =1 t′ =1 g ′ =1
Is βbtr-lin
fe
unbiased for the ATT under Assumption CDLT? If not, what is the issue with the TWFE
regression in (4.6)?
also their outcome evolutions after the treatment. Assume for instance that the treatment effect
increases linearly with length of exposure. This generates a linear trend after treatment which,
for treated groups, contaminates λ
b , the estimator of groups’ linear trend without treatment.
g
Then, λ
b is biased for λ , which ultimately biases βbfe . With TWFE event-study estimators,
g g tr-lin
further issues arise when introducing group-specific linear trends in the regression, as then the
group-specific linear trends can be collinear with the relative-time indicators.
Set up. In this section, let us momentarily assume that one has outcome data at a more dis-
aggregated level than the (g, t) cells, say at the (i, g, t) level, where i indexes, say, individuals,
while g could represent, say, municipalities. In each cell, some individuals are eligible for treat-
ment and will therefore be treated if g is a treatment group and t > T0 , and some individuals
are ineligible and therefore remain untreated even if g is a treatment group and t > T0 . Let
Xi,g,t denote the eligibility status of individual i in cell (g, t). For instance, one may have that
only females are eligible for treatment, so that Xi,g,t = 1 for females and Xi,g,t = 0 for males.
Then, Di,g,t = 1{t > T0 }Dg Xi,g,t . For x ∈ {0, 1}, let Yx,g,t denote the average outcome across
individuals satisfying Xi,g,t = x in cell (g, t). In the example, Y1,g,t denotes the average outcome
of females in cell (g, t), while Y0,g,t is the average outcome of males. Let Yx,g,t (0) denote the
average outcome without treatment across individuals satisfying Xi,g,t = x.
E [Y1,g,2 (0) − Y0,g,2 (0) − (Y1,g,1 (0) − Y0,g,1 (0))] does not depend on g. (4.7)
(4.7) is a parallel-trends assumption on the difference between the untreated outcomes of the
eligible and ineligible subgroups: it requires that all groups experience the same evolution of
that difference. (4.7) is neither stronger or weaker than Assumption PT. On the other hand,
(4.7) is weaker than assuming that eligibles and ineligibles are on parallel trends:
for all x ∈ {0, 1}. This may be why researchers often feel that triple-difference estimators rely
on a weaker assumption than DIDs.
1 X 1 X
TRI-D := (Y1,g,2 − Y0,g,2 − (Y1,g,1 − Y0,g,1 )) − (Y1,g,2 − Y0,g,2 − (Y1,g,1 − Y0,g,1 ))
G1 g:Dg =1 G0 g:Dg =0
is unbiased for the ATT. Intuitively, TRI-D compares the evolution of the eligible versus ineligible
difference in treated and control groups from period one to two. To compute TRI-D, one can
regress Yi,g,t on an intercept, Dg , Xi,g,t , 1{t = 2}, Dg Xi,g,t , Dg 1{t = 2}, Xi,g,t 1{t = 2}, and Di,g,t .
Pre-trends test. If another pre-period, period zero, is available, one can run a pre-trends test
of (4.7), by comparing the evolution of the eligible versus ineligible difference in treated and
control groups from period zero to one.
More or less credible results with control variables? Let us first clear up a common
misconception. Increasing statistical precision cannot be, in and of itself, a reason to introduce
control variables in a DID analysis: in this chapter’s appendix, we show that under the uncondi-
tional parallel-trends assumption and an homoscedasticity assumption, the asymptotic variance
of a DID estimator without covariates is always smaller than that of a DID estimator with co-
variates. The only reason to introduce control variables is that doing so allows the researcher to
work under a conditional parallel-trends assumption, which might be more plausible than the
unconditional one, thus increasing the credibility of one’s results. At the same time, introducing
covariates could also diminish results’ credibility. Choosing which control variables to include
gives researchers the possibility to engage in specification searching and p-hacking, as different
controls may lead to different results.3 The recent development of automated machine-learning
based methods for variable selection may attenuate these concerns, without making them dis-
appear: these methods still require that the researcher specify a dictionary of potential controls
to choose from, and different dictionaries can lead to different results.
3
Prespecifying the controls one will use is often not feasible in observational studies where DID estimators
are used: there, it is difficult to credibly document when the researcher first accessed the data.
4.1. TWFE AND DID ESTIMATORS WITH CONTROL VARIABLES 117
Practical recommendations. In view of the above, we believe that if the pre-trend coeffi-
cients βbℓfe in the ES TWFE regression without controls in (3.6) are precisely estimated and not
significantly different from zero, there may not be a compelling reason to include controls in the
estimation: there is no indication of a violation of the unconditional parallel-trends assumption,
so moving to a conditional assumption may not be warranted. If on the other hand pre-trend
coefficients without controls are significant, large, or imprecise, there is a more compelling ar-
gument to include some controls, if pre-trends are less significant, smaller, or more precisely
estimated with covariates. Still, this means that controls are only included after a pre-test has
been conducted. It has been shown in other contexts that pre-testing/model-selection steps can
bias the post-test estimator and distort inference (Leeb and Pötscher, 2003). Assessing if this
issue is quantitatively important in DID estimation is an active area of research (Roth, 2022).
−1
g:Dg =1 g:Dg =1
Estimator. Propose an estimator that is unbiased for β1,X under Assumption LPT.
118 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
Estimating the conditional ATT function? Instead of estimating the best linear predictor
of treated groups’ treatment effects, one may be interested in estimating the conditional ATT
(CATT) function, namely the function mapping groups’ covariates to their treatment effect. If
the CATT function is linear, it coincides with the best linear predictor, but if the CATT function
is not linear the two functions differ. Few estimators of the CATT function under a conditional
parallel-trends assumption have been proposed. An exception is Lu, Nie and Wager (2019), but
the estimators in that paper are not implemented in a Stata or R command.
In this section, we no longer assume that Assumption ND holds: lagged treatments may have
an effect on the current outcome. We also no longer assume that T = 2.
4.1. TWFE AND DID ESTIMATORS WITH CONTROL VARIABLES 119
An AR(1) model with fixed effects. A common justification to control for the lagged
outcome is that the untreated outcome may follow an AR(1) model. For instance, assume that
for t ≥ 2
Yg,t (0t ) = αg + γt + ρYg,t−1 (0t−1 ) + εg,t , (4.8)
for some αg , γt , ρ and εg,t satisfying E[εg,t ] = 0. With ρ = 0, (4.8) boils down to the TWFE
model for the never-treated outcome in (2.5), which we saw is equivalent to the parallel-trends
assumption. Then, let us further assume that the effects of the current treatment and of the
lagged outcome on the current outcome are constant: for t ≥ 2, for all (d1 , ..., dt ) in {0, 1}t ,
Yg,t (d1 , ..., dt ) = αg + γt + dt δ + ρYg,t−1 (d1 , ..., dt−1 ) + εg,t , t = 1, ..., T. (4.9)
Under the AR(1) model with fixed effects, whether we need to control for the lagged
outcome depends on whether the outcome process has reached a steady-state. For
simplicity, assume that |ρ| < 1 and (4.8) holds for all t ∈ Z, including for negative time periods,
even if we still assume that Yg,t is observed at t = 1, ..., T only. Then, replacing Yg,t−1 (0t−1 ) by
αg + γt−1 + ρYg,t−2 (0t−2 ) + εg,t−1 in (4.8) and iterating for Yg,t−2 (0t−2 ), Yg,t−3 (0t−3 ), ..., we obtain
Yg,t (0t ) = α
eg + γ
et + εeg,t , (4.11)
where α
e g := αg /(1 − ρ), γ
et := ρk γt−k and εeg,t := ρk εg,t−k (we assume that the two
P∞ P∞
k=0 k=0
series converge; this holds for instance if |γt | ≤ M for all t ∈ Z and (εg,t )t∈Z is stationary).
Moreover, since E[εeg,t ] = 0, (4.11) is equivalent to the TWFE model for the never-treated
outcome in (2.5), so the parallel-trends assumption holds unconditionally, and it is unnecessary
to control for the lagged outcome to estimate consistently the ATT or the (ATTℓ )ℓ=1,...,T1 . This
conclusion only relies on (4.8), and it does not rely on the assumption that treatment effects
are homogeneous in (4.9). Moreover, this conclusion still holds if we replace ρYg,t−1 (0) by any
linear combination of past potential outcomes, provided that we can invert the corresponding
120 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
relationship, as we did above. On the other hand, this conclusion relies on the assumption that
the outcome process started sufficiently long ago so as to assume that it has reached a steady
state where we can omit the initial condition. At the other extreme, assume that T = 2 and
that the outcome process starts at period 1. Then,
so the unconditional parallel-trends assumption does not hold, and one needs to control for the
lagged outcome.
Under an AR(1) model with fixed effects, controlling for the lagged outcome may
lead to a so-called Nickell bias. While identification arguments can motivate controlling for
the lagged outcome, doing so can lead to issues when it comes to estimation. Even if we assume
E[εg,t |Yg,t−1 ] = 0, the TWFE estimator of δ using Yg,t−1 as a control is biased, and inconsistent
when G diverges to +∞ but T is fixed (Nickell, 1981), a realistic asymptotic approximation in
many DID applications where the number of time periods is not very large. Specifically, due to
the group FEs, it follows from the Frisch-Waugh-Lowell theorem that the regression is equivalent
to a regression without group FEs and where all variables have been demeaned within groups.
However, letting Yg,2:T and Yg,1:T −1 respectively denote the average of Yg,t from periods 2 to T
and from periods 1 to T − 1 (and similarly for Dg,t ), we have, under (4.10),
In this equation, which regressor is mechanically correlated to the residual εg,t − εg,2:T ?
Yg,t−1 − Yg,1:T −1 is a function of (εg,1 , ..., εg,T −1 ) and is therefore mechanically correlated to the
residual εg,t − εg,2:T , thus leading to an omitted variable bias that may bias the estimators of ρ
and δ.
Solutions to the Nickell bias. Consistent estimators of the coefficients in (4.10) can be
obtained by estimating the equation in first difference, instrumenting Yg,t −Yg,t−1 by Yg,t−2 and/or
4.1. TWFE AND DID ESTIMATORS WITH CONTROL VARIABLES 121
earlier lags of the outcome (see, e.g., Arellano and Bond, 1991), but such regressions sometimes
suffer from weak instrument problems. Alternatively, one may derive analytic expressions of
the coefficients’ asymptotic bias to bias-correct them (see, e.g., Kiviet, 1995).4 However, those
two estimation strategies rely on the homogeneous treatment effect assumption in (4.9), and
extending them to allow for heterogeneous treatment effects is not straightforward. Klosin
(2024) proposes a bias-correction method that allows for heterogeneous treatment effects along
some covariates, but the method does not allow for unrestricted heterogeneity.
In this section, we go back to ruling out dynamic effects and assuming that the data contains
only two time periods: T = 2, T0 = T1 = 1.
Self-selection and the Ashenfelter dip. Sometimes, one might worry that treated groups
choose to get treated after experiencing a negative outcome shock. For instance, Ashenfelter
(1978) finds that US workers choosing to receive a post-schooling training program experience
an earnings drop before doing so, a so-called Ashenfelter’s dip. Then, if outcome shocks are
positively correlated over time, it is more likely that treated units would have kept experiencing
negative shocks if they had not been treated, thus leading to a violation of Assumption PT.
4
If (4.9) holds, then ATTℓ = δ(1 − ρℓ )/(1 − ρ), so having consistent estimators of δ and ρ is sufficien to
consistently estimate the ATTℓ effects.
122 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
the same condition as in Assumption CPT, with the baseline outcome playing the role of the co-
variates. Like the covariates in the previous sections, the baseline outcomes (Yg,1 (0))g∈{1,...,G} are
implicitly conditioned upon in what follows, and we treat them as non-stochastic. Is Assumption
CPT still a parallel-trends assumption?
Letting γ̃(x) = m(x) + x, and because Yg,1 (0) is conditioned upon, (4.12) is equivalent to
groups with the same baseline outcome should have the same expected untreated outcome at
period 2. This is equivalent to the conditional independence or ignorability assumption that
underlies matching estimators. In fact, comparing the outcomes evolutions Yg,2 − Yg,1 of treated
and control groups with the same Yg,1 is equivalent to just comparing their Yg,2 : a DID estimator
conditional on Yg,1 is actually a matching estimator (Chabé-Ferret, 2015). Accordingly, (4.12)
is also similar to a sequential randomization assumption, where one assumes that treatment is
randomly assigned conditional on the lagged outcome (Robins, 1986; Bojinov et al., 2021).
Matching on the baseline outcome might lead to spurious results due to mean-
reversion. Mean-reversion phenomena when matching on the baseline outcome were first dis-
cussed in McNemar (1940), and further discussed in Chabé-Ferret (2015) and Daw and Hatfield
(2018). The following toy model is sufficient to convey the intution. Assume that the treatment
has no effect (Yg,t (1) = Yg,t (0)), and treated groups are such that P (Yg,t (0) = 1) = P (Yg,t (0) =
2) = 1/2, while control groups are such that P (Yg,t (0) = 0) = P (Yg,t (0) = 1) = 1/2, and Yg,2 (0)
and Yg,1 (0) are independent for all g. Thus, for all t, E(Yg,t (0)) = 1.5 for treated groups while
E(Yg,t (0)) = 0.5 for control groups. In this DGP, should a researcher use a DID estimator, or a
matching estimator conditional on the baseline outcome?
4.1. TWFE AND DID ESTIMATORS WITH CONTROL VARIABLES 123
For all g, E(Yg,2 − Yg,1 ) = E(Yg,2 (0) − Yg,1 (0)) = 0. Therefore, the expectation of the DID
estimator is equal to zero, so it is unbiased. The matching estimator is equal to
1 X 1 X
Yg,2 − Yg,2 ,
G1 g:Dg =1,Yg,1 =1 G0 g:Dg =0,Yg,1 =1
because the only common value of Yg,1 for the treated and control groups is 1. Then, as Yg,2 (0)
and Yg,1 (0) are independent, for every treated group g such that Yg,1 = Yg,1 (0) = 1, E(Yg,2 ) =
E(Yg,2 (0)) = 1.5, while for every control group g such that Yg,1 = Yg,1 (0) = 1, E(Yg,2 ) =
E(Yg,2 (0)) = 0.5. Accordingly, the expectation of the matching estimator is equal to 1.5−0.5 = 1:
this estimator is biased. Intuitively, the matching estimator compares the period-two outcomes
of treated groups with a bad period-one shock and control groups with a good period-one shock.
But at period two the treatment groups revert to their higher mean, while control groups revert
to their lower mean, thus generating a spurious positive estimate.
Practical recommendations. Of course the previous toy model is very simplistic, but Chabé-
Ferret (2015) and Daw and Hatfield (2018) show that mean-reversion can still happen in more
realistic models, and Bach, Bozio, Guillouzouic and Malgouyres (2023) is a recent striking exam-
ple showing that mean-reversion can lead to a substantial bias in an actual empirical application.
If more than two time periods are available, researchers that want to control for the lagged out-
come should compute a placebo matching estimator, comparing the Yg,T0 outcome of the treated
and controls with the same Yg,T0 −1 outcome. This can help assess if mean reversion will mechan-
ically bias the actual matching estimator. They may also match on less recent outcome lags
than Yg,T0 , and they could also match on several outcome lags: the more lags one matches on,
the less likely it is that treated and controls are matched on transitory shocks, the essence of the
mean-reversion problem. Here, there is an interesting connection with the synthetic control esti-
mator. To reconstruct treated’s counterfactual outcome, that estimator uses a weighted average
of controls that match the treated at all pre-treatment periods. The synthetic control estimator
is consistent when the number of groups and the number of pre-treatment periods go to infinity,
the later requirement ensuring that treated and controls are not matched on transitory shocks.
124 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
4.1.5 Computation: Stata and R commands to compute DID estimators with con-
trols
Several Stata and R commands compute DID estimators with control variables. We review
several of them thoroughly in Chapter 6. At this stage, we just mention two commands.
The drdid command. DIDX,or , DIDX,ps , and DIDX,dr are computed by the drdid Stata (see
Rios-Avila, Sant’Anna and Naqvi, 2021) and R (see Sant’Anna and Zhao, 2022) commands. The
syntax of the Stata command is:
Computing pre-trend and event-study estimators with controls when T > 2. Stata
and R commands computing DID estimators with control variables can generally be used with
more than two time periods. Then, they can compute pre-trends and event-study estimators
with controls, while avoiding the issues of TWFE regressions with controls. Those estimators
are analogous to the two-periods DID estimators with controls discussed in this chapter, except
that to estimate effect ℓ, the covariate-adjusted DID goes from the last period before treatment
T0 to T0 + ℓ, instead of going from period one to two. Similarly, the ℓth pre-trend estimator is a
covariate-adjusted DID from T0 to T0 − ℓ. As we will use it in the empirical example in the next
section, here is the syntax of the did_multiplegt_dyn command, to produce an event-study
graph controlling non-parametrically for some variables taking a small number of values:
did_multiplegt_dyn outcome groupid timeid treatment,
effects(#) placebo(#) trends_nonparam(var_names)
where effects(#) is the number of event-study effects to be estimated, and placebo(#) is the
number of pre-trend estimators.
1900. In this application, the pre-trend estimators in Figure 3.2 lend support to the unconditional
parallel-trends assumption, so there may not be a compelling argument to move away from
TWFE estimators without controls. Of course, when there is evidence that the unconditional
parallel-trends assumption is violated, controlling for some covariates may be appealing.
.75
.5
Effect
.25
-.25
-18 -15 -12 -9 -6 -3 0 3 6 9 12 15 18 21
Relative time to year before TWEA
Note: This figure shows the estimated effects of compulsory licensing on patents, as well as pre-trends estimates,
using years 1900 to 1939 of the data from Moser and Voena (2012), and a TWFE event-study regression with a
full set of interactions between subclasses’ number of patents in 1900 and year fixed effects. Standard errors are
clustered at the patent subclass level. 95% confidence intervals are shown in red.
Interactive fixed effects (IFE), synthetic control (SC), and synthetic DID (SD) estimators are
very popular alternatives to DID estimators. We open this section with a list of advantages and
drawbacks of these methods relative to DIDs, and a list of topics on which we think that further
128 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
1. Applicability. IFE, SC, and SD estimators are not as widely applicable as DIDs, because
they require a large number of pre-treatment periods T0 .
2. Identification. IFE, SC, and SD estimators rely on a factor model for the untreated
outcome that is substantially weaker than the TWFE model underlying DID estimators.
At the same time, IFE, SC, and SD estimators assume that the factor model holds for
a large number of time periods, while DID estimators can be used if the more restrictive
TWFE model holds for a few periods around the treatment-adoption date.
3. Estimation. The SD estimator can sometimes be more precise than the DID estimator.
On the other hand, unlike the DID estimator, IFE, SC, and SD estimators require choosing
tuning parameters, and there is no theoretically justified way of choosing those parameters.
4. Placebo tests. While it is easy to placebo test the parallel-trends and no-anticipation
assumptions underlying DIDs via pre-trend tests, testing the assumptions underlying IFE,
SC, and SD estimators is less straightforward, an issue that has received little attention.
1. Applicability. Proposing estimators relying on a factor model and with proven guarantees
even if T0 is fixed would be a major improvement.5 More simulation studies assessing
the number of pre-treatment periods that are necessary to reliably use IFE, SC, and SD
estimators would also be useful.
3. Placebo tests. Providing placebo tests of the IFE, SC, and SD identifying assumptions
would be useful. As was done for DID, it is also important to run realistic simulations to
5
Imbens, Kallus and Mao (2021) and Brown and Butts (2023) propose such estimators, but while their results
allow T0 to be fixed, they require other strong assumptions.
4.2. INTERACTIVE FES, SYNTHETIC CONTROLS, AND SYNTHETIC DID 129
assess if those placebo tests have enough power to detect violations of identifying assump-
tions that could lead to substantial biases.
Disclaimer. The literature on IFE, SC, and SD is huge, and what follows is very far from
being exhaustive: our goal is simply to present the main references in that literature, as well as
the most popular Stata and R commands to compute those estimators. For much more thorough
reviews of this literature, see for instance Abadie (2021) or Arkhangelsky and Imbens (2024).
with E(εg,t ) = 0. Hsiao, Ching and Ki Wan (2012), Gobillon and Magnac (2016), and Xu (2017)
instead assume that the untreated outcome follows a TWFE model augmented with interactive
fixed effects (TWFE-IFE):
R
Yg,t (0) = αg + γt + λg,r ft,r + εg,t , (4.14)
X
r=1
with E(εg,t ) = 0. (ft,r )r∈{1,...,R} is a vector of period-specific shocks affecting all groups, like
the period FE γt . The key difference between (4.13) and (4.14) is that (4.14) has group-specific
coefficients (λg,r )r∈{1,...,R} in front of the common shocks (ft,r )r∈{1,...,R} , as if some group FEs were
interacted with period FEs. For instance, assume that ft,1 represents the state of the economy
at period t. Then, how would you describe groups with a large positive value of λg,1 ? Groups
with a smaller positive value of λg,1 ? Groups with a negative value of λg,1 ?
Groups that have a large positive value of λg,1 are such that their outcome without treatment
responds very positively to a positive economic shock and very negatively to a negative shock:
130 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
their outcome is very sensitive to the economic environment. Groups with a smaller positive
value of λg,1 are less sensitive to the economic environment, and groups with a negative coefficient
are counter-cyclical. Does the parallel-trends assumption hold under (4.14)?
By allowing groups to respond differently to common shocks, (4.14) allows for differential trends
between groups:
R
E(Yg,t (0) − Yg,t−1 (0)) = γt − γt−1 + λg,r (ft,r − ft−1,r ).
X
r=1
One could also assume that (4.14) holds with control variables, thus further relaxing the parallel-
trends assumption. (ft,r )r∈{1,...,R} are often called the factors, (λg,r )r∈{1,...,R} are often called the
loadings, and IFE models are also often called factor models.
Estimating the (ATTℓ )ℓ∈{1,...,T1 } and the ATT under a TWFE-IFE model. To estimate
the (ATTℓ )ℓ∈{1,...,T1 } and the ATT under (4.14), Xu (2017), building upon Bai (2009), proposes
the following algorithm:6
T R
!2
X X X
εbg,t − λg,r ft,r ,
g:Dg =0 t=1 r=1
see this chapter’s appendix for details on how to solve this minimization problem.7
6
Hsiao et al. (2012) and Gobillon and Magnac (2016) propose closely-related procedures.
7
With covariates Xg,t , one needs to iterate Step (b) and a step where one estimates the coefficients on Xg,t
until convergence. See Xu (2017) for details.
4.2. INTERACTIVE FES, SYNTHETIC CONTROLS, AND SYNTHETIC DID 131
2. Then, for each treated group, run a regression of (Yg,t − γbt )t∈{1,...,T0 } on an intercept and
(fbt,r )r∈{1,...,R},t∈{1,...,T0 } , and let (α
b g , (λ
b )
g,r r∈{1,...,R} )g:Dg =1 denote the coefficients from those
G1 regressions.
1 X R
!!
βbife = bg + γ
bT0 +ℓ +
X
ℓ Yg,T0 +ℓ − α λ
b fb
g,r T0 +ℓ,r
G1 g:Dg =1 r=1
1 X T
βbife = βbife
T1 t=T0 +1 ℓ
be an estimator of ATT.
Intuition for the TWFE-IFE estimator. What is the intuition underlying βbℓife ? In partic-
ular, what are the commonalities and differences between βbℓife and βbℓb,l,g ?
to impute the unobserved counterfactual outcome Yg,t (0) of treated (g, t) cells. Then, one can
use
R
!
bg + γ
bt +
X
Yg,t − α λ
b fb
g,r t,r ,
r=1
the difference between the cell’s observed and imputed outcome, to estimate its treatment effect.
Thus, βbℓife is similar to βbℓb,l,g , the imputation estimator of ATTℓ discussed in Chapter 3. The
difference between the two estimators is that βbℓife uses a TWFE-IFE model to impute the missing
counterfactual outcome, while βbℓb,l,g just uses a TWFE model.
Asymptotic theory for the TWFE-IFE estimator. Liu et al. (2024) show that βbℓife is
consistent when G0 , T0 , and G1 diverge to +∞. Xu (2017) proposes a parametric bootstrap to
estimate its variance. However, to our knowledge the validity of this procedure has not been
132 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
established yet. In fact, we are not aware of an asymptotic-normality result on βbℓife : the results
of Bai (2009), for instance, apply to other parameters.8
How many pre-treatment periods should one have to use a TWFE-IFE estimator?
The asymptotic approximation underlying the TWFE-IFE estimator requires that G0 , T0 , and
G1 diverge to +∞. This is in contrast with the asymptotic approximation underlying the
TWFE estimator, which only requires that G0 and G1 diverge to +∞. Then, one should have a
reasonably large number of pre-treatment periods T0 to reliably use the TWFE-IFE estimator,
but how large should that number be? The simulations in Xu (2017) suggest that with 15
pre-treatment periods, the confidence interval of the TWFE-IFE estimator has close to nominal
coverage. However, those simulations are not calibrated to a real dataset, and there may exist
realistic settings where more pre-treatment periods are needed to have good coverage. Assessing
the coverage of the confidence interval of the TWFE-IFE estimator in simulations calibrated to
real datasets is an interesting avenue for future research.
Weak factors.∗ Asymptotic results for factor models typically rely on a “strong factor as-
sumption”, which requires that the loadings λg,r and factors ft,r have sufficient variation across
g and over t, respectively (Bai, 2009). If that assumption fails IFE estimators can be biased and
their confidence intervals (CIs) can be size-distorted. This is an issue, as applied researchers
have no way of knowing ex-ante if in their empirical application, factors are strong or weak.
Armstrong, Weidner and Zeleneev (2022) propose IFE estimators and confidence intervals that
remain valid when factors are weak. However, their estimators and confidence intervals have not
been extended yet to a TWFE-IFE imputation estimator allowing for heterogeneous treatment
effects, in the spirit of that of Xu (2017).
How to choose the number of factors? Another difference between the TWFE-IFE and
TWFE estimators is that unlike the latter, the former requires choosing a tuning parameter,
namely the number of factors R. One may use cross-validation to choose R. For instance, one
8
Bai and Ng (2021) also propose treatment effect estimators under a factor model for the untreated outcome,
and they derive an asymptotic-normality result for their estimator. However, their estimator is not computed
yet by a Stata or R command, unlike that of Xu (2017).
4.2. INTERACTIVE FES, SYNTHETIC CONTROLS, AND SYNTHETIC DID 133
estimates the TWFE-IFE model with R = 1 using all periods except period T , one uses the
model to predict the outcome of untreated groups at T , and one computes the model’s mean
squared error (MSE). One repeats the procedure holding out periods T − 1, ..., 1, and one finally
computes the model’s average MSE across all periods. Then, one repeats the same procedure
for R = 2, for R = 3, etc. Finally, one chooses the value of R yielding the lowest MSE. Xu
(2017) finds that the TWFE-IFE estimator with a cross-validated R works well in simulations,
though the paper does not provide a theoretical justification for that procedure.
How to test the TWFE-IFE model? The parallel-trends assumption can be tested via
pre-trend tests. How can one run a similar test of the TWFE-IFE model?
For ℓ ∈ {−1, ..., −(T0 − 1)}, one could define the following pre-trends estimator:
1 X R
!!
βbife = bg + γ
bT0 +ℓ +
X
ℓ Yg,T0 +ℓ − α λ
b fb
g,r T0 +ℓ .
G1 g:Dg =1 r=1
However, this estimator will be mechanically close to zero, even if (4.14) fails. This placebo
compares the actual and imputed outcome of treatment groups before treatment, but treatment
groups’ outcomes at all pre-treatment periods are used to estimate the imputation parameters,
so the actual and imputed outcomes will mechanically be close. Instead, one can recompute the
TWFE-IFE estimator, pretending that the treatment took place at period T0 + 1 − P instead of
T0 + 1 (Liu et al., 2024). Then, for ℓ ∈ {1, ..., P }, βbℓife is an actual placebo estimator, comparing
treatment groups’ actual and imputed outcomes before treatment, at time periods that were
not used to estimate the imputation parameters. This testing strategy comes with one caveat.
The TWFE-IFE estimator can only be used when the number of pre-treatment periods is large,
but having P placebos reduces to T0 − P the number of pre-treatment periods one can use to
estimate the loadings of the treated groups. This can be an issue if T0 is already not very large.
Liu and Liu, 2022b) and R (see Liu, Wang, Xu, Liu and Liu, 2022a) commands. In Stata, the
command’s syntax is
The synthetic control (SC) estimator. The SC estimator was originally proposed for
settings with only one treated group, so in most of this section we assume that G1 = 1, and that
group G is the treated group. To simplify the exposition, we also assume that T0 + 1 = T : group
G is only treated at period T . Then, the ATT reduces to E(YG,T (1) − YG,T (0)), and we need to
estimate the missing counterfactual outcome YG,T (0). For that purpose, Abadie, Diamond and
Hainmueller (2010), building upon Abadie and Gardeazabal (2003), propose to use
G−1
YbG,T (0) =
X
wbg Yg,T ,
g=1
YbG,T (0) is the period-T outcome of the weighted average of control groups whose period 1 to
9
This minimization can be solved easily as an instance of quadratic programming.
4.2. INTERACTIVE FES, SYNTHETIC CONTROLS, AND SYNTHETIC DID 135
T − 1 outcomes are closest to that of the treatment group, hereafter referred to as the synthetic
control. Then,
βbsc = YG,T − YbG,T (0).
βbsc is one of the estimators considered by Abadie et al. (2010), sometimes referred to as the canon-
ical SC estimator (Doudchenko and Imbens, 2016). When there is more than one post-treatment
period, one can compute event-study SC estimators, by letting βbℓsc = YG,T0 +ℓ − YbG,T0 +ℓ (0) for
ℓ ∈ {1, ..., T1 }, where YbG,T0 +ℓ (0) = g=1 wbg Yg,T0 +ℓ .
PG−1
Other versions of the SC estimator. Abadie et al. (2010) also consider cases where other
pre-intervention characteristics than the pre-treatment outcomes (Yg,1 , ..., Yg,T −1 ) are used in
the objective function in (4.15), and where not all pre-treatment outcomes are used in that
objective function. However, there is currently little guidance as to how one should choose
the pre-intervention characteristics on which to match the treated and controls. This gives
researchers opportunities to engage in specification searching (Ferman, Pinto and Possebom,
2020). The canonical SC estimator is not subject to that concern.
Intuition underlying the SC estimator. To convey the intuition underlying the SC estima-
tor, let us assume that the TWFE-IFE model in (4.14) holds. If Yg,t (0) = αg + γt + λg,r ft,r +
PR
r=1
εg,t and wbg Yg,t ≈ YG,t for all t ≤ T0 , which relationship might there be between the FEs
PG−1
g=1
Then, the fact that the level of the outcome is similar in the treated group and in the synthetic
control might indicate that their FEs are similar:
G−1
(4.16)
X
wbg αg ≈ αG .
g=1
136 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
Similarly, the fact that their outcome paths are similar might indicate that they react similarly
to the common shocks (ft,r )r∈{1,...,R} , thus implying that their loadings may also be similar:
G−1
(4.17)
X
wbg λg,r ≈ λG,r .
g=1
Then,
G−1
E YbG,T (0) =E
X
wbg Yg,T
g=1
!
G−1 R
=E wbg αg + γT + λg,r fT,r + εg,T
X X
g=1 r=1
R
≈ αG + γT +
X
λG,r fT,r
r=1
=E (YG,T (0)) ,
where the approximation follows from (4.16) and (4.17), and from omitting the estimation error
in wbg combined with E(εg,T ) = 0.
g=1
G−1
wg∗ λg,r − λG,r → 0. (4.19)
X
g=1
Moreover, wg∗ should not be too sparse: the number of control groups for which wg∗ > 0 should
not be too low. One may think of the TWFE-IFE model in (4.14), (4.18), and (4.19) as the
identifying assumptions underlying the SC estimator.
trend as the treated group. Instead, DID requires that the simple average of control groups has
the same counterfactual trend as the treated group, which is stronger. However, DID estimators
with controls can be consistent in instances where the SC estimator is not. For instance, if
meaning that the untreated outcome follows a TWFE model with group-specific linear trends,
then a DID estimator with linear trends is always consistent for the ATT, while the SC estimator
is inconsistent if λG does not belong to the convex hull of (λg )g∈{1,...,G−1} , as then (4.19) fails
(Arboleda Cárcamo, 2024). In their Footnote 4, Arkhangelsky, Athey, Hirshberg, Imbens and
Wager (2021) sketch an extension of their SC and SD estimators to models with covariates.
However, this extension does not allow for heterogeneous treatment effects, and it is unclear
whether it is applicable to control variables whose dimensionality grows with the sample size, as
is the case of group-specific linear trends.
If one is ready to assume a TWFE-IFE model, then why not use a TWFE-IFE
estimator? The identifying assumptions underlying the SC estimator ((4.14), (4.18), and
(4.19)) are stronger than those underlying the TWFE-IFE estimator ((4.14)). Yet, there are
two arguments to still use the SC estimator, when it comes to estimation. First, unlike the
TWFE-IFE estimator, the SC estimator does not require taking a stance on the number of
factors. Second, TWFE-IFE estimates the factors and loadings in (4.14), before estimating the
treatment’s effect. However, estimating the factors and loadings may require imposing stronger
assumptions than those needed to estimate the treatment’s effect. Instead, the SC estimator
bypasses the estimation of the factors and loadings, and directly estimates the treatment effect
in a way that accounts for the factor structure in (4.14). This might explain why in simulations,
the SC estimator sometimes performs better under (4.14) than estimators trying to estimate the
factor structure (Arkhangelsky et al., 2021).
Inference with one treated group. With only one treated group, conducting inference is
not straightforward. Abadie et al. (2010) propose a placebo approach where one compares the SC
estimator to the quantiles of the distribution of placebo SC estimators, computed assuming that
the treated group was actually one of the control groups. Typically, this type of randomization
138 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
inference procedure can be justified by assuming that the treated group was chosen at random,
but this may not be a plausible assumption in the settings where SC estimators are used.
Instead, Chernozhukov, Wüthrich and Zhu (2021) propose an inference procedure that is valid
under more realistic assumptions, namely if the TWFE-IFE model in (4.14) holds and the error
terms εg,t are stationary and not too serially correlated, and if the number of time periods T
goes to infinity. In the simple example we consider, where group G is only treated at period T ,
a p-value of the sharp null that YG,T (1) − YG,T (0) = 0 can be obtained as follows:
2. Let ubsc
t = YG,t − w̃g Yg,t , and let Fb (x) = 1{|ubsc
PG−1 1 PT
g=1 T t=1 t | ≤ x}.
In Step 1, one solves almost the same minimization problem as in (4.15), except that one finds
the weighted average of control groups whose period 1 to T outcomes are closest to that of the
treatment group. Under the null that YG,T (1)−YG,T (0) = 0, YG,T = YG,T (1) = YG,T (0), so period
T can be used in the computation of the SC weights, as if period T too was a pre-treatment
we reject the null that YG,T (1) − YG,T (0) = 0 if an SC estimator computed under that null
yields a much larger prediction error at period T , the period where the null is imposed, than at
other periods. Including period T in Step 1 is key: if that period is not included, |ubsc
T | will be
mechanically larger than the other residuals, just because the weights were chosen to minimize
the squared residuals from period 1 to T − 1 but not at T .10
10
One can proceed similarly to test the null that YG,T (1)−YG,T (0) = θ for any θ ∈ R, except that one minimizes
T G−1
!2
X X
YG,t − θ1{t = T } − wg Yg,t
t=1 g=1
in Step 1: under the null that YG,T (1)−YG,T (0) = θ, YG,T −θ = YG,T (0). Finally, assuming that YG,T (1)−YG,T (0)
is non-random (and thus equal to the ATT), a 1 − α level confidence interval for the ATT can be constructed,
as the unions of all θ such that the test’s p-value is larger than or equal to α.
4.2. INTERACTIVE FES, SYNTHETIC CONTROLS, AND SYNTHETIC DID 139
Placebo tests with the SC estimator. An intuitive concern with the SC estimator goes
as follows. By construction, the outcome of the synthetic control will match that of the treated
group almost perfectly in the pre-treatment periods. Then, the outcomes of the two groups may
start diverging when the treatment group gets treated, but is this due to the treatment’s effect,
or just to the fact that post-treatment outcomes are not used to compute the SC weights? To
assess whether this concern is legitimate or not, one can recompute the SC estimator, pretending
that the treatment took place at period T0 + 1 − P instead of T0 + 1. If placebo SC estimators at
periods T0 + 1 − P , ..., and T0 are small in comparison to the actual estimators at periods T0 + 1,
T0 + 2, etc., this lends credibility to the SC estimator. As with the TWFE-IFE estimator, a
caveat of this strategy is that having P placebos reduces to T0 − P the number of pre-treatment
periods one can use to estimate the synthetic control weights, which can be an issue if T0 is
already not very large to begin with.
The synthetic DID (SD) estimator. In this section, we no longer assume that there is only
one treated group. In settings with several treated groups, Arkhangelsky et al. (2021) propose an
SD estimator, that combines features of the SC and DID estimators. To simplify the exposition,
let us assume that groups 1 to G0 are the control groups, while groups G0 + 1 to G are the
treated groups. The SD estimator βbsd is the coefficient on Dg,t in a TWFE regression of Yg,t
on group and period FEs and Dg,t , weighted by wbgsd θbtsd . wbgsd = 1/G1 for all g ∈ {G0 + 1, ..., G},
140 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
(2021) recommend to set at (G1 T1 )1/4 σb , with σb the standard deviation of the first differences
Yg,t − Yg,t−1 across all control groups in the pre-treatment periods. Similarly, θbtsd = 1/T1 for all
t ∈ {T0 + 1, ..., T }, and (θb0sd , θb1sd , ..., θbTsd0 ) are the minimizers of
2
G0
1 X T T0
(4.21)
X X
Yg,t − θ0 − θt Yg,t
g=1 T1 t=T0 +1 t=1
ATTℓ , one can run a TWFE regression of Yg,t on group and period FEs and the relative time
indicators (1{t = T0 + ℓ})ℓ∈{1,...,T1 } , weighted by wbgsd θbtsd .
Commonalities and differences with the usual DID estimator and with the SC es-
timator. What is the commonality between βbsd and the DID estimator βbfe ? What is the
difference between those two estimators? What is the commonality between βbsd and the SC
estimator? What is the difference between those two estimators?
βbsd and βbfe are computed using the same TWFE regression, except that to obtain the former,
we use weights. Acccordingly, one can show that βbsd is a weighted DID estimator:
1 G T
1 G T0
1 G0 T G0 X
T0
βbsd = θbtsd Yg,t − wbgsd Yg,t − wbgsd θbtsd Yg,t .
X X X X X X X
Yg,t −
G1 T1 g=G0 +1 t=T0 +1 G1 g=G0 +1 t=1 T1 g=1 t=T0 +1 g=1 t=1
The weights are the product of a group-specific component wbgsd and a period-specific component
θbtsd . The group-specific component of the weights, wbgsd , is just 1/G1 for the treated groups, and
for the control groups it is similar to the SC weights in the previous section, up to two differences.
First, by allowing for an intercept wb0sd , the weighted average of control groups’ outcomes may
systematically differ from the average of treated group’s outcomes, by wb0sd , but that difference
4.2. INTERACTIVE FES, SYNTHETIC CONTROLS, AND SYNTHETIC DID 141
has to remain stable from period 1 to T0 , thus enforcing the parallel-trends assumption in the pre-
treatment periods. Second, the objective function includes a regularization penalty ξ 2 T0 wg2 ,
PG0
g=1
to ensure that the minimizers of (4.20) are not too sparse. The period-specific component θbtsd is
just 1/T1 for the treated periods. For the pre-treatment periods, one finds the weighted average
of pre-treatment outcomes that replicates best control groups’ post-treatment outcomes. In the
same way that the SC estimator gives more weight to control groups whose pre-period outcomes
“resemble” that of treated groups, the SD estimator gives more weight to pre-treatment periods
when control groups’ outcomes resemble their post-treatment outcomes.
SC estimator with many treated groups. Arkhangelsky et al. (2021) also consider another
estimator, the coefficient on Dg,t in a TWFE regression of Yg,t on period FEs and Dg,t , weighted
by wbgsc , where the weights wbgsc solve almost the same minimization problem as (4.20), without the
intercept. That second estimator is a close analogue of the SC estimator applied to the average
outcome of the treated groups. The only difference is the penalization in the objective function.
1 G T R
1 G T0 R
! !
θt∗
X X X X X X
λg,r fT,r − λg,r fT,r
T1 G1 g=G0 +1 t=T0 +1 r=1 G1 g=G0 +1 t=1 r=1
!
1 G0 T R G0 X
T0 R
!
wg∗ wg∗ θt∗
X X X X X
− λg,r fT,r − λg,r fT,r
T1 g=1 t=T0 +1 r=1 g=1 t=1 r=1
142 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
√
goes to zero faster than 1/ G1 T1 . The first condition requires that the control groups should
not be too different from the treated ones: up to an intercept and asymptotically, it should be
possible to perfectly replicate the loadings of the treated group using a convex combination of
control groups. The second condition imposes a similar but weaker requirement on the similarity
of the pre- and post-treatment periods: it should be possible to replicate “not too poorly” the
factors of the post-treatment periods using a weighted average of pre-treatment ones. Finally,
P
the third condition requires that with the weights wg∗ and θt∗ , a weighted DID of R
r=1 λg,r fT,r ,
the factor-model part of Yg,t (0), goes to zero, at a faster rate than in the first condition. This last
condition shows that by weighting both the control groups but also the pre-treatment periods,
the SD estimator inherits a kind of double-robustness property. It is consistent either if the
loadings of treated and control groups are “sufficiently similar”, or if the factors of pre- and
post-treatment periods are “sufficiently similar”. Like a related condition underlying the SC
estimator, that third condition can still fail if the untreated outcome follows a TWFE model
with linear trends that differ for the treated and untreated groups.
Asymptotic theory for the SD estimator. Arkhangelsky et al. (2021) show that the SD
estimator is consistent and asymptotically normal. This is a stronger result than those available
for the SC estimator, in part because the authors assume that G0 , T0 , and G1 T1 diverge to +∞,
while the results on the SC estimator we have discussed so far only assume that G0 and/or T0
diverge to +∞. On top of the aforementioned identifying assumptions, their result also relies
on other assumptions, restated informally as follows:
2. The vectors of errors (εg,t )t∈{1,...,T } are independent and identically distributed across
groups, and follow a normal distribution.
Inference. Arkhangelsky et al. (2021) propose to use a block bootstrap, where one draws
groups with replacement from the original sample, one recomputes the SD estimator in each
bootstrap sample, and one finally uses the variance of the estimator across bootstrap samples
to estimate the variance of βbsd . While valid, this procedure might be computationally costly, so
4.2. INTERACTIVE FES, SYNTHETIC CONTROLS, AND SYNTHETIC DID 143
the authors also propose a less-computationally intensive jacknife procedure, that is valid but
can yield a conservative variance estimator.
Placebo tests with the SD estimator. As for the SC estimator, it is important to placebo-
test the identifying assumptions of the SD estimator. Otherwise, it could be that the treated
group’s outcomes diverge from their counterfactual in the post-treatment periods, just because
the treated (g, t) cells are precisely those that were not used to construct the counterfactual.
For that purpose, just pretending that the treatment took place at period T0 + 1 − P instead
of T0 + 1 will not work, because then periods T0 + 1 − P to T0 are still used to estimate the
period-specific component of the weights. Instead, one can keep only the first T0 periods, and
compute the SD estimator on this restricted dataset, pretending that treatment took place at
T0 + 1 − P . Again, this testing strategy reduces the number of pre-treatment periods one can
use in the estimation, which may be an issue if T0 was not very large to begin with.
Simulation evidence. On top of the identifying assumptions of the SD estimator, the asymp-
totic result in Arkhangelsky et al. (2021) relies on some strong conditions, like the errors’ nor-
mality assumption, as well as rate conditions on G0 , T0 , and G1 T1 whose applicability may be
hard to gauge in applications. Then, assessing if the resulting confidence intervals have close-to-
nominal coverage in realistic settings is important. The authors conduct two simulation studies,
calibrated to datasets representative of those typically used for panel data studies. Outcomes
are generated from a TWFE-IFE model where the loadings and factors, as well as the errors’
variance-covariance matrix, are estimated from some actual outcomes in these datasets, and the
treatment assignment mechanism is also inspired from actual policies that took place over the
data period (while satisfying the identifying assumptions of the SD estimator). In their first
simulation, G = 50, T = 40, G0 ∈ {40, 49}, and T0 ∈ {30, 39}. In their second simulation,
G = 111, T = 48, G0 = 101, and T0 = 38. In both cases, the authors find that their confidence
intervals generally have good coverage. While those results are encouraging, further simulation
studies could be useful, in particular to assess whether the SD estimator can reliably be used
with less than 30 pre-treatment periods. It is also important to assess the power of placebo tests
of the SD identifying assumptions in realistic simulations, and if violations that those tests do
not have power to detect could lead to substantial biases. Finally, assessing if the SD confidence
144 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
intervals have good coverage when errors are not normally distributed could be valuable.
estimators, we now estimate a TWFE-IFE model with two factors. Still, the fact that cross-
validation gives different results in the full sample and in the treated subsample may indicate
that the TWFE-IFE model may not be well suited to this application: the TWFE-IFE model
could be misspecified, with different numbers of factors for the treated and for the controls,
or the treatment-group loadings could be small, leading to a weak-factor problem. Estimate
the TWFE-IFE model with two factors, using the se option to request that standard errors be
computed.
Event-study estimates
1.5
1.25
.75
Effect
.5
.25
-.25
-.5
1 6 11 16 21
Years after TWEA
IFE TWFE
Note: This figure shows the estimated effects of compulsory licensing on patents, using years 1900 to 1939 of
the data from Moser and Voena (2012), and two estimators: the TWFE event-study regressions in (3.6) and the
TWFE-IFE imputation estimator. Standard errors are clustered at the patent subclass level. The 95% confidence
intervals rely on a normal approximation.
SC estimators of Arkhangelsky et al. (2021). Rather than using the original SC estimator
of Abadie et al. (2010), applied to the average of treated subclasses, we use the SC estimator of
Arkhangelsky et al. (2021). The reason is simply that the confidence intervals attached to the
SC estimator of Abadie et al. (2010) rely on the assumption that the treated units were chosen
at random, which, as shown in Section 3.2.1.1, is clearly implausible in this application. Instead,
the confidence intervals of the SC estimator of Arkhangelsky et al. (2021) rely on the TWFE-IFE
model in (4.14) and a large sample approximation. Install the sdid_event package from SSC,
and use it to compute event-study SC estimators using the moser_voena_didtextbook dataset,
with 200 bootstrap replications.
4.2. INTERACTIVE FES, SYNTHETIC CONTROLS, AND SYNTHETIC DID 147
Event-study estimates
2
1.5
.5
Effect
-.5
-1
-1.5
-2
1 6 11 16 21
Years after TWEA
SC TWFE
Note: This figure shows the estimated effects of compulsory licensing on patents, using years 1900 to 1939 of
the data from Moser and Voena (2012), and the TWFE event-study regressions in (3.6) and the synthetic control
estimator of Arkhangelsky et al. (2021). Standard errors are clustered at the patent subclass level. The 95%
confidence intervals rely on a normal approximation.
4.2. INTERACTIVE FES, SYNTHETIC CONTROLS, AND SYNTHETIC DID 149
Event-study estimates
1.5
.5
Effect
-.5
-1
-1.5
-2
1 6 11 16 21
Years after TWEA
Demeaned SC TWFE
Note: This figure shows the estimated effects of compulsory licensing on patents, using years 1900 to 1939 of the
data from Moser and Voena (2012), and the TWFE event-study regressions in (3.6) and the demeaned synthetic
control estimator of Arkhangelsky et al. (2021). Standard errors are clustered at the patent subclass level. The
95% confidence intervals rely on a normal approximation.
SD estimators are less noisy than the corresponding TWFE estimators, but others are more
noisy. The standard error of the SD estimator of the ATT, which is just the simple average of
the event-study estimators, is equal to 0.043, which is 12% larger than the standard error of the
TWFE estimator of the ATT.
Figure 4.5: TWFE estimators and Synthetic Control estimators, on the data
of Moser and Voena (2012)
Event-study estimates
1
.75
.5
Effect
.25
-.25
1 6 11 16 21
Years after TWEA
SD TWFE
Note: This figure shows the estimated effects of compulsory licensing on patents, using years 1900 to 1939 of
the data from Moser and Voena (2012), and the TWFE event-study regressions in (3.6) and the synthetic DID
estimator. Standard errors are clustered at the patent subclass level. The 95% confidence intervals rely on a
normal approximation.
Conclusion. In this application, the TWFE-IFE and SC estimators do not perform very well,
while the SD estimator is similar to, but slightly noisier than, the TWFE estimator. Thus,
there is no compelling argument to move away from the TWFE estimator, all the more so as
the pre-trend estimators on Figure 3.2 suggest that the parallel-trends assumption is plausible.
The TWFE-IFE, SC, and SD estimators remain appealing alternatives in applications where
pre-trend tests indicate a violation of the parallel-trends assumption.
4.3. BOUNDED DIFFERENTIAL TRENDS 151
Bounded differential trends. Rambachan and Roth (2023) propose an alternative relaxation
of the parallel-trends condition. Let us assume that Assumption ND holds, and T = 3, T0 = 2.
Assumption BDT (Bounded differential trends) There is a positive real number M such that
1 X 1 X
E (Yg,3 (0) − Yg,2 (0)) − E (Yg,3 (0) − Yg,2 (0))
G1 g:Dg =1 G0 g:Dg =0
1 X 1 X
≤M E (Yg,2 (0) − Yg,1 (0)) − E (Yg,2 (0) − Yg,1 (0)) . (4.22)
G1 g:Dg =1 G0 g:Dg =0
Assumption BDT allows treated and control groups to experience differential trends, but requires
that their period-two-to-three differential trend be bounded in absolute value by some constant
M times their period-one-to-two differential trend. Thus, period-two-to-three and period-one-
to-two differential trends can be different but should not be too different, where M indexes
how large the difference can be. Note that with M = 0, Assumption BDT is equivalent to
parallel-trends from period 2 to 3. Similarly, if
1 X 1 X
E (Yg,2 (0) − Yg,1 (0)) − E (Yg,2 (0) − Yg,1 (0)) = 0,
G1 g:Dg =1 G0 g:Dg =0
then Assumption BDT implies parallel trends from period 2 to 3, irrespective of the value of M .
We saw in Chapter 3 that βb1fe in the TWFE ES regression in (3.6) is a DID comparing the
outcome evolution of treated and control groups from period two to three, while βb−1
fe
is a DID
comparing their outcome evolutions from period two to one. Then, E βb1fe is equal to the
ATT plus the differential outcome evolution that treated and controls would have experienced
without treatment from period two to three. Under Assumption BDT, that differential evolution
is included between −M |E βb−1
fe
| and M |E βb−1
fe
|, hence the bounds for ATT.
152 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
where the last equality follows from adding and subtracting Yg,3 (0). Taking expectations and
rearranging,
1 X 1 X
ATT =E βb1fe −E (Yg,3 (0) − Yg,2 (0)) − (Yg,3 (0) − Yg,2 (0)) .
G1 g:Dg =1 G0 g:Dg =0
Finally, the result follows from Assumption BDT, and the fact that
1 X 1 X
fe
βb−1 = (Yg,1 (0) − Yg,2 (0)) − (Yg,1 (0) − Yg,2 (0)).
G1 g:Dg =1 G0 g:Dg =0
QED.
Estimation and inference. Given M , the lower and upper bounds in Theorem 7 can respec-
tively be estimated by βb1fe − M |βb−1
fe
| and βb1fe + M |βb−1
fe
|. Which condition should hold to have that
h i
0 does not belong to the interval βb1fe − M |βb−1
fe
|, βb1fe + M |βb−1
fe
|?
Sensitivity analysis. Practitioners may not have a good sense of which value of M they
should choose. Rather than recommending a particular value, Rambachan and Roth (2023)
4.3. BOUNDED DIFFERENTIAL TRENDS 153
recommend that they conduct the following sensitivity analysis. Assume that βb1fe is strictly
positive and significantly different from zero. Under parallel trends, researchers would conclude
that the treatment has a positive effect. To assess if that conclusion is robust to plausible
violations of parallel trends, Rambachan and Roth (2023) propose to compute M ∗ , the lowest
value of M such that 0 belongs to the confidence interval of ATT. M ∗ = 5 means that even
under differential trends five times larger from period two to three than from period one to two,
one can still conclude that the treatment had a positive effect: the researcher’s conclusion is
very robust to differential trends. On the other hand, M ∗ = 0.2 means that differential trends
fives times smaller from period two to three than from period one to two are enough for the
researcher’s conclusion to break down, thus suggesting that results are not robust to plausible
differential trends.
Generalization to multiple time periods. With more than three time periods, if T0 ≥ 2
Assumption BDT can be generalized as follows: for all t > T0 , there is a positive real number
M such that
1 X 1 X
E (Yg,t (0) − Yg,t−1 (0)) − E (Yg,t (0) − Yg,t−1 (0))
G1 g:Dg =1 G0 g:Dg =0
1 X 1 X
≤M max E (Yg,t′ (0) − Yg,t′ −1 (0)) − E (Yg,t′ (0) − Yg,t′ −1 (0)) .
t′ ∈{2,...,T0 } G1 g:Dg =1 G0 g:Dg =0
Application to the compulsory licensing example. Figure 3.2 in Chapter 3 suggests that
the compulsory licensing of German patents in 1919 had a large effect on patenting in the US, in
154 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
patent subclasses of organic-chemistry where at least one German patent was licensed in 1919.
In 1932, this effect becomes truly very large, but the corresponding estimator is a DID from 1918
to 1932, namely over a 14 years period. Then, one might worry that over such a long period,
treated and control subclasses could have experienced different patenting evolutions, even in the
absence of compulsory licensing. Therefore, we now conduct the sensitivity analysis proposed by
Rambachan and Roth (2023), where βb1fe is the 1918 to 1932 event-study estimator, and βb−1
fe
is the
symmetric 1918 to 1904 pre-trend estimator. Using the moser_voena_didtextbook dataset,
estimate a TWFE ES regression computing βb1fe and βb−1
fe
, and then run the sensitivity analysis
of Rambachan and Roth (2023).
preserve
keep if year==1918|year==1904|year==1932
reghdfe patents reltimeminus14 reltimeplus14 ///
, absorb(year treatmentgroup) cluster(subclass)
honestdid, pre(1) post(2) mvec(2(2)10) coefplot ///
xtitle(M, size(large)) ytitle(95% Robust CI, size(large))
restore
The mvec option is used to indicate the values of M the command should use, and the coefplot
xtitle(M, size(large)) ytitle(95% Robust CI, size(large)) options request that results
be put on a graph, shown in Figure 4.6 below. The figure shows that for the 1932 effect of com-
pulsory licensing to become insignificant, one needs to allow for differential trends post treatment
almost six times larger than the pre-treatment differential trends. This suggests that this effect
is very robust to violations of parallel trends. Intuitively, this is due to the fact that in this
application, βb1fe is almost 15 times larger than βb−1
fe
. In applications where event-study estimators
are not much larger than pre-trend estimators, the sensitivity analysis of Rambachan and Roth
(2023) would indicate a lower robustness to differential pre-trends.
4.3. BOUNDED DIFFERENTIAL TRENDS 155
1.5
95% Robust CI
.5
-.5
Original 2 4 6 8 10
M
Note: This figure shows the sensitivity of the estimated effect of compulsory licensing on US innovation in 1932,
estimated on the data of Moser and Voena (2012), to violations of parallel trends no larger than M times the 1904
to 1918 differential trend between treated and control subclasses. Standard errors are clustered at the patent
subclass level.
1 X 1 X
E (Yg,3 (0) − Yg,2 (0)) − E (Yg,3 (0) − Yg,2 (0)) ≤ M̃ .
G1 g:Dg =1 G0 g:Dg =0
With
1 X 1 X
M̃ = M E (Yg,2 (0) − Yg,1 (0)) − E (Yg,2 (0) − Yg,1 (0)) ,
G1 g:Dg =1 G0 g:Dg =0
156 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
has to be estimated, M̃ is no longer a tuning parameter chosen ex-ante by the researcher but the
product of a first-step estimation. Thus, Rambachan and Roth (2023) extend the ideas in Manski
and Pepper (2018), by accounting for the estimation error in the differential pre-trends, which
are a natural benchmark to calibrate the differential post-trends. Note also that Rambachan
and Roth (2023) propose other relaxations of parallel-trends than bounded differential trends,
see for instance their so-called “smoothness restrictions”.
4.4 Appendix∗
where
Following the discussion in Subsection 4.1.3.5, we compare the asymptotic variances of DID
estimators of the ATT with and without controls. To do so, we adopt a superpolation framework
where groups are i.i.d. We thus omit the index g below, and index expectations and variances
by u to indicate that they are not conditional on the design. We also let D denote the treatment
status at period 2, and we let ∆Y (0) and ∆Y correspond respectively to Y2 (0) − Y1 (0) and
Y2 − Y1 . Finally, we make an homoskedasticity assumption: Vu (∆Y (0)|D = 0, X) = σ 2 .
Assume that
Eu (∆Y (0)|D = 1, X) = Eu (∆Y (0)|D = 0, X), (4.23)
with p := Eu (D), π(X) := Pu (D = 1|X) and m(X) := Eu [∆Y (0)|X]. Now assume that
an analogue of (2.6). Under (4.24), the asymptotic variance of the standard DID estimator
without covariates is V = Vu (ψ), where
D 1−D
!
D
ψ := − (∆Y − m) − ATT,
p 1−p p
meaning that counterfactual outcome trends are mean-independent of D and X. Under (4.25),
the DID estimator with and without covariates are both consistent for the ATT. Under (4.25),
m(X) = m.
1
!
π(X)
ξ := (∆Y − m)(1 − D) − .
1 − p (1 − π(X))p
VX − V =2Eu [ψξ] + Eu [ξ 2 ]
1 2 1
" ( )( )#
π(X) π(X)
=E (∆Y − m) (1 − D)
2
− − + −
1 − p (1 − π(X))p 1 − p 1 − p (1 − π(X))p
!2
π(X) 1
=σ 2 E (1 − D) − . (4.26)
(1 − π(X))p (1 − p)2
In the second line we used D(1 − D) = 0 to simplify 2ψξ while in the third, we used (a − b)(−a −
4.4. APPENDIX∗ 159
of the algorithm presented in Section 4.2.1. First, let Λ denote the G0 × R matrix with typical
(g, r) element λg,r . Similarly, let F denote the T × R matrix with typical (t, r) element ft,r and
ε be the G0 × T matrix with typical (g, t) element εbg,t . Then, we seek to solve
min ∥ε − Λ
e Fe ′ ∥2 ,
F (4.27)
F
e,Λ
e
where ∥ · ∥F denotes the Frobenius norm, ∥A∥F =trace(A′ A)1/2 . To ensure identifiability, we
assume hereafter that R ≤ p := min(G0 , T ). The solution to (4.27) is still not unique since for
any invertible matrix A, Λ
e Fe ′ = (ΛA)(
e Fe A′−1 )′ . We follow Bai (2009) by imposing that Λ′ Λ is
diagonal and F ′ F = T × IR , where IR is the identity matrix of size R.
p
ε= σi ui vi′
X
i=1
denote a singular value decomposition of ε. Here σ1 ≥ ... ≥ σp ≥ 0 are the singular values of ε
and the (ui )i=1,...,p (resp. the (vi )i=1,...,p ) are orthonormal vectors of RG0 (resp. RT ). Then, let us
160 CHAPTER 4. ALTERNATIVES TO PARALLEL TRENDS
define Λ
b = [σ u , ..., σ u ]/T 1/2 and Fb := T 1/2 [v , ..., v ]. By construction, Λ
1 1 R R 1 R
b ′Λ
b is diagonal and
Therefore, (Fb , Λ)
b solves (4.27) under the aforementioned constraints.
Part III
161
Chapter 5
In the classical design, βbfe is a simple DID estimator, and it is unbiased for the ATT under
parly testable no-anticipation and parallel-trends assumption. However, the majority of papers
estimating TWFE regressions do so in more complicated designs, with treatments that may be
non-absorbing and/or non-binary, and where groups may experience several treatment changes,
at different points in time. This chapter investigates what βbfe estimates in such designs.
163
164 CHAPTER 5. TWFE ESTIMATORS OUTSIDE OF THE CLASSICAL DESIGN
Dataset used in this chapter. To answer the green questions in this chapter, you need to
use the gentzkowetal_didtextbook dataset, which contains the following variables:
• first_change: the year when a county’s number of daily newspapers changes for the first
time;
No dynamic effects. Throughout this chapter, we assume that the treatment has no dynamic
effects, namely, we maintain Assumption ND. This is consistent with the TWFE regression in
(3.1), where the current treatment Dg,t is one of the independent variables, but the lagged
treatments Dg,t−1 , Dg,t−2 etc. are not part of the independent variables, thus implicitly ruling
out dynamic treatment effects.1 For instance, in the newspaper example, omitting the lagged
1
In a classical design, omitting the lagged treatments does not implicitly rule out dynamic effects, because
Dg,t = Dg,t−k whenever Dg,t−k ̸= 0 for some k > 0. Then, the coefficient on Dg,t captures the sum of the effects
of the current and past treatment on the outcome.
165
treatments implicitly assumes that the number of newspapers available in county g in previous
elections no longer affects the turnout rate in election-year t. Importantly, we will allow for
dynamic effects in the following chapters.
Note that under Assumption ND and with a binary treatment, the definition of TEg,t above
coincides with the definition we have used so far, which is why we recycle notation. Interpret
TEg,t , in general and in the context of the newspaper example.
TEg,t denotes the expected effect in cell (g, t) of moving the treatment from 0 to Dg,t , scaled by
Dg,t . In other words, TEg,t is the slope of (g, t)’s potential outcome function, from 0 to its actual
treatment Dg,t . In the newspaper example, TEg,t is the difference between the actual turnout
rate in county g and year t and its counterfactual turnout rate without any newspaper, divided
by its number of newspapers. Thus, TEg,t can be interpreted as an effect per newspaper. Let N1
denote the number of (g, t) cells such that Dg,t ̸= 0, namely the number of treated (g, t) cells.
A natural target parameter is
1
ATT = TEg,t .
X
N1 (g,t):Dg,t ̸=0
ATT is the average, across all treated cells, of the slope of their potential outcome functions,
from 0 to their actual treatment. In Design CLA, ATT reduces to the ATT parameter we have
considered in previous chapters, which is why we recycle notation.
166 CHAPTER 5. TWFE ESTIMATORS OUTSIDE OF THE CLASSICAL DESIGN
Let
ûg,t Dg,t
Wg,t = P ,
(g ′ ,t′ ):Dg′ ,t′ ̸=0 ûg ′ ,t′ Dg ′ ,t′
where ûg,t denotes the sample residual from a regression of Dg,t on group and period FEs. In the
newspaper example, how could you compute the residuals ûg,t ?
ûg,t is the residual from a regression of the number of newspapers in county g and year t on
county and year FEs.
(g,t):Dg,t ̸=0
Theorem 8 says that βbfe is unbiased for a weighted sum of the treatment effects TEg,t , across all
treated (g, t) cells, and where the treatment effect of cell (g, t) receives a weight equal to Wg,t .
In the newspaper example, βbfe is unbiased for a weighted sum of the effects of newspapers on
turnout across all county×year cells with at least one newspaper.
It directly follows from the definition of Wg,t that Wg,t = 1. Therefore, βbfe is unbiased
P
(g,t):Dg,t ̸=0
5.1. A DECOMPOSITION OF βbFE 167
for a weighted sum of the treatment effects TEg,t , across all treated (g, t) cells, with weights
summing to one. As Wg,t = 1, the average value of the weights Wg,t across the N1
P
(g,t):Dg,t ̸=0
The equality follows from the Frisch-Waugh-Lovell theorem, restated in the appendix of Chapter
3. Moreover, by the first-order conditions attached to an OLS regression, ûg,t is uncorrelated to
all the group and time FEs: for all g ′ ,
T
ûg,t 1{g = g ′ } = 0 ⇔ ûg′ ,t = 0, (5.4)
X X
g,t t=1
g,t g=1
Finally,
ûg,t E [Yg,t ]
h i P
g,t
E βbfe = P
g,t ûg,t Dg,t
ûg,t (E [Yg,t (0)] + Dg,t TEg,t )
P
= g,t P
g,t ûg,t Dg,t
ûg,t Dg,t TEg,t
P P P
g,t ûg,t αg g,t ûg,t γt
=P +P + g,t
P
g,t ûg,t Dg,t g,t ûg,t Dg,t g,t ûg,t Dg,t
= Wg,t TEg,t .
X
(g,t):Dg,t ̸=0
168 CHAPTER 5. TWFE ESTIMATORS OUTSIDE OF THE CLASSICAL DESIGN
The first equality follows from (5.3) and the fact the design is conditioned upon, the second
equality follows from (5.2), the third equality follows from (2.3), the fifth equality follows from
(5.4) and (5.5), the last equality follows from the definition of Wg,t QED.
Outside of the classical design, βbfe may be biased for the ATT under a no-anticipation
h i
and a parallel-trends assumption. Note that E βbfe =ATT for all possible values of the
treatment effects TEg,t if and only if Wg,t = 1/N1 for all treated (g, t) cells. If Wg,t varies across
treated cells, then Wg,t > 1/N1 for some cells, Wg,t < 1/N1 for some other cells, and βbfe may be
biased for the ATT if the treatment effects of cells down- and up-weighted by βbfe differ. Having
that Wg,t = 1/N1 for all treated (g, t) cells is equivalent to having that Dg,t ûg,t is constant across
those cells. Remember that ûg,t is the residual from a regression of Dg,t on group and period
FEs. One can show that the coefficient on the FE for group g in the regression is equal to the
average treatment of group g across periods, denoted Dg,. , while the coefficient on the FE for
period t is equal to the average treatment at period t across groups, denoted D.,t . Then,
where D.,. is the average treatment across groups and periods, and this term ensures that the
average of the residuals ûg,t is equal to zero. Then,
In Design CLA, Dg,t , Dg,. , and D.,t are constant across treated cells, so Dg,t ûg,t is also constant
across treated cells. Dg,t ûg,t is also constant across treated cells in designs with a non-absorbing
binary treatment, without variation in treatment timing, where Dg,t = 1{t ∈ T1 }Dg , and T1
5.2. βbFE MAY BE BIASED FOR THE ATT 169
denotes the set of (non-necessarily consecutive) periods where the treated groups are treated.
On the other hand, in designs with variation in treatment timing and/or treatment dose, Dg,t ûg,t
generally varies across treated (g, t) cells, because Dg,t , Dg,. , and/or D.,t vary.
Find an assumption which, added to those in Theorem 8, ensures that βbfe is unbiased for the
ATT.
βbfe is unbiased for the ATT if the treatment-effect is constant. Assume that the
treatment effect is constant, across groups and over time:
There exists a real number δ such that TEg,t = δ for all (g, t). (5.7)
h i
If (5.7) holds, what is the value of E βbfe ?
Plugging (5.7) into Theorem 8, and using the fact that Wg,t = 1, one has that
P
(g,t):Dg,t =1
h i
E βbfe = δ: βbfe is unbiased for δ, which is also equal to the ATT if the treatment effect is
constant. However, (5.7) is often an implausible assumption. For instance, could it be that the
effect of newspapers varies across counties?
Yes, it seems difficult to rule out ex-ante that possibility. For instance, the effect of newspapers
could be stronger in rural counties, where newspapers may play a larger role in the transmission
of political ideas, than in more urban counties where political ideas may have other ways to
circulate even in the absence of newspapers.
170 CHAPTER 5. TWFE ESTIMATORS OUTSIDE OF THE CLASSICAL DESIGN
βbfe is unbiased for the ATT if the weights Wg,t are uncorrelated with the treatment
effects TEg,t . Instead of assuming constant treatment effects, let us consider the following
assumption:
1
(TEg,t − ATT) = 0. (5.8)
X
Wg,t −
(g,t):Dg,t ̸=0
N1
As the average value of Wg,t across treated cells is equal to 1
N1
, the condition in the previous
display requires that Wg,t is uncorrelated with TEg,t . If the treatment effect is constant then
(5.8) automatically holds, so (5.8) is weaker than (5.7). (5.8) implies that
1 1
Wg,t TEg,t = Wg,t ATT + TEg,t − ATT = ATT,
X X X X
where the second equality follows from Wg,t = 1. Combined with Theorem 8, the
P
(g,t):Dg,t ̸=0
Intuitively, (5.8) ensures that the treatment effects that are up- and down-weighted by βbfe do
not systematically differ, so βbfe is unbiased for the ATT under (5.8).
To simplify, let us momentarily assume that treatment is binary. Then the numerator of Wg,t ,
the part of Wg,t that varies across (g, t) cells, is equal to 1 − Dg,. − D.,t + D.,. . Under which
economic model of selection into treatment could we have that this quantity is correlated with
the treatment effects TEg,t , and thus (5.8) fails?
(5.8) is likely to fail if selection into treatment follows a Roy model. If selection
into treatment follows a Roy selection model where (g, t) cells decide to get treated when their
benefit from treatment is larger than the cost, (5.8) is likely to fail. To see this, note that
1 − Dg,. − D.,t + D.,. is decreasing in Dg,. , meaning that βbfe downweights the treatment effect
of groups with the highest average treatment from period 1 to T . However, in a Roy selection
model, groups with the largest average treatment may be those with the largest treatment effect,
which could lead to a correlation between Wg,t and TEg,t .
5.3. βbFE MAY NOT ESTIMATE A CONVEX COMBINATION OF TREATMENT EFFECTS171
Tests of (5.8). While (5.8) may not be very plausible when selection into treatment is based
on a Roy model, there may be other instances where (5.8) is more plausible. As we will see in
Section 6.4 of the next chapter, (5.8) is sometimes testable. For now, we just note that (5.8) can
be suggestively tested, if one observes a proxy variable Pg,t likely to be correlated with TEg,t .
Then, one can test if Wg,t and Pg,t are correlated.
βbfe may not estimate a convex combination of treatment effects. (5.6) implies that
some of the weights Wg,t may be negative, if there are treated (g, t) cells such that
In the newspapers example, with negative weights, βbfe could be estimating something like 3
times the effect of newspapers on turnout in Santa Clara county, minus 2 times the effect in
Wayne county. Then, if adding one more newspaper raises turnout by 1 percentage points
h i
in Santa Clara county and by 2 percentage points in Wayne county, one would have E βbfe =
h i
3×0.01−(2×0.02) = −0.01. E βbfe would be negative, while the effect of newspapers is positive
in both counties. This example shows that βbfe may not satisfy the “no-sign reversal property”
h i
(Imbens and Angrist, 1994; Small, Tan, Ramsahai, Lorch and Brookhart, 2017): E βbfe could
for instance be negative, even if the treatment effect is strictly positive in every (g, t). This
phenomenon can only arise when some of the weights Wg,t are negative: when all those weights
are positive, βbfe does satisfy the no-sign reversal property.
172 CHAPTER 5. TWFE ESTIMATORS OUTSIDE OF THE CLASSICAL DESIGN
No-sign-reversal and Pareto dominance.∗ Despite its intuitive appeal and its popularity
among applied researchers, the no-sign reversal property is not grounded in statistical decision
theory, unlike other commonly-used criteria to discriminate estimators such as the mean-squared
error. Still, it is connected to the economic concept of Pareto dominance (de Chaisemartin and
D’Haultfœuille, 2023b). If an estimator does not satisfy “no-sign-reversal”, its expectation or its
probability limit could for instance be positive, even if the treatment is Pareto-dominated by
the absence of treatment, meaning that everybody is hurt by the treatment.
βbfe can be used to test the sharp null of no treatment effect. Finally, while βbfe may
not provide an easily interpretable measure of the treatment’s effect, it can be used to test the
so-called sharp null of no treatment effect (Yg,t (d) − Yg,t (0) = 0 for all (g, t, d)): under that null,
h i
it follows from Theorem 8 that E βbfe = 0.
summarized as follows:
1. Theorem 1 in Chernozhukov et al. (2013) shows that under the assumption that E(Yg,t (0))
does not depend on t, one-way FE regressions, with group FEs but no period FEs, may be
biased for the average treatment effect, but unlike TWFE regressions they always estimate
a convex combination of effects.
2. The equation on p.590 of Blundell and Costa-Dias (2009) is the first decomposition of
a DID estimator as a potentially non-convex weighted sum of treatment effects under
the parallel-trends assumption.3 They consider designs with two groups and two periods,
where exposure to treatment increases more in one group than in the other, and allow for
heterogeneous effects across groups, but they assume constant effects over time. de Chaise-
martin (2011) independently obtains a similar decomposition without assuming constant
effects over time. His decomposition shows that time-varying effects also lead to negative
weights. See also Fricke (2017) for a related result, in designs where the two groups receive
different treatments or different treatment doses at period two.
show that βbfd can also be decomposed as a weighted sum of TEg,t under Assumptions NA, ND,
and PT, with weights Wg,t
fd
that differ from those in (5.1) when T ≥ 3, but that also sum to one
and that may also be negative:
h i
E βbfd = fd
TEg,t . (5.11)
X
Wg,t
(g,t):Dg,t ̸=0
This implies that under constant treatment effects, the expectations of βbfe and βbfd are equal.
Accordingly, if the two coefficients significantly differ, under Assumptions NA, ND, and PT one
can reject the null that the treatment effect is constant.
The twowayfeweights Stata (see de Chaisemartin, D’Haultfœuille and Deeb, 2019) and R (see
Zhang and de Chaisemartin, 2021) commands compute the weights attached to TWFE and FE
regressions.
To compute the weights attached to a TWFE regression, the syntax of the Stata command is:
To compute the weights attached to TWFE or FD regressions with control variables, users can
input those variables into the controls option. To suggestively test (5.8), users can specify
the test_random_weights option, inputting variables likely to be correlated with the treatment
effects TEg,t into the option. Then, the command will compute the correlation between the
weights Wg,t and those variables, and test if those correlations are significant.
5.6. APPLICATION 175
5.6 Application
Gentzkow et al. (2011) estimate a regression similar to the TWFE one in (3.1), up to two
differences. First, they include state-year FEs as additional control variables. Second, they
estimate the regression in first difference. In our replication, we start by estimating the basic
TWFE regression in (3.1), we then estimate a TWFE regression with state-year FEs, and we
finally replicate the authors’ specification.
βbfe = 0.0029: according to this regression, one more newspaper would increase turnout by
0.29 percentage points. The coefficient is marginally significant (s.e. = 0.0016). Use the
twowayfeweights Stata package to decompose βbfe , and interpret the results.
Regress turnout on number of newspapers and county, year, and state-year FEs, clustering
standard errors at the county level. Interpret the results.
Regress the change in turnout on the change in the number of newspapers and state-year FEs,
clustering standard errors at the county level. Interpret the results.
5.6. APPLICATION 177
average effect.
In the next two chapters, we will focus on two seemingly small departures from the classical
design: designs with an absorbing, binary treatment and variation in treatment timing, and de-
signs with two periods and variability in the treatment dose received by treated groups at period
two. In each case, we will see that TWFE estimators may not estimate the ATT or a convex
combination of effects because they leverage DIDs with a control group that is actually treated.
Then, we will review alternative estimators, which avoid leveraging DIDs with a treated control
group, and are unbiased for averages of (g, t)-specific effects. Finally, in the book’s last chap-
ter, we will combine the insights from the two preceding chapters to propose estimators robust
to heterogeneous effects in general designs, with non-absorbing and/or non-binary treatments.
Importantly, while we have ruled out dynamic effects in this chapter, we will allow for effects of
the lagged treatments on the outcome in the following chapters.
Chapter 6
Binary and staggered designs. Throughout this chapter, we assume that treatment is
absorbing and binary, as in Chapters 3 and 4, but we assume that there is variation in treatment
timing: treated groups start receiving the treatment at different dates. As a shortcut, we refer
to such treatments as binary and staggered.
Design BST (Binary and staggered design) Dg,t = 1{t ≥ Fg }, with ming:Fg >1 Fg < maxg Fg .
Fg is the first date at which group g becomes treated, and group g remains treated thereafter.
If g never becomes treated over the study period, we let Fg = T + 1. Fg may be equal to 1,
meaning that group g is always treated. ming:Fg >1 Fg < maxg Fg requires that among groups
that are untreated at period 1, not all groups get treated at the same period.
Chapter’s running example: the effect of unilateral divorce laws on divorce rates.
Between 1968 and 1988, 29 US states adopted a unilateral divorce law (UDL), allowing one
spouse to terminate the marriage without the consent of the other. Wolfers (2006), building
upon Friedberg (1998), uses a yearly panel of US states (plus the District of Columbia, hereafter
incorrectly referred to as a state) from 1956 to 1988, to estimate the effects of those laws on
divorce rates. The UDL treatment satisfies Design BST: the treatment is binary, states adopt
UDLs at different dates, and they never repeal those laws. Then, Fg denotes the year when state
g adopts a UDL.
179
180 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
Dataset used in this chapter. To answer the green questions in this chapter, you need to
use the wolfers_didtextbook dataset, which contains the following variables:
• cohort: the year when state g adopted treatment (Fg in our notation, except that it is
equal to zero rather than T + 1 for never-treated states);
• early_late_never: a variable equal to 1 for states with an adoption year below the
median, to 2 for states with an adoption year above the median, and to 3 for never
adopters;
• exposurelength: a variable equal to the number of years for which state g has had a UDL
at year t for treated (g, t) cells, and to 0 for untreated (g, t) cells;
• div_rate: the number of divorces per 1,000 people in state g and year t;
• stpop1968: the population of state g in year 1968, the last year before states start adopting
UDLs;
(g, t)-specific effects. Remember that for any integer k ≥ 1, 1k denotes a vector of k ones.
For all g such that Fg ≤ T , and t ∈ {Fg , ..., T }, let
h i
TEg,t = E [Yg,t − Yg,t (0t )] = E Yg,t (0Fg −1 , 1t−(Fg −1) ) − Yg,t (0t ) .
Interpret TEg,t .
TEg,t is the expected effect, in group g and at period t, of having been treated rather than
untreated from period Fg to t, namely for t − (Fg − 1) periods. In the UDL example, TEg,t is
the effect, in state g and year t, of having been exposed to a UDL for t − (Fg − 1) years.
Average treatment effect on the treated. Letting N1 denote the number of treated (g, t)
cells, a natural aggregated target parameter is
1
ATT = TEg,t ,
X
N1 (g,t):Dg,t =1
the average effect of having been treated rather than untreated for t − (Fg − 1) periods, across
all treated (g, t) cells. This parameter generalizes the ATT parameter introduced in Chapter 3
to the binary-and-staggered designs we consider in this chapter.
(g, ℓ)-specific effects. For all g such that Fg ≤ T , and for ℓ ∈ {1, ..., T − (Fg − 1)}, let
h i
TErg,ℓ = E Yg,Fg −1+ℓ (0Fg −1 , 1ℓ ) − Yg,Fg −1+ℓ (0Fg −1+ℓ ) .
Interpret TErg,ℓ .
TErg,ℓ is the expected effect, in group g and at period Fg − 1 + ℓ, of having been treated rather
182 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
than untreated from period Fg to Fg − 1 + ℓ, namely for ℓ periods. TErg,ℓ = TEg,Fg −1+ℓ , so TErg,ℓ
is just a convenient notation to index treatment effects with respect to length of exposure to
treatment rather than calendar time.
Average effect of having been treated for ℓ periods. Let T = maxg Fg − 1 denote the
last period when there is still at least one untreated group. If there are never-treated groups,
T = T . Let F = ming:Fg >1 Fg denote the earliest period when a group adopts the treatment.
For instance, if all groups are untreated at periods 1 and 2 and one group becomes treated at
period 3, F = 3. For any ℓ ∈ {1, ..., T − (F − 1)}, let Gℓ denote the number of groups such that
Fg − 1 + ℓ ≤ T , meaning that those groups reach their ℓth period of exposure to treatment at a
time period where there is still at least one untreated group. Then, let
1
ATTℓ = TErg,ℓ .
X
Gℓ g:Fg ≥2,Fg −1+ℓ≤T
ATTℓ is the average effect of having been exposed to treatment for ℓ periods, across all initially
untreated groups that reach ℓ treatment periods before all groups are treated. We restrict
attention to those groups, because we will not be able to propose unbiased DID estimators of
TErg,ℓ for initially treated groups and for groups reaching ℓ treatment periods after T . ATTℓ
generalizes the parameter ATTℓ defined in Chapter 3, to binary-and-staggered designs with
variation in treatment timing. If no group is treated at period one and there are never-treated
groups, T = T and one can show that the ATT is a weighted average of the ATTℓ s:
Gℓ
ATT = ATTℓ . (6.1)
X
ℓ N1
Yes, because for ℓ ̸= ℓ′ , ATTℓ and ATTℓ′ do not apply to the same groups, as fewer and fewer
groups reach ℓ treatment periods before T , as ℓ increases. Thus, variations in ATTℓ across ℓ
can come from treatment effects varying with length of exposure or calendar time, but also from
compositional changes if treatment effects vary across groups. Some of the Stata and R packages
discussed in this chapter have options to estimate event-study effects that all apply to the same
groups, thus avoiding such compositional changes.
ûg,t
Wg,t = P ,
(g ′ ,t′ ):Dg′ ,t′ =1 ûg′ ,t′
T −(Fg −1)
and one has that Dg,. = T
, so
T − (Fg − 1)
ûg,t = Dg,t − − D.,t + D.,. ,
T
for treated cells. In view of the previous display, which groups are the most likely to be such
that some of the weights Wg,t are negative for some t?
Groups that become treated early. In particular, always-treated groups are such that ûg,t = D.,. −
D.,t . As D.,t is weakly increasing in t in Design BST, D.,T > D.,. , so if there are always-treated
groups, their treatment effect at the last period is always weighted negatively by βbfe . Then, to
mitigate or eliminate the negative weights, one could drop the always-treated groups from the
184 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
estimation sample. As those groups are never observed without treatment, their treatment effect
cannot be estimated under no-anticipation and parallel-trends assumptions. Dropping them from
the estimation sample is thus necessary if one would prefer not imposing other assumptions.
Which time periods are the most likely to be such that some of the weights Wg,t are negative
for some g?
The last time periods of the panel, because D.,t is weakly increasing in t in Design BST. Overall,
the long-run treatment effects of early-treated groups are the most likely to be weighted nega-
tively, something that was first noted by Borusyak and Jaravel (2017). When are all the weights
Wg,t likely to be positive?
T −(Fg −1)
All the weights are positive if and only if T
+ D.,t ≤ 1 + D.,. for all (g, t). Accordingly,
all the weights are likely to be positive when there is no group that is treated most of the time,
and no time period where most groups are treated. For instance, if a large proportion of groups
are never treated, it is likely that βbfe estimates a convex combination of effects, thus implying
that βbfe satisfies the no-sign reversal property. Even then, βbfe may still be biased for the ATT.
Application to the UDL example. Using the wolfers_didtextbook dataset, run the
static TWFE regression of the divorce rate on state and year FEs and the UDL treatment,
weighting the regression by the state’s population and clustering standard errors at the state
level. According to this regression, do UDLs have an effect on divorces?
6.2. TWO-WAY FIXED EFFECTS ESTIMATORS 185
Under Assumptions NA and PT, βbfe estimates a weighted sum of 522 TEg,t . 490 TEg,t receive
a positive weight, and 32 receive a negative weight. Negative weights are small and sum to
−0.026: βbfe estimates an “almost convex” combination of effects. Yet, weights are strongly and
negatively correlated with length of exposure, meaning that βbfe downweights effects of longer
lengths of exposure. Then, βbfe could differ from the ATT if treatment effects vary with length of
exposure to a UDL. For instance, βbfe would overestimate the ATT if treatment effects decrease
with length of exposure.
Redefining the treatment to make the problem go away? With never-treated groups
and no always-treated group, one could redefine the treatment as D̃g,t = 1{t ≥ ming′ Fg′ }1{Fg ≤
T }: cell (g, t) is considered as treated if group g is eventually treated and t is after the first period
when a group becomes treated. D̃g,t is a binary and absorbing treatment, without variation in
treatment timing, so it follows from results in the previous chapter that under Assumptions NA
and PT, the coefficient β̃ fe in a TWFE regression of Yg,t on group and period FEs and D̃g,t yields
˜ the ATT of D̃g,t on the outcome. The issue with this strategy is
an unbiased estimator of ATT,
that there are (g, t) cells with D̃g,t = 1 that are actually untreated, such that the effect of D̃g,t
on the outcome is actually equal to zero for those cells. Then, letting Ñ1 denote the number
186 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
g̸=g ′ ,t<t′
where DIDg,g′ ,t,t′ is a DID comparing the outcome evolution of two groups g and g ′ from a pre
period t to a post period t′ , and where vg,g′ ,t,t′ are non-negative weights summing to one, with
vg,g′ ,t,t′ > 0 if and only if g’s treatment changes between t and t′ while g ′ ’s treatment does not
change.1 Given that g’s treatment changes between t and t′ , what is the only possible value
of (Dg,t , Dg,t′ )? Given that g ′ ’s treatment does not change between t and t′ , what are the two
possible values of (Dg′ ,t , Dg′ ,t′ )?
(Dg,t , Dg,t′ ) = (0, 1). (Dg′ ,t , Dg′ ,t′ ) = (0, 0), or (Dg′ ,t , Dg′ ,t′ ) = (1, 1). Thus, some of the DIDg,g′ ,t,t′
in (6.2) compare a group switching from untreated to treated between t and t′ to a group
untreated at both dates, while other DIDg,g′ ,t,t′ compare a switching group to a group treated at
1
Goodman-Bacon (2021) actually decomposes βbfe as a weighted average of DIDs between cohorts of groups
becoming treated at the same date, and between periods of time where their treatment remains constant. One
can then further decompose his decomposition, as we do here.
6.2. TWO-WAY FIXED EFFECTS ESTIMATORS 187
both dates. The negative weights in (5.1) originate from this second type of DIDs.
with
In words, the econometrician’s TWFE coefficient is just the simple average of DIDe,ℓ,1,2 , which
compares the period-1-to-2 fever evolution of patients e and ℓ, and of DIDℓ,e,2,3 , which compares
the period-2-to-3 fever evolution of patients ℓ and e. To his surprise, the econometrician finds
that βbfe > 0. Is it correct to conclude that antibiotics increase fever, or could something else
explain why βbfe > 0?
188 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
Antibiotics are slow-acting drugs, which take a few days to reduce symptoms. Due to that,
the fever of the early-treated patient may drop slightly from period one to two, and sharply
from period two to three, as at period three that patient has been receiving antibiotics for two
periods. Similarly, the fever of the late-treated patient may drop slightly from period two to
three. Thus, one may have that DIDe,ℓ,1,2 takes a small negative value, while DIDℓ,e,2,3 takes
a large positive value, and eventually βbfe is positive. Figure 7.2 below shows patients’ actual
(solid lines) and counterfactual (dashed lines) fever evolutions, in a numerical example where
both patients always benefit from the antibiotics treatment, but βbfe is positive.
t
1 2 3
DIDℓ,e,2,3 , on the other hand, compares the period-2-to-3 outcome evolution of group ℓ, that
switches from untreated to treated from period 2 to 3, to the outcome evolution of group e that
is treated at both dates. Accordingly,
E [Ye,3 − Ye,2 ] = E [Ye,3 (0, 12 ) − Ye,2 (0, 1)] = E [Ye,3 (03 ) − Ye,2 (02 )] + TEe,3 − TEe,2 . (6.5)
E [Yℓ,3 − Yℓ,2 ] = E [Yℓ,3 (02 , 1) − Yℓ,2 (02 )] = E [Yℓ,3 (03 ) − Yℓ,2 (02 )] + TEℓ,3 . (6.6)
Taking the difference between the two previous equations, and using the fact that E [Ye,3 (03 ) − Ye,2 (02 )]
and E [Yℓ,3 (03 ) − Yℓ,2 (02 )] cancel each other out under the parallel-trends assumption,
h i 1 1
E βbfe = TEℓ,3 + TEe,2 − TEe,3 . (6.8)
2 2
In this simple example, the decomposition of βbfe in Theorem 8 reduces to (6.8). The right-hand-
side of the previous display is a weighted sum the effect of antibiotics on ℓ’s fever at period three,
on e’s fever at period two, and on e’s fever at period three, with weights summing to one, and
where e’s effect at period three is weighted negatively. Intuitively, the negative weight comes
from the fact that e is treated at periods two and three, and DIDℓ,e,2,3 , which uses e as a control
group, subtracts its period-three effect out. If TEℓ,3 and TEe,2 , and TEe,3 are all negative, but
TEe,3 is more than three times larger in absolute value than both TEℓ,3 and TEe,2 , for instance
h i
because antibiotics take time to act, E βbfe > 0.
Find an assumption on the treatment effects TEg,t such that under that supplementary assump-
tion, the negative weight in the decomposition of βbfe in (6.8) disappears.
190 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
βbfe estimates a convex combination of effects if treatment effects do not change over
time. If one is ready to assume that TEe,3 = TEe,2 , (6.7) simplifies to
Then, the negative weight in (6.7) disappears, and βbfe estimates a weighted average of treat-
ment effects. This extends beyond this simple example: Theorem S2 of the Web Appendix of
de Chaisemartin and D’Haultfœuille (2020) and Equation (16) of Goodman-Bacon (2021) show
that in staggered adoption designs with a binary treatment, βbfe estimates a convex combina-
tion of effects, if TEg,t does not depend on t. This conclusion, however, no longer holds if the
treatment is not binary or the design is not staggered. Moreover, assuming that TEg,t does not
depend on t is often implausible: this requires that treatment effects do not vary with length
of exposure and with calendar time. The decomposition of βbfe under parallel trends and the
assumption that treatment effects do not change over time in Theorem S2 of the Web Appendix
of de Chaisemartin and D’Haultfœuille (2020) can be computed by the twowayfeweights Stata
command, replacing type(feTR) by type(feS).
Leveraging DIDℓ,e,2,3 to estimate the treatment’s effect could lead to a biased estimator, if treat-
ment effects are heterogeneous. But could there be an argument in favor of leveraging DIDℓ,e,2,3 ,
as βbfe does?
for some real number β. The first equality in the previous display requires that treatment effects
are non-stochastic, while the second one requires that they do not vary across groups and time
periods. Then,
The second equality follows from (6.10) and from the fact that in Chapter 2, we saw that
Assumption PT is equivalent to
Thus, under Assumption PT and the constant effect condition in (6.10), Yg,t is generated by a
population version of the TWFE regression in (3.1). Under those three assumptions, DIDe,ℓ,1,2
and DIDℓ,e,2,3 are both unbiased for β: DIDℓ,e,2,3 is not a forbidden comparison anymore. Now,
assume that the errors εg,t are homoscedastic and i.i.d., both across g and t. Those are the
assumptions under which OLS estimators are the best linear unbiased estimators of population
regression coefficients, by the Gauss-Markov theorem. Then, V βbfe = 0.75 × V (DIDe,ℓ,1,2 ), as
predicted by the Gauss-Markov theorem. Thus, the reason why βbfe leverages DIDℓ,e,2,3 instead
of just leveraging DIDe,ℓ,1,2 is that doing so may lead to an unbiased estimator with a lower
variance, if treatment effects are constant and errors are i.i.d. and homoscedastic.
The decomposition in (6.2) cannot be used to assess if βbfe estimates a convex com-
bination of effects. Researchers sometimes use the sum of the weights on switchers-versus-
always-treated DID in (6.2) as a diagnostic of the robustness of βbfe to heterogeneous treatment
effects. We do not recommend this diagnostic, for the following reason. Let us first consider an
example similar to that above, but with a third group n that remains untreated from period 1
to 3. In this second example, (6.2) now indicates that βbfe assigns a weight equal to 1/6 to DIDs
comparing a switcher to a group treated at both periods. On the other hand, all the weights in
the decomposition in Theorem 8 are now positive. This phenomenon can also arise in real data
sets. In the data of Stevenson and Wolfers (2006) used by Goodman-Bacon (2021) in his empir-
ical application, if one restricts the sample to states that are not always treated and to the first
ten years of the panel, all the weights in Theorem 8 are positive, but the sum of the weights in
(6.2) on DIDs comparing a switcher to a group treated at both periods is equal to 0.06. Beyond
these examples, one can show that having DIDs comparing a switcher to a group treated at both
periods in (6.2) is necessary but not sufficient to have negative weights in Theorem 8. Similarly,
the sum of the weights on DIDs comparing a switcher to a group treated at both periods in (6.2)
is always larger than the absolute value of the sum of the negative weights in Theorem 8. (6.2)
“overestimates” the negative weights in Theorem 8, because as soon as there are three distinct
192 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
treatment dates, there is not a unique way of decomposing βbfe as a weighted average of DIDs,
and there exists other decompositions than (6.2), putting less weight on DIDs using a group
treated at both periods as the control group.2
Computation: Stata and R commands to compute the weights in (6.2). The bacondecomp
Stata (see Goodman-Bacon et al., 2019) and R (see Flack and Edward, 2020) commands com-
pute the DIDg,g′ ,t,t′ entering in (6.2), the weights assigned to them, as well as the sum of the
weights on DIDg,g′ ,t,t′ using a group treated at both periods as the control group. The syntax of
the bacondecomp Stata command is:
=(vℓ,e,t1 ,t2 − v)DIDℓ,e,t1 ,t2 + vDIDℓ,n,t0 ,t2 + vDIDe,ℓ,t0 ,t1 + (ve,n,t0 ,t2 − v)DIDe,n,t0 ,t2 . (6.15)
Plugging (6.15) into (6.2) will yield a different decomposition of βbfe as a weighted average of DIDs. But the weight
on DIDs using a group treated at both periods as the control group is equal to vℓ,e,t1 ,t2 in the left-hand-side of
(6.15), and to (vℓ,e,t1 ,t2 − v) in its right-hand side. Accordingly, this new decomposition puts strictly less weight
than (6.2) on DIDs using a group treated at both periods as the control group.
6.2. TWO-WAY FIXED EFFECTS ESTIMATORS 193
period three. Assume that the early-treated group is chosen at random: with probability 1/2,
the early-treated group is group 1, and with probability 1/2 the early-treated group is group
2. Thus, e and ℓ are now random variables, with P (e = 1) = P (e = 2) = 1/2, and ℓ = 3 − e.
Consider, as in Athey and Imbens (2022), that potential outcomes are non-stochastic. Then,
TEg,t = Yg,t − Yg,t (0t ), and
Yg,1 (Frison and Pocock, 1992; McKenzie, 2012). Roth and Sant’Anna (2023) build upon those
results, and derive an efficient estimator of the ATT in binary-and-staggered designs with T ≥ 3
and randomly assigned treatment dates. Again, the efficient estimator is not a DID or TWFE
estimator: it is a weighted average of cross-sectional comparisons of treated and controls units
controlling for baseline outcomes.
If there are no always-treated groups, all groups are untreated till period F − 1. Then, for all
t ≤ F − 1 Yg,t = Yg,t (0t ) for all g. Then, the previous display implies that
an equation that only involves observed variables, and can therefore be tested. To test (6.16),
one can for instance run a pooled regression of Yg,t on adoption-cohort FEs for all t ≤ F − 1, or
one can regress Fg on (Yg,1 , ..., Yg,F −1 ) and run an F-test that all coefficients are equal to zero.
Application to the UDL example. The data starts in 1956. 2 states had already passed a
UDL in 1956, but of the remaining 49 states, none passes a UDL before 1969. Thus we observe
Yg,t (0t ), the divorce rate without a UDL, of 49 states for 13 years. If treatment timing is as good
as randomly assigned, we should have Yg,t (0t ) ⊥⊥ Fg , which we can test by regressing Yg,t on FEs
for each adoption cohorts, in the subsample of gs untreated in 1956 and t < 1969. Unfortunately,
many adoption cohorts contain only one state, so running an F-test from a regression of Yg,t on
FEs for each value of Fg is infeasible. Instead, using the wolfers_didtextbook dataset, regress
the divorce rate on FEs for each possible value of the early_late_never variable, which groups
together states adopting before the median adoption year, those adopting after that year, and
the never adopters. Weight the regression by stpop, and cluster standard errors at the state
6.2. TWO-WAY FIXED EFFECTS ESTIMATORS 195
level. Can you reject the null that those FEs do not predict the divorce rate? Then, is it the
case that treatment timing is as good as randomly assigned?
In Design BST, to estimate dynamic effects and test the no-anticipation and parallel-trends
assumptions, researchers have often estimated the following TWFE ES regression, which is
similar to that in (3.6) but accommodates groups’ heterogeneous treatment dates:
G T L
Yg,t = b g′ 1{g =g}+
′
γbt′ 1{t =t}+
′
βbℓfe 1{t = Fg − 1 + ℓ} + ϵ̂g,t . (6.17)
X X X
α
g ′ =1 t′ =1 ℓ=−K,ℓ̸=0
In words, the outcome is regressed on group and period FEs, and relative-time indicators 1{t =
Fg − 1 + ℓ} equal to 1 if at t, group g has been exposed to treatment for ℓ periods. For ℓ ≥ 1,
βbℓfe is supposed to estimate the cumulative effect of ℓ periods of exposure to treatment. For
ℓ ≤ −1, βbℓfe is supposed to be a placebo coefficient testing the parallel-trends assumption, by
comparing the outcome trends of groups that will and will not start receiving the treatment in
|ℓ − 1| periods. Researchers have sometimes estimated a variant of this regression, where the
first and last indicators 1{t = Fg − 1 − K} and 1{t = Fg − 1 + L} are respectively replaced by an
indicator for being at least K periods away from the period before adoption (1{t ≤ Fg − 1 − K})
and an indicator for having been treated for at least L periods (1{t ≥ Fg − 1 + L}). Such
endpoint binning is for instance recommended by Schmidheiny and Siegloch (2023): without it,
196 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
the regression implicitly assumes that the treatment no longer has any effect after L periods.
Instead, with endpoint binning the regression assumes that the treatment effect is constant after
L periods, a more plausible assumption.
Wolfers (2006). The event-study regression in Figure 6.2 and that in Wolfers (2006) differ on
two dimensions: Wolfers (2006) does not include any placebo indicator for pre-adoption periods,
and he includes post-adoption indicators for bins of two years (one indicator for the adoption
year and the year after that, one indicator for the two following years, etc.). Results seem robust
to those specification choices.
.5
Effect
Effect
0
0
-.5
-.5
-1
-1
-9 -6 -3 0 3 6 9 12 15 -9 -6 -3 0
Relative time to year before law Relative tim
.5
are clustered at the state level. 95% confidence intervals are shown in red.
Effect
Effect
0
-.5
K= max Fg − 1 − 1, L = T − min Fg + 1.
-1
-1
-9 largest
maxg:Fg ≤T,Fg >1 Fg − 1 is the -6 number
-3 0of time
3 periods
6 over
9 which
12 a15
group untreated -9
at the-6 -3 0
Relative time to year before law Relative tim
start of panel is observed prior to getting treated. Thus, K is the largest number of pre-trends
coefficients one can hope to estimate. Similarly, L is the largest number of time periods over
198 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
which a group untreated at the start of panel is observed after getting treated. Thus, L is the
largest exposure to treatment whose effect one can estimate. The TWFE ES regression with
K = K and L = L is called the “fully-dynamic” regression. Below, we restate Proposition 1 in
Borusyak et al. (2024).
Proposition 1 If all groups are eventually treated and K = K and L = L, the regressors
in (6.17) are perfectly collinear: if (βℓ )ℓ=−K,...,L,ℓ̸=0 solves the OLS problem, so does (βℓ +
κℓ)ℓ=−K,...,L,ℓ̸=0 , for any κ ∈ R.
Hence, the fully-dynamic TWFE ES specification requires never-treated groups. Otherwise some
pre-trends coefficients should be removed, or dynamic effects should be restricted, for instance
by binning at endpoints.
L L
1 {Fg = t − ℓ + 1} ℓ = 1 {Fg = t − ℓ + 1} ℓ
X X
ℓ=−K ℓ=−K
ℓ̸=0
L
= (t + 1 − Fg ) 1 {Fg = t − ℓ + 1}
X
ℓ=−K
= t| + 1
{z }
−Fg
| {z }
enter in time FE enter in group FE
QED.
TWFE ES regressions are not robust to heterogeneous effects, and may suffer from
a contamination bias. The result below essentially follows from Proposition 3 in Sun and
Abraham (2021):3
3
(6.18) follows from Proposition 3 in Sun and Abraham (2021), assuming no binning. A slight difference is
that the decomposition in Sun and Abraham (2021) gathers groups that started receiving the treatment at the
same period into cohorts. Their decomposition can be further decomposed, as in Theorem 9.
6.2. TWO-WAY FIXED EFFECTS ESTIMATORS 199
Theorem 9 In Design BST, under Assumptions NA and PT, and if L = L and there are
never-treated groups, then for ℓ ∈ {1, ..., L},
h i L
βbfe = ℓ
TErg,ℓ + ℓ
′ TEg,ℓ′ ,
r
(6.18)
X X X
E ℓ Wg,ℓ Wg,ℓ
g:Fg −1+ℓ≤T ℓ′ =1 g:Fg −1+ℓ′ ≤T
ℓ′ ̸=ℓ
where Wg,ℓ
ℓ
and Wg,ℓ
ℓ
′ are weights such that
ℓ
= 1 and ℓ
′ = 0 for
P P
g:Fg −1+ℓ≤T Wg,ℓ g:Fg −1+ℓ′ ≤T Wg,ℓ
every ℓ′ ∈ {1, ..., L}, ℓ′ ̸= ℓ.
The first summation in the right-hand side of (6.18) is a weighted sum across groups of the
cumulative effect of ℓ treatment periods, with weights summing to 1 but that may be negative.
This first summation resembles that in the decomposition of βbfe in Theorem 8, and it implies
that βbℓfe may be biased for ATTℓ if the cumulative effect of ℓ treatment periods varies across
groups. Interpret the second summation in the right-hand side of (6.18).
This second summation is a weighted sum, across ℓ′ ̸= ℓ and groups, of the cumulative effect
of ℓ′ treatment periods in group g, with weights summing to 0. This second summation was
not present in the decomposition of βbfe . In view of this second summation, is βbℓfe estimating the
cumulative effect of ℓ treatment periods, or is this coefficient contaminated by other effects?
The presence of this second summation implies that βbℓfe , which is supposed to estimate the cu-
mulative effect of ℓ treatment periods, may in fact be contaminated by the effects of ℓ′ treatment
periods. As ℓ
′ = 0 for every ℓ , this second summation disappears if TEg,ℓ′ does
′ r
P
g:Fg −1+ℓ′ ≤T Wg,ℓ
not vary across groups, but this is often an implausible assumption: this rules out treatment
effect heterogeneity across groups, but also over time as groups reach their ℓth treatment period
at different points in time. Importantly, Sun and Abraham (2021) show that the negative result
200 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
in Theorem 9 is not specific to the fully-dynamic TWFE ES specification: similar negative re-
sults also apply to less flexible ES regressions. This negative result is also not specific to TWFE
ES regressions with never-treated groups: similar negative results also apply to regressions with
no never-treated groups.
where ηg,t
ℓ
is the residual of the regression of 1 {t = Fg − 1 + ℓ} on group and time fixed ef-
fects and the indicators (1 {t = Fg − 1 + ℓ′ })−K≤ℓ′ ≤L,ℓ′ ̸∈{0,ℓ} . Besides, from (6.11) and (6.13), we
obtain
ℓ′ =1
Hence,
L
g,t Yg,t = TErg,ℓ + ′ TEg,ℓ′ . (6.20)
r
E ηℓ ℓ ℓ
X X X X
ηg,Fg −1+ℓ
ηg,F g −1+ℓ
g,t g:Fg −1+ℓ≤T ℓ′ =1 g:Fg −1+ℓ′ ≤T
ℓ′ ̸=ℓ
Let ℓ
= ℓ ℓ
for (ℓ, ℓ ) ∈ {1, ..., L}2 . Then, by (6.19) and (6.20),
′
P
Wg,ℓ′ ηg,Fg −1+ℓ
′/ g:Fg −1+ℓ≤T ηg,Fg −1+ℓ
h i
E βbℓfe = ℓ
TErg,ℓ + ℓ
′ TEg,ℓ′ .
r
X X X
Wg,F g −1+ℓ
Wg,Fg −1+ℓ
g:Fg −1+ℓ≤T ℓ′ ̸=ℓ,ℓ′ >0 g:Fg −1+ℓ′ ≤T
By definition of Wg,ℓ
ℓ
, ℓ
= 1. Moreover, by construction of the residual ηg,t
ℓ
,
P
g Wg,ℓ
′ = 1 {t = Fg − 1 + ℓ′ } = 0.
ℓ ℓ
X X
ηg,F g −1+ℓ
ηg,t
g:Fg −1+ℓ′ ≤T g,t
unlike the static regression, the TWFE ES regressions does not have one but several treatment
variables, all the relative time indicators. To decompose, say, βb1fe , one needs to input the first
relative time indicator as the treatment variable, while all the other relative time indicators
need to be inputted in the other_treatments option. If the TWFE ES regression has relative
time indicators for time periods before treatment adoption, those need to be inputted to the
controls option.
1 1
βb1fe = DIDe,ℓ,1,2 + DIDℓ,e,2,4 . (6.21)
2 2
Let us momentarily assume that the effect of being exposed to treatment for three periods is
the same as the effect of being exposed for two periods (TErg,3 = TErg,2 ), as is implicitly assumed
by the endpoint binning in the regression. From period two to four, group ℓ goes from zero to
two periods of exposure to treatment, while group e goes from one to three periods of exposure.
Then, one can show that
E (DIDℓ,e,2,4 ) = TErℓ,2 − (TEre,3 − TEre,1 ) = TEre,1 + TErℓ,2 − TEre,2 , (6.22)
where the second equality follows from our assumption that TErg,3 = TErg,2 . If effects of two
periods of exposure do not vary across groups, the previous display reduces to
E (DIDℓ,e,2,4 ) = TEre,1 .
Thus, DIDℓ,e,2,4 is a valid estimator of the effect of one period of exposure to treatment under
the assumptions implicitly made by the TWFE ES regression. Then, assume that the errors in
the population version of the TWFE ES regression are homoscedastic and i.i.d., both across g
and t. Then one can show that V βb1 = 0.75 × V (DIDe,ℓ,1,2 ), as predicted by the Gauss-Markov
theorem. Thus, the reason why βb1 leverages DIDℓ,e,2,4 instead of just leveraging DIDe,ℓ,1,2 is that
6.2. TWO-WAY FIXED EFFECTS ESTIMATORS 203
doing so may lead to an unbiased estimator with a lower variance. But leveraging DIDℓ,e,2,4
might lead to a bias if being exposed to treatment for three periods does not have the same
effect as being exposed for two periods, or if treatment effects vary across groups.
DIDℓ,o,3,4 − DIDℓ,e,2,3 ,
the difference between a DID comparing the late and on-time groups from period three to four,
and a DID comparing the late and early groups from period two to three. From period three to
four, group ℓ goes from zero to one period of exposure to treatment, while group o goes from
one to two periods of exposure. Similarly, ℓ is untreated at periods two and three, while e goes
from one to two periods of exposure. Then, one can show that under Assumptions NA and PT,
DIDℓ,o,3,4 − DIDe,ℓ,2,3 is unbiased for
If effects of one and two periods of exposure do not vary across groups, the previous display
reduces to TErℓ,1 . Again, βb1fe leverages a comparison that is valid if treatment effects are homo-
geneous between groups, but that lead it to be contaminated by effects of having been exposed
to treatment for two periods if effects are heterogeneous.
Intuition for the contamination of pre-trend estimators. Consider a simple design with
G = 3, T = 3, an early-treated group that gets treated at t = 2, a late-treated group that gets
treated at t = 3, and a never-treated group. Assume that one estimates a TWFE ES regression
with K = K = 1 and L = L = 2, the fully-dynamic specification in this example. Then, one
204 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
2
βb−1 =DIDn,ℓ,1,2 + [DIDe,n,1,2 − DIDℓ,n,2,3 ]
5
=Yn,2 (02 ) − Yn,1 (0) − (Yℓ,2 (02 ) − Yℓ,1 (0))
2
"
+ Ye,2 (0, 1) − Ye,1 (0) − (Yn,2 (02 ) − Yn,1 (0))
5
#
− (Yℓ,3 (02 , 1) − Yℓ,2 (02 ) − (Yn,3 (03 ) − Yn,2 (02 )))
h i 2
E βb−1 = [TEre,1 − TErℓ,1 ] :
5
βb−1 is contaminated by a weighted sum of the effects of one period of exposure to treatment
in the early and late treated groups, with weights that sum to zero. This contamination term
disappears if TEre,1 = TErℓ,1 , as implicitly assumed by the TWFE ES regression. If the population
TWFE ES regression is correctly specified, and if its errors are homoscedastic and i.i.d. across
g and t, DIDn,ℓ,1,2 and βb−1 are both unbiased for β−1 , the coefficient comparing the outcome
evolutions of groups that will be treated in one period and groups that will not be treated in one
period, and V βb−1 = 0.6 × V (DIDn,ℓ,1,2 ). Thus, leveraging 2/5 [DIDe,n,1,2 − DIDℓ,n,2,3 ] instead
of just leveraging DIDn,ℓ,1,2 , the “natural” pre-trends estimator in this example, may lead to
an unbiased estimator with a lower variance. But if the TWFE ES regression is misspecified
because of heterogeneous treatment effects, the expectation of 2/5[DIDe,n,1,2 − DIDℓ,n,2,3 ] is no
longer equal to 0, so βb−1 is no longer unbiased for the differential trend of groups that will be
treated in one period and groups that will not be treated in one period.
Use twowayfeweights to decompose βb1fe , and compute the correlation between the weights and
the year variable.4 Interpret the results.
4
We use the twowayfeweights Stata command, because it has an option to compute the correlation between
the weights and other variables.
6.2. TWO-WAY FIXED EFFECTS ESTIMATORS 205
5
Two of the 29 states that pass a UDL have missing divorce rates the year when they pass the law, which is
why effects in 27 rather than 29 states enter in the decomposition.
206 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
Intuitively, researchers hope that βbℓlp estimates the average effect of being treated at period t on
groups’ period-t − 1 + ℓ outcome, thus allowing them to use ℓ 7→ βbℓlp to estimate ℓ 7→ ATTℓ .
Theorem 10 In Design BST, if Dg,1 = 0 for all g, then under Assumptions NA and PT,
∀ℓ ∈ {1, ..., T − 1} such that βbℓlp is well defined,
h i T −(F −1)
βblp = lp,ℓ
′ TEg,ℓ′ ,
r
X X
E ℓ Wg,ℓ
ℓ′ =1 g:ℓ≤Fg −1+ℓ′ ≤T
T −F +1
′ < 1
lp,ℓ
X X
Wg,ℓ
ℓ′ =1 g:ℓ≤Fg −1+ℓ′ ≤T
The proof, omitted here, can be found in de Chaisemartin and D’Haultfœuille (forthc.). Theorem
10 shows that βbℓlp estimates a linear combination, across ℓ′ and g, of the effect of ℓ′ period of
exposure to treatment in group g. Accordingly, βbℓlp does not estimate an average across groups
of the effect of ℓ periods of exposure to treatment: like the TWFE ES coefficient βbℓlp , βbℓlp
is contaminated by effects of other lengths of exposure. A further issue is that some of the
weights may be negative, and for ℓ ≥ 2, some weights are always negative. A last and perhaps
even more concerning issue is that for ℓ ∈ {2, ..., F }, the weights sum to strictly less than
6.3. HETEROGENEITY-ROBUST ESTIMATORS 207
one. This implies that even if there is a, say, positive real number δ such that TErg,ℓ′ = δ,
meaning that the treatment effect does not vary with length of exposure or across groups,
E βbℓlp < δ: the local-projection regression is downwards biased. This is because the regression
is misspecified: it considers groups with Dg,t = 0 as untreated, whereas some of them may
actually have become treated at some point between t + 1 and t − 1 + ℓ. In their empirical
application, de Chaisemartin and D’Haultfœuille (forthc.) exhibit an example where some local-
projection regression coefficients estimate weighted sums of effects where the sum of the weights
is negative. Then, even if the treatment effect is constant across length of exposure and groups,
E βbℓlp could be of a different sign than the treatment effect.
In Design BST, groups can be aggregated into cohorts that start receiving the treatment at the
same period. Let C = {c ∈ {2, ..., T } : ∃g : Fg = c} denote the set of dates at which at least
one group adopts the treatment. C is the set of all adoption cohorts. F = min C denotes the
earliest adoption cohort. For all c ∈ C and t ∈ {1, ..., T }, let Y c,t denote the average outcome
at period t across groups belonging to cohort c. At period T , cohort c has been treated for
T − (c − 1) periods. Callaway and Sant’Anna (2021) and Sun and Abraham (2021) define a first
set of parameters of interest as
h i
TErc,ℓ = E Y c,c−1+ℓ (0c−1 , 1ℓ ) − Y c,c−1+ℓ (0c−1+ℓ ) ,
208 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
for all c ∈ C and ℓ ∈ {1, ..., T − (c − 1)}. TErc,ℓ are cohort-specific event-study effects. Letting
Gc denote the number of groups in adoption-cohort c, we have that for all ℓ ≤ T − (F − 1),
Gc
ATTℓ = TErc,ℓ . (6.24)
X
c:c−1+ℓ≤T Gℓ
6.3.2.1 Estimators
Unbiased estimator of TErc,ℓ . Let Y n,t denote the average outcome at period t across groups
that remain untreated from period 1 to T , hereafter referred to as the never-treated groups,
assuming for now that such groups exist. To estimate TErc,ℓ , Callaway and Sant’Anna (2021)
and Sun and Abraham (2021) propose
cs,sa
TE
d
c,ℓ = Y c,c−1+ℓ − Y c,c−1 − Y n,c−1+ℓ − Y n,c−1 ,
a DID estimator comparing the period-c − 1-to-c − 1 + ℓ outcome evolution in cohort c and in
the never-treated groups n.
Theorem 11 In Design BST, under Assumptions NA and PT, for c ∈ C and ℓ ∈ {1, ..., T −
(c − 1)},
d cs,sa = TEr .
h i
E TE c,ℓ c,ℓ (6.25)
under Assumptions NA and PT alone, even if the treatment effect is heterogeneous, across groups
d cs,sa is robust to heterogeneous treatment
or over time. Intuitively, why is it that unlike βbfe , TEc,ℓ
effects?
We saw that βbfe is not robust to heterogeneous treatment effects, because it leverages DIDs
d cs,sa
comparing a group going from untreated to treated to a group treated at both periods. TEc,ℓ
6.3. HETEROGENEITY-ROBUST ESTIMATORS 209
does not leverage such comparisons, as it compares groups going from untreated to treated to
groups untreated at both periods.
Unbiased estimator of ATTℓ and ATT. It directly follows from (6.24) and Theorem 11
that
[ℓ
cs,sa Gc d cs,sa
ATT := TEc,ℓ
X
c:c−1+ℓ≤T Gℓ
is unbiased for ATTℓ . Then, it follows from (6.1) and Theorem 11 that if no group is treated at
period one and there are never-treated groups,
[
cs,sa Gℓ [ cs,sa
ATT := ATTℓ
X
ℓ N1
d cs,sa
h i
E TEc,ℓ
h i
=E Y c,c−1+ℓ (0c−1 , 1ℓ ) − Y c,c−1 (0c−1 ) − Y n,c−1+ℓ (0c−1+ℓ ) − Y n,c−1 (0c−1 )
h i
=E Y c,c−1+ℓ (0c−1 , 1ℓ ) − Y c,c−1+ℓ (0c−1+ℓ )
h i
+E Y c,c−1+ℓ (0c−1+ℓ ) − Y c,c−1 (0c−1 ) − Y n,c−1+ℓ (0c−1+ℓ ) − Y n,c−1 (0c−1 )
=TErc,ℓ .
second equality follows from adding and subtracting Y c,c−1+ℓ (0c−1+ℓ ). The third equality follows
from Assumption PT QED.
6.3.2.2 Extensions
Pre-trend tests of Assumptions NA and PT. For c ∈ {3, ..., T } and ℓ ∈ {1, ..., c − 2}, let
cs,sa
TE
d
c,−ℓ = Y c,c−1−ℓ − Y c,c−1 − Y n,c−1−ℓ − Y n,c−1
be a placebo DID estimator comparing the outcome evolution in cohort c and in the never-
treated groups, from period c − 1 to c − 1 − ℓ, namely over ℓ periods before cohort c got treated.
d cs,sa exactly mimicks TE
TE d cs,sa , the estimator of the effect of having been exposed to treatment
c,−ℓ c,ℓ
cs,sa
[ℓ
for ℓ periods in cohort c. To mimick ATT , one may then use
cs,sa
[ −ℓ := Gc d cs,sa
ATT pl TEc,−ℓ ,
X
c:c−1−ℓ≥1 G−ℓ
cs,sa
−ℓ =
where Gpl
P
c:c−1−ℓ≥1
[
−ℓ > 0. Even if ATT−ℓ perfectly mimicks
Gc , for any ℓ ≥ 1 such that Gpl
cs,sa
[ℓ
ATT , two differences are worth mentioning. First, groups such that c−1+ℓ > T are included
cs,sa cs,sa
[ −ℓ but not in ATT
in ATT [ℓ because they have not reached ℓ periods of exposure to treatment
cs,sa
[ℓ
yet at period T . Second, groups such that c − 1 − ℓ < 1 are included in ATT but not in
cs,sa
[ −ℓ , because we do not observe their outcome evolution over ℓ periods before they adopt
ATT
cs,sa i
[ −ℓ
h
the treatment. One can show that under Assumptions NA and PT, for ℓ ≥ 1 E ATT = 0,
cs,sa
[ −ℓ is significantly different from zero. Note
so one can reject Assumptions NA and PT if ATT
that this test of Assumptions NA and PT is robust to heterogeneous treatment effects, unlike
the test based on the pre-trends coefficients from a TWFE ES regression.
least three reasons. First, when there is no never-treated group, the effects TErc,ℓ can still be
estimated, for every c ≥ 2 and ℓ ≥ 1 such that c − 1 + ℓ ≤ T , where T is the last period when
at least one group is still untreated. Without never-treated groups, Sun and Abraham (2021)
6.3. HETEROGENEITY-ROBUST ESTIMATORS 211
propose to use the last treated cohort as the control group, but this may result in imprecise
estimators when that cohort is small. Second, even when there are never-treated groups, one
may worry that such groups are less comparable to groups that get treated at some point, and
researchers sometimes prefer to discard them and use only not-yet-but-eventually-treated groups
as controls. While doing so can increase the plausibility of the parallel-trends assumption, it can
also reduce statistical precision if the never-treated account for a large proportion of the sample.
Therefore, we recommend that practice only when dropping the never-treated substantially
reduces pre-trend estimates. Third, even if there are never-treated groups and one is fine with
keeping them in the analysis, the not-yet-treated is a larger control group, and may lead to
more precise estimators. When there are never-treated groups, the only argument against using
not-yet-treated as controls is that not-yet-treated might be more subject to anticipation effects.
Estimators with covariates. Callaway and Sant’Anna (2021) also propose estimators relying
on a conditional parallel-trends assumption, which extend DIDX,dr , the parametric doubly-robust
estimator with covariates reviewed in Section 4.1, to binary-and-staggered designs. See Section
5 of Ahrens, Chernozhukov, Hansen, Kozbur, Schaffer and Wiemann (2025) for a debiased-
machine-learning estimator that extends DIDX,dr-ml to binary-and-staggered designs.
Sensitivity analysis under bounded differential trends. The approach discussed in Sec-
tion 4.3, to bound ATTℓ and derive a confidence interval for it under a bounded differential
trends assumption, may also be used in Design BST, using the estimated effects and placebos
cs,sa
[ℓ
ATT , and their variance-covariance matrix as inputs to the procedure. The only caveat is
cs,sa cs,sa
[ℓ
that because ATT [ −ℓ , the
does not apply to the exact same groups as the placebo ATT
bounded differential trends assumption underlying the procedure is a little less appealing here:
cs,sa
[ −ℓ may differ from the
the differential trends experienced by groups included in placebo ATT
cs,sa
[ℓ
differential trends experienced by groups included in the actual estimator ATT . One could
cs,sa cs,sa
[ −ℓ and ATT
restrict the estimation sample to groups included in ATT [ℓ , though that might
lead to power losses.
Testing (5.8).∗ Under Assumptions NA and PT, one can show that for all g : 2 ≤ Fg ≤ T and
t ∈ {Fg , Fg + 1, ..., T },
TE
d =Y −Y
g,t g,t g,Fg −1 − Y n,t − Y n,Fg −1
is unbiased for TEg,t , group g’s treatment effect at period t. The estimators TE
d can be used to
g,t
test (5.8), a condition, discussed in the previous chapter, which ensures that the TWFE estimator
is unbiased for the ATT, by requiring that the weights attached to the TWFE estimator are
uncorrelated to treatment effects. If no group is treated at period one, then TEg,t can be
unbiasedly estimated for all treated (g, t) cells. Then, the covariance between the weights and
the treatment effects can be unbiasedly estimated using
1 d
cs,sa
[
TEg,t − ATT
X
Wg,t − .
(g,t):Dg,t ̸=0
N1
Then, one can reject (5.8) when this estimator significantly differs from zero.
6.3.2.3 Inference
Callaway and Sant’Anna (2021) propose bootstrap confidence intervals (CIs), with asymptot-
ically exact coverage when the number of groups goes to infinity, under the assumption that
groups are independent. de Chaisemartin and D’Haultfœuille (2020) and de Chaisemartin and
D’Haultfœuille (forthc.) propose analytic CIs, that are asymptotically conservative when the
number of groups goes to infinity: their coverage rate is at least as large as their nominal cov-
6.3. HETEROGENEITY-ROBUST ESTIMATORS 213
erage rate. Those CIs also rely on the assumption that groups are independent, and they are
conservative conditional on the design. They can be especially conservative when many adop-
tion cohorts contain only one group: with a cohort of size one the variance of Y c,c−1+ℓ − Y c,c−1
2
cannot be unbiasedly estimated and is instead conservatively estimated by Y c,c−1+ℓ − Y c,c−1
(de Chaisemartin, Ciccia, D’Haultfœuille, Knau, Malézieux and Sow, 2024). To avoid this issue,
researchers with many adoption cohorts containing only one group can instead bootstrap those
estimators: the resulting CIs may be less conservative. Alternatively, if they are ready to intro-
duce a little bit of measurement error in groups’ exact adoption dates and length of exposure to
treatment, they can coarsen a bit their time variable (e.g. aggregate a daily panel at the weekly
level), to ensure that most adoption cohorts have at least two groups.
d cs,sa and TE
Obtaining the estimators TE d cs,sa from a TWFE ES regression. Sun and
c,ℓ c,ℓ
cs,sa
Abraham (2021) show that the estimators TE
d
c,ℓ can be obtained from the following TWFE ES
regression:
T T −(c−1)
cs,sa
Yg,t = b c 1{g ∈ c} + γbt′ 1{t = t } +
′
TEc,ℓ 1{Fg = c, t = c − 1 + ℓ} + ϵ̂g,t .
X X X X
α d
c∈C t′ =1 c∈C ℓ=−(c−2),ℓ̸=0
(6.26)
With respect to (6.17), (6.26) has cohort instead of group fixed effects, and the relative time
indicators are interacted with cohort FEs, thus allowing for heterogeneous effects across adop-
tion cohorts. Note that when control variables are included in this regression, it is no longer
guaranteed to estimate a convex combination of the cohort-specific effects TErc,ℓ .
Estimators proposed by Sun and Abraham (2021). The estimators proposed by Sun and
Abraham (2021) are computed by the eventstudyinteract Stata command (see Sun, 2021).
Its syntax is
where rel_time_list is the list of relative-time indicators one would include in the event-
study regression in (6.17), first_treatment is a variable equal to the period when group g got
treated for the first time, and controlgroup is an indicator for the control group observations
(e.g.: the never treated). The command has a an option to include covariates, by including them
in the regression in (6.26). The resulting estimators are not guaranteed to estimate a convex
combination of the cohort-specific effects TErc,ℓ .
where cohort is equal to the period when a group starts receiving the treatment. The estimators
of Callaway and Sant’Anna (2021) are also implemented in the Stata functions xthdidregress
and hdidregress.
6.3. HETEROGENEITY-ROBUST ESTIMATORS 215
where the number inputted to the effects option is the number of effects ATTℓ the user would
like to estimate, while the number inputted to the placebo option is the number of pre-trend
coefficients the user would like to estimate.
Estimators proposed by Dube et al. (2023).∗ The estimators proposed by Dube et al.
(2023) are computed by the lpdid Stata command (Busch and Girardi, 2023). Without control
variables, those estimators are very fast to compute, because they rely on OLS regressions. With
control variables, the command can compute two types of estimators. Without the rw option,
estimators with covariates are very fast to compute, but they assume that treatment effects
do not vary with the covariates, and if that assumption fails they may estimate a non-convex
combination of cohort-specific effects. With the rw option, estimators with covariates remain
valid if treatment effects vary with the covariates, but the command then relies on the bootstrap
for inference, and may therefore be slower. Note that with the rw option, the command also
estimates ATTℓ rather than a “variance-weighted” average of cohort-specific effects.
small and individually insignificant. The placebos are smaller in the bottom-left than in the
top-righ panel. This is because by default, csdid computes first-difference placebos, comparing
the outcome evolution of treated and not-yet treated states, before the treated start receiving the
treatment, and between pairs of consecutive periods. On the other hand, eventstudyinteract
computes long-difference placebos. csdid computes long-difference placebos when the long
option is specified.
Borusyak et al. (2024), Gardner (2021), and Liu et al. (2024) have proposed imputation estima-
tors, that differ from the DID estimators discussed in the previous section.
6.3.3.1 Estimators
The estimators in Borusyak et al. (2024) can be obtained by running a TWFE regression of the
outcome on group and time FEs, and FEs for every treated (g, t) cell:
G T
b,g,l
Yg,t = b g′ 1{g = g ′ } + γbt′ 1{t = t′ } + TE
d ′ ′ 1{g = g ′ , t = t′ }D + ϵ̂ .
X X X
α g ,t g,t g,t
g ′ =1 t′ =1 g ′ ,t′
treatment at period c and t = c − 1 + ℓ. One can show that the resulting estimators are unbiased
218 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
under Assumptions NA and PT, and are therefore robust to heterogeneous treatment effects.
Intuitively, why is it that those estimators are robust to heterogeneous treatment effects?
Because the TWFE regression in the previous display does not impose any restriction on treat-
ment effect heterogeneity, as it has one coefficient per treated (g, t) cell.
T X T −(c−1) b,g,l
Yg,t = b c 1{Fg = c} + γbt′ 1{t = t′ } + TEc,ℓ 1{Fg = c, t = c − 1 + ℓ} + ϵ̂g,t .
X X X
α d
c∈C t′ =1 c∈C ℓ=1
One can show that the coefficients on 1{Fg = c, t = c − 1 + ℓ} in this regression are numerically
equivalent to the estimators of TErc,ℓ proposed by Borusyak et al. (2024). Without variation
in treatment timing, the equation in the previous display reduces to that in (3.29). Thus,
without variation in treatment timing imputation event-study estimators are equivalent to the
event-study coefficients in the TWFE ES regression in (3.28), which, instead of using the period
just before treatment as the baseline period, use the average of all pre-treatment periods. This
numerical equivalence also shows that the only difference between the imputation estimators
and the TWFE ES regression of Sun and Abraham (2021) is that the one above does not have
cohort× time-to-adoption FEs to test the parallel trends and no-anticipation assumptions.
6.3.3.3 Extensions
Estimators with covariates. To control for covariates, one can just include them in the
imputation first-step. As only untreated observations are used in that first step, this yields
an estimator robust to heterogeneous treatment effects, unlike what happens when one simul-
taneously estimates the coefficients on the covariates and the treatment effect. Similarly, the
imputation procedure can easily be extended to triple-differences, or models with group-specific
linear trends.
Pre-trend tests of Assumptions NA and PT. To test Assumptions NA and PT, Borusyak
et al. (2024) propose to estimate a TWFE regression among all the untreated (g, t) cells, with K
leads of the treatment, and use the leads’ coefficients as placebos. Those pre-trend estimators
are different from those of Sun and Abraham (2021), Callaway and Sant’Anna (2021), and
de Chaisemartin and D’Haultfœuille (forthc.), an issue we return to in our empirical application.
Bibliographic notes. Before Borusyak et al. (2024), Liu et al. (2024) and Gardner (2021)
have also proposed numerically-equivalent imputation estimators.6 The result showing that
imputation estimators are efficient with independent and identically distributed (i.i.d.) errors,
which we discuss below, is shown in Borusyak et al. (2024).
6.3.3.4 Inference
Borusyak et al. (2024) propose variance estimators and confidence intervals (CIs) based on an
asymptotic approximation where the number of groups goes to infinity, under the assumption
that groups are independent. Their CIs are conservative conditional on the design. Gardner,
Thakral, Tô and Yap (2024) show that imputation estimators can be cast as generalized-method-
of-moments (GMM) estimators, which gives rise to different variance estimators and CIs. In
simulations, Gardner et al. (2024) find that when cohorts have a small number of groups, the
CIs of Borusyak et al. (2024) undercover while their CIs have closer-to-nominal size.
6
We saw in Chapter 4 that even before that, Hsiao et al. (2012), Gobillon and Magnac (2016), and Xu (2017)
have proposed a similar strategy to estimate treatment effects under a TWFE model with interactive fixed effects.
220 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
The did_imputation Stata command (see Borusyak, 2021) and the didimputation R command
(see Butts, 2021b) compute the estimators proposed by Borusyak et al. (2024). The syntax of
the Stata did_imputation command is:
where cohort is a variable equal to the period when group g first got treated. The did2s Stata
(Butts and Gardner, 2021) and R (Butts, 2021a) commands compute the estimators proposed
by Gardner (2021). The fect Stata (Liu et al., 2022b) and R (Liu et al., 2022a) command
compute the estimators proposed by Liu et al. (2024).
Using the wolfers_didtextbook dataset, compute the same number of event-study and pre-
trend estimators as before, with the did_imputation command. Pay attention to the fact that
with this command, the cohort variable has to be missing for the never-treated groups. Interpret
the results.
command estimates a TWFE regression among all the untreated (g, t) cells, with K leads of
the treatment, and uses the leads’ coefficients as the placebos. To be consistent with the other
estimations, we run the command with 9 leads. Then, everything is relative to 10 periods prior
to treatment, which is why the reference period is set at t = −10 in the bottom-right panel.
csdid does not allow to jointly test if the placebos in Figure 6.3 are significant: it computes a joint nullity test,
but for more disaggregated, cohort-specific placebos.
222 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
Figure 6.3: Effects of Unilateral Divorce Laws, using four estimation methods
.5
Effect
Effect
0
0
-.5
-.5
-1
-1
-9 -6 -3 0 3 6 9 12 15 -9 -6 -3 0 3 6 9 12 15
Relative time to year before law Relative time to year before law
.5
Effect
Effect
0
0
-.5
-.5
-1
-1
-9 -6 -3 0 3 6 9 12 15 -9 -6 -3 0 3 6 9 12 15
Relative time to year before law Relative time to year before law
Note: This figure shows the estimated effects of Unilateral Divorce Laws on the divorce rate, as well as placebo
pre-trends estimates, using the data of Wolfers (2006) and four estimation methods. In the top-left panel, we
show estimated effects per the event-study regression in (6.17), with L = 16, K = 9, and endpoint binning. In
the top-right (resp. bottom-left, bottom-right) panel, we show estimated effects per the eventstudyinteract
(resp. csdid, did_imputation) Stata commands. All estimations are weighted by states’ populations. Standard
errors are clustered at the state level. 95% confidence intervals are shown in red.
6.3. HETEROGENEITY-ROBUST ESTIMATORS 223
6.3.4 Comparing the properties of those estimators, and their software implemen-
tation
Based on their number of downloads from the SSC repository as of June 2024, it seems that
eventstudyinteract, csdid, did_multiplegt_dyn, and did_imputation currently are the
most commonly used commands for heterogeneity-robust DID estimation. In what follows, we
compare the estimators, variance estimators, and confidence intervals produced by those four
commands, as well as their functionalities.
6.3.4.1 Variance
ℓ
b,h
TEc,ℓ =
X
d Y c,c−1+k − Y c,c−1+k−1 − Y nyt,c−1+k,c−1+k − Y nyt,c−1+k,c−1+k−1 ,
k=1
where for all (t′ , t), Y nyt,t,t′ denotes the average outcome, at period t′ , of groups not-yet-treated
at t. With the not-yet treated as the control group, the estimator of Callaway and Sant’Anna
(2021) compares the c − 1 to c − 1 + ℓ outcome evolution of cohort c and groups not-yet-treated
224 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
Looking at Figure 6.3, is it the case that one heterogeneity-robust estimation method leads to
sizeably more precise estimates than the others?
Application to the UDL example. The confidence interval of the effect of having been
exposed to a UDL for one year is much tighter in the bottom-right panel of Figure 6.3 than
in all other panels: for that effect, the estimator proposed by Borusyak et al. (2024) does lead
to a large precision gain. However, the opposite can hold when one considers other effects.
For instance, the confidence interval of the effect of having been exposed to a UDL for three
years is more than 50% larger per did_imputation than per csdid. Accordingly, the estimators
proposed by Borusyak et al. (2024) do not always lead to precision gains, relative to those
proposed by Sun and Abraham (2021) or Callaway and Sant’Anna (2021).
Precision loss with respect to TWFE estimators. TWFE ES estimators of the event-
study effects ATTℓ are BLUE if TErg,ℓ is constant across g and the errors εg,t in (6.12) are
homoscedastic and i.i.d., both across g and t. As discussed earlier in this chapter, the forbid-
den comparisons leveraged by those estimators can lead to a bias if effects are heterogeneous,
but can improve precision if effects are homogeneous and errors are homoscedastic and i.i.d..
Then, there might be a bias-variance trade-off between TWFE and heterogeneity-robust esti-
mators. Unfortunately, we do not yet have a meta-analysis comparing the variance of TWFE
and heterogeneity-robust estimators in a large number of binary and staggered designs. Chiu
6.3. HETEROGENEITY-ROBUST ESTIMATORS 225
et al. (2023) compare TWFE and heterogeneity-robust estimators in 49 political science arti-
cles with a binary treatment, 11 of which have a binary and staggered design, and 38 of which
have a binary and non-absorbing design. Their Figure 3 reports the distribution of the ratio
of estimated standard errors of the imputation estimators of Liu et al. (2024) and of TWFE
estimators across the 49 articles. It shows that the median ratio is equal to 1.1: in the median
application, using an heterogeneity-robust estimator leads to moderate loss of precision, equiv-
alent to a 10% sample-size reduction. Of course, there is heterogeneity across articles: in some
of them, heterogeneity-robust estimators are more precise than TWFE estimators, but in others
their standard errors are up to three times larger than that of TWFE estimators.
Egerod and Hollenbach (2024) compute, in simulations, the finite-sample coverage rate of the
confidence intervals (CIs) computed by eventstudyinteract, csdid, did_multiplegt_dyn,
and did_imputation.
Simulation design. The simulations are based on a real data set with the 50 US states, with
three cohorts of treated states (early, middle, and late) and some never-treated states, varying
the number of states per cohort and the size of the treatment effect. The design is not fixed:
in each draw, different states are randomly assigned to the early, middle, and late treated co-
horts. Instead, the asymptotic results underlying the CIs computed by did_multiplegt_dyn
and did_imputation are conditional on the design. Thus, the simulations evaluate the perfor-
mance of those CIs outside of the framework where they have proven guarantees.8
Results. Figure 2 in Egerod and Hollenbach (2024) shows that when each cohort has only two
states, thus meaning that there are only six treated states in total, all CIs tend to undercover, to
different degrees. Specifically, when the treatment effect is low, 95% CIs respectively have effec-
tive coverage rates of slightly less than 80% with eventstudyinteract and did_imputation, of
around 85% with did_multiplegt_dyn, and of slightly less than 90% with csdid. Increasing the
8
In their Table 1, de Chaisemartin, Ciccia, D’Haultfœuille, Knau, Malézieux and Sow (2024) find that the CI
of did_multiplegt_dyn has close to nominal coverage in simulations based on a real data set where the design
is fixed.
226 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
treatment effect worsens the coverage of the CIs of eventstudyinteract and did_imputation.
For did_imputation, this could reflect the fact that this CI is valid conditional on the design,
not unconditionally, and the design contributes more to the estimator’s variance when the treat-
ment effect is large. Increasing the treatment effect does not affect the coverage of the CIs of
did_multiplegt_dyn, and improves the coverage of the CIs of csdid, which actually reaches
nominal coverage when the treatment effect is very large. When each cohort has four states,
thus meaning that there are twelve treated states in total, all CIs have close to nominal cov-
erage when the treatment effect is low, and the coverage of the CI of did_imputation (and
to a lesser extent eventstudyinteract and did_multiplegt_dyn) deteriorates slightly when
the treatment effect increases. When each cohort has six states, thus meaning that there are
twenty-four treated states in total, all CIs have close to nominal coverage, irrespective of the
magnitude of the treatment effect.
9
When estimating TErc,ℓ , the number of treated groups is just Gc , the number of groups in cohort c, while the
number of control groups is just the number of never-treated or not-yet-treated groups. When estimating ATTℓ ,
the number of treated groups is the sum of Gc across the cohorts for which TErc,ℓ can be estimated. The number
of control groups is the number of never-treated groups, or the average number of not-yet-treated groups across
the estimators of TErc,ℓ .
6.3. HETEROGENEITY-ROBUST ESTIMATORS 227
6.3.4.3 Bias
Sensitivity analysis under bounded differential trends. As discussed above, one can
construct placebo estimators that closely mimick the actual treatment effect estimators of Call-
away and Sant’Anna (2021) and Sun and Abraham (2021), by comparing the outcome evolutions
of (almost) the same groups over the same number of periods, before the treatment onset. This
makes those estimators amenable to the estimation approach under bounded differential trends
proposed by Rambachan and Roth (2023): this approach requires that placebos be informative
of the actual estimators’ bias, an easily-rationalizable assumption when the placebos and actual
estimators are constructed symmetrically. Building a placebo that would similarly mimick the
estimator proposed by Borusyak et al. (2024) is not feasible, because that estimator leverages
all pre-treatment periods to estimate the treatment’s effect.
and/or non-binary treatment. Panel B shows that did_imputation is the fastest of the four com-
mands, while csdid is the slowest. On a relatively large dataset with 5100 groups and 33 periods,
constructed by duplicating 100 times each state in the dataset of Wolfers (2006), the run time
of csdid is almost 6 times larger than that of did_imputation, that of eventstudyinteract
is 1.71 times larger, and that of did_multiplegt_dyn is 1.40 times larger. Panel C shows that
all commands can be used to estimate the ATT, the event-study effects ATTℓ , and cohort-
specific event-study effects. csdid and did_imputation can be used to obtain other, po-
tentially more disaggregated effects. The estimation of several treatment effects could lead
to a multiple-hypothesis-testing problem: to address it, csdid produces jointly valid confi-
dence intervals, and did_multiplegt_dyn produces a joint test that all event-study effects are
zero. Panel D shows that eventstudyinteract only uses never treated (or the last treated)
as controls, did_imputation only uses not-yet treated, and csdid and did_multiplegt_dyn
can use both groups as controls depending on the options specified. Panel E shows that
eventstudyinteract and did_multiplegt_dyn produce long-difference pre-trend estimators
by default, csdid produces long-difference pre-trends if the long or long2 options are speci-
fied, and did_imputation does not produce long-difference pre-trends. csdid produces a joint
test that cohort-specific event-study pre-trends are all equal to zero. did_multiplegt_dyn and
did_imputation produce a joint test that event-study pre-trends are all equal to zero. Panel F
shows that eventstudyinteract does not compute heterogeneity-robust estimators that con-
trol for covariates: the estimators with covariates computed by that command assume constant
treatment effects. All other commands can control for covariates linearly. Only csdid can con-
trol for any type of time-invariant covariates non-parametrically, while did_multiplegt_dyn and
did_imputation can only do so for discrete covariates. On the other hand, csdid cannot control
for time-varying covariates or allow for group-specific linear trends, while did_multiplegt_dyn
and did_imputation can. Finally, Panel G shows that did_multiplegt_dyn and did_imputation
have built-in options to ensure that all event-study effects apply to the same groups, and to in-
vestigate heterogeneous treatment effects along some covariates.
6.3. HETEROGENEITY-ROBUST ESTIMATORS 229
Panel A: Applicability
Absorbing and binary D Yes Yes Yes Yes
Non-absorbing and/or non-binary D No No Yes No
Panel F: Heterogeneity-robust
estimators, controlling for covariates
Linearly No Yes Yes Yes
Non-param, discrete time-invariant Xs No Yes Yes Yes
Non-param, continous time-invariant Xs No Yes No No
Time-varying Xs No No Yes Yes
Group-specific linear trends No No Yes Yes
6.4.1 Estimating the correlation between treatment effects and some covariates
Assume that one wants to assess if the group-specific effects of ℓ periods of exposure to treatment
TErg,ℓ are correlated with a K × 1 vector of time-invariant covariates Xg , whose first coordinate
is a constant. In this section, we propose a generalization of the method described in Section
3.6.1 to designs with variation in treatment timing.
a two-step regression estimator, and we are not aware of a Stata or R command that estimates
its asymptotic variance.
evolutions of groups reaching ℓ periods of exposure to treatment before T are uncorrelated with
the non-constant variables in Xg , then de Chaisemartin and D’Haultfœuille (forthc.) show that
∆Y (0)
βbℓ,X,k is unbiased for βℓ,X,k . βk,ℓ,X = 0 is placebo testable, by regressing Yg,Fg −1−ℓ − Yg,Fg −1 on
(Xg , (1{Fg = c})c∈C ), in the sample of groups such that Fg − 1 + ℓ ≤ T , Fg − 1 − ℓ ≥ 2, and
testing if the coefficient on Xg is equal to zero. The did_multiplegt_dyn Stata and R packages
∆Y (0)
have a predict_het option that can be used to compute βbℓ,X and placebo-test βk,ℓ,X = 0.
ment effects of groups with Xg,2 = 0. Specifically, that coefficient estimates a weighted sum of
effects among treated groups with Xg,2 = 1, with weights summing to one, plus a weighted sum
of effects among treated groups with Xg,2 = 0, with weights summing to zero. The same holds
for the coefficient on Dg,t (1 − Xg,2 ): it may be contaminated by the treatment effects of treated
groups with Xg,2 = 1.
Estimating the conditional ATT function? Instead of estimating the best linear predictor
of TErg,ℓ , one may be interested in estimating the function mapping groups’ covariates to the
average of TErg,ℓ . Hatamyar, Kreif, Rocha and Huber (2023) combine insights from Callaway
and Sant’Anna (2021) and Lu et al. (2019) to form an estimator of that function under a
conditional parallel-trends assumption, in designs with variation in treatment timing. However,
their estimators are not implemented yet in a Stata or R command.
denote the sample variance of Yg,c−1+ℓ − Yg,c−1 in cohort c, let G>t denote the number of groups
not yet treated at t, and let
1 2
2
=
X
σbℓ,nyt,c−1+ℓ Yg,c−1+ℓ − Yg,c−1 − Y nyt,c−1+ℓ,c−1+ℓ − Y nyt,c−1+ℓ,c−1
G>c−1+ℓ − 1 g:Fg >c−1+ℓ
denote the sample variance of Yg,c−1+ℓ − Yg,c−1 among groups not-yet-treated at c − 1 + ℓ. Under
similar assumptions as in Section 3.6, one can show that
Gc − 1 2
(σbℓ,c − σbℓ,nyt,c−1+ℓ
2
)
Gc
is unbiased for the variance of the group-specific effects of ℓ periods of exposure to treatment in
cohort c. Then, one can aggregate those estimators across cohorts to estimate the variance of
6.5. NON-LINEAR DID 233
the group-specific effects of ℓ periods of exposure to treatment. We are not aware of a Stata or
R package computing those variance estimators in designs with variation in treatment timing.
the distribution of those effects would be misleading: one first needs to apply a deconvolution to
these noisy measures. However, deconvolution techniques rely on strong assumptions, and yield
estimators that often converge at a slow rate.
does not depend on c, for a known, strictly increasing function L taking values in [0, 1]. (6.27) is
a parallel-trends assumption on a monotone transformation of the average untreated outcome.
It generalizes (3.34) to designs with variation in treatment timing. Let us assume for now that
there exists never-treated groups. Then,
h i
E Y c,t (0t )
h i
=L L−1 E Y c,t (0t )
h i h i h i
=L L−1 E Y c,c−1 (0c−1 ) + L−1 E Y c,t (0t ) − L−1 E Y c,c−1 (0c−1 )
h i h i h i
=L L−1 E Y c,c−1 (0c−1 ) + L−1 E Y n,t (0t ) − L−1 E Y n,c−1 (0c−1 )
h i h i h i
=L L−1 E Y c,c−1 + L−1 E Y n,t − L−1 E Y n,c−1 (6.28)
where the third equality follows from (6.27). Then, for c ∈ C and ℓ ∈ {1, ..., T − (c − 1)} a
natural estimator of TErc,ℓ is
ldv
TE
d
c,ℓ := Y c,c−1+ℓ − L L
−1
Y c,c−1 + L−1 Y n,c−1+ℓ − L−1 Y n,c−1 .
234 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
where (α,
b (α
b c )c∈C , (γ
bt )t∈{2,...,T } , (βbc,ℓ )c∈C,ℓ∈{−c+2,...,−1,1,T −(c−1)} ) is the maximum likelihood estima-
c∈C t′ =2
T −(c−1) !
+ βc,ℓ 1{Fg = c, t = c − 1 + ℓ} . (6.30)
X X
c∈C ℓ=−(c−2),ℓ̸=0
If L is the logistic (resp. normal) cdf, for instance, those coefficients can be obtained by a logit
(resp. probit) version of the linear regression proposed by Sun and Abraham (2021). For ℓ < −1,
the coefficients βbc,ℓ can be used to test (6.28).
(6.29) is that the corresponding regression simultaneously computes event-study and pre-trend
estimators.
Incidental parameters? In non-linear models, including cohort rather than group FEs is
important: including group FEs could lead to an incidental parameter problem (Neyman and
Scott, 1948), which could severely bias the estimator if T , the number of time periods of the
panel, is low. If the number of groups is large relative to the number of cohorts, the estimators
described above will not be subject to that problem.
TWFE regresssions have often been used to estimate the effect of regional trade
agreements on trade. The effect of regional trade agreements (RTA) on trade is a question
that has been extensively studied by trade economists. The outcome of interest is Yi,j,t , the
trade flow from country i to j at t, and the treatment Di,j,t is an indicator for whether there is
an RTA between i and j at t. With respect to our previous notation, groups g now correspond
to origin-destination country pairs (i, j). Researchers sometimes estimate TWFE regressions
of ln(Yi,j,t ) on origin-destination-pair FEs, year FEs, and Di,j,t . But following the influential
work of Silva and Tenreyro (2006), they more often estimate a TWFE Poisson regression of
Yi,j,t on origin-destination-pair FEs, year FEs, and Di,j,t , to account for heteroscedasticity, and
also because Yi,j,t can be equal to zero when there is no trade from i to j at t. Motivated by
gravity models of international trade, researchers also often include control variables in their
specification, such as importer-year and exporter-year FEs, and year FEs interacted with an
indicator equal to one when i = j, namely when Yi,j,t actually represents a domestic trade flow.
The RTA treatment is binary and staggered: there is variation in the timing when different
country pairs start having an RTA. Thus, TWFE Poisson regressions of the effect of RTAs on
trade could be biased, if the effect of RTAs is heterogeneous across country pairs or over time.
236 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
Heterogeneity-robust TWFE estimators are twice larger than standard TWFE esti-
mators. Nagengast and Yotov (2025) investigate whether findings from TWFE Poisson regres-
sions are robust to allowing for heterogeneous effects. They use the Structural Gravity Database
of the World Trade Organization to measure trade flows between countries, and the Dynamic
Gravity Dataset of the US International Trade Commission to measure RTAs between countries,
and their dates of adoption. They start by estimating the standard TWFE Poisson regression
in the trade literature. Then, they estimate an heterogeneity-robust version of that regression,
where instead of having just one treatment indicator Di,j,t , the regression has cohort× time-
since-RTA-adoption FEs, following Wooldridge (2023). Then, they average the coefficients on
the cohort× time-since-RTA-adoption FEs, to obtain an estimator comparable to the coefficient
on Di,j,t in the standard regression. They also estimate the standard TWFE OLS regression,
and an heterogeneity-robust TWFE OLS regression following Wooldridge (2021). The first (resp.
second, third, fourth) line of Table 6.2 below replicates Column (1) of Table 3 (resp. Column
(2) of Table 3, Column (1) of Table 7, Column (2) of Table 7) of their paper. Allowing for
heterogeneous effects doubles the estimated effect of RTAs on trade, both in the Poisson and in
the OLS regression, and the differences between the estimators is statistically significant at all
conventional levels. On the other hand, using a Poisson or an OLS regression does not change
results much.
6.5. NON-LINEAR DID 237
Effect of RTAs
Notes: The table shows estimates of the effect of RTAs on trade, taken from Nagengast and Yotov (2025). The
TWFE Poisson estimate is from their Table 3 Column (1), the heterogeneity-robust TWFE Poisson estimate is
from their Table 3 Column (2), the TWFE OLS estimate is from their Table 7 Column (1), and the heterogeneity-
robust TWFE OLS estimate is from their Table 7 Column (2). Standard errors are shown below the estimates,
between parentheses.
Standard TWFE regressions are downward biased, because RTAs’ effects increase
with length of exposure, and because RTAs’ effects are larger for country pairs
that adopt an RTA early. In Panel (a) of their Figure 3, reproduced in Figure 6.4 below,
Nagengast and Yotov (2025) show heterogeneity-robust estimates of ATTℓ , the average effect of
having had an RTA for ℓ years (“ETWFE” in the figure), as well as standard TWFE Poisson ES
estimators of the same effects (“Dynamic TWFE” in the figure). Heterogeneity-robust estimates
clearly show that effects are increasing with length of exposure, something that does not appear
as clearly with the TWFE Poisson ES estimates.
28
238 dimension (see6. Section
CHAPTER 2.2.2).
DESIGNS WITH VARIATION IN TREATMENT TIMING
Figure 6.4:Figure
Estimates3:
of RTA effects, by length of exposure.
Event-time-specific and cohort-specific treat
.8
.5
.6
Treatment effect .4
Treatment effect
.4 .3
.2 .2
0 .1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15+
Periods from treatment onset 0
0 1 2 3 4 5 6 7
ETWFE Dynamic TWFE Periods from
Treatment effect
1 .2
adopting an RTA after 2000.
Figure
.5 6.5: Estimates of RTA effects, by cohort of adoption. 0
0 −.2
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15+ 0 1 2 3 4 5 6
Periods from treatment onset Periods from
.4
Treatment effect
Treatment effect
1
.3
.2
0
Finally, the authors follow de Chaisemartin and D’Haultfœuille (2020), and compute
.1 the weights
duced in Figure 6.6 below, they aggregate those weights by length of exposure to an RTA and
by cohort of adoption, and compare them to the weights attached to the heterogeneity-robust
TWFE OLS estimator. Consistent with the results discussed in Section 6.2.1, the TWFE OLS
estimator gives too much weight to short-run effects, and to the effects of late adopters. As
short-run effects are smaller than long-run effects, and late adopters have smaller effects than
early adopters, the TWFE OLS estimator is downward biased. Some weights attached to the
TWFE OLS estimator are negative, but negative weights are small and sum to −0.028 only.
Figure
Figure6.6: Weights of
4: Weights associated to theand
OLS ETWFE TWFE regression
TWFE estimator
.15 .4
.3
.1
Weight
Weight
ETWFE ETWFE
.2
TWFE TWFE
.05
.1
0 0
0 5 10 15 20 25 30 1980 1990 2000 2010
Periods from treatment onset Treatment cohort
.02
.01
.015
Weight
Weight
ETWFE ETWFE
.005
TWFE TWFE
.01
0 .005
−.005 0
0 10 20 30 0 5 10 15 20 25
Periods from treatment onset Periods from treatment onset
Pre-trend tests
(c) By eventsuggest that the
time (1983–1989 no-anticipation
cohort) andtime
(d) By event parallel-trends assumptions
(1990–1999 cohort)
are plausible in this application. In their Figure 2, reproduced in Figure 6.7 below, the
.08
authors conduct pre-trend tests, in the spirit of those proposed by Borusyak et al. (2024).
.06
Pre-trends estimates are much smaller than the event-study estimates in Figure 6.4, most of
Weight
ETWFE
.04 TWFE
them are individually insignificant, and they are jointly insignificant (F-test p-value= 0.298).
.02
This suggests
0
that the no-anticipation and parallel-trends assumptions are plausible in this
0 5 10 15
application. Periods from treatment onset
Athey and Imbens (2006) extend their CIC estimator of QTEs to absorbing and binary treat-
ments with more than two periods and variation in treatment timing. CIC estimators of QTEs
at the time period when treated groups start receiving the treatment are computed by the
fuzzydid Stata package. We are not aware of a Stata or R package that can be used to com-
pute CIC estimators of QTEs at later periods.12 To our knowledge, the QTE estimators of Kim
and Wooldridge (2024) and the quantile-DID estimators have not been extended to designs with
more than two periods and variation in treatment timing.
12
The cic (Kranker, 2019) Stata and qte (Callaway, 2023) R package can only be used with two time periods.
6.6. FURTHER TOPICS∗ 241
6.6.1 You shall never use always-treated as the control group, except maybe if
they have been treated for a long time
Sometimes, the only control group available to the researcher consists of always-treated groups:
some groups are treated at period one, and the groups untreated at period one all become treated
at the same date (see, e.g., Field, 2007). Then, if one wants to compute a DID, the only option is
to compare the outcome evolutions of switchers and always-treated groups. In other cases, there
is a variation in the timing at which switchers get treated, but a large proportion of groups are
treated at period one (see, e.g., Ujhelyi, 2014). Then, not leveraging DIDs comparing switchers
and always-treated may lead to a large loss of statistical precision. In this chapter, we have seen
that such DIDs rely on Assumptions NA and PT, and on the assumption that, by the time when
the switchers get treated, the treatment effect of the always-treated groups is constant over time.
This assumption may be implausible if the always-treated started receiving the treatment a few
periods before the switchers get treated. If, on the other hand, the always-treated groups started
receiving the treatment a long time before the switchers get treated, it may be reasonable to
assume that their treatment effect is no longer evolving by that time, and pre-trend tests can
be useful to assess the plausibility of this assumption. All the Stata and R commands reviewed
in this chapter can easily be tweaked to ensure that all or some always-treated groups are used
as controls. For instance, with did_multiplegt_dyn one just needs to redefine the treatment
variable of those groups as equal to zero throughout the panel.
Instances where the outcome is observed less frequently than the treatment. There
are circumstances where the outcome is observed less frequently than the treatment. For in-
stance, electoral outcomes are observed only during election years, while treatment may be
observed every year. Then, one could restrict the estimation sample to years where the outcome
is observed, and naively apply some of the event-study estimators reviewed in this chapter. How-
ever, doing so may yield hard to interpret event-study effects, that conflate effects of different
exposure lengths. To see this, assume that elections take place every other year, on even years.
In the sample restricted to even years, two groups g and g ′ such that g adopted treatment in
year 2k while g ′ adopted treatment in year 2k − 1 will be considered as having both adopted in
2k. Then, estimators of ATT1 will actually average effects of one year of exposure for groups
that adopted during an even year, and of two years of exposure for groups that adopted during
an odd year. A simple solution amounts to estimating effects separately for groups adopting
during an even year, and for groups adopting during an odd year. Then, one may combine in a
single event-study graph effects of odds numbers of years of exposure for groups adopting during
an even year, and effects of even numbers of years of exposure for groups adopting during an
odd year. A tutorial showing how this can be done with the did_multiplegt_dyn Stata and R
commands is available here.
6.6.3 Weighting
With variation in treatment timing, weighting does not lead to further issues beyond those
already mentioned in Section 3.9.2 of Chapter 3.
In applications with variation in treatment timing and where the treatment effect might spill
over onto untreated groups geographically close to a treated group, Butts (2021c) proposes a
method to estimate the average total effect of the treatment across all treated groups, and the
indirect effect of the treatment across all affected untreated groups. Assume that the researcher
is ready to assume that groups located more than x kilometers away from a treated group cannot
244 CHAPTER 6. DESIGNS WITH VARIATION IN TREATMENT TIMING
be affected by its treatment. The choice of x should be based on context-specific knowledge, and
researchers will typically present sensitivity analyses, where they show that results are robust
to changes in x. Then, let Ag,t be an indicator equal to one if group g is untreated at t but
indirectly affected because a group located less than x kilometers away from g is treated at t.
Then, Butts (2021c) proposes an extension of the imputation estimator of Borusyak et al. (2024),
Gardner (2021), and Liu et al. (2024). First, one fits a TWFE regression of the outcome on
group and time FEs in the sample of untreated and unaffected (g, t) cells, with Dg,t = 0 and
Ag,t = 0. Then, one uses that regression to predict the counterfactual untreated outcome of
treated cells and of untreated but affected cells. Estimates of the total effect of treated cells are
obtained by subtracting their counterfactual to their actual outcome. Similarly, estimates of the
indirect effect of untreated but affected cells are obtained by subtracting their counterfactual to
their actual outcome.
6.7 Appendix∗
T T −(c−1) !
P (Yg,t = 1) = L α+ αc 1{g ∈ c}+ γt′ 1{t = t′ }+ βc,ℓ 1{Fg = c, t = c−1+ℓ} .
X X X X
ldv
TE
d
c,ℓ = Y c,c−1+ℓ − L L
−1
Y c,c−1 + L−1 Y n,c−1+ℓ − L−1 Y n,c−1 . (6.31)
ldv
TE
d
c,ℓ = L α
b+α
bc + γ
bc−1+ℓ + βbc,ℓ − L (α
b+α
bc + γ
bc−1+ℓ ) .
Proof: let C denote the cardinality of C. First, observe that the model is saturated in cohorts
and time: the regressors are not collinear and there are 1 + C + T − 1 + C(T − 1) = (C + 1)T
coefficients, compared to (C + 1) cohorts (C cohorts in C plus the never-treated groups) and
6.7. APPENDIX∗ 245
Y n,t =L (α
b+γ
bt ) .
As a result,
L−1 Y c,c−1 + L−1 Y n,c−1+ℓ − L−1 Y n,c−1 = (α
b+α
bc + γ
bc−1 ) + (α
b+γ
bc−1+ℓ ) − (α
b+γ
bc−1 )
=α
b+α
bc + γ
bc−1+ℓ .
The result follows from this equation, (6.31) and (6.32) applied to t = c − 1 + ℓ.
1
Y k := Pn
X
n o Yi .
i=1 1 Xi = xk
f
i:X
ei =xk
i=1
Since θb exists, it satisfies the first-order conditions, which can be written, after some manipula-
tions,
n
F ′ (Xi′ θ)
b
− F Xi′ θb = 0.
X
Xi Yi
i=1 F X ′ θb 1 − F X ′ θb
i i
Heterogeneous adoption designs. In most of this chapter, we assume that our data contains
only two time periods, T = 2. We assume that treatment follows an “heterogeneous-adoption
design” (HAD): groups are untreated at period one, some or all groups receive a strictly positive
treatment dose at period two, but the treatment dose can vary across treated groups, with some
groups receiving larger doses than others. The variability in the dose received by treated groups
is the key difference between HADs and the classical designs reviewed in Chapter 3. The period-
two treatment dose could be a discrete variable taking a small number of values, like 1, 2, and 3.
But the period-two treatment could also be a continuously distributed variable, taking as many
different values as there are treated groups.
I.i.d. groups. In this chapter, we replace Assumption IND, which requires that groups are
independent, by the slightly stronger assumption that the groups are an independent and identi-
cally distributed (i.i.d.) sample, drawn from an infinite super-population of groups. Introducing
an infinite super-population of groups is necessary, to allow for a potentially continuous dis-
tribution of the period-two treatment. As groups are assumed i.i.d., we drop the g subscript,
except when we introduce estimators. As different samples of groups lead to different treatments
and potential outcomes, expectations are taken with respect to both the distribution of groups’
potential outcomes and treatments, while in all the unstarred sections that preceded, groups’
247
248 CHAPTER 7. DESIGNS WITH VARIATION IN TREATMENT DOSE
treatments (the study design D) were implicitly conditioned upon (see starred Section 2.4 for
further discussion). To highlight expectations taken with respect to both the distribution of
groups’ potential outcomes and treatments, we let Eu [·] denote such expectations. Finally, an
estimand refers to a function of the probability distribution of the observed random variables,
namely (Y1 , Y2 , D1 , D2 ) in the simple setting we consider.
Design HAD (Heterogeneous adoption design) D1 = 0, D2 ≥ 0, Eu (D2 ) > 0 and Vu (D2 |D2 >
0) > 0.1
HADs with stayers or quasi-stayers. Some of the results in this chapter apply to a subset
of HADs, namely HADs with stayers or quasi-stayers.
Design HAD’ (Heterogeneous adoption design with stayers or quasi-stayers) The conditions in
Design HAD hold and the support of D2 includes 0.2
The support condition in Design HAD’ holds in two important cases. The first case is when
there are groups whose period-two treatment is equal to zero: Pu (D2 = 0) > 0. Hereafter, those
groups are referred to as stayers: their treatment does not change from period one to two, they
stay untreated. The second case is when there are no groups whose period-two treatment is equal
to zero (Pu (D2 = 0) = 0), but there are groups whose period-two treatment is “very close” to
zero: for any δ > 0, Pu (0 < D2 < δ) > 0. For instance, this condition holds if D2 is continuously
distributed on R+ with a continuous density that is strictly positive at 0. Hereafter, groups with
a period-two treatment close to zero are referred to as quasi-stayers.
Chapter’s running example. Before China joined the World Trade Organization (WTO),
United States (US) imports from China were already subject to the low Normal Trade Relations
(NTR) tariff rates reserved to WTO members, since 1980. However, those rates required uncer-
tain and politically contentious annual renewals. Without renewal, US import tariffs on Chinese
goods would have spiked to higher non-NTR tariff rates. In 2001, when China joined the WTO,
1
Note that Eu (D2 ) > 0 implies Pu (D2 > 0) > 0, so that Vu (D2 |D2 > 0) is well defined.
2
Recall that the support of a random variable A is the smallest closed set C such that P (A ∈ C) = 1.
249
the US granted Permanent NTR (PNTR) to China. The reform eliminated a potential tariff
spike, equal to the difference between the non-NTR and NTR tariff rates, referred to as the
NTR gap. This NTR gap varies substantially across industries: without NTR renewal, in some
industries there would have been a large increase in tariffs on Chinese imports, while in other
industries the increase would have been smaller. Pierce and Schott (2016) study the effect of
the NTR-gap treatment on US manufacturing employment. They define their treatment Dg,t
as the interaction of industry g’s NTR gap and t being after 2001. Letting for instance t = 1
denote year 2000 and t = 2 denote year 2001, Dg,1 = 0, Dg,2 ≥ 0, and there is variability in
the NTR gap of treated industries: the conditions in Design HAD are met. The NTR gap is
strictly positive in all industries. Therefore, there are no stayers in this application. However,
Figure 2 in Pierce and Schott (2016), reproduced below, suggests that the NTR gap’s density is
strictly positive in the neighborhood of 0: while the NTR gap is strictly positive in all industries,
there are industries where it is close to zero. Therefore, there seems to be quasi-stayers in this
application. Below, we show that a statistical test that there are quasi-stayers is not rejected.
VOL. 106 NO. 7 Pierce and Schott: The Decline of US Manufacturing Employment 1639
Figure 7.1: Density of period-two treatment dose in Pierce and Schott (2016)
3
Density
0
0 0.2 0.4 0.6 0.8
outsideinmanufacturing.
Dataset used this chapter.Use ofTo
thisanswer
constantthe
manufacturing sample ensures
green questions in thisthat our
chapter, you need
results are not driven by any changes in classification system.13 We note, however,
to use the that dataset,
qualitatively identical results can
pierce_schott_didtextbook which
be obtained is constructed
using frommanu-
the simple NAICS the data used by
facturing definition in the publicly available NBER-CES Manufacturing Industry
Pierce and Schott (2016) to produce their Table 3. The data used by the authors to produce
Database from Becker, Gray, and Marvakov (2013), and that neither of these drops
their other has a materialtables
regression impactis on the general trend
proprietary. of manufacturing employment overdataset
The pierce_schott_didtextbook the contains
past several decades.14
While the loss of US manufacturing employment after 2000 is dramatic, we note
that it is not accompanied by a similarly steep decline in value added. Indeed, as
illustrated in Figure 3, real value added in US manufacturing, as measured by the
BEA, continues to increase after 2000, though at a slower rate (2.8 percent) com-
250 CHAPTER 7. DESIGNS WITH VARIATION IN TREATMENT DOSE
• lemp1997, lemp1998, lemp1999, and lemp2000: the industry’s log employment in 1997,
1998, 1999, and 2000;
Fuzzy designs. HADs are often “fuzzy” DID designs (de Chaisemartin, 2011; de Chaisemartin
and D’Haultfœuille, 2018), where the treatment varies at the individual level, and Dg,t is the
treatment rate of individuals in cell (g, t). For instance, in 1996 a new TV channel, called NTV,
was introduced in Russia. At that time, it was the only TV channel in Russia not controlled
by the government. Enikolopov, Petrova and Zhuravskaya (2011) study the effect of having
access to this independent news source on voting behavior, using voting outcomes for the 1938
Russian subregions in the 1995 and 1999 elections. The authors define their treatment as Dg,t ,
the proportion of the population having access to NTV in region g and year t, hereafter referred
to as the NTV exposure rate. By definition Dg,1 = 0, Dg,2 ≥ 0, and there is variability in the
NTV exposure rate across treated regions: the conditions in Design HAD are met. In the region
with the lowest exposure rate to NTV, this rate is still equal to 27%, so there is no unexposed
(stayer) or almost unexposed (quasi-stayer) region.5
Throughout this chapter, we impose Assumption NA, the no-anticipation assumption. We also
impose Assumption ND, thus ruling out dynamic effects of the treatment. With only two time
periods and no treated unit at period one, Assumption ND is not of essence but imposing it
simplifies notation. Finally, we replace Assumption PT by a different parallel-trends assumption,
better suited to case with a non-binary treatment and i.i.d. groups we consider. For any variable
X, let ∆X = X2 − X1 denote the change in X from period one to two.
As D2 = 2, Y2 (2) is the group’s actual, observed outcome, so estimating (Y2 (2) − Y2 (0))/2 only
requires estimating one unobserved outcome, Y2 (0). On the other hand, estimating (Y2 (3) −
Y2 (0))/3 requires estimating two unobserved outcomes, Y2 (3) and Y2 (0). While estimating Y2 (0)
can be achieved under Assumption PTNB, a parallel-trends assumption whose plausibility can be
assessed using a pre-trends test, estimating Y2 (3) requires making assumptions whose plausibility
cannot be assessed using a pre-trends test. This is why we focus on actual-versus-no-treatment
slopes, rather than on counterfactual-versus-no-treatment slopes.
Bounded-slope assumption.∗ Throughout this chapter, we assume that there exists a real
number K such that for all d2 > 0 in the support D2 , |Y2 (d2 ) − Y2 (0)|/d2 ≤ K almost surely.
This ensures that the expectations of the slopes (Y2 (d2 ) − Y2 (0))/d2 introduced below are well
defined. This condition holds if d2 7→ Y2 (d2 ) is differentiable with a bounded derivative.
254 CHAPTER 7. DESIGNS WITH VARIATION IN TREATMENT DOSE
CAS(d2 ) is the average of the slopes TE2 , across all groups with treatment dose d2 . Hereafter,
d2 7→ CAS(d2 ) is referred to as the conditional average slope function (CAS).
Assume that CAS(3) < CAS(1). Can we conclude that three doses of treatment generate a lower
return per dose than one dose, thus implying that treatment has diminishing returns? Or could
we have that CAS(3) < CAS(1) with constant or even increasing returns?
and
CAS(1) = Eu ((Y2 (1) − Y2 (0))/1|D2 = 1) .
Therefore, CAS(3) and CAS(1) are averages of different slopes across different populations:
CAS(3) averages 0-to-3 slopes across groups that received 3 doses,
while
CAS(1) averages 0-to-1 slopes across groups that received 1 dose.
Therefore, CAS(3) < CAS(1) could be due to diminishing returns, or it could be due to the fact
that groups that received three doses have lower returns per dose than groups that received one
dose. For instance, one could have that
Y2 (1) − Y2 (0) Y2 (3) − Y2 (0)
! !
Eu D2 = 3 = Eu D2 = 3
1 3
Y2 (1) − Y2 (0) Y2 (3) − Y2 (0)
! !
<Eu D2 = 1 = Eu D2 = 1 ,
1 3
7.2. TARGET PARAMETERS 255
in which case the difference between CAS(3) and CAS(1) is entirely driven by the selection bias
(groups receiving one and three doses have different returns per dose), while returns per dose
are constant within those two groups.
Find a supplementary assumption under which one can conclude that treatment has diminishing
returns when CAS(3) < CAS(1). When is that assumption plausible, and when is it implausible?
the average of TE2 across all treated groups. By the law of iterated expectations,
ATT = Eu (TE2 |D2 > 0) = Eu (Eu (TE2 |D2 )|D2 > 0) = Eu (CAS(D2 )|D2 > 0) : (7.2)
ATT is a weighted average of the conditional average slopes Eu (TE2 |D2 = d2 ). For instance, if
a half of treated groups receive one dose of treatment at period two while the other half receive
two doses,
1 1
ATT = E(T E2 |D2 = 1) + E(T E2 |D2 = 2).
2 2
If D2 is a continuously distributed variable, Eu (TE2 |D2 = d2 ) receives a weight equal to
fD2 |D2 >0 (d2 ), the density of D2 conditional on D2 > 0 evaluated at d2 .
The WATT is a weighted average of treated groups’ slopes, where groups with a larger period-
two treatment receive more weight. For instance, if a half of treated groups receive one dose of
treatment at period two while the other half received two doses, Eu [D2 |D2 > 0] = 1/2 × 1 +
1/2 × 2 = 3/2, and
1 1 1 2 1 2
WATT = E(T E2 |D2 = 1) + E(T E2 |D2 = 2) = E(T E2 |D2 = 1) + E(T E2 |D2 = 2).
2 3/2 2 3/2 3 3
ATT = WATT if and only if treated-groups’ slopes are uncorrelated with their
treatment dose:
cov (T E2 , D2 |D2 > 0) = 0. (7.3)
This condition is similar to, but weaker than, (7.1). As that previous condition, it is unlikely to
hold in a Roy selection model.
As the WATT is a ratio of expectations rather than the expectation of a ratio, it is not affected
by a small-denominator problem, even if there are treated groups with a value of D2 close to
zero. Second, de Chaisemartin et al. (2022) show that the WATT is the relevant quantity to
consider in a cost-benefit analysis assessing if groups’ period-two treatment D2 is beneficial.
Assume that the outcome is a measure of output, such as agricultural yields or wages, expressed
in monetary units. Assume also that the treatment is costly, with a cost linear in dose, uniform
across groups, and known to the analyst: the cost of giving d doses of treatment is c × d for
some known c > 0. Then, D2 is beneficial relative to a no-treatment counterfactual if and only
if Eu (Y2 (D2 ) − cD2 ) > Eu (Y2 (0)), namely if and only if
WATT > c.
Then, comparing the WATT to the per-unit treatment cost is sufficient to evaluate if changing
the treatment from 0 to D2 was beneficial.
258 CHAPTER 7. DESIGNS WITH VARIATION IN TREATMENT DOSE
7.3.1 Under parallel-trends, βbfe may not identify a convex combination of slopes
In an HAD, the TWFE estimator compares the outcome evolutions of more- and
less-treated groups. Let βbfe denote the coefficient on Dg,t in a regression of Yg,t on group
FEs, an indicator for period 2, and Dg,t , as defined in (3.1), with T = 2:
G
Yg,t = b g′ 1{g = g ′ } + γ
b2 1{t = 2} + βbfe Dg,t + ϵ̂g,t .
X
α
g ′ =1
The regression of ∆Yg on a constant and Dg,2 is a univariate regression. Therefore, βbfe is equal
to the sample covariance between ∆Yg and Dg,2 , divided by the sample variance of Dg,2 :
If Dg,2 were binary, βbfe would just compare the average outcome evolutions of treated and control
groups. With a non-binary treatment, βbfe still implements a form of treated-versus-control
comparison. In the numerator of βbfe , the outcome evolutions of groups such that Dg,2 − D.,2 > 0
enter with a positive sign, so those groups are used as “treatment groups”. On the other hand,
groups such that Dg,2 − D.,2 < 0 are used as “control groups”: their outcome evolutions are
weighted negatively. Thus, βbfe compares the outcome evolutions of more- and less-treated groups.
Under parallel trends, βbfe may not estimate the ATT, or a convex combination of
slopes. Let β fe denote the probability limit of βbfe when G → +∞, assuming that (Dg,2 , ∆Yg )
has a finite second moment to ensure this probability limit exists.
if D2 is continuously distributed. As d2 7→ (d2 − Eu (D2 ))d2 is not constant, (7.2) and (7.6) imply
that βbfe may not be consistent for ATT. βbfe may not even be consistent for a convex combination
of conditional average slopes: some weights in (7.6) are negative if Pu (0 < D2 < Eu (D2 )) > 0.
For instance, this condition always holds when there are no stayers.
Comparison with binary-and-staggered designs. Theorem 13 shows that βbfe may fail
to identify a convex combination of effects, even without variation in treatment timing. In
binary-and-staggered designs, we have seen that time-varying effects could lead βbfe to estimate
a non-convex combination of effects. Here, treatment effects that vary across groups receiving
different treatment doses could lead βbfe to estimate a non-convex combination of effects.
Application to the NTR-gap example. To compute the weights attached to βbfe in the
regression of delta2001 on ntrgap, use the twowayfeweights Stata command, and run:
twowayfeweights delta2001 indusid cons ntrgap ntrgap, type(fdTR)
The fdTR option reflects the fact that βbfe arises from a first-difference regression of delta2001
on ntrgap. Here the regression is estimated with only two periods and one first difference,
so the regression does not have time FEs, and the time variable inputted to the command
is just a constant. Finally, to compute weights attached to first-difference regressions with
twowayfeweights, on top of the first-differenced outcome, group id, time id, and first-differenced
treatment variables, one also needs to input the treatment of cell (g, t), hence the command’s
fifth argument ntrgap. Here, this fifth argument is redundant, because in an HAD the treatment
and the first-differenced treatment are equal. Interpret the results: does βbfe estimate a convex
or almost convex combination of slopes of log-employment with respect to the NTR gap?
βbfe estimates a weighted sum of slopes of log-employment with respect to the NTR gap in 103
7.3. TWFE ESTIMATOR IN HETEROGENEOUS-ADOPTION DESIGNS 261
industries, where 62 estimated weights are strictly positive, while 41 are strictly negative. The
negative weights sum to -0.32, so βbfe is far from estimating a convex combination of slopes.
Intuition for the negative weights in Theorem 13. βbfe compares the outcome evolution
of more- and less-treated groups. However, less-treated groups may still be treated, in which
case their treatment effect gets differenced out and weighted negatively by βbfe .
Forbidden comparisons in HADs: a tale of two patients who had a headache. To gain
further intuition, let us consider a simple example, with only two groups m and ℓ, corresponding
to two patients who start having a headache at t = 1. At t = 2, they go see their doctor, who
prescribes two Ibuprofen pills to m, and one Ibuprofen pill to ℓ. Thus m is the more-treated
patient, while ℓ is the less-treated one. An econometrician comes by, and immediately sees some
research potential in this natural experiment. Accustomed to running TWFE regressions, they
regress Yg,t , the pain level of patient g at t, on patient FEs, period FEs, and the treatment
received by patient g at t. With G = 2, one can show that
∆Ym − ∆Yℓ
βbfe = . (7.7)
Dm,2 − Dℓ,2
As in our example, Dm,2 − Dℓ,2 = 2 − 1 = 1, (7.7) simplifies to
βbfe is just a simple DID comparing the evolution of the pain of m and ℓ, before and after they
take their Ibuprofen prescription. To his surprise, the econometrician finds that βbfe > 0: the pain
of patient m, who received more Ibuprofen, decreases less than that of patient ℓ, who received
262 CHAPTER 7. DESIGNS WITH VARIATION IN TREATMENT DOSE
less Ibuprofen. Is it correct to conclude that Ibuprofen increases pain, or could something else
explain why βbfe > 0?
It could be the case that the drug reduces the pain of both patients, but the effect of the drug
per pill is more than twice lower for patient m than for patient ℓ, so their pain level decreases less
than that of patient ℓ, despite the fact they received two pills while patient ℓ only received one
pill. This scenario is not completely unlikely: perhaps the reason why the doctor prescribed two
pills to m is because they believed that m would have a lower sensitivity to the drug. This shows
that in HADs, negative weights arise from a new type of forbidden comparisons: comparing the
outcome evolutions of more- and less-treated groups.
=Eu (Ym,2 (0) − Ym,1 (0) − (Yℓ,2 (0) − Yℓ,1 (0))) + Eu (Ym,2 (2) − Ym,2 (0) − (Yℓ,2 (1) − Yℓ,2 (0)))
where the last equality follows from the assumption that Eu (∆Ym (0)) = Eu (∆Yℓ (0)), and from
the definitions of TEm,2 and TEℓ,2 . The right-hand-side of the previous display is a weighted sum
of m and ℓ’s treatment effects per dose of Ibuprofen, with weights summing to one, and where
ℓ’s effect is weighted negatively. Intuitively, ℓ is also treated at period two, and βbfe , which uses
ℓ as a control group, subtracts its treatment effect out. If TEm,2 and TEℓ,2 are both negative,
h i
but m’s effect is more than twice smaller in absolute value than ℓ’s effect, Eu βbfe > 0.
that receives one dose than for the patient that receives two, thus leading to a positive TWFE
coefficient.
Bibliographic notes. The right hand side of (7.7) is a Wald-DID estimator, that may not
estimate a convex combination of treatment effects under a parallel-trends assumption (Blundell
and Costa-Dias, 2009; de Chaisemartin, 2011; de Chaisemartin and D’Haultfœuille, 2018).
two doses of treatment, and a less-treated groups that receives one dose. Assume that the more
treated group is chosen at random: with probability 1/2, the more-treated group is group 1,
and with probability 1/2 the more-treated group is group 2. Thus, m and ℓ are now random
variables, with Pu (m = 1) = Pu (m = 2) = 1/2, and ℓ = 3 − m. Assume also that potential
outcomes are non-stochastic. For any random variable X, let Em (X) denote the expectation of
X with respect to the identity of the more-treated group. Then,
h i
Em βbfe
=Pu (m = 1) [Y1,2 (2) − Y1,1 (0) − (Y2,2 (1) − Y2,1 (0))] + Pu (m = 2) [Y2,2 (2) − Y2,1 (0) − (Y1,2 (1) − Y1,1 (0))]
1 1
= [Y1,2 (2) − Y1,1 (0) − (Y2,2 (1) − Y2,1 (0))] + [Y2,2 (2) − Y2,1 (0) − (Y1,2 (1) − Y1,1 (0))]
2 2
1 1
= [Y1,2 (2) − Y1,2 (1)] + [Y2,2 (2) − Y2,2 (1)] .
2 2
Therefore, βbfe is unbiased for the average effect of increasing the treatment from one to two
doses at period two, a convex combination of treatment effects.
Groups’ observed outcome at period one is equal to their untreated outcome. Then, the previous
display implies that
Dg,2 ⊥⊥ Yg,1 , (7.8)
an equation that only involves observed variables, and can therefore be tested. To test (7.8),
one can regress Yg,1 on Dg,2 . When several pre-treatment periods are available, one can regress
Dg,2 on all pre-treatment outcomes.
Application to the NTR-gap example. The data starts in 1997 while treatment starts in
2001, so we observe industries’ employment levels for four years before the PNTR treatment.
Therefore, we regress the NTR-gap treatment on (Yg,1997 , ..., Yg,2000 ), and run an F-test that all
coefficients are equal to zero. Using the pierce_schott_didtextbook dataset, regress ntrgap
on lemp1997, lemp1998, lemp1999, and lemp2000. Do you reject the null that the coefficients
on lemp1997, lemp1998, lemp1999, and lemp2000 are all equal to zero? Accordingly, does the
NTR-gap treatment seem to be as good as randomly assigned?
meaning that the conditional average slopes of groups receiving different doses do not differ. For
instance, (7.9) requires that the average 0-to-1 slope of groups receiving one dose be equal to
the average 0-to-2 slope of groups receiving two doses. Is (7.9) weaker or stronger than (7.1)?
Strictly speaking, (7.9) is neither weaker nor stronger than (7.1). However, one can show that
(7.9) holds if (7.1) holds and if
Thus, (7.9) holds if the treatment’s effect is homogeneous across groups receiving different doses,
as in (7.1), and linear, as in (7.10). To ease exposition, we sometimes refer to (7.9) as a constant-
and-linear-effect assumption, though strictly speaking (7.1) and (7.10) are sufficient but not
necessary for (7.9) to hold.
If (7.9) holds, do we have that βbfe is consistent for the ATT under Assumptions NA, ND, and
PTNB?
If (7.9) holds,
(D2 − Eu (D2 ))D2
!
β =Eu
fe
Eu (TE2 |D2 ) D2 > 0
Eu ((D2 − Eu (D2 ))D2 |D2 > 0)
(D2 − Eu (D2 ))D2
!
=ATT × Eu D2 > 0
Eu ((D2 − Eu (D2 ))D2 |D2 > 0)
=ATT.
Condition (7.9) is sufficient but not necessary to have ATT = β fe . We have ATT = β fe if and
only if
(D2 − Eu (D2 ))D2
!
covu , Eu (TE2 |D2 ) D2 > 0 = 0, (7.11)
Eu ((D2 − Eu (D2 ))D2 |D2 > 0)
a condition weaker than (7.9) (see Corollary 1 in de Chaisemartin and D’Haultfœuille, 2020).
7.3. TWFE ESTIMATOR IN HETEROGENEOUS-ADOPTION DESIGNS 267
de Chaisemartin, Ciccia, D’Haultfœuille and Knau (2024) show that the constant-and-linear-
effect assumption in (7.9) has a testable implication, and is fully testable in HADs with stayers
or quasi-stayers. Let β0 = Eu (∆Y ) − β fe Eu (D2 ) denote the probability limit of the intercept of
the TWFE regression.
Theorem 14 implies that if Assumption PTNB and (7.9) hold, Eu (∆Y |D2 ) has to satisfy a certain
property, which it may or may not satisfy, thus opening the possibility of testing Assumption
PTNB and (7.9). What is this property, and how could you test it?
Point 1 of Theorem 14 shows that if Assumption PTNB and (7.9) hold, then Eu (∆Y |D2 ) has
to be linear. In practice, Eu (∆Y |D2 ) could be non-linear. For instance, one may have that
Eu (∆Y |D2 ) = α0 + α1 D2 + α2 D22 . To test whether Eu (∆Y |D2 ) is linear, one could regress ∆Y
on, say, an intercept, D2 , D22 , and D23 , and test that the coefficients on D22 and D23 are both equal
to zero. However, this test can only detect some, but not all non-linearities in Eu (∆Y |D2 ). For
instance, if Eu (∆Y |D2 ) is a non-linear but non-polynomial function, this test might fail to reject
the null of linearity, even asymptotically.
Interpreting a failure to reject the linearity of Eu (∆Y |D2 ), in designs with stayers
or quasi-stayers. In designs with stayers or quasi-stayers, Point 2 of Theorem 14 shows that
under Assumption PTNB, there is an “if and only if” relationship between the constant-and-
linear-effect assumption in (7.9) and the linearity of Eu (∆Y |D2 ). Therefore, if Eu (∆Y |D2 ) is
linear then the constant-and-linear-effect assumption holds, thus implying that βbfe is consistent
for the ATT. This suggests the following estimation rule: in designs with stayers or quasi-stayers,
when a linearity test of Eu (∆Y |D2 ) and a pre-trends test of Assumption PTNB are not rejected,
one may use βbfe .
Interpreting a failure to reject the linearity of Eu (∆Y |D2 ), in designs without quasi-
stayers. If there are no stayers or quasi-stayers, we no longer have an “if and only if” between
(7.9) and Eu (∆Y |D2 ) = β0 + β fe D2 : Eu (∆Y |D2 ) = β0 + β fe D2 could hold but Eu (T E2 |D2 ) =
β fe + (β0 − γ2 )/D2 , thus implying that (7.9) fails if β0 − γ2 ̸= 0. Therefore, without stayers or
quasi-stayers, βbfe may not be consistent for the ATT even if Eu (∆Y |D2 ) is linear.
In the general designs that we will review in the next chapter, linearity tests may be
less useful to assess the validity of TWFE estimators.∗ Assume one uses a two-periods
panel data set to estimate a treatment’s effect, but D1 ̸= 0 and Vu (D1 ) > 0, so the conditions in
Design HAD are not met: the treatment dose varies at period one. Then, the TWFE estimator
is the coefficient on ∆Dg in a regression of ∆Yg on ∆Dg . Letting T Et = (Yt (Dt ) − Yt (0))/Dt ,
Yt = Yt (0) + Dt T Et . If one is ready to assume that the treatment effect is constant over time
(TE2 = TE1 ), then
∆Y = ∆Y (0) + ∆D × TE2 .
Then, one can show that under Assumption PTNB, if there are stayers or quasi-stayers there is
an “if and only if” relationship between
Eu (∆Y |∆D) = β0 + β fe ∆D
and Eu (TE2 |∆D) = Eu (TE2 |∆D ̸= 0), a condition under which the TWFE estimator is consis-
tent for Eu (TE2 |∆D ̸= 0). However, this “if and only if” relationship only holds if the treatment
effect is constant over time. If the treatment effect varies over time, as is often likely to be the
7.3. TWFE ESTIMATOR IN HETEROGENEOUS-ADOPTION DESIGNS 269
case, then one might have that Eu (∆Y |∆D) is linear but the TWFE estimator is not consistent
for the ATT or for a convex combination of effects.
Propose a simple method to test that Eu (∆Y |D2 ) is linear when D2 takes a finite number of
values.
null of parallel trends and constant and linear effects, the pre-testing rule we propose cannot
make post-test inference liberal.
Some details on the Stute test.∗ Under the null hypothesis that Eu [∆Y |D2 ] is linear, then
(εblin,g )g=1,...,G , the residuals of the linear regression of ∆Yg on Dg,2 , should not be correlated with
any function of D2 . Then, consider the so-called cusum process of the residuals:
G
cG (d) := G−1/2
X
1 {Dg,2 ≤ d} εblin,g .
g=1
Stute (1997) shows that under the null hypothesis, cG , as a process indexed by d, converges to
a Gaussian process. On the other hand, under the alternative, cG (d) tends to infinity for some
d. Then, one can for instance consider the following Cramer-von Mises test statistics based on
cG (d):
1 XG
CVM = c2 (Dg ).
G g=1 G
The limiting distribution of CVM under the null is complicated, but Stute et al. (1998) show that
one can approximate it using the wild bootstrap. Specifically, consider i.i.d. random variables
(Vg )g=1,...,G with Eu [Vg ] = 0, Eu [Vg2 ] = Eu [Vg3 ] = 1.6 Then, let εb∗lin,g = Vg εblin,g and
Then, letting CVM∗ denote the bootstrap counterpart of CVM based on the sample (Dg , ∆Yg∗ )g=1,...,G ,
Stute et al. (1998) show that as G → ∞, the conditional distribution of CVM∗ tends to the
limiting distribution of CVM under the null.
Computation: Stata and R commands to implement the Stute test. The stute_test
Stata (see de Chaisemartin, Ciccia, D’Haultfœueuille, Knau and Sow, 2024) and R (see de Chaise-
martin, Ciccia, D’Haultfœuille, Knau and Sow, 2024d) commands implement the Stute test. The
syntax of the Stata command is:
stute_test ∆Y D2 .
By default, the command tests that the conditional mean of ∆Y given D2 is linear. To test that
∆Y and D2 are mean independent, as one would do in a pre-trends test, one needs to specify the
order(0) option.
Without stayers, the interpretation of the linearity tests crucially depends on whether there
are quasi-stayers, namely groups with a period-two treatment very close to zero. Therefore,
de Chaisemartin, Ciccia, D’Haultfœuille and Knau (2024) propose tests of the null hypothesis
that there are quasi-stayers. One of their test statistics is QS = D(1),2 /(D(2),2 − D(1),2 ), where
D(1),2 ≤ ... ≤ D(G),2 denotes the order statistic of (Dg,2 )g=1,...,G . The critical region is Wα :=
{QS > 1/α − 1}. Intuitively, we reject the null if the distance between D(1),2 and 0 is more
than 1/α − 1 times larger than the distance between D(2),2 and D(1),2 : then, D(1),2 is too far
from zero for it to be plausible that D(1),2 would converge to 0 if the sample size were to grow
to infinity. That test is asymptotically valid if the density of D2 is strictly positive at 0, and it
has nontrivial local power.7
A test that Eu (∆Yt |D2 ) is linear for all t is not rejected. Using the pierce_schott_didtextbook
dataset, run a Stute test of linearity of the conditional expectation of delta2001 given ntrgap.
Do you reject linearity?
7 2 2 2
If one worries about the positive density assumption, one can use instead the statistic D(1),2 /(D(2),2 −D(1),2 ).
Then, the test remains valid if the density of D2 vanishes at 0, provided its derivative is strictly positive at 0.
272 CHAPTER 7. DESIGNS WITH VARIATION IN TREATMENT DOSE
this application.8 Note however that the pre-trends TWFE estimators are substantially smaller
than the actual TWFE estimators, so it does not seem that pre-trends can fully account for the
estimated treatment effects.
Eu (Yg,t (0) − Yg,2000 (0) − (t − 2000) × (Yg,2000 (0) − Yg,1999 (0))|Dg,2001 ) = γt . (7.12)
Yg,2000 (0) − Yg,1999 (0) captures industry g’s linear trend without treatment. Then, Yg,t (0) −
Yg,2000 (0) − (t − 2000) × (Yg,2000 (0) − Yg,1999 (0)) is g’s deviation from its linear trend from 2000
to t. Therefore, (7.12) requires that industries’ deviations from their linear trend are mean-
independent from the NTR-gap treatment, which is similar to Assumption CDLT in Chapter
4. Under this assumption, treatment effect estimators can be obtained by regressing, for t ∈
{2001, 2002, 2004, 2005}, Yg,t − Yg,2000 − (t − 2000) × (Yg,2000 − Yg,1999 ) on the NTR-gap treatment.
Similarly, to test that prior to 2001, industries’ deviations from their linear trends are unrelated
to the NTR-gap treatment, one can regress, for t ∈ {1998, 1997}, Yg,t − Yg,1999 − (t − 1999) ×
(Yg,2000 − Yg,1999 ) on Dg,2001 . We now implement this test. For 1998, run:
reg deltalintrend1998 ntrgap, vce(hc2, dfadjust)
Interpret the results: do we still have differential pre-trends once industry-specific linear trends
are accounted for?
8
Those findings are at odds with those from Figure 2 in Pierce and Schott (2016). Therein, the authors run
the exact same pre-trend tests as we do, but on the proprietary dataset they use for most of their analysis, and
they do not find statistically significant pre-trends. In the dataset we use, industries are at the four-digit ISIC
level, while in the proprietary dataset industries are defined at a more disaggregated level. It seems that while
disaggregated NTR gaps are uncorrelated with industries’ employment pre-trends, the aggregated variables are
correlated, a version of the so-called “ecological inference problem”.
274 CHAPTER 7. DESIGNS WITH VARIATION IN TREATMENT DOSE
The coefficient on ntrgap is small and insignificant (βbfe = −0.025, 95% CI=[−0.093, 0.044]).
We proceed similarly for 1997, and we also find a small and insignificant coefficient (βbfe =
−0.046, 95% CI=[−0.136, 0.044]). Thus, it seems we non longer have differential pre-trends
once industry-specific linear trends are accounted for. However, those pre-trend tests are testing
whether Yg,t − Yg,1999 − (t − 1999) × (Yg,2000 − Yg,1999 ) and Dg,2001 are uncorrelated for t ∈
{1998, 1997}, while (7.12) is a mean-independence assumption. To assess whether Yg,t − Yg,1999 −
(t − 1999) × (Yg,2000 − Yg,1999 ) is mean independent of Dg,2001 for t ∈ {1998, 1997}, run:
stute_test deltalintrend1998 ntrgap, order(0) seed(1)
stute_test deltalintrend1997 ntrgap, order(0) seed(1)
Interpret the results: do we reject the null that deltalintrend1998 and deltalintrend1997
are mean independent of ntrgap?
Those tests are not rejected (p-value=0.30 and p-value=0.51, respectively). A joint test pooling
the two years together is also not rejected (p-value =0.47). This lends plausibility to (7.12).
The Stute test is still not rejected with industry-specific linear trends. As it seems
that Assumption PTNB fails but (7.12) holds, we run the Stute test of linearity again, on
Yg,t − Yg,2000 − (t − 2000) × (Yg,2000 − Yg,1999 ), for t ∈ {2001, 2002, 2004, 2005}. None of the four
tests is rejected at the 5% level. A joint test is also not rejected (p-value =0.40).
TWFE estimators are smaller with than without industry-specific linear trends,
but they are still negative and marginally significant. Overall, our tests suggest that
TWFE estimators with industry-specific linear trends might be reliable in this application, or at
least there is no strong, detectable indication that they are not. Those estimators are shown in
Table 7.1 below, together with all the other estimators and tests computed in this replication.
While TWFE estimators with industry-specific linear trends are smaller and less significant than
TWFE estimators without linear trends, the estimated effect in 2004 is significant at the 5%
level, and that in 2002 is significant at the 10% level.
7.3. TWFE ESTIMATOR IN HETEROGENEOUS-ADOPTION DESIGNS 275
Panel A: Effects
2001 2002 2004 2005
Notes: This table shows estimated effects, on US employment, of eliminating potential tariffs spikes on imports
from China (Panels A and D), and pre-trend estimators (Panels B and C). Estimation uses log employment
data for a panel of 103 US industries from 1997 to 2002 and from 2004 to 2005. In Panels A and B, TWFE
regressions are shown. In Panels C and D, TWFE regressions with industry-specific linear trends are shown.
Some panels also show p-values of Stute tests of mean independence and linearity.
276 CHAPTER 7. DESIGNS WITH VARIATION IN TREATMENT DOSE
Motivation. There are at least three instances where one may prefer avoiding the TWFE
estimator, even if pre-trend tests of Assumption PTNB are not rejected. First, the test of the
constant-and-linear-effect assumption in (7.9) may be rejected. Second, even when that test
is not rejected, one may still worry that the power of the test is low. Third, even when that
test is not rejected and one is not concerned with its power, one may be in a design without
stayers or quasi-stayers, in which case (7.9) could fail even when its testable implication holds.
In this section, we review heterogeneity-robust estimators one could then use. First, we restrict
attention to HADs with stayers or quasi-stayers, before considering HADs without stayers or
quasi-stayers.
E(∆Y |D2 ) can be decomposed into groups’ counterfactual outcome evolutions without treatment
γ2 , and D2 CAS(D2 ). It directly follows from this decomposition that if γ2 is identified, then the
CAS is identified, and therefore the ATT and WATT are also identified. With that in mind,
we now propose heterogeneity-robust estimators of the CAS, ATT, and WATT, depending on
whether we have stayers, quasi-stayers, or neither stayers nor quasi-stayers.
7.4.1.1 Identification.
Theorem 15 Suppose that we are in Design HAD’ and Assumptions NA, ND and PTNB hold.
Then,
7.4. HETEROGENEITY-ROBUST ESTIMATORS 277
2. If there exists a strictly positive real number η such that Pu (0 < D2 < η) = 0,
∆Y − Eu [∆Y |D2 = 0]
" #
ATT =Eu D2 > 0 . (7.15)
D2
3.
Eu [∆Y |D2 > 0] − Eu [∆Y |D2 = 0]
WATT = . (7.16)
Eu [D2 |D2 > 0]
Theorem 15 readily follows from (7.13), and from the fact that with stayers or quasi-stayers,
Eu [∆Y |D2 = 0], the outcome evolution of untreated groups, identifies γ2 , the counterfactual
outcome evolution that treated groups would have experienced without treatment. Then, d2 7→
CAS(d2 ), the ATT, and the WATT are identified by DID estimands comparing the outcome
evolutions of treated and untreated groups, and scaling that comparison by the treatment of
treated groups. Intuitively, why is it that the estimands in Theorem 15 identify average effects
under Assumption PTNB alone, unlike β fe ?
β fe may identify a non-convex combination of effects under Assumption PTNB alone because
it leverages forbidden comparisons of the outcome evolutions of more- and less-treated groups.
Instead, the estimands in Theorem 15 only compare treated and untreated groups: weakly
treated groups are not used as control groups by those estimands.
Identification of ATT with quasi-stayers.∗ Importantly, for ATT the identification result
in Theorem 15 assumes that there are no quasi-stayers: the dose of treatment received by treated
groups should be bounded by below and cannot be arbitrarily close to zero. Otherwise, under
weak conditions one can show that
∆Y − Eu [∆Y |D2 = 0]
" #
Eu D2 > 0 = +∞,
D2
278 CHAPTER 7. DESIGNS WITH VARIATION IN TREATMENT DOSE
meaning that the estimand in (7.15) is not well defined. Intuitively, this is due to the fact that
with quasi-stayers, D2 can be arbitrarily close to zero while ∆Y − Eu [∆Y |D2 = 0] may not be
close to zero. One can show that with quasi-stayers, the ATT is identified by
∆Y − Eu [∆Y |D2 = 0]
" #
lim Eu D2 > η .
η→0 D2
The estimand in the previous display is a limiting estimand. It trims quasi-stayers from the
estimand in (7.15), and lets the trimming go to zero, as in Graham and Powell (2012), who
consider a related estimand with quasi-stayers. Accordingly, with quasi-stayers the ATT is
irregularly identified by a limiting estimand. Then, de Chaisemartin et al. (2022) conjecture
√
that the ATT cannot be estimated at the n−rate, as in Graham and Powell (2012), and as is
often the case with target parameters identified by limiting estimands. Due to this limitation,
we will not consider estimation of the ATT with quasi stayers.
1
∆Yg − 1
∆Yg
P P
g:Dg,2 =d2 g:Dg,2 =0
[ 2 ) :=
CAS(d
G1,d2 G0
,
d2
where G1,d2 denote the number of treated groups such that Dg,2 = d2 . The numerator of
[ 2 ) is numerically equivalent to the coefficient on 1{Dg,2 = d2 } in a linear regression of
CAS(d
∆Yg on indicators for all the strictly positive values that Dg,2 can take.
Estimation of the ATT, if there are no quasi-stayers. To estimate the ATT, if there are
no quasi-stayers, one can follow Theorem 15, and use
∆Yg − 1
∆Yg
[ s := 1
P
g:Dg,2 =0
ATT G0
X
.
G1 g:Dg,2 >0 Dg,2
√
de Chaisemartin et al. (2022) show that this estimator is G−consistent and asymptotically
normal when G → +∞, and derive its asymptotic variance.
Estimation of the WATT. To estimate the WATT, one can also follow Theorem 15, and
use
The coefficient on the treatment in a 2SLS regression with one binary instrument Zg and no
controls is the Wald ratio
1
∆Yg − 1
∆Yg
P P
#{g:Zg =1} g:Zg =1 #{g:Zg =0} g:Zg =0
1 1 ,
Dg,2 −
P P
#{g:Zg =1} g:Zg =1 #{g:Zg =0} g:Zg =0 Dg,2
where for any set A, #A denotes the number of elements of A, i.e. its cardinality. With
\ s . Thus, to compute WATT
Zg = 1{Dg,2 > 0}, the previous display is equal to WATT \ s , we can
run a 2SLS regression of ∆Yg on a constant and Dg,2 , using 1{Dg,2 > 0} as the instrument for
Dg,2 . Note that when control variables are included in this regression, it is no longer guaranteed
to unbiasedly estimate the WATT, and it may not even estimate a convex combination of slopes.
[ s , WATT
Like ATT \ s is an average of the group-specific-slope estimators
∆Yg − 1
∆Yg
P
g:Dg,2 =0
E g,2 :=
Td G0
,
Dg,2
[ s , WATT
but unlike ATT \ s downweights Td
E g,2 for “weakly-treated” groups (Dg,2 < 1 P
g ′ :Dg′ ,2 >0 Dg′ ,2 ),
G1
and it upweights Td
E g,2 for strongly-treated groups. One has that
where the second equality follows from adding and subtracting Yg,2 (0). Then, if T Eg,2 = δ for
some real number δ and Vu (∆Yg (0)|Dg,2 ) = σ 2 for some real number σ 2 ,
∆Yg (0) − 1
∆Yg (0)
P !
g:Dg,2 =0
E g,2 |Dg,2 =V
V Td T Eg,2 + G0
Dg,2
Dg,2
σ 2 (1 + 1/G0 )
= 2
.
Dg,2
In this section, our target parameter is the WATT: to our knowledge, with no stayers but some
quasi-stayers, only heterogeneity-robust estimators of the WATT have been proposed so far.
Without stayers, which part of the estimands identifying the WATT in Theorem 15 becomes
difficult to estimate?
Estimation problem. In Theorem 15, the estimand identifying the WATT compares treated
groups’ average outcome evolution to Eu [∆Y |D2 = 0], their counterfactual outcome evolution
without treatment under Assumption PTNB. Without stayers, estimating Eu [∆Y |D2 = 0] is
not straightforward: Pu (D2 = 0) = 0, so no group in the sample is such that Dg,2 = 0, and we
cannot merely compute the sample average of the outcome evolutions of untreated groups.
a higher treatment dose to infer groups’ counterfactual trend without treatment. At the same
time, as h increases, the variance of the estimator decreases, as it estimates groups’ counter-
factual trend without treatment out of a larger sample. This suggests that there might exist
an optimal bandwidth, that trades off the estimator’s bias and variance optimally. de Chaise-
martin, Ciccia, D’Haultfœuille and Knau (2024) leverage results from the regression discontinu-
ity designs (RDDs) and non-parametric estimation literature (see Imbens and Kalyanaraman,
2012; Calonico, Cattaneo and Titiunik, 2014; Calonico, Cattaneo and Farrell, 2018), to propose
an optimal bandwidth that minimizes an asymptotic approximation of the estimator’s mean
squared-error, and a robust confidence interval accounting for the estimator’s first-order bias.
\ qs∗ , converges at the G2/5 rate, the standard uni-
The resulting estimator of the WATT, WATT bhG
variate non-parametric rate. With stayers, the estimator of the WATT converges at the faster
G1/2 rate, so moving from a design with stayers to a design with quasi-stayers comes with a
precision cost. βbfe also converges at the faster G1/2 rate, so moving from βbfe to the heterogeneity-
\ qs∗ also comes with a precision cost. This is a further reason why, in
robust estimator WATT bhG
designs without stayers but with quasi-stayers, researchers might want to run the test of the
constant-and-linear-effect assumption discussed in the previous section, to ensure that using an
heterogeneity-robust estimator is warranted.
g=1 ∆Yg − γ
1 PG
\ qs
bh
WATTh = G
1 PG ,
G g=1 Dg,2
with γbh the intercept in the local linear regression of ∆Yg on Dg,2 , weighting observations by
k(Dg,2 /h)/h, for a kernel function k and a bandwidth h > 0.
Estimator’s asymptotic distribution.∗ Let m(d2 ) = Eu (∆Y |D2 = d2 ). One can derive the
\ qs
asymptotic behavior of WATTh under the conditions below:
3. σ 2 (d) := Vu (∆Y |D2 = d), defined on Supp(D2 ), is continuous at 0 and σ 2 (0) > 0.
Optimal bandwidth and robust confidence interval.∗ Based on (7.18), one can derive a
so-called optimal bandwidth, which, as in RDDs (see Imbens and Kalyanaraman, 2012), min-
\ qs
imizes the asymptotic mean squared error of WATThG . Then, inference on the WATT is not
\ qs∗ − WATT)
q
Gh∗G (WATTbh G
has a first-order bias that needs to be accounted for. However, the general approach for local-
polynomial regressions in Calonico et al. (2018) can be applied here. de Chaisemartin, Ciccia,
D’Haultfœuille and Knau (2024) rely on their results and on their software implementation (see
Calonico, Cattaneo and Farrell, 2019) to:
284 CHAPTER 7. DESIGNS WITH VARIATION IN TREATMENT DOSE
3. compute M
c , an estimator of γ
bh∗G bb∗ ’s first-order bias;
hG
With those inputs, de Chaisemartin, Ciccia, D’Haultfœuille and Knau (2024) simply define their
estimator of the WATT with quasi-stayers as
∆Yg − γbbh∗
1 PG
g=1
\ qs∗
WATT =
G
PG
G
,
1
bhG
G g=1 Dg,2
and its bias-corrected confidence interval as
r
b∗ )
q1−α/2 Vbbh∗ /(Gh
M
c
bh∗G G
\ qs∗
WATT +
G
(7.19)
1 PG ± 1 PG ,
bh G
G g=1 Dg,2 G g=1 Dg,2
Computation: Stata and R commands to compute WATT \ qs∗ and confidence inter-
bhG
vals for the WATT. The did_had Stata (see de Chaisemartin, Ciccia, D’Haultfœuille, Knau
and Sow, 2024b) and R (see de Chaisemartin, Ciccia, D’Haultfœuille, Knau and Sow, 2024a)
\ qs∗ and confidence intervals for the
commands can be used to compute the estimators WATT bhG
WATT, in HADs with no stayers but some quasi-stayers. The syntax of the Stata command is:
Importantly, did_had heavily relies on the nprobust package of Calonico et al. (2019), which
should be cited, together with Calonico et al. (2018), whenever did_had is used.
Constant effect of the lowest treatment dose. Results above for designs with stayers
or quasi-stayers can be extended to designs without stayers or quasi-stayers, at the expense
of imposing the following constant-effect assumption. Hereafter, let d = inf Supp(D2 ) be the
infimum of the support of the period-two treatment, so that d = 0 in designs with stayers or
quasi-stayers.
Assumption CELD requires that the effect of receiving the lowest treatment dose d be mean
independent of units’ actual period-two dose D2 . While strong, Assumption CELD is arguably
less strong than Eu [TE2 |D2 = d] =ATT, the constant-and-linear-effect assumption in (7.9). With
data from a second pre-treatment period, can we use a pre-trends test to assess the plausibility
of Assumption CELD?
Contrary to Assumption PTNB, one cannot assess the plausibility of Assumption CELD via a
pre-trends test: if the data contains another pre-period t = 0 where groups are all untreated,
Yg,1 − Yg,0 is an outcome evolution without treatment, which is not the period-one equivalent of
Yg,2 (d) − Yg,2 (0), the effect of the lowest treatment dose.
Y2 (D2 ) − Y2 (d)
TE2,d := .
D2 − d
Accordingly, we also define the following counterparts of the ATT and WATT:
h i
ATTd :=E TE2,d
" #
D2 − d
WATTd :=E TE2,d .
E[D2 − d]
286 CHAPTER 7. DESIGNS WITH VARIATION IN TREATMENT DOSE
β fe = ATTd . The following theorem shows that under Assumptions PTNB and CELD, one has
an equivalence between (7.20) and linearity of E[∆Y |D2 ]: (7.20) is fully testable.
Theorem 16 Suppose that Assumptions PTNB and CELD hold. In Design HAD, (7.20) holds
if and only if E(∆Y |D2 ) = β0 + β fe D2 .
Theorem 17 Suppose that we are in Design HAD and Assumptions PTNB and CELD hold.
Then,
E[∆Y ] − E[∆Y |D2 = d]
WATTd = . (7.21)
E[D2 − d]
The proof is similar to that of Theorem 15 and thus omitted. If P (D2 = d) > 0, to estimate
WATTd we can use a similar estimator as the 2SLS WATT estimator introduced in Section
7.4.1.2: we can run a 2SLS regression of ∆Y on D2 using 1{D2 > d} as the instrument. Similarly,
if P (D2 = d) = 0, to estimate WATTd we can use a similar estimator as the WATT estimator
with quasi-stayers introduced in Section 7.4.1.3, replacing D2 by D2 − d in the estimator’s
\ b∗ denote that estimator.
definition. We let WATTd,hG
A condition under which the estimand in Theorem 17 identifies the sign of the
WATT.∗ Under Assumption PTNB,
E[∆Y ] − E[∆Y |D2 = d] = E(Y2 (D2 ) − Y2 (0)) − E(Y2 (d) − Y2 (0)|D2 = d).
7.4. HETEROGENEITY-ROBUST ESTIMATORS 287
E(Y2 (D2 ) − Y2 (0)) and E(Y2 (d) − Y2 (0)|D2 = d) are of opposite signs,
7.4.2.1 Application to the effect of having access to independent news on voting behavior
Design and data. Remember that in 1996, a new TV channel called NTV was introduced in
Russia. At that time, NTV was the only TV channel in Russia not controlled by the government.
Enikolopov et al. (2011) study the effect of having access to this independent news source on
voting behavior, using voting outcomes for the 1938 Russian subregions in the 1995 and 1999
elections. After 1996, NTV’s coverage rate is heterogeneous across regions: while a large fraction
of the population receives it in urbanized regions, a smaller fraction receives it in more rural
regions. Yet, in the region with the lowest exposure rate to NTV, this rate is still equal to 27%,
so there is no unexposed (stayer) or almost unexposed (quasi-stayer) region. The authors define
their treatment as Dg,t , the proportion of the population having access to NTV in region g and
year t, hereafter referred to as the NTV exposure rate.
288 CHAPTER 7. DESIGNS WITH VARIATION IN TREATMENT DOSE
TWFE estimators. In their Table 3, Enikolopov et al. (2011) use βbfe to estimate NTV’s
effect on five outcomes: the share of the electorate voting for the SPS and Yabloko parties, two
opposition parties supported by NTV; the share of the electorate voting for the KPRF and LDPR
parties, two parties not supported by NTV; and electoral turnout. Specifically, they regress those
outcomes on region FEs, an indicator for the 1999 election, and the NTV exposure rate in region
g and period t. As this TWFE regression only has two time periods, βbfe , the coefficient on the
NTV exposure rate, is numerically equivalent to the coefficient on the NTV exposure rate in a
regression of outcomes’ first difference from 1995 to 1999 on a constant and Dg,2 . βbfe = 6.65
(s.e.= 1.40) for the SPS voting rate, and βbfe = 1.84 (s.e.= 0.76) for the Yabloko voting rate.
According to these regressions, increasing the NTV exposure rate from 0 to 100% increases the
share of votes for the SPS and Yabloko opposition parties by 6.65 and 1.84 percentage points,
respectively. βbfe is small and insignificant for the remaining three outcomes.
If one only makes a parallel-trends assumption, βbfe is very far from estimating a
convex combination of effects. We use the twowayfeweights Stata package to compute the
weights in (7.6), a decomposition of the probability limit of βbfe that makes no further assumptions
than the parallel-trends condition in Assumption PTNB. We find that βbfe estimates a weighted
sum of the effects of NTV in 1999 in the 1,938 regions, where 918 estimated weights are strictly
positive, while 1,020 are strictly negative. The negative weights sum to -2.26.
Estimators assuming a constant effect of the lowest treatment dose are more noisy
than βbfe , and sometimes take implausible values. Under the assumption that the effect
of raising the NTV exposure rate from 0 to d = 0.27 is constant across subregions, we can
estimate WATTd . As the NTV exposure rate is continuously distributed, P (D2 = d) = 0, so
\ b∗ , an estimator similar to the estimator with quasi-stayers in Section 7.4.1.3,
we use WATT d,hG
\ b∗ , its standard error,
replacing D2 by D2 − d. The bottom panel of Table 7.2 shows WATT d,h G
\ b∗ is close
and its bias-corrected 95% condidence interval. For the SPS vote outcome, WATTd,hG
to βbfe but not significantly different from zero, because its standard error is almost three times
\ b∗ is nine times larger than βbfe . Its standard
larger. For the Yabloko vote outcome, WATTd,hG
error is more than ten times larger than that of βbfe but it is still significantly different from
zero. In view of the low nationwide voting rate for the Yabloko party (3.2 percentage points),
\ b∗ is implausibly large, which suggests that either Assumption PTNB or
the value of WATTd,hG
\ b∗ is much more negative than βbfe ,
Assumption CELD fails. For the KPRF outcome, WATT d,hG
and it is significant, even though its standard error is almost four times larger. Finally, for the
\ b∗ is insignificant, with a standard error much larger than βbfe .
remaining two outcomes WATTd,hG
\ ∗
WATT 4.55 16.76 -10.45 -1.51 -13.63
d,b
h G
Notes: This table shows estimated effects of the exposure rate to independent information on voting outcomes in
Russia, using voting data for 1,938 Russian subregions in the 1995 and 1999 elections. In the first line, effects are
estimated using TWFE regressions. In the fourth line, effects are estimated using a parametric heterogeneity-
robust estimator, relying on the assumption that the effect of the lowest treatment dose is constant across regions.
290 CHAPTER 7. DESIGNS WITH VARIATION IN TREATMENT DOSE
Conclusion. It is only when one uses the TWFE estimator that one can conclude that access
to NTV increases votes for opposition parties. That estimator crucially relies on Assumption
PTNB and on (7.11), two assumptions which cannot be tested. However, the stronger condition
(7.9), which is testable together with Assumption PTNB, is strongly rejected by our Stute test,
thus suggesting that treatment effects are indeed heterogeneous in this application, or that the
parallel-trends condition in Assumption PTNB fails.
7.5 Appendix∗
In Design HAD,
with the convention that Dg,2 TEg,2 = 0 if Dg,2 = 0. Justify the second equality of this derivation.
The second equality follows from the fact that Dg,1 = 0 in an HAD. Plugging (7.23) into (7.5)
and multiplying numerators and denominators by 1
G
,
g=1 (Dg,2 − D.,2 )∆Yg (0) g=1 (Dg,2 − D.,2 )Dg,2 TEg,2
1 PG 1 PG
βbfe = G
+ G
. (7.24)
g=1 (Dg,2 − D.,2 ) g=1 (Dg,2 − D.,2 )
1 PG 2 1 PG 2
G G
It follows from (7.24), the weak law of large numbers and the continuous mapping theorem that
cov(D2 , ∆Y (0)) Eu ((D2 − Eu (D2 ))D2 TE2 )
β fe = + . (7.25)
Vu (D2 ) Vu (D2 )
7.5. APPENDIX∗ 291
=0, (7.26)
where the second equality follows from the law of iterated expectations. Next, it follows from
the definition of TEg,2 and the law of iterated expectations that
Eu ((D2 − Eu (D2 ))D2 TE2 ) =Eu ((D2 − Eu (D2 ))(Y2 (D2 ) − Y2 (0))
The first result follows from plugging the preceding display and (7.26) into (7.25).
Eu ((D2 − Eu (D2 ))D2 TE2 ) =Eu ((D2 − Eu (D2 ))D2 Eu (TE2 |D2 ))
=Eu ((D2 − Eu (D2 ))D2 Eu (TE2 |D2 )|D2 > 0) Pu (D2 > 0).
Finally,
Vu (D2 ) = Eu (D22 )−Eu (D2 )2 = Eu [(D2 − Eu (D2 ))D2 ] = Eu ((D2 − Eu (D2 ))D2 |D2 > 0) Pu (D2 > 0).
The second result follows from plugging the two preceding displays and (7.26) into (7.25). QED.
Point 1 of Theorem 14 directly follows from plugging (7.9) into (7.13) and from the fact that if
E(U |V ) = a0 + a1 V then it is equal to the linear regression of U on (1, V ).
Then, if β0 ̸= γ2 , limd2 →0 E(TE2 |D2 = d2 ) = ∞, thus contradicting the fact that |E(T E2 |D2 )| ≤
K under the bounded-slope assumption we maintain in this chapter. Therefore, β0 = γ2 . Then,
equating (7.13) and (7.27) implies that
D2 E(TE2 |D2 ) =β fe D2 ,
and dividing by D2 > 0 yields E(TE2 |D2 ) = β fe . Taking expectations on both sides finally yields
β fe =ATT. This proves Point 2. QED.
Eu [∆Y |D2 = 0] is well-defined in Design HAD’. Moreover, for all d2 in the support of D2 ,
Eu [∆Y |D2 = d2 ] − Eu [∆Y |D2 = 0]
d2
Eu [Y2 (d2 ) − Y1 (0)|D2 = d2 ] − Eu [∆Y (0)|D2 = 0]
=
d2
Eu [Y2 (d2 ) − Y2 (0)|D2 = d2 ] + Eu [∆Y (0)|D2 = d2 ] − Eu [∆Y (0)|D2 = 0]
=
d2
Y2 (d2 ) − Y2 (0)
" #
=Eu D2 = d2 .
d2
Justify each step of this derivation.
The first equality follows from the fact we are in Design HAD’. The third equality follows from
Assumption PTNB. This proves (7.14). (7.15) follows from (7.2), (7.14), and the law of iterated
expectations. The previous display also shows that
(7.16) follows from (7.4), the law of iterated expectations, the previous equation, and the law of
iterated expectations QED.
Chapter 8
General designs
Motivation. Of the 26 highly-cited AER papers estimating a TWFE regression in the survey
of de Chaisemartin and D’Haultfœuille (forthc.), two have an absorbing and binary treatment
with no variation in treatment timing, four more have an absorbing and binary treatment with
variation in treatment timing, and two more have an heterogeneous-adoption design. However,
18 have a design that differs from those we have studied so far. Below, we open this chapter by
reviewing some of the designs in those 18 papers: non-absorbing binary treatments, absorbing
treatments with variation in treatment timing and dose, and treatments that vary at baseline.
Yet, we will not analyze in turn each non-binary and or non-staggered design ever encountered
by a social scientist in their research. Doing so would lead to an encyclopedia, rather than an
already long textbook. Instead, we believe that equipped with the fundamental insights from
the two preceding chapters, we are now able to provide generic results, that apply to any design.
The only restriction we impose throughout is that Dg,t ≥ 0: the treatment should be a positive
variable, as is most often the case, at least up to a normalization. Alongside generic results, we
will also discuss what we see as particularly interesting design-specific results, but we will not
exhaust all that can be said on say, non-absorbing binary treatments. We believe that more
research on common non-binary and/or non-staggered designs would be very useful.
Non-absorbing binary treatments. First, social scientists are often interested in the effect
of a non-absorbing binary treatment. For instance, Burgess, Jedwab, Miguel, Morjaria and
Padró i Miquel (2015) study the effect, in Kenya, of sharing the ethnicity of a country’s president,
293
294 CHAPTER 8. GENERAL DESIGNS
on a district’s volume of public expenditures. Districts can enter and leave the treatment (sharing
the president’s ethnicity) twice over the study period. An interesting special case is when groups
can join and leave treatment once:
When (8.1) holds, groups may get treated and leave treatment once, at heterogeneous dates Fg
and Eg .
Absorbing treatments with variation in treatment timing and dose. Second, social
scientists are often interested in the effect of an absorbing treatment with variation in treatment
timing and dose:
Dg,t = Ig 1{t ≥ Fg } (8.2)
If (8.2) holds, treatment is absorbing but there is variation across groups in the period at which
they start receiving the treatment, and in the dose they receive. For instance, Favara and Imbs
(2015) study the effect, in the US, of financial deregulations conducted during the 1990s, on the
volume of credit and housing prices. Their design almost satisfies (8.2): US states deregulate
at heterogeneous times and with heterogeneous intensities. The only difference is that a small
number of states deregulate more than once over the study period, so strictly speaking the
treatment is not absorbing.
Treatments that vary at baseline. Third, social scientists are often interested in the effect
of treatments whose intensity varies across groups at all time periods, including at period one:
For instance, Gentzkow et al. (2011) study the effect, in the US, of the number of newspapers
in circulation in a county on turnout in presidential elections in that county. In 1868, the first
presidential election used in their analysis, counties’ number of newspapers ranges from 0 to
33. Another example is Fuest, Peichl and Siegloch (2018), who study the effect, in Germany, of
the local business tax rate on wages. In 1993, the first period in their data, municipalities have
business tax rates ranging from 10 to 37 percentage points.
295
Two common misconceptions about general designs. Before starting our study of general
designs, it is important to clear up two common misconceptions. First, non-absorbing designs
do not only arise when those that receive the treatment self-select in and out of it. In most of
the aforementioned examples, the multiple treatment changes come from laws that are changed
several times or repealed after having been enacted: twists and turns in policy making were not
invented by the current US president. Therefore, while it is important to document the reasons
that led the legislator to further or cancel an initial policy change, parallel-trends assumptions
are not by construction less plausible in non-absorbing designs. Second, treatments continuously
distributed across groups within periods are not necessarily continuously distributed over time
within groups. Specifically, there exists designs where different groups all receive a different
dose (Dg,t ̸= Dg′ ,t for all t and g ̸= g ′ ), but where some groups have the same dose at different
periods (Dg,t = Dg,t′ for some g and t ̸= t′ ). For instance, in Fuest, Peichl and Siegloch (2018),
in any given year the local business tax rate is close to being continuously distributed across
municipalities, but many municipalities have the same tax rate in several years.
Chapter’s roadmap. First, we will see that in general designs, TWFE estimators may be
even less robust to heterogeneous effects than in the previous chapters. Strikingly, even assuming
that treatments are randomly assigned may not be enough to guarantee that they estimate a
convex combination of effects. Second, we will see that heterogeneity-robust estimators can be
extended to general designs, by combining insights from Chapters 6 and 7, and further ensuring
that one compares switchers and stayers with the same baseline treatment.
296 CHAPTER 8. GENERAL DESIGNS
No dynamic effects. In this section, we assume that the treatment has no dynamic effects,
namely, we maintain Assumption ND. This is consistent with the TWFE regression in (3.1),
where the current treatment Dg,t is one of the independent variables, but the lagged treatments
Dg,t−1 , Dg,t−2 etc. are not part of the independent variables, thus implicitly ruling out dynamic
treatment effects. We will relax Assumption ND in some of the chapter’s next sections.
Two periods. Unless otherwise noted, in this section we also assume that T = 2. Doing so
simplifies the exposition, without much substantive loss. When T > 2, for all k ∈ {1, ..., T − 1}
and t ∈ {k + 1, ..., T } let βbk,t
fe
denote the TWFE coefficient estimated restricting the sample to
periods t and t − k. It follows from Theorem 1 in Ishimaru (2021) that βbfe is a weighted average
of the coefficients βbk,t
fe
across k and t. Therefore, even when T > 2, βbfe is a weighted average
of two-periods TWFE coefficients. Without dynamic effects, each two-periods regression can be
analyzed in isolation, because groups’ outcomes at those two periods do not depend on their
treatments at other periods. Then, once a “causal” decomposition of βbk,t
fe
as a weighted sum of
treatment effects has been obtained, one can plug it in the decomposition of βbfe as a weighted
average of the βbk,t
fe
s to finally obtain a “causal” decomposition of βbfe .
Roadmap. In this section, we will show that βbfe is even less robust to heterogeneous treatment
effects in general designs than in binary-and-staggered designs or in heterogeneous-adoption
designs. Unsurprisingly, βbfe may not estimate a convex combination of effects under Assumption
PT, or under a different parallel-trends assumption that may be better suited to general designs.
More surprisingly, βbfe may still not estimate a convex combination of effects even if treatments
are as-good-as randomly assigned.
8.1. STATIC TWO-WAY FIXED EFFECTS ESTIMATOR 297
Recall that in the decomposition of βbfe in Theorem 8, the weights Wg,t depend on the residuals ûg,t
from a regression of Dg,t on group and period FEs. Recall also that ûg,t = Dg,t − Dg,. − D.,t + D.,. ,
where Dg,. is the average treatment of group g across time periods, D.,t is the average treatment
at period t across groups, and D.,. is the average treatment across groups and periods. When
T = 2, the formula for ûg,t simplifies as follows:
ûg,1 =Dg,1 − (Dg,1 + Dg,2 )/2 − D.,1 + (D.,1 + D.,2 )/2 = 1/2(Dg,1 − Dg,2 − (D.,1 − D.,2 )),
ûg,2 =Dg,2 − (Dg,1 + Dg,2 )/2 − D.,2 + (D.,1 + D.,2 )/2 = 1/2(Dg,2 − Dg,1 − (D.,2 − D.,1 )). (8.4)
Therefore, ûg,1 = −ûg,2 . Recall that ∆ denotes the first-difference operator, and let ∆D. =
D.,2 − D.,1 . Then, it directly follows from Theorem 8 that
h i G 2
(∆Dg − ∆D. )Dg,t (1{t = 2} − 1{t = 1})
E βbfe = TEg,t . (8.5)
XX
Assume that Dg,t > 0 and ∆Dg ̸= ∆D. for all (g, t). Then, which proportion of treatment effects
TEg,t are weighted negatively?
If Dg,t > 0 for all (g, t) and ∆Dg ̸= ∆D. , all effects receive a non-zero weight. As the weight
on TEg,1 is equal to the weight on TEg,2 multiplied by −1, exactly a half of the weights are
negative: for every g such that ∆Dg ̸= ∆D. , either TEg,2 or TEg,1 is weighted negatively, a fact
first noted by de Chaisemartin and Lei (2021). This feature is specific to two-periods TWFE
regressions: if T > 2, the proportion of effects weighted negatively by βbfe might differ from a
half.
Intuitively, why is it that βbfe may weight negatively exactly a half of treatment effects?
298 CHAPTER 8. GENERAL DESIGNS
The TWFE estimator compares the outcome evolution of groups whose treatment
increases more to the outcome evolution of groups whose treatment increases less.
When T = 2, βbfe is numerically equivalent to βbfd , the coefficient from the first-difference regres-
sion of ∆Yg on an intercept and ∆Dg . Thus, it follows from standard formulas for coefficients
in regressions with one non-constant explanatory variable that
When we regress a dependent variable on an intercept and a binary treatment variable, the
coefficient on the treatment compares the average of the dependent variable in the treatment
and control groups. Here, ∆Dg is not binary, but (8.6) still shows that the coefficient on ∆Dg
has a similar “treated-versus-control” interpretation. βbfe gives a positive weight to the outcome
evolution ∆Yg of groups whose treatment change from period one to two is larger than the
average (∆Dg > ∆D. ): those groups are used as “treatment groups” by βbfe . But then, as
∆Yg = Yg,2 (0)−Yg,1 (0)+Yg,2 −Yg,2 (0)−(Yg,1 −Yg,1 (0)) = Yg,2 (0)−Yg,1 (0)+Dg,2 TEg,2 −Dg,1 TEg,1 ,
if those groups are treated at period one, their TEg,1 is weighted negatively by βbfe . Similarly, βbfe
gives a negative weight to the outcome evolution ∆Yg of groups whose treatment change from
period one to two is lower than the average (∆Dg < ∆D. ): those groups are used as “control
groups”. But then, if those groups are treated at period two, their TEg,2 is weighted negatively
by βbfe .
With two groups, βbfe reduces to a Wald-DID estimator. With two groups m and ℓ,
(8.6) reduces to
Ym,2 − Ym,1 − (Yℓ,2 − Yℓ,1 )
βbfe = . (8.7)
Dm,2 − Dm,1 − (Dℓ,2 − Dℓ,1 )
The right hand side of (8.7) is a so-called Wald-DID estimator which compares the outcome
evolution of a group whose treatment increases more to the outcome evolution of a group whose
treatment increases less (Dm,2 −Dm,1 > Dℓ,2 −Dℓ,1 ). A Wald-DID can estimate a non-convex com-
binations of treatment effects, if treatment effects vary between groups (Blundell and Costa-Dias,
2009) or if they change over time (de Chaisemartin, 2011; de Chaisemartin and D’Haultfœuille,
2018). To see that, consider a simple example where Dm,1 = 2, Dm,2 = 4, Dℓ,1 = 1, and
8.1. STATIC TWO-WAY FIXED EFFECTS ESTIMATOR 299
Dℓ,2 = 2: group m receives two units of treatment at period one and four units at period
two, and group ℓ receives one unit of treatment at period one, and two at period two. Then,
Dm,2 − Dm,1 − (Dℓ,2 − Dℓ,1 ) = 1, and
h i
E βbfe =E [Ym,2 (4) − Ym,1 (2) − (Yℓ,2 (2) − Yℓ,1 (1))]
=E [Ym,2 (0) − Ym,1 (0) − (Yℓ,2 (0) − Yℓ,1 (0))] + 4TEm,2 − 2TEm,1 − 2TEℓ,2 + TEℓ,1
where the last equality follows from Assumption PT. The right hand side of the previous display
is a weighted sum of m and ℓ’s treatment effects at periods 1 and 2, with weights summing to
one, and where two effects enter with negative weights. Assuming that treatment effects are
constant across groups but not over time (TEm,t = TEℓ,t := TEt ), the previous display reduces
to
h i
E βbfe =2TE2 − TE1 ,
so βbfe still estimates a non-convex combination of effects. Assuming that treatment effects are
constant over time but not across groups (TEg,2 = TEg,1 := TEg ), the previous display reduces
to
h i
E βbfe =2TEm − TEℓ ,
Parallel trends if groups’ treatment does not change. Instead of Assumption PT, assume
that
h i
E Yg,2 (Dg,1 ) − Yg,1 (Dg,1 ) does not vary across g. (8.9)
Interpret (8.9).
300 CHAPTER 8. GENERAL DESIGNS
(8.9) is a parallel-trends assumption in the counterfactual where groups’ treatment does not
change between periods one and two, while Assumption PT is a parallel-trends assumption in
the counterfactual where they remain untreated. In general designs, it might be the case that
very few groups are untreated at period one. Then, it might be more natural to impose a
parallel-trends assumption on Yg,t (Dg,1 ) than on Yg,t (0). For instance, if a third period of data,
period zero, is available, and all groups keep the same treatment from period zero to one, then
Yg,1 (Dg,1 ) − Yg,0 (Dg,1 ) is observed for all groups. Then, one can conduct a pre-trends test of
(8.9), while one cannot conduct a pre-trends test of Assumption PT as Yg,1 (0) − Yg,0 (0) is not
observed for all groups.
βbfe may not estimate a convex combination of effects under (8.9). If ∆Dg ̸= 0, let
denote the expectation of the slope of group g’s potential outcome function at period two,
between its period-one and its period-two treatment. Let
h i G
βbfe = ∆
TE∆ (8.10)
X
E Wg,t g,2 .
g:∆Dg ̸=0
To fix ideas, assume that ∆D. > 0: the average treatment increases from period one to two.
Then, if there are groups whose treatment increases from period one to two (∆Dg > 0), but
increases less than the average increase in the population (∆Dg − ∆D. < 0), their slope TE∆
g,2
is weighted negatively by βbfe : βbfe may still estimate a non-convex combination of effects. Like
all the other theorems in this chapter, Theorem 18 is proven in the appendix of this chapter.
If there is at least one untreated group at period one, Assumption PT and (8.9) imply
that some treatment effects are constant over time. Assume that Assumption PT and
h i
(8.9) both hold. Then, there exists real numbers γ2 and γ2∆ such that E Yg,2 (0) − Yg,1 (0) = γ2
8.1. STATIC TWO-WAY FIXED EFFECTS ESTIMATOR 301
h i
and E Yg,2 (Dg,1 ) − Yg,1 (Dg,1 ) = γ2∆ for all g. Now, assume that there exists at least one group
g0 that is untreated at period one: Dg0 ,1 = 0. Then,
h i h i
γ2∆ = E Yg0 ,2 (Dg0 ,1 ) − Yg0 ,1 (Dg0 ,1 ) = E Yg0 ,2 (0) − Yg0 ,1 (0) = γ2 .
the effect of switching the treatment from 0 to Dg,1 has to be the same at periods one and
two. (8.9) only imposes restrictions on one potential outcome per group, Yg,t (Dg,1 ), so that
condition alone does not impose restrictions on treatment effects. However, when combined with
Assumption PT, it implies that some treatment effects have to be constant over time. Then,
to have that (8.9) does not restrict effects’ heterogeneity over time, Assumption PT has to fail:
groups have to be on parallel trends in the counterfactual where they keep the same treatment
in periods one and two, but they have to experience differential trends in the counterfactual
where they remain untreated at both dates. Such a scenario might be hard to rationalize, so we
view (8.9) as “essentially” assuming constant effects over time.
(8.12) requires that all groups have the same expected evolutions of their untreated and treated
outcomes. This condition implies that all groups should experience the same evolution of their
302 CHAPTER 8. GENERAL DESIGNS
treatment effect from period one to two, but unlike (8.9) it does not imply that the treatment
effect is constant over time. On the other hand, when Dg,t is not binary, assuming parallel trends
for all potential outcomes rather than just for the untreated outcome is not enough to ensure
that βbfe estimates a convex combination of effects.
I.i.d. groups. In this section, to simplify the exposition we momentarily take the sampling-
based perspective and assume that the G groups we observe are a random sample drawn from an
infinite super population. Thus, we replace Assumption IND by Assumption IID and we drop
h i
the g subscript. Instead of E βbfe , the estimand we consider is
covu (D2 − D1 , Y2 − Y1 )
β fe := ,
Vu (D2 − D1 )
the probability limit of βbfe .
The slope St is the treatment’s effect. It is a random variable, that may vary across groups. It
is also indexed by t, as the treatment’s effect might be time-varying. Under (8.13), one has
(8.15) requires that ∆D ⊥⊥ Y2 (0) − Y1 (0), a condition similar in spirit to the parallel-trends
condition in Assumption PT. But (8.15) also requires that ∆D ⊥⊥ (S1 , S2 ), groups’ treatment
change should be independent of the level of their treatment effect at periods one and two.
Requiring that ∆D is independent of the level of those unobservables essentially amounts to
assuming that ∆D is as-good-as randomly assigned.
Omitted variable bias (OVB) formula. Assume that (8.15) holds. Then,
covu (∆D, ∆Y )
β fe =
Vu (∆D)
covu (∆D, Y2 (0) − Y1 (0)) + covu (∆D, S2 × ∆D) + covu (∆D, ∆S × D1 )
=
Vu (∆D)
covu (∆D, Y2 (0) − Y1 (0)) + Eu ((∆D)2 S2 ) − Eu (∆D)Eu (∆DS2 ) + covu (∆D, ∆S × D1 )
=
Vu (∆D)
covu (∆D, ∆S × D1 )
=Eu (S2 ) + , (8.16)
Vu (∆D)
where the second equality follows from (8.14) and the fourth follows from (8.15). (8.16) is akin
to a standard omitted variable bias (OVB) formula. It shows that β fe identifies the average
treatment effect at period two Eu (S2 ), plus the OVB term
covu (∆D, ∆S × D1 )
.
Vu (∆D)
Intuition for the OVB formula. Letting α = Eu (Y2 (0) − Y1 (0)), and letting
(8.14) is equivalent to
∆Y =α + Eu (S2 ) × ∆D + ε. (8.17)
Interpreting and testing (8.18). (8.18) is a strong condition. For instance, it cannot hold
if D1 and D2 take the same bounded set of values, and the distribution of ∆D|D1 = d is non-
degenerate for all d. If (Dt )t≥1 follows an AR(1) process (namely, Dt+1 = λ0 + λ1 Dt + ϵt+1 for
all t ≥ 1, with (ϵt )t≥1 independent across t), (8.18) can only hold in the knife-edge case where
λ1 = 1, meaning that the process is non-stationary. (8.18) is also a testable condition. How can
one test (8.18)?
For instance, one can regress ∆D on D1 , or one can use a Kolmogorov-Smirnov test similar to
that discussed in Section 3.2.1.1. Then, researchers assuming that ∆D is as good as random to
justify a TWFE regression should start by testing if ∆D and D1 are independent. If they are
not, and if the treatment effect varies over time (S1 ̸= S2 ), the regression may be subject to an
OVB, even if ∆D is independent of the unobservables (Y2 (0) − Y1 (0), S1 , S2 ).
Even if the treatment paths (D1 , D2 ) are randomly assigned, β fe may identify a
non-convex combination of Eu (S1 ) and Eu (S2 ). As the OVB in the previous display is a
8.1. STATIC TWO-WAY FIXED EFFECTS ESTIMATOR 305
function of ∆S, one may wonder if under reasonable assumptions this OVB can simplify so that
β fe identifies a relatively interpretable causal effect, like a convex combination of the effects at
period one and two. The following result shows this is not necessarily the case, even under the
following, stronger assumption:
(8.19) requires that groups’ treatment paths (D1 , D2 ) be as-good-as random, which is stronger
than assuming that their treatment changes ∆D are as good as random.
Find conditions on the distribution of (D1 , D2 ) under which the weights in Theorem 19 are all
positive.
Cases where one of the two weights in Theorem 19 could be negative. One of the
two weights in Theorem 19 could be negative if D1 and D2 are non-binary, positively correlated,
and have different variances. The weights in Theorem 19 can be estimated, to assess if in a given
application, β fe estimates a convex combination of effects under (8.13) and (8.19).
Adding group FEs can do more harm than good. Under (8.19), estimating a TWFE
regression is not necessary anymore: one can merely regress Y1 on an intercept and D1 to estimate
Eu (S1 ), and one can regress Y2 on an intercept and D2 to estimate Eu (S2 ). Thus, adding group
FEs to the regression is not necessary, and adding those FEs can actually do more harm than
good, by leading the researcher to estimate a non-convex combination of treatment effects.
those above apply to such regressions. Rather than invoking a parallel-trends assumption to
justify this regression, one could instead assume that (8.15) holds, meaning that the change
in counties’ number of newspapers is as good as random. Then, we have seen that the first-
difference regression may still be subject to an OVB if that change is correlated to counties’
lagged number of newspapers. Using gentzkowetal_didtextbook, regress changedailies on
lag_numdailies, clustering standard errors at the county level. Interpret the results.
The coefficient on lag_numdailies is negative and significant. Then, the first-difference re-
gression may still be subject to an OVB, even if the change in counties’ number of newspapers
is as good as random. To assess the plausibility of (8.15), one may regress the change in
counties’ number of newspapers on some counties’ characteristics that are unlikely to be af-
fected by that change, in the spirit of a balancing check in a randomized controlled trial.
Using gentzkowetal_didtextbook, regress changedailies on lag_ishare_urb, counties’ lagged
urbanization rate, clustering standard errors at the county level. Interpret the results.
The coefficient on lag_ishare_urb is negative and significant: more-urbanized counties are less
likely to experience an increase in their number of newspapers. This suggests that the change
in counties’ number of newspapers may not be as good as random.
8.1.4 Extensions
ments. For instance, one may be interested in estimating separately the effects of laws legal-
izing Marijuana consumption for medical and for recreational purposes. For every (k, g, t) ∈
{1, ..., K} × {1, ..., G} × {1, ..., T }, let Dg,t
k
denote the value of treatment k for group g at period
t. To simplify the exposition, we assume that the treatments are binary. Let βbfe denote the
coefficient on Dg,t
1
in a regression of Yg,t on group fixed effects, period fixed effects, and the vector
(Dg,t
1 K
, ..., Dg,t ). Let TE1g,t denote the effect, in cell (g, t), of moving the first treatment from zero
not even estimate a convex combination of the effects of that treatment, as some of the weights
1
Wg,t may be negative. Moreover, βbfe is contaminated by the effects of the other treatments,
a phenomenon very similar to that we discussed in the context of TWFE ES regressions in
Chapter 6. A difference with TWFE ES regressions is that the contamination weights Wg,t
−1
do not always sum to zero. Accordingly, even if the effects of all treatments are constant, βbfe
may still be biased for the first treatment’s effect. There are still three special cases where the
contamination weights sum to zero: if K = 2, or if the treatments Dg,t
2 K
, ..., Dg,t are mutually
exclusive, or if for all (g, t), there exists (δg,t
k
)k=2,...,K such that
K
TE−1
g,t =
k k
(8.20)
X
Dg,t δg,t ,
k=2
The origin of the contamination weights. Consider a simple example with four groups
and two periods. No group is treated at period 1. At period two, group two receives the first
treatment, group three receives the second treatment, and group four receives both treatments.
Then one can show that
1 1
βbfe = (Y2,2 − Y2,1 − (Y1,2 − Y1,1 )) + (Y4,2 − Y4,1 − (Y3,2 − Y3,1 )) .
2 2
8.1. STATIC TWO-WAY FIXED EFFECTS ESTIMATOR 309
The second DID in the previous display compares the period-one-to-two outcome evolution of
group 4, that starts receiving the first and second treatments at period 2, to that of group 3,
that only starts receiving the second treatment. If the effect of the second treatment is constant
across groups, that DID unbiasedly estimates the effect of the first treatment. But if the effect
of the second treatment is heterogeneous, that DID is contaminated by the effect of the second
treatment.
To simplify the discussion, in this section we assume that T = 2 and groups are an i.i.d. sample
drawn from a larger population, thus allowing us to drop the g subscript except when we consider
estimators.
regressions estimate a LATE or even just a convex combination of (g, t)-specific effects. As T = 2
and groups are i.i.d., the probability limit of βb2SLS
fe
is
cov(∆Y, ∆Z)
fe
β2SLS := .
cov(∆D, ∆Z)
cov(D2 (Z1 ) − D1 (Z1 ), ∆Z) = 0, cov(Y2 (D2 (Z1 )) − Y1 (D1 (Z1 )), ∆Z) = 0,
respectively requiring that the instrument change ∆Z be uncorrelated with the counterfactual
trends of the treatment and outcome if the instrument had not changed. Under those assump-
tions, as
∆Y =Y2 (D2 (Z2 )) − Y1 (D1 (Z1 )) = Y2 (D2 (Z2 )) − Y2 (D2 (Z1 )) + Y2 (D2 (Z1 )) − Y1 (D1 (Z1 )),
where
(∆Z − E(∆Z))(D2 (Z2 ) − D2 (Z1 ))
W 2SLS = ,
E[(∆Z − E(∆Z))(D2 (Z2 ) − D2 (Z1 ))]
and we use the convention that 0/0 = 0. The previous display shows that β2SLS
fe
may estimate
a non-convex combination of the slopes (Y2 (D2 (Z2 )) − Y2 (D2 (Z1 )))/(D2 (Z2 ) − D2 (Z1 )), even if
one further makes a monotonicity assumption such as Z2 ≥ Z1 ⇒ D2 (Z2 ) ≥ D2 (Z1 ).
∆Zs is a sector-specific shock, and Qs,g is the share that sector s accounts for in, say, the
employment of location g. For instance, in Autor et al. (2013), ∆Zs is the change in imports from
China to high-income countries in sector s, and Qs,g is the share that sector s accounts for in the
employment of US commuting zone (CZ) g. Under a parallel-trends assumption, requiring that
8.2. DISTRIBUTED-LAG TWO-WAY FIXED EFFECTS ESTIMATORS 311
the shares Qs,g are uncorrelated to groups’ outcome evolutions without treatment, Goldsmith-
Pinkham et al. (2020) show that if the treatment effect is constant then βb2SLS
fe
is consistent for
the constant-effect parameter. As Bartik-TWFE regressions are just a special case of 2SLS-
TWFE regressions, the result in the previous paragraph implies that this conclusion no longer
holds if treatment effects are heterogeneous: then, Bartik regressions may estimate a non-convex
combination of effects. Accordingly, de Chaisemartin and Lei (2021) find that under parallel
trends, βb2SLS
fe
estimates, in Autor et al. (2013), a highly non-convex combination of CZ-and-year-
specific effects of Chinese imports on US employment. Moreover, the weights are correlated
with some CZ characteristics like the offshorability of their employment, which are themselves
likely to be correlated with their employment elasticity to Chinese imports. Instead of assuming
parallel trends, Borusyak et al. (2022) assume that the shocks ∆Zs are as good as random. In
line with the results we discussed in Section 8.1.3, de Chaisemartin and Lei (2021) show that
even if the shocks are as good as random, βb2SLS
fe
may still fail to estimate a convex combination
of treatment effects if those effects change over time, and if ∆Zs and Zs,1 are correlated. When
they revisit Autor et al. (2013), de Chaisemartin and Lei (2021) find that ∆Zs and Zs,1 are very
strongly positively correlated, and that some-industry level characteristics predict ∆Zs , thus
suggesting that ∆Zs may not be as good as random.
In this section, we no longer impose Assumption ND and we allow for dynamic effects. We also
no longer assume that T = 2.
In practice, researchers may slightly augment or modify (8.21). They may include treatment
leads in the regression, to test Assumptions NA and PT. They may define the lagged treatments
as equal to 0 at time periods when they are not observed, and estimate the regression in the
full sample. They may also estimate the regression in first difference and without group fixed
effects. Finally, they may include control variables. Results similar to Theorem 20 below apply
to all those variations on (8.21).
Theorem 20 Suppose that Assumptions NA and PT hold, and that for all g and t ≥ K +1 there
exists real numbers γg,t
ℓ
such that for all d in the support of D, Yg,t (d) = Yg,t (0t ) +
ℓ∈{0,...,K}
dt−ℓ . Then, for all ℓ ∈ {0, ..., K},
PK ℓ
ℓ=0 γg,t
K
′
h i
βbdl = dl,ℓ ℓ
+ dl,ℓ ℓ
X X X
E ℓ Wg,t γg,t Wg,t γg,t ,
(g,t):Dg,t−ℓ ̸=0, ℓ′ =0 (g,t):Dg,t−ℓ′ ̸=0,
t≥K+1 ℓ′ ̸=ℓ t≥K+1
where
dl,ℓ
= 1, dl,ℓ
= 0 ∀ℓ′ ̸= ℓ.
X X
Wg,t Wg,t
(g,t):Dg,t−ℓ ̸=0, (g,t):Dg,t−ℓ′ ̸=0,
t≥K+1 t≥K+1
K
Yg,t (d) = Yg,t (0t ) + ℓ
(8.22)
X
γg,t dt−ℓ ,
ℓ=0
meaning that the functional form of the distributed-lag regression is correctly specified: only the
first K treatment lags affect the outcome, their effect is linear and those lags do not interact.
Even under those strong assumptions, βbℓdl , the coefficient on the ℓth treatment lag, estimates
the sum of K + 1 terms. The first term is a weighted sum of the effect of the ℓth treatment lag,
across all (g, t) cells for which that lag is not equal to 0, with weights that sum to one but may
be negative. This term may be biased for the average effect of the ℓth treatment lag, if that
effect varies across (g, t) cells. The remaining K terms are weighted sums of the effects of other
treatment lags, with weights summing to zero. If the effects of the other lags vary across (g, t)
cells, those terms may differ from zero and may contaminate βbℓdl .
8.2. DISTRIBUTED-LAG TWO-WAY FIXED EFFECTS ESTIMATORS 313
βb0dl = −0.0008 (s.e.= 0.0014) and βb1dl = 0.0050 (s.e.= 0.0015): according to this distributed-lag
TWFE regression, increasing the current number of newspapers insignificantly reduces turnout
by 0.08 percentage points, while increasing its first lag significantly increases turnout by 0.5
percentage points. Use the twowayfeweights Stata package and the other_treatments option
to decompose βb0dl and βb1dl , and interpret the results.
This section introduces the heterogeneity-robust DID estimators that we proposed in de Chaise-
martin and D’Haultfœuille (forthc.), for designs where the treatment may be non-binary and/or
non-absorbing, and where the outcome may be affected by treatment lags. Those estimators are
computed by the did_multiplegt_dyn Stata and didmultiplegtdyn R commands. The syntax
of the Stata command is described in Chapter 6.
Date of first treatment change. For all g, let Fg = min{t : t ≥ 2, Dg,t ̸= Dg,t−1 } denote
the date at which a group’s treatment changes for the first time. We adopt the convention that
Fg = T + 1 if g’s treatment never changes. In a binary and staggered design, if no group is
treated at t = 1 then Fg reduces to the first date at which g is treated, which is why we use the
same notation as in Chapter 6.
DID estimators applicable in any design where groups do not all experience their
first treatment change at the same date. The estimators below are applicable to any
design where the following condition holds:
Design STAY (Designs with some stayers) ∃(g, g ′ ) such that: (i) Dg,1 = Dg′ ,1 , (ii) Fg ̸= Fg′ .
(i) requires that there exist groups with the same period-one treatment. If groups’ period-one
treatments are i.i.d. draws from a continuous distribution, Dg,1 ̸= Dg′ ,1 for all (g, g ′ ), so (i) fails.
In Section 8.3.4.2, we will extend the estimators below to designs where (i) fails, so (i) is not
really of essence to what follows. (ii) requires that there is heterogeneity in the date at which
groups change treatment for the first time. There are many applications where (ii) holds. Still,
it fails in designs without stayers, where Dg,1 ̸= Dg,2 and Fg = 2 for all g, as will for instance
be the case if Dg,t is the amount of rainfall or the average temperature in location g and year
8.3. HETEROGENEITY-ROBUST ESTIMATORS 315
t: all locations will experience different precipitations or temperatures in years one and two.
(ii) also fails if groups all change treatment for the first time at the same date t0 , for instance
due to a universal policy affecting them all: Fg = t0 for all g. In such cases, if all groups are
untreated at period one and receive heterogeneous treatment doses at t0 , the design is actually
and heterogeneous adoption design and one can then use the estimators reviewed in Chapter 7.
Assumption PTNC (Parallel trends if groups’ treatment never changes, conditional on their
period-one treatment) ∀(g, g ′ ), if Dg,1 = Dg′ ,1 ∈ D1r , then ∀t ≥ 2,
E[Yg,t (Dg,1,t ) − Yg,t−1 (Dg,1,t−1 )] = E[Yg′ ,t (Dg′ ,1,t ) − Yg′ ,t−1 (Dg′ ,1,t−1 )].
Assumption PTNC requires that if two groups have the same period-one treatment, then they
have the same expected outcome evolution if their treatment never changes. If all groups are
untreated at period 1, D1r = {0}, so Assumption PTNC is equivalent to Assumption PT, the
standard parallel-trends assumption if groups are never treated. Note that Assumption PTNC
restricts only one potential outcome per group, so Assumption PTNC alone does not restrict
groups’ treatment effects.
316 CHAPTER 8. GENERAL DESIGNS
E[Yg,t (Dg,1,t ) − Yg,t−1 (Dg,1,t−1 )] = E[Yg′ ,t (Dg′ ,1,t ) − Yg′ ,t−1 (Dg′ ,1,t−1 )]. (8.23)
(8.23) is stronger than Assumption PTNC: it requires that all groups, and not just those with the
same period-one treatment, have the same expected evolution if their treatment never changes.
To simplify the remainder of the discussion, let us assume that treatment is binary and there is
at least one group g0 that is untreated at period 1. Then, (8.23) implies that for all groups g
treated at period one and for all t ≥ 2,
E[Yg,t (1t ) − Yg,t−1 (1t−1 )] = E[Yg0 ,t (0t ) − Yg0 ,t−1 (0t )],
while Assumption PT, the standard parallel-trends assumption on groups’ never-treated out-
come, implies that
E[Yg,t (0t ) − Yg,t−1 (0t−1 )] = E[Yg0 ,t (0t ) − Yg0 ,t−1 (0t )].
Therefore, when combined with Assumption PT, (8.23) implies that for all groups treated at
period one and for all t ≥ 2,
Interpret (8.24).
(8.24) means that for initially treated groups, the effect of being treated for t periods should
be the same as the effect of being treated for t − 1 periods. By iteration, the effect of being
treated for t periods should be the same as the effect of being treated for one period. This is
an unpalatable restriction: it fails whenever lagged treatments affect the outcome, and it also
rules out time-varying effects. By contrast, when combined with the standard parallel-trends
8.3. HETEROGENEITY-ROBUST ESTIMATORS 317
assumption on groups’ never-treated outcome, Assumption PTNC implies that for all groups
such that Dg,1 = 1 and for all t ≥ 2,
E(Yg,t (1t ) − Yg,t (0t )) − (E(Yg,t−1 (1t−1 ) − Yg,t−1 (0t−1 ))) does not vary across g.
This means that the incremental effect of one treatment period should be the same in every
initially treated group. This is a strong restriction, but unlike (8.24), it does not rule out effects
of lagged treatments on the outcome, it allows for time-varying effects, and it also allows for
heterogeneous effects across groups as the initial effect of being treated for one period can differ
across groups.
Tg = max Fg ′ − 1
g ′ :Dg′ ,1 =Dg,1
denote the last period where there is still a group with the same period-one treatment as g and
whose treatment has not changed since the start of the panel. For any g such that Fg ≤ Tg , and
for any ℓ ∈ {1, ..., Tg − (Fg − 1)}, let
h i
AVSQg,ℓ = E Yg,Fg −1+ℓ − Yg,Fg −1+ℓ (Dg,1 , ..., Dg,1 )
be the expected difference between group g’s actual outcome at Fg − 1 + ℓ and its counterfactual
“status quo” outcome if its treatment had remained equal to its period-one value from period
one to Fg − 1 + ℓ. We refer to AVSQg,ℓ as the actual-versus-status-quo (AVSQ) event-study (ES)
effect of g at Fg − 1 + ℓ. In a binary and staggered design, if groups are all untreated at period
1, then AVSQg,ℓ reduces to the effect TErg,ℓ that we considered in Chapter 6.
Estimation of the non-normalized actual-versus-status-quo effects. For all (g, t), let
be the set of groups g ′ with the same period-one treatment as g, and which have kept the same
treatment from period 1 to t. Recall that for any set A, #A denotes its number of elements, i.e.
its cardinality. For every g such that Fg ≤ Tg , and every ℓ ∈ {1, ..., Tg − (Fg − 1)}, #Cg,ℓ > 0.
To estimate AVSQg,ℓ , we use
Theorem 21 If Assumptions NA and PTNC hold, then for every (g, ℓ) such that 1 ≤ ℓ ≤
\ g,ℓ = AVSQg,ℓ .
h i
Tg − (Fg − 1), E AVSQ
No-crossing condition. Assume that g0 is such that Dg0 ,1 = 1, Dg0 ,2 = 2, Dg0 ,3 = 0. Then,
=E[Yg0 ,3 (1, 2, 0) − Yg0 ,3 (1, 1, 0)] − E[Yg0 ,3 (1, 1, 1) − Yg0 ,3 (1, 1, 0)]
is the difference between the effect of increasing g0 ’s period-2 treatment from 1 to 2, and the
effect of increasing g0 ’s period-3 treatment from 0 to 1. One could have that both effects are
positive but AVSQg0 ,2 is negative. Beyond this example, for all (g, ℓ) such that ℓ periods after
its first treatment change, g has experienced both a treatment strictly below and a treatment
strictly above its period-one treatment, AVSQg,ℓ can be written as a linear combination, with
negative weights, of the effects of increasing different treatment lags. Throughout this section,
we assume away the existence of such (g, ℓ)s.
∀g ∈ {1, ..., G}, either Dg,t ≥ Dg,1 ∀t, or Dg,t ≤ Dg,1 ∀t. (8.25)
(8.25) automatically holds if all groups are untreated at baseline. It also holds automatically
when the treatment is binary, or when groups’ treatment can only change once. When (8.25)
fails, one can discard from the sample all cells (g, t) such that g has experienced both a treatment
strictly below and a treatment strictly above its period-one treatment at some point from period
one to t. The did_multiplegt_dyn command automatically drops those cells. This yields an
8.3. HETEROGENEITY-ROBUST ESTIMATORS 319
unbalanced panel of groups where (8.25) holds by construction, on which the estimators below
can be applied. We impose (8.25) to avoid the notational burden of defining estimators on an
unbalanced panel.
Definition of the non-normalized AVSQ ES effects. Let L = maxg (Tg − (Fg − 1)) denote
the largest ℓ such that AVSQg,ℓ can be estimated for at least one g. Under Design STAY, L ≥ 1.
For every ℓ ∈ {1, ..., L}, let
Sℓ = {g : Fg − 1 + ℓ ≤ Tg }
be the set of groups for which AVSQg,ℓ can be estimated. For all g such that Fg ≤ T , let
be equal to 1 (resp. -1) for groups whose treatment increases (resp. decreases) at Fg . Then, let
1 X
AVSQℓ = Sg AVSQg,ℓ , (8.26)
#Sℓ g∈Sℓ
be the average of Sg AVSQg,ℓ , referred to as non-normalized AVSQ ES effect ℓ. Why is it that
for groups such that Sg = −1, AVSQg,ℓ is multiplied by −1 in the definition of AVSQℓ ?
Under (8.25), for groups with Sg = 1, Dg,t ≥ Dg,1 for all t, so AVSQg,ℓ is the effect of having been
exposed to a weakly higher treatment dose for ℓ periods. Conversely, for groups with Sg = −1,
Dg,t ≤ Dg,1 for all t, so AVSQg,ℓ is the effect of having been exposed to a weakly lower dose for
ℓ periods. Taking the negative of AVSQg,ℓ for those groups ensures that AVSQℓ is an average
effect of having been exposed to a weakly larger dose for ℓ periods. In a binary and staggered
design, AVSQℓ reduces to ATTℓ , the average effect of having been treated rather than untreated
for ℓ periods. Therefore, AVSQℓ generalizes ATTℓ to non-binary and/or non-staggered designs.
Estimation of the non-normalized event-study effects. For every ℓ ∈ {1, ..., L}, let
\ℓ = 1 X \ g,ℓ .
AVSQ Sg AVSQ
#Sℓ g∈Sℓ
\ ℓ is unbiased for AVSQℓ under Assumptions NA and PTNC.
Theorem 21 implies that AVSQ
320 CHAPTER 8. GENERAL DESIGNS
is the effect of having been treated for Eg − (Fg − 1) periods, Fg − 1 + ℓ − Eg periods before the
outcome is measured. Thus, the number and the recency of the treatment periods that generate
AVSQg,ℓ vary across groups, complicating the interpretation of AVSQℓ . Similarly, with three
periods and three groups such that (D1,1 = 0, D1,2 = 4, D1,3 = 0), (D2,1 = 0, D2,2 = 2, D2,3 = 3),
and (D2,1 = 0, D2,2 = 0, D2,3 = 0), AVSQ2 is the average of E(Y1,3 (0, 4, 0) − Y1,3 (0, 0, 0)) and
E(Y2,3 (0, 2, 3) − Y2,3 (0, 0, 0)). Thus, the magnitude and timing of the treatment increments
generating AVSQg,ℓ varies across groups, which again complicates the interpretation of AVSQℓ .
More disaggregated effects. If the design is such that the number of treatment paths is
low relative to G, then one may be able to precisely estimate the average of AVSQg,ℓ separately
across all groups with the same path, thus yielding estimates of the average effects of specific
treatment paths. The did_multiplegt_dyn command estimates treatment-path specific event-
study effects when the by_path option is specified. For instance, if (8.1) holds (Dg,t = 1{Eg ≥
t ≥ Fg }) the number of treatment paths may often be low enough for this solution to be practical.
But in more complicated designs, the number of paths may be too large for this solution to be
practical, especially as ℓ increases. In such instances, we still recommend that researchers report
the period-1-to-Fg − 1 + ℓ treatment paths and their distribution: this information may be
\ ℓ . The did_multiplegt_dyn command reports the paths and their
helpful to interpret AVSQ
distribution when the design option is specified.
\ 1 is numerically equivalent
Alternative estimators. In a binary and staggered design, AVSQ
\ ℓ is
to the DIDM estimator in de Chaisemartin and D’Haultfœuille (2020), and for all ℓ AVSQ
numerically equivalent to the event-study estimator of the effect of ℓ periods of exposure to
8.3. HETEROGENEITY-ROBUST ESTIMATORS 321
treatment of Callaway and Sant’Anna (2021), using the not-yet treated as controls. Outside of
\ ℓ is numerically
binary and staggered designs, when all groups are untreated at period one, AVSQ
equivalent to the estimator obtained by redefining the treatment as an indicator equal to one
if group g’s treatment has ever changed at t, and then computing the event-study estimator
of ℓ periods of exposure to treatment of Callaway and Sant’Anna (2021) with this binarized
and staggerized treatment. This “binarize and staggerize” idea has for instance been used by
Deryugina (2017) or Krolikowski (2018). When groups’ period-one treatment varies, the two
\ ℓ only compares switchers and not-yet-switchers with the
estimators are not equivalent:2 AVSQ
same period-one treatment, whereas the estimator of Callaway and Sant’Anna (2021) applied
to this binarized and staggerized treatment compares switchers and non-switchers with different
period-one treatments. Then, that estimator relies on (8.23), an assumption which, when com-
bined with Assumption PT, essentially rules out effects of lagged treatments on the outcome, as
discussed above.
ℓ−1
AVSQD = (Dg,Fg +k − Dg,1 )
X
g,ℓ
k=0
be the difference between the total treatment dose received by group g from Fg to Fg − 1 + ℓ,
and the total treatment dose it would have received in the status-quo counterfactual. Then, let
AVSQg,ℓ
AVSQng,ℓ = .
AVSQD
g,ℓ
2
For instance, East, Miller, Page and Wherry (2023) consider designs where groups’ period-one treatment
varies, and binarize and staggerize the treatment and compute the event-study estimators of Callaway and
Sant’Anna (2021).
322 CHAPTER 8. GENERAL DESIGNS
AVSQng,ℓ is a weighted average of the effects of the current and ℓ − 1 first treatment
lags. For k ∈ {0, ..., ℓ − 1}, let3
h
sg,ℓ,k =E Yg,Fg −1+ℓ (Dg,1,Fg −1 , Dg,Fg , ..., Dg,Fg −1+ℓ−k−1 , Dg,Fg −1+ℓ−k , Dg,1,k )
i
− Yg,Fg −1+ℓ (Dg,1,Fg −1 , Dg,Fg , ..., Dg,Fg −1+ℓ−k−1 , Dg,1 , Dg,1,k ) /(Dg,Fg −1+ℓ−k − Dg,1 )
be the slope of the expected potential outcome function of group g at Fg − 1 + ℓ with respect
to its kth treatment lag (the underlined term), when that lag is switched from its status-quo
counterfactual value Dg,1 to its actual value Dg,Fg −1+ℓ−k , whereas all its previous treatments are
held at their actual values, and all its subsequent treatments are held at their status-quo value.
For any k ∈ {0, ..., ℓ − 1}, let
Dg,Fg −1+ℓ−k − Dg,1
wg,ℓ,k = .
AVSQD g,ℓ
Theorem 22 For every (g, ℓ) such that 1 ≤ ℓ ≤ Tg − (Fg − 1), AVSQng,ℓ = wg,ℓ,k sg,ℓ,k .
Pℓ−1
k=0
Theorem 22 shows that AVSQng,ℓ is a weighted average of the slopes of g’s potential outcome at
Fg − 1 + ℓ with respect to its ℓ − 1 first treatment lags, where for k ∈ {0, ..., ℓ − 1}, the effect
of the kth lag receives a weight proportional to the absolute value of the difference between g’s
kth treatment lag and its status-quo treatment.
Theorem 22 in some specific designs. For concreteness, we rewrite the result in Theorem
22 in two specific designs. First, in binary-and-staggered designs, Theorem 22 reduces to
1Xℓ−1 h i
AVSQng,ℓ = E Yg,Fg −1+ℓ (0Fg −1 , 1ℓ−k−1 , 1, 0k ) − Yg,Fg −1+ℓ (0Fg −1 , 1ℓ−k−1 , 0, 0k ) .
ℓ k=0
3
We use the convention that for k = 0 and any d, dk stands for the empty vector. Accordingly when k = ℓ − 1,
(Dg,Fg , ..., Dg,Fg −1+ℓ−k−1 ) also stands for the empty vector. We sometimes refer to a cell’s current treatment as
its 0th treatment lag, and use the convention 0/0=0.
8.3. HETEROGENEITY-ROBUST ESTIMATORS 323
Then, AVSQng,ℓ is the simple average, across k ranging from 0 to ℓ − 1, of the effect of switching
the kth treatment lag from 0 to 1, holding previous treatments at 1 and subsequent treatments
at 0. Second, if (8.2) holds (Dg,t = Ig 1{t ≥ Fg }), Theorem 22 reduces to
h i
1X g,Fg −1+ℓ (0Fg −1 , Ig,ℓ−k−1 , Ig , 0k ) − Yg,Fg −1+ℓ (0Fg −1 , Ig,ℓ−k−1 , 0, 0k )
ℓ−1 E Y
AVSQng,ℓ = .
ℓ k=0 Ig
Thus, in staggered designs with group-specific treatment intensities, AVSQng,ℓ is the average,
across k ranging from 0 to ℓ − 1, of the effect of switching the kth lag from 0 to Ig , normalized
by Ig .
1 X
AVSQD
ℓ = |AVSQD
g,ℓ |.
#Sℓ g∈Sℓ
Note the following relation between the non-normalized and normalized event-study effects:
AVSQℓ
AVSQnℓ = . (8.27)
AVSQD
ℓ
AVSQnℓ is a weighted average of all the AVSQng,ℓ that can be estimated, with weights proportional
to |AVSQD
g,ℓ |. It follows directly from Theorem 21 that
4
1 X |AVSQD \
g,ℓ | AVSQg,ℓ
\ nℓ :=
AVSQ
#Sℓ g∈Sℓ AVSQD
ℓ AVSQg,ℓ
D
is unbiased for AVSQnℓ . As in (8.27), there is the following relationship between the normalized
and non-normalized ES estimators:
\ nℓ = AVSQ
AVSQ \ ℓ /AVSQD
ℓ .
4
In designs where some gs are such that |AVSQD
g,ℓ | is close to zero, the estimator of the unweighted average
of the AVSQng,ℓ s will suffer from a small-denominator problem, similar to that we already discussed in Chapter 7
when we compared the ATT and the WATT. This is what leads us to consider a weighted average of the AVSQng,ℓ .
324 CHAPTER 8. GENERAL DESIGNS
AVSQnℓ is a weighted average of the effects of the current and ℓ − 1 first treatment
lags, with weights that can be computed. It follows from Theorem 22 that AVSQnℓ is a
weighted average of the effects of groups’ current and ℓ − 1 first treatment lags on their outcome.
The total weight assigned by AVSQnℓ to the effect of the kth-lag (for 0 ≤ k ≤ ℓ − 1) is equal to
Compute wℓ,k in designs where groups’ treatment can only change once, meaning that Dg,t =
Dg,Fg for all t ≥ Fg .
Dg,1 ). Therefore,
1 X
AVSQD
ℓ = ℓ |Dg,Fg − Dg,1 |,
#Sℓ g∈Sℓ
so wℓ,k = 1/ℓ. Then, AVSQn1 is an effect of the current treatment on the outcome, AVSQn2 is
a weighted average of the effect of the current treatment and of the first treatment lag on the
outcome with weights 1/2, AVSQn3 is a weighted average of the effect of the current treatment
and of the first and second treatment lag on the outcome with weights 1/3, etc. When groups’
treatment can change more than once, we recommend reporting k 7→ wℓ,k , to document which
lags contribute the most to AVSQnℓ . The did_multiplegt_dyn command reports k 7→ wℓ,k when
the normalized_weights option is specified.
Estimating separately the effect of the current and lagged treatments.∗ Researchers
estimating distributed-lag TWFE regressions seek to separately estimate the effect of the current
and lagged treatments on the outcome. By estimating normalized AVSQ ES effects, they can
estimate weighted averages of the effects of the current and lagged treatments on the outcome,
thus fulfilling a related but different estimation goal. For instance, without further assumptions,
one cannot use AVSQn2 to tease out the effect of the current treatment and of its first lag on
the outcome. In a working paper version of de Chaisemartin and D’Haultfœuille (forthc.) (see
8.3. HETEROGENEITY-ROBUST ESTIMATORS 325
Section 4 of de Chaisemartin and D’Haultfœuille, 2021), it is shown that if one assumes that
K
Yg,t (d) = Yg,t (0t ) + γgl dt−l ,
X
l=0
meaning that lags’ effects are additively separable, linear, and constant over time, then there is
an invertible linear system relating γgl and AVSQg,ℓ . Therefore,
l∈{0,...,Tg −Fg } ℓ∈{1,...,Tg −(Fg −1)}
γgl can be unbiasedly estimated, and averages of the effects of the current and lagged
l∈{0,...,Tg −Fg }
treatments on the outcome across groups can also be unbiasedly estimated. Those estimators
allow for heterogeneous effects across groups, unlike distributed-lag TWFE estimators, but they
still rely on strong linearity and separability assumptions. We are not aware of a Stata or R
package computing those estimators.
In this section, we strengthen (8.25) by assuming that groups always have a weakly-larger treat-
ment than their period-one treatment:
We impose (8.28) to reduce the notational burden. When it fails, one can just conduct the
cost-benefit analysis separately for groups with Sg = 1 and for groups with Sg = −1.
As explained below, ACE corresponds to an average cumulative effect per unit of treatment,
whence its name. To motivate this parameter, let us take the perspective of a planner, seeking
to conduct a cost-benefit analysis comparing groups’ actual treatments D to the counterfactual
“status-quo” scenario where they would have always kept their period-one treatment. Assume
that the outcome is a measure of output, such as agricultural yields or wages, expressed in
monetary units. Assume also that the treatment is costly, with a cost linear in dose, and known
to the analyst. Then, let cg,ℓ ≥ 0 denote the cost of administering one treatment unit in group
g at period Fg − 1 + ℓ. Assuming that the planner’s discount factor is equal to 1, groups’ actual
326 CHAPTER 8. GENERAL DESIGNS
treatments are beneficial in monetary terms relative to the status quo, up to period Tg , if and
only if
Tg −(Fg −1) Tg −(Fg −1)
AVSQg,ℓ − cg,ℓ (Dg,Fg −1+ℓ − Dg,1 ) > 0 ⇔ ACE > c,
X X X X
In binary-and-staggered designs, if no group is treated at period one and there are never-treated
groups, the ACE reduces to a well-known treatment-effect parameter, which one?
so the ACE reduces to the ATT. Thus, the ACE generalizes the ATT to non-binary and/or
non-staggered designs.
The ACE is an average cumulative effect per unit of treatment. Let us first consider
a simple example with two groups and four periods, such that D1,1 = 0, D1,2 = 1, D1,3 = 1, and
D1,4 = 0, while group 2 is never treated. In this example, the formula for the ACE reduces to
E [Y1,2 (0, 1) − Y1,2 (0, 0) + Y1,3 (0, 1, 1) − Y1,3 (0, 0, 0) + Y1,4 (0, 1, 1, 0) − Y1,4 (0, 0, 0, 0)]
ACE =
1+1+0
1
= E [Y1,2 (0, 1) − Y1,2 (0, 0) + Y1,3 (0, 1, 0) − Y1,3 (0, 0, 0) + Y1,4 (0, 1, 0, 0) − Y1,4 (0, 0, 0, 0)]
2
1
+ E [Y1,3 (0, 1, 1) − Y1,3 (0, 1, 0) + Y1,4 (0, 1, 1, 0) − Y1,4 (0, 1, 0, 0)] . (8.29)
2
8.3. HETEROGENEITY-ROBUST ESTIMATORS 327
The first expectation in (8.29) is the cumulative effect produced by group 1’s period-2 treat-
ment, at periods 2, 3, and 4, relative to the situation where it would have always remained
untreated. The second expectation in (8.29) is the cumulative effect produced by group 1’s
period-3 treatment, at periods 3 and 4, conditional on its period-2 treatment and relative to
the situation where it would have been untreated at periods 3 and 4. Accordingly, ACE is the
average effect of the two treatment doses received by group 1, cumulated across all periods after
each of those two doses are received. A similar interpretation holds beyond this simple example.
For k ∈ {0, ..., ℓ − 1}, let AVSQg,ℓ,k be the numerator of the slope sg,ℓ,k defined above. AVSQg,ℓ,k
is the effect, on the expected potential outcome of group g at Fg − 1 + ℓ, of switching its kth
treatment lag from its status-quo to its actual value, whereas all its previous treatments are held
at their actual values, and all its subsequent treatments are held at their status-quo value. As
AVSQg,ℓ = AVSQg,ℓ,k ,
Pℓ−1
k=0
Average number of time periods over which the effect of a dose is cumulated. The
effect of switching g’s period Fg + k treatment from Dg,1 to Dg,Fg +k is cumulated from period
Fg + k to Tg , namely over Tg − Fg − k + 1 periods. To interpret ACE, one may compute
PTg −Fg
(Dg,Fg +k − Dg,1 )(Tg − Fg − k + 1)
P
g:Fg ≤Tg k=0
PTg −Fg ,
(Dg,Fg +k − Dg,1 )
P
g:Dg,1 =0,Fg ≤Tg k=0
the average number of time periods over which the effect of a dose is cumulated, across all
incremental doses received by switchers over the study period. One may then divide ACE by
this average number of time periods, to get an average effect of being exposed to one dose of
treatment for one period. The did_multiplegt_dyn command reports this average number of
time periods.
328 CHAPTER 8. GENERAL DESIGNS
8.3.3 Inference
de Chaisemartin and D’Haultfœuille (forthc.) propose analytic confidence intervals (CIs) for
the AVSQℓ , AVSQnℓ , and ACE effects, based on asymptotic approximations where the number
of groups goes to infinity, under the assumption that groups are independent. We recommend,
as in Section 3.3.2, that researchers using those CIs with less than 40 switching groups or with
less than 40 control groups perform simulations tailored to their data to assess their coverage
rate.5 Those CIs are conservative conditional on the design. They can be especially conservative
when there are many values of (Dg,1 , Fg ) such that only one group has that value: with only one
group, the variance of the outcome evolution across groups with that value cannot be unbiasedly
estimated. If this issue comes from the fact that Dg,1 takes many different values, then researchers
may treat Dg,1 as a continuous treatment variable and use the estimators proposed in Section
8.3.4.2 below. If this issue comes from the fact that Fg takes many different values, researchers
may coarsen a bit their time variable (e.g. aggregate a daily panel at the weekly level), to ensure
that most values of Fg have at least two groups. A third possibility is to use bootstrap rather
than analytic CIs.
5
When estimating AVSQℓ , the number of treated groups is #Sℓ . The number of control groups is the
cardinality of the union of {g ′ : Dg′ ,1 = Dg,1 , Fg′ > Fg − 1 + ℓ} across all g for which AVSQg,ℓ can be estimated.
In words, the number of control groups is the number of groups used as controls to estimate AVSQg,ℓ for at least
one g.
8.3. HETEROGENEITY-ROBUST ESTIMATORS 329
8.3.4 Extensions
de Chaisemartin and D’Haultfœuille (forthc.) propose pre-trend estimators one can use to test
Assumptions NA and PTNC. For any g : 3 ≤ Fg ≤ Tg and ℓ ∈ {1, ..., min(Tg − (Fg − 1), Fg − 2)},
let
\ g,−ℓ = Yg,Fg −1−ℓ − Yg,Fg −1 − 1
AVSQ (Yg′ ,Fg −1−ℓ − Yg′ ,Fg −1 ).
X
#Cg,ℓ g′ ∈Cg,ℓ
groups with the same baseline treatment as g, and that have not switched treatment yet at
\ g,ℓ , it compares those groups’ outcome evolutions from period Fg − 1
Fg − 1 + ℓ. But unlike AVSQ
to period Fg − 1 − ℓ, namely before group g’s treatment changes for the first time. Accordingly,
\ g,−ℓ assesses if g and its control groups experience the same evolution of their status-quo
AVSQ
\ g,ℓ
outcome over ℓ periods, the number of periods over which parallel trends has to hold for AVSQ
to be unbiased for AVSQg,ℓ . One can show that under Assumptions NA and PTNC,
\ g,−ℓ = 0.
E AVSQ
\ g,ℓ
Then, let Sℓpl = {g : 1 ≤ Fg − 1 − ℓ, Fg − 1 + ℓ ≤ Tg } be the set of groups for which AVSQ
\ g,−ℓ can also be computed (Fg − 1 − ℓ ≥
can be computed (Fg − 1 + ℓ ≤ Tg ) and for which AVSQ
1 ⇔ ℓ ≤ Fg − 2), and let
\ −ℓ = 1 X \ g,−ℓ
AVSQ Sg AVSQ
#Sℓpl g∈S pl
ℓ
The estimators above compare switchers and not-yet-switchers with the same period-one treat-
ment. The challenge with a continuous treatment is that the sample does not contain switchers
and not-yet-switchers with the same period-one treatment. Then, to estimate E[Yg,t (Dg,1,t ) −
Yg,t−1 (Dg,1,t−1 )], a switcher’s outcome evolution in the status-quo counterfactual, we cannot just
330 CHAPTER 8. GENERAL DESIGNS
use the average t − 1-to-t outcome evolution of not-yet-switchers with the same Dg,1 . To circum-
vent this issue, de Chaisemartin and D’Haultfœuille (forthc.) propose to replace Assumption
PTNC, the parallel-trends assumption on the status-quo outcome, by the following, stronger
condition: for all t ≥ 2,
K
E[Yg,t (Dg,1,t ) − Yg,t−1 (Dg,1,t−1 )] = k
(8.30)
X
γk,t Dg,1 ,
k=0
for some integer K. On top of assuming that groups with the same-period one treatment all have
the same counterfactual outcome trends if their treatment does not change, (8.30) also assumes
a functional form, namely a degree-K polynomial, for how those counterfactual outcome trends
vary with Dg,1 . Then, (γ0,t , ..., γK,t ) can be unbiasedly estimated by regressing Yg,t − Yg,t−1 on
(1, Dg,1 , ..., Dg,1
K
)k=0,...,K , in the sample of groups such that Fg > t. With those estimators in
hand, one can use
Fg −1+ℓ K
k
X X
γbk,t Dg,1
t=Fg k=0
Fg −1+ℓ K
k
X X
Yg,Fg −1+ℓ − Yg,Fg −1 − γbk,t Dg,1
t=Fg k=0
to estimate their AVSQg,ℓ effect. The corresponding estimators are computed by the did_multiplegt_dyn
command, when the continuous(#) option is specified. The option’s argument is K, the poly-
nomial degree assumed by the researcher in (8.30). Note that those estimators are closely related
to the DID estimators with covariates discussed in Chapter 4, with the polynomial in the base-
line treatment (1, Dg,1 , ..., Dg,1
K
)k=0,...,K playing the role of the covariates. Importantly, and like
some of the DID estimators with covariates discussed in Chapter 4, the estimators discussed in
this section are parametric and rely on the researcher’s choice of a functional form for groups’
outcome evolution under the status-quo counterfactual. Proposing a non-parametric estima-
tor could be done, leveraging ideas similar to those in de Chaisemartin et al. (2022), a paper
discussed below: this is an interesting avenue for future research.
8.3. HETEROGENEITY-ROBUST ESTIMATORS 331
A joint test that lagged treatments do not affect the outcome and that treatment
effects do not change over time. In this paragraph, we assume that groups’ treatment can
only change once, meaning that Dg,t = Dg,Fg for all t ≥ Fg . If that condition is not met, the
result below still holds, restricting attention to (g, t) cells such that t is strictly before g’s second
treatment change (t < min{t′ : t′ > Fg , Dg,t′ ̸= Dg,t′ −1 }). If Dg,t = Dg,Fg for all t ≥ Fg ,
h i
AVSQg,ℓ = E Yg,Fg −1+ℓ (Dg,1,Fg −1 , Dg,Fg ,ℓ ) − Yg,Fg −1+ℓ (Dg,1,Fg −1+ℓ ) .
Now, assume that Assumption ND holds: lagged treatments cannot affect the outcome. Then,
h i
AVSQg,ℓ = E Yg,Fg −1+ℓ (Dg,Fg ) − Yg,Fg −1+ℓ (Dg,1 ) .
the effect of the current treatment on the outcome does not depend on time. Then,
which does not depend on ℓ. Finally, for any ℓ′ and any ℓ ∈ {1, ..., ℓ′ }, let
1 X
AVSQbal
ℓ,ℓ′ = AVSQg,ℓ
#Sℓ′ g∈S ′
ℓ
denote a version of AVSQℓ , defined on the same subsample of groups as AVSQℓ′ , thus ensuring
that for every ℓ ∈ {1, ..., ℓ′ } AVSQbal
ℓ,ℓ′ applies to the same set of groups. Then, as Assumption
ND and (8.31) imply that AVSQg,ℓ does not depend on ℓ, those two assumptions also imply
that AVSQbal
ℓ,ℓ′ does not depend on ℓ, a testable condition. The corresponding test is computed
A test that lagged treatments do not affect the outcome, in some specific designs.
In designs with a binary treatment and where some groups leave the treatment after having been
previously treated, Liu et al. (2024) propose another test of Assumption ND. Their test amounts
332 CHAPTER 8. GENERAL DESIGNS
to estimating the average treatment effect across previously treated groups, at time periods
where those groups have left the treatment. Under Assumption ND, this average treatment
effect should be equal to zero. This yields a test of Assumption ND alone, rather than a joint
test of Assumption ND and (8.31), an advantage with respect to the test described in the
previous paragraph. A disadvantage of the test of Liu et al. (2024) is that it can only be used
in designs with a binary treatment and where some treated groups leave the treatment. Their
test is implemented by the fect Stata (Liu et al., 2022b) and R (Liu et al., 2022a) commands.
Assume that one wants to assess if treatment effects are correlated with a K × 1 vector of
time-invariant covariates Xg , whose first coordinate is a constant. In this section, we extend the
method described in Section 6.4.1 to non-binary and/or non-staggered designs.
Target parameter. Let βℓ,X be the coefficient in an infeasible regression of Sg AVSQg,ℓ , the
effect of having been exposed to a weakly higher treatment for ℓ periods in group g on Xg and
indicators for all possible values of (Fg , Dg,1 , Sg ), in the sample of groups such that Fg −1+ℓ ≤ Tg .
XgT βℓ,X is the best linear predictor of Sg AVSQg,ℓ given Xg and indicators for all possible values of
8.3. HETEROGENEITY-ROBUST ESTIMATORS 333
(Fg , Dg,1 , Sg ). As in Chapter 6, the coefficient from a regression where indicators for all possible
values of (Fg , Dg,1 , Sg ) are not controlled for would arguably be a more natural target, but βℓ,X
is easier to estimate.
Binary and staggered designs with two consecutive treatments. In Section 3.2 of
their web appendix, de Chaisemartin and D’Haultfœuille (2023a) propose estimators for cases
where one is interested in the effect of several, rather than one treatment. To simplify, they
start by considering the case with two binary and staggered treatments, where groups can only
start receiving the second treatment after they have received the first. This last restriction
holds when the second treatment is a reinforcement of the first. For instance, one may want
to separately estimate the effects of medical and recreational marijuana laws in the US: so far,
states have passed the former before the latter (see Meinhofer, Witman, Hinde and Simon, 2021).
Another example are voter ID laws in the US, where less-strict laws are typically passed before
stricter ones (see Cantoni and Pons, 2021). Another example are anti-deforestation policies in
334 CHAPTER 8. GENERAL DESIGNS
the Amazon rainforest, where plots of lands are typically put into a concession, and then some
concessions get certified (see Rico-Straffon, Wang, Panlasigui, Loucks, Swenson and Pfaff, 2023).
Estimating separately the effect of the first treatment. In such designs, estimating
separately the effect of the first treatment is straightforward: one can for instance compute the
estimators in Callaway and Sant’Anna (2021) or de Chaisemartin and D’Haultfœuille (forthc.),
restricting the sample to all (g, t)s that have not received the second treatment. In the marijuana
laws example, to estimate the effect of medical marijuana laws, one can just restrict the sample
to all state×year (g, t) such that state g has not passed a recreational law yet in year t. The
horizon until which effects of the first treatment can be estimated will just be truncated by the
second treatment.
Estimating the effect of receiving at least one of the two treatments. Estimating the
effect of receiving at least one of the two treatments is also straightforward: one can just com-
pute the estimators in Callaway and Sant’Anna (2021) or de Chaisemartin and D’Haultfœuille
(forthc.), redefining the treatment variable as equal to one for all (g, t)s that have received at
least one treatment.
Estimating separately the effect of the second treatment. Estimating separately the
effect of the second treatment is more challenging but can still be achieved, under the assumption
that the effect of one additional period of exposure to the first treatment is the same in every
group. Intuitively, why is it necessary to impose that assumption, if one wants to use a DID to
estimate the incremental effect of the second treatment?
To understand why that assumption is needed, let us go back to the marijuana law example.
Without that assumption, a state passing a recreational law may start experiencing a different
outcome trend than other states that have only passed a medical law, either because of the
recreational law, or because the additional effect of being exposed to the medical law for one
8.3. HETEROGENEITY-ROBUST ESTIMATORS 335
more period is different in that state and in other states. Thus, that assumption is key to
disentangle the effects of the two treatments. Though it is arguably strong, that assumption
is partly testable: it implies that groups that start receiving the first treatment at the same
time should have the same outcome evolutions until they adopt the second treatment. Under
that assumption, one can estimate the additional effect of the second treatment using, say,
the did_multiplegt_dyn command, restricting the sample to the (g, t)s that have received the
first treatment, and including the adoption date of the first treatment in the trends_nonparam
option. The resulting estimators compare the outcome evolution of groups that adopt/do not
adopt the second treatment, and that adopted the first treatment at the same date. When
the number of groups is relatively low (e.g.: the 50 US states), there may not be any pair
of groups receiving the first treatment at the same time period. Then, de Chaisemartin and
D’Haultfœuille (2023a) propose two alternative estimators. First, instead of assuming that the
effect of one more treatment period is homogeneous across groups, one may assume that the
effect of the first treatment evolves linearly with the number of periods of exposure, with a
slope that may differ across groups. Under that assumption, one should specify the trends_lin
instead of the trends_nonparam option. Second, one may assume that the effect of one more
treatment period is homogeneous across groups and across time periods. Under that assumption,
one should include indicators for whether g has been exposed to the first treatment for at least
1, 2, etc. periods at period t in the controls option.
Separately estimating the effect of entering and leaving the treatment when groups
can enter and leave treatment once. In Section 1.6 of their web appendix, de Chaisemartin
and D’Haultfœuille (forthc.) show that similar ideas can be used to separately estimate the effect
of entering and leaving the treatment if (8.1) holds (Dg,t = 1{Eg ≥ t ≥ Fg }), by relabeling entry
and exit of treatment as two different treatments. To estimate the effect of joining the treatment,
one can use the did_multiplegt_dyn command, restricting the sample to all (g, t)s that have
not exited the treatment yet. To estimate the effect of leaving the treatment, one should restrict
the sample to all (g, t)s such that g has already been treated at t, define the treatment variable
as an indicator for having left the treatment, and either control non-parametrically for the date
when groups’ entered the treatment, or allow for group-specific linear trends, or control for
336 CHAPTER 8. GENERAL DESIGNS
indicators for whether g has been exposed to the first treatment for at least 1, 2, etc. periods
at period t.
8.3.4.7 The initial-conditions problem in designs where the treatment varies at period one∗
When groups receive heterogeneous treatment doses at period one, this may suggest that they
have experienced treatment changes before period one. If potential outcomes can depend on
8.3. HETEROGENEITY-ROBUST ESTIMATORS 337
any treatment lag, even those before period one, unobserved treatment changes that took place
before period one may still affect groups’ outcome over the entirety of the study period, which
could bias the DID estimators introduced above. Pre-trend tests can be used to assess if, in spite
of dynamic effects of treatment changes that took place before the start of the study period, the
parallel-trends assumption underlying those estimators remains plausible. Moreover, if one is
ready to assume that groups’ outcomes can only be affected by their first k treatment lags, then
the estimators introduced above can be recomputed, restricting the sample to groups with a
stable treatment from period 1 to k + 1 and to time periods k + 1 to T . In that subsample, those
estimators remain valid even if treatment changes before period one can affect the outcome. An
issue with this strategy is that it leads to a reduced sample size, and may yield noisy estimators.
To our knowledge, there does not exist yet heterogeneity-robust DID estimators for cases where
one is only willing to make a parallel-trends assumption with respect to an instrument, and
treatment lags can affect the outcome.
1.44 percentage point, and the effect is statistically significant (s.e.=0.43 percentage point).
That effect can be estimated for 1,119 out of the 1,195 counties in the data: 34 counties never
experience a change in their number of newspapers, and 42 counties that do experience a change
cannot be matched with a not-yet-switcher with the same number of newspapers at baseline.
Being exposed to a weakly larger number of newspapers for two, three, and four electoral cycle
also significantly increases turnout. Effects increase with exposure length, but one cannot reject
the null that all effects are equal (p-value=0.40). As ℓ increases, effects mechanically apply to
fewer and fewer counties, but the effect after four electoral cycles still applies to 917 counties. Pre-
trend estimates are small and individually and jointly insignificant. However, their confidence
intervals are quite large. While the first pre-trend estimator applies to 906 of the 1,119 counties
\ 1 is estimated, the fourth pre-trend estimator only applies to 447 of the 917
for which AVSQ
\ 4 is estimated. The confidence interval of AVSQ
counties for which AVSQ \ −4 is already quite large,
\ −5 is substantially larger, so we have very little power to detect differential
but that of AVSQ
trends over more than five election cycles. This is why we only report four placebo and four
event-study estimators.
8.3. HETEROGENEITY-ROBUST ESTIMATORS 339
.04
.03
.02
.01
Effect
-.01
-.02
-.03
-.04
-4 -3 -2 -1 0 1 2 3 4
Relative time to change in newspapers
Note: This figure shows non-normalized DID estimates of the effect of being exposed to a weakly larger number
of newspapers for ℓ periods on turnout, as well as pre-trends estimates, computed using the data of Gentzkow
et al. (2011) and the did_multiplegt_dyn Stata command. Standard errors are clustered at the county level.
95% confidence intervals are shown in red.
Rerun the previous command, estimating one event-study effect and adding design(0.8,console)
at the end. What are the three most common “actual-versus-status-quo” comparisons averaged
\ 1 ? Rerun the previous command, estimating two event-
in AVSQ1 , the effect estimated by AVSQ
study effects. What are the three most common “actual-versus-status-quo” comparisons aver-
\ 2 ? Rerun the previous command, estimating four
aged in AVSQ2 , the effect estimated by AVSQ
event-study effects. What are the three most common “actual-versus-status-quo” comparisons
\ 4?
averaged in AVSQ4 , the effect estimated by AVSQ
340 CHAPTER 8. GENERAL DESIGNS
effects_equal(all)
Normalized event-study and pre-trends estimates are shown in Figure 8.2 below. Normalized
event-study estimates are decreasing with ℓ, but one cannot reject the null that all effects are
equal (p-value=0.17). w1,0 = 1: the first event-study estimate is an effect of contemporaneous
newspapers on turnout. w2,0 = 0.48 and w2,1 = 0.52: the second normalized event-study esti-
mate is a weighted average of the effects of contemporaneous newspapers and of the first lag
of newspapers on turnout, with approximately equal weights. w3,0 = 0.35, w3,1 = 0.31, and
w3,2 = 0.33: the third normalized event-study estimate is a weighted average of the effects of
contemporaneous newspapers and of the first and second lag of newspapers, with approximately
equal weights. Finally, w4,0 = 0.28, w4,1 = 0.26, w4,2 = 0.23, and w4,3 = 0.24: the fourth normal-
ized event-study estimate is a weighted average of the effects of contemporaneous newspapers
and of the first, second, and third lag of newspapers, again with approximately equal weights.
Then, the fact that normalized event-study estimates are decreasing with ℓ may suggest that
lagged newspapers have a smaller effect on turnout than contemporaneous newspapers.
342 CHAPTER 8. GENERAL DESIGNS
.01
Effect
-.01
-4 -3 -2 -1 0 1 2 3 4
Relative time to change in newspapers
Note: This figure shows normalized DID estimates of the effect of newspapers on turnout, as well as normalized
pre-trends estimates, computed using the data of Gentzkow et al. (2011) and the did_multiplegt_dyn Stata
command. Standard errors are clustered at the county level. 95% confidence intervals are shown in red.
Can you jointly test the null that the first lagged treatment has no effect on the outcome, and
that treatment effects are constant over time?
As explained in Section 8.3.4.3, this test can be performed in a subsample of groups whose
treatment does not change between Fg and Fg + 1, and for which both the first and second
non-normalized event-study effects can be estimated. Then, the test amounts to assessing if the
first and second non-normalized event-study effects are equal in that subsample. We implement
8.4. HETEROGENEITY-ROBUST ESTIMATORS, RULING OUT DYNAMIC EFFECTS343
this test using the did_multiplegt_dyn command, restricting attention to (g, t) cells such that
t < Fg or Dg,Fg = Dg,Fg +1 :
did_multiplegt_dyn prestout cnty90 year numdailies
if year<=first_change|same_treat_after_first_change==1,
effects(2) effects_equal(all) same_switchers graph_off
In that subsample of 512 counties, the estimates of the first and second non-normalized event-
study effects are close and not significantly different (p-value=0.83). This suggests that the first
lag of newspapers does not affect turnout.
Allowing for dynamic effects is appealing, but then the previous section has shown that under a
placebo-testable parallel-trends condition, one can only estimate event-study effects that average
together effects of many different treatment paths, and may therefore be hard to interpret.
Therefore, if the test proposed in Section 8.3.4.3 suggests that lagged treatments actually do not
affect the outcome, one may consider assuming away dynamic effects. Then, one can estimate
effects that are easier to interpret. Moreover, those effects can be estimated under a minimal
parallel-trends assumption over consecutive periods rather than across multiple periods. Finally,
with a non-absorbing treatment, allowing for dynamic effects makes it hard to separately estimate
the effect of groups’ first, second, etc. treatment changes. Instead, ruling out dynamic effects
makes it possible to separately estimate the effect of each treatment change. This makes it
easy to test for instance if the treatment effect changes over time, and this can also lead to
more precise estimators. Therefore, throughout this section we impose Assumption ND. We
start by describing the estimators proposed by de Chaisemartin and D’Haultfœuille (2020) and
de Chaisemartin et al. (2022), before describing alternative estimators.
344 CHAPTER 8. GENERAL DESIGNS
8.4.1 Design
In this section, we assume that Dg,t takes values in D = {0, ..., d}. If d = 1, Dg,t is binary, but we
allow for a non-binary discrete treatment taking a finite number of values, as is for instance the
case in the newspaper example. The estimators below can be computed whenever the following
condition holds:
Design STAY-C (Designs with stayers between consecutive periods) ∃(g, g ′ ) such that: (i)
Dg,t−1 = Dg′ ,t−1 ; (ii) Dg,t ̸= Dg,t−1 while Dg′ ,t = Dg′ ,t−1 .
Design STAY-C requires that there exists a pair of consecutive time periods (t − 1, t) and a pair
of groups (g, g ′ ) such that g and g ′ have the same treatment at t − 1, g’s treatment changes from
t − 1 to t while g ′ ’s treatment does not change. Hereafter, we refer to groups whose treatment
changes from t − 1 to t as t − 1-to-t switchers, while we refer to groups whose treatment does not
change from t − 1 to t as t − 1-to-t stayers. When T = 2, Design STAY-C is equivalent to Design
STAY, the design in which the estimators in the previous section can be used. When T > 2, the
two designs are no longer equivalent. Without dynamic effects, each pair of consecutive periods
can be analyzed in isolation, because groups’ outcomes at t − 1 and t do not depend on their
treatments at other periods. Then, Design STAY-C requires that one can match a t − 1-to-t
switcher to a t − 1-to-t stayer with the same period-t − 1 treatment, without taking into account
their prior treatment histories. Instead, Design STAY requires that one can match a switcher
to a not-yet-switcher with the same treatment history until the switcher switched. As in the
previous section, (i) fails if the treatment is continuously distributed, but in Section 8.4.8.1 we
will extend the estimators below to designs with a continuous treatment.
∀t ∈ {2, ..., T }, let Dtr = {d : ∃(g, g ′ ) : Dg,t−1 = Dg′ ,t−1 = Dg′ ,t = d ̸= Dg,t } be the set of values of
the lagged treatment Dg,t−1 such that at least one t − 1-to-t switcher and one t − 1-to-t stayer
have Dg,t−1 = d.
Assumption PTNC-C (Parallel trends if groups’ treatment does not change between consecu-
tive periods, conditional on their lagged treatment) ∀t ∈ {2, ..., T }, ∀(g, g ′ ), if Dg,t−1 = Dg′ ,t−1 ∈
8.4. HETEROGENEITY-ROBUST ESTIMATORS, RULING OUT DYNAMIC EFFECTS345
Dtr , then
E[Yg,t (Dg,t−1 ) − Yg,t−1 (Dg,t−1 )] = E[Yg′ ,t (Dg,t−1 ) − Yg′ ,t−1 (Dg,t−1 )].
Assumption PTNC-C requires that if two groups have the same lagged treatment, then they have
the same expected outcome evolution from period t − 1 to t, in the counterfactual where their
treatment does not change from t − 1 to t. Importantly, Assumption PTNC-C only requires that
some groups be on parallel trends over consecutive time periods, not over the entire duration
of the panel. Specifically, because Assumption PTNC-C is conditional on Dg,t−1 , it cannot be
“chained” across pairs of time periods: for instance, under Assumption PTNC-C, two groups g
and g ′ such that Dg,1 = 2, Dg,2 = Dg,3 = 3 and Dg′ ,1 = Dg′ ,2 = Dg′ ,3 = 2 experience parallel
trends from period one to two but not from period two to three, because they have the same
treatment at period one but not at period two. As it restricts only one potential outcome per
group, Assumption PTNC-C alone does not restrict groups’ treatment effects. When combined
with (2.2), the standard parallel-trends assumption on the untreated outcome, does Assumption
PTNC-C imply that the treatment effect should be constant over time?
No, but together the two conditions imply a parallel-trends condition on groups’ treatment
effects:
E[Yg,t (d) − Yg,t−1 (d)] = E[Yg′ ,t (d) − Yg′ ,t−1 (d)]
and
E[Yg,t (0) − Yg,t−1 (0)] = E[Yg′ ,t (0) − Yg′ ,t−1 (0)],
for d ∈ Dtr and (g, g ′ ) such that Dg,t−1 = Dg′ ,t−1 = d. These two equalities imply that
E[Yg,t (d) − Yg,t (0) − (Yg,t−1 (d) − Yg,t−1 (0))] = E[Yg′ ,t (d) − Yg′ ,t (0) − (Yg′ ,t−1 (d) − Yg′ ,t−1 (0))].
346 CHAPTER 8. GENERAL DESIGNS
Therefore, for all d ∈ Dtr , the effect of changing the treatment from 0 to d should follow the same
evolution from t − 1 to t for all groups g such that Dg,t−1 = d. On the other hand, if one were to
assume that Assumption PTNC-C holds for all groups, as in (8.9), then together with (2.2) the
two conditions would imply that some treatment effects are constant over time. Thus, as in the
previous section, making the parallel-trends assumption conditional on groups’ prior treatment
substantially weakens the restrictions on treatment-effect heterogeneity implicitly imposed by
those parallel-trends assumptions.
Let S = {(g, t) : t ≥ 2, Dg,t ̸= Dg,t−1 , ∃g ′ : Dg′ ,t−1 = Dg′ ,t = Dg,t−1 } denote the set of t − 1-to-t
switchers with the same lagged treatment as at least one t − 1-to-t stayer. For all (g, t) ∈ S, let
denote the expectation of the slope of group g’s potential outcome function at period t, between
its period-t and its period-t − 1 treatment. Recall that for any set A, #A denotes its number of
elements, i.e. its cardinality. The first target parameter we consider is
1 X
ATS = E TE∆
g,t .
#S (g,t)∈S
ATS (Average Treatment effect of the Switchers) is the average of the slopes TE∆
g,t across all
switchers in S. In Design HAD and with i.i.d. groups, the ATS reduces to the ATT we considered
in Chapter 7. The second target parameter we consider is a weighted average of switchers’ slopes
TE∆
g,t :
|Dg,t − Dg,t−1 |
WATS = E TE∆
X
g,t .
|D − |
P
(g,t)∈S
′ ′
(g ,t )∈S ′
g ,t′ D ′ ′
g ,t −1
In Design HAD and with i.i.d. groups, the WATS reduces to the WATT we considered in Chapter
7. If Dg,t is binary ATS = WATS, but the two parameters can differ if Dg,t is non-binary. One
may also be interested in estimating separately the average treatment effect of switchers-in
whose treatment increases (Dg,t > Dg,t−1 ), and of switchers-out whose treatment decreases.
Accordingly, one can let S+ = {(g, t) : t ≥ 2, Dg,t > Dg,t−1 , ∃g ′ : Dg′ ,t−1 = Dg′ ,t = Dg,t−1 },
8.4. HETEROGENEITY-ROBUST ESTIMATORS, RULING OUT DYNAMIC EFFECTS347
8.4.3 Estimators
g to the average t − 1-to-t outcome evolutions of stayers with the same lagged treatment as g.
Theorem 23 If Assumptions NA, ND, and PTNC-C hold, then for all (g, t) ∈ S,
h i
E TE
d ∆ = TE∆ .
g,t g,t
\ = |Dg,t − Dg,t−1 |
WATS TE
d∆
X
g,t
|D − |
P
(g,t)∈S
′ ′
(g ,t )∈S ′
g ,t′ D ′ ′
g ,t −1
is unbiased for the WATS. One can follow similar steps to construct unbiased estimators of
ATS+ , ATS− , WATS+ , and WATS− , respectively denoted ATS
d , ATS
+
\ + , and WATS
d , WATS
−
\ −.
Bibliographic notes. With a binary treatment, the multi-period DID estimator in Imai and
Kim (2021) is numerically equivalent to ATS \ is numerically equivalent to the DIDM
d . WATS
+
Let S pl = {(g, t) ∈ S : Dg,t−1 = Dg,t−2 } denote the subsample of t − 1-to-t switchers that are
also t − 2-to-t − 1 stayers. For all (g, t) ∈ S pl , let
g ′ :Dg′ ,t−2 =Dg′ ,t−1 =Dg′ ,t =Dg,t−1 (Yg ′ ,t−1 − Yg′ ,t−2 )
1
Yg,t−1 − Yg,t−2 −
P
#{g ′ :Dg′ ,t−2 =Dg′ ,t−1 =Dg′ ,t =Dg,t−1 }
TE
d ∆,pl =
g,t ,
Dg,t − Dg,t−1
and let
1
ATS
d pl = TE
d ∆,pl .
X
#S pl (g,t)∈S pl
g,t
ATS
d pl compares the t−2-to-t−1 outcome evolutions of t−1-to-t switchers and stayers, restricting
attention to t − 2-to-t − 1 stayers. One can follow similar steps to define a placebo WATS
estimator. If ATS
d pl is small when compared to ATS,
d differential trends between switchers and
stayers are larger after switchers’ treatment changes than before that change, which suggests
that ATS
d is unlikely to be strongly biased due to a violation of Assumption PTNC-C. Why is it
Otherwise, the placebo estimator could be contaminated by the treatment’s effect: its expec-
tation could differ from zero even if switchers and stayers are on parallel trends, if t − 1-to-t
switchers and stayers have different probabilities of being t − 2-to-t − 1 switchers and/or different
treatment effects.
ATS \ compare the outcome evolutions of t − 1-to-t switchers and stayers. But there
d and WATS
may be, say, t − 1-to-t stayers whose treatment changed from t − 2 to t − 1. If lagged treatments
affect units’ current outcome, that change could still affect the t − 1-to-t outcome evolution
of those stayers. This could lead to a violation of the parallel-trends assumption underlying
ATS \ To mitigate this concern, de Chaisemartin et al. (2022) propose the following
d and WATS.
8.4. HETEROGENEITY-ROBUST ESTIMATORS, RULING OUT DYNAMIC EFFECTS349
robustness check. One can recompute ATS \ restricting, for each pair of consecutive
d and WATS,
periods (t − 1, t), the estimation sample to t − 2-to-t − 1 stayers, as in the placebo analysis. They
show that the resulting estimator is robust to dynamic effects up to one treatment lag. Similarly,
if one wants to allow for effects of the first and second treatment lags on the outcome, one just
needs to restrict the estimation sample to t − 3-to-t − 1 stayers. However, the more robustness
to dynamic effects one would like to have, the smaller the estimation sample becomes.
8.4.6 Inference
de Chaisemartin and D’Haultfœuille (2020) propose confidence intervals for the WATS based on
asymptotic approximations where the number of groups goes to infinity, under the assumption
that groups are independent. We recommend, as in Section 3.3.2, that researchers using those
confidence intervals with less than 40 switchers or with less than 40 stayers perform simulations
tailored to their data to assess their coverage rate.6
Borusyak et al. (2024) and Liu et al. (2024) show that with a binary, non-absorbing treatment,
their imputation estimator can still be used if one is ready to rule out dynamic effects. Then,
under (2.2), the imputation estimator is unbiased for the average effect of the treatment across
all treated (g, t) cells such that the fixed effect of group g and the fixed effect of period t can
be estimated in the first-step TWFE regression of the outcome on group and time FEs in the
sample of untreated (g, t) cells. ATS
d and the imputation estimator are unbiased for average
treatment effects across two different sets of (g, t) cells. This could lead the two estimators to
differ if treatment effects are heterogeneous across those two sets. The two estimators also rely
on different parallel-trends assumptions. In particular, the imputation estimator assumes that
all groups experience the same evolution of their untreated outcome over the entire duration of
6
The number of switchers is the number of groups g for which at least one cell (g, t) belongs to S. The number
of stayers is the number of groups g such that g belongs to the set of stayers attached to at least one cell (g ′ , t)
in S.
350 CHAPTER 8. GENERAL DESIGNS
experience the same evolution of their untreated outcome from t − 1 to t, a weaker assumption.
The parallel-trends assumption underlying ATS
d
− is neither weaker nor stronger than that un-
derlying the imputation estimator, but again it only requires that some groups are on parallel
trends over consecutive time periods.
8.4.8 Extensions
To simplify the exposition, in this section we assume that T = 2: if T > 2, one can just compute
the estimators below for all pairs of consecutive time periods, and then take a weighted average
across pairs of periods.
Intuitively, one can estimate non-parametrically E[Yg,2 (Dg,1 )−Yg,1 (Dg,1 )], and therefore the ATS
and WATS, with a procedure similar to that described in Chapter 4 to control for covariates
in DID estimation, with Dg,1 playing the role of the covariate. First, one estimates a non-
parametric (e.g. series, kernel, lasso) regression of Yg,2 − Yg,1 on Dg,1 among stayers. Second, one
8.4. HETEROGENEITY-ROBUST ESTIMATORS, RULING OUT DYNAMIC EFFECTS351
computes switchers’ predicted outcome evolution if their treatment had not changed, based on
that regression. Finally, we subtract from switchers’ outcome evolution their predicted outcome
evolution without treatment, to recover their treatment effect, and we average across switchers.
de Chaisemartin et al. (2022) derive doubly-robust moment conditions identifying the ATS and
WATS, and they propose non-parametric doubly-robust estimators, with data-driven choices of
the tuning parameters used in the first-step estimations. They show that the ATS estimator
√
is G−consistent if switchers cannot experience arbitrarily small treatment changes, while the
√
WATS estimator is always G−consistent. They also show that when switchers cannot experi-
ence arbitrarily small treatment changes, under some conditions the asymptotic variance of the
WATS estimator is strictly lower than that of the ATS estimator.
not change, and (iii’) that had the same treatments as the switching groups in period t − 1. (ii)
and (ii’) ensure that this estimator will not be subject to an issue affecting TWFE regressions
with several treatments, which issue is that?
(ii) and (ii’) ensure that the estimator of the effect of the first treatment is not contaminated by
effects of other treatments, unlike the coefficient on the first treatment in a TWFE regressions
with several treatments. Interestingly, this idea was present as early as in Snow (1856): to assess
if cholera is transmitted by air or water, Snow found a treatment group whose water quality
changed while its air quality did not change, and a control group whose water and air quality
did not change. (i’) ensures that the estimator is robust to heterogeneous effects of the first
treatment across groups. Finally, (iii’) ensures that the estimator is robust to heterogeneous
effects over time of all treatments.
de Chaisemartin et al. (2022) extend their estimators to the IV case, where one makes a parallel-
trends assumption with respect to an instrument rather than the treatment. As discussed
earlier, 2SLS-TWFE estimators may not estimate a LATE or a convex combination of treat-
ment effects. Instead, de Chaisemartin et al. (2022) propose an IV-WATS estimator, defined
as the ratio of a reduced-form WATS estimator with Yg,t as the outcome and the instrument
Zg,t as the treatment, divided by a first-stage WATS estimator with Dg,t as the outcome and
Zg,t as the treatment. They show that under a monotonicity condition as in Imbens and An-
grist (1994), and under parallel-trends conditions on E (Yg,2 (Dg,2 (Zg,1 )) − Yg,1 (Dg,1 (Zg,1 ))) and
E (Dg,2 (Zg,1 ) − Dg,1 (Zg,1 )), groups’ outcome and treatment evolutions if their instrument had
not changed, the IV-WATS estimator is consistent for a so-called IV-WATS effect. The IV-WATS
effect is a weighted average of the slopes
Yg,2 (Dg,2 (Zg,2 )) − Yg,2 (Dg,2 (Zg,1 ))
,
Dg,2 (Zg,2 ) − Dg,2 (Zg,1 )
across “complier-switchers” such that Dg,2 (Zg,2 ) ̸= Dg,2 (Zg,1 ). Those are the groups whose
8.4. HETEROGENEITY-ROBUST ESTIMATORS, RULING OUT DYNAMIC EFFECTS353
instrument switches from period one to two (since Dg,2 (Zg,2 ) ̸= Dg,2 (Zg,1 ) implies Zg,2 ̸=
Zg,1 ), and whose treatment responds to that change, like the compliers in Imbens and An-
grist (1994). The IV-WATS weights slopes proportionally to Dg,2 (Zg,2 ) − Dg,2 (Zg,1 ), compliers-
switchers’ first-stage effect. Similarly, a reduced-form ATS estimator divided by a first-stage
ATS estimator is consistent for a weighted average of the same slopes, with weights propor-
tional to (Dg,2 (Zg,2 ) − Dg,2 (Zg,1 ))/(Zg,2 − Zg,1 ), the slope of compliers-switchers’ first stage.
Finally, de Chaisemartin et al. (2022) show that the “reduced-form” parallel-trends condition
on E (Yg,2 (Dg,2 (Zg,1 )) − Yg,1 (Dg,1 (Zg,1 ))) restricts treatment effect heterogeneity over time and
across units. Instead, a reduced-form parallel-trends condition conditional on (Zg,1 , Dg,1 ) no
longer restricts treatment-effect heterogeneity over time, though it still restricts it across units.
Accordingly, de Chaisemartin et al. (2022) recommend controlling for (Zg,1 , Dg,1 ) in the IV es-
timation.
estimators proposed by Borusyak et al. (2024). You can refer to Chapter 6 for the syntax of the
Stata command. The fect Stata (Liu et al., 2022b) and R (Liu et al., 2022a) command compute
the estimators proposed by Liu et al. (2024).
\ and
Using gentzkowetal_didtextbook and the did_multiplegt_old command, compute WATS
\ pl . Interpret the results. In particular, how does WATS
WATS \ compare to AVSQ
\ n1 ?
that newspapers have a non-linear effect on turnout, and that going from zero to one newspaper
\ pl is small and insignificant
has a larger effect than going from one to two, etc. Finally, WATS
(-0.0000, s.e.=0.0023), though its confidence interval is quite wide.
8 \ slightly differs from the DIDM estimator in Table 3 of de Chaisemartin and D’Haultfœuille (2020),
WATS
because it does not control for state-specific trends, and it does not group number of newspapers above three
into one category.
8.5. HETEROGENEITY-ROBUST ESTIMATORS IN DESIGNS WITHOUT STAYERS? 355
The heterogeneity-robust DID estimators reviewed earlier can be computed in designs with
stayers. Such designs seem quite common. Of the 26 highly-cited AER papers estimating
a TWFE regression in the survey of de Chaisemartin and D’Haultfœuille (forthc.), there are
12 for which we can replicate at least one TWFE regression estimated in the paper, without
having to preprocess the publicly available data using a software we are not familiar with (e.g.
ArcGIS). For papers for which we can replicate several TWFE regressions, we focus on the
one reported first in the paper. For each of these 12 regressions, we assess if it has stayers:
∃(g, t) : Dg,t = Dg,t−1 , or ∃(g, t) : Zg,t = Zg,t−1 for 2SLS regressions. For regressions with several
treatments or instruments, we assess if at least one treatment or instrument has stayers. We find
that 9 regressions have stayers. Of the three papers that do not have stayers for any treatment or
instrument in their main regression, one is Pierce and Schott (2016), an heterogeneous adoption
design with quasi-stayers. Another one is Fetzer (2019), who studies the effect of austerity in
the UK on the propensity to vote for Brexit. While the main austerity measure in the paper’s
Table 1 Column (1) does not have stayers, other austerity measures in that same table have
stayers. Overall, and though our sample is admittedly small, this suggests that we have now
been able to propose heterogeneity-robust DID estimators that can be used in a majority of
the cases where TWFE regressions are used. Yet, designs without stayers exist. They are for
instance prominent in a very important field of research, which measures the impact of weather
variables on economic or health outcomes, such as agricultural yields (see, e.g., Deschênes and
Greenstone, 2007) or mortality. If Dg,t is the amount of rainfall or the average temperature in
location g and year t, all locations will experience different precipitations or temperatures in
consecutive years: such treatments are continuously distributed across both g and t. We now
review several alternatives to TWFEs in such cases. Throughout, we assume away dynamic
effects to simplify the exposition, though some of the estimators below can be extended to allow
for dynamic effects. We also assume that T = 2, occasionally assuming the existence of a third
period t = 0 when we discuss pre-trend estimators.
356 CHAPTER 8. GENERAL DESIGNS
When there are quasi-stayers, namely groups that experience arbitrarily small treatment changes,
a first solution is to use them as the control group. The researcher chooses a value h such that
groups for which |Dg,2 − Dg,1 | ≤ h are considered as quasi-stayers. Then, the heterogeneity-
robust DID estimators presented in the previous section can be computed as if the treatment
of those cells had not changed, using for instance the did_multiplegt_old Stata command,
with the threshold_stable_treatment(#) option. The option’s argument is h, the bandwidth
below which the researcher considers that a cell’s treatment did not change. As in Chapter 7,
results from non-parametric estimation can be used to choose that bandwidth optimally, up to
the additional difficulty that in general designs heterogeneity-robust DID estimators control for
the lagged treatment, and therefore the bandwidth has to be chosen conditional on a continuous
control variable (de Chaisemartin, D’Haultfœuille and Vazquez-Bare, 2024). While the did_had
Stata and R packages can be used to compute the optimal bandwidth in heterogeneous adoption
designs, we are not aware of a Stata or R package that can be used to compute that bandwidth
in a general design. In general designs, a further difficulty with this approach concerns the com-
putation of pre-trend estimators. Remember that in the previous section, pre-trend estimators
compared the average of Yg,1 − Yg,0 between period-one-to-two switchers and stayers, restricting
the sample to period-zero-to-one stayers. Here, this implies that pre-trend estimators have to be
computed in the subsample of period-zero-to-one quasi-stayers. A second bandwidth hpl needs
to be chosen, and that second bandwidth depends on the first one h: the pre-trends estimator
will compare Yg,1 − Yg,0 between gs such that |Dg,2 − Dg,1 | > h and gs such that |Dg,2 − Dg,1 | ≤ h,
restricting the sample to gs such that g : |Dg,1 − Dg,0 | ≤ hpl . We are not aware of results from
non-parametric statistics that one could use to jointly choose those two bandwidths optimally.
Another, “quick and easy” way of solving the problem is to round the treatment. In a weather
application, with temperatures rounded, say, to the first digit, the control group becomes loca-
tions with the same rounded-to-the-first-digit temperatures at periods one and two. Thus, this
8.5. HETEROGENEITY-ROBUST ESTIMATORS IN DESIGNS WITHOUT STAYERS? 357
approach is related to, but different from, using quasi-stayers as the control group. Letting r(.)
denote the function mapping the raw treatment into its rounded value, this approach implicitly
assumes that Yg,t (d) = Yg,t (d′ ) for all (d, d′ ) such that r(d) = r(d′ ). For instance, with tem-
peratures rounded to the first digit, agricultural yields can change when average temperatures
change from 18.79 to 18.80 degrees, but they cannot change when temperatures change from
18.70 to 18.79, an assumption that may be hard to justify. In weather applications, there are
sometimes more principled ways to change the treatment’s definition. For instance, there may
be well-controlled laboratory evidence suggesting that a crop’s growth is only impaired when
temperature goes above a threshold. Then, one can redefine the treatment as the number of
days when temperatures exceeded that threshold in location g and year t. With this treatment
definition, commonly used in the literature, we are back to a design with some stayers. Beyond
agricultural yields, this approach may also be applicable to health outcomes. For instance, there
may be a consensus in the medical literature that temperature variations within a range of
“normal” temperatures have no effect on mortality. On the other hand, this approach may not
be applicable to study the effect of temperatures on GDP. There, it seems harder to determine
ex-ante a range of temperatures that would all lead to the same GDP level.
8.6 Appendix∗
h i
Under (8.9), there exists a real number γ2∆ such that E Yg,2 (Dg,1 ) − Yg,1 (Dg,1 ) = γ2∆ for all g.
One has that
Finally,
G
= ∆
TE∆
X
Wg,t g,2 .
g=1
QED.
We have
covu (∆D, ∆Y ) covu (∆D, Y2 (0) − Y1 (0)) + covu (∆D, S2 × ∆D) + covu (∆D, ∆S × D1 )
=
Vu (∆D) t′ =1 [Vu (Dt′ ) − covu (D1 , D2 )]
P2
2
Vu (Dt ) − covu (D1 , D2 )
= Eu (St ).
X
The first equality follows from (8.14). The second follows from (8.19). QED.
8.6. APPENDIX∗ 359
By (8.22) and (6.13), Yg,t = αg + γt + Dg,t−ℓ + εg,t , with E[εg,t ] = 0. The rest of the proof
PK ℓ
ℓ=0 γg,t
is identical to that of Theorem 9, simply replacing Yg,t (0t−ℓ′ , 1ℓ′ )−Yg,t (0t ) and 1 {t = Fg − 1 + ℓ}
by respectively γg,t
ℓ
and Dg,t−ℓ . QED.
=AVSQg,ℓ .
ℓ−1 h
= E Yg,Fg −1+ℓ (Dg,1,Fg −1 , Dg,Fg , ..., Dg,Fg −1+ℓ−k−1 , Dg,Fg −1+ℓ−k , Dg,1,k )
X
k=0
ℓ−1
!
i
− Yg,Fg −1+ℓ (Dg,1,Fg −1 , Dg,Fg , ..., Dg,Fg −1+ℓ−k−1 , Dg,1 , Dg,1,k )|D / (Dg,Fg −1+ℓ−k − Dg,1 )
X
k=0
ℓ−1
=
X
wg,ℓ,k sg,ℓ,k .
k=0
QED.
We have
1
(Yg′ ,t − Yg′ ,t−1 )
X
E Yg,t − Yg,t−1 −
#{g ′ : Dg′ ,t−1 = Dg′ ,t = Dg,t−1 } g′ :D
g ′ ,t−1 =Dg ′ ,t =Dg,t−1
Dividing the previous display by Dg,t − Dg,t−1 proves the result. QED.
Chapter 9
Let us close this textbook with some recommendations for practitioners wishing to leverage
a potentially complex natural experiment and a DID-like estimator to learn the effect of a
treatment on an outcome:
2. Lay out explicitly the no-anticipation and parallel-trends (or factor-model) assumption
underlying your identification.
(a) Report pre-trend or placebo estimators of your identifying assumptions. For your
results to be convincing, those pre-trend estimators should be smaller than your
actual treatment-effect estimators, and ideally, they should allow you to rule out
small differential trends, that cannot account for a large fraction of your treatment-
effect estimators. See Chapter 3 for details.
(b) If your pre-trends are small and precisely estimated with a simple DID estimator, you
may not need to use more complicated estimators, like a DID with control variables
or a synthetic control or synthetic DID estimator. See Chapter 4 for details.
(c) If you use an interactive fixed effects, a synthetic control, or a synthetic DID estimator,
we still recommend conducting a thorough pre-trends/placebo analysis. See Chapter
4 for details.
361
362 CHAPTER 9. CONCLUSION: PRACTITIONERS’ CHECKLIST
(d) If you invoke an “as good as random” assumption (random treatment timing, random
treatment dose), you need to conduct balancing checks to substantiate this assump-
tion. For instance, you can regress the treatment timing or dose on all pre-treatment
outcomes: pre-treatment outcomes should not predict the treatment timing or dose
if those are as good as randomly assigned.
(e) Even if those tests are conclusive, acknowledge that they remain suggestive, and
discuss remaining threats to your identification. For instance, are there concomitant
shocks or other policies that could explain why the treated and control groups start
experiencing differential trends after the treated get treated?
(a) Identify your design: is your treatment binary? Is it absorbing? Is there variation in
treatment timing?
i. Most treatments that social scientists are interested in are likely to have effects
that vary across space and over time, so robustness to heterogeneity is a desirable
feature.
ii. Heterogeneity-robust DID estimators are more transparent than TWFE estima-
tors, which makes it easier to communicate a study’s methodology and findings
outside of academic circles.
(c) If several heterogeneity-robust estimators are available for your research design, you
do not need to compute and report all of them: typically, they tend to be close to each
other. If they are not, this is evidence that the no-anticipation and parallel-trends
assumptions underlying those estimators are violated, but pre-trend tests are a more
direct way of testing those assumptions.
363
(e) If you estimate event-study effects, do not report more effects than pre-trend estima-
tors.
5. Perform inference:
(a) Cluster standard errors, either at the level at which the treatment is assigned, or at
the most disaggregated level at which one can still construct a panel dataset. See
starred Section 2.4 for details.
(b) With at least 40 treated and 40 control groups, you can use confidence intervals
relying on large-sample approximations.
(c) With less than 40 treated or less than 40 control groups, we recommend that you
conduct simulations based on your data to assess if, in your data, confidence intervals
relying on large-sample approximations have close-to-nominal coverage. If those sim-
ulations suggest that they do not, see Section 3.3 for alternative inference procedures.
364 CHAPTER 9. CONCLUSION: PRACTITIONERS’ CHECKLIST
Bibliography
Abadie, A. (2021), ‘Using synthetic controls: Feasibility, data requirements, and methodological
aspects’, Journal of Economic Literature 59(2), 391–425.
Abadie, A., Athey, S., Imbens, G. W. and Wooldridge, J. M. (2023), ‘When should you adjust
standard errors for clustering?’, The Quarterly Journal of Economics 138(1), 1–35.
Abadie, A., Diamond, A. and Hainmueller, J. (2010), ‘Synthetic control methods for comparative
case studies: Estimating the effect of california’s tobacco control program’, Journal of the
American Statistical Association 105(490), 493–505.
Abadie, A., Diamond, A. and Hainmueller, J. (2011), ‘SYNTH: Stata module to implement
Synthetic Control Methods for Comparative Case Studies’, Statistical Software Components,
Boston College Department of Economics.
Abadie, A. and Gardeazabal, J. (2003), ‘The economic costs of conflict: A case study of the
basque country’, American Economic Review 93(1), 113–132.
Abbring, J. H. and Van den Berg, G. J. (2003), ‘The nonparametric identification of treatment
effects in duration models’, Econometrica 71(5), 1491–1517.
Acemoglu, D., Autor, D., Dorn, D., Hanson, G. H. and Price, B. (2016), ‘Import competition and
the great us employment sag of the 2000s’, Journal of Labor Economics 34(S1), S141–S198.
Ahrens, A., Chernozhukov, V., Hansen, C., Kozbur, D., Schaffer, M. and Wiemann, T. (2025),
‘An introduction to double/debiased machine learning’, arXiv preprint arXiv:2504.08324 .
Andrews, I., Roth, J. and Pakes, A. (2023), ‘Inference for linear conditional moment inequalities’,
Review of Economic Studies 90(6), 2763–2791.
Angrist, J. D. (1998), ‘Estimating the labor market impact of voluntary military service using
social security data on military applicants’, Econometrica 66(2), 249–288.
Angrist, J. D. and Pischke, J.-S. (2009), Mostly harmless econometrics: An empiricist’s com-
panion, Princeton university press.
365
366 BIBLIOGRAPHY
Arellano, M. and Bond, S. (1991), ‘Some tests of specification for panel data: Monte carlo evi-
dence and an application to employment equations’, Review of Economic Studies 58(2), 277–
297.
Arkhangelsky, D., Athey, S., Hirshberg, D. A., Imbens, G. W. and Wager, S. (2021), ‘Synthetic
difference-in-differences’, American Economic Review 111(12), 4088–4118.
Arkhangelsky, D. and Imbens, G. (2024), ‘Causal models for longitudinal and panel data: A
survey’, The Econometrics Journal 27(3), C1–C61.
Arkhangelsky, D., Imbens, G. W., Lei, L. and Luo, X. (2024), ‘Design-robust two-way-fixed-
effects regression for panel data’, Quantitative Econoimcs 15, 999–1034.
Armstrong, T. B., Weidner, M. and Zeleneev, A. (2022), Robust estimation and inference in
panels with interactive fixed effects. arXiv preprint arXiv:2210.06639.
Ashenfelter, O. (1978), ‘Estimating the effect of training programs on earnings’, The Review of
Economics and Statistics 60, 47–57.
Autor, D. H., Dorn, D. and Hanson, G. H. (2013), ‘The china syndrome: Local labor market
effects of import competition in the united states’, American economic review 103(6), 2121–
2168.
Bach, L., Bozio, A., Guillouzouic, A. and Malgouyres, C. (2023), ‘Dividend taxes and the
allocation of capital: Comment’, American Economic Review 113(7), 2048–2052.
Bai, J. (2009), ‘Panel data models with interactive fixed effects’, Econometrica 77(4), 1229–1279.
Bai, J. and Ng, S. (2021), ‘Matrix completion, counterfactuals, and factor analysis of missing
data’, Journal of the American Statistical Association 116(536), 1746–1763.
Baker, A., Callaway, B., Cunningham, S., Goodman-Bacon, A. and Sant’Anna, P. H. (2025),
‘Difference-in-differences designs: A practitioner’s guide’, arXiv preprint arXiv:2503.13323 .
Bell, R. M. and McCaffrey, D. F. (2002), ‘Bias reduction in standard errors for linear regression
with multi-stage samples’, Survey Methodology 28(2), 169–182.
BIBLIOGRAPHY 367
Benatia, D., Bellégo, C., Cuerrier, J. and Dortet-Bernadet, V. (2025), ‘CDID: R module imple-
menting the chained DID estimator’, CRAN.
Bertrand, M., Duflo, E. and Mullainathan, S. (2004), ‘How much should we trust differences-in-
differences estimates?’, The Quarterly Journal of Economics 119(1), 249–275.
Bester, C. A., Conley, T. G. and Hansen, C. B. (2011), ‘Inference with dependent data using
cluster covariance estimators’, Journal of Econometrics 165(2), 137–151.
Bickel, P. J. (1969), ‘A distribution free version of the smirnov two sample test in the p-variate
case’, The Annals of Mathematical Statistics 40(1), 1–23.
Blundell, R., Costa-Dias, M., Meghir, C. and Van Reenen, J. (2004), ‘Evaluating the employment
impact of a mandatory job search program’, Journal of the European Economic Association
2(4), 569–606.
Bojinov, I., Rambachan, A. and Shephard, N. (2021), ‘Panel experiments and dynamic causal
effects: A finite population perspective’, Quantitative Economics 12, 1171–1196.
Borusyak, K., Hull, P. and Jaravel, X. (2022), ‘Quasi-experimental shift-share research designs’,
Review of Economic Studies 89(1), 181–213.
Borusyak, K. and Jaravel, X. (2017), Revisiting event study designs. Working Paper.
Borusyak, K., Jaravel, X. and Spiess, J. (2024), ‘Revisiting event-study designs: robust and
efficient estimation’, Review of Economic Studies p. rdae007.
Bravo, M. C., Roth, J. and Rambachan, A. (2022), ‘Honestdid: Stata module implementing the
honestdid r package’.
URL: https://EconPapers.repec.org/RePEc:boc:bocode:s459138
Brown, N. and Butts, K. (2023), Dynamic treatment effect estimation with interactive fixed
effects and short panels, Technical report, Mimeo.
368 BIBLIOGRAPHY
Burgess, R., Jedwab, R., Miguel, E., Morjaria, A. and Padró i Miquel, G. (2015), ‘The value of
democracy: evidence from road building in kenya’, American Economic Review 105(6), 1817–
51.
Busch, A. and Girardi, D. (2023), ‘LPDID: Stata module implementing Local Projections
Difference-in-Differences (LP-DiD) estimator’, Statistical Software Components, Boston Col-
lege Department of Economics.
Butts, K. (2021b), ‘didimputation: Imputation Estimator from Borusyak, Jaravel, and Spiess
(2021) in R’.
URL: https://cran.r-project.org/web/packages/didimputation/index.html
Caetano, C., Callaway, B., Payne, S. and Rodrigues, H. S. (2022), Difference in differences with
time-varying covariates. arXiv preprint arXiv:2202.02903.
Calonico, S., Cattaneo, M. D. and Farrell, M. H. (2018), ‘On the effect of bias estimation on
coverage accuracy in nonparametric inference’, Journal of the American Statistical Association
113(522), 767–779.
Calonico, S., Cattaneo, M. D. and Titiunik, R. (2014), ‘Robust nonparametric confidence inter-
vals for regression-discontinuity designs’, Econometrica 82(6), 2295–2326.
Cantoni, E. and Pons, V. (2021), ‘Strict id laws don’t stop voters: Evidence from a us nationwide
panel, 2008–2018’, The Quarterly Journal of Economics 136(4), 2615–2660.
BIBLIOGRAPHY 369
Chen, X., Christensen, T. and Kankanala, S. (2024), ‘Adaptive estimation and uniform con-
fidence bands for nonparametric structural functions and elasticities’, Review of Economic
Studies .
Chernozhukov, V., Fernández-Val, I., Hahn, J. and Newey, W. (2013), ‘Average and quantile
effects in nonseparable panel models’, Econometrica 81(2), 535–580.
Chernozhukov, V., Wüthrich, K. and Zhu, Y. (2021), ‘An exact and robust conformal infer-
ence method for counterfactual and synthetic controls’, Journal of the American Statistical
Association 116(536), 1849–1864.
Chiu, A., Lan, X., Liu, Z. and Xu, Y. (2023), ‘What to do (and not to do) with causal
panel analysis under parallel trends: Lessons from a large reanalysis study’, arXiv preprint
arXiv:2309.15983 .
Conley, T. G. and Taber, C. R. (2011), ‘Inference with “difference in differences” with a small
number of policy changes’, The Review of Economics and Statistics 93(1), 113–125.
Crump, R. K., Hotz, V. J., Imbens, G. W. and Mitnik, O. A. (2009), ‘Dealing with limited
overlap in estimation of average treatment effects’, Biometrika 96(1), 187–199.
Daw, J. R. and Hatfield, L. A. (2018), ‘Matching and regression to the mean in difference-in-
differences analysis’, Health services research 53(6), 4138–4156.
de Chaisemartin, C. (2011), Fuzzy differences in differences. Working Paper 2011-10, Center for
Research in Economics and Statistics.
de Chaisemartin, C., Ciccia, D., D’Haultfœueuille, X., Knau, F. and Sow, D. (2024), ‘Stute_test:
Stata module to perform stute (1997) linearity test’.
de Chaisemartin, C., Ciccia, D., D’Haultfœuille, X. and Knau, F. (2024), Two-way fixed effects
and differences-in-differences estimators in heterogeneous-adoption designs. arXiv preprint
arXiv:2405.04465.
370 BIBLIOGRAPHY
de Chaisemartin, C., Ciccia, D., D’Haultfœuille, X., Knau, F., Malézieux, M. and Sow, D.
(2024a), ‘Didmultiplegtdyn: R module to estimate event-study difference-in-difference (did)
estimators in designs with multiple groups and periods, with a potentially non-binary treat-
ment that may increase or decrease multiple times’.
URL: https://cran.r-project.org/web/packages/DIDmultiplegtDYN/index.html
de Chaisemartin, C., Ciccia, D., D’Haultfœuille, X., Knau, F., Malézieux, M. and Sow, D.
(2024b), ‘Did_multiplegt_dyn: Stata module to estimate event-study difference-in-difference
(did) estimators in designs with multiple groups and periods, with a potentially non-binary
treatment that may increase or decrease multiple times’.
de Chaisemartin, C., Ciccia, D., D’Haultfœuille, X., Knau, F., Malézieux, M. and Sow, D.
(2024), ‘Event-study estimators and variance estimators computed by the did_multiplegt_dyn
command’.
de Chaisemartin, C., Ciccia, D., D’Haultfœuille, X., Knau, F. and Sow, D. (2024a), ‘Didhad:
R module to estimate the effect of a treatment on an outcome in a heterogeneous-adoption
design with no stayers but some quasi stayers’.
URL: https://cran.r-project.org/web/packages/YatchewTest/index.html
de Chaisemartin, C., Ciccia, D., D’Haultfœuille, X., Knau, F. and Sow, D. (2024b), ‘Didhad:
Stata module to estimate the effect of a treatment on an outcome in a heterogeneous-adoption
design with no stayers but some quasi stayers’.
de Chaisemartin, C., Ciccia, D., D’Haultfœuille, X., Knau, F. and Sow, D. (2024c),
‘Did_multiplegt_stat: Stata module to estimate event-study difference-in-difference (did) es-
timators in designs with multiple groups and periods, with a potentially non-binary treatment
that may increase or decrease multiple times’.
de Chaisemartin, C., Ciccia, D., D’Haultfœuille, X., Knau, F. and Sow, D. (2024d), ‘Stutetest:
Stata module to perform stute (1997) linearity test’.
URL: https://cran.r-project.org/web/packages/StuteTest/index.html
de Chaisemartin, C. and D’Haultfœuille, X. (2020), ‘Two-way fixed effects estimators with het-
erogeneous treatment effects’, American Economic Review 110(9), 2964–2996.
de Chaisemartin, C. and D’Haultfœuille, X. (2024), Under the null of valid specification, pre-tests
cannot make post-test inference liberal. arXiv e-prints arXiv:2407.03725.
de Chaisemartin, C., D’Haultfœuille, X., Pasquier, F., Sow, D. and Vazquez-Bare, G. (2022),
Difference-in-differences for continuous treatments and instruments with stayers. arXiv
preprint arXiv:2201.06898.
de Chaisemartin, C. and Lei, Z. (2021), Are bartik regressions always robust to heterogeneous
treatment effects? Available at SSRN 3802200.
de Chaisemartin, C. and Lei, Z. (2024), ‘Randomly assigned first differences?’, arXiv preprint
arXiv:2411.03208 .
Deeb, A. and de Chaisemartin, C. (2019), Clustering and external validity in randomized con-
trolled trials. arXiv preprint arXiv:1912.01052.
Deryugina, T. (2017), ‘The fiscal cost of hurricanes: Disaster aid versus social insurance’, Amer-
ican Economic Journal: Economic Policy 9(3), 168–98.
372 BIBLIOGRAPHY
Deschênes, O. and Greenstone, M. (2007), ‘The economic impacts of climate change: evidence
from agricultural output and random fluctuations in weather’, American Economic Review
97(1), 354–385.
DiCiccio, C. J. and Romano, J. P. (2017), ‘Robust Permutation Tests for Correlation and Re-
gression Coefficients’, Journal of the American Statistical Association 112, 1211–1220.
Donald, S. G. and Lang, K. (2007), ‘Inference with difference-in-differences and other panel
data’, The review of Economics and Statistics 89(2), 221–233.
Dube, A., Girardi, D., Jorda, O. and Taylor, A. M. (2023), A local projections approach to
difference-in-differences event studies. National Bureau of Economic Research.
East, C. N., Miller, S., Page, M. and Wherry, L. R. (2023), ‘Multigenerational impacts of
childhood access to the safety net: Early life exposure to medicaid and the next generation’s
health’, American Economic Review 113(1), 98–135.
Egerod, B. C. and Hollenbach, F. M. (2024), ‘How many is enough? sample size in staggered
difference-in-differences designs’, OSF Preprint .
Enikolopov, R., Petrova, M. and Zhuravskaya, E. (2011), ‘Media and political persuasion: Evi-
dence from russia’, American Economic Review 101(7), 3253–3285.
Favara, G. and Imbs, J. (2015), ‘Credit supply and the price of housing’, American Economic
Review 105(3), 958–92.
Ferman, B. (2021), ‘On the properties of the synthetic control estimator with many periods and
many controls’, Journal of the American Statistical Association 116(536), 1764–1772.
Ferman, B. and Pinto, C. (2019), ‘Inference in differences-in-differences with few treated groups
and heteroskedasticity’, Review of Economics and Statistics 101(3), 452–467.
Ferman, B., Pinto, C. and Possebom, V. (2020), ‘Cherry picking with synthetic controls’, Journal
of Policy Analysis and Management 39(2), 510–532.
Fetzer, T. (2019), ‘Did austerity cause brexit?’, American Economic Review 109(11), 3849–3886.
BIBLIOGRAPHY 373
Field, E. (2007), ‘Entitled to work: Urban property rights and labor supply in peru’, The
Quarterly Journal of Economics 122(4), 1561–1602.
Friedberg, L. (1998), ‘Did unilateral divorce raise divorce rates? evidence from panel data’, The
American Economic Review 88(3), 608–627.
Frison, L. and Pocock, S. J. (1992), ‘Repeated measures in clinical trials: analysis using mean
summary statistics and its implications for design’, Statistics in medicine 11(13), 1685–1704.
Fuest, C., Peichl, A. and Siegloch, S. (2018), ‘Do higher corporate taxes reduce wages? micro
evidence from germany’, American Economic Review 108(2), 393–418.
Gardner, J., Thakral, N., Tô, L. T. and Yap, L. (2024), ‘Two-stage differences in differences’.
Gentzkow, M., Shapiro, J. M. and Sinkinson, M. (2011), ‘The effect of newspaper entry and exit
on electoral politics’, American Economic Review 101(7), 2980–3018.
Gobillon, L. and Magnac, T. (2016), ‘Regional policy evaluation: Interactive fixed effects and
synthetic controls’, Review of Economics and Statistics 98(3), 535–551.
Goldsmith-Pinkham, P., Sorkin, I. and Swift, H. (2020), ‘Bartik instruments: What, when, why,
and how’, American Economic Review 110(8), 2586–2624.
Graham, B. S. and Powell, J. L. (2012), ‘Identification and estimation of average partial effects in
“irregular” correlated random coefficient panel data models’, Econometrica 80(5), 2105–2152.
Harmon, N. (2024), ‘DID_STEPWISE: Stata module implementing the chained DID estimator’.
Hatamyar, J., Kreif, N., Rocha, R. and Huber, M. (2023), Machine learning for stag-
gered difference-in-differences and dynamic treatment effect heterogeneity. arXiv preprint
arXiv:2310.11962.
374 BIBLIOGRAPHY
Hoehn-Velasco, L., Penglase, J., Pesko, M. and Shahid, H. (2024), ‘The california effect: The
challenges of identifying the impact of social policies during an era of social change’, Available
at SSRN 4802701 .
Hsiao, C., Ching, H. S. and Ki Wan, S. (2012), ‘A panel data approach for program evaluation:
measuring the benefits of political and economic integration of Hong Kong with mainland
China’, Journal of Applied Econometrics 27(5), 705–740.
Imai, K. and Kim, I. S. (2021), ‘On the use of two-way fixed effects regression models for causal
inference with panel data’, Political Analysis 29(3), 405–415.
Imbens, G., Kallus, N. and Mao, X. (2021), Controlling for unmeasured confounding in panel
data using minimal bridge functions: From two-way fixed effects to factor models. arXiv
preprint arXiv:2108.03849.
Imbens, G. and Kalyanaraman, K. (2012), ‘Optimal bandwidth choice for the regression discon-
tinuity estimator’, Review of Economic Studies 79(3), 933–959.
Imbens, G. W. and Angrist, J. D. (1994), ‘Identification and estimation of local average treatment
effects’, Econometrica 62(2), 467–475.
Imbens, G. W. and Kolesar, M. (2016), ‘Robust standard errors in small samples: Some practical
advice’, Review of Economics and Statistics 98(4), 701–712.
Imbens, G. W., Rubin, D. B. and Sacerdote, B. I. (2001), ‘Estimating the effect of unearned in-
come on labor earnings, savings, and consumption: Evidence from a survey of lottery players’,
American Economic Review 91(4), 778–794.
Imbens, G. and Xu, Y. (2024), Lalonde (1986) after nearly four decades: Lessons learned. arXiv
preprint arXiv:2406.00827.
Ishimaru, S. (2021), ‘What do we get from two-way fixed effects regressions? implications from
numerical equivalence’, arXiv preprint arXiv:2103.12374 .
Jordà, Ò. (2005), ‘Estimation and inference of impulse responses by local projections’, American
Economic Review 95(1), 161–182.
BIBLIOGRAPHY 375
Kim, D. (2024), ‘DDID: STATA ado package for “Difference-in-differences Estimator of Quantile
Treatment Effect on the Treated”’.
Kiviet, J. F. (1995), ‘On bias, inconsistency, and efficiency of various estimators in dynamic
panel data models’, Journal of Econometrics 68(1), 53–78.
Klosin, S. (2024), Dynamic biases of static panel data estimators. arXiv preprint
arXiv:2410.16112.
Kolesar, M. (2023), ‘dfadjust: Degrees of Freedom Adjustment for Robust Standard Errors’.
URL: https://cran.r-project.org/web/packages/dfadjust/index.html
Kranker, K. (2019), ‘CIC: Stata module to implement the Athey and Imbens (2006) Changes-in-
Changes model’, Statistical Software Components, Boston College Department of Economics.
Krolikowski, P. (2018), ‘Choosing a control group for displaced workers’, ILR Review
71(5), 1232–1254.
Lee, C. H. and Steigerwald, D. G. (2018), ‘Inference for clustered data’, The Stata Journal
18(2), 447–460.
Liu, L., Wang, Y. and Xu, Y. (2024), ‘A practical guide to counterfactual estimators for
causal inference with time-series cross-sectional data’, American Journal of Political Science
68(1), 160–176.
Liu, L., Wang, Y., Xu, Y., Liu, Z. and Liu, S. (2022a), ‘Fect: Companion r package to “a practical
guide to counterfactual estimators for causal inference with time-series cross-sectional data”’.
Liu, L., Wang, Y., Xu, Y., Liu, Z. and Liu, S. (2022b), ‘Fect: Companion stata package to “a
practical guide to counterfactual estimators for causal inference with time-series cross-sectional
data”’.
376 BIBLIOGRAPHY
Lu, C., Nie, X. and Wager, S. (2019), Robust nonparametric difference-in-differences estimation.
arXiv e-prints.
MacKinnon, J. G., Nielsen, M. Ø. and Webb, M. D. (2023), ‘Fast and reliable jackknife and
bootstrap methods for cluster-robust inference’, Journal of Applied Econometrics 38(5), 671–
694.
Manski, C. F. and Pepper, J. V. (2018), ‘How do right-to-carry laws affect crime rates? coping
with ambiguity using bounded-variation assumptions’, Review of Economics and Statistics
100(2), 232–244.
McKenzie, D. (2012), ‘Beyond baseline and follow-up: The case for more t in experiments’,
Journal of development Economics 99(2), 210–221.
Meinhofer, A., Witman, A. E., Hinde, J. M. and Simon, K. (2021), ‘Marijuana liberalization
policies and perinatal health’, Journal of health economics 80, 102537.
Meyer, B. D., Viscusi, W. K. and Durbin, D. L. (1995), ‘Workers’ compensation and lnjury
duration: Evidence from a natural experiment’, The American Economic Review 85(3), 322–
340.
Mora, R. and Reggio, I. (2019), ‘Alternative diff-in-diffs estimators with several pretreatment
periods’, Econometric Reviews 38(5), 465–486.
Moser, P. and Voena, A. (2012), ‘Compulsory licensing: Evidence from the trading with the
enemy act’, American Economic Review 102(1), 396–427.
Neyman, J., Dabrowska, D. M. and Speed, T. P. (1990), ‘On the application of probability theory
to agricultural experiments. essay on principles. section 9.’, Statistical Science 5, 465–472.
Neyman, J. and Scott, E. L. (1948), ‘Consistent estimates based on partially consistent obser-
vations’, Econometrica: Journal of the Econometric Society pp. 1–32.
BIBLIOGRAPHY 377
Nickell, S. (1981), ‘Biases in dynamic models with fixed effects’, Econometrica 49, 1417–1426.
Pailañir, D., Clarke, D. and Ciccia, D. (2022), ‘SDID: Stata module to perform synthetic
difference-in-differences estimation, inference, and visualization’, Statistical Software Com-
ponents, Boston College Department of Economics.
Poterba, J. M., Venti, S. F. and Wise, D. A. (1995), ‘Do 401 (k) contributions crowd out other
personal saving?’, Journal of Public Economics 58(1), 1–32.
Puhani, P. A. (2012), ‘The treatment effect, the cross difference, and the interaction term in
nonlinear “difference-in-differences” models’, Economics Letters 115(1), 85–87.
Rambachan, A. and Roth, J. (2023), ‘A more credible approach to parallel trends’, Review of
Economic Studies 90(5), 2555–2591.
Rico-Straffon, J., Wang, Z., Panlasigui, S., Loucks, C. J., Swenson, J. and Pfaff, A. (2023),
‘Forest concessions and eco-certifications in the peruvian amazon: Deforestation impacts of
logging rights and logging restrictions’, Journal of Environmental Economics and Management
118, 102780.
Rios-Avila, F., Sant’Anna, P. and Callaway, B. (2021), ‘Csdid: Stata module for the estimation
of difference-in-difference models with multiple time periods’.
URL: https://EconPapers.repec.org/RePEc:boc:bocode:s458976
Rios-Avila, F., Sant’Anna, P. H. and Naqvi, A. (2021), ‘DRDID: Stata module for the estimation
of Doubly Robust Difference-in-Difference models’, Statistical Software Components, Boston
College Department of Economics.
URL: https://ideas.repec.org/c/boc/bocode/s458977.html
Robins, J. (1986), ‘A new approach to causal inference in mortality studies with a sustained
exposure period-application to control of the healthy worker survivor effect’, Mathematical
modelling 7(9-12), 1393–1512.
Roth, J. (2022), ‘Pretest with caution: Event-study estimates after testing for parallel trends’,
American Economic Review: Insights 4(3), 305–22.
Roth, J. and Sant’Anna, P. H. (2023), ‘When is parallel trends sensitive to functional form?’,
Econometrica 91(2), 737–747.
Roth, J. and Sant’Anna, P. H. (2023), ‘Efficient estimation for staggered rollout designs’, Journal
of Political Economy Microeconomics 1(4), 669–709.
378 BIBLIOGRAPHY
Royen, T. (2014), ‘A simple proof of the gaussian correlation conjecture extended to multivariate
gamma distributions’, Far Eastearn Journal of Theoretical Statistics 48, 139—-145.
Rubin, D. B. (1978), ‘Bayesian inference for causal effects: The role of randomization’, The
Annals of Statistics 6, 34–58.
Sant’Anna, P. and Callaway, B. (2021), ‘did: Treatment effects with multiple periods and groups
in R’.
URL: https://cran.r-project.org/web/packages/did/index.html
Sasaki, Y. and Ura, T. (2021), Slow movers in panel data. arXiv preprint arXiv:2110.12041.
Schmidheiny, K. and Siegloch, S. (2023), ‘On event studies and distributed-lags in two-way fixed
effects models: Identification, equivalence, and generalization’, Journal of Applied Economet-
rics 38(5), 695–713.
Semmelweis, I. F. (1983), The etiology, concept, and prophylaxis of childbed fever, number 2,
Univ of Wisconsin Press.
Silva, J. S. and Tenreyro, S. (2006), ‘The log of gravity’, The Review of Economics and Statistics
88(4), 641–658.
Small, D. S., Tan, Z., Ramsahai, R. R., Lorch, S. A. and Brookhart, M. A. (2017), Instrumental
variable estimation with a stochastic monotonicity assumption. Working paper.
Snow, J. (1856), ‘On the mode of communication of cholera’, Edinburgh medical journal 1(7), 668.
Solon, G., Haider, S. J. and Wooldridge, J. M. (2015), ‘What are we weighting for?’, Journal of
Human resources 50(2), 301–316.
Stevenson, B. and Wolfers, J. (2006), ‘Bargaining in the shadow of the law: Divorce laws and
family distress’, The Quarterly Journal of Economics 121(1), 267–288.
BIBLIOGRAPHY 379
Stute, W. (1997), ‘Nonparametric model checks for regression’, The Annals of Statistics 25, 613–
641.
Sun, L. (2020), ‘EVENTSTUDYWEIGHTS: Stata module to estimate the implied weights on the
cohort-specific average treatment effects on the treated (CATTs) (event study specifications)’.
URL: https://ideas.repec.org/c/boc/bocode/s458833.html
Sun, L. and Abraham, S. (2021), ‘Estimating dynamic treatment effects in event studies with
heterogeneous treatment effects’, Journal of Econometrics 225, 175–199.
Ujhelyi, G. (2014), ‘Civil service rules and policy choices: evidence from us state governments’,
American Economic Journal: Economic Policy 6(2), 338–380.
van der Vaart, A. and Wellner, J. A. (2023), Weak Convergence and Empirical Processes: With
Applications to Statistics, Springer Nature.
Weiss, A. (2024), How much should we trust modern difference-in-differences estimates? Center
for Open Science working paper.
Wolfers, J. (2006), ‘Did unilateral divorce laws raise divorce rates? a reconciliation and new
results’, American Economic Review 96(5), 1802–1820.
Wooldridge, J. (2021), Two-way fixed effects, the two-way mundlak regression, and difference-
in-differences estimators. Available at SSRN 3906345.
Wooldridge, J. M. (2007), ‘Inverse probability weighted estimation for general missing data
problems’, Journal of econometrics 141(2), 1281–1301.
Xu, Y. (2017), ‘Generalized synthetic control method: Causal inference with interactive fixed
effects models’, Political Analysis 25(1), 57–76.
Yitzhaki, S. (1996), ‘On using linear regressions in welfare economics’, Journal of Business &
Economic Statistics 14(4), 478–486.