Data Analysis For Social Scientists (14.1310x)
Data Analysis For Social Scientists (14.1310x)
David G. Khachatrian
September 28, 2019
1 Preamble
This was made a good deal after having taken the course. It will likely not be exhaustive. It may also in-
clude some editorializing: bits of what I believe are relevant observations and/or information I have come
across.
Also, many of the earlier topics of probability and "theoretical" statistics were covered in detail in the
Probability (6.431x) and Fundamentals of Statistics (18.6501x) notes, so those topics will either be skipped
or mentioned very quickly in these notes.
2 Probability Terms
A probability on a sample space S is a collection of numbers P( A) that satisfy sigma-algebra properties.
You may want to smooth out your plots. One can use kernel density estimation to achieve this. We
extrapolate from known datapoints xi using a kernel function K:
n n
1 1 1 x − xi
fˆh ( x ) =
n ∑ Kh ( x − xi ) = n ∑ h K ( h
)
i =1 i =1
K is a probability density with mean 0, and h > 0 is the bandwidth. Increased h increases the effect each
point xi on faraway points (it "smooths/flattens things out"). Common choices for K are Epanechnikov or
Normal (here, σ2 is essentially the bandwidth parameter).
(Note that kernel density estimation is a distortion added to the visualization of the data).
Sometimes you’ll want to plot the CDF. This is especially helpful to determine the existence of first-order
stochastic dominance (FOSD). X dominates Y to the first-order if X’s α-quantile is greater than or equal to Y’s
α-quantile for all α ∈ (0, 1), i.e., FX−1 (α) ≥ FY−1 (α). This corresponds to Pr [ X ≤ k] ≤ Pr [Y ≤ k]∀ k ∈ R.
Intuitively, "X is always more likely to yield larger numbers than Y". Visually, the CDF of X is always
"to the right of" the CDF of Y.
4 Auctions
(k)
The k’th order statistic of n realizations of a random variable X, denoted Xn (n usually suppressed) de-
scribes the probability law governing the distribution of the nk -quantile of X. Perhaps better explained in
mathematical terms (assuming iid draws),
1
(k)
Pr [ Xn ≤ k] = Pr [k of n realizations of X are ≤ k] × Pr [(n-k) realizations of X are > k ]
n
FY (k) = ( FX (k))k (1 − FX (k))n−k
k
We can compare two scenarios, both times of which will have N potential buyers of a good with distri-
bution of valuation/offers O (and we assume zero transaction costs):
1. The seller fixes a price Pthreshold = Pt (which can be chosen optimally if the seller knows both N and
the distribution of offers O), and sells at the first offer at or above Pt (so P ≥ Pt ).
N −1
2. The seller sets up an auction; they sell to the second-highest bid (the N -quantile of realized values),
P.
( xs )(n−f x)
p X ( x; s, f , n) =
(s+n f )
There is a similarity to the binomial in terms of expectation and variance as well. Call the fraction of
f
successes before first draw p = s+s f and the fraction of failures before first draw q = s+ f = 1 − p:
s
E[ X ] = n = np
s+ f
s f s+ f −n s+ f −n
Var ( X ) = n = np(1 − p) ≤ np(1 − p)
s+ f s+ f s+ f −1 s+ f −1
5.3 F distribution
If X ∼ χ2n , Y ∼ χ2m , and X and Y are independent then
X/n
∼ Fn,m
Y/m
This is useful when using an F-test to compare how good two models compare to one another. (Longer
explanation about F-test to come.)
2
6 RCT design and power calculations.
The power of a test is the probability of a false negative.
Say you have N individuals, you set γ fraction of them to the treatment group and the rest as control
(both samples large enough to be able to invoke CLT). You choose significance level α. You assume that the
treatment has a constant effect (of unknown magnitude) τ on the population. That is to say, you assume
that:
σ2 σ2
X̄c ∼ N (µ, ), X̄t ∼ N (µ + τ,
γN (1 − γ ) N
for some shared σ2 between the two populations (note: this is another assumption, which you can relax
if you have reason to). Now in total there are the following parameters in our experimental design:
When you fix all the other parameters, you can calculate the induced value for the final parameter.
γ = 0.5 is always the most "efficient" partition but may not always be reasonable/ethical (e.g., medical
study for new disease treatment that turns out to be really effective). α is usually fixed at a low value to
limit false positives. The higher the N, the better (but usually increasing N is costly).
Usually, you choose an effect size τ 0 such that, if τ < τ 0 , you "may as well not even bother". You usually
also estimate σ2 from previous studies. (Both τ and σ2 could also be estimated from a smaller pilot study
of the intervention under consideration.) From there, you can calculate the final value.
Ȳt − Ȳc
T= p ≈ N (0, 1)
VNeyman
σ2 σ2
where the Neyman variance assumes independence between the populations, i.e. VNeyman = Nt + Nc .
The slightly less beefy version of matched design is stratified design, where you end up having more
than two people in each "group", and you randomly send γ of them to the treatment arm and the rest to the
control arm.
Stratified design still lowers the variance of the estimated treatment effect, but not as much as matched
design. (Stratified design where you end up only having one stratum basically leads to a "regular" split.)
3
Clustered design involves combining clusters of participants into just 1 datapoint.
Clustered design increases the variances of estimated treatment effect, so it’s not recommended if you
can avoid it. Clustered design tends to be by necessity rather than desire. For example: you stage an in-
tervention in different classrooms in a school. Ideally, you’d get information about all the students in each
classroom and compare their treatment effect; in reality, the school can only provide information about
aggregate effects across classes with no per-student breakdown. What would otherwise be ∼ 30 data-
points/students become 1 datapoint/class. (So with lower n and no improvement anywhere else, variance
will go up.)
7 Causality.
Often, we are interested in trying to tease out causality among different factors. A common method is struc-
tural equation modeling (SEM), where we assume a structure between observable variables (e.g test scores)
and latent variables (e.g. intelligence), and then calculate the parameters that link these variables together
based on e.g. the covariance matrices.
This course looks more at the Rubin causal model. We will want to design our experiment so that the
stable unit treatment value assumption (SUTVA) holds. SUTVA assumes that "my choice of treat/no-treat for
Person A does not affect the outcome for Person B, and vice-versa". (An example where SUTVA would
not hold is if you choose parts of a local population to get immunized and others don’t – Person A getting
immunized improves Person B’s prospects regardless of whether B gets immunized.)
Another important aspect is to randomly assign individuals to the different arms of the study. This ensures
no built-in selection bias.
Why? A key idea is that "we could have gotten different outcomes for the same person i if they had
been given a different intervention k 6= j, so in general Yi,j 6= Yi,k ."
For an individual i and a study with A treatment arms, let Yi,j be what you would observe (for whatever
it is you’re measuring with this study) if you put person i in arm j. We can then assign people i to arm j;
denote the choice of assignment Wi (which equals 0, 1, · · · , j, depending on which arm Person i is assigned
to). Since we don’t have A clones of people, we’ll only be able to observe one branch. For simplicity, let’s
(obs)
have A = 2 ("either you get a treatment or you get a control"). So Yi = (Yi,j | Wi = a) = Yi,a .
We want to estimate the effect of treatment, so what would make sense is to use sample estimators for
E[Yi,1 ] − E[Yi,0 ]. We expand and rearrange:
E[Yi,1 ] − E[Yi,0 ] = ( E[Yi,1 | Wi = 1] + E[Yi,1 | Wi = 0]) − ( E[Yi,0 | Wi = 0] + E[Yi,0 | Wi = 1]) (law of total expectation)
= ( E[Yi,1 | Wi = 1] − E[Yi,1 | Wi = 0]) + ( E[Yi,0 | Wi = 1] − E[Yi,0 | Wi = 0])
= effect of treatment on the treated + selection bias
What happened here? What does Yi,0 | Wi = 1 mean? This refers to what response the people in the
treatment arm would have given if they had instead been in the control arm. If there’s a systematic differ-
ence between the groups (their baseline responses would differ, i.e., ( E[Yi,0 | Wi = 1] − E[Yi,0 | Wi = 0]) 6=
0), selection bias is at play. No good.
How can we prevent selection bias from occurring? The simplest way is to use random assignment (so any
bias decreases with increasing n due to Law of Large Numbers). Another benefit is that if we assume that
there is no selection bias, that means the two groups are essentially "from the same population". So our
conditional estimates Yi,j | Wi = j can describe the whole population and not just "those people with the
characteristics we selected for inclusion in our treatment arm", etc. A double-win!
Note that if the above is true, we can view all of this as a linear model (if we’re comfortable making the
linear model assumptions, e.g. Gaussian homoskedastic noise with no serial correlation). We are trying to
estimate E[Yi | Xi ] with the following:
4
Yi = β 0 + β 1 IsTreatmenti
In this case, β 0 estimates the average response for the control group and β 1 estimates the average
marginal benefit from undergoing treatment. (The point estimates for β will correspond to the sample
means/difference of sample means of the two groups.)
But we can decide to view things differently. Rather than having sampled randomly from a larger
population, what if the data we collected is our population of interest? Then we’d have a good deal of
information; if there are a treatment arms, we have 1/a of all the information about our population. Let’s
return to just two arms; then we have half of all the information.
So here’s a question: Can we test whether the intervention has any effect on anyone? This is a hypothesis
test:
(
H0 : Yi,0 − Yi,1 = 0∀ i
H1 : ∃ i s.t. Yi,0 − Yi,1 6= 0
Note that this is a pretty intense assumption. We aren’t saying "maybe the mean effect is zero"; we’re
saying "this doesn’t do anything to anyone". Hence why we call it a sharp null hypothesis.
Under the null, we now have all the relevant information and can "fill in" our empty rows. We now need
to choose a relevant test-statistic: one choice is the absolute difference in sample means: T = Ȳi,1 − Ȳi,0 .
How do we construct a p-value/measure of significance? We have a measure of Tobs for how our exper-
iment split up individuals Wi . What are all the other T corresponding to different treatment-control splits?
That gives us a range of possible T values. We can then see how extreme Tobs was and calculate the p-value
from that.
To be clear, let’s say you had Nc people in the control arm and Nt in the treatment arm. This means you
calculate ( NcN+cNt ) combinations of hypothetical assignments, calculate T for each one, and then compare
how extreme the actual assignment’s Tobs was. A combinatorial number of calculations for a test isn’t great
– accordingly, this sort of test is usually only really considered for small sample sizes. And often, one would
use simulations rather than calculate the distribution of T exactly.
8 Nonparametric comparisons/regressions.
Say you assume a model Y = E[Y | X ] + e = g( X ) + e, and you want to estimate g without imposing a
functional form. We can use a kernel to describe an estimator based on the observation:
x−x
E [Y | X ] ∑i yi K ( h i )
E [Y | X ] = → ĝ = x−x
E [1 | X ] ∑i K ( h i )
As h → 0, the bias in your estimator goes to 0. As nh → ∞, the variance in your estimator goes to zero.
(So, "the less you smear each point everywhere, the less you’re aiming at the wrong location. And the more
points you have, the less spread you’ll have.")
5
You have to choose both your kernel and your bandwidth. If you’ve chosen the kernel, how do you
choose an optimal bandwidth? Cross-validation. You fit to a random subset of points and find the h that
minimizes the sum of squared residuals on the holdout set of points.
Other nonparametric fitting methods include series estimation, spline interpolation, local linear regres-
sion (LOESS).
∑i (yi − ŷi )2
R2 = 1 − fraction of variance not explained by M = 1 −
∑i (yi − ȳ)2
As it turns out, these sum of squared residuals follow χ2 distributions with differing degrees of freedom,
so with a bit of finagling, you can construct an F-statistic of your model M and run an F-test to see whether
your R2 is statistically significant. (Whether that’s of practical use depends on your model, the value of R2 ,
etc.)
In general, you can run F-tests based on comparing a "full" model ŷ f and a "restricted" model ŷr with f
and r degrees of freedom, respectively. Then you can write:
SSRr −SSR f
scaled improvement in SSR using fuller model (SSRr − SSR f )/((n − r ) − (n − f )) f −r
T= = = ∼ Ff −r,n− f
SSR not explained even by the fuller model (SSR f )/(n − f ) (SSR f )/(n − f )
An assumption behind this model is that the difference between two groups (Bi /β for male vs. female in
this example) would have remained stable over time if no intervention had been applied (Ai /α). Without
this assumption, there is "mixing" between β and γ (which ultimately leads to higher variances due to
having correlated features, plus inaccurate point estimates since γ "steals" some of the change that would
have happened over time anyway).
6
β β
Logarithmic transformation Assume the correct form between Xi and Yi is Yi = AX1i1 X2i2 eei . Take the
log of both sides to get a linear form: log(Yi ) = β 0 + β 1 log( X1i ) + β 2 log( X2i ) + ei .
Box Cox Transformation In this case we assume the correct form is Yi = ( XiT β + ei )−1 . Invert to have
1 T
Y = Xi β + ei .
i
Note that this is highly reminiscent of generalized linear models using link functions g so that g( E[Y |
X ]) = X T β. Though we have not explicitly written our probability distribution for Y | X, we are implying
said distribution and are essentially performing the same sort of optimization, which likely lacks a closed-
form solution (and so would using, e.g., iteratively reweighted least squares).
If you’re not sure what form Yi takes, you can try to perform a regression on a series expansion of X1 i:
j
Yi = ∑kj=0 β j X1i + ei . This is a non-parametric (distribution-free) method called series regression.
10.5 Dummy variables and their uses (e.g., controlling for group fixed effects).
In general, categorical variables that take the form of indicator variables (1 if true, 0 otherwise) are called
dummy variables. It can be worth including dummy variables that capture, for example, the location of an
applicant or the number of applications an individual sent, grouping those from the same location/same
number of applications together. The idea is that such information may contain variability that is not of
interest to you but could affect the parameters of your features of interest.
For example, say you’re interested in the effect of (only) SAT score on future earnings. Perhaps people
with higher SAT scores send more applications, or people with higher SAT scores went to better schools.
You could capture these into separate groups so that the SATi parameter only captures the effects of SAT
score, and not any subsequent or surrounding effects. (I suspect this will greatly diminish the magnitude
of the effect, and more of it would be captured by Collegei – but perhaps not!)
This notion of creating dummy variables to capture the effects of being in a certain group can be called
controlling for group fixed effects.
7
10.6 Regression Discontinuity Design
One form of nonlinear transformation of Xi could be partitioning the range of Xi and considering in which
partition the realized value is located: S(k0 ,k1 );i = IsFeatureWithinIntervali = S0,i , etc.
This can be used in circumstances where an intervention’s effect has a discontinuity/"jump" based on
the value of the running variable, a. (When would such a discontinuity exist? e.g., "Candidate only under
consideration if they achieve at least P = 70 points on the exam." There is discontinuity of effect between
P = 69 and P = 70.)
Under such circumstances, a regression discontinuity (RD) design to evaluate causal effects (according to
one’s model of the world) is of interest. As an example: Is there an increase to all-cause mortality that can
be attributed to individuals reaching legal drinking age? Construct the equation:
Yi = β 0 + β 1 Dai + β 2 ai + ei
Yi represents all-cause mortality, ai is an individual’s age, and Dai represents whether they are of legal
drinking age. Note we include both ai and Dai ; in this way, the coefficient of Dai captures the excess change
of all-cause mortality for reaching legal drinking age, while ai captures the (continuous) effect of increasing
age on all-cause mortality.
Note that an RD design might look statistically significant, but you may in fact just be missing the correct
feature (perhaps a nonlinear transformation of obtained data) in your model. Always inspect your data!
Yi = β 0 + β 1 X1i + β 2 X2i + ei
But we don’t have access to X2i or any proxy for it, so we have to fit our model without it:
Yi = α0 + α1 X1i + wi
If we did have both X1i and X2i , we could see the relationship between the two by running an ancillary
(or auxiliary) regression:
X2i = δ0 + δ1 X1i + ξ i
Intuitively, we can expect that if X2i has an effect on Yi (i.e. β 2 6= 0, its "effect" will have to be "transferred"
over to some other part of the partial model (because we’re still forming an unbiased model of Yi ). If it’s
uncorrelated with all the other variables, it will transfer into the error term, making wi := β 2 X2i + ei . But
if there is a correlation between features, then that sneaks into the coefficients of the partial model. For
example, say δ1 > 0, i.e., X1i and X2i are positively correlated. Then in our partial model without X2i , the
model can "make up" for the missing variable by increasing |α̂1 |. More specifically, we could break apart
the partial model’s parameter as:
α̂1 = part actually due to X1 + part that X2i snuck in because they’re correlated
= β 1 + δ1 β 2
So there’s an omitted variable bias (OVB) due to the missing X2i , specifically:
OVB = α̂1 − β 1 = δ1 β 2
You can derive this intuition more rigorously by using the expression for α̂1 , a coefficient for a linear
model:
8
Cov(Yi , X1i )
α̂1 =
Var ( X1i )
Cov( β 0 + β 1 X1i + β 2 X2i + ei , X1i )
= (plug in full/true model of Yi )
Var ( X1i )
= · · · (break apart into separate covariance terms, evaluate, etc.)
If dealing with vectors, a similar analysis can be performed with α̂1 = ( X1T X )−1 X1T Y.
What is all this useful for? (Presumably we don’t have the omitted variables to place in.) This guides our
thinking and makes sure we frame our analysis properly. We should ask ourselves:
1. Are we missing features X2i that are relevant to our target variable Yi ?
2. Would this feature have a strong effect on Yi ? (β 2 ).
3. Would this missing feature likely be strongly correlated with a feature whose effect we are interested
in? (δ1 ; if it’s only correlated with features we have added in to avoid bias in our estimates, then it
shouldn’t be a huge problem)
4. If there is a correlation with both outcome variable and feature of interest, which way does it bias the
parameter associated with our feature of interest?
Aside: This is reminiscent of M-estimation with the appropriate choice of ρ and Q (except M-estimation
is still estimation, with a theoretically derived estimator and no "train-test split"). If you felt comfortable
assuming a probability law for Y and assuming the necessary conditions, you could make asymptotic Nor-
mality guarantees for ŷ based on the choice of ρ/Q.
With completely free reign, you can choose an f that just perfectly memorizes the input data. But
intuitively, if you "overlearned" your data, you probably learned some random idiosyncrasies of that sam-
ple rather than "the world as a whole". How do you handle this? Impose regularization conditioning:
min f ∈F ∑i Loss(yi , f ( xi )) + λR( f ). (Ridge regression: R( f (θ )) = kθ k22 . LASSO regression: R( f (θ )) =
kθ k11 .) Also, have a holdout set to assist in parameter turning, and a holdout set only used at the very end
to get a sense for how the finished model will perform "in the real world".
All of this might seem ad-hoc. Is any of this legitimate? Remember, our current focus is on prediction, not
estimation. We’re not concerned with whether the model reflects how the world truly generated the data,
which is in the end never directly observable (estimation; the benefit is attributing importance to different
factors). Rather, we only care about the quality of the model’s predictions outside our sample data, which
we can directly observe – after all, we can leave aside data to be "out of sample" and see how it does. If we
were to draw graphical models, causal models would point features to latent variables that then generate the
observed outcomes, while predictive models would point all the features directly to the observed outcomes.
9
12 Data Visualization for others
You may want to visualize data for yourself ("exploratory data analysis"). However, there is a different set
of considerations when you are visualizing data for others. Such visualizations must be clear, uncluttered,
and convey important information. (By necessity, you can’t put all of the information into a single visual –
but you shouldn’t be lying by omission!)
A starting point for considering one’s visualizations is Tufte’s principles, which emphasizes minimalism
(in short, "as little ink as is needed to clearly convey your message"). (That may be overboard, but it’s better
to start small and build up vs. including everything than paring down.)
For tables, the same general considerations apply: include only what’s necessary and be clear. In R, the
package stargazer is a good starting point.
Yi = α0 + α1 X1i + ei
Endogeneity refers to when the error term is correlated with a feature – in this bivariate case, Cov( X1i , ei ) 6=
δX1i y
0. This is bad; it muddies the waters. We would want αˆ1 ≈ δX1i = α1 , but if there’s a correlation,
δX1i y+δei y
αˆ1 ≈ δX1i 6= α1 then you’ll have Endogeneity can be caused from a number of methods:
1. Omitted Variable Bias. (Discussed earlier – but we never gave an answer as to how to deal with it.
We’ll get there now.)
2. Measurement error in the features. Since we assume the measurements are precise, any measurement
error in X1i would go into ei , making the two correlated.
3. Reverse causality. If, rather than X1i causing Yi , Yi in fact causes X1i , then our whole model is incor-
rectly specified. (In a regular model, Cov(Yi , ei ) = Var (ei ). If X1i is what should actually be the target
variable, it would make sense for Cov( X1i , ei ) ∝ Var (ei ) 6= 0.)
This is a bit of a mess for our estimation methods – our estimates β̂ will be both biased and inconsistent.
How can we deal with it? Using instrumental variables (IVs).
Say X1i is an endogenous variable in our linear model. An instrumental variable Zi for X1i satisfies the
following properties:
1. Cov( X1i , Zi ) 6= 0. (There is some relationship between the instrumental variable and the variable it’s
intended to proxy for.)
2. Cov(ei , Zi ) = 0. (The instrumental variable is otherwise unconnected with the other unobserved
features that had been causing the endogeneity with X1i .)
3. Zi is not a causal determiner of Yi . (Zi isn’t actually just a feature that should have been in your model
in the first place. Its effects on Yi will only be through its ability to proxy X1i .)
ˆ IV = Cov(Yi ,Zi )
The estimators for the instrumental variable beta Cov( X1i ,Zi )
is biased but consistent!
Before we go further: an IV kind of sounds magical. If Zi is correlated with X1i , won’t it have to also be
correlated with ei (since X1i is correlated with ei )? That is to say, wouldn’t Zi have the exact same problem?
10
ln(wage)i = α + βeduci + ei
One could expect that many other factors affect earnings as well as years of education, for example some
measure of innate ability:
As it turns out, if someone is born in the fourth quarter of the year, they tend to join school at around 5
3/4 years old, whereas people born in the first quarter start around the age of 6 3/4 (due to school regula-
tions). Similarly, one couldn’t end their school (to e.g. join an apprenticeship) until they were 16 years old.
So on average, people born in the fourth quarter were "forced" into more schooling than those in the first
quarter. At the same time, one can reasonably assume that the quarter of the year in which you’re born is
not correlated with your innate ability, etc. In this way, Is4thQuarteri can serve as an instrumental variable
for educi .
How can we deal with this in general? One can perform two-stage least-squares. Say you have a model
Yi = β 0 + β 1 X1i + β 2 X2i + e I , and X1i is our suspected endogenous variable for which we have gotten
instrumental variables Z1 and Z2 . Then:
1. First Stage: X1i = π0 + π1 Z1 + π2 Z2 + π−2 X2i + wi . Get estimators for π, X̂1i . We include the other
exogenous variables to keep any correlations with the other features. Remember: we’re trying to
make a proxy for X1i that is uncorrelated with ei – if X1i is correlated with X2i , we should try to keep
that behavior in our estimator. (The more we deviate our proxy from its target, the less good a proxy
it becomes.)
2. Second Stage: Yi = β 0 + β 1 X̂1i + β 2 X2i + ei . This time, our estimator βˆ1 is consistent (but still biased).
If our instrumental variable is a binary/indicator variable, we can estimate our β̂ IV quite simply with
the Wald estimate:
E[Yi | Zi = 1] − E[Yi | zi = 0
β̂Wald =
E[ Ai | Zi = 1] − E[ Ai | Zi = 0
In general, we can write:
β̂ = ( Z T X )−1 Z T Y
(If there are more instruments than endogenous variables, you can write something similar, just with
some projection matrices.)
If an instrument is meant to compel subjects to get a treatment, the estimator captures the effect of
treatment on those who are in fact compelled to get treatment because of the instrument. This is called a
local average treatment effect (LATE). (Note that this suggests that this does not describe the population as a
whole, only people that would be swayed by the instrument.)
14 Experimental design.
There are a few important things to keep in mind:
11
14.1 Randomization methods
Ideally, you can stratify your experimental design. Maybe you can do simple random assignment. You may
be forced to cluster, but try to avoid it if possible.
What about cases where "ideal" randomization is not ideal? You can perform a randomized phase-in of
subgroups in your sample over time; those who haven’t phased in can be your point of comparison. You
can randomize around a cutoff to see how important that cutoff actually is. You can set up an encouragement
design and estimate the local average treatment effect (LATE) of the intervention (rather than force people
to join or be excluded from an intervention).
12