0% found this document useful (0 votes)

233 views12 pages

Data Analysis For Social Scientists (14.1310x)

This document provides a review of key topics covered in a course on data analysis for social scientists, including: 1. Exploratory data analysis techniques like plotting histograms and kernel density estimation for smoothing. 2. Comparing distributions using cumulative distribution functions to determine first-order stochastic dominance. 3. Auction theory and how auctions can lead to better outcomes for sellers compared to fixed prices. 4. Randomized controlled trial design and how sample size, treatment effect size, and other factors impact statistical power. Matched and stratified designs can lower variance compared to regular splits.

Uploaded by

Sebastian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

233 views12 pages

Data Analysis For Social Scientists (14.1310x)

Uploaded by

Sebastian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Data Analysis for Social Scientists (14.1310x) review notes.

David G. Khachatrian
September 28, 2019

1 Preamble
This was made a good deal after having taken the course. It will likely not be exhaustive. It may also in-
clude some editorializing: bits of what I believe are relevant observations and/or information I have come
across.

Also, many of the earlier topics of probability and "theoretical" statistics were covered in detail in the
Probability (6.431x) and Fundamentals of Statistics (18.6501x) notes, so those topics will either be skipped
or mentioned very quickly in these notes.

2 Probability Terms
A probability on a sample space S is a collection of numbers P( A) that satisfy sigma-algebra properties.

3 Exploratory data analysis.

One can often get a sense of the data by plotting histograms of the features in question.

You may want to smooth out your plots. One can use kernel density estimation to achieve this. We
extrapolate from known datapoints xi using a kernel function K:
n n
1 1 1 x − xi
fˆh ( x ) =
n ∑ Kh ( x − xi ) = n ∑ h K ( h
)
i =1 i =1

K is a probability density with mean 0, and h > 0 is the bandwidth. Increased h increases the effect each
point xi on faraway points (it "smooths/flattens things out"). Common choices for K are Epanechnikov or
Normal (here, σ2 is essentially the bandwidth parameter).
(Note that kernel density estimation is a distortion added to the visualization of the data).

Sometimes you’ll want to plot the CDF. This is especially helpful to determine the existence of first-order
stochastic dominance (FOSD). X dominates Y to the first-order if X’s α-quantile is greater than or equal to Y’s
α-quantile for all α ∈ (0, 1), i.e., FX−1 (α) ≥ FY−1 (α). This corresponds to Pr [ X ≤ k] ≤ Pr [Y ≤ k]∀ k ∈ R.
Intuitively, "X is always more likely to yield larger numbers than Y". Visually, the CDF of X is always
"to the right of" the CDF of Y.

4 Auctions
(k)
The k’th order statistic of n realizations of a random variable X, denoted Xn (n usually suppressed) de-
scribes the probability law governing the distribution of the nk -quantile of X. Perhaps better explained in
mathematical terms (assuming iid draws),

1
(k)
Pr [ Xn ≤ k] = Pr [k of n realizations of X are ≤ k] × Pr [(n-k) realizations of X are > k ]

n
FY (k) = ( FX (k))k (1 − FX (k))n−k
k
We can compare two scenarios, both times of which will have N potential buyers of a good with distri-
bution of valuation/offers O (and we assume zero transaction costs):

1. The seller fixes a price Pthreshold = Pt (which can be chosen optimally if the seller knows both N and
the distribution of offers O), and sells at the first offer at or above Pt (so P ≥ Pt ).
N −1
2. The seller sets up an auction; they sell to the second-highest bid (the N -quantile of realized values),
P.

The goal is to maximize expected profit/revenue E[ P].

One can go through the calculations (assuming, for example, that O ∼ U [0, 1]) to find that the auction
scenario benefits the seller for N > 2. In general, for large enough N, an auction leads to a better outcome
for the seller, even though they don’t need to know O in the auction scenario! This is essentially price
discovery in action.

5 Some new distributions

5.1 Hypergeometric Distribution
The hypergeometric distribution arises when you sample without replacement (in comparison to the binomial
distribution, where you sample with replacement). Say you sample n items without replacement from a
pool with s "successes" and f "failures" (with a total of s + f = N items). Then the number of successes in
that sample X follows the hypergeometric distribution:

( xs )(n−f x)
p X ( x; s, f , n) =
(s+n f )
There is a similarity to the binomial in terms of expectation and variance as well. Call the fraction of
f
successes before first draw p = s+s f and the fraction of failures before first draw q = s+ f = 1 − p:
s
E[ X ] = n = np
s+ f
s f s+ f −n s+ f −n
Var ( X ) = n = np(1 − p) ≤ np(1 − p)
s+ f s+ f s+ f −1 s+ f −1

5.2 Negative binomial distribution

The negative binomial distribution of order r, NB( p, r ) is the sum r iid Geom( p) r.v.’s. (This is the discrete
analog to "the Erlang distribution of order k, Erlang(λ, k ), is the sum of k iid Exp(λ) r.v.’s.)

5.3 F distribution
If X ∼ χ2n , Y ∼ χ2m , and X and Y are independent then

X/n
∼ Fn,m
Y/m
This is useful when using an F-test to compare how good two models compare to one another. (Longer
explanation about F-test to come.)

2
6 RCT design and power calculations.
The power of a test is the probability of a false negative.

Say you have N individuals, you set γ fraction of them to the treatment group and the rest as control
(both samples large enough to be able to invoke CLT). You choose significance level α. You assume that the
treatment has a constant effect (of unknown magnitude) τ on the population. That is to say, you assume
that:

σ2 σ2
X̄c ∼ N (µ, ), X̄t ∼ N (µ + τ,
γN (1 − γ ) N
for some shared σ2 between the two populations (note: this is another assumption, which you can relax
if you have reason to). Now in total there are the following parameters in our experimental design:

1. the effect size of treatment τ

2. the variance of the underlying population distribution σ2
3. the probability of Type II error (for a given threshold/effect size) β
4. the probability of Type I erro α

5. the number of observations/participants in the study N

6. the fraction of participants that join the treatment group γ

When you fix all the other parameters, you can calculate the induced value for the final parameter.
γ = 0.5 is always the most "efficient" partition but may not always be reasonable/ethical (e.g., medical
study for new disease treatment that turns out to be really effective). α is usually fixed at a low value to
limit false positives. The higher the N, the better (but usually increasing N is costly).

Usually, you choose an effect size τ 0 such that, if τ < τ 0 , you "may as well not even bother". You usually
also estimate σ2 from previous studies. (Both τ and σ2 could also be estimated from a smaller pilot study
of the intervention under consideration.) From there, you can calculate the final value.

The relevant test statistic is

Ȳt − Ȳc
T= p ≈ N (0, 1)
VNeyman
σ2 σ2
where the Neyman variance assumes independence between the populations, i.e. VNeyman = Nt + Nc .

6.1 Experimental design effects on variance

In a matched design, you have paired participants exactly by matching them on other features which may
confound treatment effect (socioeconomic status, location, age, etc). You then randomly choose one of the
pair to join the control arm and the other to join the treatment arm.
The matched design ends up lowering the variance of the estimated treatment effect (which is good!).

The slightly less beefy version of matched design is stratified design, where you end up having more
than two people in each "group", and you randomly send γ of them to the treatment arm and the rest to the
control arm.
Stratified design still lowers the variance of the estimated treatment effect, but not as much as matched
design. (Stratified design where you end up only having one stratum basically leads to a "regular" split.)

3
Clustered design involves combining clusters of participants into just 1 datapoint.
Clustered design increases the variances of estimated treatment effect, so it’s not recommended if you
can avoid it. Clustered design tends to be by necessity rather than desire. For example: you stage an in-
tervention in different classrooms in a school. Ideally, you’d get information about all the students in each
classroom and compare their treatment effect; in reality, the school can only provide information about
aggregate effects across classes with no per-student breakdown. What would otherwise be ∼ 30 data-
points/students become 1 datapoint/class. (So with lower n and no improvement anywhere else, variance
will go up.)

7 Causality.
Often, we are interested in trying to tease out causality among different factors. A common method is struc-
tural equation modeling (SEM), where we assume a structure between observable variables (e.g test scores)
and latent variables (e.g. intelligence), and then calculate the parameters that link these variables together
based on e.g. the covariance matrices.

This course looks more at the Rubin causal model. We will want to design our experiment so that the
stable unit treatment value assumption (SUTVA) holds. SUTVA assumes that "my choice of treat/no-treat for
Person A does not affect the outcome for Person B, and vice-versa". (An example where SUTVA would
not hold is if you choose parts of a local population to get immunized and others don’t – Person A getting
immunized improves Person B’s prospects regardless of whether B gets immunized.)

Another important aspect is to randomly assign individuals to the different arms of the study. This ensures
no built-in selection bias.

Why? A key idea is that "we could have gotten different outcomes for the same person i if they had
been given a different intervention k 6= j, so in general Yi,j 6= Yi,k ."
For an individual i and a study with A treatment arms, let Yi,j be what you would observe (for whatever
it is you’re measuring with this study) if you put person i in arm j. We can then assign people i to arm j;
denote the choice of assignment Wi (which equals 0, 1, · · · , j, depending on which arm Person i is assigned
to). Since we don’t have A clones of people, we’ll only be able to observe one branch. For simplicity, let’s
(obs)
have A = 2 ("either you get a treatment or you get a control"). So Yi = (Yi,j | Wi = a) = Yi,a .

We want to estimate the effect of treatment, so what would make sense is to use sample estimators for
E[Yi,1 ] − E[Yi,0 ]. We expand and rearrange:

E[Yi,1 ] − E[Yi,0 ] = ( E[Yi,1 | Wi = 1] + E[Yi,1 | Wi = 0]) − ( E[Yi,0 | Wi = 0] + E[Yi,0 | Wi = 1]) (law of total expectation)
= ( E[Yi,1 | Wi = 1] − E[Yi,1 | Wi = 0]) + ( E[Yi,0 | Wi = 1] − E[Yi,0 | Wi = 0])
= effect of treatment on the treated + selection bias
What happened here? What does Yi,0 | Wi = 1 mean? This refers to what response the people in the
treatment arm would have given if they had instead been in the control arm. If there’s a systematic differ-
ence between the groups (their baseline responses would differ, i.e., ( E[Yi,0 | Wi = 1] − E[Yi,0 | Wi = 0]) 6=
0), selection bias is at play. No good.
How can we prevent selection bias from occurring? The simplest way is to use random assignment (so any
bias decreases with increasing n due to Law of Large Numbers). Another benefit is that if we assume that
there is no selection bias, that means the two groups are essentially "from the same population". So our
conditional estimates Yi,j | Wi = j can describe the whole population and not just "those people with the
characteristics we selected for inclusion in our treatment arm", etc. A double-win!

Note that if the above is true, we can view all of this as a linear model (if we’re comfortable making the
linear model assumptions, e.g. Gaussian homoskedastic noise with no serial correlation). We are trying to
estimate E[Yi | Xi ] with the following:

4
Yi = β 0 + β 1 IsTreatmenti
In this case, β 0 estimates the average response for the control group and β 1 estimates the average
marginal benefit from undergoing treatment. (The point estimates for β will correspond to the sample
means/difference of sample means of the two groups.)

7.1 Fisher and the Sharp Null

We could proceed as normal. Our natural estimator for the treatment takes sample means: (Ȳi | Wi =
1) − (Ȳi | Wi = 0). Like before, we assume that the two populations are independent (knowing how
one group doesn’t tell us how the other group did – this is essentially the assumption behind SUTVA),
so we construct the unbiased sample variance. We can run things as normally from there; if we assume
Yi,j | Wi = j is Normally distributed, we can run a t-test, etc.

But we can decide to view things differently. Rather than having sampled randomly from a larger
population, what if the data we collected is our population of interest? Then we’d have a good deal of
information; if there are a treatment arms, we have 1/a of all the information about our population. Let’s
return to just two arms; then we have half of all the information.

So here’s a question: Can we test whether the intervention has any effect on anyone? This is a hypothesis
test:
(
H0 : Yi,0 − Yi,1 = 0∀ i
H1 : ∃ i s.t. Yi,0 − Yi,1 6= 0
Note that this is a pretty intense assumption. We aren’t saying "maybe the mean effect is zero"; we’re
saying "this doesn’t do anything to anyone". Hence why we call it a sharp null hypothesis.

Under the null, we now have all the relevant information and can "fill in" our empty rows. We now need
to choose a relevant test-statistic: one choice is the absolute difference in sample means: T = Ȳi,1 − Ȳi,0 .
How do we construct a p-value/measure of significance? We have a measure of Tobs for how our exper-
iment split up individuals Wi . What are all the other T corresponding to different treatment-control splits?
That gives us a range of possible T values. We can then see how extreme Tobs was and calculate the p-value
from that.

To be clear, let’s say you had Nc people in the control arm and Nt in the treatment arm. This means you
calculate ( NcN+cNt ) combinations of hypothetical assignments, calculate T for each one, and then compare
how extreme the actual assignment’s Tobs was. A combinatorial number of calculations for a test isn’t great
– accordingly, this sort of test is usually only really considered for small sample sizes. And often, one would
use simulations rather than calculate the distribution of T exactly.

8 Nonparametric comparisons/regressions.
Say you assume a model Y = E[Y | X ] + e = g( X ) + e, and you want to estimate g without imposing a
functional form. We can use a kernel to describe an estimator based on the observation:
x−x
E [Y | X ] ∑i yi K ( h i )
E [Y | X ] = → ĝ = x−x
E [1 | X ] ∑i K ( h i )
As h → 0, the bias in your estimator goes to 0. As nh → ∞, the variance in your estimator goes to zero.
(So, "the less you smear each point everywhere, the less you’re aiming at the wrong location. And the more
points you have, the less spread you’ll have.")

5
You have to choose both your kernel and your bandwidth. If you’ve chosen the kernel, how do you
choose an optimal bandwidth? Cross-validation. You fit to a random subset of points and find the h that
minimizes the sum of squared residuals on the holdout set of points.

Other nonparametric fitting methods include series estimation, spline interpolation, local linear regres-
sion (LOESS).

9 R2 and the F-test

We can view R2 as answering "How much of the variance in Y does our model M = ŷ = g( x ) explain?"
From this explanation, the formula is hopefully clearer:

∑i (yi − ŷi )2
R2 = 1 − fraction of variance not explained by M = 1 −
∑i (yi − ȳ)2
As it turns out, these sum of squared residuals follow χ2 distributions with differing degrees of freedom,
so with a bit of finagling, you can construct an F-statistic of your model M and run an F-test to see whether
your R2 is statistically significant. (Whether that’s of practical use depends on your model, the value of R2 ,
etc.)

In general, you can run F-tests based on comparing a "full" model ŷ f and a "restricted" model ŷr with f
and r degrees of freedom, respectively. Then you can write:

SSRr −SSR f
scaled improvement in SSR using fuller model (SSRr − SSR f )/((n − r ) − (n − f )) f −r
T= = = ∼ Ff −r,n− f
SSR not explained even by the fuller model (SSR f )/(n − f ) (SSR f )/(n − f )

10 Considerations and methods for running regressions.

10.1 Interaction effects and their uses (e.g., "Difference-in-Differences").
Consider a regression of the form:

Yi = δ + αAi + βBi + γAi ∗ Bi + ei

Ai and Bi are two features of an observation. What does (the parameter γ for) Ai ∗ Bi represent? The
interaction effects of having both Ai and Bi at the same time. For example, Ai can represent the effect of
a treatment for female participants, Bi can represent the baseline difference between males and females,
and Ai ∗ Bi would represent the difference in effect of the treatment for males compared to females. It’s a
"difference between male and female" on top of a "difference between treatment and control" – a difference-
in-differences model.

An assumption behind this model is that the difference between two groups (Bi /β for male vs. female in
this example) would have remained stable over time if no intervention had been applied (Ai /α). Without
this assumption, there is "mixing" between β and γ (which ultimately leads to higher variances due to
having correlated features, plus inaccurate point estimates since γ "steals" some of the change that would
have happened over time anyway).

10.2 Transformations of the dependent variable.

In the above regressions, we have assumed a linear dependence between target variables and features:
Yi = XiT β + ei . What if we assume different functional forms? We can adjust our expressions to still
perform a sort of linear regression! Examples include:

6
β β
Logarithmic transformation Assume the correct form between Xi and Yi is Yi = AX1i1 X2i2 eei . Take the
log of both sides to get a linear form: log(Yi ) = β 0 + β 1 log( X1i ) + β 2 log( X2i ) + ei .

Box Cox Transformation In this case we assume the correct form is Yi = ( XiT β + ei )−1 . Invert to have
1 T
Y = Xi β + ei .
i

exp( XiT β+ei )

Discrete Choice Model In this case we assume a sigmoidal shape/softmax: Pi = . Then we
1+exp( XiT β+ei )

can get to a linear form via Yi = log 1−PiP = XiT β + ei .
i

Note that this is highly reminiscent of generalized linear models using link functions g so that g( E[Y |
X ]) = X T β. Though we have not explicitly written our probability distribution for Y | X, we are implying
said distribution and are essentially performing the same sort of optimization, which likely lacks a closed-
form solution (and so would using, e.g., iteratively reweighted least squares).

10.3 Nonlinear transformations of the independent variables.

Let’s say you visualize the data and you suspect that Yi is not linearly related to Xi , but instead to f ( Xi )?
Well, you can calculate that feature and put that in the linear model! This is feature engineering and is an
important part of any model creation process. This can also include interaction terms, e.g. X1i × log( X3i ) or
X2i × IsTreatmenti . (Don’t go too crazy here though – ideally you can interpret the features you’re creating.)

If you’re not sure what form Yi takes, you can try to perform a regression on a series expansion of X1 i:
j
Yi = ∑kj=0 β j X1i + ei . This is a non-parametric (distribution-free) method called series regression.

10.4 Locally Linear Regression

A nonparametric estimation method that is generally better than kernel estimation is locally linear regression.
It still involves kernels, but rather than trying to fit the best "flat-lines" around each point, it tries to fit the
best line around each point (weighting each observation based on the value of the kernel at each point).
This gives us an estimated slope at each point (alongside the estimated value), which can be of interest.
(More rigorous mathematical discussion here.

10.5 Dummy variables and their uses (e.g., controlling for group fixed effects).
In general, categorical variables that take the form of indicator variables (1 if true, 0 otherwise) are called
dummy variables. It can be worth including dummy variables that capture, for example, the location of an
applicant or the number of applications an individual sent, grouping those from the same location/same
number of applications together. The idea is that such information may contain variability that is not of
interest to you but could affect the parameters of your features of interest.

For example, say you’re interested in the effect of (only) SAT score on future earnings. Perhaps people
with higher SAT scores send more applications, or people with higher SAT scores went to better schools.
You could capture these into separate groups so that the SATi parameter only captures the effects of SAT
score, and not any subsequent or surrounding effects. (I suspect this will greatly diminish the magnitude
of the effect, and more of it would be captured by Collegei – but perhaps not!)

This notion of creating dummy variables to capture the effects of being in a certain group can be called
controlling for group fixed effects.

7
10.6 Regression Discontinuity Design
One form of nonlinear transformation of Xi could be partitioning the range of Xi and considering in which
partition the realized value is located: S(k0 ,k1 );i = IsFeatureWithinIntervali = S0,i , etc.

This can be used in circumstances where an intervention’s effect has a discontinuity/"jump" based on
the value of the running variable, a. (When would such a discontinuity exist? e.g., "Candidate only under
consideration if they achieve at least P = 70 points on the exam." There is discontinuity of effect between
P = 69 and P = 70.)

Under such circumstances, a regression discontinuity (RD) design to evaluate causal effects (according to
one’s model of the world) is of interest. As an example: Is there an increase to all-cause mortality that can
be attributed to individuals reaching legal drinking age? Construct the equation:

Yi = β 0 + β 1 Dai + β 2 ai + ei
Yi represents all-cause mortality, ai is an individual’s age, and Dai represents whether they are of legal
drinking age. Note we include both ai and Dai ; in this way, the coefficient of Dai captures the excess change
of all-cause mortality for reaching legal drinking age, while ai captures the (continuous) effect of increasing
age on all-cause mortality.

Note that an RD design might look statistically significant, but you may in fact just be missing the correct
feature (perhaps a nonlinear transformation of obtained data) in your model. Always inspect your data!

10.7 Omitted Variable Bias

An unfortunate fact of life is that very often (if not always), we will not have all the relevant features that
causally affect Yi . Say the true model is:

Yi = β 0 + β 1 X1i + β 2 X2i + ei
But we don’t have access to X2i or any proxy for it, so we have to fit our model without it:

Yi = α0 + α1 X1i + wi
If we did have both X1i and X2i , we could see the relationship between the two by running an ancillary
(or auxiliary) regression:

X2i = δ0 + δ1 X1i + ξ i
Intuitively, we can expect that if X2i has an effect on Yi (i.e. β 2 6= 0, its "effect" will have to be "transferred"
over to some other part of the partial model (because we’re still forming an unbiased model of Yi ). If it’s
uncorrelated with all the other variables, it will transfer into the error term, making wi := β 2 X2i + ei . But
if there is a correlation between features, then that sneaks into the coefficients of the partial model. For
example, say δ1 > 0, i.e., X1i and X2i are positively correlated. Then in our partial model without X2i , the
model can "make up" for the missing variable by increasing |α̂1 |. More specifically, we could break apart
the partial model’s parameter as:

α̂1 = part actually due to X1 + part that X2i snuck in because they’re correlated
= β 1 + δ1 β 2
So there’s an omitted variable bias (OVB) due to the missing X2i , specifically:

OVB = α̂1 − β 1 = δ1 β 2
You can derive this intuition more rigorously by using the expression for α̂1 , a coefficient for a linear
model:

8
Cov(Yi , X1i )
α̂1 =
Var ( X1i )
Cov( β 0 + β 1 X1i + β 2 X2i + ei , X1i )
= (plug in full/true model of Yi )
Var ( X1i )
= · · · (break apart into separate covariance terms, evaluate, etc.)
If dealing with vectors, a similar analysis can be performed with α̂1 = ( X1T X )−1 X1T Y.

What is all this useful for? (Presumably we don’t have the omitted variables to place in.) This guides our
thinking and makes sure we frame our analysis properly. We should ask ourselves:

1. Are we missing features X2i that are relevant to our target variable Yi ?
2. Would this feature have a strong effect on Yi ? (β 2 ).
3. Would this missing feature likely be strongly correlated with a feature whose effect we are interested
in? (δ1 ; if it’s only correlated with features we have added in to avoid bias in our estimates, then it
shouldn’t be a huge problem)
4. If there is a correlation with both outcome variable and feature of interest, which way does it bias the
parameter associated with our feature of interest?

11 Machine Learning and Econometrics

In econometrics, we focus on estimation. With machine learning, we tend to focus on prediction. What’s
the difference? In econometrics, we make fairly strict assumptions about how the data is generated (Yi =
XiT β + ei , for example), which limits our search space to a specific family of functions, for which we can
find the "best" coefficients β̂. With machine learning, we provide very loose specifications or provide free
reign on functional form, which diminishes interpretability a good deal but gives the flexibility to be able
to make the "best" predictions ŷ.

Aside: This is reminiscent of M-estimation with the appropriate choice of ρ and Q (except M-estimation
is still estimation, with a theoretically derived estimator and no "train-test split"). If you felt comfortable
assuming a probability law for Y and assuming the necessary conditions, you could make asymptotic Nor-
mality guarantees for ŷ based on the choice of ρ/Q.

With completely free reign, you can choose an f that just perfectly memorizes the input data. But
intuitively, if you "overlearned" your data, you probably learned some random idiosyncrasies of that sam-
ple rather than "the world as a whole". How do you handle this? Impose regularization conditioning:
min f ∈F ∑i Loss(yi , f ( xi )) + λR( f ). (Ridge regression: R( f (θ )) = kθ k22 . LASSO regression: R( f (θ )) =
kθ k11 .) Also, have a holdout set to assist in parameter turning, and a holdout set only used at the very end
to get a sense for how the finished model will perform "in the real world".

All of this might seem ad-hoc. Is any of this legitimate? Remember, our current focus is on prediction, not
estimation. We’re not concerned with whether the model reflects how the world truly generated the data,
which is in the end never directly observable (estimation; the benefit is attributing importance to different
factors). Rather, we only care about the quality of the model’s predictions outside our sample data, which
we can directly observe – after all, we can leave aside data to be "out of sample" and see how it does. If we
were to draw graphical models, causal models would point features to latent variables that then generate the
observed outcomes, while predictive models would point all the features directly to the observed outcomes.

9
12 Data Visualization for others
You may want to visualize data for yourself ("exploratory data analysis"). However, there is a different set
of considerations when you are visualizing data for others. Such visualizations must be clear, uncluttered,
and convey important information. (By necessity, you can’t put all of the information into a single visual –
but you shouldn’t be lying by omission!)

A starting point for considering one’s visualizations is Tufte’s principles, which emphasizes minimalism
(in short, "as little ink as is needed to clearly convey your message"). (That may be overboard, but it’s better
to start small and build up vs. including everything than paring down.)

For tables, the same general considerations apply: include only what’s necessary and be clear. In R, the
package stargazer is a good starting point.

13 Endogeneity and Instrumental Variables

Let’s say we have our linear model:

Yi = α0 + α1 X1i + ei
Endogeneity refers to when the error term is correlated with a feature – in this bivariate case, Cov( X1i , ei ) 6=
δX1i y
0. This is bad; it muddies the waters. We would want αˆ1 ≈ δX1i = α1 , but if there’s a correlation,
δX1i y+δei y
αˆ1 ≈ δX1i 6= α1 then you’ll have Endogeneity can be caused from a number of methods:

1. Omitted Variable Bias. (Discussed earlier – but we never gave an answer as to how to deal with it.
We’ll get there now.)
2. Measurement error in the features. Since we assume the measurements are precise, any measurement
error in X1i would go into ei , making the two correlated.
3. Reverse causality. If, rather than X1i causing Yi , Yi in fact causes X1i , then our whole model is incor-
rectly specified. (In a regular model, Cov(Yi , ei ) = Var (ei ). If X1i is what should actually be the target
variable, it would make sense for Cov( X1i , ei ) ∝ Var (ei ) 6= 0.)

This is a bit of a mess for our estimation methods – our estimates β̂ will be both biased and inconsistent.
How can we deal with it? Using instrumental variables (IVs).

Say X1i is an endogenous variable in our linear model. An instrumental variable Zi for X1i satisfies the
following properties:

1. Cov( X1i , Zi ) 6= 0. (There is some relationship between the instrumental variable and the variable it’s
intended to proxy for.)
2. Cov(ei , Zi ) = 0. (The instrumental variable is otherwise unconnected with the other unobserved
features that had been causing the endogeneity with X1i .)
3. Zi is not a causal determiner of Yi . (Zi isn’t actually just a feature that should have been in your model
in the first place. Its effects on Yi will only be through its ability to proxy X1i .)

ˆ IV = Cov(Yi ,Zi )
The estimators for the instrumental variable beta Cov( X1i ,Zi )
is biased but consistent!

Before we go further: an IV kind of sounds magical. If Zi is correlated with X1i , won’t it have to also be
correlated with ei (since X1i is correlated with ei )? That is to say, wouldn’t Zi have the exact same problem?

Let’s consider an example: We want to see the effect of education on log-wages.

10
ln(wage)i = α + βeduci + ei
One could expect that many other factors affect earnings as well as years of education, for example some
measure of innate ability:

ln(wage)i = α + βeduci + γabili + ei

If we assume that innate ability correlates with education and with earnings, we would have omitted
variable bias and therefore endogeneity in educi . We can’t reasonably get measures of everyone’s innate
ability, so what do we do? How do we get an IV ("proxy") for education that isn’t correlated with innate
ability or anything else?

As it turns out, if someone is born in the fourth quarter of the year, they tend to join school at around 5
3/4 years old, whereas people born in the first quarter start around the age of 6 3/4 (due to school regula-
tions). Similarly, one couldn’t end their school (to e.g. join an apprenticeship) until they were 16 years old.
So on average, people born in the fourth quarter were "forced" into more schooling than those in the first
quarter. At the same time, one can reasonably assume that the quarter of the year in which you’re born is
not correlated with your innate ability, etc. In this way, Is4thQuarteri can serve as an instrumental variable
for educi .

How can we deal with this in general? One can perform two-stage least-squares. Say you have a model
Yi = β 0 + β 1 X1i + β 2 X2i + e I , and X1i is our suspected endogenous variable for which we have gotten
instrumental variables Z1 and Z2 . Then:

1. First Stage: X1i = π0 + π1 Z1 + π2 Z2 + π−2 X2i + wi . Get estimators for π, X̂1i . We include the other
exogenous variables to keep any correlations with the other features. Remember: we’re trying to
make a proxy for X1i that is uncorrelated with ei – if X1i is correlated with X2i , we should try to keep
that behavior in our estimator. (The more we deviate our proxy from its target, the less good a proxy
it becomes.)
2. Second Stage: Yi = β 0 + β 1 X̂1i + β 2 X2i + ei . This time, our estimator βˆ1 is consistent (but still biased).

If our instrumental variable is a binary/indicator variable, we can estimate our β̂ IV quite simply with
the Wald estimate:

E[Yi | Zi = 1] − E[Yi | zi = 0
β̂Wald =
E[ Ai | Zi = 1] − E[ Ai | Zi = 0
In general, we can write:

β̂ = ( Z T X )−1 Z T Y
(If there are more instruments than endogenous variables, you can write something similar, just with
some projection matrices.)

If an instrument is meant to compel subjects to get a treatment, the estimator captures the effect of
treatment on those who are in fact compelled to get treatment because of the instrument. This is called a
local average treatment effect (LATE). (Note that this suggests that this does not describe the population as a
whole, only people that would be swayed by the instrument.)

14 Experimental design.
There are a few important things to keep in mind:

1. Be clear about what question exactly you’re trying to answer.

2. Try to introduce randomness in your experiments (even if it may not seem feasible at first).

11
14.1 Randomization methods
Ideally, you can stratify your experimental design. Maybe you can do simple random assignment. You may
be forced to cluster, but try to avoid it if possible.

What about cases where "ideal" randomization is not ideal? You can perform a randomized phase-in of
subgroups in your sample over time; those who haven’t phased in can be your point of comparison. You
can randomize around a cutoff to see how important that cutoff actually is. You can set up an encouragement
design and estimate the local average treatment effect (LATE) of the intervention (rather than force people
to join or be excluded from an intervention).

14.2 Clarify your design.

Try to drill down on not only whether an intervention works, but what aspects of the intervention causes a
change (for example, is it the fact that some information is made openly public? That you have a physical
reminder? That there is an appearance of accountability?). Be aware of potential unintended consequences
of your interventions, and attempt to estimate them (e.g. "crowding-out"/displacement effects).

FRM Part 2 Formula Sheet 2024 Version-1713349178740
100% (1)
FRM Part 2 Formula Sheet 2024 Version-1713349178740
22 pages
The Land Question in History (Silvano Borruso On History's Great Driver)
No ratings yet
The Land Question in History (Silvano Borruso On History's Great Driver)
19 pages
Bekaert International Financial Management 2e
100% (1)
Bekaert International Financial Management 2e
6 pages
Effective Execution: Building High-Performing Organizations
From Everand
Effective Execution: Building High-Performing Organizations
Raghav S Nandyal
No ratings yet
Potential Outcomes Framework
100% (1)
Potential Outcomes Framework
7 pages
Financial Econometrics Project
No ratings yet
Financial Econometrics Project
2 pages
Financial Econometrics-Mba 451F Cia - 1 Building Research Proposal
No ratings yet
Financial Econometrics-Mba 451F Cia - 1 Building Research Proposal
4 pages
Question Paper Investment Banking and Financial Services-I (261) : April 2006
100% (3)
Question Paper Investment Banking and Financial Services-I (261) : April 2006
192 pages
Analyzing M&E Data: Chapter at A Glance
No ratings yet
Analyzing M&E Data: Chapter at A Glance
18 pages
Value at Risk Lecture
No ratings yet
Value at Risk Lecture
25 pages
Ipo Underpricing
No ratings yet
Ipo Underpricing
13 pages
Bank Lending ASSG 1 and 2
No ratings yet
Bank Lending ASSG 1 and 2
4 pages
Principles of Microeconomics Final
No ratings yet
Principles of Microeconomics Final
8 pages
ECN 702 Final Examination Question Paper
No ratings yet
ECN 702 Final Examination Question Paper
6 pages
Prior Exam 3
No ratings yet
Prior Exam 3
9 pages
Time Series Analysis in Economics
100% (1)
Time Series Analysis in Economics
397 pages
Statistical Analysis of Contingency Tables (Fagerland, Morten W. Laake, Petter Lydersen Etc.) (Z-Library)
No ratings yet
Statistical Analysis of Contingency Tables (Fagerland, Morten W. Laake, Petter Lydersen Etc.) (Z-Library)
657 pages
At Siara S Macro Math Camp 2017
No ratings yet
At Siara S Macro Math Camp 2017
130 pages
David J. Denis - Handbook of Corporate Finance (2024, Edward Elgar Publishing) - Libgen - Li
No ratings yet
David J. Denis - Handbook of Corporate Finance (2024, Edward Elgar Publishing) - Libgen - Li
709 pages
1 Bank Lending
No ratings yet
1 Bank Lending
56 pages
(Module 3 Continued) Data Analysis in Excel
No ratings yet
(Module 3 Continued) Data Analysis in Excel
22 pages
Data Visualisation
No ratings yet
Data Visualisation
55 pages
Introductory Econometrics Exam Memo
100% (4)
Introductory Econometrics Exam Memo
9 pages
Lecture Note of Mathematical Economics PDF
100% (1)
Lecture Note of Mathematical Economics PDF
291 pages
Chapter5 Solutions
100% (1)
Chapter5 Solutions
12 pages
Advanced Econometric Methods I: Lecture Notes On Bootstrap: 1 Motivation
No ratings yet
Advanced Econometric Methods I: Lecture Notes On Bootstrap: 1 Motivation
19 pages
Crime Prediction and Analysis Using Machine Learning
No ratings yet
Crime Prediction and Analysis Using Machine Learning
11 pages
Pivottable For Data Analysis in Excel
No ratings yet
Pivottable For Data Analysis in Excel
12 pages
Applied Stochastic Process
No ratings yet
Applied Stochastic Process
132 pages
PHD Thesis Bernardo R C Costa Lima Final Submission
100% (1)
PHD Thesis Bernardo R C Costa Lima Final Submission
204 pages
2014-01-25T14-27-52-R45-Credit Analysis Models
No ratings yet
2014-01-25T14-27-52-R45-Credit Analysis Models
39 pages
Capital Markets - EDHEC Risk Institute
100% (2)
Capital Markets - EDHEC Risk Institute
386 pages
Machine Learning: An Applied Econometric Approach
100% (1)
Machine Learning: An Applied Econometric Approach
31 pages
The Econometric Modelling of Financial Time Series: Terence C. Mills
100% (1)
The Econometric Modelling of Financial Time Series: Terence C. Mills
11 pages
Mock Final Exam - Econometrics 2022-2023
100% (1)
Mock Final Exam - Econometrics 2022-2023
7 pages
2017 Level II Formula Sheet PDF
No ratings yet
2017 Level II Formula Sheet PDF
33 pages
Bank Lending
No ratings yet
Bank Lending
114 pages
3-Project Failure and Success
No ratings yet
3-Project Failure and Success
28 pages
Data Clustering: 50 Years Beyond K-Means
No ratings yet
Data Clustering: 50 Years Beyond K-Means
35 pages
Financial Statistics Laboratory 3: Bootstrap
No ratings yet
Financial Statistics Laboratory 3: Bootstrap
16 pages
Solution ECON0001 19 20 Part2
No ratings yet
Solution ECON0001 19 20 Part2
5 pages
Strategic CF
No ratings yet
Strategic CF
4 pages
Predicting Credit Risk For Unsecured Lending
No ratings yet
Predicting Credit Risk For Unsecured Lending
9 pages
12 Financial Modelling & Valuation THEMED TRACK
No ratings yet
12 Financial Modelling & Valuation THEMED TRACK
6 pages
MSC 515 Econometrics of Event Studies PDF
No ratings yet
MSC 515 Econometrics of Event Studies PDF
49 pages
The Art of Company Financial Modelling
No ratings yet
The Art of Company Financial Modelling
19 pages
Post Crisis Banking Regulation VoxEU PDF
No ratings yet
Post Crisis Banking Regulation VoxEU PDF
195 pages
The Valuation and Characteristics of Bonds
100% (1)
The Valuation and Characteristics of Bonds
56 pages
CASE:10 Students' and Former Students' Debt Problems: Sujan Khaiju
100% (1)
CASE:10 Students' and Former Students' Debt Problems: Sujan Khaiju
7 pages
Atkinson, Anthony - Multidimensional Deprivation. Contrasting Social Welfare and Counting Approaches
No ratings yet
Atkinson, Anthony - Multidimensional Deprivation. Contrasting Social Welfare and Counting Approaches
15 pages
Performance Analysis
0% (1)
Performance Analysis
2 pages
A Guide To Writing in Economics PDF
No ratings yet
A Guide To Writing in Economics PDF
63 pages
Vector Autoregression (VAR) - Comprehensive Guide With Examples in Python - ML
0% (1)
Vector Autoregression (VAR) - Comprehensive Guide With Examples in Python - ML
41 pages
Statistics
No ratings yet
Statistics
27 pages
David A. Freedman - The Limits of Econometrics PDF
100% (1)
David A. Freedman - The Limits of Econometrics PDF
13 pages
Investment Analysis and Portfolio Management
No ratings yet
Investment Analysis and Portfolio Management
284 pages
Statistical Analysis with Excel Complete Self-Assessment Guide
From Everand
Statistical Analysis with Excel Complete Self-Assessment Guide
Gerardus Blokdyk
No ratings yet
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Introduction To Data Science Exploratory Data Analysis
No ratings yet
Introduction To Data Science Exploratory Data Analysis
55 pages
Hypothesis Testing
No ratings yet
Hypothesis Testing
18 pages
Ap Stat 1-7 Notes
No ratings yet
Ap Stat 1-7 Notes
12 pages
Producer-Side Experiments Based On Counterfactual Interleaving Designs For Online Recommender Systems
No ratings yet
Producer-Side Experiments Based On Counterfactual Interleaving Designs For Online Recommender Systems
31 pages
Chapter 1 - Causal Inference
No ratings yet
Chapter 1 - Causal Inference
13 pages
Serialized Media
No ratings yet
Serialized Media
42 pages
QR 33
No ratings yet
QR 33
140 pages
Causalmediate
No ratings yet
Causalmediate
41 pages
Optimal Encouragement Designs
No ratings yet
Optimal Encouragement Designs
32 pages
Causal Notes
No ratings yet
Causal Notes
17 pages
A Counterfactual Framework For Seller-Side A/B Testing On Marketplaces
No ratings yet
A Counterfactual Framework For Seller-Side A/B Testing On Marketplaces
9 pages
Angrist Et Al 1996 PDF
No ratings yet
Angrist Et Al 1996 PDF
13 pages
This Content Downloaded From 110.93.80.98 On Sat, 18 Dec 2021 23:37:29 UTC
No ratings yet
This Content Downloaded From 110.93.80.98 On Sat, 18 Dec 2021 23:37:29 UTC
23 pages
04 Interference Dynamics Notes
No ratings yet
04 Interference Dynamics Notes
13 pages
Dynamic DiD Regression Li Strezhnev June 25 2024
No ratings yet
Dynamic DiD Regression Li Strezhnev June 25 2024
112 pages
Design-Based Causal Inference in Bipartite Experiments
No ratings yet
Design-Based Causal Inference in Bipartite Experiments
32 pages
Causal Inference in Transportation Safety Studies - Comparison of Potential Outcomes and Causal Diagrams
No ratings yet
Causal Inference in Transportation Safety Studies - Comparison of Potential Outcomes and Causal Diagrams
28 pages
Causal Inference in Sociological Research: Further
No ratings yet
Causal Inference in Sociological Research: Further
30 pages
01 Foundations
No ratings yet
01 Foundations
102 pages
IMBENS e RUBIN. Rubin Causal Model
No ratings yet
IMBENS e RUBIN. Rubin Causal Model
15 pages
Perraillon MC, Causal Inference
No ratings yet
Perraillon MC, Causal Inference
22 pages
Data Analysis For Social Scientists (14.1310x)
No ratings yet
Data Analysis For Social Scientists (14.1310x)
12 pages
Matching Methods
No ratings yet
Matching Methods
9 pages
Introduction To Causal Inference-Aug25 2020-Neal
No ratings yet
Introduction To Causal Inference-Aug25 2020-Neal
61 pages
CH 1
No ratings yet
CH 1
80 pages
What's Trending in Difference-In-Differences
No ratings yet
What's Trending in Difference-In-Differences
27 pages
Causal Inference - Estimating Counterfactuals
No ratings yet
Causal Inference - Estimating Counterfactuals
15 pages
Imbens & Wooldridge (2009) Recent Developments in The Econometrics of Program Evaluation
No ratings yet
Imbens & Wooldridge (2009) Recent Developments in The Econometrics of Program Evaluation
83 pages
CausalML Chap 0
No ratings yet
CausalML Chap 0
8 pages
Butts Spatial DiD
No ratings yet
Butts Spatial DiD
35 pages
The Trade Effects of Export Control Regulations in Japan
No ratings yet
The Trade Effects of Export Control Regulations in Japan
15 pages
Li Luo and Pattabhiramaiah
No ratings yet
Li Luo and Pattabhiramaiah
15 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Data Analysis For Social Scientists (14.1310x)

Uploaded by

Data Analysis For Social Scientists (14.1310x)

Uploaded by

Data Analysis for Social Scientists (14.1310x) review notes.

3 Exploratory data analysis.

The goal is to maximize expected profit/revenue E[ P].

5 Some new distributions

5.2 Negative binomial distribution

1. the effect size of treatment τ

5. the number of observations/participants in the study N

The relevant test statistic is

6.1 Experimental design effects on variance

7.1 Fisher and the Sharp Null

9 R2 and the F-test

10 Considerations and methods for running regressions.

Yi = δ + αAi + βBi + γAi ∗ Bi + ei

10.2 Transformations of the dependent variable.

exp( XiT β+ei )

10.3 Nonlinear transformations of the independent variables.

10.4 Locally Linear Regression

10.7 Omitted Variable Bias

11 Machine Learning and Econometrics

13 Endogeneity and Instrumental Variables

Let’s consider an example: We want to see the effect of education on log-wages.

ln(wage)i = α + βeduci + γabili + ei

1. Be clear about what question exactly you’re trying to answer.

14.2 Clarify your design.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.