Lecture 3-1_Introduction to Multiple Regression
Lecture 3-1_Introduction to Multiple Regression
Xinwei Ma
Department of Economics
UC San Diego
Spring 2020
• The California data contains test score and student-to-teacher ratio of 420 school districts. It is
a survey data, meaning that the class size (student-to-teacher ratio) is not randomized.
• We still believe there is a causal effect of class size (X ) on student performance (Y ), but there
might be other factors (u) affecting both X and Y at the same time.
• For example, a better funded school district is likely to have smaller classes, and attracts more
experienced teachers.
expenPerStu
stuTeacherRatio testScore
In this case, our zero conditional mean assumption is likely to fail: E[u|X ] 6= 0.
• The above causal diagram applies to many economic datasets. As social scientists, we rarely
have the luxury to conduct randomized experiments, and hence many economic studies rely on
careful analysis of the underlying causal mechanism and advanced econometric methods.
• We call a variable confounder if it affects both X and Y . We will discuss how to explicitly
control for confounding effects later in this class.
To assess the consequence of violating the zero conditional mean assumption, we start
from the model
testScore = β0 + β1 stuTeacherRatio + u,
where the error term includes the variable expenPerStu.
• Now consider the conditional expectation of the error term. More specifically, assume the error
term contains expenPerStu (funding status of a school district).
• If we observe a school district with small classes, then it is likely that this school district is better
funded, and vice versa. Therefore,
E[expenPerStu|stuTeacherRatio is large] < E[expenPerStu|stuTeacherRatio is small].
• Therefore, our estimate, β̂1 , will over estimate the true causal effect. That is, β̂1 < β1 < 0 in
large samples.
The previous analysis tells us that the slope estimate is biased, and provides qualitative
analysis of this bias.
• These are two models about the relationship between class size and student performance.
• The difference is that the funding status of a school district (i.e., expenPerStu) is absorbed into
the error term in the short regression, while in the long regression, this information is separated
from the error term.
• Separating a variable from the error term may affect the intercept, but this is not relevant for our
purpose, as our primary focus is on the slope parameter.
• The slope parameter remains the same. Remember, the slope parameter represents the causal
effect of class size on student performance, which does not depend on how we model the
variables.
• Key assumption:
E[ulong |stuTeacherRatio, expenPerStu] = 0, but E[ushort |stuTeacherRatio] 6= 0
c Xinwei Ma 2021 3/40
Omitted Variable Bias: An Example
• Because we include both stuTeacherRatio and expenPerStu explicitly in the long regression,
we believe this assumption is more plausible. It is, however, still an assumption, which means it
may be violated in practice.
Consider the short regression, and assume we obtained β̂1,short by regressing testScore
on stuTeacherRatio only. We know that β̂1,short is biased for β1 (i.e., inconsistent).
p Cov[stuTeacherRatio, expenPerStu]
→ β1 + β2 ×
|{z} |{z} V[stuTeacherRatio]
<0 >0 | {z }
<0
c Xinwei Ma 2021 4/40
Omitted Variable Bias: An Example
We also computed the bias of β̂1,short , which is obtained from running the short
regression
p Cov[stuTeacherRatio, expenPerStu]
β̂1,short → β1 + β2 × .
V[stuTeacherRatio]
Cov[stuTeacherRatio, expenPerStu]
• The term, β2 is called the omitted variable bias (OVB).
V[stuTeacherRatio]
• Primarily interested in β1 , but β2 is of some interest, too. β1 still represents the effect of class
size on test score, holding all other factors constant:
∆testScore = β1 ∆stuTeacherRatio + β2 ∆expenPerStu + ∆u
= β1 ∆stuTeacherRatio (∆expenPerStu = 0, ∆u = 0).
• By explicitly including expenPerStu in the equation, we have taken it out of the error term.
• If expenPerStu is a good proxy for a school district’s funding status, this may lead to a more
persuasive estimate of the causal effect of class size, because it is more plausible that the zero
conditional mean assumption holds in our data: E[u|stuTeacherRatio, expenPerStu] = 0.
• In a simple regression approach, we would relate final to missed (number of lectures missed).
• We already know that 100 · β1 is the percentage change in wage when education increases by
one year. 100 · β2 has a similar interpretation (for a one point increase in iq).
• β3 and β4 are harder to interpret, but we can use calculus to get the slope of ln(wage) with
respect to exper:
∂ln(wage)
= β3 + 2β4 exper
∂exper
Multiply by 100 to get the percentage effect. (More later.)
Y = β0 + β1 X1 + β2 X2 + u,
where
• β0 is the intercept,
• β1 measures the change in Y with respect to X1 , holding other factors (X2 and u) fixed,
• β2 measures the change in Y with respect to X2 , holding other factors (X1 and u) fixed.
• For any values of X1 and X2 in the population, the average unobservable is equal to zero. (The
value zero is not important because we have an intercept, β0 in the equation.)
• Other factors, such as “motivation” and “ teacher’s experience” are part of u. Motivation is very
difficult to measure. Experience is easier:
testScore = β0 + β1 stuTeacherRatio + β2 expenPerStu + β3 teacherExp + u
Y = β0 + β1 X1 + β2 X2 + . . . + βk Xk + u.
Multiple regression allows us to explicitly control for “other factors” and incorporate more
flexible functional forms.
Key assumption for the general multiple regression model: the zero conditional mean
assumption
E[u|X1 , X2 , · · · , Xk ] = 0.
• Provided we are careful, we can make this condition closer to being true by “controlling for”
more variables. In the class size example, we “control for” expenditure per student when
estimating the effect of class size on student performance.
Y = β0 + β1 X1 + β2 X2 + · · · + βk Xk + u.
• Now the regressors have two subscripts: i is the observation number (as always) and the second
subscript labels a particular variable.
E[u|X1 , X2 , · · · , Xk ] = 0.
These are the conditions in the population that determine the parameters. So we use
their sample analogs, which is a method of moments approach to estimation.
We define our estimates and the solution to the following sample moment conditions
n
1 X
0= Yi − β̂0 − β̂1 Xi1 − β̂2 Xi2 − · · · − β̂k Xik
n i=1
n
1 X
0= Xi1 (Yi − β̂0 − β̂1 Xi1 − β̂2 Xi2 − · · · − β̂k Xik )
n i=1
n
1 X
0= Xi2 (Yi − β̂0 − β̂1 Xi1 − β̂2 Xi2 − · · · − β̂k Xik )
n i=1
.
..
n
1 X
0= Xik (Yi − β̂0 − β̂1 Xi1 − β̂2 Xi2 − · · · − β̂k Xik ) .
n i=1
The above are k + 1 equations with k + 1 unknowns, which allows us to obtain the
estimates β̂0 , β̂1 , β̂2 , · · · , β̂k .
To obtain expressions for the estimates, β̂0 , β̂1 , β̂2 , · · · , β̂k , we need to use linear algebra
(not required). Fortunately, modern statistical platforms (such as Stata) can compute the
estimates very fast.
c Xinwei Ma 2021 13/40
Mechanics and Interpretation in Multiple Regression
• If we believe that the “long regression” better reflects a causal relationship between class size
and test score (because the zero conditional mean assumption is more plausible after we
explicitly control for funding status of a school district), then the “short regression” is likely to
over-state the effect of class size.
• We first regress final (out of 40 points) on missed (lectures missed out of 32) in a simple
regression analysis. Then we add priGPA (prior GPA) as a control for student ability:
Short regression: \
f inal = 26.60 −0.121 missed.
• Simple regression estimate implies that, say, 10 more missed classes reduces the predicted score
by about 1.2 points (out of 40).
• The coefficient on missed actually becomes positive, but it is very small. (Later, we will see it is
not statistically different from zero.)
• The coefficient on priGPA means that one more point on prior GPA (for example, from 2.5 to
3.5) predicts a final exam score that is 3.24 points higher. However, it is unclear if this reflects
any interesting causal relationship.
• If we believe the “long regression” better reflects a causal relationship between attendance and
class performance (because the zero conditional mean assumption is more plausible after we
explicitly control for student ability), then the “short regression” is likely to over-state the effect
of attendance.
c Xinwei Ma 2021 15/40
Outline
Some properties:
n
• The residuals always sum to zero,
P
ûi = 0. This implies Y = Ŷ .
i=1
n
• Each regressor has a zero sample correlation (covariance) with the residual,
P
Xij ûi = 0 for all
i=1
1 ≤ j ≤ k. This follows from the sample moment conditions. It implies that Ŷi and ûi are also
n
P
uncorrelated, Ŷi ûi = 0
i=1
where SST, SSE, and SSR are the total, explained, and residual sum of squares:
n
X n
X n
X
SST = (Yi − Y )2 , SSE = (Ŷi − Y )2 , SSR = ûi2 .
i=1 i=1 i=1
SSE SSR
R2 = =1−
SST SST
• R 2 is a useful summary measure but tells us nothing about causality. Having a “high”
R-squared is neither necessary nor sufficient to infer causality.
We will illustrate how to use Stata for multiple regression with the dataset
Data-classSize-testScore.dta
This dataset contains information on 420 school districts in California. We will focus on
three variables, testScore, stuTeacherRatio, and expenPerStu.
---------------------------------------------------------------------------------
testScore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
stuTeacherRatio | -1.763149 .6108781 -2.89 0.004 -2.963933 -.5623648
expenPerStu | .0024835 .0018231 1.36 0.174 -.0011 .006067
_cons | 675.596 19.56124 34.54 0.000 637.1451 714.047
---------------------------------------------------------------------------------
• Stata stores the estimates in an object called e(b). (Technically, this is a 1 × 3 matrix.) To
show the slope estimate, we use disp e(b)[1,1] and disp e(b)[1,2]. We can also access the
intercept estimate by disp e(b)[1,3].
Method 2: Automatic:
• After running the regression, we can use the predict command to compute the fitted values:
predict testScoreHat2
• The above tells Stata to generate all fitted values and store them in a new variable called
testScoreHat2.
We can check if the two approaches agree. Indeed, the two variables, testScoreHat and
testScoreHat2, are identical:
count if testScoreHat == testScoreHat2
c Xinwei Ma 2021 19/40
Finite-Sample Properties of OLS Estimates
Method 2: Automatic:
• After running the regression, we can use the predict command to compute the residuals:
predict resid2, residual
• The above tells Stata to generate all residuals and store them in a new variable called resid2.
We can check if the two approaches agree. Indeed, the two variables, resid and resid2,
are identical:
count if resid == resid2
n
• Residuals sum up to zero:
P
ûi = 0
i=1
. summ resid
. disp r(sum)
3.055e-06
The final result is very close to zero (it is not exactly zero due to numerical errors)
• The residual and the regressors have zero sample covariance (correlation):
1P n 1P n
ûi Xi = 0, and ûi Ŷi = 0
n i=1 n i=1
. corr resid stuTeacherRatio expenPerStu testScoreHat
(obs=420)
• Assume a researcher is interested in estimating the effect of class on test score. She considers
four different models (specifications):
testScore = β0 + β1 stuTeacherRatio +u
testScore = β0 + β1 stuTeacherRatio + β2 expenPerStu +u
testScore = β0 + β1 stuTeacherRatio + β3 fracEnglish + u
testScore = β0 + β1 stuTeacherRatio + β2 expenPerStu + β3 fracEnglish + u.
Model
• In general, we report estimation results in a table with each column representing one model
(specification).
Y = β0 + β1 X1 + β2 X2 + · · · + βk Xk + u,
where β0 , β1 , β2 , · · · , βk are the (unknown) population parameters.
• Stating this assumption formally shows that our goal is to estimate the parameters.
• We emphasized its importance if we would like to draw causal relationship from data.
• Rules out the (extreme) case that one (or more) of the regressors is an exact linear function of
the others.
• If, say, Xi1 is an exact linear function of Xi2 , · · · , Xik in the sample, we say the model suffers
from perfect multicollinearity.
• This is a technical condition which allows us to compute the variance of the OLS estimates.
• Perfect multicollinearity can arise if n < k + 1. That is, if we include more regressors than the
sample size.
− This is rarely a concern in practice, because we usually have thousands of observations but only a few
regressors.
− For example, if we include both college (equals 1 for college graduates) and nonCollege (equals 1 for
non college graduates) in our regression, then we will have college + nonCollege = 1, which is a perfect
linear relationship.
− This does not prevent us from including nonlinear transformations of a regressor. For example, we can
include both exper (experience) and exper2 .
• Under perfect multicollinearity, there are no unique OLS estimators. Stata and other statistical
packages will indicate a problem.
E[β̂j ] = βj , j = 0, 1, 2, · · · , k.
• This unbiasedness result relies crucially on the zero conditional mean assumption.
• Often the hope is that if our focus is on, say, X1 , we can include enough other variables in
X2 , · · · , Xk to make the zero conditional mean assumption true, or “close” to be true.
The unbiasedness result allows for the βj to be any value, including zero.
Under assumptions 1–5, the OLS estimates are consistent and asymptotically normal:
√
p d
β̂j → βj , n β̂j − βj → N 0, σβ̂2 , j = 0, 2, · · · , k, n → ∞.
j
• In practice, we interpret consistency as “β̂j is close to βj with high probability in large samples.”
• The exact formula of the standard error is complicated. We rely on statistical packages, such as
Stata, for computation.
We emphasize that both consistency and asymptotic normality require the zero
conditional mean assumption.
• In practice, we always have a finite sample, and we interpret consistency as β̂j should not be far
from βj in samples of reasonable size.
The estimation result is (recall that we need the robust option for valid standard errors;
otherwise Stata will compute Standard errors under homoskedasticity)
\
testScore = 675.596 − 1.763 stuTeacherRatio + 0.002 expenPerStuRatio
(18.844) (0.592) (0.002)
---------------------------------------------------------------------------------
| Robust
testScore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
stuTeacherRatio | -1.763149 .5920629 -2.98 0.003 -2.926949 -.5993493
expenPerStu | .0024835 .0018916 1.31 0.190 -.0012348 .0062018
_cons | 675.596 18.84424 35.85 0.000 638.5545 712.6376
---------------------------------------------------------------------------------
Hypothesis testing with two-sided alternative. Let β̂j be an OLS estimate with standard
error se(β̂j ).
β̂j − c
• We use T = as our test statistic.
se(β̂j )
• pVal = 2Φ(−|T |), and reject the null hypothesis if pVal < α.
---------------------------------------------------------------------------------
| Robust
testScore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
stuTeacherRatio | -1.763149 .5920629 (1) (2) (3) (4)
expenPerStu | .0024835 .0018916 1.31 0.190 -.0012348 .0062018
_cons | 675.596 18.84424 35.85 0.000 638.5545 712.6376
---------------------------------------------------------------------------------
• By default, Stata computes t-statistics for the null hypothesis H0 : β1 = 0. Therefore, (1) is
−1.763149 − 0
= −2.978 (disp -1.763149 / .5920629)
0.5920629
---------------------------------------------------------------------------------
| Robust
testScore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
stuTeacherRatio | -1.763149 .5920629 -2.98 0.003 -2.926949 -.5993493
expenPerStu | .0024835 .0018916 1.31 0.190 -.0012348 .0062018
_cons | 675.596 18.84424 35.85 0.000 638.5545 712.6376
---------------------------------------------------------------------------------
−1.763149 − (−1)
• Method 1: Compute the t-statistic as = −1.289, and its absolute value
0.5920629
does not exceed the critical value 1.960. Therefore, we do not reject the hypotheses
H0 : β1 = −1. You can also compute the p-value and compare it to 0.05.
• Method 2: Use the confidence interval. The null hypothesis, −1, is contained by the 95%
confidence interval, [−2.927, −0.599], and hence we do not reject this hypothesis.
• Method 3: Use the Stata command test stuTeacherRatio == -1 after running the regression.
This command will give an F-statistic (more later) and a p-value of 0.198. Since the p-value is
larger than 0.05, we do not reject the null hypothesis.
---------------------------------------------------------------------------------
| Robust
testScore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
stuTeacherRatio | -1.763149 .5920629 -2.98 0.003 -2.926949 -.5993493
expenPerStu | .0024835 .0018916 1.31 0.190 -.0012348 .0062018
_cons | 675.596 18.84424 35.85 0.000 638.5545 712.6376
---------------------------------------------------------------------------------
• Method 1: Use the t-statistic or the p-value. The absolute value of the t-statistic exceeds the
critical value 2.576, so that we reject the null hypothesis H0 : β1 = 0. This can also be seen
using the p-value, which is smaller than 0.01.
• Method 2: Re-run the regression with the option level(99), and Stata will compute the 99%
confidence interval. Again, we reject the null hypothesis because the 99% confidence interval
does not contain our null hypothesis, 0.
. reg testScore stuTeacherRatio expenPerStu, robust level(99)
---------------------------------------------------------------------------------
| Robust
testScore | Coef. Std. Err. t P>|t| [99% Conf. Interval]
----------------+----------------------------------------------------------------
stuTeacherRatio | -1.763149 .5920629 -2.98 0.003 -3.295213 -.2310853
expenPerStu | .0024835 .0018916 1.31 0.190 -.0024114 .0073784
_cons | 675.596 18.84424 35.85 0.000 626.8334 724.3587
c Xinwei---------------------------------------------------------------------------------
Ma 2021 36/40
Hypothesis Testing for an Individual Coefficient
Hypothesis testing with one-sided alternative. Let β̂j be an OLS estimate with standard
error se(β̂j ).
β̂j − c
• We use T = as our test statistic.
se(β̂j )
---------------------------------------------------------------------------------
| Robust
testScore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
stuTeacherRatio | -1.763149 .5920629 -2.98 0.003 -2.926949 -.5993493
expenPerStu | .0024835 .0018916 1.31 0.190 -.0012348 .0062018
_cons | 675.596 18.84424 35.85 0.000 638.5545 712.6376
---------------------------------------------------------------------------------
• How to interpret this null hypothesis? One unit reduction in student-to-teacher ratio leads to at
least two units increase in test score.
−1.763149 − (−2)
• Compute the t-statistic as = 0.400, which does not exceed the critical value
0.5920629
1.645. Therefore, we do not reject the hypothesis H0 : β1 ≤ −2. You can also compute the
p-value and compare it to 0.05.
• Unfortunately, Stata does not compute one-sided confidence intervals or p-values for one-sided
tests.
Hypothesis testing with one-sided alternative. Let β̂j be an OLS estimate with standard
error se(β̂j ).
β̂j − c
• We use T = as our test statistic.
se(β̂j )
---------------------------------------------------------------------------------
| Robust
testScore | Coef. Std. Err. t P>|t| [95% Conf. Interval]
----------------+----------------------------------------------------------------
stuTeacherRatio | -1.763149 .5920629 -2.98 0.003 -2.926949 -.5993493
expenPerStu | .0024835 .0018916 1.31 0.190 -.0012348 .0062018
_cons | 675.596 18.84424 35.85 0.000 638.5545 712.6376
---------------------------------------------------------------------------------
• How to interpret this null hypothesis? Smaller class size does not help improve test score.
−1.763149 − (0)
• Compute the t-statistic as = −2.980, which exceeds the critical value −2.326
0.5920629
Therefore, we reject the hypothesis, and conclude that there is statistical evidence suggesting
students perform better in smaller classes. You can also compute the p-value and compare it to
0.01.
• Unfortunately, Stata does not compute one-sided confidence intervals or p-values for one-sided
tests.
EXERCISE. Interpret the hypothesis H0 : β1 ≥ −1, and test with significance level 1%
(i.e., α = 0.01).
c Xinwei Ma 2021 40/40
The lectures and course materials, including slides, tests, outlines, and similar
materials, are protected by U.S. copyright law and by University policy. You may take
notes and make copies of course materials for your own use. You may also share those
materials with another student who is enrolled in or auditing this course.
If you do so, you may be subject to student conduct proceedings under the UC San
Diego Student Code of Conduct.
c Xinwei Ma 2021
x1ma@ucsd.edu