Ols 23-24
Ols 23-24
Quantitative Methods
OLS and Regression Analysis
Jan Podivinsky
University of Southampton
The linear regression model
example: Using data on class sizes and test scores from different school districts, we want
to estimate the expected effect of reducing class size on test scores.
We can write this relationship, as follows:
∆TestScore
β1 =
∆ClassSize
rearrange to get a sense of the answer to our question:
∆TestScore = β1 ∆ClassSize
TestScore = β0 + β1 ClassSize + ϵ,
2
Class size effects
3
Causality
4
Causality
District Income
5
The OLS estimator
6
The linear regression problem
the OLS regression line is the solution to the linear regression problem:
all this says above is that we chose values for b0 and b1 such that we get the “line of best
fit”, where “best fit” is defined by minimising the RSS
define RSS = E {(Yi − b0 − b1 Xi )2 } . The derivatives of the RSS with respect to the
parameters are:
∂RSS
= −2E (Yi − b0 − b1 Xi )
∂b0
∂RSS
= −2E {(Yi − b0 − b1 Xi )Xi }
∂b1
7
The linear regression problem
E (Yi − β0 − β1 Xi ) = 0
E {(Yi − β0 − β1 Xi )Xi } = 0
N.B. once we write the FOCs we replace b0 and b1 with β0 and β1 - the parameters which
solve the population regression problem.
the first FOC yields:
β0 = E (Yi ) − β1 E (Xi )
substitute this into the second FOC and rearrange:
is given by:
β0 = E (Yi ) − β1 E (Xi ) (= Y − β1 X )
PN !
Cov (Yi , Xi ) (Y i − Y )(Xi − X )
β1 = = i=1
PN
Var (Xi ) i=1 (Xi − X )
2
9
The data and the OLS regression line
10
OLS in matrix form
the linear model with several (say k) explanatory variables is given by the equation:
We can write the multiple regression model in matrix form, by defining the following
vectors and matrices:
Let X be an N × k matrix where we have observations on k − 1 explanatory variables (the
kth term is the constant) for N observations.
Let Y be an N × 1 vector of observations on the dependent variable.
Let ϵ be an N × 1 vector of disturbances or errors.
Let β be a k × 1 vector of unknown population parameters that we want to estimate.
11
OLS in matrix form
Y = Xβ + ϵ
12
OLS: criteria for estimates
13
OLS in matrix form: criteria for estimates
e = Y − X β̂
eN (N×1)
N.B. this is not the same thing as ee ′ - the variance-covariance matrix of residuals
14
Sum of squared residuals (RSS)
e ′ e = (Y − X β̂)′ (Y − X β̂)
= Y ′ Y − β̂ ′ X ′ Y − Y ′ X β̂ + β̂ ′ X ′ X β̂
= Y ′ Y − 2β̂ ′ X ′ Y + β̂ ′ X ′ X β̂
note, moving from the second to the third line we use the fact that the transpose of a
scalar is a scalar, i.e.
Y′ X β̂ = (Y ′ X β̂)′ = β̂ ′ X ′ Y
(1×N)(N×k)(k×1)
15
Finding β̂, the OLS estimator
to find the β̂ that minimises the sum of squared residuals, we need to take the derivative
with respect to β̂:
∂e ′ e
= −2X ′ Y + 2X ′ X β̂ = 0
∂ β̂
to ensure this is a minimum we take the derivative of this with respect to β̂ again, this
gives us 2X ′ X
as long as X has full rank, this is a positive definite matrix (analogous to a positive real
number) and therefore a minimum
16
The normal equations
∂e ′ e
= −2X ′ Y + 2X ′ X β̂ = 0
∂ β̂
(X ′ X )β̂ = X ′ Y
remember that (X ′ X ) and X ′ Y are known from our data, but β̂ is unknown
if the inverse of (X ′ X ) exists, i.e. X is full rank, then pre-multiplying both sides by this
inverse gives us:
(X ′ X )−1 (X ′ X )β̂ = (X ′ X )−1 X ′ Y
17
The normal equations
the inverse of (X ′ X ) may not exist, in which case the matrix is called non-invertible or
singular, and is said to be of less than full rank.
there are two possible reasons why this matrix might be non-invertible:
1. If N < k i.e. we have more independent variables than observations, then the matrix is not of
full rank
2. One or more of the independent variables are a linear combination of the other variables, i.e.
perfect multicollinearity
18
The normal equations
I β̂ = (X ′ X )−1 X ′ Y
β̂ = (X ′ X )−1 X ′ Y
19
Properties of OLS estimates
the primary property of OLS estimators is that they satisfy the criteria of minimising the
sum of squared residuals (RSS). But there are other properties that will also be true
recall the normal equations from earlier:
(X ′ X )β̂ = X ′ Y
(X ′ X )β̂ = X ′ (X β̂ + e)
(X ′ X )β̂ = (X ′ X )β̂ + X ′ e
X ′e = 0
20
Properties of OLS estimates
21
Properties of OLS estimates
the observed values of X are uncorrelated (orthogonal) with the residuals. X ′ e =0 implies
that for every row xk of X ′ , xk′ e = 0.
in other words, each regressor has zero sample correlation with the residuals (e).
note: this does not mean that X is uncorrelated with the disturbances (ϵ); we have to
assume this
the residuals represent the “unexplained” variation in Y - if they are not orthogonal to X ,
then more explanation could be squeezed out of X by a different set of coefficients.
22
Properties of OLS estimates
if our regression includes a constant (as it does the way I’ve written X above), then the
following properties also hold.
the sum of the residuals is zero (ref row one two slides above)
put another way, this means the sample mean of the residuals is zero
23
Properties of OLS estimates
the regression hyperplane passes through the means of the observed values (X and Y ).
this follows from the fact that e = 0
Recall that e = Y − X β̂
summing across observations and dividing by N: e = Y − X β̂ = 0
this implies Y = X β̂, which shows that the regression hyperplane goes through the point
of means of the data
24
25
Gauss-Markov assumptions
note that we know nothing about β̂ except that it satisfies all of the properties discussed
above.
we need to make some assumptions about the true model in order to make any inferences
regarding β (the true population parameters) from β̂ (our estimator of the true
parameters).
these we call the Gauss-Markov assumptions.
26
Gauss-Markov assumptions - set-up
for the assumptions that follow, we will deal with a regression model that is linear in the
parameters (i.e. what we have seen already).
in matrix notation, we have:
Y = Xβ + ϵ
in scalar notation, we have:
27
Gauss-Markov assumptions - set-up
28
Gauss-Markov assumptions
[A1] states that the expected value of the error term is zero, which means that, on
average, the regression line should be correct
29
Gauss-Markov assumptions
30
Gauss-Markov assumptions
[A3] states that all error terms have the same variance - we call this homoskedasticity
this is a useful assumption since it implies that no particular value of X carries any more
information about the behaviour of Y than any other
31
Gauss-Markov assumptions
[A4] imposes zero correlation between different error terms. this we describe as a case of
no autocorrelation
i.e. knowing something about the disturbance term for one observations tells us nothing
about the disturbance term for any other observation.
32
Gauss-Markov assumptions
from [A3] and [A4], we can write down the variance-covariance matrix of the error terms as:
2
σ 0 ··· 0 1 0 ··· 0
0 σ2 · · · 0 0 1 · · · 0
E [ϵϵ′ ] = . = σ 2 2
.. = σ IN
.. .. .. ..
..
. . . . .
0 0 · · · σ2 0 0 ··· 1
33
Gauss-Markov assumptions
from [A2] we have that X and ϵ are independent, which implies, along with the other G-M
assumpations, that:
this is a much stronger statement than we had before; this means the disturbances average
out to 0 for any value of X
this means that the matrix of explanatory variables X does not provide any information
about the expected values of the error terms, or how they (co)vary
2
σ 0 ··· 0 1 0 ··· 0
0 σ2 · · · 0 0 1 · · · 0
′
E [ϵϵ |X ] = . = σ2 . . 2
.. = σ IN
. .
. .
. . .
. . . . . .
0 0 · · · σ2 0 0 ··· 1
[A5]: ϵ ∼ N(0, σ 2 IN )
this is not one of the G-M assumptions, but is useful for inference (i.e. hypothesis testing)
35
Gauss-Markov Theorem
the Gauss-Markov Theorem states that, under assumptions 1-4, there will be no other
linear and unbiased estimator of the β coefficients that has a smaller sampling variance.
in other words, the OLS estimator is the Best Linear Unbiased Estimator (BLUE).
the “Best” part of BLUE relates to the variance of the OLS estimator - it is the smallest of
all other linear unbiased estimators
36
Gauss-Markov Theorem
β̂ = (X ′ X )−1 X ′ (X β + ϵ)
= (X ′ X )−1 X ′ X β + (X ′ X )−1 X ′ ϵ)
= β + (X ′ X )−1 X ′ ϵ
37
Gauss-Markov Theorem
where we move from line four to five based on [A2], and from five to six based on [A1]
38
Variance-covariance matrix
We have our point estimates (which we just saw are unbiased), but what about our
standard errors etc? We need to derive the variance-covariance matrix of the OLS
estimator, β̂:
Var [β̂] = E [(β̂ − β)(β̂ − β)′ ] = E [((X ′ X )−1 X ′ ϵ) ((X ′ X )−1 X ′ ϵ)′ ]
= E [(X ′ X )−1 X ′ ϵϵ′ X (X ′ X )−1 ]
using the fact that (AB)′ = B ′ A′ . i.e. we can rewrite ((X ′ X )−1 X ′ ϵ)′ as ϵ′ X (X ′ X )−1
if we assume that X is non-stochastic, then:
note that X is indeed stochastic. The assumption above makes the proof easier, but the
proof does not rely on this
39
Variance-covariance matrix
as we don’t observe the disturbances (ϵi ), we have to use the residuals (ei ) to estimate σ 2
with σ̂ 2 :
e ′e
σ̂ 2 =
N −k
the square root of which is called the standard error of the regression,
40
Goodness of fit: R 2
How do we measure how well the estimated regression model “fits” the data?
Typically we use a measure known as the R 2 : the proportion of the sample variance of y
that is explained by the model. Recall i.) :
e ′e
σ̂ 2 =
N −k
2
β̂ ′ X ′ Y − NY
2 ESS RSS
R = = 2
=1−
TSS Y ′ Y − NY TSS
41
42
43
Inference in bivariate regression
44
Regression for the California Test Score data
our estimated regression model for the California class size data is
Yi = 698.9 − 2.28Xi + ϵi
the data are the population of California school districts for 1999.
the estimator for the regression slope has sampling variation, i.e. it is a random variable
because the samples it is constructed from contain randomness.
the graphs below help to visualise this sampling variation with a thought experiment
pretend our sample is the population
take smaller sub-samples from the“population”
45
A regression line for a sample of 30
46
A regression line for another sample of 30
47
Different OLS estimates
the estimated regression slopes in the two pictures are different. One is -3.48, the other is
-1.38. Neither one matches the “population” regression slope of -2.28.
the average of the estimates from 10 samples is -2.47.
if we do this very many times we will get -2.28 on average because the OLS regression
slope is an unbiased estimator of the population regression slope
the estimator for the regression slope has sampling variation, i.e. it is a random variable
because the samples it is constructed from contain randomness.
48
Distribution of 100 000 estimates for β (n=30)
49
Standard error
Yi = β̂0 + β̂1 Xi + ϵi
looks similar: s
1 Var (ϵi )
SE (β̂1 ) =
n Var (Xi )
50
Decomposing sampling variability
51
Conventional and (heteroskedasticity-) robust standard errors
52
Data where the residual variance is unrelated to the regressor (homoskedastic-
ity)
53
Data where the residual variance is related to the regressor (heteroskedasticity)
54
Homoskedasticity versus heteroskedasticity
56
Tests involving multiple coefficients
we may be interested in the hypothesis that having more English 2nd language students or
more students on free lunches has the same impact on test scores, i.e. in the hypothesis:
H0 : γ 1 = γ 2 versus H1 : γ1 ̸= γ2
57
The two coefficient t-test
Var (γ̂1 − γ̂2 ) = Var (γ̂1 ) + Var (γ̂2 ) − 2Cov (γ̂1 − γ̂2 )
one can find Cov (γ̂1 − γ̂2 ) just like you can find the sampling variance for a single
coefficient. The t-statistic is simply:
γ̂1 − γ̂2
tn−k = p
Var (γ̂1 ) + Var (γ̂2 ) − 2Cov (γ̂1 − γ̂2 )
58
Testing multiple hypotheses at once
we may be interested in the hypothesis that neither the fraction of English learners nor the
fraction of free lunch students has any impact on test scores, or:
H0 : γ1 = 0, γ2 = 0 versus H1 : γ1 ̸= 0, γ2 ̸= 0
we could just use two simple t-tests for each hypothesis that γ1 = 0 and γ2 = 0, which we
be a test of whether the two nulls were individually true. But we may want to know
whether both are true at once
59
Testing joint hypotheses
to test a joint hypothesis, we cannot just combine the single t-statistics. There are two
reasons for this:
As before, the estimated coefficients γ̂1 and γ̂2 will in general be correlated. We need to take
this correlation into account.
Even if this correlation is zero, rejecting the joint hypothesis if either one of the two t-tests
rejects would reject too often under the null hypothesis. Suppose t1 and t2 are your two
t-statistics. You don’t reject it with probability:
this means we are rejecting 9.75% of the time (1 − 0.952 ), rather than 5% of the time if
the null hypothesis is true
60
The F -test
in order to test a joint hypothesis, we need to perform an F -test. The F -statistic for the
hypothesis H0 : γ1 = 0, γ2 = 0 has the form
61
The F -test and the t-test
F = t2
and has a χ2 (1) distribution under the null. You can always do an F -test instead of a
t-test (but not vice versa).
62
Testing equality of two coefficients
63
Testing equality of two coefficients
64
Testing a joint hypothesis
65
F-test for comparing between a “short” and “long” regression
what if we want to test a regression model with many regressors (the “long” model)
against a regression model with just a few variables (the “short” model)
use an F -test to do so
Compare the RSS from the unrestricted regression model (RSSUR ) to the RSS from the
restricted one(RSSR ):
where J is the number of variables to be restricted (i.e. the difference between the number
of regressors in the restricted and unrestricted models)
66
Functional form in regression
67
Other forms of the regression model
it is easy to augment the simple regression model to fit the income data better, for
example, by adding a quadratic term in income and estimating
68
Linear vs. quadratic specification
69
Linear vs. quadratic specification: testing βˆ2
70
Linear vs. quadratic specification: testing βˆ2
71
How to interpret non-linear regression functions?
In this case β is the effect of a $1,000 increase in average income on test scores.
in the quadratic specification, it’s a bit more difficult:
72
What’s linear about linear regression?
OLS regression is often called linear regression. So what’s linear about linear regression?
the regression function is linear in the parameters (α, β1 , β2 , · · · )
we can’t estimate a regression like this by OLS:
Yi = αKiβ Lγi + ei
73
The log specification for income
74
Log(income) specification
75
Interpreting the log specification
the simple log specification for income seems to work extremely well in this example, and
often does for similar variables.
the log specification:
test scorei = α + βln(income)i + ei
implies:
∂test score β
=
∂income income
∂test score
=β
∂income/income
76
The log of income
proportional changes in income are often more reasonable than absolute changes:
a $1000 change is pretty big for a district with income of $15 000 (in terms of economic
impact)
a $1000 change is much smaller for a district with income of $40 000.
comparing a 10% change may be a better suited exercise (so a $100 change for low income
districts compared to a $400 change for high income districts). This is what the log
specification does.
77
Log derivatives
we know that:
∂ln(x) 1 ∂x
= ⇒ ∂ln(x) =
∂x x x
from this, it follows that:
∂ln(y ) ∂y /y ∂y x
= =
∂ln(x) ∂x/x ∂x y
which is an elasticity
this means that the log-log regression gives you an elasticity:
78
Controlling for income in the test score data
the quadratic regression evaluated at the mean yields a larger (partial) effect of income of
test scores than the linear regression in this case. Why?
the linear regression puts comparatively a lot of weight on districts with very high incomes
(which don’t provide a good fit for the linear regression).
80
Partial effects at the mean
81
Making sense of the income results
82
Partial effects at the mean
83
Sub-sample partial effects - linear specification
84
Sub-sample effects: income ∈ (10000, 20000)
85
Sub-sample effects: income ∈ (10000, 20000)
86