0% found this document useful (0 votes)

26 views87 pages

Ols 23-24

The document discusses linear regression analysis and the Ordinary Least Squares (OLS) estimation method. It begins by introducing the linear regression model using an example of estimating the effect of class size on test scores. It then discusses the OLS estimator and how it chooses coefficients to minimize the sum of squared errors. The document explains how OLS finds the "line of best fit" and derives the OLS estimation formula. It also expresses the linear regression model and OLS estimation method in matrix notation.

Uploaded by

cbmkmtv5zb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views87 pages

Ols 23-24

Uploaded by

cbmkmtv5zb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 87

ECON6004

Quantitative Methods
OLS and Regression Analysis

Jan Podivinsky

University of Southampton
The linear regression model

example: Using data on class sizes and test scores from different school districts, we want
to estimate the expected effect of reducing class size on test scores.
We can write this relationship, as follows:
∆TestScore
β1 =
∆ClassSize
rearrange to get a sense of the answer to our question:

∆TestScore = β1 ∆ClassSize

(what sign to we expect β1 to take, and why?)

or we can write in terms of a regression model:

TestScore = β0 + β1 ClassSize + ϵ,

where β0 is the intercept of this straight line, and β1 is the slope. 1

The data

2
Class size effects

question of interest: Do smaller classes result in better outcomes for students?

Data: Primary school kids in Californian school districts (n = 420) for year 1999
Variables:
dependent/“response” variables: 5th grade test scores (Stanford-9 achievement test,
combined math and reading), district average
explanatory variable: Student-teacher ratio (STR) = no. of students in the district divided
by no. full-time equivalent teachers

3
Causality

do smaller classes causally result in better outcomes for students?

Class Size Test Score

4
Causality

do smaller classes causally result in better outcomes for students?

what if something else, a confounding / omitted variable, drives both class size and test
scores?
an example of this could be average family income within the district.
through local taxes, richer districts will be able to afford more teachers
children from wealthier families have, on average, higher test scores

Class Size Test Score

District Income

5
The OLS estimator

OLS: Ordinary Least Squares: Yi = β0 + β1 Xi + ei

the OLS estimator chooses the regression coefficients so that the estimated regression line
is as close as possible to the observed data, where closeness is measured by the sum of
the squared “mistakes” made in predicting Y given X
we can compare the predicted value (Ŷi ) with the actual value (Yi ) of the dependent
variable and the difference between the actual and predicted value gives the
residual:ei = Yi − Ŷi
since Ŷi = β̂0 + β̂1 Xi this gives ei = Yi − Ŷi = Yi − (β̂0 + β̂1 Xi )
rather than minimise the sum of residuals, minimise the sum of squared residuals
(sometimes written as residual sum of squares (RSS)):
N
X
RSS = ei2
i=1

6
The linear regression problem

the OLS regression line is the solution to the linear regression problem:

(β0 , β1 ) = arg minE {(Yi − b0 − b1 Xi )2 }

b0 ,b1

all this says above is that we chose values for b0 and b1 such that we get the “line of best
fit”, where “best fit” is defined by minimising the RSS
define RSS = E {(Yi − b0 − b1 Xi )2 } . The derivatives of the RSS with respect to the
parameters are:
∂RSS
= −2E (Yi − b0 − b1 Xi )
∂b0
∂RSS
= −2E {(Yi − b0 − b1 Xi )Xi }
∂b1

7
The linear regression problem

So the first order conditions (FOCs) of the minimisation problem are:

E (Yi − β0 − β1 Xi ) = 0
E {(Yi − β0 − β1 Xi )Xi } = 0

N.B. once we write the FOCs we replace b0 and b1 with β0 and β1 - the parameters which
solve the population regression problem.
the first FOC yields:
β0 = E (Yi ) − β1 E (Xi )
substitute this into the second FOC and rearrange:

E (Yi Xi ) − β1 E (Xi2 ) = β0 E (Xi ) = E (Yi )E (Xi ) − β1 {E (Xi )}2

E (Yi Xi ) − E (Yi )E (Xi ) = β1 E (Xi2 ) − {E (Xi )}2

Cov (Yi , Xi ) = β1 Var (Xi )

8
The linear regression problem solved

the solution to the linear regression problem:

(β0 , β1 ) = arg minE {(Yi − b0 − b1 Xi )2 }

b0 ,b1

is given by:

β0 = E (Yi ) − β1 E (Xi ) (= Y − β1 X )
PN !
Cov (Yi , Xi ) (Y i − Y )(Xi − X )
β1 = = i=1
PN
Var (Xi ) i=1 (Xi − X )
2

N.B. in a sample we replace population expectations with sample averages

9
The data and the OLS regression line

10
OLS in matrix form

the linear model with several (say k) explanatory variables is given by the equation:

Yi = β0 + β1 Xi1 + β2 Xi2 + · · · + βk−1 Xi(k−1) + ϵi , i = 1, · · · , N

We can write the multiple regression model in matrix form, by defining the following
vectors and matrices:
Let X be an N × k matrix where we have observations on k − 1 explanatory variables (the
kth term is the constant) for N observations.
Let Y be an N × 1 vector of observations on the dependent variable.
Let ϵ be an N × 1 vector of disturbances or errors.
Let β be a k × 1 vector of unknown population parameters that we want to estimate.

11
OLS in matrix form

Our regression model would then look like:

       
Y1 1 X11 X12 · · · X1(k−1) β0 ϵ1
 Y2  1 X21 X22 · · · X2(k−1)   β1   ϵ2 
= . + .  ,
       
 ..   .. 
.  .. 

 .  .   . 
YN (N×1) 1 XN1 XN2 · · · XN(k−1) (N×k) βk−1 (k×1) ϵN (N×1)

which can be written more simply in matrix notation as:

Y = Xβ + ϵ

the model has a systematic component, X β, and a stochastic (random) component, ϵ

12
OLS: criteria for estimates

our estimates of the population parameters are referred to as β̂

PN
we want to find the estimator β̂ that minimizes the sum of squared residuals ( 2
i=1 ei in
scalar notation).
be careful about distinguishing between disturbances/error terms, ϵ, that refer to things
that cannot be observed and residuals, e, that can be observed.

13
OLS in matrix form: criteria for estimates

the vector of residuals e is given by:

e = Y − X β̂

the sum of squared residuals (RSS) is e ′ e:

 
e1
 e2 
e1 e2 ··· eN (1×N)  .  = e1 × e1 + e2 × e2 + · · · +eN × eN
 
 ..  (1×1)

eN (N×1)

N.B. this is not the same thing as ee ′ - the variance-covariance matrix of residuals

14
Sum of squared residuals (RSS)

we can write the sum of squared residuals as:

e ′ e = (Y − X β̂)′ (Y − X β̂)
= Y ′ Y − β̂ ′ X ′ Y − Y ′ X β̂ + β̂ ′ X ′ X β̂
= Y ′ Y − 2β̂ ′ X ′ Y + β̂ ′ X ′ X β̂

note, moving from the second to the third line we use the fact that the transpose of a
scalar is a scalar, i.e.
Y′ X β̂ = (Y ′ X β̂)′ = β̂ ′ X ′ Y
(1×N)(N×k)(k×1)

15
Finding β̂, the OLS estimator

to find the β̂ that minimises the sum of squared residuals, we need to take the derivative
with respect to β̂:
∂e ′ e
= −2X ′ Y + 2X ′ X β̂ = 0
∂ β̂
to ensure this is a minimum we take the derivative of this with respect to β̂ again, this
gives us 2X ′ X
as long as X has full rank, this is a positive definite matrix (analogous to a positive real
number) and therefore a minimum

16
The normal equations

∂e ′ e
= −2X ′ Y + 2X ′ X β̂ = 0
∂ β̂

From this we get the “normal equations”:

(X ′ X )β̂ = X ′ Y

remember that (X ′ X ) and X ′ Y are known from our data, but β̂ is unknown
if the inverse of (X ′ X ) exists, i.e. X is full rank, then pre-multiplying both sides by this
inverse gives us:
(X ′ X )−1 (X ′ X )β̂ = (X ′ X )−1 X ′ Y

17
The normal equations

the inverse of (X ′ X ) may not exist, in which case the matrix is called non-invertible or
singular, and is said to be of less than full rank.
there are two possible reasons why this matrix might be non-invertible:
1. If N < k i.e. we have more independent variables than observations, then the matrix is not of
full rank
2. One or more of the independent variables are a linear combination of the other variables, i.e.
perfect multicollinearity

18
The normal equations

(X ′ X )−1 (X ′ X )β̂ = (X ′ X )−1 X ′ Y

we know that by definition, (X ′ X )−1 (X ′ X ) = I , where I in this case is a k × k identity

matrix.
using this in the equation above, we find:

I β̂ = (X ′ X )−1 X ′ Y
β̂ = (X ′ X )−1 X ′ Y

19
Properties of OLS estimates

the primary property of OLS estimators is that they satisfy the criteria of minimising the
sum of squared residuals (RSS). But there are other properties that will also be true
recall the normal equations from earlier:

(X ′ X )β̂ = X ′ Y

now substitute in Y = X β̂ + e to get:

(X ′ X )β̂ = X ′ (X β̂ + e)
(X ′ X )β̂ = (X ′ X )β̂ + X ′ e
X ′e = 0

20
Properties of OLS estimates

what does X ′ e look like?

   
1 1 ··· 1 e1
 X11 X21 ··· XN1   e2 
=
   
 ..   .. 
 .   . 
X1(k−1) X2(k−1) · · · XN(k−1) (k×N)
eN (N×1)
   
e1 + e2 + ··· + eN 0
 X11 e1 + X21 e2 + ··· + XN1 eN  0
= .
   
..
 .. 
 
 . 
X1(k−1) e1 + X2(k−1) e2 + · · · + XN(k−1) eN (k×1) 0 (k×1)

from X ′ e = 0, we can derive a number of properties

21
Properties of OLS estimates

the observed values of X are uncorrelated (orthogonal) with the residuals. X ′ e =0 implies
that for every row xk of X ′ , xk′ e = 0.
in other words, each regressor has zero sample correlation with the residuals (e).
note: this does not mean that X is uncorrelated with the disturbances (ϵ); we have to
assume this
the residuals represent the “unexplained” variation in Y - if they are not orthogonal to X ,
then more explanation could be squeezed out of X by a different set of coefficients.

22
Properties of OLS estimates

if our regression includes a constant (as it does the way I’ve written X above), then the
following properties also hold.
the sum of the residuals is zero (ref row one two slides above)
put another way, this means the sample mean of the residuals is zero

23
Properties of OLS estimates

the regression hyperplane passes through the means of the observed values (X and Y ).
this follows from the fact that e = 0
Recall that e = Y − X β̂
summing across observations and dividing by N: e = Y − X β̂ = 0
this implies Y = X β̂, which shows that the regression hyperplane goes through the point
of means of the data

24
25
Gauss-Markov assumptions

note that we know nothing about β̂ except that it satisfies all of the properties discussed
above.
we need to make some assumptions about the true model in order to make any inferences
regarding β (the true population parameters) from β̂ (our estimator of the true
parameters).
these we call the Gauss-Markov assumptions.

26
Gauss-Markov assumptions - set-up

for the assumptions that follow, we will deal with a regression model that is linear in the
parameters (i.e. what we have seen already).
in matrix notation, we have:
Y = Xβ + ϵ
in scalar notation, we have:

Yi = β0 + β1 Xi1 + β2 Xi2 + · · · + βk−1 Xi(k−1) + ϵi , i = 1, · · · , N

27
Gauss-Markov assumptions - set-up

we also require that X is an N × k matrix of full rank, k

this requirement states that there is no perfect multicollinearity.
in other words, the columns of X are linearly independent.
this requirement also states that the number of observations N must be greater than the
number of parameters to be estimated i.e. N ≥ k.
this requirement is sometimes known as the identification condition

28
Gauss-Markov assumptions

assumption 1 [A1]: in scalar form - E [ϵi ] = 0, i = 1, · · · , N

assumption 1 [A1]: in matrix form - E [ϵ] = 0

[A1] states that the expected value of the error term is zero, which means that, on
average, the regression line should be correct

29
Gauss-Markov assumptions

assumption 2 [A2]: {ϵ1 , · · · , ϵN } and {X1 , · · · , XN } are independent

30
Gauss-Markov assumptions

assumption 3 [A3]: in scalar form - Var [ϵi ] = σ 2 , i = 1, · · · , N

assumption 3 [A3]: in matrix form - Var [ϵ] = σ2I N

[A3] states that all error terms have the same variance - we call this homoskedasticity
this is a useful assumption since it implies that no particular value of X carries any more
information about the behaviour of Y than any other

31
Gauss-Markov assumptions

assumption 4 [A4]: in scalar form - Cov [ϵi , ϵj ] = 0, i, j = 1, · · · , N, i ̸= j

[A4] imposes zero correlation between different error terms. this we describe as a case of
no autocorrelation
i.e. knowing something about the disturbance term for one observations tells us nothing
about the disturbance term for any other observation.

32
Gauss-Markov assumptions

from [A3] and [A4], we can write down the variance-covariance matrix of the error terms as:
 2   
σ 0 ··· 0 1 0 ··· 0
 0 σ2 · · · 0  0 1 · · · 0
E [ϵϵ′ ] =  . = σ 2 2
..  = σ IN
  
.. ..  .. ..
 ..

. .   . . . 
0 0 · · · σ2 0 0 ··· 1

[A3] gives us the diagonals

[A4] gives us the off-diagonals

33
Gauss-Markov assumptions

from [A2] we have that X and ϵ are independent, which implies, along with the other G-M
assumpations, that:

E [ϵ|X ] = E [ϵ] = 0 and E [ϵϵ′ |X ] = E [ϵϵ′ ] = σ 2 IN

this is a much stronger statement than we had before; this means the disturbances average
out to 0 for any value of X
this means that the matrix of explanatory variables X does not provide any information
about the expected values of the error terms, or how they (co)vary
 2   
σ 0 ··· 0 1 0 ··· 0
 0 σ2 · · · 0  0 1 · · · 0
′
E [ϵϵ |X ] =  . = σ2  . . 2
..  = σ IN
   
. .
. .
.  . .
. . . . . .
0 0 · · · σ2 0 0 ··· 1

from [A1] and [A2] we have that E (Y |X ) = E (Y ) = X β 34

An additional assumption

[A5]: ϵ ∼ N(0, σ 2 IN )
this is not one of the G-M assumptions, but is useful for inference (i.e. hypothesis testing)

35
Gauss-Markov Theorem

the Gauss-Markov Theorem states that, under assumptions 1-4, there will be no other
linear and unbiased estimator of the β coefficients that has a smaller sampling variance.
in other words, the OLS estimator is the Best Linear Unbiased Estimator (BLUE).
the “Best” part of BLUE relates to the variance of the OLS estimator - it is the smallest of
all other linear unbiased estimators

36
Gauss-Markov Theorem

proof that β̂ is an unbiased estimator of β:

from earlier we know i.) β̂ = (X ′ X )−1 X ′ Y and ii.) Y = X β + ϵ
this means that:

β̂ = (X ′ X )−1 X ′ (X β + ϵ)
= (X ′ X )−1 X ′ X β + (X ′ X )−1 X ′ ϵ)
= β + (X ′ X )−1 X ′ ϵ

using that (X ′ X )−1 X ′ X = Ik

37
Gauss-Markov Theorem

this shows immediately that OLS is unbiased :

E [β̂] = E [(X ′ X )−1 X ′ Y ]

= E [β + (X ′ X )−1 X ′ ϵ]
= E [β] + E [(X ′ X )−1 X ′ ϵ]
= β + E [(X ′ X )−1 X ′ ϵ]
= β + E [(X ′ X )−1 X ′ ]E [ϵ]
=β

where we move from line four to five based on [A2], and from five to six based on [A1]

38
Variance-covariance matrix

We have our point estimates (which we just saw are unbiased), but what about our
standard errors etc? We need to derive the variance-covariance matrix of the OLS
estimator, β̂:

Var [β̂] = E [(β̂ − β)(β̂ − β)′ ] = E [((X ′ X )−1 X ′ ϵ) ((X ′ X )−1 X ′ ϵ)′ ]
= E [(X ′ X )−1 X ′ ϵϵ′ X (X ′ X )−1 ]

using the fact that (AB)′ = B ′ A′ . i.e. we can rewrite ((X ′ X )−1 X ′ ϵ)′ as ϵ′ X (X ′ X )−1
if we assume that X is non-stochastic, then:

E [(β̂ − β)(β̂ − β)′ ] = (X ′ X )−1 X ′ E [ϵϵ′ ]X (X ′ X )−1

note that X is indeed stochastic. The assumption above makes the proof easier, but the
proof does not rely on this

39
Variance-covariance matrix

from earlier, we have that the variance-covariance matrix of the disturbances is

E [ϵϵ′ ] = σ 2 I , so we now have:

Var [β̂] = E [(β̂ − β)(β̂ − β)′ ] = (X ′ X )−1 X ′ σ 2 IX (X ′ X )−1

= σ 2 I [(X ′ X )−1 X ′ X (X ′ X )−1 ]
= σ 2 I (X ′ X )−1
= σ 2 (X ′ X )−1

as we don’t observe the disturbances (ϵi ), we have to use the residuals (ei ) to estimate σ 2
with σ̂ 2 :
e ′e
σ̂ 2 =
N −k
the square root of which is called the standard error of the regression,

40
Goodness of fit: R 2

How do we measure how well the estimated regression model “fits” the data?
Typically we use a measure known as the R 2 : the proportion of the sample variance of y
that is explained by the model. Recall i.) :

e ′e
σ̂ 2 =
N −k

and ii.) that N 2 ′

P
i=1 ei = e e = RSS = TSS − ESS
2
TSS = Total Sum of Squares = Y ′ Y − NY
2
ESS = Explained Sum of Squares = β̂ ′ X ′ Y − NY
RSS = TSS − ESS = Residual Sum of Squares = Y ′ Y − β̂ ′ X ′ Y

2
β̂ ′ X ′ Y − NY

2 ESS RSS
R = = 2
=1−
TSS Y ′ Y − NY TSS
41
42
43
Inference in bivariate regression

standard errors for regression parameters

how does the standard error for the regression slope parameter compare to that for the sample
average?
the relationship between the regressors and residuals: homoskedasticity versus
heteroskedasticity
robust standard errors to account for heteroskedasticity
testing (t-test, p-values, confidence intervals) works just the same as for sample averages

44
Regression for the California Test Score data

our estimated regression model for the California class size data is

Yi = 698.9 − 2.28Xi + ϵi

the data are the population of California school districts for 1999.
the estimator for the regression slope has sampling variation, i.e. it is a random variable
because the samples it is constructed from contain randomness.
the graphs below help to visualise this sampling variation with a thought experiment
pretend our sample is the population
take smaller sub-samples from the“population”

45
A regression line for a sample of 30

46
A regression line for another sample of 30

47
Different OLS estimates

the estimated regression slopes in the two pictures are different. One is -3.48, the other is
-1.38. Neither one matches the “population” regression slope of -2.28.
the average of the estimates from 10 samples is -2.47.
if we do this very many times we will get -2.28 on average because the OLS regression
slope is an unbiased estimator of the population regression slope
the estimator for the regression slope has sampling variation, i.e. it is a random variable
because the samples it is constructed from contain randomness.

48
Distribution of 100 000 estimates for β (n=30)

49
Standard error

the standard error of the sample average is:

r
Var (Yi )
SE (Y n ) =
n
The variance of the estimated slope coefficient in a bivariate sample regression of the form:

Yi = β̂0 + β̂1 Xi + ϵi

looks similar: s
1 Var (ϵi )
SE (β̂1 ) =
n Var (Xi )

50
Decomposing sampling variability

The OLS standard error s

1 Var (ϵi )
SE (β̂) =
n Var (Xi )
depends on three key elements:
the inverse of the sample size, n. Larger samples result in more precise (smaller standard
error) estimates
the amount of variation in the residual ϵi . This replaces Var (Yi ) in the formula for the
standard error of the sample average.
the inverse of the variation in the regressor Xi . More precise estimates when there is lots of
variation in the regressor.

51
Conventional and (heteroskedasticity-) robust standard errors

conventional standard errors s

1 Var (ϵi )
SE (β̂) =
n Var (Xi )
robust standard errors s
1 Var {(Xi − E [Xi ])ϵi }
RSE (β̂) =
n [Var (Xi )]2

52
Data where the residual variance is unrelated to the regressor (homoskedastic-
ity)

53
Data where the residual variance is related to the regressor (heteroskedasticity)

54
Homoskedasticity versus heteroskedasticity

heteroskedasticity: The dispersion in the residuals is related to the regressor Xi . The

robust standard error allows for this.
homoskedasticity: The dispersion in the residuals is unrelated to Xi , or put another way,
E (ϵ2i |Xi ) = Var (ϵi ), a constant. In this case:
Var {(Xi − E [Xi ])ϵi } = Var (ϵi )Var (Xi )

so that the OLS sampling variance simplifies to:

s
1 Var {(Xi − E [Xi ])ϵi }
RSE (β̂) =
n [Var (Xi )]2
s
1 Var (ϵi )Var (Xi )
=
n [Var (Xi )]2
s
1 Var (ϵi )
= = SE (β̂)
n Var (Xi ) 55
Inference and testing in multivariate regression

multivariate regression = regression with more than one regressor

standard errors and t-tests for a single coefficient are just analogous to the bivariate
regression case.
New testing problems arise in multivariate regression:
Testing single hypotheses involving multiple coefficients
Testing multiple hypotheses at the same time

56
Tests involving multiple coefficients

consider the (mulitvariate) regression:

test scorei = α + βclass sizei + γ1 % English 2nd languagei

+ γ2 % free school meali + ei

we may be interested in the hypothesis that having more English 2nd language students or
more students on free lunches has the same impact on test scores, i.e. in the hypothesis:

H0 : γ 1 = γ 2 versus H1 : γ1 ̸= γ2

so the test statistic is:

γ̂1 − γ̂2
tn−k =
SE (γ̂1 − γ̂2 )

57
The two coefficient t-test

we need to find SE (γ̂1 − γ̂2 ). Recall:

Var (γ̂1 − γ̂2 ) = Var (γ̂1 ) + Var (γ̂2 ) − 2Cov (γ̂1 − γ̂2 )

one can find Cov (γ̂1 − γ̂2 ) just like you can find the sampling variance for a single
coefficient. The t-statistic is simply:
γ̂1 − γ̂2
tn−k = p
Var (γ̂1 ) + Var (γ̂2 ) − 2Cov (γ̂1 − γ̂2 )

58
Testing multiple hypotheses at once

in multivariate regression it is sometimes interesting to test multiple hypotheses at once.

Consider again the regression

test scorei = α + βclass sizei + γ1 % English 2nd languagei

+ γ2 % free school meali + ei

we may be interested in the hypothesis that neither the fraction of English learners nor the
fraction of free lunch students has any impact on test scores, or:

H0 : γ1 = 0, γ2 = 0 versus H1 : γ1 ̸= 0, γ2 ̸= 0

we could just use two simple t-tests for each hypothesis that γ1 = 0 and γ2 = 0, which we
be a test of whether the two nulls were individually true. But we may want to know
whether both are true at once

59
Testing joint hypotheses

to test a joint hypothesis, we cannot just combine the single t-statistics. There are two
reasons for this:
As before, the estimated coefficients γ̂1 and γ̂2 will in general be correlated. We need to take
this correlation into account.
Even if this correlation is zero, rejecting the joint hypothesis if either one of the two t-tests
rejects would reject too often under the null hypothesis. Suppose t1 and t2 are your two
t-statistics. You don’t reject it with probability:

Pr (|t1 | ≤ 1.96 and |t2 | ≤ 1.96) = Pr (|t1 | ≤ 1.96) × Pr (|t2 | ≤ 1.96)

= 0.952 = 0.9025

this means we are rejecting 9.75% of the time (1 − 0.952 ), rather than 5% of the time if
the null hypothesis is true

60
The F -test

in order to test a joint hypothesis, we need to perform an F -test. The F -statistic for the
hypothesis H0 : γ1 = 0, γ2 = 0 has the form

1 t12 + t22 − 2ρt1 t2 t1 t2

F = ,
2 1 − ρt1 t2

where ρt1 t2 is the correlation of the two t-statistics. Note, if

ρt1 t2 = 0 we just want to add the two t-statistics.
ρt1 t2 is large, we want to subtract something from the sum of the two t-statistics because if
one t-test rejects under the null, then the second test is more likely to reject as well.
we compare the F -statistic to a χ2 (2) distribution because our test involves 2 restrictions.
Using the appropriate distribution adjusts the rejections region, so we don’t reject too
often under the null.

61
The F -test and the t-test

note that the F -statistic for a single hypothesis is just

F = t2

and has a χ2 (1) distribution under the null. You can always do an F -test instead of a
t-test (but not vice versa).

62
Testing equality of two coefficients

if we test the hypothesis

H0 : γ 1 = γ 2
in the regression:

test scorei = α + βclass sizei + γ1 % English 2nd languagei

+ γ2 % free school meali + ei

Stata computes an F -test.

63
Testing equality of two coefficients

64
Testing a joint hypothesis

65
F-test for comparing between a “short” and “long” regression

what if we want to test a regression model with many regressors (the “long” model)
against a regression model with just a few variables (the “short” model)
use an F -test to do so
Compare the RSS from the unrestricted regression model (RSSUR ) to the RSS from the
restricted one(RSSR ):

(RSSR − RSSUR )/J

F = ∼ F (J, N − KUR )
RSSUR /(N − KUR )

where J is the number of variables to be restricted (i.e. the difference between the number
of regressors in the restricted and unrestricted models)

66
Functional form in regression

test scorei = α + β1 incomei + ei

67
Other forms of the regression model

it is easy to augment the simple regression model to fit the income data better, for
example, by adding a quadratic term in income and estimating

test scorei = α + β1 incomei + β2 income2i + ei

68
Linear vs. quadratic specification

69
Linear vs. quadratic specification: testing βˆ2

70
Linear vs. quadratic specification: testing βˆ2

71
How to interpret non-linear regression functions?

with a simple linear regression function, interpretation is easy:

test scorei = α + βincomei + ei

In this case β is the effect of a $1,000 increase in average income on test scores.
in the quadratic specification, it’s a bit more difficult:

test scorei = α + β1 incomei + β2 income2i + ei

the effect of a $1,000 increase in average income on test scores is now:

∂test score
= β1 + 2β2 income
∂income
so the effect depends on the level of income you look at

72
What’s linear about linear regression?

OLS regression is often called linear regression. So what’s linear about linear regression?
the regression function is linear in the parameters (α, β1 , β2 , · · · )
we can’t estimate a regression like this by OLS:

Yi = αKiβ Lγi + ei

the regression function can be non-linear in the regressors.

we can still estimate a nonlinear relationship between test scores and income, for example, by
including the square of income.

73
The log specification for income

test scorei = α + βln(income)i + ei

74
Log(income) specification

75
Interpreting the log specification

the simple log specification for income seems to work extremely well in this example, and
often does for similar variables.
the log specification:
test scorei = α + βln(income)i + ei
implies:
∂test score β
=
∂income income
∂test score
=β
∂income/income

so, here, β is the effect of a relative change in income

76
The log of income

proportional changes in income are often more reasonable than absolute changes:
a $1000 change is pretty big for a district with income of $15 000 (in terms of economic
impact)
a $1000 change is much smaller for a district with income of $40 000.
comparing a 10% change may be a better suited exercise (so a $100 change for low income
districts compared to a $400 change for high income districts). This is what the log
specification does.

77
Log derivatives

we know that:
∂ln(x) 1 ∂x
= ⇒ ∂ln(x) =
∂x x x
from this, it follows that:
∂ln(y ) ∂y /y ∂y x
= =
∂ln(x) ∂x/x ∂x y
which is an elasticity
this means that the log-log regression gives you an elasticity:

ln(test score)i = α + βln(income)i + ei

where here, β directly estimates an elasticity

78
Controlling for income in the test score data

Regressor (1) (2) (3) (4)

Student-teacher ratio -2.280*** -0.649* -0.910** -0.879**
(0.519) (0.353) (0.355) (0.340)
Average income 1.839*** 3.882***
(0.115) (0.271)
Average income2 -0.044***
(0.005)
ln(Average income) 35.616***
(1.400)

Observations 420 420 420 420

R-squared 0.051 0.511 0.564 0.570 79
Making sense of the income results

in the simple linear regression we got a coefficient on income of 1.84

for the quadratic regression we get
∂test score
= β1 + 2β2 income
∂income
= 3.882 + 2 × (−.044) income
= 2.53 at income (where income = 15.3)

the quadratic regression evaluated at the mean yields a larger (partial) effect of income of
test scores than the linear regression in this case. Why?
the linear regression puts comparatively a lot of weight on districts with very high incomes
(which don’t provide a good fit for the linear regression).

80
Partial effects at the mean

81
Making sense of the income results

partial effect from simple linear regression: 1.84

partial effect (at the mean) from quadratic regression: 2.53
for the log specification:

∂test score ∂ln(income) 1

=β =β
∂income ∂income income
so evaluating the partial effect at income = 15.3, we get:
∂test score 1 1
=β = 35.616 = 2.33
∂income income 15.3
which is very similar to the effect we obtained from the quadratic specification

82
Partial effects at the mean

83
Sub-sample partial effects - linear specification

84
Sub-sample effects: income ∈ (10000, 20000)

Regressor (1) (2) (3) (4)

Student-teacher ratio -1.305*** -1.269*** -1.270*** -1.269***
(0.452) (0.427) (0.426) (0.430)
Average income 2.348*** 1.033
(0.267) (3.145)
Average income2 0.045
(0.108)
ln(Average income) 32.725***
(3.786)

Observations 280 280 280 280

R-squared 0.030 0.225 0.226 0.221
Partial Effect of Income - 2.348 2.425 2.137

85
Sub-sample effects: income ∈ (10000, 20000)

7 Classical Assumptions of Ordinary Least Squares (OLS) Linear Regression - Statistics by Jim
No ratings yet
7 Classical Assumptions of Ordinary Least Squares (OLS) Linear Regression - Statistics by Jim
71 pages
Lecture Set 2
No ratings yet
Lecture Set 2
47 pages
Econ 399 Chapter2a
No ratings yet
Econ 399 Chapter2a
40 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
The Simple Regression Model
No ratings yet
The Simple Regression Model
24 pages
Metrics 2019 Lec3
No ratings yet
Metrics 2019 Lec3
59 pages
Chapter 2 Econometric
No ratings yet
Chapter 2 Econometric
28 pages
Lecture 2 SLR - 1
No ratings yet
Lecture 2 SLR - 1
28 pages
Lec Topic3
No ratings yet
Lec Topic3
51 pages
OLS Method
No ratings yet
OLS Method
12 pages
Econometrics Chap - 2
No ratings yet
Econometrics Chap - 2
57 pages
ECON6001: Applied Econometrics S&W: Chapter 4: Linear Regression With One Regressor, An Introduction Dr. Gedeon Lim
No ratings yet
ECON6001: Applied Econometrics S&W: Chapter 4: Linear Regression With One Regressor, An Introduction Dr. Gedeon Lim
59 pages
Week 3-4
No ratings yet
Week 3-4
75 pages
Pertemuan 2 - Simple Linear Regression
No ratings yet
Pertemuan 2 - Simple Linear Regression
24 pages
Ols 2
No ratings yet
Ols 2
19 pages
Lecture 4
No ratings yet
Lecture 4
25 pages
Simple Linear Regression
No ratings yet
Simple Linear Regression
42 pages
ECO375H Slides 3
No ratings yet
ECO375H Slides 3
39 pages
Derex Econom
No ratings yet
Derex Econom
13 pages
UnivariateRegression 2
No ratings yet
UnivariateRegression 2
72 pages
Lecture 2-3 - Properties of The OLS Estimates
No ratings yet
Lecture 2-3 - Properties of The OLS Estimates
20 pages
LLICO2b ECO1 English
No ratings yet
LLICO2b ECO1 English
15 pages
Unit-3 Notes
No ratings yet
Unit-3 Notes
16 pages
Lecture 2
No ratings yet
Lecture 2
25 pages
The Simple Regression Model: Introductory Econometrics: A Modern Approach (Wooldridge)
No ratings yet
The Simple Regression Model: Introductory Econometrics: A Modern Approach (Wooldridge)
15 pages
Chapter 3 Econometrics Edited
No ratings yet
Chapter 3 Econometrics Edited
48 pages
Lecture 6. Linear Regression
No ratings yet
Lecture 6. Linear Regression
12 pages
Chap 2
No ratings yet
Chap 2
15 pages
Application of Ordinary Least Square Method in Nonlinear
No ratings yet
Application of Ordinary Least Square Method in Nonlinear
4 pages
ECO 401 Econometrics: SI 2021 Week 2, 14 September
100% (1)
ECO 401 Econometrics: SI 2021 Week 2, 14 September
47 pages
Chapter 1 Article
No ratings yet
Chapter 1 Article
9 pages
Assignments Ashoka University
No ratings yet
Assignments Ashoka University
32 pages
02 Simple Regression
No ratings yet
02 Simple Regression
29 pages
The Simple Regression Model: DR Jin Hongfei 1
No ratings yet
The Simple Regression Model: DR Jin Hongfei 1
41 pages
Cheatsheet
No ratings yet
Cheatsheet
2 pages
Tema I (Mínimos Cuadrados Ordinarios)
No ratings yet
Tema I (Mínimos Cuadrados Ordinarios)
49 pages
Econometric S
No ratings yet
Econometric S
8 pages
Lecture 2
No ratings yet
Lecture 2
14 pages
TCH442E Quantitative Methods For Finance
No ratings yet
TCH442E Quantitative Methods For Finance
21 pages
2 - Model Linear Jamak Dan OLS
No ratings yet
2 - Model Linear Jamak Dan OLS
11 pages
Introduction To Econometrics - Stock & Watson - CH 4 Slides
100% (2)
Introduction To Econometrics - Stock & Watson - CH 4 Slides
84 pages
Ordinary Least Squares Linear Regression Review: Week 4
No ratings yet
Ordinary Least Squares Linear Regression Review: Week 4
10 pages
The Multiple Linear Regression Model: Version: 30-10-2023, 16:07
No ratings yet
The Multiple Linear Regression Model: Version: 30-10-2023, 16:07
17 pages
Ordinary Least Squares: Rómulo A. Chumacero
No ratings yet
Ordinary Least Squares: Rómulo A. Chumacero
50 pages
Econometrics 7
No ratings yet
Econometrics 7
49 pages
Ordinary Least Squares With A Single Independent Variable
No ratings yet
Ordinary Least Squares With A Single Independent Variable
6 pages
Introduction To Mathematical Modeling: Simple Linear Regression
No ratings yet
Introduction To Mathematical Modeling: Simple Linear Regression
21 pages
Week 2, OLS
No ratings yet
Week 2, OLS
83 pages
Multiple Regression Analysis: I 0 1 I1 K Ik I
100% (1)
Multiple Regression Analysis: I 0 1 I1 K Ik I
30 pages
201 - 04 - 01 - Bijma An Introduction To Mathematical Statistics 2017
100% (2)
201 - 04 - 01 - Bijma An Introduction To Mathematical Statistics 2017
380 pages
CH 2 Multiple Regression S
No ratings yet
CH 2 Multiple Regression S
78 pages
Ass 1 2019 RMBA
100% (3)
Ass 1 2019 RMBA
8 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Ordinary Least Squares: Linear Model
No ratings yet
Ordinary Least Squares: Linear Model
13 pages
Ordinary Least Squares
No ratings yet
Ordinary Least Squares
21 pages
FECO Note 2 - Simple Linear Regression: Xuan Chinh Mai
No ratings yet
FECO Note 2 - Simple Linear Regression: Xuan Chinh Mai
7 pages
Ben Ulmer, Matt Fernandez, Predicting Soccer Results in The English Premier League PDF
100% (1)
Ben Ulmer, Matt Fernandez, Predicting Soccer Results in The English Premier League PDF
5 pages
Matrix OLS NYU Notes
No ratings yet
Matrix OLS NYU Notes
14 pages
Evans Analytics2e PPT 04
No ratings yet
Evans Analytics2e PPT 04
63 pages
Sample Size Determination: BY DR Zubair K.O
100% (1)
Sample Size Determination: BY DR Zubair K.O
43 pages
Chapter 02
No ratings yet
Chapter 02
14 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Worksheet A Topic 2.6 Competing Function Model Validation
No ratings yet
Worksheet A Topic 2.6 Competing Function Model Validation
3 pages
Chapter III Research Methodology
100% (1)
Chapter III Research Methodology
22 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Problems in Standard Deviation and Estimation
No ratings yet
Problems in Standard Deviation and Estimation
5 pages
Mth3003 Tutorial Questions FOR UPM STUDENTS JULY 2010/2011: Please Try All Your Best To Answer All Questions)
No ratings yet
Mth3003 Tutorial Questions FOR UPM STUDENTS JULY 2010/2011: Please Try All Your Best To Answer All Questions)
29 pages
2023-24 ML Notes 1
No ratings yet
2023-24 ML Notes 1
25 pages
Miller and Freunds Probability and Statistics For Engineers 9th Edition Johnson Solutions Manual Download
100% (4)
Miller and Freunds Probability and Statistics For Engineers 9th Edition Johnson Solutions Manual Download
52 pages
Module 8 - Normal Distribution
No ratings yet
Module 8 - Normal Distribution
9 pages
Wooldridge 6e AppE IM
No ratings yet
Wooldridge 6e AppE IM
5 pages
10.E Hypothesis Testing With Two Samples (Exercises)
No ratings yet
10.E Hypothesis Testing With Two Samples (Exercises)
29 pages
P - Influence of Self-Efficacy On Elementary Students
No ratings yet
P - Influence of Self-Efficacy On Elementary Students
9 pages
Teaching Sampling
No ratings yet
Teaching Sampling
91 pages
Germany22 Luedicke
No ratings yet
Germany22 Luedicke
39 pages
EUC1502 Module2 Machine Learning
No ratings yet
EUC1502 Module2 Machine Learning
32 pages
CRLB
No ratings yet
CRLB
53 pages
Standardizationof Sixteen Personality Factor Questionnaire 16 PFwith Nigerian Subjects
No ratings yet
Standardizationof Sixteen Personality Factor Questionnaire 16 PFwith Nigerian Subjects
9 pages
EdMaestro SEM With SmartPLS4 March 2025
No ratings yet
EdMaestro SEM With SmartPLS4 March 2025
1 page
Notes 4 - Confidence Intervals and Significance Tests
No ratings yet
Notes 4 - Confidence Intervals and Significance Tests
1 page
882 - Business Statistics - 720 - (24-05-23 08 - 23 - 39 - 852 Am)
No ratings yet
882 - Business Statistics - 720 - (24-05-23 08 - 23 - 39 - 852 Am)
5 pages
Quiz 4 - Continuous Probability Distribution
No ratings yet
Quiz 4 - Continuous Probability Distribution
4 pages
Research Methodology - Measurement & Scaling Techniques
No ratings yet
Research Methodology - Measurement & Scaling Techniques
13 pages
Kasetsart Journal of Social Sciences: Pornpichet Hanghon, Idsaratt Rinthaisong
No ratings yet
Kasetsart Journal of Social Sciences: Pornpichet Hanghon, Idsaratt Rinthaisong
6 pages
MATH 1280-01 - Written Assignment Unit 2 - Sheet1
No ratings yet
MATH 1280-01 - Written Assignment Unit 2 - Sheet1
1 page
STEYX Function
No ratings yet
STEYX Function
2 pages
Introduction to Bessel Functions
From Everand
Introduction to Bessel Functions
Frank Bowman
2.5/5 (1)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Calculus-II (Mathematics) Question Bank
From Everand
Calculus-II (Mathematics) Question Bank
Mohmmad Khaja Shareef
No ratings yet
Calculus I Essentials
From Everand
Calculus I Essentials
Editors of REA
1/5 (1)

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Ols 23-24

Uploaded by

Ols 23-24

Uploaded by

ECON6004

(what sign to we expect β1 to take, and why?)

where β0 is the intercept of this straight line, and β1 is the slope. 1

 question of interest: Do smaller classes result in better outcomes for students?

 do smaller classes causally result in better outcomes for students?

Class Size Test Score

 do smaller classes causally result in better outcomes for students?

Class Size Test Score

 OLS: Ordinary Least Squares: Yi = β0 + β1 Xi + ei

(β0 , β1 ) = arg minE {(Yi − b0 − b1 Xi )2 }

 So the first order conditions (FOCs) of the minimisation problem are:

E (Yi Xi ) − β1 E (Xi2 ) = β0 E (Xi ) = E (Yi )E (Xi ) − β1 {E (Xi )}2

Cov (Yi , Xi ) = β1 Var (Xi )

 the solution to the linear regression problem:

(β0 , β1 ) = arg minE {(Yi − b0 − b1 Xi )2 }

N.B. in a sample we replace population expectations with sample averages

Yi = β0 + β1 Xi1 + β2 Xi2 + · · · + βk−1 Xi(k−1) + ϵi , i = 1, · · · , N

 Our regression model would then look like:

which can be written more simply in matrix notation as:

 the model has a systematic component, X β, and a stochastic (random) component, ϵ

 our estimates of the population parameters are referred to as β̂

 the vector of residuals e is given by:

 the sum of squared residuals (RSS) is e ′ e:

 we can write the sum of squared residuals as:

 From this we get the “normal equations”:

(X ′ X )−1 (X ′ X )β̂ = (X ′ X )−1 X ′ Y

 we know that by definition, (X ′ X )−1 (X ′ X ) = I , where I in this case is a k × k identity

 now substitute in Y = X β̂ + e to get:

 what does X ′ e look like?

 from X ′ e = 0, we can derive a number of properties

Yi = β0 + β1 Xi1 + β2 Xi2 + · · · + βk−1 Xi(k−1) + ϵi , i = 1, · · · , N

 we also require that X is an N × k matrix of full rank, k

 assumption 1 [A1]: in scalar form - E [ϵi ] = 0, i = 1, · · · , N

 assumption 2 [A2]: {ϵ1 , · · · , ϵN } and {X1 , · · · , XN } are independent

 assumption 3 [A3]: in scalar form - Var [ϵi ] = σ 2 , i = 1, · · · , N

 assumption 4 [A4]: in scalar form - Cov [ϵi , ϵj ] = 0, i, j = 1, · · · , N, i ̸= j

 [A3] gives us the diagonals

E [ϵ|X ] = E [ϵ] = 0 and E [ϵϵ′ |X ] = E [ϵϵ′ ] = σ 2 IN

 from [A1] and [A2] we have that E (Y |X ) = E (Y ) = X β 34

 proof that β̂ is an unbiased estimator of β:

using that (X ′ X )−1 X ′ X = Ik

 this shows immediately that OLS is unbiased :

E [β̂] = E [(X ′ X )−1 X ′ Y ]

E [(β̂ − β)(β̂ − β)′ ] = (X ′ X )−1 X ′ E [ϵϵ′ ]X (X ′ X )−1

 from earlier, we have that the variance-covariance matrix of the disturbances is

Var [β̂] = E [(β̂ − β)(β̂ − β)′ ] = (X ′ X )−1 X ′ σ 2 IX (X ′ X )−1

 and ii.) that N 2 ′

 standard errors for regression parameters

 the standard error of the sample average is:

 The OLS standard error s

 conventional standard errors s

 heteroskedasticity: The dispersion in the residuals is related to the regressor Xi . The

so that the OLS sampling variance simplifies to:

 multivariate regression = regression with more than one regressor

 consider the (mulitvariate) regression:

test scorei = α + βclass sizei + γ1 % English 2nd languagei

 so the test statistic is:

 we need to find SE (γ̂1 − γ̂2 ). Recall:

 in multivariate regression it is sometimes interesting to test multiple hypotheses at once.

test scorei = α + βclass sizei + γ1 % English 2nd languagei

Pr (|t1 | ≤ 1.96 and |t2 | ≤ 1.96) = Pr (|t1 | ≤ 1.96) × Pr (|t2 | ≤ 1.96)

1 t12 + t22 − 2ρt1 t2 t1 t2

where ρt1 t2 is the correlation of the two t-statistics. Note, if

 note that the F -statistic for a single hypothesis is just

 if we test the hypothesis

test scorei = α + βclass sizei + γ1 % English 2nd languagei

Stata computes an F -test.

(RSSR − RSSUR )/J

test scorei = α + β1 incomei + ei

test scorei = α + β1 incomei + β2 income2i + ei

 with a simple linear regression function, interpretation is easy:

test scorei = α + βincomei + ei

test scorei = α + β1 incomei + β2 income2i + ei

the effect of a $1,000 increase in average income on test scores is now:

 the regression function can be non-linear in the regressors.

test scorei = α + βln(income)i + ei

question of interest: Do smaller classes result in better outcomes for students?

do smaller classes causally result in better outcomes for students?

do smaller classes causally result in better outcomes for students?

OLS: Ordinary Least Squares: Yi = β0 + β1 Xi + ei

So the first order conditions (FOCs) of the minimisation problem are:

the solution to the linear regression problem:

Our regression model would then look like:

the model has a systematic component, X β, and a stochastic (random) component, ϵ

our estimates of the population parameters are referred to as β̂

the vector of residuals e is given by:

the sum of squared residuals (RSS) is e ′ e:

we can write the sum of squared residuals as:

From this we get the “normal equations”:

we know that by definition, (X ′ X )−1 (X ′ X ) = I , where I in this case is a k × k identity

now substitute in Y = X β̂ + e to get:

what does X ′ e look like?

from X ′ e = 0, we can derive a number of properties

we also require that X is an N × k matrix of full rank, k

assumption 1 [A1]: in scalar form - E [ϵi ] = 0, i = 1, · · · , N

assumption 2 [A2]: {ϵ1 , · · · , ϵN } and {X1 , · · · , XN } are independent

assumption 3 [A3]: in scalar form - Var [ϵi ] = σ 2 , i = 1, · · · , N

assumption 4 [A4]: in scalar form - Cov [ϵi , ϵj ] = 0, i, j = 1, · · · , N, i ̸= j

[A3] gives us the diagonals

from [A1] and [A2] we have that E (Y |X ) = E (Y ) = X β 34

proof that β̂ is an unbiased estimator of β:

this shows immediately that OLS is unbiased :

from earlier, we have that the variance-covariance matrix of the disturbances is

and ii.) that N 2 ′

standard errors for regression parameters

the standard error of the sample average is:

The OLS standard error s

conventional standard errors s

heteroskedasticity: The dispersion in the residuals is related to the regressor Xi . The

multivariate regression = regression with more than one regressor

consider the (mulitvariate) regression:

so the test statistic is:

we need to find SE (γ̂1 − γ̂2 ). Recall:

in multivariate regression it is sometimes interesting to test multiple hypotheses at once.

note that the F -statistic for a single hypothesis is just

if we test the hypothesis

with a simple linear regression function, interpretation is easy:

the regression function can be non-linear in the regressors.

in the simple linear regression we got a coefficient on income of 1.84

partial effect from simple linear regression: 1.84