0% found this document useful (0 votes)
26 views87 pages

Ols 23-24

The document discusses linear regression analysis and the Ordinary Least Squares (OLS) estimation method. It begins by introducing the linear regression model using an example of estimating the effect of class size on test scores. It then discusses the OLS estimator and how it chooses coefficients to minimize the sum of squared errors. The document explains how OLS finds the "line of best fit" and derives the OLS estimation formula. It also expresses the linear regression model and OLS estimation method in matrix notation.

Uploaded by

cbmkmtv5zb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views87 pages

Ols 23-24

The document discusses linear regression analysis and the Ordinary Least Squares (OLS) estimation method. It begins by introducing the linear regression model using an example of estimating the effect of class size on test scores. It then discusses the OLS estimator and how it chooses coefficients to minimize the sum of squared errors. The document explains how OLS finds the "line of best fit" and derives the OLS estimation formula. It also expresses the linear regression model and OLS estimation method in matrix notation.

Uploaded by

cbmkmtv5zb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

ECON6004

Quantitative Methods
OLS and Regression Analysis

Jan Podivinsky

University of Southampton
The linear regression model

ˆ example: Using data on class sizes and test scores from different school districts, we want
to estimate the expected effect of reducing class size on test scores.
ˆ We can write this relationship, as follows:
∆TestScore
β1 =
∆ClassSize
ˆ rearrange to get a sense of the answer to our question:

∆TestScore = β1 ∆ClassSize

(what sign to we expect β1 to take, and why?)


ˆ or we can write in terms of a regression model:

TestScore = β0 + β1 ClassSize + ϵ,

where β0 is the intercept of this straight line, and β1 is the slope. 1


The data

2
Class size effects

ˆ question of interest: Do smaller classes result in better outcomes for students?


ˆ Data: Primary school kids in Californian school districts (n = 420) for year 1999
ˆ Variables:
ˆ dependent/“response” variables: 5th grade test scores (Stanford-9 achievement test,
combined math and reading), district average
ˆ explanatory variable: Student-teacher ratio (STR) = no. of students in the district divided
by no. full-time equivalent teachers

3
Causality

ˆ do smaller classes causally result in better outcomes for students?

Class Size Test Score

4
Causality

ˆ do smaller classes causally result in better outcomes for students?


ˆ what if something else, a confounding / omitted variable, drives both class size and test
scores?
ˆ an example of this could be average family income within the district.
ˆ through local taxes, richer districts will be able to afford more teachers
ˆ children from wealthier families have, on average, higher test scores

Class Size Test Score

District Income

5
The OLS estimator

ˆ OLS: Ordinary Least Squares: Yi = β0 + β1 Xi + ei


ˆ the OLS estimator chooses the regression coefficients so that the estimated regression line
is as close as possible to the observed data, where closeness is measured by the sum of
the squared “mistakes” made in predicting Y given X
ˆ we can compare the predicted value (Ŷi ) with the actual value (Yi ) of the dependent
variable and the difference between the actual and predicted value gives the
residual:ei = Yi − Ŷi
ˆ since Ŷi = β̂0 + β̂1 Xi this gives ei = Yi − Ŷi = Yi − (β̂0 + β̂1 Xi )
ˆ rather than minimise the sum of residuals, minimise the sum of squared residuals
(sometimes written as residual sum of squares (RSS)):
N
X
RSS = ei2
i=1

6
The linear regression problem

ˆ the OLS regression line is the solution to the linear regression problem:

(β0 , β1 ) = arg minE {(Yi − b0 − b1 Xi )2 }


b0 ,b1

ˆ all this says above is that we chose values for b0 and b1 such that we get the “line of best
fit”, where “best fit” is defined by minimising the RSS
ˆ define RSS = E {(Yi − b0 − b1 Xi )2 } . The derivatives of the RSS with respect to the
parameters are:
∂RSS
= −2E (Yi − b0 − b1 Xi )
∂b0
∂RSS
= −2E {(Yi − b0 − b1 Xi )Xi }
∂b1

7
The linear regression problem

ˆ So the first order conditions (FOCs) of the minimisation problem are:

E (Yi − β0 − β1 Xi ) = 0
E {(Yi − β0 − β1 Xi )Xi } = 0

N.B. once we write the FOCs we replace b0 and b1 with β0 and β1 - the parameters which
solve the population regression problem.
ˆ the first FOC yields:
β0 = E (Yi ) − β1 E (Xi )
substitute this into the second FOC and rearrange:

E (Yi Xi ) − β1 E (Xi2 ) = β0 E (Xi ) = E (Yi )E (Xi ) − β1 {E (Xi )}2


E (Yi Xi ) − E (Yi )E (Xi ) = β1 E (Xi2 ) − {E (Xi )}2
 

Cov (Yi , Xi ) = β1 Var (Xi )


8
The linear regression problem solved

ˆ the solution to the linear regression problem:

(β0 , β1 ) = arg minE {(Yi − b0 − b1 Xi )2 }


b0 ,b1

is given by:

β0 = E (Yi ) − β1 E (Xi ) (= Y − β1 X )
PN !
Cov (Yi , Xi ) (Y i − Y )(Xi − X )
β1 = = i=1
PN
Var (Xi ) i=1 (Xi − X )
2

N.B. in a sample we replace population expectations with sample averages

9
The data and the OLS regression line

10
OLS in matrix form

ˆ the linear model with several (say k) explanatory variables is given by the equation:

Yi = β0 + β1 Xi1 + β2 Xi2 + · · · + βk−1 Xi(k−1) + ϵi , i = 1, · · · , N

We can write the multiple regression model in matrix form, by defining the following
vectors and matrices:
ˆ Let X be an N × k matrix where we have observations on k − 1 explanatory variables (the
kth term is the constant) for N observations.
ˆ Let Y be an N × 1 vector of observations on the dependent variable.
ˆ Let ϵ be an N × 1 vector of disturbances or errors.
ˆ Let β be a k × 1 vector of unknown population parameters that we want to estimate.

11
OLS in matrix form

ˆ Our regression model would then look like:


       
Y1 1 X11 X12 · · · X1(k−1) β0 ϵ1
 Y2  1 X21 X22 · · · X2(k−1)   β1   ϵ2 
= . + .  ,
       
 ..   .. 
.  .. 

 .  .   . 
YN (N×1) 1 XN1 XN2 · · · XN(k−1) (N×k) βk−1 (k×1) ϵN (N×1)

which can be written more simply in matrix notation as:

Y = Xβ + ϵ

ˆ the model has a systematic component, X β, and a stochastic (random) component, ϵ

12
OLS: criteria for estimates

ˆ our estimates of the population parameters are referred to as β̂


PN
ˆ we want to find the estimator β̂ that minimizes the sum of squared residuals ( 2
i=1 ei in
scalar notation).
ˆ be careful about distinguishing between disturbances/error terms, ϵ, that refer to things
that cannot be observed and residuals, e, that can be observed.

13
OLS in matrix form: criteria for estimates

ˆ the vector of residuals e is given by:

e = Y − X β̂

ˆ the sum of squared residuals (RSS) is e ′ e:


 
e1
   e2   
e1 e2 ··· eN (1×N)  .  = e1 × e1 + e2 × e2 + · · · +eN × eN
 
 ..  (1×1)

eN (N×1)

ˆ N.B. this is not the same thing as ee ′ - the variance-covariance matrix of residuals

14
Sum of squared residuals (RSS)

ˆ we can write the sum of squared residuals as:

e ′ e = (Y − X β̂)′ (Y − X β̂)
= Y ′ Y − β̂ ′ X ′ Y − Y ′ X β̂ + β̂ ′ X ′ X β̂
= Y ′ Y − 2β̂ ′ X ′ Y + β̂ ′ X ′ X β̂

ˆ note, moving from the second to the third line we use the fact that the transpose of a
scalar is a scalar, i.e.
Y′ X β̂ = (Y ′ X β̂)′ = β̂ ′ X ′ Y
(1×N)(N×k)(k×1)

15
Finding β̂, the OLS estimator

ˆ to find the β̂ that minimises the sum of squared residuals, we need to take the derivative
with respect to β̂:
∂e ′ e
= −2X ′ Y + 2X ′ X β̂ = 0
∂ β̂
ˆ to ensure this is a minimum we take the derivative of this with respect to β̂ again, this
gives us 2X ′ X
ˆ as long as X has full rank, this is a positive definite matrix (analogous to a positive real
number) and therefore a minimum

16
The normal equations

∂e ′ e
= −2X ′ Y + 2X ′ X β̂ = 0
∂ β̂

ˆ From this we get the “normal equations”:

(X ′ X )β̂ = X ′ Y

ˆ remember that (X ′ X ) and X ′ Y are known from our data, but β̂ is unknown
ˆ if the inverse of (X ′ X ) exists, i.e. X is full rank, then pre-multiplying both sides by this
inverse gives us:
(X ′ X )−1 (X ′ X )β̂ = (X ′ X )−1 X ′ Y

17
The normal equations

ˆ the inverse of (X ′ X ) may not exist, in which case the matrix is called non-invertible or
singular, and is said to be of less than full rank.
ˆ there are two possible reasons why this matrix might be non-invertible:
1. If N < k i.e. we have more independent variables than observations, then the matrix is not of
full rank
2. One or more of the independent variables are a linear combination of the other variables, i.e.
perfect multicollinearity

18
The normal equations

(X ′ X )−1 (X ′ X )β̂ = (X ′ X )−1 X ′ Y

ˆ we know that by definition, (X ′ X )−1 (X ′ X ) = I , where I in this case is a k × k identity


matrix.
ˆ using this in the equation above, we find:

I β̂ = (X ′ X )−1 X ′ Y
β̂ = (X ′ X )−1 X ′ Y

19
Properties of OLS estimates

ˆ the primary property of OLS estimators is that they satisfy the criteria of minimising the
sum of squared residuals (RSS). But there are other properties that will also be true
ˆ recall the normal equations from earlier:

(X ′ X )β̂ = X ′ Y

ˆ now substitute in Y = X β̂ + e to get:

(X ′ X )β̂ = X ′ (X β̂ + e)
(X ′ X )β̂ = (X ′ X )β̂ + X ′ e
X ′e = 0

20
Properties of OLS estimates

ˆ what does X ′ e look like?


   
1 1 ··· 1 e1
 X11 X21 ··· XN1   e2 
=
   
 ..   .. 
 .   . 
X1(k−1) X2(k−1) · · · XN(k−1) (k×N)
eN (N×1)
   
e1 + e2 + ··· + eN 0
 X11 e1 + X21 e2 + ··· + XN1 eN  0
= .
   
..
 .. 
 
 . 
X1(k−1) e1 + X2(k−1) e2 + · · · + XN(k−1) eN (k×1) 0 (k×1)

ˆ from X ′ e = 0, we can derive a number of properties

21
Properties of OLS estimates

ˆ the observed values of X are uncorrelated (orthogonal) with the residuals. X ′ e =0 implies
that for every row xk of X ′ , xk′ e = 0.
ˆ in other words, each regressor has zero sample correlation with the residuals (e).
ˆ note: this does not mean that X is uncorrelated with the disturbances (ϵ); we have to
assume this
ˆ the residuals represent the “unexplained” variation in Y - if they are not orthogonal to X ,
then more explanation could be squeezed out of X by a different set of coefficients.

22
Properties of OLS estimates

ˆ if our regression includes a constant (as it does the way I’ve written X above), then the
following properties also hold.
ˆ the sum of the residuals is zero (ref row one two slides above)
ˆ put another way, this means the sample mean of the residuals is zero

23
Properties of OLS estimates

ˆ the regression hyperplane passes through the means of the observed values (X and Y ).
ˆ this follows from the fact that e = 0
ˆ Recall that e = Y − X β̂
ˆ summing across observations and dividing by N: e = Y − X β̂ = 0
ˆ this implies Y = X β̂, which shows that the regression hyperplane goes through the point
of means of the data

24
25
Gauss-Markov assumptions

ˆ note that we know nothing about β̂ except that it satisfies all of the properties discussed
above.
ˆ we need to make some assumptions about the true model in order to make any inferences
regarding β (the true population parameters) from β̂ (our estimator of the true
parameters).
ˆ these we call the Gauss-Markov assumptions.

26
Gauss-Markov assumptions - set-up

ˆ for the assumptions that follow, we will deal with a regression model that is linear in the
parameters (i.e. what we have seen already).
ˆ in matrix notation, we have:
Y = Xβ + ϵ
ˆ in scalar notation, we have:

Yi = β0 + β1 Xi1 + β2 Xi2 + · · · + βk−1 Xi(k−1) + ϵi , i = 1, · · · , N

27
Gauss-Markov assumptions - set-up

ˆ we also require that X is an N × k matrix of full rank, k


ˆ this requirement states that there is no perfect multicollinearity.
ˆ in other words, the columns of X are linearly independent.
ˆ this requirement also states that the number of observations N must be greater than the
number of parameters to be estimated i.e. N ≥ k.
ˆ this requirement is sometimes known as the identification condition

28
Gauss-Markov assumptions

ˆ assumption 1 [A1]: in scalar form - E [ϵi ] = 0, i = 1, · · · , N


ˆ assumption 1 [A1]: in matrix form - E [ϵ] = 0

ˆ [A1] states that the expected value of the error term is zero, which means that, on
average, the regression line should be correct

29
Gauss-Markov assumptions

ˆ assumption 2 [A2]: {ϵ1 , · · · , ϵN } and {X1 , · · · , XN } are independent

30
Gauss-Markov assumptions

ˆ assumption 3 [A3]: in scalar form - Var [ϵi ] = σ 2 , i = 1, · · · , N


ˆ assumption 3 [A3]: in matrix form - Var [ϵ] = σ2I N

ˆ [A3] states that all error terms have the same variance - we call this homoskedasticity
ˆ this is a useful assumption since it implies that no particular value of X carries any more
information about the behaviour of Y than any other

31
Gauss-Markov assumptions

ˆ assumption 4 [A4]: in scalar form - Cov [ϵi , ϵj ] = 0, i, j = 1, · · · , N, i ̸= j

ˆ [A4] imposes zero correlation between different error terms. this we describe as a case of
no autocorrelation
ˆ i.e. knowing something about the disturbance term for one observations tells us nothing
about the disturbance term for any other observation.

32
Gauss-Markov assumptions

ˆ from [A3] and [A4], we can write down the variance-covariance matrix of the error terms as:
 2   
σ 0 ··· 0 1 0 ··· 0
 0 σ2 · · · 0  0 1 · · · 0
E [ϵϵ′ ] =  . = σ 2 2
..  = σ IN
  
.. ..  .. ..
 ..

. .   . . . 
0 0 · · · σ2 0 0 ··· 1

ˆ [A3] gives us the diagonals


ˆ [A4] gives us the off-diagonals

33
Gauss-Markov assumptions

ˆ from [A2] we have that X and ϵ are independent, which implies, along with the other G-M
assumpations, that:

E [ϵ|X ] = E [ϵ] = 0 and E [ϵϵ′ |X ] = E [ϵϵ′ ] = σ 2 IN

ˆ this is a much stronger statement than we had before; this means the disturbances average
out to 0 for any value of X
ˆ this means that the matrix of explanatory variables X does not provide any information
about the expected values of the error terms, or how they (co)vary
 2   
σ 0 ··· 0 1 0 ··· 0
 0 σ2 · · · 0  0 1 · · · 0

E [ϵϵ |X ] =  . = σ2  . . 2
..  = σ IN
   
. .
. .
.  . .
. . . . . .
0 0 · · · σ2 0 0 ··· 1

ˆ from [A1] and [A2] we have that E (Y |X ) = E (Y ) = X β 34


An additional assumption

ˆ [A5]: ϵ ∼ N(0, σ 2 IN )
ˆ this is not one of the G-M assumptions, but is useful for inference (i.e. hypothesis testing)

35
Gauss-Markov Theorem

ˆ the Gauss-Markov Theorem states that, under assumptions 1-4, there will be no other
linear and unbiased estimator of the β coefficients that has a smaller sampling variance.
ˆ in other words, the OLS estimator is the Best Linear Unbiased Estimator (BLUE).
ˆ the “Best” part of BLUE relates to the variance of the OLS estimator - it is the smallest of
all other linear unbiased estimators

36
Gauss-Markov Theorem

ˆ proof that β̂ is an unbiased estimator of β:


ˆ from earlier we know i.) β̂ = (X ′ X )−1 X ′ Y and ii.) Y = X β + ϵ
ˆ this means that:

β̂ = (X ′ X )−1 X ′ (X β + ϵ)
= (X ′ X )−1 X ′ X β + (X ′ X )−1 X ′ ϵ)
= β + (X ′ X )−1 X ′ ϵ

using that (X ′ X )−1 X ′ X = Ik

37
Gauss-Markov Theorem

ˆ this shows immediately that OLS is unbiased :

E [β̂] = E [(X ′ X )−1 X ′ Y ]


= E [β + (X ′ X )−1 X ′ ϵ]
= E [β] + E [(X ′ X )−1 X ′ ϵ]
= β + E [(X ′ X )−1 X ′ ϵ]
= β + E [(X ′ X )−1 X ′ ]E [ϵ]

where we move from line four to five based on [A2], and from five to six based on [A1]

38
Variance-covariance matrix

ˆ We have our point estimates (which we just saw are unbiased), but what about our
standard errors etc? We need to derive the variance-covariance matrix of the OLS
estimator, β̂:

Var [β̂] = E [(β̂ − β)(β̂ − β)′ ] = E [((X ′ X )−1 X ′ ϵ) ((X ′ X )−1 X ′ ϵ)′ ]
= E [(X ′ X )−1 X ′ ϵϵ′ X (X ′ X )−1 ]

using the fact that (AB)′ = B ′ A′ . i.e. we can rewrite ((X ′ X )−1 X ′ ϵ)′ as ϵ′ X (X ′ X )−1
ˆ if we assume that X is non-stochastic, then:

E [(β̂ − β)(β̂ − β)′ ] = (X ′ X )−1 X ′ E [ϵϵ′ ]X (X ′ X )−1

ˆ note that X is indeed stochastic. The assumption above makes the proof easier, but the
proof does not rely on this

39
Variance-covariance matrix

ˆ from earlier, we have that the variance-covariance matrix of the disturbances is


E [ϵϵ′ ] = σ 2 I , so we now have:

Var [β̂] = E [(β̂ − β)(β̂ − β)′ ] = (X ′ X )−1 X ′ σ 2 IX (X ′ X )−1


= σ 2 I [(X ′ X )−1 X ′ X (X ′ X )−1 ]
= σ 2 I (X ′ X )−1
= σ 2 (X ′ X )−1

ˆ as we don’t observe the disturbances (ϵi ), we have to use the residuals (ei ) to estimate σ 2
with σ̂ 2 :
e ′e
σ̂ 2 =
N −k
the square root of which is called the standard error of the regression,

40
Goodness of fit: R 2

ˆ How do we measure how well the estimated regression model “fits” the data?
ˆ Typically we use a measure known as the R 2 : the proportion of the sample variance of y
that is explained by the model. Recall i.) :

e ′e
σ̂ 2 =
N −k

ˆ and ii.) that N 2 ′


P
i=1 ei = e e = RSS = TSS − ESS
2
ˆ TSS = Total Sum of Squares = Y ′ Y − NY
2
ˆ ESS = Explained Sum of Squares = β̂ ′ X ′ Y − NY
ˆ RSS = TSS − ESS = Residual Sum of Squares = Y ′ Y − β̂ ′ X ′ Y

2
β̂ ′ X ′ Y − NY
 
2 ESS RSS
R = = 2
=1−
TSS Y ′ Y − NY TSS
41
42
43
Inference in bivariate regression

ˆ standard errors for regression parameters


ˆ how does the standard error for the regression slope parameter compare to that for the sample
average?
ˆ the relationship between the regressors and residuals: homoskedasticity versus
heteroskedasticity
ˆ robust standard errors to account for heteroskedasticity
ˆ testing (t-test, p-values, confidence intervals) works just the same as for sample averages

44
Regression for the California Test Score data

ˆ our estimated regression model for the California class size data is

Yi = 698.9 − 2.28Xi + ϵi

ˆ the data are the population of California school districts for 1999.
ˆ the estimator for the regression slope has sampling variation, i.e. it is a random variable
because the samples it is constructed from contain randomness.
ˆ the graphs below help to visualise this sampling variation with a thought experiment
ˆ pretend our sample is the population
ˆ take smaller sub-samples from the“population”

45
A regression line for a sample of 30

46
A regression line for another sample of 30

47
Different OLS estimates

ˆ the estimated regression slopes in the two pictures are different. One is -3.48, the other is
-1.38. Neither one matches the “population” regression slope of -2.28.
ˆ the average of the estimates from 10 samples is -2.47.
ˆ if we do this very many times we will get -2.28 on average because the OLS regression
slope is an unbiased estimator of the population regression slope
ˆ the estimator for the regression slope has sampling variation, i.e. it is a random variable
because the samples it is constructed from contain randomness.

48
Distribution of 100 000 estimates for β (n=30)

49
Standard error

ˆ the standard error of the sample average is:


r
Var (Yi )
SE (Y n ) =
n
ˆ The variance of the estimated slope coefficient in a bivariate sample regression of the form:

Yi = β̂0 + β̂1 Xi + ϵi

looks similar: s
1 Var (ϵi )
SE (β̂1 ) =
n Var (Xi )

50
Decomposing sampling variability

ˆ The OLS standard error s


1 Var (ϵi )
SE (β̂) =
n Var (Xi )
depends on three key elements:
ˆ the inverse of the sample size, n. Larger samples result in more precise (smaller standard
error) estimates
ˆ the amount of variation in the residual ϵi . This replaces Var (Yi ) in the formula for the
standard error of the sample average.
ˆ the inverse of the variation in the regressor Xi . More precise estimates when there is lots of
variation in the regressor.

51
Conventional and (heteroskedasticity-) robust standard errors

ˆ conventional standard errors s


1 Var (ϵi )
SE (β̂) =
n Var (Xi )
ˆ robust standard errors s
1 Var {(Xi − E [Xi ])ϵi }
RSE (β̂) =
n [Var (Xi )]2

52
Data where the residual variance is unrelated to the regressor (homoskedastic-
ity)

53
Data where the residual variance is related to the regressor (heteroskedasticity)

54
Homoskedasticity versus heteroskedasticity

ˆ heteroskedasticity: The dispersion in the residuals is related to the regressor Xi . The


robust standard error allows for this.
ˆ homoskedasticity: The dispersion in the residuals is unrelated to Xi , or put another way,
E (ϵ2i |Xi ) = Var (ϵi ), a constant. In this case:
Var {(Xi − E [Xi ])ϵi } = Var (ϵi )Var (Xi )

so that the OLS sampling variance simplifies to:


s
1 Var {(Xi − E [Xi ])ϵi }
RSE (β̂) =
n [Var (Xi )]2
s
1 Var (ϵi )Var (Xi )
=
n [Var (Xi )]2
s
1 Var (ϵi )
= = SE (β̂)
n Var (Xi ) 55
Inference and testing in multivariate regression

ˆ multivariate regression = regression with more than one regressor


ˆ standard errors and t-tests for a single coefficient are just analogous to the bivariate
regression case.
ˆ New testing problems arise in multivariate regression:
ˆ Testing single hypotheses involving multiple coefficients
ˆ Testing multiple hypotheses at the same time

56
Tests involving multiple coefficients

ˆ consider the (mulitvariate) regression:

test scorei = α + βclass sizei + γ1 % English 2nd languagei


+ γ2 % free school meali + ei

ˆ we may be interested in the hypothesis that having more English 2nd language students or
more students on free lunches has the same impact on test scores, i.e. in the hypothesis:

H0 : γ 1 = γ 2 versus H1 : γ1 ̸= γ2

ˆ so the test statistic is:


γ̂1 − γ̂2
tn−k =
SE (γ̂1 − γ̂2 )

57
The two coefficient t-test

ˆ we need to find SE (γ̂1 − γ̂2 ). Recall:

Var (γ̂1 − γ̂2 ) = Var (γ̂1 ) + Var (γ̂2 ) − 2Cov (γ̂1 − γ̂2 )

ˆ one can find Cov (γ̂1 − γ̂2 ) just like you can find the sampling variance for a single
coefficient. The t-statistic is simply:
γ̂1 − γ̂2
tn−k = p
Var (γ̂1 ) + Var (γ̂2 ) − 2Cov (γ̂1 − γ̂2 )

58
Testing multiple hypotheses at once

ˆ in multivariate regression it is sometimes interesting to test multiple hypotheses at once.


Consider again the regression

test scorei = α + βclass sizei + γ1 % English 2nd languagei


+ γ2 % free school meali + ei

ˆ we may be interested in the hypothesis that neither the fraction of English learners nor the
fraction of free lunch students has any impact on test scores, or:

H0 : γ1 = 0, γ2 = 0 versus H1 : γ1 ̸= 0, γ2 ̸= 0

ˆ we could just use two simple t-tests for each hypothesis that γ1 = 0 and γ2 = 0, which we
be a test of whether the two nulls were individually true. But we may want to know
whether both are true at once

59
Testing joint hypotheses

ˆ to test a joint hypothesis, we cannot just combine the single t-statistics. There are two
reasons for this:
ˆ As before, the estimated coefficients γ̂1 and γ̂2 will in general be correlated. We need to take
this correlation into account.
ˆ Even if this correlation is zero, rejecting the joint hypothesis if either one of the two t-tests
rejects would reject too often under the null hypothesis. Suppose t1 and t2 are your two
t-statistics. You don’t reject it with probability:

Pr (|t1 | ≤ 1.96 and |t2 | ≤ 1.96) = Pr (|t1 | ≤ 1.96) × Pr (|t2 | ≤ 1.96)


= 0.952 = 0.9025

ˆ this means we are rejecting 9.75% of the time (1 − 0.952 ), rather than 5% of the time if
the null hypothesis is true

60
The F -test

ˆ in order to test a joint hypothesis, we need to perform an F -test. The F -statistic for the
hypothesis H0 : γ1 = 0, γ2 = 0 has the form

1 t12 + t22 − 2ρt1 t2 t1 t2


 
F = ,
2 1 − ρt1 t2

where ρt1 t2 is the correlation of the two t-statistics. Note, if


ˆ ρt1 t2 = 0 we just want to add the two t-statistics.
ˆ ρt1 t2 is large, we want to subtract something from the sum of the two t-statistics because if
one t-test rejects under the null, then the second test is more likely to reject as well.
ˆ we compare the F -statistic to a χ2 (2) distribution because our test involves 2 restrictions.
Using the appropriate distribution adjusts the rejections region, so we don’t reject too
often under the null.

61
The F -test and the t-test

ˆ note that the F -statistic for a single hypothesis is just

F = t2

and has a χ2 (1) distribution under the null. You can always do an F -test instead of a
t-test (but not vice versa).

62
Testing equality of two coefficients

ˆ if we test the hypothesis


H0 : γ 1 = γ 2
in the regression:

test scorei = α + βclass sizei + γ1 % English 2nd languagei


+ γ2 % free school meali + ei

Stata computes an F -test.

63
Testing equality of two coefficients

64
Testing a joint hypothesis

65
F-test for comparing between a “short” and “long” regression

ˆ what if we want to test a regression model with many regressors (the “long” model)
against a regression model with just a few variables (the “short” model)
ˆ use an F -test to do so
ˆ Compare the RSS from the unrestricted regression model (RSSUR ) to the RSS from the
restricted one(RSSR ):

(RSSR − RSSUR )/J


F = ∼ F (J, N − KUR )
RSSUR /(N − KUR )

where J is the number of variables to be restricted (i.e. the difference between the number
of regressors in the restricted and unrestricted models)

66
Functional form in regression

test scorei = α + β1 incomei + ei

67
Other forms of the regression model

ˆ it is easy to augment the simple regression model to fit the income data better, for
example, by adding a quadratic term in income and estimating

test scorei = α + β1 incomei + β2 income2i + ei

68
Linear vs. quadratic specification

69
Linear vs. quadratic specification: testing βˆ2

70
Linear vs. quadratic specification: testing βˆ2

71
How to interpret non-linear regression functions?

ˆ with a simple linear regression function, interpretation is easy:

test scorei = α + βincomei + ei

In this case β is the effect of a $1,000 increase in average income on test scores.
ˆ in the quadratic specification, it’s a bit more difficult:

test scorei = α + β1 incomei + β2 income2i + ei

the effect of a $1,000 increase in average income on test scores is now:


∂test score
= β1 + 2β2 income
∂income
so the effect depends on the level of income you look at

72
What’s linear about linear regression?

ˆ OLS regression is often called linear regression. So what’s linear about linear regression?
ˆ the regression function is linear in the parameters (α, β1 , β2 , · · · )
ˆ we can’t estimate a regression like this by OLS:

Yi = αKiβ Lγi + ei

ˆ the regression function can be non-linear in the regressors.


ˆ we can still estimate a nonlinear relationship between test scores and income, for example, by
including the square of income.

73
The log specification for income

test scorei = α + βln(income)i + ei

74
Log(income) specification

75
Interpreting the log specification

ˆ the simple log specification for income seems to work extremely well in this example, and
often does for similar variables.
ˆ the log specification:
test scorei = α + βln(income)i + ei
implies:
∂test score β
=
∂income income
∂test score

∂income/income

so, here, β is the effect of a relative change in income

76
The log of income

ˆ proportional changes in income are often more reasonable than absolute changes:
ˆ a $1000 change is pretty big for a district with income of $15 000 (in terms of economic
impact)
ˆ a $1000 change is much smaller for a district with income of $40 000.
ˆ comparing a 10% change may be a better suited exercise (so a $100 change for low income
districts compared to a $400 change for high income districts). This is what the log
specification does.

77
Log derivatives

ˆ we know that:
∂ln(x) 1 ∂x
= ⇒ ∂ln(x) =
∂x x x
ˆ from this, it follows that:
∂ln(y ) ∂y /y ∂y x
= =
∂ln(x) ∂x/x ∂x y
which is an elasticity
ˆ this means that the log-log regression gives you an elasticity:

ln(test score)i = α + βln(income)i + ei

where here, β directly estimates an elasticity

78
Controlling for income in the test score data

Regressor (1) (2) (3) (4)


Student-teacher ratio -2.280*** -0.649* -0.910** -0.879**
(0.519) (0.353) (0.355) (0.340)
Average income 1.839*** 3.882***
(0.115) (0.271)
Average income2 -0.044***
(0.005)
ln(Average income) 35.616***
(1.400)

Observations 420 420 420 420


R-squared 0.051 0.511 0.564 0.570 79
Making sense of the income results

ˆ in the simple linear regression we got a coefficient on income of 1.84


ˆ for the quadratic regression we get
∂test score
= β1 + 2β2 income
∂income
= 3.882 + 2 × (−.044) income
= 2.53 at income (where income = 15.3)

ˆ the quadratic regression evaluated at the mean yields a larger (partial) effect of income of
test scores than the linear regression in this case. Why?
ˆ the linear regression puts comparatively a lot of weight on districts with very high incomes
(which don’t provide a good fit for the linear regression).

80
Partial effects at the mean

81
Making sense of the income results

ˆ partial effect from simple linear regression: 1.84


ˆ partial effect (at the mean) from quadratic regression: 2.53
ˆ for the log specification:

∂test score ∂ln(income) 1


=β =β
∂income ∂income income
ˆ so evaluating the partial effect at income = 15.3, we get:
∂test score 1 1
=β = 35.616 = 2.33
∂income income 15.3
which is very similar to the effect we obtained from the quadratic specification

82
Partial effects at the mean

83
Sub-sample partial effects - linear specification

84
Sub-sample effects: income ∈ (10000, 20000)

Regressor (1) (2) (3) (4)


Student-teacher ratio -1.305*** -1.269*** -1.270*** -1.269***
(0.452) (0.427) (0.426) (0.430)
Average income 2.348*** 1.033
(0.267) (3.145)
Average income2 0.045
(0.108)
ln(Average income) 32.725***
(3.786)

Observations 280 280 280 280


R-squared 0.030 0.225 0.226 0.221
Partial Effect of Income - 2.348 2.425 2.137

85
Sub-sample effects: income ∈ (10000, 20000)

86

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy