0% found this document useful (0 votes)
59 views44 pages

Lec 5 V 11

1. The document discusses multiple regression models and omitted variable bias. 2. It introduces the multiple linear regression model and ordinary least squares estimation. Binary variables are also discussed. 3. Omitted variable bias can occur if an omitted variable is correlated with the included regressors and determines the dependent variable. This can lead to a violation of the zero conditional mean assumption.

Uploaded by

Zeyituna Abe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views44 pages

Lec 5 V 11

1. The document discusses multiple regression models and omitted variable bias. 2. It introduces the multiple linear regression model and ordinary least squares estimation. Binary variables are also discussed. 3. Omitted variable bias can occur if an omitted variable is correlated with the included regressors and determines the dependent variable. This can lead to a violation of the zero conditional mean assumption.

Uploaded by

Zeyituna Abe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

AgEc 551: Applied/Advanced Econometrics

Lecture 5 – Multiple regression model

Anbes Tenaye
Department of Agricultural Economics
Hawassa University

Last updated Nov. 26, 2022


Outline

• Continue from slide 34 on lecture 4.


• Regressions when X is a binary variable
• Omitted variable bias
• Introduction to multiple linear regression model and OLS

2 / 44
Reminder Thursday January 29 10:27:18 2015 Page 1

___ ____ ____ ____ ____(R)


Interpretation and prediction: /__ / ____/ / ____/
___/ / /___/ / /___/
Statistics/Data Analysis

1 . reg ahe age

Source SS df MS Number of obs = 7711


F( 1, 7709) = 230.43
Model 23005.7375 1 23005.7375 Prob > F = 0.0000
Residual 769645.718 7709 99.8372964 R-squared = 0.0290
Adj R-squared = 0.0289
Total 792651.456 7710 102.80823 Root MSE = 9.9919

ahe Coef. Std. Err. t P>|t| [95% Conf. Interval]

age .6049863 .0398542 15.18 0.000 .5268613 .6831113


_cons 1.082275 1.184255 0.91 0.361 -1.239187 3.403737

The regression result gives:


Ŷ = 1.08 + 0.60age
Predictions:
• A 26 year old worker is predicted to have an average hourly wage of:
$ 16.68 (1.08+0.6*26).
• For each year of education you are predicted to earn $ 0.6 more.
3 / 44
Regression when X is a binary variable

• A lot of information relevant for econometric analysis is qualitative.


• This information can be summarized with one or multiple binary
variables.
• In econometrics binary variables are typically called dummy variables.
• In defining a dummy variable we must decide which event is assigned
the value one and which is assigned the value 0.
• The name typically indicates the event with value one.
• Female (1=female, 0=male)
• Higher educ (1=college or more, 0=less than college)
• Public transport (1=use public transport to work, 0=do not use public
transport)
• Drug (1=received the drug, 0= received placebo)

4 / 44
Regression when X is a binary variable

The population regression model with the binary variable Di (D=1 if


female, D=0 if male) is:

Yi = 0 + 1 Di + ui

when i is a male (D=0) we get:

Yi = 0 + ui ! E (Yi |D = 0) = 0

while if i is a female (D=1) we get:

Yi = 0 + 1 + ui ! E (Yi |D = 1) = 0 + 1

Thus 1 = E (Yi |Female) E (Yi |male)

5 / 44
Dummy variables

• The group with an indicator of 0 is the base group, the group against
which comparisons are made.
• It does not matter how we choose the base group, but it is important
to keep track of which group is the base group.
• If the two groups do not di↵er then 1 is zero.

6 / 44
Example

Data from additional E4.1


• Data from on average hourly earnings from a sample of full-time
Thursday January 29 19:07:55 2015 Page 1
workers.
___ ____ ____ ____ ____(R)
• Female = 1 the person is female, female/__
= 0 /the____/
person
/ is____/
male.
___/ / /___/ / /___/
Statistics/Data Analysis

1 . reg ahe female

Source SS df MS Number of obs = 7711


F( 1, 7709) = 129.46
Model 13091.0876 1 13091.0876 Prob > F = 0.0000
Residual 779560.368 7709 101.12341 R-squared = 0.0165
Adj R-squared = 0.0164
Total 792651.456 7710 102.80823 Root MSE = 10.056

ahe Coef. Std. Err. t P>|t| [95% Conf. Interval]

female -2.629912 .2311422 -11.38 0.000 -3.083013 -2.17681


_cons 20.11387 .1520326 132.30 0.000 19.81584 20.41189

7 / 44
Proportions and percentages as dependent variables

• The proportional change is the change in a variable relative to its


initial value, mathematically, the change divided by the initial value.
• The percentage change is the proportional change in a variable,
multiplied by 100.
• The percentage point change is the di↵erence between two
percentages.

8 / 44
Proportions and percentages as dependent variables

In a dataset on CEO’s where y is annual salary in thousands of dollars and


X is the average return on equity (roe) the following OLS regression line
can be obtained:
salary = 0 + 1 roe + u

• ROE is defined in terms of net income as a percentage of common


equity, thus if roe=10, the average return on equity is 10%.
• The slope parameter 1 measures the change in annual salary, in
thousands of dollars, when return on equity increase by one
percentage point.

9 / 44
Homoskedasticity

The dummy variable example can shed light on what is meant by


homoskedasticity:
• The definition of homoskedasticity requires the error term to be
independent of X, i.e it must not depend on female in our example.
• For women the error term (ui ) is the deviation of the i th woman’s
earning from the population mean earnings for women.
• For men the error term (ui ) is the deviation of the i th man’s earning
from the population mean earnings for men.
• Thus the variance of earnings must be the same for men as it is for
women.

10 / 44
The ideal analysis

• The aim of regression is often to identify causality.


• In an ideal randomized controlled experiment the only di↵erence
between the “treatment” and “control” group is the variable you
study.
• In observational data there may be a systematic di↵erence the
”treatment” group and the ”control group” in one or more variables.
• If those variables are not included in the regression we have omitted
variables.

11 / 44
Omitted variable bias -ZCM assumption

• In the last lecture you saw that E (u|X ) = 0 is important in order for
the OLS estimator to be unbiased.
• The omitted variable is thus important if the omission leads to a
violation of the ZCM assumption.
• The bias that arise from such an omission is called omitted variable
bias.

12 / 44
Omitted variable bias

Omitted variable bias


For omitted variable bias to occur, the omitted variable ”Z” must satisfy
two conditions:
• The omitted variable is correlated with the included regressor (i.e.
corr (Z , X ) 6= 0)
• The omitted variable is a determinant of the dependent variable (i.e.
Z is part of u)

13 / 44
OVB example

We estimate:
yi = 0 + 1X +u
while the true model is:

yi = 0 + 1X + 2Z +v

The exclusion of Z leads to a bias in 1 whenever Z is a determinant of Y


and correlated with X.

14 / 44
Example: Corr (Z , X ) 6= 0

The omitted variable (Z) is correlated with X , example

wages = 0 + 1 educ + ui
|{z}
1 pinc+vi

• Parents income is likely to be correlated with education, college is


expensive and the alternative funding is loan or scholarship which is
harder to acquire.

15 / 44
Example: Z is a determinant of Y

The omitted variable is a determinant of the dependent variable,

wages = 0 + 1 educ + ui
|{z}
2 MS+vi

• Market situation is likely to determine wages, workers in firms that


are doing well are likely to have higher wages.

16 / 44
Example: Omitted variable bias

The omitted variable is both determinant of the dependent variable, i.e.


corr (X2 , Y ) 6= 0 and correlated with the included regressor

wages = 0 + 1 educ + ui
|{z}
3 ability +vi

• Ability - the higher your ability the ”easier” education is for you and
the more likely you are to have high education.
• Ability - the higher your ability the better you are at your job and the
higher wages you get.

17 / 44
How to overcome omitted variable bias

1 Run a ideal randomized controlled experiment


2 Do cross tabulation
3 Include the omitted variable in the regression

18 / 44
Cross tabulation

One can address omitted variable bias by splitting the data into subgroups.
For example:

College graduates High school graduates


High family income ȲHFI ,C ȲHFI ,H
Medium family income ȲMFI ,C ȲMFI ,H
Low family income ȲLFI ,C ȲLFI ,H

19 / 44
Cross tabulation

• Cross tabulation only provides a di↵erence of means analysis, but it


does not provide a useful estimate of the ceteris paribus e↵ect.
• To quantify the partial e↵ect on Yi on the change in one variable
(X1i ) holding the other independent variables constant we need to
include the variables we want to hold constant in the model.
• When dealing with multiple independent variables we need the
multiple linear regression model.

20 / 44
Multiple linear regression model

21 / 44
Multiple linear regression model

• Multiple linear regression models contain more than one independent


variable.
• Multiple variables is necessary if:
• You are interested in the ceteris paribus e↵ect of multiple parameters.
• Y is a polynomial function of X (more in chapter 8)
• You fear violation omitted variable bias.

Y X Other variables
Wages Education Experience, Ability
Crop Yield Fertilizer Soil quality, location (sun etc)
Test score STR Average family income

22 / 44
Multiple linear regression model
The general multiple linear regression model for the population can be
written in the as:

Yi = 0 + 1 X1i + 2 X2i + ..... + k Xki + ui

• Where the subscript i indicates the i th of the n observations in the


sample.
• The first subscript, 1,2,...,k, denotes the independent variable number.
• The intercept 0 is the expected value of Y when all the X’s equal
zero.
• The intercept can be thought of as the coefficient on a regressor, X0i ,
that equals one for all i.
• The coefficient 1 is the coefficient of X1i , 2 the coefficient on X2i
etc.

23 / 44
Multiple linear regression model

The average relationship between the k independent variables and the


dependent variable is given by:

E (Yi |X1i = x1 , X 2i = x2 , ..., Xki = xk ) = 0 + 1 x1 + 2 x2 + ... + k xk

• 1 is thus the e↵ect on Y of a unit change in X1 holding all other


independent variables constant.
• The error term includes all other factors than the X’s that influence Y.

24 / 44
Example

To make it more tractable consider a model with two independent


variables. Then the population model is:

Yi = 0 + 1 X1i + 2 X2i +u

Example:
wagei = 0 + 1 educi + 2 expi + ui
2
wagei = 0 + 1 expi + 2 IQi + ui

25 / 44
Interpretation of the coefficient

In the two variable case the predicted value is given by:

Ŷ = ˆ0 + ˆ1 X1 + ˆ2 X2

Thus the predicted change in y given the changes in X1 and X2 are given
by:
Ŷ = ˆ1 X1 + ˆ2 X2
Thus if x2 is held fixed then:

Ŷ = ˆ1 X1

26 / 44
Interpretation of the coefficient
Sunday February 1 14:48:19 2015 Page 1
Using data on 526 observations on wage, education and ____
___ ____ ____ experience
____(R) the
/__ / ____/ / ____/
following output was obtained: ___/ / /___/ / /___/
Statistics/Data Analysis

1 . reg wage educ exper

Source SS df MS Number of obs = 526


F( 2, 523) = 75.99
Model 1612.2545 2 806.127251 Prob > F = 0.0000
Residual 5548.15979 523 10.6083361 R-squared = 0.2252
Adj R-squared = 0.2222
Total 7160.41429 525 13.6388844 Root MSE = 3.257

wage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ .6442721 .0538061 11.97 0.000 .5385695 .7499747


exper .0700954 .0109776 6.39 0.000 .0485297 .0916611
_cons -3.390539 .7665661 -4.42 0.000 -4.896466 -1.884613

Holding experience fixed another year of education is predicted to increase


your wage by 0.64 dollars.

27 / 44
Interpretation of the coefficient

If we want to change more than one independent variable we simply add


the two e↵ects.
Example:
wage
ˆ = 3.39 + 0.64educ + 0.07exp
If you increase education by one year and decrease experience by one year
the predicted increase in wage is 0.57 dollars. (0.64-0.07)

28 / 44
Example: Smoking and birthweight

Using the data set birthweight smoking.dta you can estimate the following
regression:
ˆ
birthweight = 3432.06 253.2Smoker
If we include the number of prenatal visits:
ˆ
birthweight = 3050.5 218.8Smoker + 34.1nprevist

29 / 44
Example education
educ Coef.
Robust
Std. Err. t P>|t| [95% Conf. Interval]

The relationship between years of education of male -.3256257


kids
meduc
-.2332289
.1775282
.0470914
.0220452
-4.95
workers0.000
-.1408321
.1342739 and
8.05 0.000 the years
.2207825
feduc .2098686 .0259383 8.09 0.000
.1589756 .2607615
of education of the parents.
_cons 9.670457 .2974098 32.52 0.000
9.086917 10.254

8 . reg educ meduc feduc, robust

Linear regression Number of obs = 1129


F( 2, 1126) = 159.83
Prob > F = 0.0000
R-squared = 0.2689
Root MSE = 2.2595

Robust
educ Coef. Std. Err. t P>|t| [95% Conf. Interval]

meduc .1844065 .0223369 8.26 0.000 .1405798 .2282332


feduc .2208784 .0259207 8.52 0.000 .1700201 .2717368
_cons 8.860898 .2352065 37.67 0.000 8.399405 9.32239

9 .

• Interpret the coefficient on mother’s education.


• What is the predicted di↵erence in education for a person where both
parents have 12 years of education and a person where both parents
have 16 years of education?
30 / 44
Example education
Monday February 2 20:29:26 2015 Page 1

___ ____ ____ ____ ____(R)


From stata: /__
___/ /
/ ____/
/___/ /
/ ____/
/___/
Statistics/Data Analysis

1 . display _cons+_b[meduc]*12+_b[feduc]*12
5.8634189

2 . display _cons+_b[meduc]*16+_b[feduc]*16
7.4845585

3 .
4 . display 7.484-5.863
1.621

5 .
6 . *or
7 .
8 . display _b[meduc]*4+_b[feduc]*4
1.6211396

Or by hand:

0.1844 ⇤ (16 12) + 0.2209 ⇤ (16 12) = 1.6212

31 / 44
Multiple linear regression model

Advantages of the MLRM over the SLRM:


• By adding more independent variables (control variables) we can
explicitly control for other factors a↵ecting y.
• More likely that the zero conditional mean assumption holds and thus
more likely to have an unbiased estimator.
• By controlling for more factors, we can explain more of the variation
in y, thus better predictions.
• Can incorporate more general functional forms.

32 / 44
Assumptions of the MLRM

1 (The model is linear in parameters)


2 Random sampling
3 Large outliers are unlikely
4 Zero conditional mean, i.e the error u has an expected value of zero
given any value of the independent variables

E (u|X1 , x2 , ....Xk ) = 0

5 (There is sampling variation in X) and there are no exact linear


relationships among the independent variables.
Under these assumptions the OLS estimators are unbiased estimators of
the population parameters. In addition there is the homoskedasticity
assumption which is necessary for OLS to be BLUE.

33 / 44
No exact linear relationships

Perfect collinearity
A situation in which one of the regressors is an exact linear function of the
other regressors.

• This is required to be able to compute the estimators.


• The variables can be correlated, but not perfectly correlated.
• Typically perfect collinearity arise because of specification mistakes.
• Mistakenly put in the same variable measured in di↵erent units
• The dummy variable trap: Including the intercept plus a binary variable
for each group.
• Sample size is to small compared to parameters (need at least k+1
observations to estimate k+1 parameters)

34 / 44
No perfect collinearity

Solving the three 1oc for the model with two independent variables gives:

ˆX2 2 ˆY ,X1 ˆY ,X2 ˆX1 ,X2


ˆ1 =
ˆX2 1 ˆX2 2 ˆX1 ,X2

where ˆX2 j (j = 1, 2), ˆY2 ,Xj and ˆX2 1 ,X2 are empirical variances and
covariances. Thus we require that:

ˆX2 1 ˆX2 2 ˆX1 ,X2 = ˆX2 1 ˆX2 2 (1 rX21 ,X2 ) 6= 0

Thus must have that ˆX2 1 > 0, ˆX2 2 > 0 and rX21 ,X2 6= 1. Thus the sample
correlation coefficient between X1 and X2 cannot be one or minus one.

35 / 44
Imperfect collinearity

• Occurs when two or more of the regressors are highly correlated (but
not perfectly correlated).
• High correlation makes it hard to estimate the e↵ect of the one
variable holding the other constant.
• For the model with two independent variables and homoskedastic
errors: !
1 1 2
2 u
ˆ1 =
n 1 ⇢2X1 ,X2 2
X1

• The two variable case illustrates that the higher the correlation
between X1 and X2 the higher the variance of ˆ1 .
• Thus, when multiple regressors are imperfectly collinear, the
coefficients on one or more of these regressors will be imprecisely
estimated.

36 / 44
Omitted variable bias

The direction of bias is illustrated in the the following formula:


p u
ˆ1 ! 1 + ⇢Xu (1)
X

where ⇢Xu = corr (Xi , ui ). The formula indicates that:


• Omitted variable bias exist even when n is large.
• The larger the correlation between X and the error term the larger the
bias.
• The direction of the bias depends on whether X and u are negatively
or positively correlated.

37 / 44
Example bias
Monday February 2 20:38:46 2015 Page 1
Comparing estimates from simple and multiple regression. What is the
___ ____ ____ ____ ____(R)
return to education? Simple regression: /__
___/ /
/ ____/
/___/ /
/ ____/
/___/
Statistics/Data Analysis

1 . reg wage educ, robust

Linear regression Number of obs = 935


F( 1, 933) = 95.65
Prob > F = 0.0000
R-squared = 0.1070
Root MSE = 382.32

Robust
wage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ 60.21428 6.156956 9.78 0.000 48.1312 72.29737


_cons 146.9524 80.26953 1.83 0.067 -10.57731 304.4822

Can we give this regression a causal interpretation? What happens if we


include IQ in the regression?
forth

38 / 44
Example bias - two independent variables

Call the simple regression of Y on X1 (think of regressing wage on


education)
Ỹ = ˜0 + ˜1 X1 + v
while the true population model is:

Yi = 0 + 1 X1 + 2 X2 + ui

The relationship between ˜1 and 1 is:


˜1 = 1 + ˜
2 1

where ˜1 comes from the regression Xˆ2 = ˜0 + ˜1 X1

39 / 44
Example bias - two independent variables

Thus the bias that arise from the omitted variable (in the model with two
independent variables) is given by 2 ˜1 and the direction of the bias can
be summarized by the following table:

corr (x1 , x2 ) > 0 corr (x1 , x2 ) < 0


2 >0 Positive bias Negative bias
2 <0 Negative bias Positive bias

40 / 44
Linear regression Number of obs = 935
Comparing estimates from simple and multiple
F( 2,
Prob > F
regression
932) =
=
64.47
0.0000
R-squared = 0.1339
Root MSE = 376.73

Robust
wage Coef. Std. Err. t P>|t| [95% Conf. Interval]

educ
Monday February 2 42.05762 6.810074
20:58:01 2015 Page 1 6.18 0.000 28.69276 55.42247
IQ 5.137958 .9266458 5.54 0.000 3.319404 6.956512
_cons -128.8899 93.09396 -1.38 0.167 -311.5879 53.80818
___ ____ ____ ____ ____(R)
/__ / ____/ / ____/
2 . reg educ IQ, robust ___/ / /___/ / /___/
Statistics/Data Analysis
Linear regression Number of obs = 935
F( 1, 933) = 342.94
Prob > F = 0.0000
Robust R-squared = 0.2659
IQ Coef. Std. Err. t P>|t| [95%MSE
Root Conf. Interval]
= 1.883

educ 3.533829 .1839282 19.21 0.000 3.172868 3.89479


_cons 53.68715 Robust
2.545285 21.09 0.000 48.69201 58.6823
educ Coef. Std. Err. t P>|t| [95% Conf. Interval]

IQ .0752564 .0040638 18.52 0.000 .0672811 .0832317


_cons 5.8463 .407318 14.35 0.000 5.046934 6.645665

1˜= 60.214 ⇡ 42.057 + 3.533 ⇤ 5.137


back

41 / 44
Bias - multiple independent variables

• Deriving the sign of omitted variable bias when there are more than
two independent variables in the model is more difficult.
• Note that correlation between a single explanatory variable and the
error generally results in all OLS estimators being biased.
• Suppose the true population model is:

Y = 0 + 1 X1 + 2 X2 + 3 X3 +u

• But we estimate
Ỹ = ˜0 + ˜1 X1 + ˜2 X2
• If Corr (X1 , X3 ) 6= 0 while Corr (X2 , X3 ) = 0 ˜2 will also be biased
unless corr (X1 , X2 ) = 0.

42 / 44
Bias - multiple independent variables

wage = 0 + 1 educ + 2 exper + 3 abil +u

• People with higher ability tend to have higher education


• People with higher education tend to have less experience
• Even if we assume that ability and experience are uncorrelated 2 is
biased.
• We cannot conclude the direction of bias without further assumptions

43 / 44
Causation

• Regression analysis can refute a causal relationship, since correlation


is necessary for causation..
• ..but cannot confirm or discover a causal relationship by statistical
analysis alone.
• The true population parameter measures the ceteris paribus e↵ect
which holds all other (relevant) factors equal.
• However, it is rarely possible to literally hold all else equal:
• ”natural experiments” or ”quasi-experiments”.
• Use instrument on unobserved factors.

44 / 44

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy