Econ 251 PS3 Solutions v2
Econ 251 PS3 Solutions v2
Problem Set #3
SOLUTIONS
The problem set uses the same STATA file WAGE2.dta as the second problem set. As
a reminder, WAGE2.dta contains the following variables:
wage monthly earnings (in 1976 USD)
hours average weekly hours of work
IQ IQ (intelligence quotient) score
educ years of education
exper years of work experience
age age in years
married =1 if the person is married
black =1 if the person is black
meduc mother’s education
(meaning the education level of the person’s mother)
feduc father’s education
(meaning the education level of the person’s mother)
1. (i) Economists usually use log-earnings, rather than earnings, since log-earnings
allow modelling percentage changes in earnings, rather than absolute changes (we’ll
see this very soon in class).
Generate a new variable equal to the natural logarithm of variable wage. Call this new
variable lwage. Note: You do not need to submit anything for part 1.(i).
(ii) Find the sample mean log-wage (lwage) for blacks and non-blacks separately.
Hint: in order to generate variable log-wage type the following command in STATA:
gen lwage=log(wage)
(1 point)
Solution:
The sample mean log-wage (lwage) for blacks is 6.52, while non-blacks’ mean log
earnings is 6.82.
(See STATA output on pg. 2).
1
. gen lwage=log(wage)
Summary of lwage
=1 if black Mean Std. Dev. Freq.
FYI: after generating variable lwage you may wish to have a look at the new variable
and at variable wage in order to double check that lwage was generated correctly. If
you type browse wage lwage you can view the two variables and check that for each
observation the value of lwage=log(wage). E.g. for observation 1, wage=769 and
lwage=6.645091, which is the value of log(769) displayed to the 6th decimal.
2. Test the hypothesis that the population mean log-wage is equal for black and non-
black men, against the two-sided alternative. Use a significance level α = 0.05.
Hint: Use the STATA command ttest var1, by (var2)
In this particular example, var1 is lwage and var2 is black; hence, what you would need to
type in the command window in STATA is: ttest lwage, by (black)
(i) What is the null hypothesis being tested in terms of the notation used in class? What
is the alternative hypothesis in terms of the notation used in class? Be very precise and
explain what the notation stands for.
(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?
(iii) Show how the t-statistic of 7.2876 was calculated.
(1 point each, 3 points in total)
Solution:
2
. ttest lwage, by (black)
Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
STEP 2: look at the p-value for the relevant alternative hypothesis and compare it to the
significance level
The p-value for this alternative hypothesis is given by Pr (|T| > |t|) = 0.0000. A p-value
of 0.0000 means that there is a 0% chance to observe a test statistic as the one we
actually observed (i.e. t-stat=7.29), if the null hypothesis were true.
3
the p-value = 0.0000 < significance level α = 0.05. We conclude that the populations mean
log-wage differs significantly for black and non-black men.
IMPORTANT NOTE
In part 1 of the question we talk about the sample mean log-wage meaning that in this
particular sample of 935 men the average log-wage is different for blacks and non-
blacks. WE NEVER TEST A HYPOTHESIS ABOUT THE SAMPLE MEANS: we
know they are different numbers (in our examples they are 6.82 and 6.52).
In part 2 we talk about the population mean meaning we are referring to the
population from which this sample was drawn, and we are asking the question: based
on our random sample and the sample statistics, do we have enough information to
conclude that the population means of the two groups (blacks and non-blacks) differ?
where
̂ is the estimated difference in means (i.e. this is just the difference between the
𝒅𝒊𝒇𝒇
two sample means = 6.8165-6.5244=0.2921).
𝒅𝒊𝒇𝒇𝒐 is the difference between the two population means under the null hypothesis
(recall H0: µo = µ1 ⟺ H0: diff≡µo - µ1 = 0, i.e. 𝒅𝒊𝒇𝒇𝒐 = 𝟎).
STATA reports this as “H0: diff=0”.
𝑺𝑬(𝒅𝒊𝒇𝒇) is the standard error of the difference in means.
3. Test the hypothesis that the population mean years of education (variable educ) is equal
for black and non-black men, against the two-sided alternative. Use a significance level
α = 0.05.
(i) What is the null hypothesis being tested in terms of the notation used in class?
What is the alternative hypothesis in terms of the notation used in class?
(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?
(iii) Show how the t-statistic of 5.5720 was calculated.
(1 point each, 3 points in total)
Solution:
H0: µo = µ1
H1: µo ≠ µ1
Here µ1 is the population mean years of education for blacks and µo is the population
mean years of education for non-blacks. Again, please note we test a hypothesis about
4
the population means; we know the sample means are not equal – they are different
numbers – we saw this is problem set #2.
(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?
Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
We reject the null hypothesis (in favour of the alternative) because the p-
value=0.0000<significance level α = 0.05 and conclude that the population mean years of
education differ significantly for black and non-black men.
Solution:
̂ − 𝒅𝒊𝒇𝒇𝒐
𝒅𝒊𝒇𝒇 𝟏.𝟏𝟕𝟕𝟗𝟕−𝟎
t-stat = = = 5.572,
𝑺𝑬(𝒅𝒊𝒇𝒇) 𝟎.𝟐𝟏𝟏𝟒𝟏
where
̂ is the estimated difference in means (i.e. this is just the difference between the
𝒅𝒊𝒇𝒇
two sample means = 13.61963-12.44167=1.17796).
𝒅𝒊𝒇𝒇𝒐 = 𝟎 is the difference between the two population means under the null
hypothesis
𝑺𝑬(𝒅𝒊𝒇𝒇) is the standard error of the difference in means.
4. Test the hypothesis that the population mean years of mother’s education (variable
meduc) is larger for non-black men than for black men. Use a significance level α =
0.05.
(i) What is the null hypothesis being tested in terms of the notation used in class?
What is the alternative hypothesis in terms of the notation used in class?
5
(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?
(iii) Show how the t-statistic of 6.6322 was calculated.
(1 point each, 3 points in total)
Solution:
H0: µo = µ1
H1: µo > µ1
Here µ1 is the population mean years of mother’s education blacks and µo is the
population mean years of education for non-blacks.
(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?
Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
We reject the null hypothesis (in favour of the alternative) because the p-
value=0.0000<significance level α = 0.05 and conclude that the population mean years of
mother’s education is significantly larger for non-black than for black men.
where
̂ is the estimated difference in means (i.e. this is just the difference between the
𝒅𝒊𝒇𝒇
two sample means = 10.91029-8.939394=1.970896).
6
𝒅𝒊𝒇𝒇𝒐 = 𝟎 is the difference between the two population means under the null
hypothesis
𝑺𝑬(𝒅𝒊𝒇𝒇) is the standard error of the difference in means.
5. Finally, test the hypothesis that the population mean (average) weekly hours of work
(variable hours) is equal to 45 hours, against the two-sided alternative. Use a
significance level α = 0.05.
(i) What is the null hypothesis being tested in terms of the notation used in class?
What is the alternative hypothesis in terms of the notation used in class?
(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?
(iii) Show how the t-statistic of -4.5314 was calculated.
(1 point each, 3 points in total)
Solution:
H0: µ = 45
H1: µ ≠ 45
(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?
. ttest hours=45
One-sample t test
Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
We reject the null hypothesis (in favour of the alternative) because the p-
value=0.0000<significance level α = 0.05 and conclude that the population mean weekly
hours of work is significantly different than 45 hours a week.
Solution:
7
̅ −μ
X 𝟒𝟑.𝟗𝟐𝟗𝟒𝟏−𝟒𝟓
t-stat = s/√N𝟎 = = -4.5314,
𝟕.𝟐𝟐𝟒𝟐𝟓𝟔/√935
where
̅ is the sample mean hours of work for everyone in the sample
𝑿
𝝁𝒐 = 𝟎 is the population mean hours of work under the null hypothesis
𝑺𝑬(𝑿 ̅ ) = s/√N is the standard error of the sample mean of variable hours, which is
calculated as the ratio of the sample standard deviation of variable hours, s=7.224256
divided by the square root of the sample size, N=935.
This part of the problem set introduces you to running regressions in STATA.
This problem uses STATA file BIG9SALARY.dta. This data was used by O. Baser and
E. Pema (2003) in a paper titled “The Return of Publications for Economics Faculty”.
BIG9SALARY.dta contains data on 223 faculty members of Economics departments in 9
universities in the US (Ohio State, Iowa, Indiana, Purdue, Michigan State, Minnesota,
Michigan, Wisconsin and Illinois) collected in year 1995. The dataset contains the
following variables:
id person identifier
salary total gross annual salary of the faculty member (in 1999 USD)
totpge total number of standardized article pages published
pubindex publication index (the product of number of articles published by
the rank of the journal)
top20phd =1 if Ph.D. was obtained from a top 20 Economics department; 0
otherwise
yearphd year when the Ph.D. was obtained
age age in years
female =1 if female; 0 otherwise
mich =1 for University of Michigan professors; 0 otherwise
(as well as dummies for each of the other universities).
Consider the simple linear regression model relating a faculty member’s annual salary
(salary) to total number of standardized article pages published (totpge):
salary = 𝜷0 + 𝜷1 totpge + u
8
The error term u represents all factors that affect the outcome variable Y (in our
example, a faculty member’s annual salary), other than the explanatory variable X (in
our case, total number of article pages published).
3. Now use the data in BIG9SALARY.dta to estimate this simple regression model and
answer the following questions:
Hint: in order to estimate the model in STATA, type: reg salary totpge
Solution:
. reg salary totpge
The OLS estimate of the intercept is $66, 626.78 (note that STATA calls the intercept
_cons). The intercept is always interpreted as the predicted (average) value of Y when
X=0. In this example this means that a faculty member who zero total number of
standardized article pages published (i.e. has no published articles) has annual salary
of $66, 626.78, on average.
9
The slope is always interpreted as the estimated effect of a one-unit increase in X on the
average value of Y. In our example this means that every additional article pages
increases a faculty member’s salary by $89.66, on average.
(iii) What is the effect of a 100-page increase in total number of standardized article
pages published on annual salary?
(1 point)
Solution:
From part (ii) we found that 1 more article page increases salary by $89.66; therefore,
100 more pages will increase salary by 100*($89.66)=$8965.59. You can calculate this
with STATA by typing:
di 100 * 89.65586
8965.586
Notice there is a linear relationship between X (totpge) and Y (salary) – every unit
increase of X (totpge) increases Y (salary) by the same amount – by $95.35.
FYI: You may find it easier to use the formula from class linking the change in Y for
any given change in X:
∆𝒚̂=𝜷 ̂ 1∆X = 100*($89.66)=$8965.59.
Solution:
The predicted annual salary of this faculty member 𝒀̂ 𝒊 is given by the value on the
regression line corresponding to totpge =1060.5.
̂𝒊= ̂
𝒀 ̂ 1Xi = 66626.78 + 89.65586*1060.5= $161,706.82
𝜷0 + 𝜷
You can calculate this with STATA by typing
. di 66626.78 + 89.65586*1060.5
161706.82
(v) What is the actual salary of this faculty member? Find the residual for this faculty
member. Does our regression suggest this faculty member is underpaid or overpaid?
10
Hint: The residual 𝑢̂i is given by the difference between the actual and the predicted salary
for this observation: 𝑢̂i = 𝑌𝑖 − 𝑌̂𝑖 . You calculated the predicted salary in part (iv).
(2 points)
Solution:
The actual salary of this faculty member is $92,083 (this is given by the value of
variable salary for this faculty member).
The residual 𝒖 ̂ i for this observation is given by the difference between the actual and
the predicted salary for this observation:
̂ 𝒊 = $𝟗𝟐, 𝟎𝟖𝟑 − $𝟏𝟔𝟏, 𝟕𝟎𝟔. 𝟖𝟐 = $-69, 623.82
̂ i = 𝒀𝒊 − 𝒀
𝒖
With STATA:
. di 92083 - 161706.82
-69623.82
Our regression suggests that this faculty member is underpaid.
(vi) Now estimate the model separately for the male and female faculty members.
Are the estimates of the intercepts the same? What about the slope estimates? How do you
interpret this?
Do the regression results provide convincing evidence that either group of faculty members
is discriminated against? Explain.
(3 points)
Hint: This question asks you to this about the difference between estimates and population
parameters, and the correlation versus causal relationship.
To estimate the model separately for women and men type in the command window:
reg salary totpge if female==1
reg salary totpge if female==0
Solution:
The estimation results are presented below. The estimates of the intercepts are
different, suggesting that with no articles published (totpge=0) female faculty
members earn about $10,500 lower salary, on average. However, the slope estimates
suggest that female faculty members are paid about $37 more for every additional
page published.
The regression results suggest that there may be differential payment for each group
but they do not provide convincing evidence that either male or female faculty
members are discriminated against? First, we can only see the difference in the
estimates of the slopes and intercepts, while we still don’t know if the population
parameters differ in the population. Secondly, our results may not have a causal
interpretation since we are not sure male and female faculty members are the same,
on average, in all dimensions affecting earnings (i.e. it might be the case that female
faculty members are mainly employed in language schools, where the salary levels are
generally lower).
Ultimately, we need to develop tools for analysing this question at a deeper level.
. reg salary totpge if female==1
11
-------------+---------------------------------- F(1, 19) = 4.07
Model | 1.7705e+09 1 1.7705e+09 Prob > F = 0.0579
Residual | 8.2608e+09 19 434781062 R-squared = 0.1765
-------------+---------------------------------- Adj R-squared = 0.1332
Total | 1.0031e+10 20 501566396 Root MSE = 20851
------------------------------------------------------------------------------
salary | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totpge | 122.0529 60.48352 2.02 0.058 -4.540533 248.6464
_cons | 57594.09 6323.063 9.11 0.000 44359.77 70828.42
------------------------------------------------------------------------------
. di 68174.95- 57594.09
10580.86
. di 85.01794- 122.0529
-37.03496
are given by
𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅
∑𝑁 ̅ ̅
𝑖=1(𝑋𝑖 − 𝑋 ) (𝑌𝑖 − 𝑌)
𝛽̂1 =
∑𝑖=1(𝑋𝑖 − 𝑋̅)2
𝑁
12
𝑁 𝑁
(7 points)
Solution:
𝟏
Simplifying by multiplying both sides of each equation by - 𝟐 yields:
Let’s solve these and find the formulae of 𝛽̂0 and 𝛽̂1.
From (1):
⟺ ∑ 𝑌𝑖 − 𝑁𝛽̂0 − 𝛽̂1 ∑ 𝑋𝑖 = 0
− ∑ 𝑌𝑖 ∑ 𝑋𝑖
𝛽̂0 = + 𝛽̂
−𝑁 −𝑁 1
⟺ 𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅
In order to obtain to formula for 𝜷 ̂ 𝟏 , we’re going to substitute the expression for
𝛽̂0 into (𝟐). But before this, let’s simplify (2):
13
∑(𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 )(𝑋𝑖 ) = 0
Substituting β̂ 0 from (1)and diving and multiplying the second term by 𝑁 yields:
∑ 𝑋𝑖
⟺ ∑ 𝑌𝑖 𝑋𝑖 − 𝑁(𝑌̅ − 𝛽̂1 𝑋̅) 𝑁
− 𝛽̂1 ∑ 𝑋𝑖2 = 0
∑ 𝑋𝑖 𝑌𝑖 − 𝑁𝑋̅𝑌̅
𝛽̂1 =
∑ 𝑋𝑖2 − 𝑁𝑋̅ 2
And we’ve already shown in lecture 1 (and in homework 1) that this can be rewritten as:
∑𝑁 ̅ ̅
𝑖=1(𝑋𝑖 − 𝑋 ) (𝑌𝑖 − 𝑌 )
𝛽̂1 =
∑𝑁 ̅ 2
𝑖=1(𝑋𝑖 − 𝑋 )
2. Throughout this class, we are going to be using STATA to calculate regression lines,
but it is a good idea to compute a regression line by hand once.
a) Calculate the OLS estimates of the intercept and slope from a regression of Y
on X, using the following data. Show all your calculations!
Obs. i Xi Yi
1 3 3
2 9 6
3 6 3
b) Check that the point (𝑋̅, 𝑌̅) lies on the regression line, where 𝑋̅ is the sample
average of variable X, and 𝑌̅ is the sample average of variable Y. (1 point)
c) Show that the property you illustrated in part b) also holds in general, i.e. that
the point (𝑋̅, 𝑌̅) always lies on the regression line. Your solution cannot use
the formula for the intercept estimate. (7 points)
14
Hint: write the residual for a single observation:
̂𝑖
𝑢̂𝑖 = 𝑌𝑖 − 𝑌
From here:
̂𝑖 + 𝑢̂𝑖
𝑌𝑖 = 𝑌
̂𝑖 = β̂0 + β̂1 𝑋𝑖 :
Replace 𝑌
𝑌𝑖 = β̂0 + β̂1 𝑋𝑖 + 𝑢̂𝑖
Then sum over all the data points – from 1 to N:
𝑁 𝑁
⇔
𝑁 𝑁 𝑁 𝑁
∑ 𝑌𝑖 = 𝑁β̂0 + β̂1 ∑ 𝑋𝑖
𝑖=1 𝑖=1
(since ∑𝑁
𝑖=1 𝑢
̂ 𝑖 = 0 from the normal equation).
From here: 𝑌̅ = β̂0 + β̂1 𝑋̅, which means that the point (𝑋̅, 𝑌̅) lies on the regression line.
OPTIONAL: double-check your calculation for part a) with STATA by following the
steps below:
2) Type in the values of X as var1, Type in the values of Y as var2 in the data editor (or
just copy/paste the numbers from the Word table):
15
3) Close the Data Editor. You can now see var1 and var2 in the variable list.
5) Regress Y on X by typing:
reg Y X
(Beware that STATA is case sensitive, if you call the variables Y and X (caps) you
should use caps in the regression command as well).
Do you get the same result for the slope and intercept as the one you calculated?
Solution:
a)
Obs.
Xi Yi (Xi X ) (Yi Y ) (X i X )2 ( X i X )(Yi Y )
i
1 3 3 -3 -1 9 3
2 9 6 3 2 9 6
3 6 3 0 -1 0 0
∑ 0 0 18 9
̅ = 6, 𝐘
𝐗 ̅=𝟒
16
𝑁
∑(𝑋𝑖 − 𝑋̅)2 = 18
𝑖=1
Our estimate of the slope parameter is 𝜷 ̂ 𝟏 = 9/18 = 0.5 (we estimate that a 1-unit
increase in X increases the average of Y by 0.5).
Alternatively, you can use the properties of summations we proved earlier in the
course, which will save you some calculations:
𝑁 𝑁
𝑁 𝑁
Obs. i Xi Yi Xi Y i 𝑿𝟐𝒊
1 3 3 9 9
2 9 6 54 81
3 6 3 18 36
∑ 81 126
𝑵
̅𝒀
∑ 𝑿𝒊 𝒀𝒊 − 𝑵 𝑿 ̅ = 𝟖𝟏 − 𝟑 · 𝟔 · 𝟒 = 𝟗
𝒊=𝟏
𝑵
̅ 𝟐 = 𝟏𝟐𝟔 − 𝟑 · 𝟔𝟐 = 𝟏𝟖
∑ 𝑿𝟐𝒊 − 𝑵𝑿
𝒊=𝟏
̂ 𝟏 = 9/18 = 0.5, 𝜷
𝜷 ̂ 𝟎 = ̅𝒀 − 𝜷
̂ 𝟏𝐗
̅ = 4 – 6·0.5 = 1
17
. reg Y X
b)
̅ ̅
𝐗 = 6, 𝐘 = 𝟒 – plugging 𝐗̅ = 6 into the equation of the regression line we obtain:
c) The hint contains the complete solutions; I just wanted you to go through it.
See graph below for illustration:
18