0% found this document useful (0 votes)
176 views18 pages

Econ 251 PS3 Solutions v2

The document provides solutions to questions from Problem Set #3 in Econ 251. It tests hypotheses about population parameters using data from the STATA file WAGE2.dta. Question 1 asks students to find sample mean log-wages for blacks and non-blacks, which are 6.52 and 6.82 respectively. Question 2 tests whether population mean log-wages differ between blacks and non-blacks using a t-test. The null hypothesis of no difference is rejected with a t-statistic of 7.2876. Question 3 tests for differences in population mean years of education and also rejects the null hypothesis, with a t-statistic of 5.5720. Calculations of the t-statistics

Uploaded by

Peter Shang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
176 views18 pages

Econ 251 PS3 Solutions v2

The document provides solutions to questions from Problem Set #3 in Econ 251. It tests hypotheses about population parameters using data from the STATA file WAGE2.dta. Question 1 asks students to find sample mean log-wages for blacks and non-blacks, which are 6.52 and 6.82 respectively. Question 2 tests whether population mean log-wages differ between blacks and non-blacks using a t-test. The null hypothesis of no difference is rejected with a t-statistic of 7.2876. Question 3 tests for differences in population mean years of education and also rejects the null hypothesis, with a t-statistic of 5.5720. Calculations of the t-statistics

Uploaded by

Peter Shang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Econ 251

Problem Set #3

SOLUTIONS

Part I: Testing hypotheses in STATA (13 points in total)


This part of the problem set introduces you to using STATA for testing hypotheses about
the population parameters.
Instructions:
Following each question, please handwrite or type your answers and copy/paste the
STATA output (please, use the ‘copy as picture’ option).

The problem set uses the same STATA file WAGE2.dta as the second problem set. As
a reminder, WAGE2.dta contains the following variables:
wage monthly earnings (in 1976 USD)
hours average weekly hours of work
IQ IQ (intelligence quotient) score
educ years of education
exper years of work experience
age age in years
married =1 if the person is married
black =1 if the person is black
meduc mother’s education
(meaning the education level of the person’s mother)
feduc father’s education
(meaning the education level of the person’s mother)

1. (i) Economists usually use log-earnings, rather than earnings, since log-earnings
allow modelling percentage changes in earnings, rather than absolute changes (we’ll
see this very soon in class).
Generate a new variable equal to the natural logarithm of variable wage. Call this new
variable lwage. Note: You do not need to submit anything for part 1.(i).

(ii) Find the sample mean log-wage (lwage) for blacks and non-blacks separately.
Hint: in order to generate variable log-wage type the following command in STATA:
gen lwage=log(wage)
(1 point)
Solution:

The sample mean log-wage (lwage) for blacks is 6.52, while non-blacks’ mean log
earnings is 6.82.
(See STATA output on pg. 2).

1
. gen lwage=log(wage)

. tab black, sum(lwage)

Summary of lwage
=1 if black Mean Std. Dev. Freq.

0 6.8164865 .41214464 815


1 6.5244342 .39392362 120

Total 6.7790038 .4211439 935

FYI: after generating variable lwage you may wish to have a look at the new variable
and at variable wage in order to double check that lwage was generated correctly. If
you type browse wage lwage you can view the two variables and check that for each
observation the value of lwage=log(wage). E.g. for observation 1, wage=769 and
lwage=6.645091, which is the value of log(769) displayed to the 6th decimal.

2. Test the hypothesis that the population mean log-wage is equal for black and non-
black men, against the two-sided alternative. Use a significance level α = 0.05.
Hint: Use the STATA command ttest var1, by (var2)
In this particular example, var1 is lwage and var2 is black; hence, what you would need to
type in the command window in STATA is: ttest lwage, by (black)

(i) What is the null hypothesis being tested in terms of the notation used in class? What
is the alternative hypothesis in terms of the notation used in class? Be very precise and
explain what the notation stands for.
(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?
(iii) Show how the t-statistic of 7.2876 was calculated.
(1 point each, 3 points in total)

Solution:

2
. ttest lwage, by (black)

Two-sample t test with equal variances

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

0 815 6.816486 .0144368 .4121446 6.788149 6.844824


1 120 6.524434 .0359601 .3939236 6.45323 6.595639

combined 935 6.779004 .0137729 .4211439 6.751974 6.806033

diff .2920523 .0400754 .2134039 .3707006

diff = mean(0) - mean(1) t = 7.2876


Ho: diff = 0 degrees of freedom = 933

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0


Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000

Here is a step-by-step solution to parts (i) and (ii) of this question.

STEP 1: define the null and the alternative hypotheses

Stated statistically, our null hypothesis is:


H0: µo = µ1 where µ1 is the population mean log-wage for blacks and µo is the
population mean log-wage for non-blacks. THE NULL IS THE HYPOTHESIS WE
WANT TO DISPROVE.
Note that this is equivalent to H0: µo - µ1 = 0. This is how the null hypothesis is written
in STATA: “H0: diff = 0” (i.e. the difference in population mean log-wage for non-
blacks and blacks equals zero).

The null hypothesis is tested against the alternative:


H1: µo ≠ µ1
In STATA this alternative is written as: “Ha: diff != 0” (i.e. the difference in
population means does not equal zero).

STEP 2: look at the p-value for the relevant alternative hypothesis and compare it to the
significance level

The p-value for this alternative hypothesis is given by Pr (|T| > |t|) = 0.0000. A p-value
of 0.0000 means that there is a 0% chance to observe a test statistic as the one we
actually observed (i.e. t-stat=7.29), if the null hypothesis were true.

STEP 3: What do you conclude?


Reject the null if p-value ≤ significance level α
Fail to reject null if p-value > significance level α
All you had to say to get full points in this question is what are the null and alternative
for part (i), and that we reject the null hypothesis (in favour of the alternative) because

3
the p-value = 0.0000 < significance level α = 0.05. We conclude that the populations mean
log-wage differs significantly for black and non-black men.

IMPORTANT NOTE

In part 1 of the question we talk about the sample mean log-wage meaning that in this
particular sample of 935 men the average log-wage is different for blacks and non-
blacks. WE NEVER TEST A HYPOTHESIS ABOUT THE SAMPLE MEANS: we
know they are different numbers (in our examples they are 6.82 and 6.52).

In part 2 we talk about the population mean meaning we are referring to the
population from which this sample was drawn, and we are asking the question: based
on our random sample and the sample statistics, do we have enough information to
conclude that the population means of the two groups (blacks and non-blacks) differ?

(iii) Show how the t-statistic of 7.2876 was calculated.


(1 point)
Solution:
̂ − 𝒅𝒊𝒇𝒇𝒐
𝒅𝒊𝒇𝒇 𝟎.𝟐𝟗𝟐𝟎𝟓−𝟎
t-stat = = = 7.287,
𝑺𝑬(𝒅𝒊𝒇𝒇) 𝟎.𝟎𝟒𝟎𝟎𝟖

where
̂ is the estimated difference in means (i.e. this is just the difference between the
𝒅𝒊𝒇𝒇
two sample means = 6.8165-6.5244=0.2921).
𝒅𝒊𝒇𝒇𝒐 is the difference between the two population means under the null hypothesis
(recall H0: µo = µ1 ⟺ H0: diff≡µo - µ1 = 0, i.e. 𝒅𝒊𝒇𝒇𝒐 = 𝟎).
STATA reports this as “H0: diff=0”.
𝑺𝑬(𝒅𝒊𝒇𝒇) is the standard error of the difference in means.

3. Test the hypothesis that the population mean years of education (variable educ) is equal
for black and non-black men, against the two-sided alternative. Use a significance level
α = 0.05.
(i) What is the null hypothesis being tested in terms of the notation used in class?
What is the alternative hypothesis in terms of the notation used in class?
(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?
(iii) Show how the t-statistic of 5.5720 was calculated.
(1 point each, 3 points in total)

Solution:
H0: µo = µ1
H1: µo ≠ µ1

Here µ1 is the population mean years of education for blacks and µo is the population
mean years of education for non-blacks. Again, please note we test a hypothesis about

4
the population means; we know the sample means are not equal – they are different
numbers – we saw this is problem set #2.

(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?

. ttest educ, by (black)

Two-sample t test with equal variances

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

0 815 13.61963 .0776698 2.217333 13.46718 13.77209


1 120 12.44167 .1586868 1.738326 12.12745 12.75588

combined 935 13.46845 .0718383 2.196654 13.32747 13.60943

diff 1.177965 .2114084 .7630741 1.592856

diff = mean(0) - mean(1) t = 5.5720


Ho: diff = 0 degrees of freedom = 933

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0


Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000

We reject the null hypothesis (in favour of the alternative) because the p-
value=0.0000<significance level α = 0.05 and conclude that the population mean years of
education differ significantly for black and non-black men.

(iii) Show how the t-statistic of 5.5720 was calculated.

Solution:
̂ − 𝒅𝒊𝒇𝒇𝒐
𝒅𝒊𝒇𝒇 𝟏.𝟏𝟕𝟕𝟗𝟕−𝟎
t-stat = = = 5.572,
𝑺𝑬(𝒅𝒊𝒇𝒇) 𝟎.𝟐𝟏𝟏𝟒𝟏

where
̂ is the estimated difference in means (i.e. this is just the difference between the
𝒅𝒊𝒇𝒇
two sample means = 13.61963-12.44167=1.17796).
𝒅𝒊𝒇𝒇𝒐 = 𝟎 is the difference between the two population means under the null
hypothesis
𝑺𝑬(𝒅𝒊𝒇𝒇) is the standard error of the difference in means.

4. Test the hypothesis that the population mean years of mother’s education (variable
meduc) is larger for non-black men than for black men. Use a significance level α =
0.05.
(i) What is the null hypothesis being tested in terms of the notation used in class?
What is the alternative hypothesis in terms of the notation used in class?

5
(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?
(iii) Show how the t-statistic of 6.6322 was calculated.
(1 point each, 3 points in total)

Solution:
H0: µo = µ1
H1: µo > µ1

Here µ1 is the population mean years of mother’s education blacks and µo is the
population mean years of education for non-blacks.

(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?

. ttest meduc, by (black)

Two-sample t test with equal variances

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

0 758 10.91029 .099562 2.74112 10.71484 11.10574


1 99 8.939394 .308546 3.069994 8.327095 9.551693

combined 857 10.68261 .0973458 2.849756 10.49155 10.87368

diff 1.970896 .2971709 1.387626 2.554166

diff = mean(0) - mean(1) t = 6.6322


Ho: diff = 0 degrees of freedom = 855

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0


Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 0.0000

We reject the null hypothesis (in favour of the alternative) because the p-
value=0.0000<significance level α = 0.05 and conclude that the population mean years of
mother’s education is significantly larger for non-black than for black men.

(iii) Show how the t-statistic of 6.6322 was calculated.


(1 point)
Solution:
̂ − 𝒅𝒊𝒇𝒇𝒐
𝒅𝒊𝒇𝒇 𝟏.𝟗𝟕𝟎𝟖−𝟎
t-stat = = = 6.632,
𝑺𝑬(𝒅𝒊𝒇𝒇) 𝟎.𝟐𝟗𝟕𝟐

where
̂ is the estimated difference in means (i.e. this is just the difference between the
𝒅𝒊𝒇𝒇
two sample means = 10.91029-8.939394=1.970896).

6
𝒅𝒊𝒇𝒇𝒐 = 𝟎 is the difference between the two population means under the null
hypothesis
𝑺𝑬(𝒅𝒊𝒇𝒇) is the standard error of the difference in means.

5. Finally, test the hypothesis that the population mean (average) weekly hours of work
(variable hours) is equal to 45 hours, against the two-sided alternative. Use a
significance level α = 0.05.
(i) What is the null hypothesis being tested in terms of the notation used in class?
What is the alternative hypothesis in terms of the notation used in class?
(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?
(iii) Show how the t-statistic of -4.5314 was calculated.
(1 point each, 3 points in total)

Solution:
H0: µ = 45
H1: µ ≠ 45

Here µ is the population mean weekly hours of work.

(ii) What do you conclude (i.e. do you reject the null at the 5% significance level), and
why?

. ttest hours=45

One-sample t test

Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

hours 935 43.92941 .2362584 7.224256 43.46575 44.39307

mean = mean(hours) t = -4.5314


Ho: mean = 45 degrees of freedom = 934

Ha: mean < 45 Ha: mean != 45 Ha: mean > 45


Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000

We reject the null hypothesis (in favour of the alternative) because the p-
value=0.0000<significance level α = 0.05 and conclude that the population mean weekly
hours of work is significantly different than 45 hours a week.

(iii) Show how the t-statistic of -4.5314 was calculated.

Solution:

7
̅ −μ
X 𝟒𝟑.𝟗𝟐𝟗𝟒𝟏−𝟒𝟓
t-stat = s/√N𝟎 = = -4.5314,
𝟕.𝟐𝟐𝟒𝟐𝟓𝟔/√935

where
̅ is the sample mean hours of work for everyone in the sample
𝑿
𝝁𝒐 = 𝟎 is the population mean hours of work under the null hypothesis
𝑺𝑬(𝑿 ̅ ) = s/√N is the standard error of the sample mean of variable hours, which is
calculated as the ratio of the sample standard deviation of variable hours, s=7.224256
divided by the square root of the sample size, N=935.

Part II: Simple linear regression in STATA (9 points in total)

This part of the problem set introduces you to running regressions in STATA.

This problem uses STATA file BIG9SALARY.dta. This data was used by O. Baser and
E. Pema (2003) in a paper titled “The Return of Publications for Economics Faculty”.
BIG9SALARY.dta contains data on 223 faculty members of Economics departments in 9
universities in the US (Ohio State, Iowa, Indiana, Purdue, Michigan State, Minnesota,
Michigan, Wisconsin and Illinois) collected in year 1995. The dataset contains the
following variables:
id person identifier
salary total gross annual salary of the faculty member (in 1999 USD)
totpge total number of standardized article pages published
pubindex publication index (the product of number of articles published by
the rank of the journal)
top20phd =1 if Ph.D. was obtained from a top 20 Economics department; 0
otherwise
yearphd year when the Ph.D. was obtained
age age in years
female =1 if female; 0 otherwise
mich =1 for University of Michigan professors; 0 otherwise
(as well as dummies for each of the other universities).

Use dataset BIG9SALARY.dta to answer the following questions.

Consider the simple linear regression model relating a faculty member’s annual salary
(salary) to total number of standardized article pages published (totpge):

salary = 𝜷0 + 𝜷1 totpge + u

1. What is the meaning of the error term u?


(1 point)
Solution:

8
The error term u represents all factors that affect the outcome variable Y (in our
example, a faculty member’s annual salary), other than the explanatory variable X (in
our case, total number of article pages published).

2. Provide an example of a variable (factor) contained in u.


(1 point)
Solution:
Examples of variables in u are: a faculty member’s seniority level (assistant, associate
or full professor), tenure vs. non-tenure track; teaching evaluations; gender;
experience; type of university (e.g. public or private); geographic location, etc.

3. Now use the data in BIG9SALARY.dta to estimate this simple regression model and
answer the following questions:

Hint: in order to estimate the model in STATA, type: reg salary totpge

(i) Interpret the OLS estimate of the intercept.


(2 points)

Solution:
. reg salary totpge

Source SS df MS Number of obs = 233


F(1, 231) = 64.66
Model 3.8897e+10 1 3.8897e+10 Prob > F = 0.0000
Residual 1.3896e+11 231 601542208 R-squared = 0.2187
Adj R-squared = 0.2153
Total 1.7785e+11 232 766608670 Root MSE = 24526

salary Coef. Std. Err. t P>|t| [95% Conf. Interval]

totpge 89.65586 11.14946 8.04 0.000 67.68822 111.6235


_cons 66626.78 2414.201 27.60 0.000 61870.11 71383.45

The OLS estimate of the intercept is $66, 626.78 (note that STATA calls the intercept
_cons). The intercept is always interpreted as the predicted (average) value of Y when
X=0. In this example this means that a faculty member who zero total number of
standardized article pages published (i.e. has no published articles) has annual salary
of $66, 626.78, on average.

(ii) Interpret the slope (coefficient) on totpge.


(2 points)
Solution:
The OLS estimate of the slope is $89.66 (rounded to the second decimal).

9
The slope is always interpreted as the estimated effect of a one-unit increase in X on the
average value of Y. In our example this means that every additional article pages
increases a faculty member’s salary by $89.66, on average.

(iii) What is the effect of a 100-page increase in total number of standardized article
pages published on annual salary?
(1 point)
Solution:
From part (ii) we found that 1 more article page increases salary by $89.66; therefore,
100 more pages will increase salary by 100*($89.66)=$8965.59. You can calculate this
with STATA by typing:
di 100 * 89.65586
8965.586
Notice there is a linear relationship between X (totpge) and Y (salary) – every unit
increase of X (totpge) increases Y (salary) by the same amount – by $95.35.
FYI: You may find it easier to use the formula from class linking the change in Y for
any given change in X:
∆𝒚̂=𝜷 ̂ 1∆X = 100*($89.66)=$8965.59.

(iv) Use STATA command

browse totpge salary if id==330


to display the actual salary and number of published standardized article pages of the
faculty member with identifier number 330 (this is a MSU faculty member, who has the
highest number of published article pages amongst everyone in the sample). What is the
predicted annual salary of this faculty member?
Hint: The predicted annual salary of this faculty member 𝑌̂𝑖 is given by the value on the
regression line corresponding to totpge =1060.5.
(1 point)

Solution:

The predicted annual salary of this faculty member 𝒀̂ 𝒊 is given by the value on the
regression line corresponding to totpge =1060.5.
̂𝒊= ̂
𝒀 ̂ 1Xi = 66626.78 + 89.65586*1060.5= $161,706.82
𝜷0 + 𝜷
You can calculate this with STATA by typing
. di 66626.78 + 89.65586*1060.5
161706.82

(v) What is the actual salary of this faculty member? Find the residual for this faculty
member. Does our regression suggest this faculty member is underpaid or overpaid?

10
Hint: The residual 𝑢̂i is given by the difference between the actual and the predicted salary
for this observation: 𝑢̂i = 𝑌𝑖 − 𝑌̂𝑖 . You calculated the predicted salary in part (iv).
(2 points)

Solution:
The actual salary of this faculty member is $92,083 (this is given by the value of
variable salary for this faculty member).
The residual 𝒖 ̂ i for this observation is given by the difference between the actual and
the predicted salary for this observation:
̂ 𝒊 = $𝟗𝟐, 𝟎𝟖𝟑 − $𝟏𝟔𝟏, 𝟕𝟎𝟔. 𝟖𝟐 = $-69, 623.82
̂ i = 𝒀𝒊 − 𝒀
𝒖
With STATA:
. di 92083 - 161706.82
-69623.82
Our regression suggests that this faculty member is underpaid.

(vi) Now estimate the model separately for the male and female faculty members.
Are the estimates of the intercepts the same? What about the slope estimates? How do you
interpret this?
Do the regression results provide convincing evidence that either group of faculty members
is discriminated against? Explain.
(3 points)

Hint: This question asks you to this about the difference between estimates and population
parameters, and the correlation versus causal relationship.
To estimate the model separately for women and men type in the command window:
reg salary totpge if female==1
reg salary totpge if female==0

Solution:
The estimation results are presented below. The estimates of the intercepts are
different, suggesting that with no articles published (totpge=0) female faculty
members earn about $10,500 lower salary, on average. However, the slope estimates
suggest that female faculty members are paid about $37 more for every additional
page published.
The regression results suggest that there may be differential payment for each group
but they do not provide convincing evidence that either male or female faculty
members are discriminated against? First, we can only see the difference in the
estimates of the slopes and intercepts, while we still don’t know if the population
parameters differ in the population. Secondly, our results may not have a causal
interpretation since we are not sure male and female faculty members are the same,
on average, in all dimensions affecting earnings (i.e. it might be the case that female
faculty members are mainly employed in language schools, where the salary levels are
generally lower).
Ultimately, we need to develop tools for analysing this question at a deeper level.
. reg salary totpge if female==1

Source | SS df MS Number of obs = 21

11
-------------+---------------------------------- F(1, 19) = 4.07
Model | 1.7705e+09 1 1.7705e+09 Prob > F = 0.0579
Residual | 8.2608e+09 19 434781062 R-squared = 0.1765
-------------+---------------------------------- Adj R-squared = 0.1332
Total | 1.0031e+10 20 501566396 Root MSE = 20851
------------------------------------------------------------------------------
salary | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totpge | 122.0529 60.48352 2.02 0.058 -4.540533 248.6464
_cons | 57594.09 6323.063 9.11 0.000 44359.77 70828.42
------------------------------------------------------------------------------

. reg salary totpge if female==0

Source | SS df MS Number of obs = 211


-------------+---------------------------------- F(1, 209) = 52.77
Model | 3.2585e+10 1 3.2585e+10 Prob > F = 0.0000
Residual | 1.2906e+11 209 617526916 R-squared = 0.2016
-------------+---------------------------------- Adj R-squared = 0.1978
Total | 1.6165e+11 210 769752285 Root MSE = 24850
------------------------------------------------------------------------------
salary | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
totpge | 85.01794 11.7039 7.26 0.000 61.94511 108.0908
_cons | 68174.95 2634.973 25.87 0.000 62980.42 73369.48
------------------------------------------------------------------------------

. di 68174.95- 57594.09
10580.86

. di 85.01794- 122.0529
-37.03496

Part III: Algebra of OLS (17 points in total)


For this part of the question no STATA submission is required.

1. Consider the simple linear regression model


Y = 𝜷0 + 𝜷1X + u.
Show that the OLS estimators of 𝜷0 and 𝜷1 minimizing the sum of squared residuals
𝑁

𝑚𝑖𝑛{𝛽̂0,𝛽̂1 } ∑ (𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 )2


𝑖=1

are given by

𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅

∑𝑁 ̅ ̅
𝑖=1(𝑋𝑖 − 𝑋 ) (𝑌𝑖 − 𝑌)
𝛽̂1 =
∑𝑖=1(𝑋𝑖 − 𝑋̅)2
𝑁

Hint: We proved this in class. No need to show that

12
𝑁 𝑁

∑(𝑋𝑖 − 𝑋̅) (𝑌𝑖 − 𝑌̅) = ∑ 𝑋𝑖 𝑌𝑖 − 𝑁 𝑋̅𝑌̅


𝑖=1 𝑖=1
𝑁 𝑁

∑(𝑋𝑖 − 𝑋̅)2 = ∑ 𝑋𝑖2 − 𝑁𝑋̅ 2


𝑖=1 𝑖=1
as we did this in homework 1.

(7 points)

Solution:

First order conditions (FOC):

{𝛽̂0 }: 2 ∑(𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 )(−1) = 0 (1)

{𝛽̂1 }: 2 ∑(𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 )(−𝑋𝑖 ) = 0 (2)

𝟏
Simplifying by multiplying both sides of each equation by - 𝟐 yields:

{𝛽̂0 }: ∑(𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 ) = 0 (1)

{𝛽̂1 }: ∑[(𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 )𝑋𝑖 ] = 0 (2)

These two expressions and are called the normal equations.

Let’s solve these and find the formulae of 𝛽̂0 and 𝛽̂1.

From (1):

∑(𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 ) = 0

⟺ ∑ 𝑌𝑖 − 𝑁𝛽̂0 − 𝛽̂1 ∑ 𝑋𝑖 = 0

Solving for 𝛽̂0 yields:

− ∑ 𝑌𝑖 ∑ 𝑋𝑖
𝛽̂0 = + 𝛽̂
−𝑁 −𝑁 1

⟺ 𝛽̂0 = 𝑌̅ − 𝛽̂1 𝑋̅

In order to obtain to formula for 𝜷 ̂ 𝟏 , we’re going to substitute the expression for
𝛽̂0 into (𝟐). But before this, let’s simplify (2):

13
∑(𝑌𝑖 − 𝛽̂0 − 𝛽̂1 𝑋𝑖 )(𝑋𝑖 ) = 0

⟺ ∑ 𝑌𝑖 𝑋𝑖 − 𝛽̂0 ∑ 𝑋𝑖 − 𝛽̂1 ∑ 𝑋𝑖2 = 0

Substituting β̂ 0 from (1)and diving and multiplying the second term by 𝑁 yields:

∑ 𝑋𝑖
⟺ ∑ 𝑌𝑖 𝑋𝑖 − 𝑁(𝑌̅ − 𝛽̂1 𝑋̅) 𝑁
− 𝛽̂1 ∑ 𝑋𝑖2 = 0

⟺ ∑ 𝑌𝑖 𝑋𝑖 − 𝑁𝑋̅𝑌̅ + 𝑁𝛽̂1 𝑋̅ 2 − 𝛽̂1 ∑ 𝑋𝑖2 = 0

⟺ (∑ 𝑋𝑖 𝑌𝑖 − 𝑁𝑋̅𝑌̅) − 𝛽̂1 (∑ 𝑋𝑖2 − 𝑁𝑋̅ 2 ) = 0

Solving for 𝛽̂1 :

∑ 𝑋𝑖 𝑌𝑖 − 𝑁𝑋̅𝑌̅
𝛽̂1 =
∑ 𝑋𝑖2 − 𝑁𝑋̅ 2

And we’ve already shown in lecture 1 (and in homework 1) that this can be rewritten as:

∑𝑁 ̅ ̅
𝑖=1(𝑋𝑖 − 𝑋 ) (𝑌𝑖 − 𝑌 )
𝛽̂1 =
∑𝑁 ̅ 2
𝑖=1(𝑋𝑖 − 𝑋 )

2. Throughout this class, we are going to be using STATA to calculate regression lines,
but it is a good idea to compute a regression line by hand once.
a) Calculate the OLS estimates of the intercept and slope from a regression of Y
on X, using the following data. Show all your calculations!

Obs. i Xi Yi
1 3 3
2 9 6
3 6 3

(1 point for each estimate, 2 points in total)

b) Check that the point (𝑋̅, 𝑌̅) lies on the regression line, where 𝑋̅ is the sample
average of variable X, and 𝑌̅ is the sample average of variable Y. (1 point)

c) Show that the property you illustrated in part b) also holds in general, i.e. that
the point (𝑋̅, 𝑌̅) always lies on the regression line. Your solution cannot use
the formula for the intercept estimate. (7 points)

14
Hint: write the residual for a single observation:

̂𝑖
𝑢̂𝑖 = 𝑌𝑖 − 𝑌
From here:
̂𝑖 + 𝑢̂𝑖
𝑌𝑖 = 𝑌

̂𝑖 = β̂0 + β̂1 𝑋𝑖 :
Replace 𝑌
𝑌𝑖 = β̂0 + β̂1 𝑋𝑖 + 𝑢̂𝑖
Then sum over all the data points – from 1 to N:
𝑁 𝑁

∑ 𝑌𝑖 = ∑(β̂0 + β̂1 𝑋𝑖 + 𝑢̂𝑖 )


𝑖=1 𝑖=1


𝑁 𝑁 𝑁 𝑁

∑ 𝑌𝑖 = ∑ β̂0 + ∑ β̂1 𝑋𝑖 + ∑ 𝑢̂𝑖


𝑖=1 𝑖=1 𝑖=1 𝑖=1

𝑁 𝑁

∑ 𝑌𝑖 = 𝑁β̂0 + β̂1 ∑ 𝑋𝑖
𝑖=1 𝑖=1

(since ∑𝑁
𝑖=1 𝑢
̂ 𝑖 = 0 from the normal equation).

From here: 𝑌̅ = β̂0 + β̂1 𝑋̅, which means that the point (𝑋̅, 𝑌̅) lies on the regression line.

OPTIONAL: double-check your calculation for part a) with STATA by following the
steps below:

1. Open STATA and select Data/Data editor/Data editor (Edit).

2) Type in the values of X as var1, Type in the values of Y as var2 in the data editor (or
just copy/paste the numbers from the Word table):

15
3) Close the Data Editor. You can now see var1 and var2 in the variable list.

4) Rename var1 as X, var2 as Y by typing the following code:


rename var1 X
rename var2 Y
You can now see the variables as X and Y:

5) Regress Y on X by typing:
reg Y X
(Beware that STATA is case sensitive, if you call the variables Y and X (caps) you
should use caps in the regression command as well).

Do you get the same result for the slope and intercept as the one you calculated?

Solution:
a)

Obs.
Xi Yi (Xi  X ) (Yi  Y ) (X i  X )2 ( X i  X )(Yi  Y )
i
1 3 3 -3 -1 9 3
2 9 6 3 2 9 6
3 6 3 0 -1 0 0
∑ 0 0 18 9

̅ = 6, 𝐘
𝐗 ̅=𝟒

16
𝑁

∑(𝑋𝑖 − 𝑋̅) (𝑌𝑖 − 𝑌̅) = 9


𝑖=1
𝑁

∑(𝑋𝑖 − 𝑋̅)2 = 18
𝑖=1

Our estimate of the slope parameter is 𝜷 ̂ 𝟏 = 9/18 = 0.5 (we estimate that a 1-unit
increase in X increases the average of Y by 0.5).

Our estimate of the intercept is:


̂ 𝟎 =𝒀
𝜷 ̅−𝜷 ̂ 𝟏𝐗
̅ = 4 – 6·0.5 = 1 (we estimate that the average of Y is 1 when X equals
zero).

Alternatively, you can use the properties of summations we proved earlier in the
course, which will save you some calculations:
𝑁 𝑁

∑(𝑋𝑖 − 𝑋̅) (𝑌𝑖 − 𝑌̅) = ∑ 𝑋𝑖 𝑌𝑖 − 𝑁 𝑋̅𝑌̅


𝑖=1 𝑖=1

𝑁 𝑁

∑(𝑋𝑖 − 𝑋̅)2 = ∑ 𝑋𝑖2 − 𝑁𝑋̅ 2


𝑖=1 𝑖=1

Obs. i Xi Yi Xi Y i 𝑿𝟐𝒊
1 3 3 9 9
2 9 6 54 81
3 6 3 18 36
∑ 81 126

𝑵
̅𝒀
∑ 𝑿𝒊 𝒀𝒊 − 𝑵 𝑿 ̅ = 𝟖𝟏 − 𝟑 · 𝟔 · 𝟒 = 𝟗
𝒊=𝟏

𝑵
̅ 𝟐 = 𝟏𝟐𝟔 − 𝟑 · 𝟔𝟐 = 𝟏𝟖
∑ 𝑿𝟐𝒊 − 𝑵𝑿
𝒊=𝟏
̂ 𝟏 = 9/18 = 0.5, 𝜷
𝜷 ̂ 𝟎 = ̅𝒀 − 𝜷
̂ 𝟏𝐗
̅ = 4 – 6·0.5 = 1

The STATA output below confirms our calculations.

17
. reg Y X

Source SS df MS Number of obs = 3


F( 1, 1) = 3.00
Model 4.5 1 4.5 Prob > F = 0.3333
Residual 1.5 1 1.5 R-squared = 0.7500
Adj R-squared = 0.5000
Total 6 2 3 Root MSE = 1.2247

Y Coef. Std. Err. t P>|t| [95% Conf. Interval]

X .5 .2886751 1.73 0.333 -3.167965 4.167965


_cons 1 1.870829 0.53 0.687 -22.77113 24.77113

b)
̅ ̅
𝐗 = 6, 𝐘 = 𝟒 – plugging 𝐗̅ = 6 into the equation of the regression line we obtain:
c) The hint contains the complete solutions; I just wanted you to go through it.
See graph below for illustration:

18

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy