0% found this document useful (0 votes)
9 views3 pages

Case 3

The document outlines a case study for applied econometrics focusing on predicting high earners using a dataset. It includes tasks such as data importation, summary statistics, variable checks, and model building using linear probability and logit models. The analysis aims to explore factors influencing annual income and assess the relative status of men and women in the labor market.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views3 pages

Case 3

The document outlines a case study for applied econometrics focusing on predicting high earners using a dataset. It includes tasks such as data importation, summary statistics, variable checks, and model building using linear probability and logit models. The analysis aims to explore factors influencing annual income and assess the relative status of men and women in the labor market.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Applied Econometrics for Managers: Case #3

High Earner Prediction

1. We will use the “data_case_3.csv” dataset for this session. The variable description is pro-
vided in the “data_doc_case_3.txt” file. Your first task is to import the dataset into R. How many
observations are in the imported dataset? (You should have 30162 observations)

2. Get the summary statistics of the data. Fill in the following table.

Variable Number of Observations Mean Standard Deviation Minimum Maximum Range


Age

3. We want to ensure that there is no coding error in variables education_num and education.
To check this, create a table with education as the row variable and education_num as the column
variable, and fill in the following.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
th
10
11th
12th
1st - 4th
5th -6th
7th -8th
9th
Assoc-acdm
Assoc-voc
Bachelors
Doctorate
HS-grad
Masters
Preschool
Prof-school
Some-college

4. How many individuals in the dataset are earning more than $50,000 annually?

1
5. If we are interested to know the relative status of men and women in the labor market
from this dataset, what can we do? Complete the following tables. Can we say anything about
the relative status from these tables?

Share of high earners and low earners (<=$50K annual income) in male and female (>$50K annual income)
Less than equal to $50K annual income Greater than $50K annual income
Female 100%
Male 100%

Share of male and female in high earners (>$50K annual income) and low earners (<=$50K annual income)
Less than equal to $50K annual income Greater than $50K annual income
Female
Male
100% 100%

6. In which occupation the share of high earners (>$50K annual income) is the highest? In
high earners, share of people with which education level is the highest? Also mention the highest
shares for both these questions.

7. Create a new variable inc50k which takes the value of ‘0’ if an individual earns less than
or equal to $50,000 per year and becomes ‘1’ if an individual earns greater than $50,000 per year.
What is its mean?

8. We are interested to know if the likelihood of a person being a high earner (>$50,000 annual
income) can be predicted. In order to answer this, build a linear probability model by regressing
inc50k on age, square of age, sex, race, and education_num. What is the value of the estimated
coefficient for sex? Is it statistically significant? Explain the other estimated coefficients as well.

9. Now drop education_num from the model in question 8 and add relationship, education,
workclass, occupation, hours_per_week, and capital_gain to that model as explanatory variables.
Explain the results. How has the estimated coefficient for sex changed, and what does that mean?

10. What is the prediction accuracy of the model in question 9? To check this, consider that an
individual will earn more than $50,000 annually with 100% certainty if predicted inc50k is greater
than or equal to 0.5, and will earn less than or equal to $50,000 annually if predicted inc50k is less
than 0.5.

11. Now, let’s build a logit model using inc50k as the dependent variable and age, square of
age, sex, race, and education_num as the explanatory variables. What is the value of the estimated
coefficient for sex? What message do the estimated coefficients convey in this case?

12. What is McFadden’s pseudo R-squared for the logit model in question 11? Also, test for
the overall significance of this model.

2
13. Now drop education_num from the model in question 11 and add relationship, education,
workclass, occupation, hours_per_week, and capital_gain to that model as explanatory variables.
What is the estimated coefficient for sex now? What message do estimated coefficients convey
in this case?

14. Run a statistical test to determine whether the new explanatory variables included in the
model in question 13 belong to the model or not.

15. Check for prediction accuracy of the model in question 13. To check this, consider that an
individual will earn more than $50,000 annually with 100% certainty if predicted inc50k is greater
than or equal to 0.5, and will earn less than or equal to $50,000 annually if predicted inc50k is less
than 0.5.

16. What is average partial effect (APE)? Calculate the average partial effects for the model
in question 13. What is the marginal effect for sex? Explain the other marginal effects as well.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy