Take Home Assignment 2
Take Home Assignment 2
2) Append this data with the remaining geographical data available in geography2.dta.
(1 Mark)
3) Generate a chart using your appended data, which tells us something insightful about the
data. You can choose any variables and any type of chart e.g. bar, pie, scatterplot. Briefly
state what insights does your graph reveal about the data. (5 Marks)
4) There might be unwanted spaces before or after country names in the variable “country”.
These spaces could adversely affect the merge to be completed in the next question. Please
1
ECON 330 Fall 2023
remove these spaces. What other issues could adversely affect merging results?
(3 Marks)
5) Merge this dataset with the dataset infectious_disease.dta (HINT: Merging can be
completed using a variable that is common to both datasets). How many observations are
there in total? Drop all observations that do not match. (5 Marks)
6) Incorrectly measured values in pop95 and pdenpavg are recorded as negative values in the
dataset. Please save these negative values to a separate dataset and drop them from the
original dataset. (2 Marks)
8) Variables Mal46p, Mal66p, Mal82p and Mal94p describe the percentage of 1995
population in malaria- affected areas in four different years (1946, 1966, 1982 and 1994).
Generate a variable that categorizes observations in Mal46p (percentage of population)
by low, moderate and high, based on the following groupings: (4 Marks)
a. “Low” if less than or equal to 33 percent of the 1995 population resided in a
malaria-affected area in 1946.
b. “Moderate” if more than 33 percent and less than or equal to 66 percent of the
1995 population resided in a malaria-affected area in 1946.
c. “High” if more than 66 percent of the 1995 population resided in a malaria-
affected area in 1946.
9) The variable “time” captures the point in time when the observation was recorded. Change
the format of this variable so that a current observation of the form “Mon Aug 10 11:06:25
UTC 2015” is converted to “Aug 10 11:06:25 2015”. (2 Marks)
10) Sort in descending order, the variable that captures information on population in 1995.
(1 Mark)
11) What was the lowest magnitude of malaria area (percentage) for every country across the
four years: 1946, 1966, 1982 and 1994? Construct a variable containing this value for
each country. (3 Marks)
2
ECON 330 Fall 2023
3
ECON 330 Fall 2023
Part C
1. To test the effectiveness of a job training program on the subsequent wages of workers,
we specify the model:
𝑙𝑜𝑔(𝑤𝑎𝑔𝑒) = 𝛽0 + 𝛽1 𝑡𝑟𝑎𝑖𝑛 + 𝛽2 𝑒𝑑𝑢𝑐 + 𝛽3 𝑒𝑥𝑝𝑒𝑟 + 𝑢,
where train is a binary variable equal to unity if a worker participated in the program.
Think of the error term 𝑢 as containing unobserved worker ability. If less able workers
have a greater chance of being selected for the program, and you use an OLS analysis,
what can you say about the likely bias in the OLS estimator of b1? (Hint: Refer back to
Chapter 3.) (2 Marks)
2. For a child 𝑖 living in a particular school district, let 𝑣𝑜𝑢𝑐ℎ𝑒𝑟𝑖 be a dummy variable equal
to one if a child is selected to participate in a school voucher program, and let 𝑠𝑐𝑜𝑟𝑒𝑖 be
that child’s score on a subsequent standardized exam. Suppose that the participation
variable, 𝑣𝑜𝑢𝑐ℎ𝑒𝑟𝑖 , is completely randomized in the sense that it is independent of both
observed and unobserved factors that can affect the test score.
(i) If you run a simple regression 𝑠𝑐𝑜𝑟𝑒𝑖 on 𝑣𝑜𝑢𝑐ℎ𝑒𝑟𝑖 using a random sample of size
𝑛, does the OLS estimator provide an unbiased estimator of the effect of the
voucher program? (2 Marks)
(ii) Suppose you can collect additional background information, such as family
income, family structure (e.g., whether the child lives with both parents), and
parents’ education levels. Do you need to control for these factors to obtain an
unbiased estimator of the effects of the voucher program? Explain. (2 Marks)
(iii) Why should you include the family background variables in the regression? Is
there a situation in which you would not include the background variables?
(2 Marks)
4
ECON 330 Fall 2023
3. Using the dataset HPRICE2.dta, we are going to explore the implications of using
different functional forms in regression analysis. Let’s start with a simple bivariate
relation between air quality (nox) and housing price (price).
a. Report the units in which each variable is measured. Estimate and interpret the
simple regression of price on nox. (4 Marks)
b. Looking at the density of each variable & its logarithm, the scatterplot (see Figure
1 below), as well as any guidance from Chapter 6, what functional form would
you like to use when specifying this regression model. Write down and estimate
your preferred regression equation. Interpret the OLS estimates. (4 Marks)
5
ECON 330 Fall 2023
6
ECON 330 Fall 2023
d. Produce a graph in Stata that overlays the sample regression functions from our
log-log model and lin-lin model on the scatterplot of price against nox. (Hint: use
the “graph twoway” and “graph function” commands and learn the correct use of
‘range’ option in the latter.) (5 Marks)
e. Estimate second-order (quadratic) and third-order polynomials (cubic function) in
nox. Is the cubic term, nox3, statistically significant? Overlay the two estimated
regression functions in a chart like the one in (d). (3 Marks)
f. Do we know the true functional form of price-nox relation in the population?
(1 Mark)
g. If the answer to (f) is ‘No’, are we justified in using a polynomial function in x to
approximate the unknown (potentially non-linear) y-x relation? [Hint: In calculus,
we study functions of single variables y = f(x). There are results in calculus (e.g.,
Taylor theorem or Taylor series expansion) which allow us to approximate
arbitrarily complex non-linear functions, f(x), with a sum of polynomials in x, g(x),
by finding and computing the derivatives of f(x) (e.g., we saw the example of a
linear approximation to the natural-log function in class).] (1 Mark)
h. Estimate a tenth-order polynomial in nox and produce a chart showing the estimated
function along with the quadratic function and scatter plot (as down in (d) above).
Describe the second- and tenth-order polynomial function estimates? How and why
are these two regressions different when estimated in the same sample?
(6 Marks)
i. Is there a sense in which you may have “overfit” your sample data by adding too
many non-linear terms in the tenth-order polynomial? (1 Mark)
j. Estimate a regression of price on nox, rooms, dist, crime and stratio. Which of the
explanatory variables has the biggest effect on housing price? Confirm your
answer by computing the standardized beta coefficients. (Hint: use the option
‘beta’ in the regress command) (3 Marks)