Proyecto de Regresión Multiple en Salud
Proyecto de Regresión Multiple en Salud
This exercise will continue using the lung function study that we used in mini-project 1.
The study description will be restated below.
a. (20 points) Recall that the scatter plot can be used to describe association between
continuous variables, and boxplot can be used to visualize difference in continuous
data between groups. Please produce scatterplot or boxplot when appropriate, and
comment on the relationships observed between:
b. (5 points) Fit a simple linear regression model with log transformed FEV value as
the outcome variable, and smoke as the predictor. What do you conclude about the
relationship between smoke and log transformed FEV value?
Note: please include p value if commenting on the significance of the relationship.
c. (5 points) Fit a multiple linear regression model with log transformed FEV value as
the outcome variable, and height, age and smoke as the predictor. What do you
conclude about the relationship between smoke and log transformed FEV value?
Note: please include p value if commenting on the significance of the relationship.
d. (4 points) Test the overall hypothesis that height, age and smoke when considered
together are significant predictors of log transformed FEV value. State the null and
alternative hypothesis, the p-value and your conclusion.
e. (6 points) Interpret the three estimated slope coefficients based on the multiple
linear regression in part c.
f. (bonus question) (5 points) Compare the results from part b and c, comment on the
difference in these two results and why this difference occurs.
Exercise 2. The effects of lead exposure on neurological and psychological function in
children (40 points total)
A group of children who lived near a lead smelter in El Paso, Texas, were identified and
their blood levels of lead were measured. The response variable is the Wechsler full-scale
IQ score (IQ). There were 46 children who were exposed to lead (LEAD = 1) based on
their blood-lead levels, and 78 children who were not exposed to lead (LEAD = 0). The
following variables were collected:
IQ: IQ score
LEAD: Lead exposure (0=No, 1=Yes)
SEX: Sex (1=MALE, 0=FEMALE)
AGE: Age of children (in years)
a. (4 points) Based on the first two plots in Figure 3, comment on the relationship between
IQ and AGE, IQ and LEAD (Lead exposure).
b. (4 points) Based on the last two plots in Figure 3, do you see any possible interaction
between AGE and LEAD, or between AGE and SEX? Explain why.
c. (6 points) Table 3 provides regression model summaries for the following three models:
(1) IQ as a function of SEX, LEAD, AGE and three interaction terms: AGE and
SEX, AGE and LEAD, SEX and LEAD.
(2) IQ as a function of SEX, LEAD, AGE and two interaction terms: AGE and
SEX, SEX and LEAD.
(3) IQ as a function of SEX, LEAD, AGE and one interaction term: AGE and SEX.
(4) IQ as a function of LEAD and one interaction term: AGE and SEX.
Assume that all model assumptions are satisfied for each of the models. Which model
would you choose, and why? Considering the goal of the study, LEAD will be kept in the
model regardless of its p-value. Can you further simplify the model you chose? If yes,
how? If no, why not?
Hint: When a higher order term (i.e. interaction) included in the model, a lower order (i.e.
individual term) involved in the higher order has to be considered in the model.
Model (1):
Model Summary(b)
Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 103.506 6.743 15.349 .000 90.151 116.861
sex -17.828 7.885 -.605 -2.261 .026 -33.444 -2.212
lead -6.386 8.721 -.215 -.732 .465 -23.657 10.885
age -1.078 .631 -.265 -1.708 .090 -2.328 .172
AgeBySex 1.892 .754 .644 2.508 .014 .398 3.386
AgeByLead -.006 .787 -.002 -.008 .994 -1.565 1.553
SexByLead 3.170 5.642 .095 .562 .575 -8.005 14.344
a. Dependent Variable: iq
Model (2):
Model Summary(b)
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 103.526 6.209 16.674 .000 91.230 115.821
sex -17.829 7.850 -.605 -2.271 .025 -33.375 -2.284
lead -6.444 4.333 -.217 -1.487 .140 -15.024 2.136
age -1.080 .573 -.265 -1.884 .062 -2.216 .055
AgeBySex 1.892 .751 .644 2.519 .013 .405 3.379
SexByLead 3.179 5.490 .095 .579 .564 -7.693 14.050
a. Dependent Variable: iq
Model (3):
Model Summary(b)
Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 102.707 6.029 17.036 .000 90.769 114.645
sex -16.271 7.353 -.552 -2.213 .029 -30.831 -1.710
lead -4.464 2.653 -.150 -1.682 .095 -9.718 .790
age -1.064 .571 -.261 -1.864 .065 -2.195 .067
AgeBySex 1.844 .744 .628 2.477 .015 .370 3.318
a. Dependent Variable: iq
Model (4):
Model Summary
d. (3 points) Based on model (3), write the estimated regression equation for males.
e. (4 points) Using the estimated regression equation in question d) to interpret the effect of
LEAD and AGE for males.
f. (3 points) Based on model (3), write the estimated regression equation for females.
g. (4 points) Using the estimated regression equation in question f) to interpret the
effect of LEAD and AGE for females.
h. (3 points) Is the effect of lead different between males and females? Why will this
occur?
i. (4 points) Based on the estimated regression equation from part (f), predict the IQ score
of a girl at age 8 and exposed to lead.
j. (5 points) For model (2), is there evidence of a significant interaction between AGE and
SEX? State the null and alternative hypothesis, p-value and