Lab2-Markdown XFL (CLEAN)
Lab2-Markdown XFL (CLEAN)
For this lab exercise, we will use the Carseats data. Make sure you have loaded all the
required packages and the dataset before starting with the questions.
library(ISLR)
library(MASS)
data("Carseats")
#?Carseats #use this command to learn more about the data
1. Let’s start by looking at the relationship between all the variables. Remember, we use
the pairs() command to view the scatter plot of all the variables. Can you identify
what variables are factors? (There are 3 factor variables)
80 0 50 30 1.0
Sales
0
CompPrice
80
Income
20
Advertising
0
Population
0
50
Price
1.0
ShelveLoc
Age
30
Education
10
1.0
Urban
1.0
US
0 20 0 1.0 10 1.0
1
2. Suppose you are interested in predicting the Sales of child car seats. You identify
Price, Advertising and Age are the key variables. Conduct a linear regression model
looking at this relationship. In other words, estimate:
If you coded properly, the coefficient for Price should be: -0.058.
##
## Call:
## lm(formula = Sales ~ Price + Advertising + Age, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6247 -1.5288 0.0148 1.5220 6.2925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.003472 0.718754 22.266 < 2e-16 ***
## Price -0.058028 0.004827 -12.022 < 2e-16 ***
## Advertising 0.123106 0.017095 7.201 3.02e-12 ***
## Age -0.048846 0.007047 -6.931 1.70e-11 ***
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2.269 on 396 degrees of freedom
## Multiple R-squared: 0.3595, Adjusted R-squared: 0.3547
## F-statistic: 74.1 on 3 and 396 DF, p-value: < 2.2e-16
2
3. Using the plot() command, graph the diagnostic plot for the linear regression you just
used. Make sure you include par(mfrow=c(2,2)) to ensure that all the plots are or-
dered properly. Do you think a linear regression model is adequate for this relationship?
Standardized residuals
Residuals vs Fitted Q−Q Residuals
353 26 353 26
Residuals
2
0
−5
−3
51 51
2 4 6 8 10 12 −3 −2 −1 0 1 2 3
Standardized residuals
Scale−Location Residuals vs Leverage
3535126
0 2
1.0
−3 166
Cook's distance
144
0.0
51
4. Are all the variables significant? What does being significant really mean? Write down
the interpretation for the Intercept and Age coefficient?
Yes, all variables are significant (p-value less than 0.05). Being significant means that the
coefficient does not equal 0. When Price, Advertising, and Age is 0, the sales is 16.00 on
average (Intercept). As the age increases by 1, sales is associated with a decline of 0.04 on
average (Age coefficient).
3
5. Just for practice, try to perform the linear regression:
Sales = β0 + β1 P rice + ε
. Then using abline() plot the linear regression line. You should be getting a plot
like this:
15
10
Sales
5
0
50 100 150
Price
4
6. Using the predict() command try to calculate the confidence and prediction intervals
for the regression model in Q.5, when Income Price is 30, 60, and 90. If you coded this
properly, you should be getting:
Confidence intervals:
Prediction intervals:
7. Let’s try to add a few more parameters to our regression model. Let us estimate this
model:
Sales = β0 + β1 Log(P rice) + β2 Income + β3 (Income2 ) + ε
Try to only output the coefficients this time. The solution is provided below. Make
sure your coefficients matches the solutions.
5
8. Now, lets estimate this model:
##
## Call:
## lm(formula = Sales ~ Price + Income * Urban, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6964 -1.8640 -0.0938 1.6930 7.5875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.968842 0.839695 14.254 < 2e-16 ***
## Price -0.052415 0.005318 -9.856 < 2e-16 ***
## Income 0.023487 0.007811 3.007 0.00281 **
## UrbanYes 1.082502 0.703561 1.539 0.12470
## Income:UrbanYes -0.015931 0.009546 -1.669 0.09594 .
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2.507 on 395 degrees of freedom
## Multiple R-squared: 0.2196, Adjusted R-squared: 0.2117
## F-statistic: 27.79 on 4 and 395 DF, p-value: < 2.2e-16
6
9. Finally, lets estimate Sales = β0 + β1 Education + ε. Note, Education is formatted
as a numeric variable. In this mode, try to convert this variable to a factor. If done
correctly, your regression output should have 8 different levels of Education.
##
## Call:
## lm(formula = Sales ~ Education, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1317 -1.9725 -0.0588 1.8270 8.7988
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.2458 0.4077 20.223 <2e-16 ***
## Education11 -0.7746 0.5766 -1.343 0.1800
## Education12 -0.7781 0.5737 -1.356 0.1758
## Education13 -1.1935 0.5932 -2.012 0.0449 *
## Education14 -1.1706 0.6048 -1.936 0.0536 .
## Education15 -0.1142 0.6228 -0.183 0.8547
## Education16 -1.0014 0.5797 -1.727 0.0849 .
## Education17 -0.7430 0.5737 -1.295 0.1960
## Education18 -0.9693 0.6048 -1.603 0.1098
## ---
## Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
##
## Residual standard error: 2.825 on 391 degrees of freedom
## Multiple R-squared: 0.0195, Adjusted R-squared: -0.0005622
## F-statistic: 0.972 on 8 and 391 DF, p-value: 0.4574