0% found this document useful (1 vote)
116 views27 pages

Multiple Linear Regressioin Part 1

This document provides a summary of Lecture 7 of the course STAT 3022 Data Analysis. It begins with a recap of simple linear regression and introduces the concept of multiple linear regression and polynomial regression. Multiple linear regression allows adding additional explanatory variables to provide a better fitting model with more variation explained in the response variable. Polynomial regression is a special case where polynomial terms of predictors are added rather than new predictors. An example of using food rating and decor rating to predict dinner price at Italian restaurants in New York City is provided to illustrate interpreting coefficients in a multiple linear regression model. The document concludes with a preview of Assignment 3, which involves hypothesis testing and confidence intervals for slopes in simple linear regression.

Uploaded by

Adesola Adepoju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
116 views27 pages

Multiple Linear Regressioin Part 1

This document provides a summary of Lecture 7 of the course STAT 3022 Data Analysis. It begins with a recap of simple linear regression and introduces the concept of multiple linear regression and polynomial regression. Multiple linear regression allows adding additional explanatory variables to provide a better fitting model with more variation explained in the response variable. Polynomial regression is a special case where polynomial terms of predictors are added rather than new predictors. An example of using food rating and decor rating to predict dinner price at Italian restaurants in New York City is provided to illustrate interpreting coefficients in a multiple linear regression model. The document concludes with a preview of Assignment 3, which involves hypothesis testing and confidence intervals for slopes in simple linear regression.

Uploaded by

Adesola Adepoju
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

STAT 3022 DATA ANALYSIS

Lecture #7
Section 3.1: Motivation For Multiple Linear Regression
Course Name: Data Analysis – STAT3022

Course Type: Lecture Notes

Description Content of Material:


(7) Motivation For Multiple Linear Regression

Term/ Year administered: Spring 2022

Instructor: Kazeem Adepoju, Ph.D

E-mail: kadepoju@umn.edu Educational


Institution: University of Minnesota, Twin Cities
Outline
-Recap on Simple Linear Regression

-Why adding additional term?

-Concept of Polynomial Regression

-A typical Multiple Linear Regression Model

-Preview on Assignment 3
Recap on Simple Linear Regression
New Notation:
Sometimes we refer to a regression model in the following format: Y ~ 1 + X
For example: daughterHeight ~ 1 + motherHeight

Implications:
• For each observation: Y = 𝛽0 + 𝛽1𝑥 + 𝑒 What is “1”?
• Population model: 𝐸 𝑌 𝑋 = 𝑥 = 𝛽0 + 𝛽1𝑥 𝐸 𝑌𝑋=𝑥 = 𝛽0×1 + 𝛽1𝑥

• Fitted model: 𝑦* = 𝛽, 0 + 𝛽, 1 𝑥 𝑦* = 𝛽, 0 ×1 + 𝛽, 1 𝑥
Use the lm( ) function to obtain the fitted model in R -These two lines of code are
equivalent.
mod = lm(daughterHeight ~ motherHeight, data = Heights)
-R assumes that you want an
mod = lm(daughterHeight ~ 1 + motherHeight, data = Heights)
intercept regardless of
whether you have 1 in your
equation.
Number of Number of term(s)
explanatory/predictor
variable(s)

Simple linear regression:


Model 1: daughterHeight ~ 1 + motherHeight

Multiple linear regression:


Model 2:
daughterHeight ~ 1 + motherHeight + fatherHeight

Model 3:
daughterHeight ~ 1 + motherHeight + 𝑚𝑜𝑡ℎ𝑒𝑟𝐻𝑒𝑖𝑔ℎ𝑡2
Why adding additional terms?
To provide a model with better fit (much larger 𝑅 2 ).

● Polynomial regression: Adding polynomial terms of the predictor variable, but


not new predictor variables.
○ Reason: The response variable may not be linearly related to an explanatory
variable.

● Adding other explanatory variables


○ Reason: The response variable may relate to more than one explanatory
variable.
when the number of terms is 3 (1, X1, and X2)
Polynomial Regression
Special case for multiple linear regression : polynomial regression

Example: Modeling salary from years of experience for professional workers.


(in thousands of

𝑆𝑎𝑙𝑎𝑟𝑦 = 48.5 + 0.9𝐸𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒


dollars)

Violation of the linearity assumption: The simple linear regression model


Salary ~ 1 + Experience cannot capture the trend (curvature) of the
data.
Special case for multiple linear regression : polynomial regression
Example: Modeling salary from years of experience for professional workers.
(in thousands of dollars)

What if we add a term 𝐸𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒2?

Adding the term (𝐸𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒2) helps the model capture the curvature in the data.
Fitting po lyno mial regres s ion in R:
Example: Mo deling s alary from years of experience for professional workers .

mod2 = lm(Salary ~ 1 + Experience + I(Experience^2), data = dat)


summary(mod2) What does the I( ) function do?

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.720498 0.828724 41.90 <2e-16 ***
Experience 2.872275 0.095697 30.01 <2e-16 ***
I(Experience^2) -0.053316 0.002477 -21.53 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.817 on 140 degrees of freedom


Multiple R-squared: 0.9247, Adjusted R-squared: 0.9236
F-statistic: 859.3 on 2 and 140 DF, p-value: < 2.2e-16

W rite do wn the fitted model: ________________________ .

How much of the va ria tion in salary is explained bymod2? _________.


Example: Modeling salary from years of experience for professional workers.
What happens when we omit the I( ) function?
mod3 = lm(Salary ~ 1 + Experience + Experience^2, data = dat)
summary(mod3)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.50593 1.08810 44.58 <2e-16 ***
Experience 0.88345 0.05158 17.13 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.828 on 141 degrees of freedom


Multiple R-squared: 0.6754, Adjusted R-squared:
0.6731 F-statistic: 293.3 on 1 and 141 DF, p-
value: < 2.2e-16

Equivalent to fitting a model without 𝐸𝑥𝑝𝑒𝑟𝑖𝑒𝑛𝑐𝑒2


How do we interpret the coefficients in polynomial regression?
Example: Modeling salary from years of experience for professional workers.

Interpret the intercept:

• The average salary is for workers with no experience.

Interpret the slopes of 𝑬𝒙𝒑𝒆𝒓𝒊𝒆𝒏𝒄𝒆 and 𝑬𝒙𝒑𝒆𝒓𝒊𝒆𝒏𝒄𝒆𝟐:

• In polynomial regression, usually it is not meaningful to interpret the slopes


in the context.
A typical multiple linear
regression model
Multiple linear regression: adding new predictor variables
Example: Italian restaurants in New York City
Using the survey results from customers on 168 other Italian restaurants, we will build a regression
model that uses restaurant attributes (such as food rating) to predict the price of a typical dinner.

• Price = the price (in $US) of dinner (including one drink & a tip)
• Food = customer rating of the food (out of 30)
• Décor = customer rating of the decoration (out of 30)

We will consider three models:

Model A: Price ~ 1 + Food


Model B: ~ 1 + Décor
Price Model ~ 1 + Food + Décor
C: Price
Example: Italian restaurants in New York City
Scatter plot of Price vs. Food

Model A:
𝑃𝑟𝑖𝑐𝑒 = −17.83 + 2.94𝐹𝑜𝑜𝑑.
𝑅 2 = 0.39

Scatter plot of Price vs. Decor


Model B:
𝑃𝑟𝑖𝑐𝑒 = −1.36 + 2.49𝐷𝑒𝑐𝑜𝑟.
𝑅 2 = 0.52
Example: Italian res taurants in New York City
Model C: Price ~ 1 + Food + Décor

modC = lm(Price ~ 1 + Food + Decor, data = italian)


summary(modC)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -24.5002 4.7230 -5.187 6.19e-07 ***
Food 1.6461 0.2615 6.294 2.68e-09 ***
Decor 1.8820 0.1919 9.810 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.788 on 165 degrees of freedom


Multiple R-squared: 0.6167, Adjusted R-squared: 0.6121
F-statistic: 132.7 on 2 and 165 DF, p-value: < 2.2e-16

W rite do wn the fitted model: ________________________ .

How much of the va ria tion in Price is explained bymodC? _________.


Multiple linear regression: interpret model coefficients
Example: Italian restaurants in New York City
𝑃𝑟𝑖𝑐𝑒 = −24.50 + 1.65𝐹𝑜𝑜𝑑 + 1.88𝐷𝑒𝑐𝑜𝑟

Interpret the intercept:

Technical interpretation: When the food rating and the decoration rating are ,
the price of the dinner is -$24.5 on average.

Does this interpretation make sense in the context?


Multiple linear regression: interpret model coefficients
Example: Italian restaurants in New York City
𝑃𝑟𝑖𝑐𝑒 = −24.50 + 1.65𝐹𝑜𝑜𝑑 + 1.88𝐷𝑒𝑐𝑜𝑟
Interpret the slope of Food in the context:

, when food rating increases by 1,


the price of the dinner increases by $1.65 on average.
Food Decor Based on Model C: what is the predicted price difference
between Restaurant (1) and (2)?
Restaurant (1) 15 20
𝑃𝑟𝑖𝑐𝑒 =
1 −24.50 + 1.65×15 + 1.88×20
Restaurant (2) 16 20
𝑃𝑟𝑖𝑐𝑒2 = −24.50 + 1.65×16 + 1.88×20

𝑃𝑟𝑖𝑐𝑒2 − 𝑃𝑟𝑖𝑐𝑒1 = 1.65 = 𝑡ℎ𝑒 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑 𝑠𝑙𝑜𝑝𝑒 𝑜𝑓 𝐹𝑜𝑜𝑑


When you do not hold the other variable (Décor) constant:
When food rating increase by 1, the price increases by on average.
Model A: 𝑃𝑟𝑖𝑐𝑒 = −17.83 + 2.94𝐹𝑜𝑜𝑑.
Multiple linear regression: interpret model coefficients
Example: Italian restaurants in New York City
𝑃𝑟𝑖𝑐𝑒 = −24.50 + 1.65𝐹𝑜𝑜𝑑 + 1.88𝐷𝑒𝑐𝑜𝑟

Interpret the slope of Décor:


, when the decoration rating increases
by 1, the price of the dinner increases by $1.88 on average.

When you do not hold the other variable (Food) constant:


When decoration rating increase by 1, the price increases by on average.

Model B: 𝑃𝑟𝑖𝑐𝑒 = −1.36 + 2.49𝐷𝑒𝑐𝑜𝑟.


Preview on Assignment 3
Assignment 3 Preview

2.12 Inference for slope, again. A regression model was fit


to 82 data cases and the resulting sample slope, , was 5.3,
with a standard error of 2.8. Assume that the conditions for
a simple linear model, including normality, are reasonable
for this situation.
a. Test the hypothesis that =0 .
b. Construct a 95% confidence interval for .
Assignment 3 Preview Condt
Racial animus. Professor Seth Stephens-Davidowitz studies the level of racial animus across
different areas in America by measuring the percent of Google search queries that include
racially charged language. In his work, he finds the percentage of Google searches that
included the “n” word (with or without an s at the end). He then applies a sophisticated
algorithm, but for our purposes we will work with a measurement, Animus, that is derived
from his numbers and is scaled to be between 0 (low racial animus) and 250 (high racial
animus). The file RacialAnimus has data for each of the 196 media markets (we can think of
a media market as a city and the surrounding area) on Animus and other variables. We wish
to explore a linear model using Animus as the response variable. One variable of interest is
ObamaKerry, which is the percentage of the vote that Barack Obama won in 2008 minus the
percentage of the vote that John Kerry won in 2004.
a. Produce a scatterplot of the data. Does there appear to be a linear relationship? Explain.
b. Test the hypothesis that there is a linear relationship between ObamaKerry and Animus.
Does this test confirm your suspicions from part (a)? Explain.
Assignment 3 Preview Condt
2.16 Textbook prices. Exercise 1.44 examined data on the price and
number of pages for a random sample of 30 textbooks from the Cal
Poly campus bookstore. The data are stored in the file TextPrices and
appear in Table 1.5.
a. Perform a significance test to address the students’ question of
whether the number of pages is a useful predictor of a textbook’s
price. Report the hypotheses, test statistic, and P-value, along with
your conclusion.
b. Find and interpret a 95% confidence interval for the slope of this
regression model. Also explain what this slope coefficient means in
the context of these data.
Assignment 3 Preview Condt
2.24 Metabolic rates. Use the file MetabolicRate to examine the
linear relationship between the log (base 10) of metabolic rate and
log (base 10) of body size for a sample of caterpillars.
a. Fit a least squares regression line for predicting LogMrate from
LogBodySize. What is the equation of the regression line?
b. Is the slope parameter significantly different from zero? Justify
your answer.
c. Find and interpret the ANOVA table for partitioning the variability
in the transformed metabolic rates. d. Calculate the ratio of the
model sum of squares and the total sum of squares. Provide an
interpretation for this statistic.
Assignment 3 Preview Condt

2.42 Dementia: regression intervals. Use the data in the file


LewyDLBad to continue the analysis of the relationship between
MMSE and APC from Exercise 2.17.
a. What would you predict the MMSE score to be for someone with an
APC measurement of ? b. Provide a 95% confidence interval for
mean MMSE when APC is .
b. Provide a 95% prediction interval for MMSE when APC is . d. Tania
wants to know why the interval from part (c) is wider than the
interval from part (b). She doesn’t want to see a bunch of
mathematics; she wants to understand the reason. What would you
say to her?

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy