Interactive Lecture Notes 12-Regression Analysis
Interactive Lecture Notes 12-Regression Analysis
Describing and assessing the significance of relationships between variables is very important
in research. We will first learn how to do this in the case when the two variables are
quantitative. Quantitative variables have numerical values that can be ordered according to
those values.
Main idea
We wish to study the relationship between two quantitative variables.
The first step in examining the relationship is to use a graph -‐ a scatterplot -‐ to display the
relationship. We will look for an overall pattern and see if there are any departures from this
overall pattern.
If a linear relationship appears to be reasonable from the scatterplot, we will take the next step
of finding a model (an equation of a line) to summarize the relationship. The resulting equation
may be used for predicting the response for various values of the explanatory variable. If
certain assumptions hold, we can assess the significance of the linear relationship and make
some confidence intervals for our estimations and predictions.
Let's begin with an example that we will carry throughout our discussions.
Graphing the Relationship: Restaurant Bill vs Tip
How well does the size of a restaurant bill predict the tip the server receives? Below are the
bills and tips from six different restaurant visits in dollars.
Bill 41 98 25 85 50 73
Tip 8 17 4 12 5 14
Response (dependent) variable y = .
Explanatory (independent) variable x = .
Step 1: Examine the data graphically with a scatterplot.
Add the points to the scatterplot below:
yˆ
b0
b1
Goal:
To find a line that is “close” to the data points -‐ find the “best fitting” line.
How?
What do we mean by best?
One measure of how good a line fits is to look at the
“observed errors” in prediction.
Observed errors =
are called
b1
b0
73 14
372 60
372 60
x= = 𝟔𝟐 ӯ= = 𝟏𝟎
6 6
Slope Estimate:
y-‐intercept Estimate:
Note: The 5th dinner guest in sample had a bill of $50 and the observed tip was $5.
The residuals …
You found the residual for one observation. You could compute the residual for each
observation. The following table shows each residual.
𝑦̂ = −0.5877 +
predicted values residuals Squared residuals
0.17077(𝑥)
x = bill y = tip e y yˆ (e) 2 y yˆ 2
41 8 6.41 1.59 2.52
98 17 16.15 0.85 0.72
25 4 3.68 0.32 0.10
85 12 13.93 -‐1.93 3.73
50 5 7.95 -‐2.95 8.70
73 14 11.88 2.12 4.49
-‐-‐ -‐-‐ - ‐- ‐
Some pictures:
Tips Example:
r=
Interpretation:
The square of the correlation r 2
sometimes presented as a percent. It can be shown that the square of the correlation is related
to the sums of squares that arise in regression.
The responses (the amount of tip) in data set are not all the same -‐ they do vary. We
would measure the total variation in these responses as SSTO y y2 (last column
total in calculation table said we would use later).
Part of the reason why the amount of tip varies is because there is a linear relationship
between amount of tip and amount of bill, and the study included different amounts of bill.
When we found the least squares regression line, there was still some small variation remaining
of the responses from the line. This amount of variation that is not accounted for by the linear
relationship is called the SSE.
The amount of variation that is accounted for by the linear relationship is called the sum of
squares due to the model (or regression), denoted by SSM (or sometimes as SSR).
So we have: SSTO =
It can be shown that
r2=
= the proportion of total variability in the responses that can be explained by the linear
relationship with the explanatory variable x .
Note: The value of r 2 and these sums of squares are summarized in an ANOVA table that is
standard output from computer packages when doing regression.
Measuring Strength and Direction for Exam 2 vs Final
SSTO =
SSE =
So the squared correlation coefficient for our exam scores regression is:
2 SSTO SSE
r SSTO =
Interpretation:
We accounted for % of the variation in
Nonlinear relationships
Detecting Outliers and their influence on regression results.
Dangers of Extrapolation (predicting outside the range of your data)
Dangers of combining groups inappropriately (Simpson’s Paradox)
Correlation does not prove causation
R Regression Analysis for Bill vs Tips
Let’s look at the R output for our Bill and Tip data.
We will see that much of the computations are done for us.
Call:
lm(formula = Tip ~ Bill, data = Tips)
Residuals:
1 2 3 4 5 6
1.5862 0.8523 0.3185 -1.9277 -2.9508 2.1215
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.58769 2.41633 -0.243 0.81980
Bill 0.17077 0.03604 4.738 0.00905 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation “Matrix”
Bill Tip
Bill 1.0000000 0.9212755
Tip 0.9212755 1.0000000
ANOVA Table
Response: Tip
Df Sum Sq Mean Sq F value Pr(>F) Bill
1 113.732 113.732 22.446 0.009052 **
Residuals 4 20.268 5.067
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Inference in Linear Regression Analysis
The material covered so far focuses on using the data for a sample to graph and describe the
relationship. The slope and intercept values we have computed are statistics, they are
estimates of the underlying true relationship for the larger population.
Next we turn to making inferences about the relationship for the larger population. Here is a
nice summary to help us distinguish between the regression line for the sample and the
regression line for the population.
All images
To do formal inference, we think of our b0 and b1 as estimates of the unknown parameters 0
and 1 . Below we have the somewhat statistical way of expressing the underlying model that
produces our data:
This statistical model for simple linear regression assumes that for each value of x the observed
values of the response (the population of y values) is normally distributed, varying around
some true mean (that may depend on x in a linear way) and a standard deviation that does
not depend on x. This true mean is sometimes expressed as E(Y) = 0 + 1(x). And the
components and assumptions regarding this statistical model are show visually below.
True
regression
line
x
The represents the true error term. These would be the deviations of a particular value of the
response y from the true regression line. As these are the deviations from the mean, then these
error terms should have a normal distribution with mean 0 and constant standard deviation .
Now, we cannot observe these ’s. However we will be able to use the estimated (observable)
errors, namely the residuals, to come up with an estimate of the standard deviation and to
check the conditions about the true errors.
So what have we done, and where are we going?
1. Estimate the regression line based on some data. DONE!
2. Measure the strength of the linear relationship with the correlation. DONE!
3. Use the estimated equation for predictions. DONE!
4. Assess if the linear relationship is statistically significant.
5. Provide interval estimates (confidence intervals) for our predictions.
6. Understand and check the assumptions of our model.
We have already discussed the descriptive goals of 1, 2, and 3. For the inferential goals of 4 and
5, we will need an estimate of the unknown standard deviation in regression
s=
Note: Why n – 2?
From Summary:
Or from ANOVA:
Response: Tip
Df Sum Sq Mean Sq F value Pr(>F)
Bill 1 113.732 113.732 22.446 0.009052 **
Significant Linear Relationship?
Consider the following hypotheses: H 0 : 1 versus H a : 1 0
What happens if the null hypothesis is true? 0
There are a number of ways to test this hypothesis. One way is through a t-‐test statistic (think
about why it is a t and not a z test). The general form for a t test statistic is:
We have our sample estimate for 1 , it is b1 . And we have the null value of 0. So we need the
standard error for b1 . We could “derive” it, using the idea of sampling distributions (think about
the population of all possible b1 values if we were to repeat this procedure over and over many
times). Here is the result:
This t-‐statistic could be modified to test a variety of hypotheses about the population slope (different n
Try It!
Significant Relationship between Bill and Tip?
Is there a significant (non-‐zero) linear relationship between the total cost of a restaurant
bill and the tip that is left? (is the bill a useful linear predictor for the tip?)
That is, H 0 : 1 0 versus H a : 1 0 using a 5% level of significance.
test
Think about it:
Based on the results of the previous t-‐test conducted at the 5% significance level, do you think a
95% confidence interval for the true slope 1 would contain the value of 0?
Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) -0.58769
2.41633 -0.243 0.81980
Bill0.17077 0.036044.738 0.00905 **
Response: Tip
Df Sum Sq Mean Sq F value Pr(>F)
Bill 1 113.732 113.732 22.446 0.009052 **
Residuals 4 20.268 5.067
Predicting for Individuals versus Estimating the Mean
Consider the relationship between the bill and tip …
Least squares regression line (or estimated regression function):
yˆ
We also have: s
How would you predict the tip for Barb who had a $50 restaurant bill?
How would you estimate the mean tip for all customers who had a $50 restaurant bill?
So our estimate for predicting a future observation and for estimating the mean response are
found using the same least squares regression equation. What about their standard errors?
(We would need the standard errors to be able to produce an interval estimate.)
Construct a 95% prediction interval for the tip from an individual customer who had a $50 bill
(x).
Checking Assumptions in Regression
Let’s recall the statistical way of expressing the underlying model that produces our data:
Thus there are four essential technical assumptions required for inference in linear regression:
Now, we cannot observe these ’s. However we will be able to use the estimated (observable)
errors, namely the residuals, to come up with an estimate of the standard deviation and to
check the conditions about the true errors.
So how can we check these assumptions with our data and estimated model?
Now, if we saw …
Let's turn to one last full regression problem
that includes checking assumptions.
mean sd n
foot27.78125 1.549701 32
height 71.68750 3.057909 32
Call:
lm(formula = foot ~ height, data = heightfoot)
Residuals:
Min 1Q Median3QMax 0.07875 0.58075 2.25075
-1.74925 -0.81825
Coeffi cients:
Estimate Std. Error t value (Intercept) 0.253134.332320.058 Pr(>|t|)
0.954
height 0.38400 0.06038 6.360 5.12e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation Matrix
foot height
foot1.0000000 0.7577219
height 0.7577219 1.0000000
Analysis of Variance Table Response: foot
Df Sum Sq Mean Sq F value
Pr(>F)
height1 42.744 42.744 40.446 5.124e-07 ***
Residuals 30 31.7051.057
c. Give the equation of the least squares regression line for predicting foot length from height.
d. Suppose Max is 70 inches tall and has a foot length of 28.5 centimeters. Based on the least
squares regression line, what is the value of the prediction error (residual) for Max? Show
all work.
Conclusion:
f. Calculate a 95% confidence interval for the average foot length for all college men who are
70 inches tall. (Just clearly plug in all numerical values.)
Does this plot support the conclusion that the linear regression model is appropriate?
Yes No
Explain:
Regression
Linear Regression Model Standard Error of the Sample Slope
s s
s.e.(b1 )
Population Version:
Mean: Y x E(Y ) 0 1 x
S XX
x x 2
Individual: yi 0 1 xi i
Confidence Interval for 1
where i is N (0, )
b1 t*s.e.(b1 ) df = n – 2
Sample Version:
b 0
Mean: yˆ b0 b1x t-‐Test for To test H : 0 t 1
1 0 1
Individual: yi b0 b1xi ei s.e.(b )
1
df = n – 2
MSREG
or F df = 1, n – 2
MSE
Parameter Estimators Confidence Interval for the Mean Response
b
S XY x x y y x x y yˆ t *s.e.(fit) df = n – 2
1 (x x) 2
1
SXX x x 2
x x 2
where s.e.(fit) s
n S XX
b0 y b1 x
Residuals Prediction Interval for an Individual Response
e y yˆ = observed y – predicted y yˆ t *s.e.(pred) df = n – 2
where s.e.(pred) s 2 s.e.(fit)
2