Chapter 5 - Regression
Chapter 5 - Regression
Chapter 5 – Regression
Sometimes we want to do more than measure the strength of a relationship; we want to use one
variable to help predict (or explain) the other.
Response variable
The variable that measures the outcome of a study or the variable that we would like to
explain or predict.
Plotted on the y axis.
Explanatory variable
The variable that may help to explain, predict, or influence changes in the response
variable.
Plotted on the x-axis
Example – For each of the following identify the response and explanatory variables.
(a) Student volunteers at a university drank different numbers of cans of beer. Thirty
minutes later, a police officer measured their blood alcohol content.
(b) The results of the National Student Loan Survey includes data on the amount of debt of
recent graduates, their current income, and how stressed they feel about college debt.
(c) Number of times a student accessed the website for a course = __________
Person 1 2 3 4 5 6 7 8 9 10
Waist (inches) 32 33 33 34 36 38 39 41 41 44
Body Fat (%) 6 6 10 12 16 21 22 27 32 33
(b) Construct a scatterplot of the data. Describe the nature of the relationship, i.e., give the
form, direction, and strength of the relationship.
Regression Line
The line that best ∗ fits the scatterplot (or best models the relationship between 𝑥 and 𝑦.
(* What is meant by best? See the last page of these notes.)
2
Example Continued
(c) Use the information below to calculate the equation of the regression line.
(d) Graph the regression line on your scatterplot. How well do you think the line fits the data
(or models the relationship between 𝑥 and 𝑦)?
(f) Predict the body fat percent for people having waist measurements of 33, 40, and 50
inches.
3
(g) Would you trust the accuracy of all 3 predictions made in Part (f)? Why or why not?
Residual
The vertical distance that a point is from the regression line.
A residual represents the prediction error between the observed (or actual) 𝑦 value and
the predicted 𝑦 value.
When a point is
o above the regression line it has a positive residual. The line under-predicts the
actual y value.
o below the regression line it has a negative residual. The line over-predicts the
actual y value.
Residual = (Observed 𝑦 value) – (Predicted 𝑦 value) = 𝑦 − 𝑦̂
Example Continued
(h) Calculate the residuals for Person #2 and Person #9. Is the regression line over-
predicting or under-predicting the actual body fat % in these two cases?
4
Coefficient of Determination
Question – Why is there variation among the 𝑦 values, i.e., why do different people have
different body fat %?
There could be many reasons, but we group them into two major categories.
𝑦 values vary because body fat % is related to waist.
(The larger a person’s waist the larger their body fat % will be.)
𝑦 values vary because body fat % is related to other factors.
(This could explain why person 2 and person 3 have the same waist but different
body fat %. Similarly for person 8 and person 9.)
Statisticians measure the percent of variation in the 𝑦 values that is due to each category.
𝑟2 =
o the coefficient of determination
o the percent of the variation in the 𝑦 values that is due to the relationship with 𝑥
o the percent of the variation in the 𝑦 values that can be explained by the
regression model
Example Continued
(i) What percent of the variation in the body fat % can be explained by the regression
model? What percent remains unexplained?
Notes
0 ≤ 𝑟2 ≤ 1
If our model (that uses 𝑥 to predict 𝑦) is a good model, then “most” of the variation in the
𝑦 values should be explained by 𝑥, i.e., 𝑟 2 should be “close” to 1.
The closer 𝑟 2 is to 1.0, the better the regression line is for modeling the relationship
between 𝑥 and 𝑦 and the better it is for predicting 𝑦 by using 𝑥.
5
Example – As soon as a bottle of soda is opened, it begins to lose its carbonation. Fourteen 12-
ounce bottles of cola were obtained, and each was assigned a randomly selected time period (in
hours). Each bottle was opened and allowed to stand at room temperature. The carbonation (y)
in each bottle was measured after the prescribed time period (x). Summaries of the data appear
below.
Mean Std. Dev.
Time 0.614 0.390
Carbonation 2.671 0.891
Correlation –0.744
(c) In the context of this problem, give an interpretation of the slope of the regression line.
(e) After sitting for 2 hours, one particular bottle had an actual carbonation of 0.300.
Calculate the residual for the bottle. Would the regression line under-predict or over-
predict the actual amount of carbonation for the bottle?
(f) What percent of the variation in the y values can be explained by the regression model?
6
Question: How do we determine whether one line fits a scatterplot better than another?
Answer:
We use the method of least squares to assign a rating (SSE) to each line.
The line that achieves the best rating (i.e., the smallest SSE) is the regression line.
= (residual 2 )