Chapter 8
Chapter 8
Camm, Cochran, Fry, Ohlmann, Business Analytics, 5 th Edition. © 2024 Cengage Group. All Rights Reserved.
May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Chapter Contents
8.1 Simple Linear Regression Model
8.2 Least Squares Method
8.3 Assessing the Fit of the Simple Linear Regression Model
8.4 The Multiple Linear Regression Model
8.5 Inference and Linear Regression
8.6 Categorical Independent Variables
8.7 Modeling Nonlinear Relationships
8.8 Model Fitting
8.9 Big Data and Linear Regression
8.10 Prediction with Linear Regression
Summary
Where
𝑦 is the dependent variable.
𝑥 is the independent variable.
and are referred to as the population parameters.
is the error term. It accounts for the variability in 𝑦 that cannot be
explained by the linear relationship between 𝑥 and 𝑦.
is referred to as the 𝑖th residual: the error made in estimating the value of
Where
The regression model is valid only over the experimental region, defined as
the range of values of the independent variables in the data used to estimate
the model.
Extrapolation, the prediction of the value of the dependent variable outside the
experimental region, is risky and should be avoided unless we have empirical
evidence dictating otherwise.
The experimental region for the Butler Trucking data is from 50 to 100 miles.
• Any prediction made outside the travel time for a driving distance less
than 50 miles or greater than 100 miles is not a reliable estimate.
• Thus, for this model the estimate of is meaningless.
The value of the total sum of squares (SST) is a measure of the error that
results from using the sample mean to predict the values.
The coefficient of determination can only assume values between 0 and 1 and
is used to evaluate the goodness of fit for the estimated regression equation.
A perfect fit exists when is identical to for every observation so that all
residuals .
• In such case, , , and .
Poorer fits between and result in larger values of and lower values.
• The poorest fit happens when , , and .
Thus, we can conclude that 66.41% of the variability in the values of travel time
can be explained by the linear relationship between the miles traveled and
travel time. See notes for Excel instructions.
Where
is the error term that accounts for the variability in 𝑦 that cannot be
are the parameters of the model.
Where:
is a point estimate of for a
given set of independent
variables, .
A simple random sample is used
to compute the sample statistics
that are used as estimates of .
Where
estimated mean travel time
distance traveled (miles)
umber of deliveries
The , , and coefficient of determination (denoted in multiple linear regression)
are computed as we saw for simple linear regression.
The validity of inferences depends on the two conditions about the error term.
1. For any given combination of values of the independent variables , the
population of potential error terms is normally distributed with a mean of
0 and a constant variance.
2. The values of are statistically independent.
Non-constant
Nonlinear pattern
spread
Non-normal Non-independent
residuals residuals
Both Residual plots show valid conditions for inference. In Excel, select the
Residual Plots option in the Residuals area of the Regression dialog box.
Using , the p-values of 0.000 in the output indicate that we can reject and .
Hence, both parameters are statistically significant.
A review of the residuals for the current model with miles traveled () and the
number of deliveries () as independent variables reveals that driving on the
highway during afternoon rush hour affects the total travel time (*see notes.)
There are four major variable selection procedures we can use to find the best
estimated regression equation for a set of independent variables (*see notes):
1. Stepwise Regression
2. Forward Selection
3. Backward Elimination
4. Best-Subsets Regression
The first three procedures are Iterative; one independent variable at a time is
added or deleted but offers no guarantee that the best model will be found.
In the fourth procedure, all possible subsets of the independent variables are
evaluated.
The calculation of the confidence and prediction intervals use matrix algebra
and requires the use of specialized statistical software.