0% found this document useful (0 votes)
36 views60 pages

Chapter 8

Chapter 8 of 'Business Analytics, 5e' covers Linear Regression, detailing both simple and multiple linear regression models, including the least squares method, model fitting, and the assessment of fit. It outlines learning objectives such as constructing regression models, interpreting coefficients, and making predictions, while emphasizing the importance of understanding the relationship between dependent and independent variables. The chapter also discusses conditions for valid inference and the significance of the coefficient of determination in evaluating model performance.

Uploaded by

taliajvr10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views60 pages

Chapter 8

Chapter 8 of 'Business Analytics, 5e' covers Linear Regression, detailing both simple and multiple linear regression models, including the least squares method, model fitting, and the assessment of fit. It outlines learning objectives such as constructing regression models, interpreting coefficients, and making predictions, while emphasizing the importance of understanding the relationship between dependent and independent variables. The chapter also discusses conditions for valid inference and the significance of the coefficient of determination in evaluating model performance.

Uploaded by

taliajvr10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 60

Business Analytics, 5e

Chapter 8 – Linear Regression

Camm, Cochran, Fry, Ohlmann, Business Analytics, 5 th Edition. © 2024 Cengage Group. All Rights Reserved.
May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Chapter Contents
8.1 Simple Linear Regression Model
8.2 Least Squares Method
8.3 Assessing the Fit of the Simple Linear Regression Model
8.4 The Multiple Linear Regression Model
8.5 Inference and Linear Regression
8.6 Categorical Independent Variables
8.7 Modeling Nonlinear Relationships
8.8 Model Fitting
8.9 Big Data and Linear Regression
8.10 Prediction with Linear Regression
Summary

© 2024 Cengage Group. All Rights Reserved.


Learning Objectives (1 of 2)
After completing this chapter, you will be able to:
LO 8-1 Construct an estimated simple linear regression model that estimates
how a dependent variable is related to an independent variable.
LO 8-2 Construct an estimated multiple linear regression model that estimates
how a dependent variable is related to multiple independent variables.
LO 8-3 Compute and interpret the estimated coefficient of determination for a
linear regression model.
LO 8-4 Assess whether the conditions necessary for valid inference in a least
squares linear regression model are satisfied and test hypotheses
about the parameters.
LO 8-5 Test hypotheses about the parameters of a linear regression model and
interpret the results of these hypotheses tests.

© 2024 Cengage Group. All Rights Reserved.


Learning Objectives (2 of 2)

LO 8-6 Compute and interpret confidence intervals for the parameters of a


linear regression model.
LO 8-7 Use dummy variables to incorporate categorical independent variables
in a linear regression model and interpret the associated estimated
regression parameters.
LO 8-8 Use a quadratic regression model, a piecewise linear regression model,
and interaction between independent variables to account for
curvilinear relationships between independent variables and the
dependent variable in a regression model and interpret the estimated
parameters.
LO 8-9 Use an estimated linear regression model to predict the value of the
dependent variable given values of the independent variables.

© 2024 Cengage Group. All Rights Reserved.


Introduction
Managerial decisions are often based on the relationship between variables.
Regression analysis is a statistical procedure that uses data to develop an
equation showing how the variables are related.
• Dependent (or response) variable is the variable being predicted.
• Independent (or predictor) variables (or features) are variables used to
predict the value of the dependent variable.
• Simple linear regression is a form of regression analysis in which a
(“simple”) single independent variable, is used to develop a “linear”
relationship (straight line) with the dependent variable, .
• Multiple linear regression is a more general form of regression analysis
involving two or more independent variables.

© 2024 Cengage Group. All Rights Reserved.


8.1 Simple Linear Regression Model

dependent variable 𝑦 is related to the independent variable 𝑥 and error term .


The simple linear regression model is an equation that describes how the

Where
𝑦 is the dependent variable.
𝑥 is the independent variable.
and are referred to as the population parameters.
is the error term. It accounts for the variability in 𝑦 that cannot be
explained by the linear relationship between 𝑥 and 𝑦.

© 2024 Cengage Group. All Rights Reserved.


8.1 Estimated Simple Linear Regression Equation
The estimated simple linear regression equation is described as follows.

Where is the point estimator of , the mean of for a given 𝑥.


is the point estimator of and the 𝑦-intercept of the regression.
The 𝑦-intercept is the estimated value of the dependent variable 𝑦 when
the independent variable is equal to 0.
is the point estimator of and the slope of the regression.
The slope is the estimated change in the value of the dependent variable 𝑦
that is associated with a one unit increase in the independent variable .

© 2024 Cengage Group. All Rights Reserved.


8.1 The Estimation Process in Simple Linear Regression

The estimation of and is a


statistical process much like the
estimation of the population mean
described in Chapter 7.
and are the unknown parameters
of interest, and and are the sample
statistics used to estimate the
parameters.
The flow chart to the right provides
a summary of the estimation
process for simple linear regression

© 2024 Cengage Group. All Rights Reserved.


8.2 Least Squares Method
The least squares method is a procedure for using sample data to find the
estimated linear regression equation (see notes for and equations.)

is referred to as the 𝑖th residual: the error made in estimating the value of
Where

the dependent variable for the 𝑖th observation.


and are the values of independent and dependent variables for the 𝑖th
observation.
is the predicted value of the dependent variable for the th observation.
is the total number of observations.

© 2024 Cengage Group. All Rights Reserved.


8.2 The Butler Trucking Company Example
We use a sample of 10 randomly selected driving assignments made by the
Butler Trucking Company to build a scatter chart depicting the relationship
between the travel time (in hours) and the miles traveled.
DATAfile: butler
Because the scatter diagram
shows a positive linear
relationship, we choose the
simple linear regression model
to represent the relationship
between travel time (𝑦) and
miles traveled (𝑥.)

© 2024 Cengage Group. All Rights Reserved.


8.2 Regression Equation for Butler Trucking Co.
Computer software produces the following simple linear regression equation.

The slope and intercept of the regression are and .


Thus, we estimate that if the length of
a driving assignment were 1 mile
longer, the mean travel time would be
0.0678 hours or ~4 minutes longer.
Also, if the length of a driving
assignment were 0 miles, the mean
travel time would be 1.2739 hours or
~76 minutes. See notes for Excel.

© 2024 Cengage Group. All Rights Reserved.


8.2 Experimental Region and Extrapolation

The regression model is valid only over the experimental region, defined as
the range of values of the independent variables in the data used to estimate
the model.
Extrapolation, the prediction of the value of the dependent variable outside the
experimental region, is risky and should be avoided unless we have empirical
evidence dictating otherwise.
The experimental region for the Butler Trucking data is from 50 to 100 miles.
• Any prediction made outside the travel time for a driving distance less
than 50 miles or greater than 100 miles is not a reliable estimate.
• Thus, for this model the estimate of is meaningless.

© 2024 Cengage Group. All Rights Reserved.


8.2 Estimating Travel Time for the Butler Trucking Co.
We can use the estimated model for the Butler Trucking Company example,
and the known values for miles traveled for a driving assignment to estimate
the mean travel time in hours.
For example, the first driving assignment in the data set has a value for miles
traveled of , and a value for travel time of hours.
The mean travel time for this driving assignment is estimated to be
hours
The resulting residual of the estimate is
hours
The next slide shows the calculations for the 10 observations in the data set.

© 2024 Cengage Group. All Rights Reserved.


8.2 Predicted Travel Time and Residuals
Driving = Miles = Travel Time

1 100 9.3 8.0539 1.2461 1.5528


Assignment i Traveled (hours)

2 50 4.8 4.6639 0.1361 0.0185


3 100 8.9 8.0539 0.8461 0.7159
4 100 6.5 8.0539 -1.5539 2.4146
5 50 4.2 4.6639 -0.4639 0.2152
6 80 6.2 6.6979 -0.4979 0.2479
7 75 7.4 6.3589 1.0411 1.0839
8 65 6.0 5.6809 0.3191 0.1018
9 90 7.6 7.3759 0.2241 0.0502
10 90 6.1 7.3759 -1.2759 1.6279
67.0 67.0000 0.0000 8.0288

© 2024 Cengage Group. All Rights Reserved.


8.3 The Sums of Squares
The value of the sum of squares due to error (SSE) is a measure of the error
that results from using the values to predict the values.

The value of the total sum of squares (SST) is a measure of the error that
results from using the sample mean to predict the values.

The value of the sum of squares due to regression (SSR) is a measure of


how much the values deviate from the sample mean .

The relationship between these three sums of squares is

© 2024 Cengage Group. All Rights Reserved.


8.3 Total Sum of Squares for the Butler Trucking Co.
Driving = Miles = Travel Time

1 100 9.3 2.6 6.76


Assignment i Traveled (hours)

2 50 4.8 -1.9 3.61


3 100 8.9 2.2 4.84
4 100 6.5 -0.2 0.04
5 50 4.2 -2.5 6.25
6 80 6.2 -0.5 0.25
7 75 7.4 0.7 0.49
8 65 6.0 -0.7 0.49
9 90 7.6 0.9 0.81
10 90 6.1 -0.6 0.36
67.0 0 = 23.90

© 2024 Cengage Group. All Rights Reserved.


8.3 Coefficient of Determination
The ratio is called the coefficient of determination, denoted by .

The coefficient of determination can only assume values between 0 and 1 and
is used to evaluate the goodness of fit for the estimated regression equation.
A perfect fit exists when is identical to for every observation so that all
residuals .
• In such case, , , and .
Poorer fits between and result in larger values of and lower values.
• The poorest fit happens when , , and .

© 2024 Cengage Group. All Rights Reserved.


8.3 Goodness of Fit for the Butler Trucking Co.
From our previous calculations for the sum of squares due to error, we already
know that

Similar calculations for the total sum of squares reveal that

Because of the sum of squares relationship, we can write

Thus, we can conclude that 66.41% of the variability in the values of travel time
can be explained by the linear relationship between the miles traveled and
travel time. See notes for Excel instructions.

© 2024 Cengage Group. All Rights Reserved.


8.4 Multiple Linear Regression Model

variable 𝑦 is related to the independent variables and an error term .


The multiple linear regression model describes how the dependent

Where

is the error term that accounts for the variability in 𝑦 that cannot be
are the parameters of the model.

explained by the linear effect of the independent variables.


The coefficient (with represents the change in the mean value of that
corresponds to a one unit increase in the independent variable , holding the
values of all other independent variables in the model constant.

© 2024 Cengage Group. All Rights Reserved.


8.4 The Estimation Process in Multiple Regression
The estimated multiple linear regression equation is

Where:
is a point estimate of for a
given set of independent
variables, .
A simple random sample is used
to compute the sample statistics
that are used as estimates of .

© 2024 Cengage Group. All Rights Reserved.


8.4 Least Squares Method and Multiple Regression
The least squares method uses the sample data to provide the values of the
sample statistics that minimize the sum of the square errors between the
and the .

Where, is the value of dependent variable for the 𝑖th observation.


is the predicted value of dependent variable for the 𝑖th
observation.
Because the formulas for the regression coefficients involve the use of matrix
algebra, we rely on computer software packages to perform the calculations.
The emphasis will be on how to interpret the computer output rather than on
how to make the multiple regression computations.

© 2024 Cengage Group. All Rights Reserved.


8.4 Multiple Regression with Two Independent Variables
DATAfile: butlerwithdeliveries
We add a second independent variable, the number of deliveries made per
driving assignment, which also contributes to the total travel time.
The estimated multiple linear regression with two independent variables is

Where
estimated mean travel time
distance traveled (miles)
umber of deliveries
The , , and coefficient of determination (denoted in multiple linear regression)
are computed as we saw for simple linear regression.

© 2024 Cengage Group. All Rights Reserved.


8.4 Butler Trucking Co. and Multiple Regression
The estimated multiple linear regression equation (see notes for Excel
instructions), after rounding the sample coefficients to four decimal places, is

• For a fixed number of deliveries, the mean travel time is expected to


increase by 0.0672 hours (~4 minutes) when the distance traveled
increases by 1 mile.
• For a fixed distance traveled, the mean travel time is expected to increase
by 0.69 hours (~41 minutes) for each additional delivery.
• The interpretation of the estimated y-intercept is not meaningful because it
results from extrapolation.
• We can now explain of the variability in total travel time.

© 2024 Cengage Group. All Rights Reserved.


8.4 Butler Trucking Co. Excel Regression Output

© 2024 Cengage Group. All Rights Reserved.


8.4 Graph of the Multiple Linear Regression Equation
With two independent variables and , we now generate a predicted value of
for every combination of values of and .
• Instead of a regression
line, we now create a 3-D
regression plane.
The graph of the estimated
regression plane shows the
seventh driving assignment for
the Butler Trucking Company
example.
*See notes for details on the
interpretation of the graph.

© 2024 Cengage Group. All Rights Reserved.


8.5 Conditions for Valid Inference in Regression
Given a multiple linear regression model expressed as

The least squares method is used to develop estimates of the model


parameters resulting in the estimated multiple linear regression equation.

The validity of inferences depends on the two conditions about the error term.
1. For any given combination of values of the independent variables , the
population of potential error terms is normally distributed with a mean of
0 and a constant variance.
2. The values of are statistically independent.

© 2024 Cengage Group. All Rights Reserved.


8.5 Illustration of the Conditions for Valid Inference

The value of changes linearly


according to the specific value of
considered, and so the mean error is
zero at each value of .
The error term and hence the
dependent variable are normally
distributed with the same variance.
The specific value of the error term
at any particular point depends on
whether the actual value of is
greater or less than .

© 2024 Cengage Group. All Rights Reserved.


8.5 Scatter Chart of the Residuals
A simple scatter chart of the residuals is an extremely effective method for
assessing whether the error term conditions are violated.
The example to the right displays a random error pattern for a scatter chart of
residuals versus the predicted values of the dependent variable.
For proper inference, the scatter chart must exhibit a random pattern with
• residuals centered around zero,
• a constant spread of the residuals
throughout, and
• residuals symmetrically distributed
with the values near zero occurring
more frequently than those outside.

© 2024 Cengage Group. All Rights Reserved.


8.5 Common Error Term Violations

Non-constant
Nonlinear pattern
spread

Non-normal Non-independent
residuals residuals

© 2024 Cengage Group. All Rights Reserved.


8.5 Excel Residual Plots for the Butler Trucking Co.
Residuals vs. Miles Residuals vs. Deliveries

Both Residual plots show valid conditions for inference. In Excel, select the
Residual Plots option in the Residuals area of the Regression dialog box.

© 2024 Cengage Group. All Rights Reserved.


8.5 Scatter Chart of Residuals vs. Predicted Variable

A scatter chart of the


residuals against the
predicted values is also
commonly used.
The scatter chart to the
right for the Butler
Trucking Company data
shows valid conditions
for inference.
See notes to create the
data and this chart in
Excel.

© 2024 Cengage Group. All Rights Reserved.


8.5 t Test for Individual Significance
In a multiple regression model with p independent variables, for each parameter
), we use a t test to test the hypothesis that parameter is zero.

The test statistic follows a t distribution with degrees of freedom.

Where is the estimated standard deviation of the regression coefficient .


If , we reject and conclude that there is a linear relationship between the
dependent variable and the independent variable .
Statistical software will generally report a for each test statistic.

© 2024 Cengage Group. All Rights Reserved.


8.5 Individual t Tests for the Butler Trucking Co.
Example
The multiple regression output for the Butler Trucking Company example shows
the t-ratio calculations. The values of , , , and are as follows.
Variable Miles:
Variable Deliveries:
Calculation of the t-ratios provide the test statistic for the hypotheses involving
parameters and , also provided by the computer output.

Using , the p-values of 0.000 in the output indicate that we can reject and .
Hence, both parameters are statistically significant.

© 2024 Cengage Group. All Rights Reserved.


8.5 Testing Regression Coefficients with
Confidence Intervals

Confidence interval can be used to test whether each of the regression


parameters is equal to zero.
To test that is zero (i.e., there is no linear relationship between and ) at some
predetermined level of significance (say ), first build a confidence interval at the
confidence level.
If the resulting confidence interval does not contain zero, we conclude that
differs from zero at the predetermined level of significance.
The multiple regression output for the Butler Trucking Company example
shows each regression coefficient's confidence intervals at 95% and 99%
confidence levels.

© 2024 Cengage Group. All Rights Reserved.


8.5 Addressing Nonsignificant Independent Variables
If practical experience dictates that a nonsignificant independent variable is
related to the dependent variable , the independent variable should be left in
the model.
If the model sufficiently explains the dependent variable without the
nonsignificant independent variable , then consider rerunning the regression
without the nonsignificant independent variable .
At times, the estimates of the other regression coefficients and their −values
may change considerably when we remove the nonsignificant independent
variable from the model.
The appropriate treatment of the inclusion or exclusion of the -intercept when
is not statistically significant may require special consideration (*see notes.)

© 2024 Cengage Group. All Rights Reserved.


8.5 Multicollinearity
Multicollinearity refers to the correlation among the independent variables in
multiple regression analysis (*see notes.)
• Multicollinearity increases the standard errors of the regression estimates
of and the predicted values of the dependent variable so that inference
based on these estimates is less precise than it should be.
In t tests for the significance of individual parameters, it is possible to conclude
that a parameter associated with one of the multicollinear independent
variables is not significantly different from zero when the independent variable
has a strong relationship with the dependent variable instead.
The presence of multicollinearity is excluded when there is little correlation
among the independent variables.

© 2024 Cengage Group. All Rights Reserved.


8.5 Multicollinearity in the Butler Trucking Co. Data
DATAfile: butlerwithgasconsumption
The regression output to the right has
miles driven () and gasoline
consumption () as independent
variables.
The two variables are highly
correlated, as gas consumption
increases with total miles driven, with
a correlation coefficient of 0.9572.
Because of multicollinearity, the
regression coefficient is not
significant, with .

© 2024 Cengage Group. All Rights Reserved.


8.6 Dummy Variables
Thus far, the regression examples we have considered involved quantitative
independent variables such as distance traveled, gas consumption, and number
of deliveries.
Often, we must work with categorical independent variables, such as:
• gender (male, female)
• method of payment (cash, credit card, check)
To add a two-level categorical independent variable into a regression model,
such as whether a driver should take the highway during afternoon rush hour in
the Butler Trucking Company problem, we define a dummy variable as follows:

© 2024 Cengage Group. All Rights Reserved.


8.6 Effect of Afternoon Rush Hour on Travel Time

A review of the residuals for the current model with miles traveled () and the
number of deliveries () as independent variables reveals that driving on the
highway during afternoon rush hour affects the total travel time (*see notes.)

© 2024 Cengage Group. All Rights Reserved.


8.6 Regression Output with Highway Rush Hour Variable
DATAfile: butlerhighway
Excel regression output for the
Butler Trucking Company
regression model including the
independent variables:
• miles traveled ()
• number of deliveries ()
• highway rush hour ()
All independent variables are
significant, and they explain ()
about 88.4% of the total travel
time variability.

© 2024 Cengage Group. All Rights Reserved.


8.6 Interpreting the Parameters for the Butler Example

The model estimates that travel time increases by:


1. 0.0672 hours (about 4 minutes) for every increase of 1 mile traveled,
holding constant the number of deliveries and whether the driver uses
the highway during afternoon rush hour.
2. 0.6735 hours (about 40 minutes) for every delivery, holding constant the
number of miles traveled and whether the driver uses the highway
during afternoon rush hour.
3. 0.9980 hours (about 60 minutes) if the driver uses the highway during
afternoon rush hour, holding constant the number of miles traveled and
the number of deliveries.

© 2024 Cengage Group. All Rights Reserved.


8.6 More Complex Categorical Variables
If an independent categorical variable has 𝑘 levels, dummy variables are required,
with each dummy variable being coded as 0 or 1.
Consider the situation faced by a manufacturer of vending machines that sells its
products to three sales territories: region A, B, and C.
To code the sales regions in a regression model that explains the dependent variable
(𝑦) number of units sold, we need to define dummy variables as follows:
Sales Region
A 0 0
The regression equation relating the expectedBnumber of1units sold
0 to the sales
region can be written as C 0 1

© 2024 Cengage Group. All Rights Reserved.


8.6 Interpretation of the Parameters for a Categorical
Variable with Three Levels
To interpret , , and , in the sales territory example, consider the following variations of
the regression equation.

Thus, the regression parameters are interpreted as follows.


is the mean or expected value of sales for region A.
is the difference between the mean number of units sold in region B and the
mean number of units sold in region A.
is the difference between the mean number of units sold in region C and the
mean number of units sold in region A.

© 2024 Cengage Group. All Rights Reserved.


8.7 Modeling Nonlinear Relationships
A Manager at Reynolds, Inc., a manufacturer of industrial scales, wants to
investigate the relationship between length of employment () and the number
of electronic laboratory scales sold () for a sample of 123 salespeople.
DATAfile: reynolds
The estimated linear regression
equation is

The scatter diagram indicates a


curvilinear relationship between
the length of time employed and
the number of units sold.

© 2024 Cengage Group. All Rights Reserved.


8.7 A Curvilinear Pattern in the Reynolds Data
The pattern in the scatter chart of residuals against the predicted values of the
dependent variable suggests that a curvilinear relationship may provide a
better fit to the data.
We may wish to consider an
alternative to simple linear
regression if we have a practical
reason to suspect a curvilinear
relationship.
For example, a salesperson who
has been employed for a long
time may eventually become
burned out and less efficient.

© 2024 Cengage Group. All Rights Reserved.


8.7 A Quadratic Regression Model
To account for the curvilinear relationship, we add an independent variable,
MonthSq, as the square of the number of months the salesperson has been
with the firm. See notes for Excel.
The following equation describes a
quadratic regression model.

The regression output produces the


estimated quadratic regression
equation for the Reynolds problem:

© 2024 Cengage Group. All Rights Reserved.


8.7 Interpreting a Quadratic Regression Equation
If the estimated parameters and corresponding to the linear term and the squared
term have the same sign, the estimated dependent variable is
a) increasing over the experimental range of when and or
b) decreasing over the experimental range of when and .
If, on the other hand, the estimated parameters and corresponding to the linear term
and the squared term have different signs, has
c) a maximum over the experimental range of when and or
d) a minimum over the experimental range of when and .
In the case of the Reynolds data, we can use calculus to demonstrate that the
maximum sales occur at .
Thus, maximum sales are:

© 2024 Cengage Group. All Rights Reserved.


8.7 Types of Quadratic Regression Models

© 2024 Cengage Group. All Rights Reserved.


8.7 Interaction Between Independent Variables
An interaction is a relationship between the dependent variable and one
independent variable that is different at various values of a second independent
variable.
If the original data set consists of observations for and two independent
variables and , we can incorporate an interaction term into the estimated
multiple linear regression equation in the following manner.

When an interaction term between two variables is present, we cannot study


the relationship between one independent variable and the dependent variable
independently of the other variable.
See notes and next slide for an Excel application that uses the DATAFile: tyler.

© 2024 Cengage Group. All Rights Reserved.


8.7 Regression Output for the Tyler Personal Care Example

© 2024 Cengage Group. All Rights Reserved.


8.7 Piecewise Linear Regression Model
A piecewise linear regression model is a type of interaction with a dummy
variable that allows fitting nonlinear relationships as two linear regressions
joined at the value of at which the relationship between and changes.
• The value of the independent variable, ,at which the relationship
between and changes is called a knot, or breakpoint.
• A dummy variable is added to the model such that

Then, the following estimated regression equation is fit:

© 2024 Cengage Group. All Rights Reserved.


8.7 Piecewise Linear Regression Model for Reynolds Data
We observe that below some value of Months Employed, the relationship with
Sales appears to be positive and linear. Whereas the relationship becomes
negative and linear for the remaining observations. See notes for Excel.
As shown in the previous slide, we
add a dummy variable to the model
with knot at .
The regression output, shown in the
next slide, produces the following
estimated regression equation, with
all independent variables significant:

© 2024 Cengage Group. All Rights Reserved.


8.7 Piecewise Regression Output for Reynolds Data

© 2024 Cengage Group. All Rights Reserved.


8.8 Variable Selection Procedures

There are four major variable selection procedures we can use to find the best
estimated regression equation for a set of independent variables (*see notes):
1. Stepwise Regression
2. Forward Selection
3. Backward Elimination
4. Best-Subsets Regression
The first three procedures are Iterative; one independent variable at a time is
added or deleted but offers no guarantee that the best model will be found.
In the fourth procedure, all possible subsets of the independent variables are
evaluated.

© 2024 Cengage Group. All Rights Reserved.


8.8 Overfitting
Overfitting generally results from creating an overly complex regression model to
explain idiosyncrasies in the sample data.
An overfit model will overperform on the sample data used to fit the model and
underperform on other data from the population.
To avoid overfitting a model:
• Use only real and meaningful independent variables.
• Only use complex models when you have reasonable expectations about them.
• Use variable selection procedures only for guidance.
• If you have sufficient data, consider cross-validation, in which you assess the
model on data other than the sample data used to generate the model.
• One example of cross-validation is the holdout method, which divides the
data set between a training set and a validation set.

© 2024 Cengage Group. All Rights Reserved.


8.9 Inference and Very Large Samples
Virtually all regression
coefficients will be
statistically significant if the
sample is sufficiently large.
DATAfile: largecredit
The regression output to the
right shows a modest for a
data set with .
All the regression
coefficients are significant.
See notes for details.

© 2024 Cengage Group. All Rights Reserved.


8.9 Model Selection
When dealing with large samples, it is often difficult to discern the most
appropriate model.
• If developing a regression model for explanatory purposes, the practical
significance of the estimated regression coefficients should be considered
when interpreting the model and considering which variables to keep.
• If developing a regression model to make future predictions, selecting the
independent variables to include in the model should be based on the
predictive accuracy of observations that have not been used to train it.
For example, the credit card data set could be split into two data sets:
1. a training data set with , and
2. a validation data set with .

© 2024 Cengage Group. All Rights Reserved.


8.10 Prediction with Linear Regression
In addition to the point estimate, there are two types of interval estimates
associated with the regression equation:
• A confidence interval is an interval estimate of the mean value given
values of the independent variables .

• A prediction interval is an interval estimate of an individual value given


values of the independent variables .

The calculation of the confidence and prediction intervals use matrix algebra
and requires the use of specialized statistical software.

© 2024 Cengage Group. All Rights Reserved.


8.10 Prediction of New Routes for Butler Trucking Co.
Predicted Values and 95% Confidence Intervals and Prediction
Intervals for 10 New Butler Trucking Routes (DATAfile: butler)
Predicted 95% Cl 95% PI

301 105 3 9.25 0.193 1.645


Assignment Miles Deliveries Value Half-Width(+/−) Half-Width(+/−)

302 60 4 6.92 0.112 1.637


303 95 5 9.96 0.173 1.642
304 100 1 7.54 0.225 1.649
305 40 3 4.88 0.177 1.643
306 80 3 7.57 0.108 1.637
307 65 4 7.25 0.103 1.637
308 55 3 5.89 0.124 1.638
309 95 2 7.89 0.175 1.643
310 95 3 8.58 0.154 1.641

© 2024 Cengage Group. All Rights Reserved.


Summary
• In this chapter, we showed how linear regression analysis is used to determine how a
dependent variable is related to one or more independent variables.
• We used sample data and the least squares method to develop the estimated simple linear
regression equation, interpreted its coefficients, and presented the coefficient of
determination as a measure of its goodness of fit.
• We then extended our discussion to include multiple independent variables and reviewed
how to use Excel to find the estimated multiple linear regression equation, build estimates
in the form of prediction and confidence intervals, and the ramifications of multicollinearity.
• We discussed the necessary conditions for the linear regression model and its associated
error term to conduct valid inference for regression.
• We showed how to incorporate categorical independent variables into a regression model
and discussed how to fit nonlinear relationships.
• Finally, we discussed various variable selection procedures, the problem of overfitting, and
the implication of big data on regression analysis.

© 2024 Cengage Group. All Rights Reserved.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy