0% found this document useful (0 votes)
49 views40 pages

Business Analytics 2nd Edition Removed-8

This chapter discusses regression analysis and trendlines, which are tools used to model relationships between variables and predict future results. It explains common functional relationships like linear, logarithmic, and polynomial that are used in predictive models. The chapter also describes how to use Excel's Trendline tool to fit these functions to a data set and interpret the output, including the R-squared value which indicates the fit of the trendline. An example demonstrates how trendline analysis can be used to develop a price-demand function by modeling sales data at different price levels.

Uploaded by

Fika Cholifatuz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views40 pages

Business Analytics 2nd Edition Removed-8

This chapter discusses regression analysis and trendlines, which are tools used to model relationships between variables and predict future results. It explains common functional relationships like linear, logarithmic, and polynomial that are used in predictive models. The chapter also describes how to use Excel's Trendline tool to fit these functions to a data set and interpret the output, including the R-squared value which indicates the fit of the trendline. An example demonstrates how trendline analysis can be used to develop a price-demand function by modeling sales data at different price levels.

Uploaded by

Fika Cholifatuz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

8

Trendlines and
Chapter

Regression Analysis

gibsons/Shutterstock.com

Learning Objectives
After studying this chapter, you will be able to:

• Explain the purpose of regression analysis and provide • Interpret confidence intervals for regression
examples in business. coefficients
• Use a scatter chart to identify the type of relationship • Calculate standard residuals.
between two variables. • List the assumptions of regression analysis and
• List the common types of mathematical functions used describe methods to verify them.
in predictive modeling. • Explain the differences in the Excel Regression tool
• Use the Excel Trendline tool to fit models to data. output for simple and multiple linear regression
• Explain how least-squares regression finds the best- models.
fitting regression model. • Apply a systematic approach to build good regression
• Use Excel functions to find least-squares regression models.
coefficients. • Explain the importance of understanding
• Use the Excel Regression tool for both single and multicollinearity in regression models.
multiple linear regressions. • Build regression models for categorical data using
• Interpret the regression statistics of the Excel dummy variables.
Regression tool. • Test for interactions in regression models with
• Interpret significance of regression from the Excel categorical variables.
Regression tool output. • Identify when curvilinear regression models are more
• Draw conclusions for tests of hypotheses about appropriate than linear models.
regression coefficients.
233
234 Chapter 8  Trendlines and Regression Analysis

Many applications of business analytics involve modeling relationships


between one or more independent variables and some dependent variable.
For example, we might wish to predict the level of sales based on the price
we set, or extrapolate a trend into the future. As other examples, a company
may wish to predict sales based on the U.S. GDP (gross domestic product)
and the 10-year treasury bond rate to capture the influence of the business
cycle,1 or a marketing researcher might want to predict the intent of buying a
particular automobile model based on a survey that measured consumer at-
titudes toward the brand, negative word-of-mouth, and income level.2
Trendlines and regression analysis are tools for building such models and
predicting future results. Our principal focus is to gain a basic understand-
ing of how to use and interpret trendlines and regression models, statistical
issues associated with interpreting regression analysis results, and practical
issues in using trendlines and regression as tools for making and evaluating
decisions.

Modeling Relationships and Trends in Data

Understanding both the mathematics and the descriptive properties of different functional
relationships is important in building predictive analytical models. We often begin by
­creating a chart of the data to understand it and choose the appropriate type of functional
relationship to incorporate into an analytical model. For cross-sectional data, we use a
scatter chart; for time hyphenate as adjective for data series data we use a line chart.
Common types of mathematical functions used in predictive analytical models in-
clude the following:

• Linear function: y = a + bx. Linear functions show steady increases or


decreases over the range of x. This is the simplest type of function used in
­predictive models. It is easy to understand, and over small ranges of values,
can approximate behavior rather well.
• Logarithmic function: y = ln1x2. Logarithmic functions are used when the rate
of change in a variable increases or decreases quickly and then levels out, such as
with diminishing returns to scale. Logarithmic functions are often used in mar-
keting models where constant percentage increases in advertising, for instance,
result in constant, absolute increases in sales.
• Polynomial function: y = ax 2 + bx + c (second order—quadratic function),
y = ax 3 + bx 2 + dx + e (third order—cubic function), and so on. A second-
order polynomial is parabolic in nature and has only one hill or valley; a third-
order polynomial has one or two hills or valleys. Revenue models that
incorporate price elasticity are often polynomial functions.

1James R. Morris and John P. Daley, Introduction to Financial Models for Management and Planning
(Boca Raton, FL: Chapman & Hall/CRC, 2009): 257.
2Alvin C. Burns and Ronald F. Bush, Basic Marketing Research Using Microsoft Excel Data Analysis,
2nd ed. (Upper Saddle River, NJ: Prentice Hall, 2008): 450.
Chapter 8  Trendlines and Regression Analysis 235

• Power function: y = ax b. Power functions define phenomena that increase at a


specific rate. Learning curves that express improving times in performing a task
are often modeled with power functions having a 7 0 and b 6 0.
• Exponential function: y = ab x. Exponential functions have the property that y
rises or falls at constantly increasing rates. For example, the perceived brightness
of a lightbulb grows at a decreasing rate as the wattage increases. In this case,
a would be a positive number and b would be between 0 and 1. The exponential
function is often defined as y = ae x, where b = e, the base of natural logarithms
(approximately 2.71828).

The Excel Trendline tool provides a convenient method for determining the best-fitting
functional relationship among these alternatives for a set of data. First, click the chart to
which you wish to add a trendline; this will display the Chart Tools menu. Select the Chart
Tools Design tab, and then click Add Chart Element from the Chart Layouts group. From
the Trendline submenu, you can select one of the options (Linear is the most common) or
More Trendline Options. . . . If you select More Trendline Options, you will get the Format
­Trendline pane in the worksheet (see Figure 8.1). A simpler way of doing all this is to right
click on the data series in the chart and choose Add trendline from the pop-up menu—try
it! Select the radio button for the type of functional relationship you wish to fit to the data.
Check the boxes for Display Equation on chart and Display R-squared value on chart. You
may then close the Format Trendline pane. Excel will display the results on the chart you
have selected; you may move the equation and R-squared value for better readability by
dragging them to a different location. To clear a trendline, right click on it and select Delete.
R2 (R-squared) is a measure of the “fit” of the line to the data. The value of R2 will
be between 0 and 1. The larger the value of R2 the better the fit. We will discuss this fur-
ther in the context of regression analysis.
Trendlines can be used to model relationships between variables and understand
how the dependent variable behaves as the independent variable changes. For example,
the d­ emand-prediction models that we introduced in Chapter 1 (Examples 1.9 and 1.10)
would generally be developed by analyzing data.

Figure 8.1

Excel Format Trendline


Pane
236 Chapter 8  Trendlines and Regression Analysis

Example 8.1 Modeling a Price-Demand Function


A market research study has collected data on sales vol- sales = 20,512 − 9.5116 × price
umes for different levels of pricing of a particular product. This model can be used as the demand function in other
The data and a scatter diagram are shown in Figure 8.2 marketing or financial analyses.
(Excel file Price-Sales Data). The relationship between
price and sales clearly appears to be linear, so a linear
trendline was fit to the data. The resulting model is

Trendlines are also used extensively in modeling trends over time—that is, when the
variable x in the functional relationships represents time. For example, an analyst for an
airline needs to predict where fuel prices are going, and an investment analyst would want
to predict the price of stocks or key economic indicators.

Example 8.2 Predicting Crude Oil Prices


Figure 8.3 shows a chart of historical data on crude oil polynomial (second order):
prices on the first Friday of each month from January y = 0.130x2 − 2.399x + 68.01 R2 = 0.905
2006 through June 2008 (data are in the Excel file Crude
polynomial (third order):
Oil Prices). Using the Trendline tool, we can try to fit the
y = 0.005x3 − 0.111x2 + 0.648x + 59.497
various functions to these data (here x represents the
R2 = 0.928
number of months starting with January 2006). The re-
sults are as follows: power: y = 45.96x.0169 R2 = 0.397

exponential: y = 50.49e0.021x R2 = 0.664 The best-fitting model is the third-order polynomial,


shown in Figure 8.4.
logarithmic: y = 13.02ln1x2 + 39.60 R2 = 0.382

Figure 8.2

Price-Sales Data and Scatter


Diagram with Fitted Linear
Function
Chapter 8  Trendlines and Regression Analysis 237

Figure 8.3

Chart of Crude Oil Prices

Be cautious when using polynomial functions. The R2 value will continue to increase
as the order of the polynomial increases; that is, a third-order polynomial will provide
a better fit than a second order polynomial, and so on. Higher-order polynomials will
generally not be very smooth and will be difficult to interpret visually. Thus, we don’t
recommend going beyond a third-order polynomial when fitting data. Use your eye to
make a good judgment!
Of course, the proper model to use depends on the scope of the data. As the chart
shows, crude oil prices were relatively stable until early 2007 and then began to increase
rapidly. By including the early data, the long-term functional relationship might not ad-
equately express the short-term trend. For example, fitting a model to only the data begin-
ning with January 2007 yields these models:

exponential: y = 50.56 e 0.044x R2 = 0.969


polynomial (second order): y = 0.121x 2 + 1.232x + 53.48 R2 = 0.968
linear: y = 3.548x + 45.76 R2 = 0.944

Figure 8.4

Polynomial Fit of Crude Oil


Prices
238 Chapter 8  Trendlines and Regression Analysis

The difference in prediction can be significant. For example, to predict the price
6 months after the last data point 1x = 362 yields $172.24 for the third-order polyno-
mial fit with all the data and $246.45 for the exponential model with only the recent
data. Thus, the analysis must be careful to select the proper amount of data for the
analysis. The question then becomes one of choosing the best assumptions for the
model. Is it reasonable to assume that prices would increase exponentially or perhaps
at a slower rate, such as with the linear model fit? Or, would they level off and start
falling? Clearly, factors other than historical trends would enter into this choice. As
we now know, oil prices plunged in the latter half of 2008; thus, all predictive models
are risky.

Simple Linear Regression

Regression analysis is a tool for building mathematical and statistical models that char-
acterize relationships between a dependent variable (which must be a ratio variable and
not categorical) and one or more independent, or explanatory, variables, all of which are
numerical (but may be either ratio or categorical).
Two broad categories of regression models are used often in business settings:
(1) regression models of cross-sectional data and (2) regression models of time-series
data, in which the independent variables are time or some function of time and the focus is
on predicting the future. Time-series regression is an important tool in forecasting, which
is the subject of Chapter 9.
A regression model that involves a single independent variable is called simple
­regression. A regression model that involves two or more independent variables is called
multiple regression. In the remainder of this chapter, we describe how to develop and ana-
lyze both simple and multiple regression models.
Simple linear regression involves finding a linear relationship between one indepen-
dent variable, X, and one dependent variable, Y. The relationship between two variables
can assume many forms, as illustrated in Figure 8.5. The relationship may be linear or
nonlinear, or there may be no relationship at all. Because we are focusing our discussion
on linear regression models, the first thing to do is to verify that the relationship is linear,
as in Figure 8.5(a). We would not expect to see the data line up perfectly along a straight
line; we simply want to verify that the general relationship is linear. If the relationship is
clearly nonlinear, as in Figure 8.5(b), then alternative approaches must be used, and if no
relationship is evident, as in Figure 8.5(c), then it is pointless to even consider developing
a linear regression model.
To determine if a linear relationship exists between the variables, we recommend that
you create a scatter chart that can show the relationship between variables visually.

Figure 8.5

Examples of Variable
Relationships (a) Linear (b) Nonlinear (c) No relationship
Chapter 8  Trendlines and Regression Analysis 239

Example 8.3 Home Market Value Data


The market value of a house is typically related to its Figure 8.7 shows a scatter chart of the market value
size. In the Excel file Home Market Value (see Figure 8.6), in relation to the size of the home. In general, we see that
data obtained from a county auditor provides informa- higher market values are associated with larger house
tion about the age, square footage, and current market sizes and the relationship is approximately linear. There-
value of houses in a particular subdivision. We might fore, we could conclude that simple linear regression
wish to investigate the relationship between the market would be an appropriate technique for predicting market
value and the size of the home. The independent vari- value based on house size.
able, X, is the number of square feet, and the dependent
variable, Y, is the market value.

Figure 8.6

Portion of Home Market Value

Figure 8.7

Scatter Chart of Market


Value versus Home Size

Finding the Best-Fitting Regression Line


The idea behind simple linear regression is to express the relationship between the depen-
dent and independent variables by a simple linear equation, such as

market value = a + b * square feet


where a is the y-intercept and b is the slope of the line. If we draw a straight line through
the data, some of the points will fall above the line, some will fall below it, and a few
240 Chapter 8  Trendlines and Regression Analysis

Figure 8.8

Two Possible Regression


Lines

might fall on the line itself. Figure 8.8 shows two possible straight lines that pass through
the data. Clearly, you would choose A as the better-fitting line over B because all the
points are closer to the line and the line appears to be in the middle of the data. The only
difference between the lines is the value of the slope and intercept; thus, we seek to deter-
mine the values of the slope and intercept that provide the best-fitting line.

Example 8.4 Using Excel to Find the Best Regression Line


When using the Trendline tool for simple linear regres- the market value estimate to be higher than for one that
sion in the Home Market Value example, be sure the lin- has only 1,500 square feet. For example, the estimated
ear function option is selected (it is the default option market value of a home with 2,200 square feet would be
when you use the tool). Figure 8.9 shows the best fitting
market value = $32,673 + $35.036 × 2,200 = $109,752
regression line. The equation is
whereas the estimated value for a home with 1,500 square
market value = $32,673 + $35.036 × square feet
feet would be
The value of the regression line can be explained as
market value = $32,673 + $35.036 × 1,500 = $85,227
follows. Suppose we wanted to estimate the home mar-
ket value for any home in the population from which the The regression model explains the differences in market
sample data were gathered. If all we knew were the mar- value as a function of the house size and provides better es-
ket values, then the best estimate of the market value for timates than simply using the average of the sample data.
any home would simply be the sample mean, which is One important caution: it is dangerous to extrapo-
$92,069. Thus, no matter if the house has 1,500 square late a regression model outside the ranges covered by
feet or 2,200 square feet, the best estimate of market the observations. For instance, if you want to predict the
value would still be $92,069. Because the market values market value of a house that has 3,000 square feet, the re-
vary from about $75,000 to more than $120,000, there is sults may or may not be accurate, because the regression
quite a bit of uncertainty in using the mean as the esti- model estimates did not use any observations greater
mate. However, from the scatter chart, we see that larger than 2,400 square feet. We cannot be sure that a linear
homes tend to have higher market values. Therefore, if we extrapolation will hold and should not use the model to
know that a home has 2,200 square feet, we would expect make such predictions.
Chapter 8  Trendlines and Regression Analysis 241

Figure 8.9

Best-fitting Simple Linear


Regression Line

We can find the best-fitting line using the Excel Trendline tool (with the linear option
chosen), as described earlier in this chapter.

Least-Squares Regression
The mathematical basis for the best-fitting regression line is called least-squares
­regression. In regression analysis, we assume that the values of the dependent variable,
Y, in the sample data are drawn from some unknown population for each value of the
­independent variable, X. For example, in the Home Market Value data, the first and fourth
observations come from a population of homes having 1,812 square feet; the second
­observation comes from a population of homes having 1,914 square feet; and so on.
Because we are assuming that a linear relationship exists, the expected value of Y is
b0 + b1X for each value of X. The coefficients b0 and b1 are population parameters that
represent the intercept and slope, respectively, of the population from which a sample of
observations is taken. The intercept is the mean value of Y when X = 0, and the slope is
the change in the mean value of Y as X changes by one unit.
Thus, for a specific value of X, we have many possible values of Y that vary around
the mean. To account for this, we add an error term, e (the Greek letter epsilon), to the
mean. This defines a simple linear regression model:

Y = b0 + b1X + e (8.1)

However, because we don’t know the entire population, we don’t know the true values of
b0 and b1. In practice, we must estimate these as best we can from the sample data. Define b 0
and b 1 to be estimates of b0 and b1. Thus, the estimated simple linear regression equation is

Yn = b 0 + b 1X (8.2)

Let Xi be the value of the independent variable of the ith observation. When the value of
the independent variable is Xi, then Yni = b 0 + b 1Xi is the estimated value of Y for Xi.
One way to quantify the relationship between each point and the estimated regression
equation is to measure the vertical distance between them, as illustrated in Figure 8.10. We
242 Chapter 8  Trendlines and Regression Analysis

Y
Figure 8.10 ^
Y2
Measuring the Errors in a Y1
e1 e2
Regression Model ^
Y1
Y2
X
X1 X2
Errors associated with individual observations

can think of these differences, ei, as the observed errors (often called residuals) ­associated
with estimating the value of the dependent variable using the regression line. Thus, the er-
ror associated with the ith observation is:

ei = Yi - Yni (8.3)

The best-fitting line should minimize some measure of these errors. Because some
errors will be negative and others positive, we might take their absolute value or simply
square them. Mathematically, it is easier to work with the squares of the errors.
Adding the squares of the errors, we obtain the following function:

a e i = a 1Yi - Yi 2 = a 1Yi - 3b 0 + b 1Xi4 2


n n n
2 n 2 2
(8.4)
i=1 i=1 i=1

If we can find the best values of the slope and intercept that minimize the sum of squares
(hence the name “least squares”) of the observed errors ei, we will have found the best-
fitting regression line. Note that Xi and Yi are the values of the sample data and that b 0
and b 1 are unknowns in equation (8.4). Using calculus, we can show that the solution that
minimizes the sum of squares of the observed errors is

a XiYi - nX Y
n

i=1
b1 = (8.5)
a X i - nX
n
2 2
i=1

b 0 = Y - b 1X (8.6)
Although the calculations for the least-squares coefficients appear to be somewhat
complicated, they can easily be performed on an Excel spreadsheet. Even better, Excel has
built-in capabilities for doing this. For example, you may use the functions INTERCEPT
(known_y’s, known_x’s) and SLOPE(known_y’s, known_x’s) to find the least-squares co-
efficients b 0 and b 1.

Example 8.5 Using Excel Functions to Find Least-Squares Coefficients


For the Home Market Value Excel file, the range of the us that for every additional square foot, the market value
dependent variable Y (market value) is C4:C45; the increases by $35.036.
range of the independent variable X (square feet) is We may use the Excel function TREND(known_y’s,
B4:B45. The function INTERCEPT(C4:C45, B4:B45) known_x’s, new_x’s) to estimate Y for any value of X; for
yields b0 = 32,673 and SLOPE(C4:C45, B4:B45) yields example, for a house with 1,750 square feet, the estimated
b1 = 35.036, as we saw in Example 8.4. The slope tells market value is TREND(C4:C45, B4:B45, 1750) = $93,986.
Chapter 8  Trendlines and Regression Analysis 243

We could stop at this point, because we have found the best-fitting line for the ob-
served data. However, there is a lot more to regression analysis from a statistical perspec-
tive, because we are working with sample data—and usually rather small samples—which
we know have a lot of variation as compared with the full population. Therefore, it is im-
portant to understand some of the statistical properties associated with regression analysis.

Simple Linear Regression with Excel


Regression-analysis software tools available in Excel provide a variety of information
about the statistical properties of regression analysis. The Excel Regression tool can be
used for both simple and multiple linear regressions. For now, we focus on using the tool
just for simple linear regression.
From the Data Analysis menu in the Analysis group under the Data tab, select the
Regression tool. The dialog box shown in Figure 8.11 is displayed. In the box for
the ­Input Y Range, specify the range of the dependent variable values. In the box for
the ­Input X Range, specify the range for the independent variable values. Check Labels
if your data range contains a descriptive label (we highly recommend using this). You
have the option of forcing the intercept to zero by checking Constant is Zero; however,
you will usually not check this box because adding an intercept term allows a better fit
to the data. You also can set a Confidence Level (the default of 95% is commonly used)
to provide confidence intervals for the intercept and slope parameters. In the ­Residuals
section, you have the option of including a residuals output table by checking the boxes
for Residuals, Standardized Residuals, Residual Plots, and Line Fit Plots. Residual Plots
generates a chart for each independent variable versus the residual, and Line Fit Plots
generates a scatter chart with the values predicted by the regression model included
(however, creating a scatter chart with an added trendline is visually superior to what this
tool provides). ­Finally, you may also choose to have Excel construct a normal probability
plot for the dependent variable, which transforms the cumulative probability scale (verti-
cal axis) so that the graph of the cumulative normal distribution is a straight line. The
closer the points are to a straight line, the better the fit to a normal distribution.
Figure 8.12 shows the basic regression analysis output provided by the Excel
­Regression tool for the Home Market Value data. The output consists of three sections:
Regression Statistics (rows 3–8), ANOVA (rows 10–14), and an unlabeled section at the
bottom (rows 16–18) with other statistical information. The least-squares estimates of the
slope and intercept are found in the Coefficients column in the bottom section of the output.

Figure 8.11

Excel Regression Tool


Dialog
244 Chapter 8  Trendlines and Regression Analysis

Figure 8.12

Basic Regression Analysis


Output for Home Market Value
Example

In the Regression Statistics section, Multiple R is another name for the sample corre-
lation coefficient, r, which was introduced in Chapter 4. Values of r range from -1 to 1,
where the sign is determined by the sign of the slope of the regression line. A Multiple R
value greater than 0 indicates positive correlation; that is, as the independent vari-
able increases, the dependent variable does also; a value less than 0 indicates negative
­correlation—as X increases, Y decreases. A value of 0 indicates no correlation.
R-squared 1R22 is called the coefficient of determination. Earlier we noted that R2
is a measure of the how well the regression line fits the data; this value is also provided
by the Trendline tool. Specifically, R2 gives the proportion of variation in the dependent
variable that is explained by the independent variable of the regression model. The value
of R2 is between 0 and 1. A value of 1.0 indicates a perfect fit, and all data points lie on
the regression line, whereas a value of 0 indicates that no relationship exists. Although we
would like high values of R2, it is difficult to specify a “good” value that signifies a strong
relationship because this depends on the application. For example, in scientific applica-
tions such as calibrating physical measurement equipment, R2 values close to 1 would
be expected; in marketing research studies, an R2 of 0.6 or more is considered very good;
however, in many social science applications, values in the neighborhood of 0.3 might be
considered acceptable.
Adjusted R Square is a statistic that modifies the value of R2 by incorporating the
sample size and the number of explanatory variables in the model. Although it does not
give the actual percent of variation explained by the model as R2 does, it is useful when
comparing this model with other models that include additional explanatory variables. We
discuss it more fully in the context of multiple linear regression later in this chapter.
Standard Error in the Excel output is the variability of the observed Y-values from
the predicted values 1Yn2. This is formally called the standard error of the estimate, SYX.
If the data are clustered close to the regression line, then the standard error will be small;
the more scattered the data are, the larger the standard error.

Example 8.6 Interpreting Regression Statistics for Simple Linear Regression


After running the Excel Regression tool, the first things i­ndependent variable, Square Feet) is 35.036, just as we
to look for are the values of the slope and intercept, had computed earlier. In the Regression Statistics sec-
namely, the estimates b1 and b0 in the regression tion, R2 = 0.5347. This means that approximately 53%
model. In the Home Market Value example, we see that of the variation in Market Value is explained by Square
the ­intercept is 32,673, and the slope (coefficient of the Feet. The remaining variation is due to other factors that
Chapter 8  Trendlines and Regression Analysis 245

were not included in the model. The standard error of the is less than the variation around the sample mean
estimate is $7,287.72. If we compare this to the standard ($10,553). This is because the independent variable in
deviation of the market value, which is $10,553, we see the regression model explains some of the variation.
that the variation around the regression line ($7,287.72)

Regression as Analysis of Variance


In Chapter 7, we introduced analysis of variance (ANOVA), which conducts an F-test
to determine whether variation due to a particular factor, such as the differences in sam-
ple means, is significantly greater than that due to error. ANOVA is commonly applied
to regression to test for significance of regression. For a simple linear regression model,
­significance of regression is simply a hypothesis test of whether the regression coeffi-
cient b1 (slope of the independent variable) is zero:

H0: b 1 = 0
H1: b1 ≠ 0 (8.7)

If we reject the null hypothesis, then we may conclude that the slope of the independent vari-
able is not zero and, therefore, is statistically significant in the sense that it explains some of the
variation of the dependent variable around the mean. Similar to our discussion in Chapter 7,
you needn’t worry about the mathematical details of how F is computed, or even its value,
especially since the tool does not provide the critical value for the test. What is important is
the value of Significance F, which is the p-value for the F-test. If Significance F is less than
the level of significance (typically 0.05), we would reject the null hypothesis.

Example 8.7 Interpreting Significance of Regression


For the Home Market Value example, the ANOVA test is is essentially zero (3.798 : 10−8). Therefore, assuming a
shown in rows 10–14 in Figure 8.12. Significance F, that level of significance of 0.05, we must reject the null hypoth-
is, the p-value associated with the hypothesis test esis and conclude that the slope—the coefficient for Square
Feet—is not zero. This means that home size is a statistically
H0: B1 = 0 significant variable in explaining the variation in market value.
H1: B1 3 0

Testing Hypotheses for Regression Coefficients


Rows 17–18 of the Excel output, in addition to specifying the least-squares coefficients,
provide additional information for testing hypotheses associated with the intercept and
slope. Specifically, we may test the null hypothesis that b0 or b1 equals zero. Usually, it
makes little sense to test or interpret the hypothesis that b0 = 0 unless the intercept has
a significant physical meaning in the context of the application. For simple linear regres-
sion, testing the null hypothesis H0: b1 = 0 is the same as the significance of regression
test that we described earlier.
The t-test for the slope is similar to the one-sample test for the mean that we described
in Chapter 7. The test statistic is
b1 - 0
t = (8.8)
standard error
and is given in the column labeled t Stat in the Excel output. Although the critical value of
the t-distribution is not provided, the output does provide the p-value for the test.
246 Chapter 8  Trendlines and Regression Analysis

Example 8.8 Interpreting Hypothesis Tests for Regression Coefficients


For the Home Market Value example, note that the value that neither coefficient is statistically equal to zero. Note
of t Stat is computed by dividing the coefficient by the that the p-value associated with the test for the slope
standard error using formula (8.8). For instance, t Stat for coefficient, Square Feet, is equal to the Significance F
the slope is 35.03637258>5.16738385 = 6.780292234. value. This will always be true for a regression model
Because Excel does not provide the critical value with with one independent variable because it is the only ex-
which to compare the t Stat value, we may use the planatory variable. However, as we shall see, this will not
p-value to draw a conclusion. Because the p-values for be the case for multiple regression models.
both coefficients are essentially zero, we would c
­ onclude

Confidence Intervals for Regression Coefficients


Confidence intervals (Lower 95% and Upper 95% values in the output) provide informa-
tion about the unknown values of the true regression coefficients, accounting for sampling
error. They tell us what we can reasonably expect to be the ranges for the population inter-
cept and slope at a 95% confidence level.
We may also use confidence intervals to test hypotheses about the regression coeffi-
cients. For example, in Figure 8.12, we see that neither confidence interval includes zero;
therefore, we can conclude that b0 and b1 are statistically different from zero. Similarly,
we can use them to test the hypotheses that the regression coefficients equal some value
other than zero. For example, to test the hypotheses

H0: b1 = B1
H1: b1 ≠ B1

we need only check whether B1 falls within the confidence interval for the slope. If it does
not, then we reject the null hypothesis, otherwise we fail to reject it.

Example 8.9 Interpreting Confidence Intervals for Regression Coefficients


For the Home Market Value data, a 95% confidence in- m a r k e t v a l u e o f 32,673 + 35.036(1,750) = $93,986,
terval for the intercept is [14,823, 50,523]. Similarly, a if the true population parameters are at the extremes
95% confidence interval for the slope is [24.59, 45.48]. of the confidence intervals, the estimate might be as
­Although the regression model is Yn = 32,673 + 35.036X, low as 14,823 + 24.59(1,750) = $57,855 or as high as
the confidence intervals suggest a bit of uncertainty 50,523 + 45.48(1,750) = $130,113. Narrower confidence
about predictions using the model. Thus, although we intervals provide more accuracy in our predictions.
estimated that a house with 1,750 square feet has a

Residual Analysis and Regression Assumptions

Recall that residuals are the observed errors, which are the differences between the actual
values and the estimated values of the dependent variable using the regression equation.
Figure 8.13 shows a portion of the residual table generated by the Excel Regression tool.
The residual output includes, for each observation, the predicted value using the estimated
regression equation, the residual, and the standard residual. The residual is simply the dif-
ference between the actual value of the dependent variable and the predicted value, or
Yi - Yni. Figure 8.14 shows the residual plot generated by the Excel tool. This chart is actu-
ally a scatter chart of the residuals with the values of the independent variable on the x-axis.
Chapter 8  Trendlines and Regression Analysis 247

Figure 8.13

Portion of Residual Output

Figure 8.14

Residual Plot for Square


Feet

Standard residuals are residuals divided by their standard deviation. Standard re-
siduals describe how far each residual is from its mean in units of standard deviations
(similar to a z-value for a standard normal distribution). Standard residuals are useful in
checking assumptions underlying regression analysis, which we will address shortly, and
to detect outliers that may bias the results. Recall that an outlier is an extreme value that
is different from the rest of the data. A single outlier can make a significant difference in
the regression equation, changing the slope and intercept and, hence, how they would be
interpreted and used in practice. Some consider a standardized residual outside of { 2
standard deviations as an outlier. A more conservative rule of thumb would be to consider
outliers outside of a { 3 standard deviation range. (Commercial software packages have
more sophisticated techniques for identifying outliers.)

Example 8.10 Interpreting Residual Output


For the Home Market Value data, the first observa- feet, is more than 4 standard deviations above the pre-
tion has a market value of $90,000 and the regres- dicted value and would clearly be identified as an outlier.
sion model predicts $96,159.13. Thus, the residual is (If you look back at Figure 8.7, you may have noticed that
90,000 − 96,159.13 = − $6,159.13. The standard de- this point appears to be quite different from the rest of
viation of the residuals can be computed as 7,198.299. the data.) You might question whether this observation
By dividing the residual by this value, we have the stan- belongs in the data, because the house has a large value
dardized residual for the first observation. The value of despite a relatively small size. The explanation might be
− 0.8556 tells us that the first observation is about 0.85 an outdoor pool or an unusually large plot of land. Be-
standard deviation below the regression line. If we check cause this value will influence the regression results
the values of all the standardized residuals, you will find and may not be representative of the other homes in the
that the value of the last data point is 4.53, meaning that neighborhood, you might consider dropping this obser-
the market value of this home, having only 1,581 square vation and recomputing the regression model.
248 Chapter 8  Trendlines and Regression Analysis

Checking Assumptions
The statistical hypothesis tests associated with regression analysis are predicated on some
key assumptions about the data.

1. Linearity. This is usually checked by examining a scatter diagram of the data


or examining the residual plot. If the model is appropriate, then the residuals
should appear to be randomly scattered about zero, with no apparent pattern.
If the residuals exhibit some well-defined pattern, such as a linear trend, a
parabolic shape, and so on, then there is good evidence that some other func-
tional form might better fit the data.
2. Normality of errors. Regression analysis assumes that the errors for each in-
dividual value of X are normally distributed, with a mean of zero. This can
be verified either by examining a histogram of the standard residuals and in-
specting for a bell-shaped distribution or by using more formal goodness-of-
fit tests. It is usually difficult to evaluate normality with small sample sizes.
However, regression analysis is fairly robust against departures from normal-
ity, so in most cases this is not a serious issue.
3. Homoscedasticity. The third assumption is homoscedasticity, which means
that the variation about the regression line is constant for all values of the
independent variable. This can also be evaluated by examining the residual
plot and looking for large differences in the variances at different values of the
independent variable. Caution should be exercised when looking at residual
plots. In many applications, the model is derived from limited data, and multi-
ple observations for different values of X are not available, making it difficult
to draw definitive conclusions about homoscedasticity. If this assumption is
seriously violated, then techniques other than least squares should be used for
estimating the regression model.
4. Independence of errors. Finally, residuals should be independent for each
value of the independent variable. For cross-sectional data, this assumption is
usually not a problem. However, when time is the independent variable, this is
an important assumption. If successive observations appear to be correlated—
for example, by becoming larger over time or exhibiting a cyclical type of
pattern—then this assumption is violated. Correlation among successive ob-
servations over time is called autocorrelation and can be identified by residual
plots having clusters of residuals with the same sign. Autocorrelation can be
evaluated more formally using a statistical test based on a measure called the
Durbin–Watson statistic. The Durbin–Watson statistic is

a 1ei - ei - 12
n
2
i=2
D = (8.9)
a ei
n
2
i=1

This is a ratio of the squared differences in successive residuals to the sum


of the squares of all residuals. D will range from 0 to 4. When successive re-
siduals are positively autocorrelated, D will approach 0. Critical values of the
statistic have been tabulated based on the sample size and number of indepen-
dent variables that allow you to conclude that there is either evidence of au-
tocorrelation or no evidence of autocorrelation or the test is inconclusive. For
most practical purposes, values below 1 suggest autocorrelation; values above
1.5 and below 2.5 suggest no autocorrelation; and values above 2.5 suggest
Chapter 8  Trendlines and Regression Analysis 249

Figure 8.15

Histogram of Standard
Residuals

negative autocorrelation. This can become an issue when using regression in


forecasting, which we discuss in the next chapter. Some software packages
compute this statistic; however, Excel does not.

When assumptions of regression are violated, then statistical inferences drawn from
the hypothesis tests may not be valid. Thus, before drawing inferences about regression
models and performing hypothesis tests, these assumptions should be checked. However,
other than linearity, these assumptions are not needed solely for model fitting and estima-
tion purposes.

Example 8.11  hecking Regression Assumptions


C
for the Home Market Value Data
Linearity: The scatter diagram of the market value data ap- serious departure from normality, particularly as the sam-
pears to be linear; looking at the residual plot in Figure 8.14 ple size is small.
also confirms no pattern in the residuals. Homoscedasticity: In the residual plot in Figure 8.14,
Normality of errors: Figure 8.15 shows a histogram we see no serious differences in the spread of the data for
of the standard residuals for the market value data. The different values of X, particularly if the outlier is eliminated.
distribution appears to be somewhat positively skewed Independence of errors: Because the data are cross-
(particularly with the outlier) but does not appear to be a sectional, we can assume that this assumption holds.

Multiple Linear Regression

Many colleges try to predict student performance as a function of several characteristics.


In the Excel file Colleges and Universities (see Figure 8.16), suppose that we wish to pre-
dict the graduation rate as a function of the other variables—median SAT, acceptance rate,
expenditures/student, and percent in the top 10% of their high school class. It is logical to
250 Chapter 8  Trendlines and Regression Analysis

Figure 8.16

Portion of Excel File


Colleges and Universities

propose that schools with students who have higher SAT scores, a lower acceptance rate,
a larger budget, and a higher percentage of students in the top 10% of their high school
classes will tend to retain and graduate more students.
A linear regression model with more than one independent variable is called a mul-
tiple linear regression model. Simple linear regression is just a special case of multiple
linear regression. A multiple linear regression model has the form:

Y = b0 + b1X1 + b2X2 + g + bkXk + e (8.10)

where

Y is the dependent variable,


X1, c, Xk are the independent (explanatory) variables,
b0 is the intercept term,
b1, c, bk are the regression coefficients for the independent variables,
e is the error term

Similar to simple linear regression, we estimate the regression coefficients—called


partial regression coefficients—b 0, b 1, b 2, cb k, then use the model:

Yn = b 0 + b 1X1 + b 2X2 + g + b kXk (8.11)

to predict the value of the dependent variable. The partial regression coefficients repre-
sent the expected change in the dependent variable when the associated independent vari-
able is increased by one unit while the values of all other independent variables are held
constant.
For the college and university data, the proposed model would be

Graduation% = b 0 + b 1 SAT + b 2 ACCEPTANCE + b 3 EXPENDITURES


+ b 4 TOP10% HS

Thus, b 2 would represent an estimate of the change in the graduation rate for a unit in-
crease in the acceptance rate while holding all other variables constant.
As with simple linear regression, multiple linear regression uses least squares to es-
timate the intercept and slope coefficients that minimize the sum of squared error terms
over all observations. The principal assumptions discussed for simple linear regression
also hold here. The Excel Regression tool can easily perform multiple linear regression;
you need to specify only the full range for the independent variable data in the dialog. One
caution when using the tool: the independent variables in the spreadsheet must be in con-
tiguous columns. So, you may have to manually move the columns of data around before
applying the tool.
Chapter 8  Trendlines and Regression Analysis 251

The results from the Regression tool are in the same format as we saw for simple
linear regression. However, some key differences exist. Multiple R and R Square (or R2)
are called the multiple correlation coefficient and the coefficient of multiple determi-
nation, respectively, in the context of multiple regression. They indicate the strength of
association between the dependent and independent variables. Similar to simple linear
regression, R2 explains the percentage of variation in the dependent variable that is ex-
plained by the set of independent variables in the model.
The interpretation of the ANOVA section is quite different from that in simple lin-
ear regression. For multiple linear regression, ANOVA tests for significance of the entire
model. That is, it computes an F-statistic for testing the hypotheses

H0: b1 = b2 = g = bk = 0

H1: at least one bj is not 0

The null hypothesis states that no linear relationship exists between the dependent and any
of the independent variables, whereas the alternative hypothesis states that the dependent
variable has a linear relationship with at least one independent variable. If the null hy-
pothesis is rejected, we cannot conclude that a relationship exists with every independent
variable individually.
The multiple linear regression output also provides information to test hypothe-
ses about each of the individual regression coefficients. Specifically, we may test the
null hypothesis that b0 (the intercept) or any bi equals zero. If we reject the null hy-
pothesis that the slope associated with independent variable i is zero, H0: bi = 0, then
we may state that independent variable i is significant in the regression model; that
is, it contributes to reducing the variation in the dependent variable and improves the
ability of the model to better predict the dependent variable. However, if we cannot
reject H0, then that independent variable is not significant and probably should not be
included in the model. We see how to use this information to identify the best model
in the next section.
Finally, for multiple regression models, a residual plot is generated for each indepen-
dent variable. This allows you to assess the linearity and homoscedasticity assumptions of
regression.

Example 8.12 I nterpreting Regression Results for the Colleges and


Universities Data
The multiple regression results for the college and uni- some of the best students are more demanding and
versity data are shown in Figure 8.17. change schools if their needs are not being met, some
From the Coefficients section, we see that the model is: entrepreneurial students might pursue other interests
before graduation, or there is sampling error. As with
Graduation% =
simple linear regression, the model should be used only
17.92 + 0.072 SAT − 24.859 ACCEPTANCE
for values of the independent variables within the range
− 0.000136 EXPENDITURES − 0.163 TOP10% HS
of the data.
The signs of some coefficients make sense; higher The value of R2 (0.53) indicates that 53% of the varia-
SAT scores and lower acceptance rates suggest higher tion in the dependent variable is explained by these in-
graduation rates. However, we might expect that larger dependent variables. This suggests that other factors not
student expenditures and a higher percentage of top included in the model, perhaps campus living conditions,
high school students would also positively influence the social opportunities, and so on, might also influence the
graduation rate. Perhaps the problem occurred because graduation rate.
(continued)
252 Chapter 8  Trendlines and Regression Analysis

Figure 8.17

Multiple Regression
Results for Colleges
and Universities Data

Figure 8.18

Residual Plot for Top 10%


HS Variable

From the ANOVA section, we may test for signifi- regression coefficient is zero and conclude that each of
cance of regression. At a 5% significance level, we reject them is statistically significant.
the null hypothesis because Significance F is essentially Figure 8.18 shows one of the residual plots from
zero. Therefore, we may conclude that at least one slope the Excel output. The assumptions appear to be met,
is statistically different from zero. and the other residual plots (not shown) also validate
Looking at the p-values for the independent vari- these assumptions. The normal probability plot (also not
ables in the last section, we see that all are less than 0.05; shown) does not suggest any serious departures from
­therefore, we reject the null hypothesis that each partial normality.
Chapter 8  Trendlines and Regression Analysis 253

Analytics in Practice: U
 sing Linear Regression and Interactive Risk
Simulators to Predict Performance at ARAMARK3
ARAMARK is a leader in professional services, pro- for use by their clients. They developed “Interactive
viding award-winning food services, facilities man- Risk Simulators,” which are simple online tools that
agement, and uniform and career apparel to health allowed users to manipulate the values of the inde-
care institutions, universities and school districts, pendent variables in the regression models using inter-
stadiums and arenas, and businesses around the active sliders that correspond to the business metrics
world. Headquartered in Philadelphia, ARAMARK has and instantaneously view the values of the dependent
­approximately 255,000 employees serving clients in variables (the risk metrics) on gauges similar to those
22 countries. found on the dashboard of a car.
ARAMARK’s Global Risk Management Figure 8.19 illustrates the structure of the simu-
­D epartment (GRM) needed a way to determine the lators. The gauges are updated instantly as the user
statistical relationships between key business metrics adjusts the sliders, showing how changes in the busi-
(e.g., employee tenure, employee engagement, a ness environment affect the risk metrics. This visual
trained workforce, account tenure, service offerings) representation made the models easy to use and un-
and risk metrics (e.g., OSHA rate, workers’ compensa- derstand, particularly for nontechnical employees.
tion rate, customer injuries) to understand the impact
of these risks on the business. GRM also needed a
simple tool that field operators and the risk manage-
ment team could use to predict the impact of busi-
ness decisions on risk metrics before those decisions
were implemented. Typical questions they would want
to ask were, What would happen to our OSHA rate if
Gunnar Pippel/Shutterstock.com

we increased the percentage of part time labor? and


How could we impact turnover if operations improved
safety performance?
ARAMARK maintains extensive historical data.
For example, the Global Risk Management group
keeps track of data such as OSHA rates, slip/trip/fall
rates, injury costs, and level of compliance with safety
standards; the Human Resources department moni-
tors turnover and percentage of part-time labor; the GRM sent out more than 200 surveys to multiple
Payroll department keeps data on average wages; and levels of the organization to assess the usefulness of
the Training and Organizational Development depart- Interactive Risk Simulators. One hundred percent of
ment collects data on employee engagement. Excel- respondents answered “Yes” to “Were the simula-
based linear regression was used to determine the tors easy to use?” and 78% of respondents answered
relationships between the dependent variables (such “Yes” to “Would these simulators be useful in running
as OSHA rate, slip/trip/fall rate, claim cost, and turn- your business and helping you make decisions?” The
over) and the independent variables (such as the per- deployment of Interactive Risk Simulators to the field
centage of part-time labor, average wage, employee has been met with overwhelming positive response
engagement, and safety compliance). and recognition from leadership within all lines of
Although the regression models provided the ba- business, including frontline managers, food-service
sic analytical support that ARAMARK needed, the GRM directors, district managers, and general managers.
team used a novel approach to implement the models

3The author expresses his appreciation to John Toczek, Manager of Decision Support and Analytics at
ARAMARK Corporation.
254 Chapter 8  Trendlines and Regression Analysis

Nataliia Natykach/Shutterstock.com
vector-illustration/Shutterstock.com

c./Shutterstock.com
Inputs: Independent Variables Regression Models Outputs: Dependent Variables

Figure 8.19

Structure of an Interactive Risk Simulator

Building Good Regression Models

In the colleges and universities regression example, all the independent variables were
found to be significant by evaluating the p-values of the regression analysis. This will not
always be the case and leads to the question of how to build good regression models that
include the “best” set of variables.
Figure 8.20 shows a portion of the Excel file Banking Data, which provides data
acquired from banking and census records for different zip codes in the bank’s current
market. Such information can be useful in targeting advertising for new customers or
for choosing locations for branch offices. The data show the median age of the popula-
tion, median years of education, median income, median home value, median household
wealth, and average bank balance.
Figure 8.21 shows the results of regression analysis used to predict the average bank
balance as a function of the other variables. Although the independent variables explain
more than 94% of the variation in the average bank balance, you can see that at a 0.05
significance level, the p-values indicate that both Education and Home Value do not ap-
pear to be significant. A good regression model should include only significant indepen-
dent variables. However, it is not always clear exactly what will happen when we add or
remove variables from a model; variables that are (or are not) significant in one model
may (or may not) be significant in another. Therefore, you should not consider dropping
all insignificant variables at one time, but rather take a more structured approach.
Adding an independent variable to a regression model will always result in R2 equal
to or greater than the R2 of the original model. This is true even when the new independent

Figure 8.20

Portion of Banking Data


Chapter 8  Trendlines and Regression Analysis 255

Figure 8.21

Regression Analysis Results


for Banking Data

variable has little true relationship with the dependent variable. Thus, trying to maximize
R2 is not a useful criterion. A better way of evaluating the relative fit of different models is
to use adjusted R2. Adjusted R2 reflects both the number of independent variables and the
sample size and may either increase or decrease when an independent variable is added
or dropped, thus providing an indication of the value of adding or removing independent
variables in the model. An increase in adjusted R2 indicates that the model has improved.
This suggests a systematic approach to building good regression models:

1. Construct a model with all available independent variables. Check for signifi-
cance of the independent variables by examining the p-values.
2. Identify the independent variable having the largest p-value that exceeds the
chosen level of significance.
3. Remove the variable identified in step 2 from the model and evaluate adjusted
R2. (Don’t remove all variables with p-values that exceed a at the same time,
but remove only one at a time.)
4. Continue until all variables are significant.

In essence, this approach seeks to find a significant model that has the highest ­adjusted R2.

Example 8.13 Identifying the Best Regression Model


We will apply the preceding approach to the Banking appears to be the best model. Notice that the p-value
Data example. The first step is to identify the variable for Education, which was larger than 0.05 in the first
with the largest p-value exceeding 0.05; in this case, it regression analysis, dropped below 0.05 after Home
is Home Value, and we remove it from the model and Value was removed. This phenomenon often occurs
rerun the Regression tool. Figure 8.22 shows the results when multicollinearity (discussed in the next section) is
after removing Home Value. Note that the adjusted R2 present and emphasizes the importance of not remov-
has increased slightly, whereas the R2 -value decreased ing all variables with large p-values from the original
slightly because we removed a variable from the model. model at the same time.
All the p-values are now less than 0.05, so this now
256 Chapter 8  Trendlines and Regression Analysis

Figure 8.22

Regression Results without


Home Value

Another criterion used to determine if a variable should be removed is the t-statistic.


If 0 t 0 6 1, then the standard error will decrease and adjusted R2 will increase if the vari-
able is removed. If 0 t 0 7 1, then the opposite will occur. In the banking regression results,
we see that the t-statistic for Home Value is less than 1; therefore, we expect the adjusted
R2 to increase if we remove this variable. You can follow the same iterative approach out-
lined before, except using t-values instead of p-values.
These approaches using the p-values or t-statistics may involve considerable experi-
mentation to identify the best set of variables that result in the largest adjusted R2. For large
numbers of independent variables, the number of potential models can be overwhelming.
For example, there are 210 = 1,024 possible models that can be developed from a set of
10 independent variables. This can make it difficult to effectively screen out insignificant
variables. Fortunately, automated methods—stepwise regression and best subsets—exist
that facilitate this process.

Correlation and Multicollinearity


As we have learned previously, correlation, a numerical value between -1 and +1, mea-
sures the linear relationship between pairs of variables. The higher the absolute value
of the correlation, the greater the strength of the relationship. The sign simply indicates
whether variables tend to increase together (positive) or not (negative). Therefore, ex-
amining correlations between the dependent and independent variables, which can be
done using the Excel Correlation tool, can be useful in selecting variables to include in a
multiple regression model because a strong correlation indicates a strong linear relation-
ship. However, strong correlations among the independent variables can be problematic.
This can potentially signify a phenomenon called multicollinearity, a condition occurring
when two or more independent variables in the same regression model contain high levels
of the same information and, consequently, are strongly correlated with one another and
can predict each other better than the dependent variable. When significant multicollinear-
ity is present, it becomes difficult to isolate the effect of one independent variable on the
dependent variable, and the signs of coefficients may be the opposite of what they should
be, making it difficult to interpret regression coefficients. Also, p-values can be inflated,
resulting in the conclusion not to reject the null hypothesis for significance of regression
when it should be rejected.
Chapter 8  Trendlines and Regression Analysis 257

Some experts suggest that correlations between independent variables exceeding an


absolute value of 0.7 may indicate multicollinearity. However, multicollinearity is best
measured using a statistic called the variance inflation factor (VIF) for each independent
variable. More-sophisticated software packages usually compute these; unfortunately,
­Excel does not.

Example 8.14 Identifying Potential Multicollinearity


Figure 8.23 shows the correlation matrix for the variables correlations exist between Education and Home Value and
in the Colleges and Universities data. You can see that SAT also between Wealth and Income (in fact, the variance in-
and Acceptance Rate have moderate correlations with flation factors do indicate significant multicollinearity). If
the dependent variable, Graduation%, but the correla- we remove Wealth from the model, the adjusted R2 drops
tion between Expenditures/Student and Top 10% HS with to 0.9201, but we discover that Education is no longer
Graduation% are relatively low. The strongest correlation, significant. Dropping Education and leaving only Age and
however, is between two independent variables: Top 10% Income in the model results in an adjusted R2 of 0.9202.
HS and Acceptance Rate. However, the value of − 0.6097 However, if we remove Income from the model instead
does not exceed the recommended threshold of 0.7, so of Wealth, the Adjusted R2 drops to only 0.9345, and all
we can likely assume that multicollinearity is not a prob- remaining variables (Age, Education, and Wealth) are sig-
lem here (a more advanced analysis using VIF calculations nificant (see Figure 8.25). The R2-value for the model with
does indeed confirm that multicollinearity does not exist). these three variables is 0.936.
In contrast, Figure 8.24 shows the correlation matrix
for all the data in the banking example. Note that large

Practical Issues in Trendline and Regression Modeling


Example 8.14 clearly shows that it is not easy to identify the best regression model s­ imply
by examining p-values. It often requires some experimentation and trial and error. From
a practical perspective, the independent variables selected should make some sense in
­attempting to explain the dependent variable (i.e., you should have some reason to ­believe
that changes in the independent variable will cause changes in the dependent variable
even though causation cannot be proven statistically). Logic should guide your model

Figure 8.23

Correlation Matrix for


Colleges and Universities
Data

Figure 8.24

Correlation Matrix for


Banking Data
258 Chapter 8  Trendlines and Regression Analysis

Figure 8.25

Regression Results for Age,


Education, and Wealth as
Independent Variables

d­ evelopment. In many applications, behavioral, economic, or physical theory might sug-


gest that certain variables should belong in a model. Remember that additional variables
do contribute to a higher R2 and, therefore, help to explain a larger proportion of the varia-
tion. Even though a variable with a large p-value is not statistically significant, it could
simply be the result of sampling error and a modeler might wish to keep it.
Good modelers also try to have as simple a model as possible—an age-old principle
known as parsimony—with the fewest number of explanatory variables that will provide
an adequate interpretation of the dependent variable. In the physical and management sci-
ences, some of the most powerful theories are the simplest. Thus, a model for the banking
data that includes only age, education, and wealth is simpler than one with four variables;
because of the multicollinearity issue, there would be little to gain by including income
in the model. Whether the model explains 93% or 94% of the variation in bank deposits
would probably make little difference. Therefore, building good regression models relies
as much on experience and judgment as it does on technical analysis.
One issue that one often faces in using trendlines and regression is overfitting the
model. It is important to realize that sample data may have unusual variability that is dif-
ferent from the population; if we fit a model too closely to the sample data we risk not
fitting it well to the population in which we are interested. For instance, in fitting the crude
oil prices in Example 8.2, we noted that the R2-value will increase if we fit higher-order
polynomial functions to the data. While this might provide a better mathematical fit to the
sample data, doing so can make it difficult to explain the phenomena rationally. The same
thing can happen with multiple regression. If we add too many terms to the model, then
the model may not adequately predict other values from the population. Overfitting can be
mitigated by using good logic, intuition, physical or behavioral theory, and parsimony as
we have discussed.

Regression with Categorical Independent Variables

Some data of interest in a regression study may be ordinal or nominal. This is common when
including demographic data in marketing studies, for example. Because regression analysis
requires numerical data, we could include categorical variables by coding the variables. For
example, if one variable represents whether an individual is a college graduate or not, we
might code No as 0 and Yes as 1. Such variables are often called dummy variables.
Chapter 8  Trendlines and Regression Analysis 259

Example 8.15 A Model with Categorical Variables


The Excel file Employee Salaries, shown in Figure 8.26, Thus, a 30-year-old with an MBA would have an esti-
provides salary and age data for 35 employees, along mated salary of
with an indicator of whether or not the employees have
salary = 893.59 + 1044.15 × 30 + 14767.23 × 1
an MBA (Yes or No). The MBA indicator variable is cat-
egorical; thus, we code it by replacing No by 0 and Yes = $ 46,985.32
by 1. This model suggests that having an MBA increases the
If we are interested in predicting salary as a function salary of this group of employees by almost $15,000.
of the other variables, we would propose the model Note that by substituting either 0 or 1 for MBA, we obtain
Y = B0 + B1X1 + B2X2 + E two models:

where No MBA: salary = 893.59 + 1044.15 × age


MBA: salary = 15,660.82 + 1044.15 × age
Y = salary
X1 = age The only difference between them is the intercept. The
models suggest that the rate of salary increase for age
X2 = MBA indicator (0 or 1)
is the same for both groups. Of course, this may not be
After coding the MBA indicator column in the data true. Individuals with MBAs might earn relatively higher
file, we begin by running a regression on the entire data salaries as they get older. In other words, the slope of
set, yielding the output shown in Figure 8.27. Note that the Age may depend on the value of MBA.
model explains about 95% of the variation, and the p-values
of both variables are significant. The model is

salary = 893.59 + 1044.15 × age + 14767.23 × MBA

Figure 8.26

Portion of Excel File


Employee Salaries

Figure 8.27

Initial Regression Model for


Employee Salaries
260 Chapter 8  Trendlines and Regression Analysis

An interaction occurs when the effect of one variable (i.e., the slope) is dependent on
another variable. We can test for interactions by defining a new variable as the product of
the two variables, X3 = X1 * X2, and testing whether this variable is significant, leading
to an alternative model.

Example 8.16 Incorporating Interaction Terms in a Regression Model


For the Employee Salaries example, we define an interac- salary = 3,323.11 + 984.25 × age + 425.58
tion term as the product of age 1X1 2 and MBA 1X2 2 by × MBA × age
defining X3 = X1 × X2. The new model is
The models for employees with and without an MBA are:
Y = B0 + B1X1 + B2X2 + B3X3 + E
No MBA: salary = 3,323.11 + 984.25 × age + 425.58 (0)
In the worksheet, we need to create a new column (called
× age
Interaction) by multiplying MBA by Age for each observa-
tion (see Figure 8.28). The regression results are shown = 3323.11 + 984.25 × age
in Figure 8.29. MBA: salary = 3323.11 + 984.25 × age + 425.58 (1)
From Figure 8.29, we see that the adjusted R2 in- × age
creases; however, the p-value for the MBA indicator vari-
= 3,323.11 + 1,409.83 × age
able is 0.33, indicating that this variable is not significant.
Therefore, we drop this variable and run a regression Here, we see that salary depends not only on whether
using only age and the interaction term. The results are an employee holds an MBA, but also on age and is more
shown in Figure 8.30. Adjusted R2 increased slightly, and realistic than the original model.
both age and the interaction term are significant. The final
model is

Figure 8.28

Portion of Employee
Salaries Modified for
Interaction Term

Figure 8.29

Regression Results with


Interaction Term
Chapter 8  Trendlines and Regression Analysis 261

Figure 8.30

Final Regression Model for


Salary Data

Categorical Variables with More Than Two Levels


When a categorical variable has only two levels, as in the previous example, we coded
the levels as 0 and 1 and added a new variable to the model. However, when a categorical
variable has k 7 2 levels, we need to add k - 1 additional variables to the model.

Example 8.17 A Regression Model with Multiple Levels of Categorical Variables


The Excel file Surface Finish provides measurements of Note that when X2 = X3 = X4 = 0, then, by default, the
the surface finish of 35 parts produced on a lathe, along tool type is A. Substituting these values for each tool
with the revolutions per minute (RPM) of the spindle and type into the model, we obtain:
one of four types of cutting tools used (see Figure 8.31).
Tool type A: Y = B0 + B1X1 + E
The engineer who collected the data is interested in pre-
dicting the surface finish as a function of RPM and type Tool type B: Y = B0 + B1X1 + B2 + E
of tool. Tool type C: Y = B0 + B1X1 + B3 + E
Intuition might suggest defining a dummy variable Tool type D: Y = B0 + B1X1 + B4 + E
for each tool type; however, doing so will cause numer-
ical instability in the data and cause the regression tool For a fixed value of RPM (X1), the slopes corresponding
to crash. Instead, we will need k − 1 = 3 dummy vari- to the dummy variables represent the difference between
ables corresponding to three of the levels of the cat- the surface finish using that tool type and the baseline
egorical variable. The level left out will correspond to using tool type A.
a reference, or baseline, value. Therefore, because we To incorporate these dummy variables into the ­regression
have k = 4 levels of tool type, we will define a regres- model, we add three columns to the data, as shown in
sion model of the form ­Figure 8.32. Using these data, we obtain the regression results
shown in Figure 8.33. The resulting model is
Y = B0 + B1X1 + B2X2 + B3X3 + B4X4 + E
surface finish = 24.49 + 0.098 RPM − 13.31 type B
where          − 20.49 type C − 26.04 type D
Y = surface finish Almost 99% of the variation in surface finish is e
­ xplained
X1 = RPM by the model, and all variables are significant. The mod-
X2 = 1 if tool type is B and 0 if not els for each individual tool are

X3 = 1 if tool type is C and 0 if not Tool A: surface finish = 24.49 + 0.098 RPM − 13.31(0)
X4 = 1 if tool type is D and 0 if not − 20.49(0) − 26.04(0)
= 24.49 + 0.098 RPM
(continued)
262 Chapter 8  Trendlines and Regression Analysis

Tool B: surface finish = 24.49 + 0.098 RPM − 13.31(1) Tool D: surface finish = 24.49 + 0.098 RPM − 13.31(0)
− 20.49(0) − 26.04(0) − 20.49(0) − 26.04(1)
= 11.18 + 0.098 RPM = − 1.55 + 0.098 RPM
Tool C: surface finish = 24.49 + 0.098 RPM − 13.31(0) Note that the only differences among these models are
− 20.49(1) − 26.04(0) the intercepts; the slopes associated with RPM are the
= 4.00 + 0.098 RPM same. This suggests that we might wish to test for inter-
actions between the type of cutting tool and RPM; we
leave this to you as an exercise.

Figure 8.31

Portion of Excel File Surface


Finish

Figure 8.32

Data Matrix for Surface


Finish with Dummy Variables
Chapter 8  Trendlines and Regression Analysis 263

Figure 8.33

Surface Finish Regression


Model Results

Regression Models with Nonlinear Terms

Linear regression models are not appropriate for every situation. A scatter chart of the
data might show a nonlinear relationship, or the residuals for a linear fit might result in a
­nonlinear pattern. In such cases, we might propose a nonlinear model to explain the rela-
tionship. For instance, a second-order polynomial model would be

Y = b0 + b1X + b2X 2 + e
Sometimes, this is called a curvilinear regression model. In this model, b1 represents the
linear effect of X on Y, and b2 represents the curvilinear effect. However, although this
model appears to be quite different from ordinary linear regression models, it is still linear
in the parameters (the betas, which are the unknowns that we are trying to estimate). In
other words, all terms are a product of a beta coefficient and some function of the data,
which are simply numerical values. In such cases, we can still apply least squares to esti-
mate the regression coefficients.
Curvilinear regression models are also often used in forecasting when the indepen-
dent variable is time. This and other applications of regression in forecasting are discussed
in the next chapter.

Example 8.18 Modeling Beverage Sales Using Curvilinear Regression


The Excel file Beverage Sales provides data on the sales Now, both temperature and temperature squared are the
of cold beverages at a small restaurant with a large out- independent variables. Figure 8.36 shows the results for
door patio during the summer months (see Figure 8.34). the curvilinear regression model. The model is:
The owner has observed that sales tend to increase
sales = 142,850 − 3,643.17 × temperature + 23.3
on hotter days. Figure 8.35 shows linear regression re-
sults for these data. The U-shape of the residual plot (a × temperature2
­second-order polynomial trendline was fit to the residual Note that the adjusted R2 has increased significantly
data) suggests that a linear relationship is not appropri- from the linear model and that the residual plots now
ate. To apply a curvilinear regression model, add a col- show more random patterns.
umn to the data matrix by squaring the temperatures.
264 Chapter 8  Trendlines and Regression Analysis

Figure 8.34

Portion of Excel File


Beverage Sales

Figure 8.35

Linear Regression Results


for Beverage Sales

Figure 8.36

Curvilinear Regression
Results for Beverage Sales
Chapter 8  Trendlines and Regression Analysis 265

Advanced Techniques for Regression Modeling using XLMiner

XLMiner is an Excel add-in for data mining that accompanies Analytic Solver Platform.
Data mining is the subject of Chapter 10 and includes a wide variety of statistical proce-
dures for exploring data, including regression analysis. The regression analysis tool in
XLMiner has some advanced options not available in Excel’s Descriptive Statistics tool,
which we discuss in this section.
Best-subsets regression evaluates either all possible regression models for a set of
independent variables or the best subsets of models for a fixed number of independent
variables. It helps you to find the best model based on the Adjusted R2. Best-subsets
­regression evaluates models using a statistic called Cp, which is called the Bonferroni
criterion. Cp estimates the bias introduced in the estimates of the responses by having an
underspecified model (a model with important predictors missing). If Cp is much greater
than k + 1 (the number of independent variables plus 1), there is substantial bias. The
full model always has Cp = k + 1. If all models except the full model have large Cps, it
suggests that important predictor variables are missing. Models with a minimum value or
having Cp less than or at least close to k + 1 are good models to consider.
XLMiner offers five different procedures for selecting the best subsets of variables.
Backward Elimination begins with all independent variables in the model and deletes one
at a time until the best model is identified. Forward Selection begins with a model having
no independent variables and successively adds one at a time until no additional variable
makes a significant contribution. Stepwise Selection is similar to Forward Selection ex-
cept that at each step, the procedure considers dropping variables that are not statistically
significant. Sequential Replacement replaces variables sequentially, retaining those that
improve performance. These options might terminate with a different model. Exhaustive
Search looks at all combinations of variables to find the one with the best fit, but it can be
time consuming for large numbers of variables.

Example 8.19 Using XLMiner for Regression


We will use the Banking Data example. After installation, XLMiner creates a new worksheet with an “Output
XLMiner will appear as a new tab in the Excel ribbon. The Navigator” that allows you to click on hyperlinks to see var-
XLMiner ribbon is shown in Figure 8.37. To use the basic ious portions of the output (see Figure 8.41). The regression
regression tool, click the Predict button in the Data ­Mining model and ANOVA output are shown in Figure 8.42. Note
group and choose Multiple Linear Regression. The first of that this is the same as the output shown in Figure 8.21.
two dialogs will then be displayed, as shown in Figure 8.38. The Best subsets results appear below the ANOVA output,
First, enter the data range (including headers) in the shown in Figure 8.43. RSS is the residual sum of squares, or
box near the top and check the box First row contains the sum of squared deviations between the predicted prob-
­headers. All the variables will be listed in the left pane ability of success and the actual value (1 or 0). Probability is
(Variables in input data). Select the independent variables a quasi-hypothesis test that a given subset is acceptable;
and move them using the arrow button to the Input vari- if this is less than 0.05, you can rule out that subset. Note
ables pane; then select the dependent variable and move that the model with 5 coefficients (including the intercept) is
it to the Output variable pane as shown in the figure. Click the only one that has a Cp value less than k + 1 = 5, and
Next. The second dialog shown in Figure 8.39 will appear. its adjusted R2 is the largest. If you click “Choose Subset,”
Select the output options and check the Summary report XLMiner will create a new worksheet with the results for
box. However, before clicking Finish, click on the Best this model, which is the same as we found in Figure 8.22;
subsets button. In the dialog shown in Figure 8.40, check that is, the model without the Home Value variable.
the box at the top and choose the selection procedure.
Click OK and then click Finish in the Step 2 dialog.
266 Chapter 8  Trendlines and Regression Analysis

Figure 8.37

XLMiner Ribbon

Figure 8.38

XLMiner Linear Regression


Dialog, Step 1 of 2

Figure 8.39

XLMiner Linear Regression


Dialog, Step 2 of 2
Chapter 8  Trendlines and Regression Analysis 267

Figure 8.40

XLMiner Best Subset Dialog

Figure 8.41

XLMiner Output
Navigator

Figure 8.42

XLMiner
Regression
Output

Figure 8.43

XLMiner Best Subsets Results


268 Chapter 8  Trendlines and Regression Analysis

XLMiner also provides cross-validation—a process of using two sets of sample data;
one to build the model (called the training set), and the second to assess the model’s per-
formance (called the validation set). This will be explained in Chapter 10 when we study
data mining in more depth, but is not necessary for standard regression analysis.

Key Terms

Autocorrelation Multiple correlation coefficient


Best-subsets regression Multiple linear regression
Coefficient of determination 1R22 Overfitting
Cross-validation Parsimony
Coefficient of multiple determination Partial regression coefficient
Curvilinear regression model Polynomial function
Dummy variables Power function
Exponential function R2 (R-squared)
Homoscedasticity Regression analysis
Interaction Residuals
Least-squares regression Significance of regression
Linear function Simple linear regression
Logarithmic function Standard error of the estimate, SYX
Multicollinearity Standard residuals

Problems and Exercises

1. Each worksheet in the Excel file LineFit Data con- the best-fitting linear regression line using the Excel
tains a set of data that describes a functional rela- Trendline tool. What would you conclude about the
tionship between the dependent variable y and the strength of any relationship? Would you use regres-
independent variable x. Construct a line chart of each sion to make predictions of the unemployment rate
data set, and use the Add Trendline tool to determine based on the cost of living?
the best-fitting functions to model these data sets.
4. Using the data in the Excel file Weddings construct
2. A consumer products company has collected some scatter charts to determine whether any linear rela-
data relating monthly demand to the price of one of tionship appears to exist between (1) the wedding
its products: cost and attendance, (2) the wedding cost and the
value rating, and (3) the couple’s income and wed-
Price Demand ding cost only for the weddings paid for by the bride
$11 2,100 and groom. Then find the best-fitting linear regres-
$13 2,020 sion lines using the Excel Trendline tool for each of
$17 1,980 these charts.
$19 1,875 5. Using the data in the Excel file Student Grades, con-
struct a scatter chart for midterm versus final exam
What type of model would best represent these data? grades and add a linear trendline. What is the regres-
Use the Trendline tool to find the best among the op- sion model? If a student scores 70 on the midterm,
tions provided. what would you predict her grade on the final exam
to be?
3. Using the data in the Excel file Demographics, de-
termine if a linear relationship exists between un- 6. Using the results of fitting the Home Market Value
employment rates and cost of living indexes by regression line in Example 8.4, compute the errors
constructing a scatter chart. Visually, do there appear associated with each observation using formula (8.3)
to be any outliers? If so, delete them and then find and construct a histogram.
Chapter 8  Trendlines and Regression Analysis 269

7. Set up an Excel worksheet to apply formulas (8.5) a. Interpret all key regression results, hypothesis
and (8.6) to compute the values of b 0 and b 1 for the tests, and confidence intervals in the output.
data in the Excel file Home Market Value and verify b. Analyze the residuals to determine if the assump-
that you obtain the same values as in Examples 8.4 tions underlying the regression analysis are valid.
and 8.5.
c. Use the standard residuals to determine if any
8. The managing director of a consulting group has the possible outliers exist.
following monthly data on total overhead costs and d. If a couple makes $70,000 together, how much
professional labor hours to bill to clients:4 would they probably budget for the wedding?
Overhead Costs Billable Hours 11. Using the data in the Excel file Demographics, apply
$365,000 3,000 the Excel Regression tool using unemployment rate
$400,000 4,000
as the dependent variable and cost of living index as
the independent variable.
$430,000 5,000
a. Interpret all key regression results, hypothesis
$477,000 6,000
tests, and confidence intervals in the output.
$560,000 7,000
b. Analyze the residuals to determine if the assump-
$587,000 8,000 tions underlying the regression analysis are valid.
c. Use the standard residuals to determine if any
a. Develop a trendline to identify the relationship
possible outliers exist.
between billable hours and overhead costs.
b. Interpret the coefficients of your regression 12. Using the data in the Excel file Student Grades, ap-
model. Specifically, what does the fixed compo- ply the Excel Regression tool using the midterm
nent of the model mean to the consulting firm? grade as the independent variable and the final exam
grade as the dependent variable.
c. If a special job requiring 1,000 billable hours
that would contribute a margin of $38,000 be- a. Interpret all key regression results, hypothesis
fore overhead was available, would the job be tests, and confidence intervals in the output.
attractive? b. Analyze the residuals to determine if the assump-
tions underlying the regression analysis are valid.
9. Using the Excel file Weddings, apply the ­Excel Re-
gression tool using the wedding cost as the depen- c. Use the standard residuals to determine if any
dent variable and attendance as the independent possible outliers exist.
variable. 13. The Excel file National Football League provides
a. Interpret all key regression results, hypothesis various data on professional football for one season.
tests, and confidence intervals in the output. a. Construct a scatter diagram for Points/Game and
b. Analyze the residuals to determine if the assump- Yards/Game in the Excel file. Does there appear
tions underlying the regression analysis are valid. to be a linear relationship?
c. Use the standard residuals to determine if any b. Develop a regression model for predicting
possible outliers exist. Points/Game as a function of Yards/Game.
­Explain the statistical significance of the model.
d. If a couple is planning a wedding for 175 guests,
how much should they budget? c. Draw conclusions about the validity of the re-
gression analysis assumptions from the residual
10. Using the Excel file Weddings, apply the ­Excel Re-
plot and standard residuals.
gression tool using the wedding cost as the d­ ependent
variable and the couple’s income as the independent 14. A deep-foundation engineering contractor has bid
variable, only for those weddings paid for by the on a foundation system for a new building housing
bride and groom. the world headquarters for a Fortune 500 company.

4Modified from Charles T. Horngren, George Foster, and Srikant M. Datar, Cost Accounting: A Managerial Emphasis, 9th ed. (Englewood
Cliffs, NJ: Prentice Hall, 1997): 371.
270 Chapter 8  Trendlines and Regression Analysis

A part of the project consists of installing 311 auger model you select, conduct further analysis to check
cast piles. The contractor was given bid information for significance of the independent variables and for
for cost-estimating purposes, which consisted of the multicollinearity.
estimated depth of each pile; however, actual drill
footage of each pile could not be determined exactly 20. Using the data in the Excel file Freshman College
until construction was performed. The Excel file Pile Data, identify the best regression model for pre-
Foundation contains the estimates and actual pile dicting the first year retention rate. For the model
lengths after the project was completed. Develop a you select, conduct further analysis to check for
linear regression model to estimate the actual pile significance of the independent variables and for
length as a function of the estimated pile lengths. multicollinearity.
What do you conclude? 21. The Excel file Major League Baseball provides data
15. The Excel file Concert Sales provides data on sales on the 2010 season.
dollars and the number of radio, TV, and newspaper a. Construct and examine the correlation matrix. Is
ads promoting the concerts for a group of cities. De- multicollinearity a potential problem?
velop simple linear regression models for predicting b. Suggest an appropriate set of independent vari-
sales as a function of the number of each type of ad. ables that predict the number of wins by examin-
Compare these results to a multiple linear regression ing the correlation matrix.
model using both independent variables. Examine
c. Find the best multiple regression model for pre-
the residuals of the best model for regression as-
dicting the number of wins. How good is your
sumptions and possible outliers.
model? Does it use the same variables you
16. Using the data in the Excel file Home Market Value, thought were appropriate in part (b)?
develop a multiple linear regression model for esti-
mating the market value as a function of both the age 22. The Excel file Golfing Statistics provides data for a
and size of the house. Predict the value of a house portion of the 2010 professional season for the top
that is 30 years old and has 1,800 square feet, and 25 golfers.
one that is 5 years old and has 2,800 square feet. a. Find the best multiple regression model for pre-
dicting earnings/event as a function of the re-
17. The Excel file Cereal Data provides a variety of nu-
maining variables.
tritional information about 67 cereals and their shelf
location in a supermarket. Use regression analysis to b. Find the best multiple regression model for pre-
find the best model that explains the relationship be- dicting average score as a function of the other
tween calories and the other variables. Investigate the variables except earnings and events.
model assumptions and clearly explain your conclu- 23. Use the p-value criterion to find a good model for
sions. Keep in mind the principle of parsimony! predicting the number of points scored per game
18. The Excel file Salary Data provides information on by football teams using the data in the Excel file
current salary, beginning salary, previous experience ­National Football League.
(in months) when hired, and total years of education
24. The State of Ohio Department of Education has a
for a sample of 100 employees in a firm.
mandated ninth-grade proficiency test that covers
a. Develop a multiple regression model for pre- writing, reading, mathematics, citizenship (social
dicting current salary as a function of the other studies), and science. The Excel file Ohio Education
variables. Performance provides data on success rates (defined
b. Find the best model for predicting current salary as the percent of students passing) in school districts
using the t-value criterion. in the greater Cincinnati metropolitan area along
with state averages.
19. The Excel file Credit Approval Decisions provides
information on credit history for a sample of bank- a. Suggest the best regression model to predict math
ing customers. Use regression analysis to identify success as a function of success in the other sub-
the best model for predicting the credit score as a jects by examining the correlation matrix; then
function of the other numerical variables. For the run the regression tool for this set of variables.
Chapter 8  Trendlines and Regression Analysis 271

b. Develop a multiple regression model to predict regression, and examine the residual plot. What do
math success as a function of success in all other you conclude? Construct a scatter chart and use the
subjects using the systematic approach described Excel Trendline feature to identify the best type of
in this chapter. Is multicollinearity a problem? curvilinear trendline that maximizes R2.
c. Compare the models in parts (a) and (b). Are they
the same? Why or why not? Units Produced Costs

25. A national homebuilder builds single-family homes 500 $12,500


and condominium-style townhouses. The Excel file 1,000 $25,000
House Sales provides information on the selling 1,500 $32,500
price, lot cost, type of home, and region of the coun- 2,000 $40,000
try 1M = Midwest, S = South2 for closings during
2,500 $45,000
one month.
3,000 $50,000
a. Develop a multiple regression model for sales
price as a function of lot cost and type of home
29. The Helicopter Division of Aerospatiale is study-
without any interaction term.
ing assembly costs at its Marseilles plant. 6 Past
b. Determine if an interaction exists between lot data ­indicates the following labor hours per
cost and type of home and find the best model. helicopter:
What is the predicted price for either a single
family home or a townhouse with a lot cost of Helicopter Number Labor Hours
$30,000?
1 2,000
26. For the House Sales data described in Problem 25, 2 1,400
develop a regression model for selling price as a
3 1,238
function of lot cost and region, incorporating an in-
teraction term. What would be the predicted price for 4 1,142
a home in either the South or the Midwest with a lot 5 1,075
cost of $30,000? How do these predictions compare 6 1,029
to the overall average price in each region? 7 985
27. For the Excel file Auto Survey, 8 957
a. Find the best regression model to predict miles/
gallon as a function of vehicle age and mileage. Using these data, apply simple linear regression, and
examine the residual plot. What do you conclude?
b. Using your result from part (a), add the categori-
Construct a scatter chart and use the Excel Trendline
cal variable Purchased to the model. Does this
feature to identify the best type of curvilinear trend-
change your result?
line that maximizes R2.
c. Determine whether any significant interac-
tion exists between Vehicle Age and Purchased 30. For the Excel file Cereal Data, use XLMiner and
variables. best subsets with backward selection to find the best
model.
28. Cost functions are often nonlinear with volume be-
cause production facilities are often able to produce 31. Use XLMiner and best subsets with stepwise selec-
larger quantities at lower rates than smaller quanti- tion to find the best model points per game for the
ties.5 Using the following data, apply simple linear National Football League data (see Problem 23).

5Horngren, Foster, and Datar, Cost Accounting: A Managerial Emphasis, 9th ed.: 349.
6Horngren, Foster, and Datar, Cost Accounting: A Managerial Emphasis, 9th ed.: 349.
272 Chapter 8  Trendlines and Regression Analysis

Case: Performance Lawn Equipment

In reviewing the PLE data, Elizabeth Burke noticed that engineers hired 10 years ago was selected to determine
defects received from suppliers have decreased (­ worksheet the influence of these variables on how long each indi-
­Defects After Delivery). Upon investigation, she learned vidual stayed with the company. Data are compiled in the
that in 2010, PLE experienced some quality problems Employee Retention worksheet.
due to an increasing number of defects in materials Finally, as part of its efforts to remain competitive,
received from suppliers. The company instituted an ini- PLE tries to keep up with the latest in production technol-
tiative in ­August 2011 to work with suppliers to reduce ogy. This is especially important in the highly competi-
these defects, to more closely coordinate deliveries, and to tive lawn-mower line, where competitors can gain a real
­improve materials quality through reengineering supplier advantage if they develop more cost-effective means of
production policies. Elizabeth noted that the program ap- production. The lawn-mower division therefore spends a
peared to reverse an increasing trend in defects; she would great deal of effort in testing new technology. When new
like to predict what might have happened had the supplier production technology is introduced, firms often experi-
initiative not been implemented and how the number of ence learning, resulting in a gradual decrease in the time
defects might further be reduced in the near future. required to produce successive units. Generally, the rate of
In meeting with PLE’s human resources director, improvement declines until the production time levels off.
Elizabeth also discovered a concern about the high rate One example is the production of a new design for lawn-
of turnover in its field service staff. Senior managers have mower engines. To determine the time required to produce
suggested that the department look closer at its recruiting these engines, PLE produced 50 units on its production
policies, particularly to try to identify the characteristics line; test results are given on the worksheet Engines in
of individuals that lead to greater retention. However, in the database. Because PLE is continually developing new
a recent staff meeting, HR managers could not agree on technology, understanding the rate of learning can be use-
these characteristics. Some argued that years of education ful in estimating future production costs without having to
and grade point averages were good predictors. Others run extensive prototype trials, and Elizabeth would like a
argued that hiring more mature applicants would lead to better handle on this.
greater retention. To study these factors, the staff agreed Use techniques of regression analysis to assist her in
to conduct a statistical study to determine the effect that evaluating the data in these three worksheets and reach-
years of education, college grade point average, and age ing useful conclusions. Summarize your work in a formal
when hired have on retention. A sample of 40 field service r­eport with all appropriate results and analyses.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy