Business Analytics 2nd Edition Removed-8
Business Analytics 2nd Edition Removed-8
Trendlines and
Chapter
Regression Analysis
gibsons/Shutterstock.com
Learning Objectives
After studying this chapter, you will be able to:
• Explain the purpose of regression analysis and provide • Interpret confidence intervals for regression
examples in business. coefficients
• Use a scatter chart to identify the type of relationship • Calculate standard residuals.
between two variables. • List the assumptions of regression analysis and
• List the common types of mathematical functions used describe methods to verify them.
in predictive modeling. • Explain the differences in the Excel Regression tool
• Use the Excel Trendline tool to fit models to data. output for simple and multiple linear regression
• Explain how least-squares regression finds the best- models.
fitting regression model. • Apply a systematic approach to build good regression
• Use Excel functions to find least-squares regression models.
coefficients. • Explain the importance of understanding
• Use the Excel Regression tool for both single and multicollinearity in regression models.
multiple linear regressions. • Build regression models for categorical data using
• Interpret the regression statistics of the Excel dummy variables.
Regression tool. • Test for interactions in regression models with
• Interpret significance of regression from the Excel categorical variables.
Regression tool output. • Identify when curvilinear regression models are more
• Draw conclusions for tests of hypotheses about appropriate than linear models.
regression coefficients.
233
234 Chapter 8 Trendlines and Regression Analysis
Understanding both the mathematics and the descriptive properties of different functional
relationships is important in building predictive analytical models. We often begin by
creating a chart of the data to understand it and choose the appropriate type of functional
relationship to incorporate into an analytical model. For cross-sectional data, we use a
scatter chart; for time hyphenate as adjective for data series data we use a line chart.
Common types of mathematical functions used in predictive analytical models in-
clude the following:
1James R. Morris and John P. Daley, Introduction to Financial Models for Management and Planning
(Boca Raton, FL: Chapman & Hall/CRC, 2009): 257.
2Alvin C. Burns and Ronald F. Bush, Basic Marketing Research Using Microsoft Excel Data Analysis,
2nd ed. (Upper Saddle River, NJ: Prentice Hall, 2008): 450.
Chapter 8 Trendlines and Regression Analysis 235
The Excel Trendline tool provides a convenient method for determining the best-fitting
functional relationship among these alternatives for a set of data. First, click the chart to
which you wish to add a trendline; this will display the Chart Tools menu. Select the Chart
Tools Design tab, and then click Add Chart Element from the Chart Layouts group. From
the Trendline submenu, you can select one of the options (Linear is the most common) or
More Trendline Options. . . . If you select More Trendline Options, you will get the Format
Trendline pane in the worksheet (see Figure 8.1). A simpler way of doing all this is to right
click on the data series in the chart and choose Add trendline from the pop-up menu—try
it! Select the radio button for the type of functional relationship you wish to fit to the data.
Check the boxes for Display Equation on chart and Display R-squared value on chart. You
may then close the Format Trendline pane. Excel will display the results on the chart you
have selected; you may move the equation and R-squared value for better readability by
dragging them to a different location. To clear a trendline, right click on it and select Delete.
R2 (R-squared) is a measure of the “fit” of the line to the data. The value of R2 will
be between 0 and 1. The larger the value of R2 the better the fit. We will discuss this fur-
ther in the context of regression analysis.
Trendlines can be used to model relationships between variables and understand
how the dependent variable behaves as the independent variable changes. For example,
the d emand-prediction models that we introduced in Chapter 1 (Examples 1.9 and 1.10)
would generally be developed by analyzing data.
Figure 8.1
Trendlines are also used extensively in modeling trends over time—that is, when the
variable x in the functional relationships represents time. For example, an analyst for an
airline needs to predict where fuel prices are going, and an investment analyst would want
to predict the price of stocks or key economic indicators.
Figure 8.2
Figure 8.3
Be cautious when using polynomial functions. The R2 value will continue to increase
as the order of the polynomial increases; that is, a third-order polynomial will provide
a better fit than a second order polynomial, and so on. Higher-order polynomials will
generally not be very smooth and will be difficult to interpret visually. Thus, we don’t
recommend going beyond a third-order polynomial when fitting data. Use your eye to
make a good judgment!
Of course, the proper model to use depends on the scope of the data. As the chart
shows, crude oil prices were relatively stable until early 2007 and then began to increase
rapidly. By including the early data, the long-term functional relationship might not ad-
equately express the short-term trend. For example, fitting a model to only the data begin-
ning with January 2007 yields these models:
Figure 8.4
The difference in prediction can be significant. For example, to predict the price
6 months after the last data point 1x = 362 yields $172.24 for the third-order polyno-
mial fit with all the data and $246.45 for the exponential model with only the recent
data. Thus, the analysis must be careful to select the proper amount of data for the
analysis. The question then becomes one of choosing the best assumptions for the
model. Is it reasonable to assume that prices would increase exponentially or perhaps
at a slower rate, such as with the linear model fit? Or, would they level off and start
falling? Clearly, factors other than historical trends would enter into this choice. As
we now know, oil prices plunged in the latter half of 2008; thus, all predictive models
are risky.
Regression analysis is a tool for building mathematical and statistical models that char-
acterize relationships between a dependent variable (which must be a ratio variable and
not categorical) and one or more independent, or explanatory, variables, all of which are
numerical (but may be either ratio or categorical).
Two broad categories of regression models are used often in business settings:
(1) regression models of cross-sectional data and (2) regression models of time-series
data, in which the independent variables are time or some function of time and the focus is
on predicting the future. Time-series regression is an important tool in forecasting, which
is the subject of Chapter 9.
A regression model that involves a single independent variable is called simple
regression. A regression model that involves two or more independent variables is called
multiple regression. In the remainder of this chapter, we describe how to develop and ana-
lyze both simple and multiple regression models.
Simple linear regression involves finding a linear relationship between one indepen-
dent variable, X, and one dependent variable, Y. The relationship between two variables
can assume many forms, as illustrated in Figure 8.5. The relationship may be linear or
nonlinear, or there may be no relationship at all. Because we are focusing our discussion
on linear regression models, the first thing to do is to verify that the relationship is linear,
as in Figure 8.5(a). We would not expect to see the data line up perfectly along a straight
line; we simply want to verify that the general relationship is linear. If the relationship is
clearly nonlinear, as in Figure 8.5(b), then alternative approaches must be used, and if no
relationship is evident, as in Figure 8.5(c), then it is pointless to even consider developing
a linear regression model.
To determine if a linear relationship exists between the variables, we recommend that
you create a scatter chart that can show the relationship between variables visually.
Figure 8.5
Examples of Variable
Relationships (a) Linear (b) Nonlinear (c) No relationship
Chapter 8 Trendlines and Regression Analysis 239
Figure 8.6
Figure 8.7
Figure 8.8
might fall on the line itself. Figure 8.8 shows two possible straight lines that pass through
the data. Clearly, you would choose A as the better-fitting line over B because all the
points are closer to the line and the line appears to be in the middle of the data. The only
difference between the lines is the value of the slope and intercept; thus, we seek to deter-
mine the values of the slope and intercept that provide the best-fitting line.
Figure 8.9
We can find the best-fitting line using the Excel Trendline tool (with the linear option
chosen), as described earlier in this chapter.
Least-Squares Regression
The mathematical basis for the best-fitting regression line is called least-squares
regression. In regression analysis, we assume that the values of the dependent variable,
Y, in the sample data are drawn from some unknown population for each value of the
independent variable, X. For example, in the Home Market Value data, the first and fourth
observations come from a population of homes having 1,812 square feet; the second
observation comes from a population of homes having 1,914 square feet; and so on.
Because we are assuming that a linear relationship exists, the expected value of Y is
b0 + b1X for each value of X. The coefficients b0 and b1 are population parameters that
represent the intercept and slope, respectively, of the population from which a sample of
observations is taken. The intercept is the mean value of Y when X = 0, and the slope is
the change in the mean value of Y as X changes by one unit.
Thus, for a specific value of X, we have many possible values of Y that vary around
the mean. To account for this, we add an error term, e (the Greek letter epsilon), to the
mean. This defines a simple linear regression model:
Y = b0 + b1X + e (8.1)
However, because we don’t know the entire population, we don’t know the true values of
b0 and b1. In practice, we must estimate these as best we can from the sample data. Define b 0
and b 1 to be estimates of b0 and b1. Thus, the estimated simple linear regression equation is
Yn = b 0 + b 1X (8.2)
Let Xi be the value of the independent variable of the ith observation. When the value of
the independent variable is Xi, then Yni = b 0 + b 1Xi is the estimated value of Y for Xi.
One way to quantify the relationship between each point and the estimated regression
equation is to measure the vertical distance between them, as illustrated in Figure 8.10. We
242 Chapter 8 Trendlines and Regression Analysis
Y
Figure 8.10 ^
Y2
Measuring the Errors in a Y1
e1 e2
Regression Model ^
Y1
Y2
X
X1 X2
Errors associated with individual observations
can think of these differences, ei, as the observed errors (often called residuals) associated
with estimating the value of the dependent variable using the regression line. Thus, the er-
ror associated with the ith observation is:
ei = Yi - Yni (8.3)
The best-fitting line should minimize some measure of these errors. Because some
errors will be negative and others positive, we might take their absolute value or simply
square them. Mathematically, it is easier to work with the squares of the errors.
Adding the squares of the errors, we obtain the following function:
If we can find the best values of the slope and intercept that minimize the sum of squares
(hence the name “least squares”) of the observed errors ei, we will have found the best-
fitting regression line. Note that Xi and Yi are the values of the sample data and that b 0
and b 1 are unknowns in equation (8.4). Using calculus, we can show that the solution that
minimizes the sum of squares of the observed errors is
a XiYi - nX Y
n
i=1
b1 = (8.5)
a X i - nX
n
2 2
i=1
b 0 = Y - b 1X (8.6)
Although the calculations for the least-squares coefficients appear to be somewhat
complicated, they can easily be performed on an Excel spreadsheet. Even better, Excel has
built-in capabilities for doing this. For example, you may use the functions INTERCEPT
(known_y’s, known_x’s) and SLOPE(known_y’s, known_x’s) to find the least-squares co-
efficients b 0 and b 1.
We could stop at this point, because we have found the best-fitting line for the ob-
served data. However, there is a lot more to regression analysis from a statistical perspec-
tive, because we are working with sample data—and usually rather small samples—which
we know have a lot of variation as compared with the full population. Therefore, it is im-
portant to understand some of the statistical properties associated with regression analysis.
Figure 8.11
Figure 8.12
In the Regression Statistics section, Multiple R is another name for the sample corre-
lation coefficient, r, which was introduced in Chapter 4. Values of r range from -1 to 1,
where the sign is determined by the sign of the slope of the regression line. A Multiple R
value greater than 0 indicates positive correlation; that is, as the independent vari-
able increases, the dependent variable does also; a value less than 0 indicates negative
correlation—as X increases, Y decreases. A value of 0 indicates no correlation.
R-squared 1R22 is called the coefficient of determination. Earlier we noted that R2
is a measure of the how well the regression line fits the data; this value is also provided
by the Trendline tool. Specifically, R2 gives the proportion of variation in the dependent
variable that is explained by the independent variable of the regression model. The value
of R2 is between 0 and 1. A value of 1.0 indicates a perfect fit, and all data points lie on
the regression line, whereas a value of 0 indicates that no relationship exists. Although we
would like high values of R2, it is difficult to specify a “good” value that signifies a strong
relationship because this depends on the application. For example, in scientific applica-
tions such as calibrating physical measurement equipment, R2 values close to 1 would
be expected; in marketing research studies, an R2 of 0.6 or more is considered very good;
however, in many social science applications, values in the neighborhood of 0.3 might be
considered acceptable.
Adjusted R Square is a statistic that modifies the value of R2 by incorporating the
sample size and the number of explanatory variables in the model. Although it does not
give the actual percent of variation explained by the model as R2 does, it is useful when
comparing this model with other models that include additional explanatory variables. We
discuss it more fully in the context of multiple linear regression later in this chapter.
Standard Error in the Excel output is the variability of the observed Y-values from
the predicted values 1Yn2. This is formally called the standard error of the estimate, SYX.
If the data are clustered close to the regression line, then the standard error will be small;
the more scattered the data are, the larger the standard error.
were not included in the model. The standard error of the is less than the variation around the sample mean
estimate is $7,287.72. If we compare this to the standard ($10,553). This is because the independent variable in
deviation of the market value, which is $10,553, we see the regression model explains some of the variation.
that the variation around the regression line ($7,287.72)
H0: b 1 = 0
H1: b1 ≠ 0 (8.7)
If we reject the null hypothesis, then we may conclude that the slope of the independent vari-
able is not zero and, therefore, is statistically significant in the sense that it explains some of the
variation of the dependent variable around the mean. Similar to our discussion in Chapter 7,
you needn’t worry about the mathematical details of how F is computed, or even its value,
especially since the tool does not provide the critical value for the test. What is important is
the value of Significance F, which is the p-value for the F-test. If Significance F is less than
the level of significance (typically 0.05), we would reject the null hypothesis.
H0: b1 = B1
H1: b1 ≠ B1
we need only check whether B1 falls within the confidence interval for the slope. If it does
not, then we reject the null hypothesis, otherwise we fail to reject it.
Recall that residuals are the observed errors, which are the differences between the actual
values and the estimated values of the dependent variable using the regression equation.
Figure 8.13 shows a portion of the residual table generated by the Excel Regression tool.
The residual output includes, for each observation, the predicted value using the estimated
regression equation, the residual, and the standard residual. The residual is simply the dif-
ference between the actual value of the dependent variable and the predicted value, or
Yi - Yni. Figure 8.14 shows the residual plot generated by the Excel tool. This chart is actu-
ally a scatter chart of the residuals with the values of the independent variable on the x-axis.
Chapter 8 Trendlines and Regression Analysis 247
Figure 8.13
Figure 8.14
Standard residuals are residuals divided by their standard deviation. Standard re-
siduals describe how far each residual is from its mean in units of standard deviations
(similar to a z-value for a standard normal distribution). Standard residuals are useful in
checking assumptions underlying regression analysis, which we will address shortly, and
to detect outliers that may bias the results. Recall that an outlier is an extreme value that
is different from the rest of the data. A single outlier can make a significant difference in
the regression equation, changing the slope and intercept and, hence, how they would be
interpreted and used in practice. Some consider a standardized residual outside of { 2
standard deviations as an outlier. A more conservative rule of thumb would be to consider
outliers outside of a { 3 standard deviation range. (Commercial software packages have
more sophisticated techniques for identifying outliers.)
Checking Assumptions
The statistical hypothesis tests associated with regression analysis are predicated on some
key assumptions about the data.
a 1ei - ei - 12
n
2
i=2
D = (8.9)
a ei
n
2
i=1
Figure 8.15
Histogram of Standard
Residuals
When assumptions of regression are violated, then statistical inferences drawn from
the hypothesis tests may not be valid. Thus, before drawing inferences about regression
models and performing hypothesis tests, these assumptions should be checked. However,
other than linearity, these assumptions are not needed solely for model fitting and estima-
tion purposes.
Figure 8.16
propose that schools with students who have higher SAT scores, a lower acceptance rate,
a larger budget, and a higher percentage of students in the top 10% of their high school
classes will tend to retain and graduate more students.
A linear regression model with more than one independent variable is called a mul-
tiple linear regression model. Simple linear regression is just a special case of multiple
linear regression. A multiple linear regression model has the form:
where
to predict the value of the dependent variable. The partial regression coefficients repre-
sent the expected change in the dependent variable when the associated independent vari-
able is increased by one unit while the values of all other independent variables are held
constant.
For the college and university data, the proposed model would be
Thus, b 2 would represent an estimate of the change in the graduation rate for a unit in-
crease in the acceptance rate while holding all other variables constant.
As with simple linear regression, multiple linear regression uses least squares to es-
timate the intercept and slope coefficients that minimize the sum of squared error terms
over all observations. The principal assumptions discussed for simple linear regression
also hold here. The Excel Regression tool can easily perform multiple linear regression;
you need to specify only the full range for the independent variable data in the dialog. One
caution when using the tool: the independent variables in the spreadsheet must be in con-
tiguous columns. So, you may have to manually move the columns of data around before
applying the tool.
Chapter 8 Trendlines and Regression Analysis 251
The results from the Regression tool are in the same format as we saw for simple
linear regression. However, some key differences exist. Multiple R and R Square (or R2)
are called the multiple correlation coefficient and the coefficient of multiple determi-
nation, respectively, in the context of multiple regression. They indicate the strength of
association between the dependent and independent variables. Similar to simple linear
regression, R2 explains the percentage of variation in the dependent variable that is ex-
plained by the set of independent variables in the model.
The interpretation of the ANOVA section is quite different from that in simple lin-
ear regression. For multiple linear regression, ANOVA tests for significance of the entire
model. That is, it computes an F-statistic for testing the hypotheses
H0: b1 = b2 = g = bk = 0
The null hypothesis states that no linear relationship exists between the dependent and any
of the independent variables, whereas the alternative hypothesis states that the dependent
variable has a linear relationship with at least one independent variable. If the null hy-
pothesis is rejected, we cannot conclude that a relationship exists with every independent
variable individually.
The multiple linear regression output also provides information to test hypothe-
ses about each of the individual regression coefficients. Specifically, we may test the
null hypothesis that b0 (the intercept) or any bi equals zero. If we reject the null hy-
pothesis that the slope associated with independent variable i is zero, H0: bi = 0, then
we may state that independent variable i is significant in the regression model; that
is, it contributes to reducing the variation in the dependent variable and improves the
ability of the model to better predict the dependent variable. However, if we cannot
reject H0, then that independent variable is not significant and probably should not be
included in the model. We see how to use this information to identify the best model
in the next section.
Finally, for multiple regression models, a residual plot is generated for each indepen-
dent variable. This allows you to assess the linearity and homoscedasticity assumptions of
regression.
Figure 8.17
Multiple Regression
Results for Colleges
and Universities Data
Figure 8.18
From the ANOVA section, we may test for signifi- regression coefficient is zero and conclude that each of
cance of regression. At a 5% significance level, we reject them is statistically significant.
the null hypothesis because Significance F is essentially Figure 8.18 shows one of the residual plots from
zero. Therefore, we may conclude that at least one slope the Excel output. The assumptions appear to be met,
is statistically different from zero. and the other residual plots (not shown) also validate
Looking at the p-values for the independent vari- these assumptions. The normal probability plot (also not
ables in the last section, we see that all are less than 0.05; shown) does not suggest any serious departures from
therefore, we reject the null hypothesis that each partial normality.
Chapter 8 Trendlines and Regression Analysis 253
Analytics in Practice: U
sing Linear Regression and Interactive Risk
Simulators to Predict Performance at ARAMARK3
ARAMARK is a leader in professional services, pro- for use by their clients. They developed “Interactive
viding award-winning food services, facilities man- Risk Simulators,” which are simple online tools that
agement, and uniform and career apparel to health allowed users to manipulate the values of the inde-
care institutions, universities and school districts, pendent variables in the regression models using inter-
stadiums and arenas, and businesses around the active sliders that correspond to the business metrics
world. Headquartered in Philadelphia, ARAMARK has and instantaneously view the values of the dependent
approximately 255,000 employees serving clients in variables (the risk metrics) on gauges similar to those
22 countries. found on the dashboard of a car.
ARAMARK’s Global Risk Management Figure 8.19 illustrates the structure of the simu-
D epartment (GRM) needed a way to determine the lators. The gauges are updated instantly as the user
statistical relationships between key business metrics adjusts the sliders, showing how changes in the busi-
(e.g., employee tenure, employee engagement, a ness environment affect the risk metrics. This visual
trained workforce, account tenure, service offerings) representation made the models easy to use and un-
and risk metrics (e.g., OSHA rate, workers’ compensa- derstand, particularly for nontechnical employees.
tion rate, customer injuries) to understand the impact
of these risks on the business. GRM also needed a
simple tool that field operators and the risk manage-
ment team could use to predict the impact of busi-
ness decisions on risk metrics before those decisions
were implemented. Typical questions they would want
to ask were, What would happen to our OSHA rate if
Gunnar Pippel/Shutterstock.com
3The author expresses his appreciation to John Toczek, Manager of Decision Support and Analytics at
ARAMARK Corporation.
254 Chapter 8 Trendlines and Regression Analysis
Nataliia Natykach/Shutterstock.com
vector-illustration/Shutterstock.com
c./Shutterstock.com
Inputs: Independent Variables Regression Models Outputs: Dependent Variables
Figure 8.19
In the colleges and universities regression example, all the independent variables were
found to be significant by evaluating the p-values of the regression analysis. This will not
always be the case and leads to the question of how to build good regression models that
include the “best” set of variables.
Figure 8.20 shows a portion of the Excel file Banking Data, which provides data
acquired from banking and census records for different zip codes in the bank’s current
market. Such information can be useful in targeting advertising for new customers or
for choosing locations for branch offices. The data show the median age of the popula-
tion, median years of education, median income, median home value, median household
wealth, and average bank balance.
Figure 8.21 shows the results of regression analysis used to predict the average bank
balance as a function of the other variables. Although the independent variables explain
more than 94% of the variation in the average bank balance, you can see that at a 0.05
significance level, the p-values indicate that both Education and Home Value do not ap-
pear to be significant. A good regression model should include only significant indepen-
dent variables. However, it is not always clear exactly what will happen when we add or
remove variables from a model; variables that are (or are not) significant in one model
may (or may not) be significant in another. Therefore, you should not consider dropping
all insignificant variables at one time, but rather take a more structured approach.
Adding an independent variable to a regression model will always result in R2 equal
to or greater than the R2 of the original model. This is true even when the new independent
Figure 8.20
Figure 8.21
variable has little true relationship with the dependent variable. Thus, trying to maximize
R2 is not a useful criterion. A better way of evaluating the relative fit of different models is
to use adjusted R2. Adjusted R2 reflects both the number of independent variables and the
sample size and may either increase or decrease when an independent variable is added
or dropped, thus providing an indication of the value of adding or removing independent
variables in the model. An increase in adjusted R2 indicates that the model has improved.
This suggests a systematic approach to building good regression models:
1. Construct a model with all available independent variables. Check for signifi-
cance of the independent variables by examining the p-values.
2. Identify the independent variable having the largest p-value that exceeds the
chosen level of significance.
3. Remove the variable identified in step 2 from the model and evaluate adjusted
R2. (Don’t remove all variables with p-values that exceed a at the same time,
but remove only one at a time.)
4. Continue until all variables are significant.
In essence, this approach seeks to find a significant model that has the highest adjusted R2.
Figure 8.22
Figure 8.23
Figure 8.24
Figure 8.25
Some data of interest in a regression study may be ordinal or nominal. This is common when
including demographic data in marketing studies, for example. Because regression analysis
requires numerical data, we could include categorical variables by coding the variables. For
example, if one variable represents whether an individual is a college graduate or not, we
might code No as 0 and Yes as 1. Such variables are often called dummy variables.
Chapter 8 Trendlines and Regression Analysis 259
Figure 8.26
Figure 8.27
An interaction occurs when the effect of one variable (i.e., the slope) is dependent on
another variable. We can test for interactions by defining a new variable as the product of
the two variables, X3 = X1 * X2, and testing whether this variable is significant, leading
to an alternative model.
Figure 8.28
Portion of Employee
Salaries Modified for
Interaction Term
Figure 8.29
Figure 8.30
X3 = 1 if tool type is C and 0 if not Tool A: surface finish = 24.49 + 0.098 RPM − 13.31(0)
X4 = 1 if tool type is D and 0 if not − 20.49(0) − 26.04(0)
= 24.49 + 0.098 RPM
(continued)
262 Chapter 8 Trendlines and Regression Analysis
Tool B: surface finish = 24.49 + 0.098 RPM − 13.31(1) Tool D: surface finish = 24.49 + 0.098 RPM − 13.31(0)
− 20.49(0) − 26.04(0) − 20.49(0) − 26.04(1)
= 11.18 + 0.098 RPM = − 1.55 + 0.098 RPM
Tool C: surface finish = 24.49 + 0.098 RPM − 13.31(0) Note that the only differences among these models are
− 20.49(1) − 26.04(0) the intercepts; the slopes associated with RPM are the
= 4.00 + 0.098 RPM same. This suggests that we might wish to test for inter-
actions between the type of cutting tool and RPM; we
leave this to you as an exercise.
Figure 8.31
Figure 8.32
Figure 8.33
Linear regression models are not appropriate for every situation. A scatter chart of the
data might show a nonlinear relationship, or the residuals for a linear fit might result in a
nonlinear pattern. In such cases, we might propose a nonlinear model to explain the rela-
tionship. For instance, a second-order polynomial model would be
Y = b0 + b1X + b2X 2 + e
Sometimes, this is called a curvilinear regression model. In this model, b1 represents the
linear effect of X on Y, and b2 represents the curvilinear effect. However, although this
model appears to be quite different from ordinary linear regression models, it is still linear
in the parameters (the betas, which are the unknowns that we are trying to estimate). In
other words, all terms are a product of a beta coefficient and some function of the data,
which are simply numerical values. In such cases, we can still apply least squares to esti-
mate the regression coefficients.
Curvilinear regression models are also often used in forecasting when the indepen-
dent variable is time. This and other applications of regression in forecasting are discussed
in the next chapter.
Figure 8.34
Figure 8.35
Figure 8.36
Curvilinear Regression
Results for Beverage Sales
Chapter 8 Trendlines and Regression Analysis 265
XLMiner is an Excel add-in for data mining that accompanies Analytic Solver Platform.
Data mining is the subject of Chapter 10 and includes a wide variety of statistical proce-
dures for exploring data, including regression analysis. The regression analysis tool in
XLMiner has some advanced options not available in Excel’s Descriptive Statistics tool,
which we discuss in this section.
Best-subsets regression evaluates either all possible regression models for a set of
independent variables or the best subsets of models for a fixed number of independent
variables. It helps you to find the best model based on the Adjusted R2. Best-subsets
regression evaluates models using a statistic called Cp, which is called the Bonferroni
criterion. Cp estimates the bias introduced in the estimates of the responses by having an
underspecified model (a model with important predictors missing). If Cp is much greater
than k + 1 (the number of independent variables plus 1), there is substantial bias. The
full model always has Cp = k + 1. If all models except the full model have large Cps, it
suggests that important predictor variables are missing. Models with a minimum value or
having Cp less than or at least close to k + 1 are good models to consider.
XLMiner offers five different procedures for selecting the best subsets of variables.
Backward Elimination begins with all independent variables in the model and deletes one
at a time until the best model is identified. Forward Selection begins with a model having
no independent variables and successively adds one at a time until no additional variable
makes a significant contribution. Stepwise Selection is similar to Forward Selection ex-
cept that at each step, the procedure considers dropping variables that are not statistically
significant. Sequential Replacement replaces variables sequentially, retaining those that
improve performance. These options might terminate with a different model. Exhaustive
Search looks at all combinations of variables to find the one with the best fit, but it can be
time consuming for large numbers of variables.
Figure 8.37
XLMiner Ribbon
Figure 8.38
Figure 8.39
Figure 8.40
Figure 8.41
XLMiner Output
Navigator
Figure 8.42
XLMiner
Regression
Output
Figure 8.43
XLMiner also provides cross-validation—a process of using two sets of sample data;
one to build the model (called the training set), and the second to assess the model’s per-
formance (called the validation set). This will be explained in Chapter 10 when we study
data mining in more depth, but is not necessary for standard regression analysis.
Key Terms
1. Each worksheet in the Excel file LineFit Data con- the best-fitting linear regression line using the Excel
tains a set of data that describes a functional rela- Trendline tool. What would you conclude about the
tionship between the dependent variable y and the strength of any relationship? Would you use regres-
independent variable x. Construct a line chart of each sion to make predictions of the unemployment rate
data set, and use the Add Trendline tool to determine based on the cost of living?
the best-fitting functions to model these data sets.
4. Using the data in the Excel file Weddings construct
2. A consumer products company has collected some scatter charts to determine whether any linear rela-
data relating monthly demand to the price of one of tionship appears to exist between (1) the wedding
its products: cost and attendance, (2) the wedding cost and the
value rating, and (3) the couple’s income and wed-
Price Demand ding cost only for the weddings paid for by the bride
$11 2,100 and groom. Then find the best-fitting linear regres-
$13 2,020 sion lines using the Excel Trendline tool for each of
$17 1,980 these charts.
$19 1,875 5. Using the data in the Excel file Student Grades, con-
struct a scatter chart for midterm versus final exam
What type of model would best represent these data? grades and add a linear trendline. What is the regres-
Use the Trendline tool to find the best among the op- sion model? If a student scores 70 on the midterm,
tions provided. what would you predict her grade on the final exam
to be?
3. Using the data in the Excel file Demographics, de-
termine if a linear relationship exists between un- 6. Using the results of fitting the Home Market Value
employment rates and cost of living indexes by regression line in Example 8.4, compute the errors
constructing a scatter chart. Visually, do there appear associated with each observation using formula (8.3)
to be any outliers? If so, delete them and then find and construct a histogram.
Chapter 8 Trendlines and Regression Analysis 269
7. Set up an Excel worksheet to apply formulas (8.5) a. Interpret all key regression results, hypothesis
and (8.6) to compute the values of b 0 and b 1 for the tests, and confidence intervals in the output.
data in the Excel file Home Market Value and verify b. Analyze the residuals to determine if the assump-
that you obtain the same values as in Examples 8.4 tions underlying the regression analysis are valid.
and 8.5.
c. Use the standard residuals to determine if any
8. The managing director of a consulting group has the possible outliers exist.
following monthly data on total overhead costs and d. If a couple makes $70,000 together, how much
professional labor hours to bill to clients:4 would they probably budget for the wedding?
Overhead Costs Billable Hours 11. Using the data in the Excel file Demographics, apply
$365,000 3,000 the Excel Regression tool using unemployment rate
$400,000 4,000
as the dependent variable and cost of living index as
the independent variable.
$430,000 5,000
a. Interpret all key regression results, hypothesis
$477,000 6,000
tests, and confidence intervals in the output.
$560,000 7,000
b. Analyze the residuals to determine if the assump-
$587,000 8,000 tions underlying the regression analysis are valid.
c. Use the standard residuals to determine if any
a. Develop a trendline to identify the relationship
possible outliers exist.
between billable hours and overhead costs.
b. Interpret the coefficients of your regression 12. Using the data in the Excel file Student Grades, ap-
model. Specifically, what does the fixed compo- ply the Excel Regression tool using the midterm
nent of the model mean to the consulting firm? grade as the independent variable and the final exam
grade as the dependent variable.
c. If a special job requiring 1,000 billable hours
that would contribute a margin of $38,000 be- a. Interpret all key regression results, hypothesis
fore overhead was available, would the job be tests, and confidence intervals in the output.
attractive? b. Analyze the residuals to determine if the assump-
tions underlying the regression analysis are valid.
9. Using the Excel file Weddings, apply the Excel Re-
gression tool using the wedding cost as the depen- c. Use the standard residuals to determine if any
dent variable and attendance as the independent possible outliers exist.
variable. 13. The Excel file National Football League provides
a. Interpret all key regression results, hypothesis various data on professional football for one season.
tests, and confidence intervals in the output. a. Construct a scatter diagram for Points/Game and
b. Analyze the residuals to determine if the assump- Yards/Game in the Excel file. Does there appear
tions underlying the regression analysis are valid. to be a linear relationship?
c. Use the standard residuals to determine if any b. Develop a regression model for predicting
possible outliers exist. Points/Game as a function of Yards/Game.
Explain the statistical significance of the model.
d. If a couple is planning a wedding for 175 guests,
how much should they budget? c. Draw conclusions about the validity of the re-
gression analysis assumptions from the residual
10. Using the Excel file Weddings, apply the Excel Re-
plot and standard residuals.
gression tool using the wedding cost as the d ependent
variable and the couple’s income as the independent 14. A deep-foundation engineering contractor has bid
variable, only for those weddings paid for by the on a foundation system for a new building housing
bride and groom. the world headquarters for a Fortune 500 company.
4Modified from Charles T. Horngren, George Foster, and Srikant M. Datar, Cost Accounting: A Managerial Emphasis, 9th ed. (Englewood
Cliffs, NJ: Prentice Hall, 1997): 371.
270 Chapter 8 Trendlines and Regression Analysis
A part of the project consists of installing 311 auger model you select, conduct further analysis to check
cast piles. The contractor was given bid information for significance of the independent variables and for
for cost-estimating purposes, which consisted of the multicollinearity.
estimated depth of each pile; however, actual drill
footage of each pile could not be determined exactly 20. Using the data in the Excel file Freshman College
until construction was performed. The Excel file Pile Data, identify the best regression model for pre-
Foundation contains the estimates and actual pile dicting the first year retention rate. For the model
lengths after the project was completed. Develop a you select, conduct further analysis to check for
linear regression model to estimate the actual pile significance of the independent variables and for
length as a function of the estimated pile lengths. multicollinearity.
What do you conclude? 21. The Excel file Major League Baseball provides data
15. The Excel file Concert Sales provides data on sales on the 2010 season.
dollars and the number of radio, TV, and newspaper a. Construct and examine the correlation matrix. Is
ads promoting the concerts for a group of cities. De- multicollinearity a potential problem?
velop simple linear regression models for predicting b. Suggest an appropriate set of independent vari-
sales as a function of the number of each type of ad. ables that predict the number of wins by examin-
Compare these results to a multiple linear regression ing the correlation matrix.
model using both independent variables. Examine
c. Find the best multiple regression model for pre-
the residuals of the best model for regression as-
dicting the number of wins. How good is your
sumptions and possible outliers.
model? Does it use the same variables you
16. Using the data in the Excel file Home Market Value, thought were appropriate in part (b)?
develop a multiple linear regression model for esti-
mating the market value as a function of both the age 22. The Excel file Golfing Statistics provides data for a
and size of the house. Predict the value of a house portion of the 2010 professional season for the top
that is 30 years old and has 1,800 square feet, and 25 golfers.
one that is 5 years old and has 2,800 square feet. a. Find the best multiple regression model for pre-
dicting earnings/event as a function of the re-
17. The Excel file Cereal Data provides a variety of nu-
maining variables.
tritional information about 67 cereals and their shelf
location in a supermarket. Use regression analysis to b. Find the best multiple regression model for pre-
find the best model that explains the relationship be- dicting average score as a function of the other
tween calories and the other variables. Investigate the variables except earnings and events.
model assumptions and clearly explain your conclu- 23. Use the p-value criterion to find a good model for
sions. Keep in mind the principle of parsimony! predicting the number of points scored per game
18. The Excel file Salary Data provides information on by football teams using the data in the Excel file
current salary, beginning salary, previous experience National Football League.
(in months) when hired, and total years of education
24. The State of Ohio Department of Education has a
for a sample of 100 employees in a firm.
mandated ninth-grade proficiency test that covers
a. Develop a multiple regression model for pre- writing, reading, mathematics, citizenship (social
dicting current salary as a function of the other studies), and science. The Excel file Ohio Education
variables. Performance provides data on success rates (defined
b. Find the best model for predicting current salary as the percent of students passing) in school districts
using the t-value criterion. in the greater Cincinnati metropolitan area along
with state averages.
19. The Excel file Credit Approval Decisions provides
information on credit history for a sample of bank- a. Suggest the best regression model to predict math
ing customers. Use regression analysis to identify success as a function of success in the other sub-
the best model for predicting the credit score as a jects by examining the correlation matrix; then
function of the other numerical variables. For the run the regression tool for this set of variables.
Chapter 8 Trendlines and Regression Analysis 271
b. Develop a multiple regression model to predict regression, and examine the residual plot. What do
math success as a function of success in all other you conclude? Construct a scatter chart and use the
subjects using the systematic approach described Excel Trendline feature to identify the best type of
in this chapter. Is multicollinearity a problem? curvilinear trendline that maximizes R2.
c. Compare the models in parts (a) and (b). Are they
the same? Why or why not? Units Produced Costs
5Horngren, Foster, and Datar, Cost Accounting: A Managerial Emphasis, 9th ed.: 349.
6Horngren, Foster, and Datar, Cost Accounting: A Managerial Emphasis, 9th ed.: 349.
272 Chapter 8 Trendlines and Regression Analysis
In reviewing the PLE data, Elizabeth Burke noticed that engineers hired 10 years ago was selected to determine
defects received from suppliers have decreased ( worksheet the influence of these variables on how long each indi-
Defects After Delivery). Upon investigation, she learned vidual stayed with the company. Data are compiled in the
that in 2010, PLE experienced some quality problems Employee Retention worksheet.
due to an increasing number of defects in materials Finally, as part of its efforts to remain competitive,
received from suppliers. The company instituted an ini- PLE tries to keep up with the latest in production technol-
tiative in August 2011 to work with suppliers to reduce ogy. This is especially important in the highly competi-
these defects, to more closely coordinate deliveries, and to tive lawn-mower line, where competitors can gain a real
improve materials quality through reengineering supplier advantage if they develop more cost-effective means of
production policies. Elizabeth noted that the program ap- production. The lawn-mower division therefore spends a
peared to reverse an increasing trend in defects; she would great deal of effort in testing new technology. When new
like to predict what might have happened had the supplier production technology is introduced, firms often experi-
initiative not been implemented and how the number of ence learning, resulting in a gradual decrease in the time
defects might further be reduced in the near future. required to produce successive units. Generally, the rate of
In meeting with PLE’s human resources director, improvement declines until the production time levels off.
Elizabeth also discovered a concern about the high rate One example is the production of a new design for lawn-
of turnover in its field service staff. Senior managers have mower engines. To determine the time required to produce
suggested that the department look closer at its recruiting these engines, PLE produced 50 units on its production
policies, particularly to try to identify the characteristics line; test results are given on the worksheet Engines in
of individuals that lead to greater retention. However, in the database. Because PLE is continually developing new
a recent staff meeting, HR managers could not agree on technology, understanding the rate of learning can be use-
these characteristics. Some argued that years of education ful in estimating future production costs without having to
and grade point averages were good predictors. Others run extensive prototype trials, and Elizabeth would like a
argued that hiring more mature applicants would lead to better handle on this.
greater retention. To study these factors, the staff agreed Use techniques of regression analysis to assist her in
to conduct a statistical study to determine the effect that evaluating the data in these three worksheets and reach-
years of education, college grade point average, and age ing useful conclusions. Summarize your work in a formal
when hired have on retention. A sample of 40 field service report with all appropriate results and analyses.