4 Regression Analysis
4 Regression Analysis
Data
Regression Analysis
Introduction
• In engineering analysis, we often want to fit a trend line or curve to a set of x-y data.
• Consider a set of n measurements of some variable y as a function of another variable x.
• Typically, y is some measured output as a function of some known input, x.
• In general, in such a set of measurements, there may be:
Some scatter (precision error or random error).
A trend - in spite of the scatter, y may show an overall increase with x, or perhaps an
overall decrease with x.
• The linear correlation coefficient is used to determine if there is a trend.
• If there is a trend, regression analysis is used to find an equation for y as a function of x
that provides the
• best fit to the data.
Linear regression analysis
• Linear regression analysis is also called linear least-squares fit analysis.
• The goal of linear regression analysis is to find the “best fit” straight line
through a set of y vs. x data.
• The technique for deriving equations for this best-fit or least-squares fit
line is as follows:
An equation for a straight line that attempts to fit the data pairs is chosen
as Y = ax + b .
In the above equation, a is the slope (a = dy/dx – most of us are more
familiar with the symbol m rather than a for the slope of a line), and b is
the y-intercept – the y location where the line crosses the y axis (in other
words, the value of Y at x = 0).
An upper case Y is used for the fitted line to distinguish the fitted data from
the actual data values, y.
In linear regression analysis, coefficients a and b are optimized for the best
possible fit to the data.
The optimization process itself is actually very straightforward:
For each data pair (xi, yi), error ei is defined as the difference between the
predicted or fitted value and the actual value: ei = error at data pair i, or ei
= Yi - yi = axi + b - yi . ei is also called the residual.
Note: Here, what we call the actual value does not necessarily mean the
“correct” value, but rather the value of the actual measured data point.
We define E as the sum of the squared errors of the fit – a global
measure of the error associated with all n data points. The equation
for E is
It is now assumed that the best fit is the one for which E is the
smallest.
In other words, coefficients a and b that minimize E need to be
found. These coefficients are the ones that create the best-fit straight
line Y = ax + b.
How can a and b be found such that E is minimized? Well, as any
good engineer or mathematician knows, to find a minimum (or
maximum) of a quantity, that quantity is differentiated, and the
derivative is set to zero.
Here, two partial derivatives are required, since E is a function of two
variables, a and b. Therefore, we set
After some algebra, which can be verified, the following equations
result for coefficients a and b:
• Coefficients a and b can easily be calculated in a spreadsheet by the
following steps:
Create columns for xi, yi, xiyi, and xi2.
Sum these columns over all n rows of data pairs.
Using these sums, calculate a and b with the above formulas.
• Modern spreadsheets and programs like Matlab, MathCad, etc. have
built-in regression analysis tools, but it is good to understand what
the equations mean and from where they come. In the Excel
spreadsheet that accompanies this learning module, coefficients a
and b are calculated two ways for each example case – “by hand”
using the above equations, and with the built-in regression analysis
package. As can be seen, the agreement is excellent, confirming that
we have not made any algebra mistakes in the derivation.
Example:
• Given: 20 data pairs (y vs. x) the same data used in a previous
example problem in the learning module about correlation and
trends. Recall that we calculated the linear correlation coefficient to
be rxy = 0.480.
• The data pairs are listed below, along with a scatter plot of the data.
• To do: Find the best linear fit to the data.
Solution:
• We use the above equations for coefficients a and b with n = 20; we
calculate a = 3.241, and b = 4.082, to four significant digits. Thus, the
best linear fit to the data is Y = 3.241x + 4.082 .
• Alternately, using Excel’s built-in regression analysis macro, the
following output is generated:
• Office 2003 and older: Tools-Data Analysis-Regression
• Office 2007 and later: Data tab. In Analysis area, Data Analysis-
Regression
• In Excel’s notation, the y-intercept b is in the row called “Intercept”
and the column called “Coefficients”. The slope a is in the row called
“X Variable 1” and the same column (“Coefficients”). The values
agree with those calculated from the equations above, verifying our
algebra.
• Notice also the item called “Multiple R”. In Excel, Multiple R is the
absolute value of the linear correlation coefficient, rxy. For these
example data, rxy was calculated previously as 0.480, which agrees
with the result from Excel’s regression analysis (to about 7 significant
digits anyway).
• The best-fit line is plotted in the above figure as the solid blue line.
• The best-fit line (compared to any other line) has the smallest
possible sum of the squared errors, E, since coefficients a and b were
found by minimizing E (forcing the derivatives of E with respect to a
and b to be equal to zero).
• The upward trend of the data appears more obvious by eye when the
least-squares line is drawn through
• the data.
Discussion:
• Recall from the previous example problem that we could not judge by
eye whether or not there is a trend in these data. In the previous
problem we calculated the linear correlation coefficient and showed
that we can be more than 95% confident that a trend exists in these
data. In the present problem, we found the best-fit straight line that
quantifies the trend in the data.
Standard error
• A useful measure of error is called the standard error of estimate, Sy,x,
which is sometimes called simply standard error.
• For a linear fit
which reduces to
• Sy,x is a measure of the data scatter about the best-fit line, and has
the same units as y itself.
• Sy,x is a kind of “standard deviation” of the predicted least-squares fit
values compared to the original data.
• Sy,x for this problem turns out to be about 3.601 (in y units), as
verified both by calculation with the above formula and by Excel’s
regression analysis summary. (See Excel’s Summary Output above –
Standard Error
• = 3.600806.)
Some cautions about using linear regression
analysis
• Scatter in the y data is assumed to be purely random. The scatter is
assumed to follow a normal or Gaussian distribution. This may not
actually be the case. For example, a jump in y at a certain x value may
be due to some real, repeatable effect, not just random noise.
• The x values are assumed to be error-free. In reality, there may be
errors in the measurement of x as well as y. These are not accounted
for in the simple regression analysis described above. (More
advanced regression analysis techniques are available that can
account for this.)
• The reverse equation is not guaranteed. In particular, the linear least-
squares fit for y versus x was found, satisfying the equation Y = ax +
b. The reverse of this equation is x = (1 / a)Y – b / a . This reverse
equation is not necessarily the best fit of x vs. y, if the linear
regression analysis were done on x vs. y instead of y vs. x.
• The fit is strongly affected by erroneous data points. If there are some
data points that are far out of line with the majority (outliers), the
least-squares fit may not yield the desired result. The following
example illustrates this effect:
• With all the data points used, the three stray data points (outliers)
have ruined the rest of the fit (solid blue line). For this case, rxy =
0.5745 and Sy,x = 4.787.
• If these three outliers are removed, the least-squares fit follows the
overall trend of the other data points
• much more accurately (dashed green line). For this case, rxy = 0.9956
and Sy,x = 0.5385. The linear correlation coefficient is significantly
higher (better correlation), and the standard error is significantly
lower (better fit).
• In a separate learning module we discuss techniques for properly
removing outliers.
• To protect against such undesired effects, more complex least-squares
methods, such as the robust straight-line fit, are required. Discussion
of these methods are beyond the scope of the present course.
Linear regression with multiple variables
• Linear regression with multiple variables is a feature included with
most modern spreadsheets.
• Consider response, y, which is a function of m independent variables
x1, x2, ..., xm, i.e., y = y(x1, x2, ..., xm).
• Suppose y is measured at n operating points (n sets of values of y as a
function of each of the other variables).
• To perform a linear regression on these data using Excel, select the
cells for y (in one column as previously), and a range of cells for x1, x2,
..., xm (in multiple columns), and then run the built-in regression
analysis.
• When there is more than one independent variable, we use a more
general equation for the standard error,
which are the slopes of y with respect to parameters x1, x2, and x3,
respectively.
• Note that we use partial derivatives () rather than total derivatives
(d) here, since y is a function of more than one variable.
• A portion of the regression analysis results are shown below (image
copied from Excel), with the most important cells highlighted:
Discussion:
• The fit is pretty good, implying that there is little scatter in the data,
and the data fit well with the simple linear equation. We know this is
a good fit by looking at the linear correlation coefficient (Multiple R),
which is greater than 0.99, and the Standard Error, which is only 0.21
for y values ranging from about 4 to about 15. We can claim a
successful curve fit.
Comments:
• In addition to random scatter in the data, there may also be cross-talk
between some of the parameters. For example, y may have terms with
products like x1x2, x2x32, etc., which are clearly nonlinear terms.
Nevertheless, a multiple parameter linear regression analysis is often
performed only locally, around the operating point, and the linear
assumption is reasonably accurate, at least close to the operating point.
• In addition, variables x1, x2, and x3 may not be totally independent of each
other in a real experiment.
• Regression analysis with multiple variables becomes quite useful to us later
in the course when we discuss optimization techniques such as response
surface methodology.
Nonlinear and higher-order polynomial
regression analysis
• Not all data are linear, and a straight line fit may not be appropriate. A
good example is thermocouple voltage versus temperature. The
relationship is nearly linear, but not quite; that is in fact the very
reason for the necessity of thermocouple tables.
• For nonlinear data, some transformation tricks can be employed,
using logarithms or other functions.
• For some data, a good curve fit can be obtained using a polynomial fit
of some appropriate order. The order of a polynomial is defined by m,
the maximum exponent in the x data:
zeroth-order (m = 0) is just a constant: y = b .
first-order (m = 1) is a constant plus a linear term: y = b + a1x . (A
first-order polynomial fit is the same as a linear least-squares fit, as
we have already learned how to do.)
second-order (m = 2) is a constant plus a linear term plus a
quadratic term: y = b + a1x1 + a2x22. (A second-order polynomial fit is
often called a quadratic fit.)
third-order (m = 3) adds a cubic term: y = b + a1x1 + a2x22 + a3x32. (A
third-order polynomial fit is often called a cubic fit.)
mth-order (m > 0) adds terms following this pattern up to amxm y = b
+ a1x1 + a2x22 + a3x32 +……….. + amxm2.
• Excel can be manipulated to perform least-squares polynomial fits of
any order m, since Excel can perform regression analysis on more
than one independent variable simultaneously. The procedure is as
follows:
• To the right of the x column, add new columns for x2, x3, ... xm.
• Perform a multiple variable regression analysis as previously, except
choose all the data cells (x, x2, x3, ... xm) as the “Input X Range” in the
Regression working window.
• Note that m is the order of the polynomial, which is also treated as the
number of independent variables to be fit. Excel treats each of the m
columns as a separate variable. The output of the regression analysis
includes the y-intercept as previously (equal to our constant b), and also a
least-squares coefficient for each of the columns, i.e., for each of the
variables x, x2, x3, ... xm:
• The coefficient for “X Variable 1” is a1, corresponding to the x variable.
• The coefficient for “X Variable 2” is a2, corresponding to the x2 variable.
• The coefficient for “X Variable 3” is a3, corresponding to the x3 variable.
• ...
• The coefficient for “X Variable m” is am, corresponding to the xm variable.
• Finally, the fitted curve is constructed from the equation, i.e., y = b +
a1x1 + a2x22 + a3x32 +……….. + amxm2.
Example:
• Given: x and y data pairs, as shown: