0% found this document useful (0 votes)
41 views44 pages

4 Regression Analysis

The document discusses regression analysis in engineering, focusing on fitting trend lines to x-y data and the use of linear regression to find the best-fit line. It explains the process of calculating coefficients for the regression line, the significance of the linear correlation coefficient, and the standard error of estimate. Additionally, it addresses the limitations of linear regression, including the assumptions made about data scatter and the effects of outliers, and introduces multiple variable regression and polynomial regression techniques.

Uploaded by

Nazmican Tetik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views44 pages

4 Regression Analysis

The document discusses regression analysis in engineering, focusing on fitting trend lines to x-y data and the use of linear regression to find the best-fit line. It explains the process of calculating coefficients for the regression line, the significance of the linear correlation coefficient, and the standard error of estimate. Additionally, it addresses the limitations of linear regression, including the assumptions made about data scatter and the effects of outliers, and introduces multiple variable regression and polynomial regression techniques.

Uploaded by

Nazmican Tetik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Analysis of Experimental

Data
Regression Analysis
Introduction
• In engineering analysis, we often want to fit a trend line or curve to a set of x-y data.
• Consider a set of n measurements of some variable y as a function of another variable x.
• Typically, y is some measured output as a function of some known input, x.
• In general, in such a set of measurements, there may be:
Some scatter (precision error or random error).
A trend - in spite of the scatter, y may show an overall increase with x, or perhaps an
overall decrease with x.
• The linear correlation coefficient is used to determine if there is a trend.
• If there is a trend, regression analysis is used to find an equation for y as a function of x
that provides the
• best fit to the data.
Linear regression analysis
• Linear regression analysis is also called linear least-squares fit analysis.
• The goal of linear regression analysis is to find the “best fit” straight line
through a set of y vs. x data.
• The technique for deriving equations for this best-fit or least-squares fit
line is as follows:
An equation for a straight line that attempts to fit the data pairs is chosen
as Y = ax + b .
In the above equation, a is the slope (a = dy/dx – most of us are more
familiar with the symbol m rather than a for the slope of a line), and b is
the y-intercept – the y location where the line crosses the y axis (in other
words, the value of Y at x = 0).
An upper case Y is used for the fitted line to distinguish the fitted data from
the actual data values, y.
In linear regression analysis, coefficients a and b are optimized for the best
possible fit to the data.
The optimization process itself is actually very straightforward:
For each data pair (xi, yi), error ei is defined as the difference between the
predicted or fitted value and the actual value: ei = error at data pair i, or ei
= Yi - yi = axi + b - yi . ei is also called the residual.
Note: Here, what we call the actual value does not necessarily mean the
“correct” value, but rather the value of the actual measured data point.
We define E as the sum of the squared errors of the fit – a global
measure of the error associated with all n data points. The equation
for E is

It is now assumed that the best fit is the one for which E is the
smallest.
In other words, coefficients a and b that minimize E need to be
found. These coefficients are the ones that create the best-fit straight
line Y = ax + b.
How can a and b be found such that E is minimized? Well, as any
good engineer or mathematician knows, to find a minimum (or
maximum) of a quantity, that quantity is differentiated, and the
derivative is set to zero.
Here, two partial derivatives are required, since E is a function of two
variables, a and b. Therefore, we set
After some algebra, which can be verified, the following equations
result for coefficients a and b:
• Coefficients a and b can easily be calculated in a spreadsheet by the
following steps:
Create columns for xi, yi, xiyi, and xi2.
Sum these columns over all n rows of data pairs.
Using these sums, calculate a and b with the above formulas.
• Modern spreadsheets and programs like Matlab, MathCad, etc. have
built-in regression analysis tools, but it is good to understand what
the equations mean and from where they come. In the Excel
spreadsheet that accompanies this learning module, coefficients a
and b are calculated two ways for each example case – “by hand”
using the above equations, and with the built-in regression analysis
package. As can be seen, the agreement is excellent, confirming that
we have not made any algebra mistakes in the derivation.
Example:
• Given: 20 data pairs (y vs. x) the same data used in a previous
example problem in the learning module about correlation and
trends. Recall that we calculated the linear correlation coefficient to
be rxy = 0.480.
• The data pairs are listed below, along with a scatter plot of the data.
• To do: Find the best linear fit to the data.
Solution:
• We use the above equations for coefficients a and b with n = 20; we
calculate a = 3.241, and b = 4.082, to four significant digits. Thus, the
best linear fit to the data is Y = 3.241x + 4.082 .
• Alternately, using Excel’s built-in regression analysis macro, the
following output is generated:
• Office 2003 and older: Tools-Data Analysis-Regression
• Office 2007 and later: Data tab. In Analysis area, Data Analysis-
Regression
• In Excel’s notation, the y-intercept b is in the row called “Intercept”
and the column called “Coefficients”. The slope a is in the row called
“X Variable 1” and the same column (“Coefficients”). The values
agree with those calculated from the equations above, verifying our
algebra.
• Notice also the item called “Multiple R”. In Excel, Multiple R is the
absolute value of the linear correlation coefficient, rxy. For these
example data, rxy was calculated previously as 0.480, which agrees
with the result from Excel’s regression analysis (to about 7 significant
digits anyway).
• The best-fit line is plotted in the above figure as the solid blue line.
• The best-fit line (compared to any other line) has the smallest
possible sum of the squared errors, E, since coefficients a and b were
found by minimizing E (forcing the derivatives of E with respect to a
and b to be equal to zero).
• The upward trend of the data appears more obvious by eye when the
least-squares line is drawn through
• the data.
Discussion:
• Recall from the previous example problem that we could not judge by
eye whether or not there is a trend in these data. In the previous
problem we calculated the linear correlation coefficient and showed
that we can be more than 95% confident that a trend exists in these
data. In the present problem, we found the best-fit straight line that
quantifies the trend in the data.
Standard error
• A useful measure of error is called the standard error of estimate, Sy,x,
which is sometimes called simply standard error.
• For a linear fit

which reduces to
• Sy,x is a measure of the data scatter about the best-fit line, and has
the same units as y itself.
• Sy,x is a kind of “standard deviation” of the predicted least-squares fit
values compared to the original data.
• Sy,x for this problem turns out to be about 3.601 (in y units), as
verified both by calculation with the above formula and by Excel’s
regression analysis summary. (See Excel’s Summary Output above –
Standard Error
• = 3.600806.)
Some cautions about using linear regression
analysis
• Scatter in the y data is assumed to be purely random. The scatter is
assumed to follow a normal or Gaussian distribution. This may not
actually be the case. For example, a jump in y at a certain x value may
be due to some real, repeatable effect, not just random noise.
• The x values are assumed to be error-free. In reality, there may be
errors in the measurement of x as well as y. These are not accounted
for in the simple regression analysis described above. (More
advanced regression analysis techniques are available that can
account for this.)
• The reverse equation is not guaranteed. In particular, the linear least-
squares fit for y versus x was found, satisfying the equation Y = ax +
b. The reverse of this equation is x = (1 / a)Y – b / a . This reverse
equation is not necessarily the best fit of x vs. y, if the linear
regression analysis were done on x vs. y instead of y vs. x.
• The fit is strongly affected by erroneous data points. If there are some
data points that are far out of line with the majority (outliers), the
least-squares fit may not yield the desired result. The following
example illustrates this effect:
• With all the data points used, the three stray data points (outliers)
have ruined the rest of the fit (solid blue line). For this case, rxy =
0.5745 and Sy,x = 4.787.
• If these three outliers are removed, the least-squares fit follows the
overall trend of the other data points
• much more accurately (dashed green line). For this case, rxy = 0.9956
and Sy,x = 0.5385. The linear correlation coefficient is significantly
higher (better correlation), and the standard error is significantly
lower (better fit).
• In a separate learning module we discuss techniques for properly
removing outliers.
• To protect against such undesired effects, more complex least-squares
methods, such as the robust straight-line fit, are required. Discussion
of these methods are beyond the scope of the present course.
Linear regression with multiple variables
• Linear regression with multiple variables is a feature included with
most modern spreadsheets.
• Consider response, y, which is a function of m independent variables
x1, x2, ..., xm, i.e., y = y(x1, x2, ..., xm).
• Suppose y is measured at n operating points (n sets of values of y as a
function of each of the other variables).
• To perform a linear regression on these data using Excel, select the
cells for y (in one column as previously), and a range of cells for x1, x2,
..., xm (in multiple columns), and then run the built-in regression
analysis.
• When there is more than one independent variable, we use a more
general equation for the standard error,

• where df = degrees of freedom, df = n – (m + 1), n is the number of


data points or operating points, and m is the number of independent
variables.
Example:
• Given: In this example, we perform linear
regression analysis with multiple variables.
• We assume that the measured quantity y is a
linear function of three independent
variables, x1, x2, and x3, i.e., y = b + a1x1 + a2x2
+ a3x3 .
• Nine data points are measured by setting
three levels for each parameter, and the data
are placed into a simple data array as shown
to the right (the image is taken from an Excel
spreadsheet).
• To do: Calculate the y intercept and the three slopes simultaneously,
one slope for each independent variable x1, x2, and x3.
Solution:
• We perform a linear regression on these data points to determine the best
(least-squares) linear fit to the data.
• In Excel, the multiple variable regression analysis procedure is similar to that for a
single independent variable, except that we choose several columns of x data
instead of just one column:
• Launch the macro (Data Analysis-Regression). The default options are fine for
illustrative purposes.
• The nine values of y in the y-column are selected for Input Y range.
• All 27 values of x1, x2, and x3, spanning nine rows and three columns, are selected
for Input X range.
• Output Range is selected, and some suitable cell is selected for placement of the
output. OK.
• Excel generates what it calls a Summary Output.
• From Excel’s output, the following information is needed to generate the
coefficients of the equation for which we are finding the best fit, y = b +
a1x1 + a2x2 + a3x3
The y-intercept, which Excel calls Intercept. For our equation, b =
Intercept.
The three slopes, which Excel calls X Variable 1, X Variable 2, and X
Variable 3. For our equation,

which are the slopes of y with respect to parameters x1, x2, and x3,
respectively.
• Note that we use partial derivatives () rather than total derivatives
(d) here, since y is a function of more than one variable.
• A portion of the regression analysis results are shown below (image
copied from Excel), with the most important cells highlighted:
Discussion:
• The fit is pretty good, implying that there is little scatter in the data,
and the data fit well with the simple linear equation. We know this is
a good fit by looking at the linear correlation coefficient (Multiple R),
which is greater than 0.99, and the Standard Error, which is only 0.21
for y values ranging from about 4 to about 15. We can claim a
successful curve fit.
Comments:
• In addition to random scatter in the data, there may also be cross-talk
between some of the parameters. For example, y may have terms with
products like x1x2, x2x32, etc., which are clearly nonlinear terms.
Nevertheless, a multiple parameter linear regression analysis is often
performed only locally, around the operating point, and the linear
assumption is reasonably accurate, at least close to the operating point.
• In addition, variables x1, x2, and x3 may not be totally independent of each
other in a real experiment.
• Regression analysis with multiple variables becomes quite useful to us later
in the course when we discuss optimization techniques such as response
surface methodology.
Nonlinear and higher-order polynomial
regression analysis
• Not all data are linear, and a straight line fit may not be appropriate. A
good example is thermocouple voltage versus temperature. The
relationship is nearly linear, but not quite; that is in fact the very
reason for the necessity of thermocouple tables.
• For nonlinear data, some transformation tricks can be employed,
using logarithms or other functions.
• For some data, a good curve fit can be obtained using a polynomial fit
of some appropriate order. The order of a polynomial is defined by m,
the maximum exponent in the x data:
zeroth-order (m = 0) is just a constant: y = b .
first-order (m = 1) is a constant plus a linear term: y = b + a1x . (A
first-order polynomial fit is the same as a linear least-squares fit, as
we have already learned how to do.)
second-order (m = 2) is a constant plus a linear term plus a
quadratic term: y = b + a1x1 + a2x22. (A second-order polynomial fit is
often called a quadratic fit.)
third-order (m = 3) adds a cubic term: y = b + a1x1 + a2x22 + a3x32. (A
third-order polynomial fit is often called a cubic fit.)
mth-order (m > 0) adds terms following this pattern up to amxm y = b
+ a1x1 + a2x22 + a3x32 +……….. + amxm2.
• Excel can be manipulated to perform least-squares polynomial fits of
any order m, since Excel can perform regression analysis on more
than one independent variable simultaneously. The procedure is as
follows:
• To the right of the x column, add new columns for x2, x3, ... xm.
• Perform a multiple variable regression analysis as previously, except
choose all the data cells (x, x2, x3, ... xm) as the “Input X Range” in the
Regression working window.
• Note that m is the order of the polynomial, which is also treated as the
number of independent variables to be fit. Excel treats each of the m
columns as a separate variable. The output of the regression analysis
includes the y-intercept as previously (equal to our constant b), and also a
least-squares coefficient for each of the columns, i.e., for each of the
variables x, x2, x3, ... xm:
• The coefficient for “X Variable 1” is a1, corresponding to the x variable.
• The coefficient for “X Variable 2” is a2, corresponding to the x2 variable.
• The coefficient for “X Variable 3” is a3, corresponding to the x3 variable.
• ...
• The coefficient for “X Variable m” is am, corresponding to the xm variable.
• Finally, the fitted curve is constructed from the equation, i.e., y = b +
a1x1 + a2x22 + a3x32 +……….. + amxm2.
Example:
• Given: x and y data pairs, as shown:

To do: Plot the data as symbols (no line), perform a linear


least-squares fit, and plot the data as a dashed line (no
symbols), and perform a second-order polynomial least-
squares fit, and plot the data as a solid line (no symbols).
Solution:
• We plot the data as symbols, as shown on the above plot.
• We perform a standard linear regression analysis, and then generate
the best-fit line by using the equation for the best-fit straight line, Y =
ax + b . For these data, a = 1.025 and b = 1.510. The result is plotted
as the dashed black line in the figure – the agreement is not so good.
The standard error is 0.1359.
• We add a column labeled x2 between the x and y columns, and fill it
in.
• We perform a multiple variable regression analysis, using the x and x2
columns as our range of independent variables. We generate the
best-fit quadratic (2nd-order) polynomial curve by using the equation
y = b + a1x + a2x2 .
• For these data, b = 1.307, a1 = 2.382, and a2 = –1.358. The solid red
line is plotted above for this equation – the agreement is much better.
The standard error is 0.0316.
Discussion:
• These data fit much better to a second-order polynomial than to a
linear fit. We see this both “by eye”, and also by comparing the
standard error, which decreases by a factor of more than four when
we apply the quadratic (second-order) curve fit instead of the linear
curve fit.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy