0% found this document useful (0 votes)
99 views54 pages

Statistical Analysis and Calibration

The document discusses random errors that can occur in measurements and how they combine. It notes that when four small random errors occur, there is one combination that leads to a deviation of +4U, four combinations that give +2U, and six that give 0U. This ratio of combinations forms a probability distribution. When many measurements are made, the distribution of results approaches a Gaussian or normal distribution curve. The document then discusses populations and samples, noting that characteristics of a population can be inferred from a sample if it is selected properly according to statistical principles.

Uploaded by

Claire Racho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views54 pages

Statistical Analysis and Calibration

The document discusses random errors that can occur in measurements and how they combine. It notes that when four small random errors occur, there is one combination that leads to a deviation of +4U, four combinations that give +2U, and six that give 0U. This ratio of combinations forms a probability distribution. When many measurements are made, the distribution of results approaches a Gaussian or normal distribution curve. The document then discusses populations and samples, noting that characteristics of a population can be inferred from a sample if it is selected properly according to statistical principles.

Uploaded by

Claire Racho
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Imagine a situation in which four small random errors combine to

give an overall error. We will assume that each error has an equal
probability of occurring and that each can cause the final result to
be high or low by a fixed amount ±U.

The following table shows all the possible ways the four errors
can combine to give the indicated deviations from the mean value:

STATISTICAL
ANALYSIS OF
RANDOM ERRORS

1
Note that only one combination leads to a deviation of +4 U,
four combinations give a deviation of +2 U, and six give a
deviation of 0 U. The negative errors have the same relationship.
STATISTICAL
This ratio of 1:4:6:4:1 is a measure of the probability for a
ANALYSIS OF
deviation of each magnitude. If we make a sufficiently large RANDOM ERRORS
number of measurements, we can see a frequency distribution
like that shown in the following figure.

Note that the y-axis in the


plot is the relative frequency
of occurrence of the five
possible combinations.
Theoretical distribution for ten equal-
sized errors. We see that the most
frequent occurrence is zero deviation
from the mean. At the other extreme a
When the same procedure is applied to a very large number of maximum deviation of 10 U occurs
individual errors, a bell-shaped curve like that seen here results. only about once in 500 results.
Such a plot is called a Gaussian curve or a normal error curve.
2
From experience with many measurements, we find that the
distribution of replicate data from most quantitative analytical
experiments approaches that of the Gaussian curve.
DISTRIBUTION OF
EXPERIMENTAL
The spread in a set of replicate measurements is the difference
between the highest and lowest result. RESULTS
Example: Flipping coins

If you flip a coin 10 times, how many heads will you get? Try it, and record
your results. Repeat the experiment. Are your results the same? Ask friends
or members of your class to perform the same experiment and tabulate the
results.

The table below contains the results obtained by several classes of analytical
chemistry students over an 18-year period:

A Gaussian, or normal error curve, is


a curve that shows the symmetrical
distribution of data around the mean
Now, plot a histogram from these results and find the mean value and the of an infinite set of data.
standard deviation.

Continue  3
Results of a coin-flipping experiment by 395 students over an 18-year
period:
DISTRIBUTION OF
EXPERIMENTAL
RESULTS

The smooth curve in the figure is a normal error curve for an infinite
number of trials with the same population mean µ and population standard
deviation σ as the data set. Note that the population mean µ of 5.04 is very A Gaussian, or normal error curve, is
close to the value of 5 that you would predict based on the laws of a curve that shows the symmetrical
probability. distribution of data around the mean
of an infinite set of data.
As the number of trials increases, the histogram approaches the shape of the
smooth curve, and the mean µ approaches the value of 5.
4
Typically in a scientific study, we infer information about a
population or universe from observations made on a subset or
sample.

A population is the collection of all measurements of interest to


an experimenter, while a sample is a subset of measurements
selected from the population.

In some cases, the population is finite and real, while in others,


the population is hypothetical or conceptual in nature. POPULATION AND
As an example of a real population, consider a production run of SAMPLE
multivitamin tablets that produces hundreds of thousands of
tablets. Although the population is finite, we usually would not
have the time or resources to test all the tablets for quality
control purposes. Hence, we select a sample of tablets for
analysis according to statistical sampling principles. We then
infer the characteristics of the population from those of the
sample.

5
In many of the cases encountered in analytical chemistry, the
population is conceptual.
POPULATION AND
Consider, for example, the determination of calcium in a
community water supply to determine water hardness. In this SAMPLE
example, the population is the very large, nearly infinite,
number of measurements that could be made if we analysed
the entire water supply.
Statistical laws have been derived for
populations, but they can be used for
Similarly, in determining glucose in the blood of a patient, we samples after suitable modification.
could hypothetically make an extremely large number of
measurements if we used the entire blood supply. Such modifications are needed for small
samples because a few data points may
The subset of the population analysed in both these cases is not represent the entire population.
the sample.
Here, we first describe the Gaussian
statistics of populations. Then we show
We infer characteristics of the population from those obtained how these relationships can be modified
with the sample. Hence, it is very important to define the and applied to small samples of data.
population being characterised.

6
Do not confuse the statistical sample with the analytical sample.

For example, consider four water samples taken from the same
water supply and analysed in the laboratory for calcium.
POPULATION AND
These four analytical samples result in four measurements
SAMPLE
selected from the population. They are thus a single statistical
sample. This is an unfortunate duplication of the term sample.

7
Gaussian curves can be described by an equation that contains PROPERTIES
just two parameters, the population mean µ and the population
standard deviation σ. This is the equation for a normalized OF GAUSSIAN
Gaussian curve having the form:
2
− ( 𝒙 −𝝁 ) /2 𝝈
2
CURVES
𝑒
𝒚=
𝝈 √2 𝜋

The term parameter refers to quantities such as µ and σ that


define a population or distribution.

Data values such as are variables. The term statistic refers to an


estimate of a parameter that is made from a sample of data as
discussed below.

The sample mean and sample standard deviation are examples of


statistics that estimate parameters µ and σ respectively.

8
The sample mean defined in the previous lecture differs from
the population mean µ .

The sample mean is the arithmetic average of a limited sample


drawn from a population of data. The sample mean is defined as
the sum of the measurement values divided by the number of
measurements:
𝑁

∑ 𝒙𝒊 Individual
measurements THE POPULATION
𝑖=1
𝒙=¿
N Number of measurements
in the sample set
MEAN µ AND THE
The population mean µ, in contrast, is the true mean for the SAMPLE MEAN
population. It is also defined by the same equation with the
added provision that N represents here the total number of
measurements in the population:
𝑁

∑ 𝒙𝒊 Individual
measurements
𝑖=1
µ=¿ Number of measurements
N
in the population
9
In the absence of systematic error, the population mean is also
the true value for the measured quantity.

More often than not, when N is small, differs from µ because a


small sample of data may not exactly represent its population.

THE POPULATION
In most cases we do not know µ and must infer its value from .
MEAN µ AND THE
The probable difference between and µ decreases rapidly as the SAMPLE MEAN
number of measurements making up the sample increases.
Usually by the time N reaches 20 to 30, this difference is
negligible.

Note that the sample mean is a statistic that estimates the


population parameter µ.

10
Three terms are widely used to describe the precision of a set of
replicate data in a sample: standard deviation, variance, and STANDARD
coefficient of variation.
DEVIATION
These three are functions of how much an individual result differs
from the mean, called the deviation from the mean :

=| -|

Standard deviation (STD) of a sample, s, measures how closely the


data are clustered about the mean value. The smaller the STD, the
more closely the data are clustered about the mean.
𝑁 𝑁

∑ ( 𝒙 𝒊  −  𝒙 ) 2 ∑ 𝒅𝒊 2
𝒔=¿ 𝑖=1
N-1 ¿ 𝑖=1
N-1
Number of
degrees of
freedom
Gaussian curves for two sets of
An experiment that produces a small STD is more precise than one light bulbs, one having a standard
that produces a large STD. deviation half as great as the other.
The number of bulbs described by
The mean gives the center of the distribution. The STD measures each curve is the same.
the width of the distribution. 11
The square of the standard deviation is called the variance.

The standard deviation expressed as a percentage of the mean


value is called the relative standard deviation or the coefficient of
variation:
STANDARD
s
Coefficient of variation = × 100% DEVIATION
Solution
821 + 783 + 834 + 855 PRACTICAL EXAMPLE 1
The average is: = = 823.2
4
To avoid accumulating round-off errors, retain one more digit in the mean than Find the average, standard deviation, and
was present in the original data. The standard deviation is: coefficient of variation for 821, 783, 834,
and 855.
(821 - 823.2)2 + (783 - 823.2)2 + (834 - 823.2)2 + (855 - 823.2)2
𝒔=¿ 4-1
Tip: Express mean and standard deviation
in the form
± (N = ___)
30.3
Do not round off the numbers during a
The average and the standard deviation should both end at the same decimal calculation. Retain all the extra digits in
place. For = 823.2, we will write = 30.3. The coefficient of variation is the your calculator.
percent relative error:
100 × / = 100 × 30.3 / 823.2 = 3.7% 12
Solution

To find with a calculator that does not have a standard deviation key, the
following rearrangement is easy to use:
STANDARD
𝑁 2
DEVIATION
𝑁 ∑ 𝒙𝒊
∑ 𝒙 𝒊2− 𝑖=1

𝒔=¿ 𝑖=1 N PRACTICAL EXAMPLE 2


N-1
The following results were obtained in the
The following table summarises the results of the calculation of the two sums: replicate determination of the lead content
of a blood sample: 0.752, 0.756, 0.752,
Sample 𝑁 0.751, and 0.760 ppm Pb.
1 0.752 0.565504 ∑ 𝒙=𝒊 3.771 Find the mean and the standard deviation
2 0.756 0.571536 𝑖=1
of this set of data.
3 0.752 0.565504 𝑁
4 0.751 0.564001 ∑ 𝒙 𝒊2= 2.844145
5 0.760 0.577600 𝑖=1

Continue  13
𝑁

∑ 𝒙𝒊 3.771
𝑖=1
= = = 0.7542 ≈ 0.754 ppm Pb
N 5

𝑁 2 STANDARD
∑ 𝒙𝒊 DEVIATION
𝑖=1
(3.771)2
= = 2.8440882
N 5
PRACTICAL EXAMPLE 2
2.844145 - 2.8440882
𝒔=¿ = 0.00377 ≈ 0.004 ppm Pb
5-1 The following results were obtained in the
replicate determination of the lead content
of a blood sample: 0.752, 0.756, 0.752,
Note that the difference between 2.844145 and 2.8440882 is very small. 0.751, and 0.760 ppm Pb.

If we had rounded these numbers before subtracting them, a serious Find the mean and the standard deviation
error would have appeared in the calculated value of . To avoid this of this set of data.
source of error, never round a standard deviation calculation until the
very end.

Furthermore, and for the same reason, never use this equation to
calculate the STD of large numbers containing five or more digits.
14
If a series of replicate results, each containing N measurements,
are taken randomly from a population of results, the mean of
each set will show less and less scatter as N increases.

The standard deviation of the mean is known as the standard STANDARD


error of each mean and is given the symbol sm. It is the standard
deviation of a set of data divided by the square root of the
DEVIATION OF
number of data points in the set: THE MEAN
s m=¿ 𝒔
√𝑵 For example, to increase the precision
by a factor of 10 requires 100 times as
This equation tells us that the mean of four measurements is many measurements.

more precise by 4 = 2 than individual measurements in the data
Therefore, it is much better, if possible,
set. For this reason, averaging results is often used to improve
to decrease than to keep averaging
precision. However, the improvement to be gained by averaging
more results since is directly
is somewhat limited because of the square root dependence on proportional to but only inversely
the N value. proportional to the square root of N.

The standard deviation can sometimes be decreased by being


more precise in individual operations, by changing the
procedure, and by using more precise measurement tools.
15
The population standard deviation σ, which is a measure of STANDARD
the precision of the entire population, is given by the equation:
𝑁 DEVIATION
∑ ( 𝒙 𝒊  −   µ ) 2
𝑖=1
σ=¿ Number of measurements
N in the population

The quantity in this equation is the deviation of data value


from the mean of a population.

For comparison, the standard deviation of a sample of data, is:


𝑁

∑ ( 𝒙 𝒊  −  𝒙 ) 2
𝑖=1
𝒔=¿ Number of
degrees of
N-1 The two curves are for two populations of
freedom
data that differ only in their STDs. The
The breadth of the curves on the right is a measure of the standard deviation for the data set yielding
precision of the two sets of data. Thus, the precision of the data the broader but lower curve B is twice that
set leading to curve A is twice as good as that of the data set for the measurements yielding curve A.
represented by curve B.
16
The following figure shows two normal error Gaussian curves
in which we plot the relative frequency y of various deviations
from the mean versus the deviation from the mean di.

The standard deviation for curve B is twice that for curve A,


that is, σB = 2σA.

THE POPULATION
STANDARD
DEVIATION

In (a) the abscissa is the deviation In (b) the abscissa is the deviation from
from the mean (x – µ) in the units the mean in units of σ. For this plot, the
of measurement. two curves A and B are identical.
17
The figure on the right shows the type of normal error curve in
which the x axis is now a new variable z, which is defined as:

(x – µ)
z=
σ NORMAL ERROR
Note that z is the relative deviation of a data point from the mean, CURVE
that is, the deviation relative to the standard deviation. Hence,
when x – µ = σ, then z is equal to one; and when x – µ = 2σ, z is
equal to two; and so forth.

Thus, the quantity z represents the deviation of a result from the


population mean relative to the standard deviation. It is
commonly given as a variable in statistical tables since it is a
dimensionless quantity.

Since z is the deviation from the mean relative to the standard


deviation, a plot of relative frequency versus z yields a single
Gaussian curve that describes all populations of data regardless of
standard deviation. Thus, the figure on the right is the normal
error curve for both sets of data used to plot curves A and B in the
previous slide.
18
A normal error curve has several general properties:
NORMAL ERROR
 The mean occurs at the central point of maximum
frequency; CURVE
 There is a symmetrical distribution of positive and
negative deviations about the maximum; and
 There is an exponential decrease in frequency as the
magnitude of the deviations increases. Thus, small error
are observed much more often than very large ones.

The equation for the Gaussian error curve with the z variable:
2 2 2
− ( 𝒙 −𝝁 ) /2 𝝈 −𝑧 /2
𝑒 𝑒
𝒚= ¿
𝝈 √2 𝜋 𝝈 √2 𝜋
A Gaussian curve in which μ = 0 and σ = 1.

Because it appears in the Gaussian error curve expression, the A Gaussian curve whose area is unity is
square of the standard deviation is also very important. This called a normal error curve. In this case,
the abscissa, x, is equal to z.
quantity is actually the population variance.
19
A very important part of all analytical procedures is the
calibration process.

Calibration determines the relationship between the analytical


response and the analyte concentration. This relationship is
usually determined by the use of chemical standards. The
standards used can be prepared from purified reagents, if
available, or standardised by classical quantitative methods.

Most commonly, the standards used are prepared externally to


the analyte solutions (external standard methods).
CALIBRATION
In some cases, an attempt is made to reduce interferences from
other constituents in the sample matrix, called concomitants, by
using standards added to the analyte solution (internal standard
methods or standard addition methods) or by matrix matching
or modification.

Almost all analytical methods require some type of calibration


with chemical standards. Gravimetric methods and some
coulometric methods are among the few absolute methods that
do not rely on calibration with chemical standards.
20
In external standard calibration, a series of standard solutions is
prepared separately from the sample.

The standards are used to establish the instrument calibration


function, which is obtained from analysis of the instrument
response as a function of the known analyte concentration.
EXTERNAL
Ideally, three or more standard solutions are used in the
calibration process, although in some routine determinations, STANDARD
two-point calibrations can be reliable.
CALIBRATION
The calibration function can be obtained graphically or in
mathematical form.

Generally, a plot of instrument response versus known analyte


concentrations is used to produce a calibration curve, sometimes
called a working curve.

It is often desirable that the calibration curve be linear in at least


the range of the analyte concentrations.

21
Below is a linear calibration curve of absorbance versus analyte
concentration for a series of standards. For graphical methods, a EXTERNAL
straight line is drawn through the data points (shown as circles).
STANDARD
Residuals are distances on the y- CALIBRATION
axis between the data points and
the predicted line as shown in the
Data for standards shown as solid circles.
inset.

The linear calibration curve is used in an


inverse fashion to obtain the
concentration of an unknown analyte
solution with an absorbance of 0.505.

Absorbance is located on the line, and then


the concentration corresponding to that
absorbance is obtained by extrapolating to
the x-axis (dashed lines).

The concentration found is then related


back to the analyte concentration in the
original sample by applying appropriate
dilution factors from the sample
preparation steps.
22
For most chemical analyses, the response of the procedure must
be evaluated for known quantities of analyte (called standards)
so that the response to an unknown quantity can be interpreted.

For this purpose, we commonly prepare a calibration curve.

Most often, we work in a region where the calibration curve is a


straight line.
THE METHOD OF
We use the method of least squares to draw the “best” straight
line through experimental data points that have some scatter LEAST SQUARES
and do not lie perfectly on a straight line.

The best line will be such that some of the points lie above and
some lie below the line.

We will learn to estimate the error in a chemical analysis from


the errors in the calibration curve and in the measured response
to replicate samples of unknown.

23
The calibration curve shown in the previous figure is for the
determination of Ni(II) by reaction with excess thiocyanate to
form an absorbing complex ion [Ni(SCN)+]. THE METHOD OF
LEAST SQUARES
The ordinate is the dependent variable, absorbance, while the
abscissa is the independent variable, concentration of Ni(II).

As is typical and usually desirable, the plot approximates a


straight line. Note that, because of indeterminate errors in the
measurement process, not all the data fall exactly on the line.

Thus, the investigator must try to draw the “best” straight line
among the data points. Regression analysis provides the means
for objectively obtaining such a line and also for specifying the
errors associated with its subsequent use.

We consider here only the basic method of least squares for two-
dimensional data.
24
Two assumptions are made in using the method of least squares: ASSUMPTIONS OF
1. There is actually a linear relationship between the measured THE LEAST-SQUARES
response y and the standard analyte concentration x, and errors
(standard deviations) in all y values are similar.
METHOD
The mathematical relationship that describes this assumption is
called the regression model, which may be represented as:

y = mx + b

where b is the y intercept (the value of y when x is zero), and m


is the slope of the line.

2. We also assume that any deviation of the individual points from


The linear least-squares method assumes
the straight line arises from error in the measurement, which is an actual linear relationship between the
plotted on the axis y. That is, we assume there is no error in x response y and the independent variable x.
values of the points (concentrations) or the errors in the y values In addition, it is assumed that there is no
are substantially greater than the errors in the x values. error in the x values.

25
Both of these assumptions are appropriate for many analytical ASSUMPTIONS OF
methods, but bear in mind that, whenever there is significant
uncertainty in the x data, basic linear least-squares analysis may
THE LEAST-SQUARES
not give the best straight line. METHOD

Simple least-squares analysis may not be appropriate when the


errors in the y values vary significantly with x.

In that instance, it may be necessary to apply different weighting


factors to the points and perform a weighted least-squares
analysis.

Thus, when there is an error in the x values, basic least-squares Slope (m) = Δy / Δx = (y2 - y1) / (x2 - x1)
analysis may not give the best straight line. Instead, a correlation
analysis should be used. y-Intercept (b) = Crossing point on y-axis

26
The least-squares procedure can be illustrated with the aid of
the calibration curve for the determination of Ni(II) shown in THE METHOD OF
the figure on the right. Thiocyanate was added to the Ni(II)
standards, and the absorbances measured as a function of the LEAST SQUARES
Ni(II) concentration.

The vertical deviation of each point from the straight line is


called a residual as shown in the inset. As mentioned above,
residuals are distances on the y-axis between the data points
and the predicted line.

The line generated by the least-squares method is the one that


minimises the sum of the squares of the residuals for all the
points.

In addition to providing the best fit between the experimental


points and the straight line, the method gives the standard
deviations for m and b.
27
Straight-line predictability and consistency will determine the
accuracy of the unknown calculation. All measurements will
have a degree of uncertainty, and so will the plotted straight line.
FINDING THE
Graphing (curve fitting) is often done intuitively, that is, by REGRESSION LINE
simply “eyeballing” the best straight line by placing a ruler
through the points, which invariably have some scatter.

A better approach is to apply statistics to define the most


probable straight-line fit of the data. The availability of
statistical functions in spreadsheets today make it
straightforward to prepare straight-line, or even nonlinear, fits.

We should note that a straight line is a model of the relationship


between observations and amount of an analyte. One can always
blindly apply least squares fitting to any random set of numbers. What line can you draw
That does not necessarily mean a linear model is appropriate. through this points?
Perhaps one should be fitting logarithms, or should be fitting
sigmoids.

We fit models to data, not data to models.


28
Residual
FINDING THE
Regression REGRESSION LINE
line
Gaussian
curve The figure shows the least-squares curve
fitting.

The Gaussian curve drawn over the


point (3,3) indicates schematically how
A regression line is simply a single line
each value of yi is normally distributed
that best fits the data (in terms of having about the straight line.
the smallest overall distance from the line to
the points). Statisticians call this technique That is, the most probable value of y will
for finding the best-fitting line a simple fall on the line, but there is a finite
linear regression analysis using the least probability of measuring y some distance
squares method.
from the line.

29
The vertical deviation (residual) for the point (xi, yi) is yi - y,
where y is the ordinate of the straight line when x = xi.
FINDING THE
Vertical deviation = di = yi - y = yi – (mxi + b)
REGRESSION LINE
Some of the deviations are positive and some are negative.
Because we wish to minimize the magnitude of the deviations
irrespective of their signs, we square all the deviations so that Residual
we are dealing only with positive numbers:

di2 = (yi – y)2 = (yi – mxi - b)2 Gaussian


curve

Because we minimise the squares of the deviations, this is called


the method of least squares.

It can be shown that minimising the squares of the deviations


(rather than simply their magnitudes) corresponds to assuming
that the set of y values is the most probable set.

Finding values of m and b that minimise the sum of the squares


of the vertical deviations involves some calculus, which we omit.
30
The sum of the squares of the vertical deviations (residuals):
𝑵
S= ∑ ¿¿¿ ̶ (b + mx )] i
2
FINDING THE
𝑖=1
REGRESSION LINE
where N is the number of points used.

The calculation of slope and intercept is simplified when three


quantities are defined, Sxx, Syy, and Sxy as follows: Residual

𝑵 2

𝑵 𝑵 ∑ 𝒙𝒊 Gaussian
curve

∑ ( 𝒙 𝒊  −  𝒙 ) 2 = ∑ 𝒙 𝒊 − 2 𝑖=1
Sxx =
𝑖=1 𝑖=1 N
𝑵 2

𝑵 𝑵 ∑ 𝒚𝒊
∑ ( 𝒚 𝒊  −  𝒚 ) ∑𝒚𝒊 − 2 𝑖=1
Syy = 2=
𝑖=1 𝑖=1 N

31
Note that Sxx and Syy are the sum of the squares of the deviations
from the mean for individual values of x and y.

Variables and are individual pairs (points) of data for x and y, N


FINDING THE
is the number of pairs (points) and and are the average (mean) REGRESSION LINE
values for x and y:
𝑵 𝑵

∑ 𝒙𝒊 ∑𝒚𝒊 Residual

𝑖=1 𝑖=1
𝒙=¿ 𝒚=¿
N N
Gaussian
curve
Note that the equations for Sxx and Syy are the numerators in the
equations for the variance in x and the variance in y.

Likewise, Sxy is the numerator in the covariance of x and y:


𝑵 𝑵

𝑵 𝑵 ∑ 𝒙 𝒊∑ 𝒚 𝒊
Sxy = ∑ ( 𝒙 𝒊  −  𝒙 )( 𝒚 𝒊  −  𝒚 ) ∑ 𝒙 𝒊 𝒚 𝒊 −
=
𝑖=1 𝑖=1
N
𝑖=1 𝑖=1
32
Six useful quantities can be derived from Sxx, Syy, and Sxy:

1. The slope of the line: m = Sxy / Sxx


FINDING THE
2. The intercept: b = – m
REGRESSION LINE
3. The standard deviation about regression:

Syy – m2 Sxx
𝒔𝒓 =¿
Residual

N-2
Gaussian
In this equation the number of degrees of freedom is N - 2 since curve

one degree of freedom is lost in calculating m and one in


determining b.

The standard deviation about regression , also called the


standard error, is a rough measure of the magnitude of a
typical deviation from the regression line. It is actually the
standard deviation for y when the deviations are measured not
from the mean of y (as is the usual case) but from the straight
line that results from the least-squares prediction.
33
The value of is closely related to the sum of the squares of the FINDING THE
vertical deviations as follows:
REGRESSION LINE
𝑵

∑ ¿¿¿ ̶ (b + mx )]
i
2

S
𝒔𝒓 =¿ 𝑖=1
N-2
¿ N-2
Residual

Gaussian
The standard deviation about regression is often called the curve

standard error of the estimate. It roughly corresponds to the


size of a typical deviation from the estimated regression line.

With computers, the calculations are typically done using a


spreadsheet program, such as Microsoft® Excel.

34
4. The standard deviation of the slope:
2

𝒔𝒎=¿ Sxx FINDING THE


5. The standard deviation of the intercept: REGRESSION LINE
𝑵
The last equation for the standard
∑𝒙 2
𝑖 1
deviation gives us a way to calculate

𝑵
𝑖=1
𝑵 2 ¿ 𝑵 2 𝑵
the standard deviation from the mean
of a set of M replicate analyses of

N ∑𝒙 2
𝑖– ∑ 𝒙𝒊 N– ∑ 𝒙𝒊 ∑ 𝑖
𝒙
2
unknowns when a calibration curve
that contains N points is used.
𝑖=1 𝑖=1 𝑖=1 𝑖=1
Recall that is the mean value of y for
the N calibration points.
6. The standard deviation for results obtained from the
calibration curve: This equation is only approximate and
assumes that slope and intercept are
independent parameters, which is not
𝒔𝒓 1 1 2
strictly true.
𝒔𝒄=¿ m M
+¿ N
+¿ m2 Sxx

35
FINDING THE
REGRESSION LINE

PRACTICAL EXAMPLE 1
Note that the number of digits carried in the computed values should be the
maximum allowed by the calculator or computer, that is, rounding should not be
performed until the calculation is complete. Carry out a least-squares analysis of the
calibration data for the determination
Solution of isooctane in a hydrocarbon mixture
provided in the first two columns of the
table.

Columns 3, 4, and 5 of the table contain


computed values for , , and , with their
sums appearing as the last entry in each
column.

36
The slope of the line: m = Sxy / Sxx The intercept: b = – m

FINDING THE
REGRESSION LINE

Thus, the equation for the least-squares line is:


PRACTICAL EXAMPLE 1
y = 2.09x + 0.26
Carry out a least-squares analysis of the
calibration data for the determination
The standard deviation about regression: of isooctane in a hydrocarbon mixture
provided in the first two columns of the
table.

Columns 3, 4, and 5 of the table contain


computed values for , , and , with their
sums appearing as the last entry in each
column.

Continue → 37
FINDING THE
The standard deviation of the slope: REGRESSION LINE

PRACTICAL EXAMPLE 1

Carry out a least-squares analysis of the


The standard deviation of the intercept: calibration data for the determination
of isooctane in a hydrocarbon mixture
provided in the first two columns of the
table.

Columns 3, 4, and 5 of the table contain


computed values for , , and , with their
sums appearing as the last entry in each
column.

38
Solution

In either case, the unknown concentration is found from rearranging


the least-squares equation for the line, which gives:
FINDING THE
REGRESSION LINE

The standard deviation:


PRACTICAL EXAMPLE 2
𝒔𝒓 1 1 2

𝒔𝒄=¿ m M
+¿ N
+¿ m2 Sxx The calibration curve found in the previous
example was used for the chromatographic
determination of isooctane in a
(a) Substituting = 0.1442 into this equation, we obtain: hydrocarbon mixture.

A peak area of 12.51/5 = 2.65 was obtained.

Calculate the mole percent of isooctane in


the mixture and the standard deviation if
(b) For the mean of four measurements: the area was (a) the result of a single
measurement and (b) the mean of four
measurements.

39
The closer the data points are to the line predicted by a least
squares analysis, the smaller are the residuals.

The sum of the squares of the residuals, S, measures the variation


in the observed values of the dependent variable (y values) that
are not explained by the presumed linear relationship between x
and y: 𝑵 FINDING THE
S= ∑ ¿¿¿ ̶ (b + mx )]
i
2

REGRESSION LINE
𝑖=1

We can also define a total sum of the squares Stot which is Sxy and
which is defined as a measure of the total variation in the
INTERPRETATION OF
observed values of y since the deviations are measured from the
mean value of y. LEAST-SQUARES RESULTS
A coefficient of determination (R2) measures the fraction of the
observed variation in y that is explained by the linear relationship
and is given by:
R2 = 1- (S/Stot)

The closer R2 is to unity, the better the linear model explains the y
variations.
40
The difference between Stot and S is the sum of the squares due to
regression, Sreg.

In contrast to S value, Sreg is a measure of the explained variation:


FINDING THE
REGRESSION LINE
Sreg = Stot – S

R2 = 1- (S/Stot) = Sreg / Stot


INTERPRETATION OF
By dividing the sum of squares by the appropriate number of LEAST-SQUARES RESULTS
degrees of freedom, we can obtain the mean square values for
regression and for the residuals (error) and then the F value.
A significant regression is one in which the
variation in the y values due to the
The F value gives us an indication of the significance of the presumed linear relationship is large
regression. The F value is used to test the null hypothesis that the compared to that due to error (residuals).
total variance in y is equal to the variance due to error.
When the regression is significant, a large
A value of F smaller than the value from the tables at the chosen value of F occurs.
confidence level indicates that the null hypothesis should be
accepted and that the regression is not significant. A large value of
F indicates that the null hypothesis should be rejected and that
the regression is significant. 41
Solution

For each value of , we can find a predicted value of from the


linear relationship. Let us call the predicted values of . We can
write:
FINDING THE
REGRESSION LINE
= b + m,

Below is the previous table of the observed values, the


predicted values , the residuals - , and the squares of the
residuals ( - )2. By summing the latter values, we obtain S as
PRACTICAL EXAMPLE 3
seen in the table:
Find the coefficient of determination for
- ( - )2
the chromatographic data of the previous
0.352 1.09 0.99326 0.09674 0.00936 example.
0.803 1.78 1.93698 20.15698 0.02464
1.08 2.60 2.51660 0.08340 0.00696
1.38 3.03 3.14435 20.11435 0.01308
1.75 4.01 3.91857 0.09143 0.00836
Sum 5.365 12.51 0.06240

Continue → 42
First, we calculate the value of Syy and coefficient of determination :

Syy = 15.81992 – (5.365 x 12.51)/5 = 5.07748


R =1–
2
( 𝒚 𝒊  −  ^
𝒚
Syy
𝒊 )2
= 1 - 0.0624/5.07748 = 0.9877
FINDING THE
REGRESSION LINE
This calculation shows that over 98% of the variation in peak area can
be explained by the linear model.

We can also calculate Sreg = Stot – S = 5.07748 - 0.06240 = 5.01508 PRACTICAL EXAMPLE 3
Let us now calculate the F value. There were five x-y pairs used for the
analysis. The total sum of the squares has 4 degrees of freedom Find the coefficient of determination for
associated with it since one is lost in calculating the mean of the y the chromatographic data the previous
values. The sum of the squares due to the residuals has 3 degrees of example.
freedom because two parameters m and b are estimated. Hence the
value Sreg has only one degree of freedom since it is the difference
between Stot and S.
Sreg / 1 5.01508/1
In our case, we can find F = = = 241.11
S/3 0.0624/3

This very large value of F has a very small chance of occurring by


random chance, and therefore, we conclude that this is a significant 43
regression.
The following table shows real data from a protein analysis
that produces a colored product.
CONSTRUCTING
Amount Absorbance of Range Corrected absorbance A CALIBRATION
of protein independent samples
(g) CURVE
0 0.099 0.099 0.100 0.001 -0.0003 -0.0003 -0.0007

5.0 0.185 0.187 0.188 0.003 0.0857 0.0877 0.0887 A calibration curve shows the response of
an analytical method to known quantities
10.0 0.282 0.272 0.272 0.010 0.1827 0.1727 0.1727 of analyte.
15.0 0.345 0.347 0.392* 0.047 0.2457 0.2477 —
Solutions containing known concentrations
20.0 0.425 0.430 0.430 0.005 0.3257 0.3257 0.3307 of analyte are called standard solutions.
25.0 0.483 0.488 0.496 0.013 0.3837 0.3887 0.3967
Solutions containing all reagents and
solvents used in the analysis, but no
A spectrophotometer measures the absorbance of light, which deliberately added analyte, are called
is proportional to the quantity of protein analysed. blank solutions. Blanks measure the
response of the analytical procedure to
When we scan across the three absorbance values in each row impurities or interfering species in the
of the table, the number 0.392 seems out of line. This number reagents.
is called outlier.

Continue → 44
The outlier at 0.392 is inconsistent with the other values for
15.0 g, and the range of values for the 15.0-g samples is
CONSTRUCTING
much bigger than the range for the other samples. A CALIBRATION
CURVE
The linear relation between the average values of absorbance
up to the 20.0-g sample also indicates that the value 0.392 is
in error.

We choose to omit 0.392 from subsequent calculations.

It is reasonable to ask whether all three absorbances for the


25.0-g samples are low for some unknown reason, because
this point falls below the straight line in this figure.

Repetition of this analysis shows that the 25.0-g point is


consistently below the straight line and there is nothing
“wrong” with the data in the table.
45
We adopt the following procedure for constructing a calibration
curve:
CONSTRUCTING
A CALIBRATION
Step 1: Prepare known samples of analyte covering a range of
concentrations expected for unknowns. Measure the response of CURVE
the analytical procedure to these standards to generate data like
the first four columns in the table.
Amount Absorbance of
of protein independent samples
Step 2: Subtract the average absorbance (0.0993) of the blank (g)
samples from each measured absorbance to obtain corrected 0 0.099 0.099 0.100
absorbance. 5.0 0.185 0.187 0.188

The blank measures the response of the procedure when no 10.0 0.282 0.272 0.272

protein is present. 15.0 0.345 0.347 0.392*

20.0 0.425 0.430 0.430


Absorbance of the blank can arise from the color of starting
25.0 0.483 0.488 0.496
reagents, reactions of impurities, and reactions of interfering
species.

Blank values can vary from one set of reagents to another, but
corrected absorbance should not.
Continue → 46
Step 3: Make a graph of corrected absorbance versus quantity of
protein analysed (see on the right).
CONSTRUCTING
A CALIBRATION
Use the least-squares procedure to find the best straight line
through the linear portion of the data, up to and including 20.0 g CURVE
of protein (14 points, including the 3 corrected blanks, in the
shaded portion of the table). Calibration curve for protein analysis

Find the slope and intercept and their standard errors.

The equation of the solid straight line fitting the 14 data points
(open circles) from 0 to 20 g, derived by the method of least
squares, is:
y = 0.01630 (±0.00022) x + 0.0047 (±0.0026)

m sm b sb

The standard deviation about regression:

Syy – m2Sxx
𝒔𝒓 =¿ ¿0.0059
N–2
Continue → 47
The equation of the quadratic calibration curve (blue dashed line)
that fits all 17 data points ranging from 0 to 25 g, determined by
a nonlinear least squares procedure is:
CONSTRUCTING
A CALIBRATION
y = -1.17 (±0.21) × 10-4 x2 + 0.01858 (±0.00046) x – 0.0007 (±0.0010)
CURVE
= 0.0046 (i.e., y ± 0.0046)
Calibration curve for protein analysis
The equation of the linear calibration line is:

y(±) = [m(±)]x + [b(±)]

absorbance = m × (g of protein) + b

y x
(0.01630)(g of protein) + 0.0047

The value y is actually the corrected absorbance, which is the


observed absorbance minus blank absorbance.

Step 4: If you analyse an unknown at a future time, run a blank at that


time. Subtract the new blank absorbance from the unknown
absorbance to obtain corrected absorbance. 48
CONSTRUCTING
A CALIBRATION
Practical Example
CURVE
An unknown protein sample gave an absorbance of 0.406 and a blank
had an absorbance of 0.104. How many micrograms of protein are in Calibration curve for protein analysis
the unknown?

Solution:

The corrected absorbance is 0.406 - 0.104 = 0.302, which lies on the


linear portion of the calibration curve in the figure.

0.302 = m × (g of protein) + b = (0.01630)(g of protein) + 0.0047

g of protein = (0.302 - 0.0047) / 0.01630 = 18.24 g

49
We prefer calibration procedures with a linear response, in which
the corrected analytical signal (= signal from sample - signal from CALIBRATION
blank) is proportional to the quantity of analyte.
CURVE RANGES
Although we try to work in the linear range, you can obtain valid
results beyond the linear region (> 20 g) in the this experiment. Calibration curve illustrating
linear and dynamic ranges
The dashed curve that goes up to 25 g of protein comes from a
least-squares fit of the data to the quadratic equation:

y = ax2 + bx + c

y = -1.17 (±0.21) × 10-4 x2 + 0.01858 (±0.00046) x – 0.0007 (±0.0010)

The linear range of an analytical method is the analyte


concentration range over which response is linearly proportional
to concentration.

A related quantity in the figure on the right is dynamic range – the


concentration range over which there is a measurable response to
analyte, even if the response is not linear.
50
Practical Example

Consider an unknown whose corrected absorbance of 0.375 lies CALIBRATION


beyond the linear region.
CURVE RANGES
Solution:
Calibration curve illustrating
We can fit all the data points with the quadratic equation linear and dynamic ranges

y = -1.17 × 10-4 x2 + 0.01858 x – 0.0007

To find the quantity of protein, substitute the corrected absorbance


into the above quadratic equation:

0.375 = -1.17 × 10-4 x2 + 0.01858 x – 0.0007

This equation can be rearranged to

1.17 × 10-4 x2 - 0.01858 x + 0.3757 = 0

This is a quadratic equation of the form:

ax2 + bx + c = 0 Continue → 51
CALIBRATION
CURVE RANGES
There ae two possible solutions to the quadratic equation:

Calibration curve for protein analysis


𝑥1=¿
–b+ √ 𝑏 − 4 𝑎𝑐
2
𝑥2 =¿
–b- √ 𝑏 − 4 𝑎𝑐
2

2a 2a

Substituting a = 1.17 × 10-4 , b = 0.01858, and c = 0.3757 into


these equations gives:

x1 = 135 g x2 = 23.8 g

The calibration curve tells us that the correct choice is 23.8 g,
not 135 g.

52
Always make a graph of your data. The graph gives you an
opportunity to reject bad data or the stimulus to repeat a
measurement or decide that a straight line is not appropriate.

It is not reliable to extrapolate any calibration curve, linear or


nonlinear, beyond the measured range of standards. Measure
standards in the entire concentration range of interest.

At least six calibration concentrations and two replicate GOOD PRACTICE


measurements of unknown are recommended.

The most rigorous procedure is to make each calibration solution


independently from a certified material. Avoid serial dilution of a
single stock solution. Serial dilution propagates systematic error
in the stock solution.

Measure calibration solutions in random order, not in consecutive


order by increasing concentration.
53
Error bars on a graph help us judge the quality of the data and the
fit of a curve to the data.
ADDING ERROR
BARS

Let’s plot the mean absorbance of


columns 2 to 4 versus sample mass of
protein in column 1. Then we will add
error bars corresponding to the 95%
confidence interval for each point.

The Excel figure on the right lists mass


in column A and mean absorbance in
column B.

The standard deviation of absorbance is


given in column C. The 95% confidence
interval for absorbance is computed in
column D with the formula in the
margin.

54

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy