Statistical Analysis and Calibration
Statistical Analysis and Calibration
give an overall error. We will assume that each error has an equal
probability of occurring and that each can cause the final result to
be high or low by a fixed amount ±U.
The following table shows all the possible ways the four errors
can combine to give the indicated deviations from the mean value:
STATISTICAL
ANALYSIS OF
RANDOM ERRORS
1
Note that only one combination leads to a deviation of +4 U,
four combinations give a deviation of +2 U, and six give a
deviation of 0 U. The negative errors have the same relationship.
STATISTICAL
This ratio of 1:4:6:4:1 is a measure of the probability for a
ANALYSIS OF
deviation of each magnitude. If we make a sufficiently large RANDOM ERRORS
number of measurements, we can see a frequency distribution
like that shown in the following figure.
If you flip a coin 10 times, how many heads will you get? Try it, and record
your results. Repeat the experiment. Are your results the same? Ask friends
or members of your class to perform the same experiment and tabulate the
results.
The table below contains the results obtained by several classes of analytical
chemistry students over an 18-year period:
Continue 3
Results of a coin-flipping experiment by 395 students over an 18-year
period:
DISTRIBUTION OF
EXPERIMENTAL
RESULTS
The smooth curve in the figure is a normal error curve for an infinite
number of trials with the same population mean µ and population standard
deviation σ as the data set. Note that the population mean µ of 5.04 is very A Gaussian, or normal error curve, is
close to the value of 5 that you would predict based on the laws of a curve that shows the symmetrical
probability. distribution of data around the mean
of an infinite set of data.
As the number of trials increases, the histogram approaches the shape of the
smooth curve, and the mean µ approaches the value of 5.
4
Typically in a scientific study, we infer information about a
population or universe from observations made on a subset or
sample.
5
In many of the cases encountered in analytical chemistry, the
population is conceptual.
POPULATION AND
Consider, for example, the determination of calcium in a
community water supply to determine water hardness. In this SAMPLE
example, the population is the very large, nearly infinite,
number of measurements that could be made if we analysed
the entire water supply.
Statistical laws have been derived for
populations, but they can be used for
Similarly, in determining glucose in the blood of a patient, we samples after suitable modification.
could hypothetically make an extremely large number of
measurements if we used the entire blood supply. Such modifications are needed for small
samples because a few data points may
The subset of the population analysed in both these cases is not represent the entire population.
the sample.
Here, we first describe the Gaussian
statistics of populations. Then we show
We infer characteristics of the population from those obtained how these relationships can be modified
with the sample. Hence, it is very important to define the and applied to small samples of data.
population being characterised.
6
Do not confuse the statistical sample with the analytical sample.
For example, consider four water samples taken from the same
water supply and analysed in the laboratory for calcium.
POPULATION AND
These four analytical samples result in four measurements
SAMPLE
selected from the population. They are thus a single statistical
sample. This is an unfortunate duplication of the term sample.
7
Gaussian curves can be described by an equation that contains PROPERTIES
just two parameters, the population mean µ and the population
standard deviation σ. This is the equation for a normalized OF GAUSSIAN
Gaussian curve having the form:
2
− ( 𝒙 −𝝁 ) /2 𝝈
2
CURVES
𝑒
𝒚=
𝝈 √2 𝜋
8
The sample mean defined in the previous lecture differs from
the population mean µ .
∑ 𝒙𝒊 Individual
measurements THE POPULATION
𝑖=1
𝒙=¿
N Number of measurements
in the sample set
MEAN µ AND THE
The population mean µ, in contrast, is the true mean for the SAMPLE MEAN
population. It is also defined by the same equation with the
added provision that N represents here the total number of
measurements in the population:
𝑁
∑ 𝒙𝒊 Individual
measurements
𝑖=1
µ=¿ Number of measurements
N
in the population
9
In the absence of systematic error, the population mean is also
the true value for the measured quantity.
THE POPULATION
In most cases we do not know µ and must infer its value from .
MEAN µ AND THE
The probable difference between and µ decreases rapidly as the SAMPLE MEAN
number of measurements making up the sample increases.
Usually by the time N reaches 20 to 30, this difference is
negligible.
10
Three terms are widely used to describe the precision of a set of
replicate data in a sample: standard deviation, variance, and STANDARD
coefficient of variation.
DEVIATION
These three are functions of how much an individual result differs
from the mean, called the deviation from the mean :
=| -|
∑ ( 𝒙 𝒊 − 𝒙 ) 2 ∑ 𝒅𝒊 2
𝒔=¿ 𝑖=1
N-1 ¿ 𝑖=1
N-1
Number of
degrees of
freedom
Gaussian curves for two sets of
An experiment that produces a small STD is more precise than one light bulbs, one having a standard
that produces a large STD. deviation half as great as the other.
The number of bulbs described by
The mean gives the center of the distribution. The STD measures each curve is the same.
the width of the distribution. 11
The square of the standard deviation is called the variance.
To find with a calculator that does not have a standard deviation key, the
following rearrangement is easy to use:
STANDARD
𝑁 2
DEVIATION
𝑁 ∑ 𝒙𝒊
∑ 𝒙 𝒊2− 𝑖=1
Continue 13
𝑁
∑ 𝒙𝒊 3.771
𝑖=1
= = = 0.7542 ≈ 0.754 ppm Pb
N 5
𝑁 2 STANDARD
∑ 𝒙𝒊 DEVIATION
𝑖=1
(3.771)2
= = 2.8440882
N 5
PRACTICAL EXAMPLE 2
2.844145 - 2.8440882
𝒔=¿ = 0.00377 ≈ 0.004 ppm Pb
5-1 The following results were obtained in the
replicate determination of the lead content
of a blood sample: 0.752, 0.756, 0.752,
Note that the difference between 2.844145 and 2.8440882 is very small. 0.751, and 0.760 ppm Pb.
If we had rounded these numbers before subtracting them, a serious Find the mean and the standard deviation
error would have appeared in the calculated value of . To avoid this of this set of data.
source of error, never round a standard deviation calculation until the
very end.
Furthermore, and for the same reason, never use this equation to
calculate the STD of large numbers containing five or more digits.
14
If a series of replicate results, each containing N measurements,
are taken randomly from a population of results, the mean of
each set will show less and less scatter as N increases.
∑ ( 𝒙 𝒊 − 𝒙 ) 2
𝑖=1
𝒔=¿ Number of
degrees of
N-1 The two curves are for two populations of
freedom
data that differ only in their STDs. The
The breadth of the curves on the right is a measure of the standard deviation for the data set yielding
precision of the two sets of data. Thus, the precision of the data the broader but lower curve B is twice that
set leading to curve A is twice as good as that of the data set for the measurements yielding curve A.
represented by curve B.
16
The following figure shows two normal error Gaussian curves
in which we plot the relative frequency y of various deviations
from the mean versus the deviation from the mean di.
THE POPULATION
STANDARD
DEVIATION
In (a) the abscissa is the deviation In (b) the abscissa is the deviation from
from the mean (x – µ) in the units the mean in units of σ. For this plot, the
of measurement. two curves A and B are identical.
17
The figure on the right shows the type of normal error curve in
which the x axis is now a new variable z, which is defined as:
(x – µ)
z=
σ NORMAL ERROR
Note that z is the relative deviation of a data point from the mean, CURVE
that is, the deviation relative to the standard deviation. Hence,
when x – µ = σ, then z is equal to one; and when x – µ = 2σ, z is
equal to two; and so forth.
The equation for the Gaussian error curve with the z variable:
2 2 2
− ( 𝒙 −𝝁 ) /2 𝝈 −𝑧 /2
𝑒 𝑒
𝒚= ¿
𝝈 √2 𝜋 𝝈 √2 𝜋
A Gaussian curve in which μ = 0 and σ = 1.
Because it appears in the Gaussian error curve expression, the A Gaussian curve whose area is unity is
square of the standard deviation is also very important. This called a normal error curve. In this case,
the abscissa, x, is equal to z.
quantity is actually the population variance.
19
A very important part of all analytical procedures is the
calibration process.
21
Below is a linear calibration curve of absorbance versus analyte
concentration for a series of standards. For graphical methods, a EXTERNAL
straight line is drawn through the data points (shown as circles).
STANDARD
Residuals are distances on the y- CALIBRATION
axis between the data points and
the predicted line as shown in the
Data for standards shown as solid circles.
inset.
The best line will be such that some of the points lie above and
some lie below the line.
23
The calibration curve shown in the previous figure is for the
determination of Ni(II) by reaction with excess thiocyanate to
form an absorbing complex ion [Ni(SCN)+]. THE METHOD OF
LEAST SQUARES
The ordinate is the dependent variable, absorbance, while the
abscissa is the independent variable, concentration of Ni(II).
Thus, the investigator must try to draw the “best” straight line
among the data points. Regression analysis provides the means
for objectively obtaining such a line and also for specifying the
errors associated with its subsequent use.
We consider here only the basic method of least squares for two-
dimensional data.
24
Two assumptions are made in using the method of least squares: ASSUMPTIONS OF
1. There is actually a linear relationship between the measured THE LEAST-SQUARES
response y and the standard analyte concentration x, and errors
(standard deviations) in all y values are similar.
METHOD
The mathematical relationship that describes this assumption is
called the regression model, which may be represented as:
y = mx + b
25
Both of these assumptions are appropriate for many analytical ASSUMPTIONS OF
methods, but bear in mind that, whenever there is significant
uncertainty in the x data, basic linear least-squares analysis may
THE LEAST-SQUARES
not give the best straight line. METHOD
Thus, when there is an error in the x values, basic least-squares Slope (m) = Δy / Δx = (y2 - y1) / (x2 - x1)
analysis may not give the best straight line. Instead, a correlation
analysis should be used. y-Intercept (b) = Crossing point on y-axis
26
The least-squares procedure can be illustrated with the aid of
the calibration curve for the determination of Ni(II) shown in THE METHOD OF
the figure on the right. Thiocyanate was added to the Ni(II)
standards, and the absorbances measured as a function of the LEAST SQUARES
Ni(II) concentration.
29
The vertical deviation (residual) for the point (xi, yi) is yi - y,
where y is the ordinate of the straight line when x = xi.
FINDING THE
Vertical deviation = di = yi - y = yi – (mxi + b)
REGRESSION LINE
Some of the deviations are positive and some are negative.
Because we wish to minimize the magnitude of the deviations
irrespective of their signs, we square all the deviations so that Residual
we are dealing only with positive numbers:
𝑵 2
𝑵 𝑵 ∑ 𝒙𝒊 Gaussian
curve
∑ ( 𝒙 𝒊 − 𝒙 ) 2 = ∑ 𝒙 𝒊 − 2 𝑖=1
Sxx =
𝑖=1 𝑖=1 N
𝑵 2
𝑵 𝑵 ∑ 𝒚𝒊
∑ ( 𝒚 𝒊 − 𝒚 ) ∑𝒚𝒊 − 2 𝑖=1
Syy = 2=
𝑖=1 𝑖=1 N
31
Note that Sxx and Syy are the sum of the squares of the deviations
from the mean for individual values of x and y.
∑ 𝒙𝒊 ∑𝒚𝒊 Residual
𝑖=1 𝑖=1
𝒙=¿ 𝒚=¿
N N
Gaussian
curve
Note that the equations for Sxx and Syy are the numerators in the
equations for the variance in x and the variance in y.
𝑵 𝑵 ∑ 𝒙 𝒊∑ 𝒚 𝒊
Sxy = ∑ ( 𝒙 𝒊 − 𝒙 )( 𝒚 𝒊 − 𝒚 ) ∑ 𝒙 𝒊 𝒚 𝒊 −
=
𝑖=1 𝑖=1
N
𝑖=1 𝑖=1
32
Six useful quantities can be derived from Sxx, Syy, and Sxy:
Syy – m2 Sxx
𝒔𝒓 =¿
Residual
N-2
Gaussian
In this equation the number of degrees of freedom is N - 2 since curve
∑ ¿¿¿ ̶ (b + mx )]
i
2
S
𝒔𝒓 =¿ 𝑖=1
N-2
¿ N-2
Residual
Gaussian
The standard deviation about regression is often called the curve
34
4. The standard deviation of the slope:
2
𝑵
𝑖=1
𝑵 2 ¿ 𝑵 2 𝑵
the standard deviation from the mean
of a set of M replicate analyses of
N ∑𝒙 2
𝑖– ∑ 𝒙𝒊 N– ∑ 𝒙𝒊 ∑ 𝑖
𝒙
2
unknowns when a calibration curve
that contains N points is used.
𝑖=1 𝑖=1 𝑖=1 𝑖=1
Recall that is the mean value of y for
the N calibration points.
6. The standard deviation for results obtained from the
calibration curve: This equation is only approximate and
assumes that slope and intercept are
independent parameters, which is not
𝒔𝒓 1 1 2
strictly true.
𝒔𝒄=¿ m M
+¿ N
+¿ m2 Sxx
35
FINDING THE
REGRESSION LINE
PRACTICAL EXAMPLE 1
Note that the number of digits carried in the computed values should be the
maximum allowed by the calculator or computer, that is, rounding should not be
performed until the calculation is complete. Carry out a least-squares analysis of the
calibration data for the determination
Solution of isooctane in a hydrocarbon mixture
provided in the first two columns of the
table.
36
The slope of the line: m = Sxy / Sxx The intercept: b = – m
FINDING THE
REGRESSION LINE
Continue → 37
FINDING THE
The standard deviation of the slope: REGRESSION LINE
PRACTICAL EXAMPLE 1
38
Solution
𝒔𝒄=¿ m M
+¿ N
+¿ m2 Sxx The calibration curve found in the previous
example was used for the chromatographic
determination of isooctane in a
(a) Substituting = 0.1442 into this equation, we obtain: hydrocarbon mixture.
39
The closer the data points are to the line predicted by a least
squares analysis, the smaller are the residuals.
REGRESSION LINE
𝑖=1
We can also define a total sum of the squares Stot which is Sxy and
which is defined as a measure of the total variation in the
INTERPRETATION OF
observed values of y since the deviations are measured from the
mean value of y. LEAST-SQUARES RESULTS
A coefficient of determination (R2) measures the fraction of the
observed variation in y that is explained by the linear relationship
and is given by:
R2 = 1- (S/Stot)
The closer R2 is to unity, the better the linear model explains the y
variations.
40
The difference between Stot and S is the sum of the squares due to
regression, Sreg.
Continue → 42
First, we calculate the value of Syy and coefficient of determination :
∑
R =1–
2
( 𝒚 𝒊 − ^
𝒚
Syy
𝒊 )2
= 1 - 0.0624/5.07748 = 0.9877
FINDING THE
REGRESSION LINE
This calculation shows that over 98% of the variation in peak area can
be explained by the linear model.
We can also calculate Sreg = Stot – S = 5.07748 - 0.06240 = 5.01508 PRACTICAL EXAMPLE 3
Let us now calculate the F value. There were five x-y pairs used for the
analysis. The total sum of the squares has 4 degrees of freedom Find the coefficient of determination for
associated with it since one is lost in calculating the mean of the y the chromatographic data the previous
values. The sum of the squares due to the residuals has 3 degrees of example.
freedom because two parameters m and b are estimated. Hence the
value Sreg has only one degree of freedom since it is the difference
between Stot and S.
Sreg / 1 5.01508/1
In our case, we can find F = = = 241.11
S/3 0.0624/3
5.0 0.185 0.187 0.188 0.003 0.0857 0.0877 0.0887 A calibration curve shows the response of
an analytical method to known quantities
10.0 0.282 0.272 0.272 0.010 0.1827 0.1727 0.1727 of analyte.
15.0 0.345 0.347 0.392* 0.047 0.2457 0.2477 —
Solutions containing known concentrations
20.0 0.425 0.430 0.430 0.005 0.3257 0.3257 0.3307 of analyte are called standard solutions.
25.0 0.483 0.488 0.496 0.013 0.3837 0.3887 0.3967
Solutions containing all reagents and
solvents used in the analysis, but no
A spectrophotometer measures the absorbance of light, which deliberately added analyte, are called
is proportional to the quantity of protein analysed. blank solutions. Blanks measure the
response of the analytical procedure to
When we scan across the three absorbance values in each row impurities or interfering species in the
of the table, the number 0.392 seems out of line. This number reagents.
is called outlier.
Continue → 44
The outlier at 0.392 is inconsistent with the other values for
15.0 g, and the range of values for the 15.0-g samples is
CONSTRUCTING
much bigger than the range for the other samples. A CALIBRATION
CURVE
The linear relation between the average values of absorbance
up to the 20.0-g sample also indicates that the value 0.392 is
in error.
The blank measures the response of the procedure when no 10.0 0.282 0.272 0.272
Blank values can vary from one set of reagents to another, but
corrected absorbance should not.
Continue → 46
Step 3: Make a graph of corrected absorbance versus quantity of
protein analysed (see on the right).
CONSTRUCTING
A CALIBRATION
Use the least-squares procedure to find the best straight line
through the linear portion of the data, up to and including 20.0 g CURVE
of protein (14 points, including the 3 corrected blanks, in the
shaded portion of the table). Calibration curve for protein analysis
The equation of the solid straight line fitting the 14 data points
(open circles) from 0 to 20 g, derived by the method of least
squares, is:
y = 0.01630 (±0.00022) x + 0.0047 (±0.0026)
m sm b sb
Syy – m2Sxx
𝒔𝒓 =¿ ¿0.0059
N–2
Continue → 47
The equation of the quadratic calibration curve (blue dashed line)
that fits all 17 data points ranging from 0 to 25 g, determined by
a nonlinear least squares procedure is:
CONSTRUCTING
A CALIBRATION
y = -1.17 (±0.21) × 10-4 x2 + 0.01858 (±0.00046) x – 0.0007 (±0.0010)
CURVE
= 0.0046 (i.e., y ± 0.0046)
Calibration curve for protein analysis
The equation of the linear calibration line is:
y x
(0.01630)(g of protein) + 0.0047
Solution:
49
We prefer calibration procedures with a linear response, in which
the corrected analytical signal (= signal from sample - signal from CALIBRATION
blank) is proportional to the quantity of analyte.
CURVE RANGES
Although we try to work in the linear range, you can obtain valid
results beyond the linear region (> 20 g) in the this experiment. Calibration curve illustrating
linear and dynamic ranges
The dashed curve that goes up to 25 g of protein comes from a
least-squares fit of the data to the quadratic equation:
y = ax2 + bx + c
ax2 + bx + c = 0 Continue → 51
CALIBRATION
CURVE RANGES
There ae two possible solutions to the quadratic equation:
2a 2a
x1 = 135 g x2 = 23.8 g
The calibration curve tells us that the correct choice is 23.8 g,
not 135 g.
52
Always make a graph of your data. The graph gives you an
opportunity to reject bad data or the stimulus to repeat a
measurement or decide that a straight line is not appropriate.
54