0% found this document useful (0 votes)
28 views24 pages

Khan 2015

This document provides a practical guide to calculating and using the process capability indices Cp and Cpk. It explains that Cp estimates a process's capability if centered between specifications, while Cpk more accurately accounts for non-centered processes. The document also outlines how to calculate Cp and Cpk using sample data, recommends minimum capability values for different process types, and notes assumptions that must be verified like normality before using the indices.

Uploaded by

Zoulou77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views24 pages

Khan 2015

This document provides a practical guide to calculating and using the process capability indices Cp and Cpk. It explains that Cp estimates a process's capability if centered between specifications, while Cpk more accurately accounts for non-centered processes. The document also outlines how to calculate Cp and Cpk using sample data, recommends minimum capability values for different process types, and notes assumptions that must be verified like normality before using the indices.

Uploaded by

Zoulou77
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Omair A.

Khan

Understanding Statistics
for Quality by Design
A TECHNICAL DOCUMENT SERIES

West Pharmaceutical Services · Exton, Pennsylvania

July 31, 2015


Contents

PA R T I PROCESS DESIGN AND PERFORMANCE


1 A Practical Guide to Utilizing Cp and Cpk 1
2 Confidence Intervals for Process Capability Indices 7

PA R T I I PRODUCT DESIGN AND PERFORMANCE


3 Statistical Procedures for Inter-Rater Reliability 11
4 Statistical Procedures for Testing Equivalence 15

i
Part I

Process Design and Performance


A Practical Guide to Utilizing C p and C pk
Omair A. Khan1 1
West Pharmaceutical Services,
R&D Statistical Engineering Intern
June 19, 2015 omair@prettynumbe.rs

This document explains the use of process capability indices as a way


to understand and improve manufacturing processes. It is intended to
be an empirical and pragmatic approach to capability analysis without
developing the underlying statistical theory. After reading this docu-
ment, the reader will have a strong grasp on the motivation for process
capability indices and what is needed to calculate them, focusing on C p
and C pk .

Why quantify capability?

The ability to manufacture a product within a customer’s specifica-


tions or tolerances is known as capability. Statistical process control
(SPC) is a methodology for achieving process stability and improving
capability through the reduction of variability. In any production pro-
cess, a certain amount of natural variability will always exist (chance
or common causes of variation). Occasionally, assignable or special
causes of variation can be present in the output of a process, aris-
ing from improperly adjusted machines, operators, or defective raw
material. This type of variability is usually large compared to natu-
ral variability and tends to suggest an unacceptable level of process
performance. A process that is operating with only chance causes of
variation is said to be in statistical control. Conversely, a process that
is operating under the presence of assignable causes is said to be an
out-of-control process.
Since process variation can never be totally eliminated, the control
of this variation is the key to product quality. Maintaining a stable
process average and systematically reducing process variation are the
keys to achieving superior quality. If process variation is controlled,
then a process becomes predictable. If predictability and consistency
are achieved, then a description of the capability of the process to
produce acceptable products is possible. A process capability index
is a statistical measure of process capability. These indices can be
used in the following ways:2 2
Deleryd M (1996)

1. As a basis in the improvement process.


2. As an alarm clock.
3. As specifications for investments. By giving specifications for
levels of process capability indices, expected to be reached by new
machines, the purchasing process is facilitated.
4. As a certificate for customers. The supplier is able to attach the
a practical guide to utilizing c p and c pk 2

result from the process capability studies conducted when the


actual products were produced, with the delivery.
5. As a basis for new constructions. By knowing the capability of the
production processes, the designer knows how to set reasonable
specifications in order to make the product manufacturable.
6. For control of maintenance efforts. By continuously conducting
process capability studies it is possible to see if some machines are
gradually deteriorating.
7. As specifications for introducing new products.
8. For assessing the reasonableness of customer demands.
9. For motivation of co-workers.
10. For deciding priorities in the improvement process.
11. As a base for inspection activities.
12. As a receipt for improvements.
13. For formulating quality improvement programs.

Process capability indices in practice

SPC is primarily a method for monitoring process performance. C pk Sigma level Yield Fallout
Many engineers believe that C pk can be used to quantify product
0.33 1 68.27% 317311
quality. This is simply untrue. While C pk can be used to calculate 0.67 2 95.45% 45500
process fallout (Table 1), the decision to accept or reject a production 1.00 3 99.73% 2700
1.33 4 99.99% 63
lot of items must be made by acceptance sampling. Sampling plans 1.67 5 99.9999% 1
can be derived using a variety of statistical techniques but are com- 2.00 6 99.9999998% 0.002
monly chosen by consulting tables outlined in ANSI/ASQ Z1.4 (for Table 1: Relationship between C pk and
non-conforming items (measured in
attribute data) or ANSI/ASQ Z1.9 (for variables data). PPM).
As mentioned above, the main goal of capability analysis is to help
reduce variability in the manufacturing process. Higher capability
Situation Minimum Capabilty
indices generally correspond to higher profits as they imply fewer
non-conforming parts and better customer satisfaction. Table 2 con- Existing Process
Regular 1.33
tains commonly used minimum values for a variety of processes. Critical 1.50
Finally, it is important to understand that C pk does not give us the New Process
Regular 1.50
whole picture. One of the disadvantages of C pk is that it does not Critical 1.67
take into consideration the target or nominal specification. Figure 1 Six Sigma Process 2.00
illustrates how the same C pk value can describe two very different Table 2: Recommended capability
values for two-sided specifications.
processes. For this reason, it is good practice not to base decisions
solely on the numerical value of a statistic, but also to graphically
visualize the data. Another way to address this difficulty is to use a
process capability index that is a better indicator of centering, such as
C pm or C pkm .

Figure 1: Two processes with C pk = 1.0


(Montgomery, 2009).
a practical guide to utilizing c p and c pk 3

How to calculate capability indices

While there are numerous process capability indices, the two that
are most commonly used in industry are C p and C pk . These random
variables are estimated with the following equations (note the use of
the hat to denote the estimate):

U S L LS L
Ĉ p =
6 ŝ
" #
U S L µ̂ µ̂ LS L
Ĉ pk = min ,
3 ŝ 3 ŝ

where U S L and LS L are the upper and lower specification limits


given by the customer, ŝ is the sample standard deviation (s), and It is recommended that standard
µ̂ is the sample mean (X̄). C p estimates what the process is capable deviation be estimated as
s
of producing if the process mean were to be centered between the Âin=1 ( Xi X̄ )2
s=
specification limits. Many times, the mean is not exactly centered n 1
and C p overestimates the process capability. C pk more accurately rather than ŝ = R̄/d2 . This second
equation is commonly used by Six
quantifies capability in these cases and is generally used in place of
Sigma practitioners but is less statisti-
C p regardless of the location of the process mean. Note that C p = C pk cally tractable than the first equation.
when the mean is actually centered between the specification limits.

Verifying assumptions

In order to use C p and C pk properly, three main assumptions must be


verified:

1. The individual data must be normally distributed. Normality Data that deviates from normality can
can be verified by visually inspecting a Q-Q plot or by using the sometimes be transformed to behave
better. In practice, one should instead
Anderson-Darling or Shapiro-Wilk tests. determine the cause for non-normality
if the data is expected to be normal (e.g.
2. The individual data must be independent (a particular observation dimensional data).
Xt cannot depend on a previous observation Xt 1 ). Independence
can be assumed if a plot of the data against the order it was col-
lected displays no obvious pattern. One can also use the Durbin-
Watson test for autocorrelation.
3. The process must be under statistical control, which is verified
using Shewart control charts. All data points (or subgroup aver-
ages) must fall in between the calculated control limits (not to be
confused with customer determined specification limits).

If any of these assumptions is not true, then the process capability


indices have absolutely no interpretive value!
a practical guide to utilizing c p and c pk 4

Determining how much to sample

Simulations have shown that one must have at least 30 samples in


order to estimate C p or C pk . The exact number needed is dependent
on the desired power and the type I error one is willing to tolerate.
One must also take into account the length of an operator’s shift and
the type of manufacturing process to determine the frequency of
sampling. OC (operating characteristic) curves are typically used
for these types of calculations. However, it is usually easier to use
the sampling plans tabulated in the aforementioned ANSI/ASQ
standards.

Using confidence intervals

Because in practice we must estimate C pk with Ĉ pk , the point esti-


mate is subject to a certain degree of error. If we would like to ensure
that our process has a C pk of c k = 1.33, for example, our measured
Ĉ pk must be higher. This value (the lower confidence bound) is a
function of the desired C pk (c k ), the sample size (n), and the probabil-
ity of type I error one is willing to tolerate (usually a = 0.05). Table 3
shows a few of these values for a variety of sample sizes.

ck 10 20 30 40 50 75 100 125 150 Table 3: The minimum value of Ĉ pk


for which the process is considered
1.30 2.29 1.87 1.73 1.66 1.61 1.55 1.51 1.48 1.47
capable (i.e. C pk ck ) 95% of the time.
1.40 2.45 2.01 1.86 1.78 1.73 1.66 1.62 1.59 1.58 (Adapted from Chou et al., 1990)
1.50 2.62 2.14 1.99 1.90 1.85 1.78 1.73 1.71 1.69

These values assume that the data were collected individually. When
rational subgrouping is employed, the required minimum value
of Ĉ pk will be less than what is tabulated. The exact calculation is
beyond the scope of this document, and the reader is referred to
Scholz and Vangel (1998) for more details.

Final words: beware of statistical terrorism

C p and C pk can be extremely useful when used as part of a more This section is largely adapted from
comprehensive capability plan. However, these process capability Kotz and Lovelace (1998).

indices have a high potential to be misused. The result is often an


atmosphere of "statistical terrorism" within an organization. Burke et
al. (1991) define statistical terrorism as "the use (or misuse) of valid
statistical techniques along with threats and intimidation to achieve a
business objective, even if the objective may be reasonable." Below are
some examples of statistical terrorism that Burke et al. have outlined:

• "Bandwagon" terrorism. Customers require suppliers to commit to


a practical guide to utilizing c p and c pk 5

implementing SPC aggressively and may even demand a commit-


ment to a deadline date for SPC implementation, after which proof
of quality via control charts will be required with each shipment.
The result? Vendors ignore the statistical methodology and focus
on making attractive charts. The vendors simply won’t send out-
of-control charts to the customer, since they fear the material will
be rejected. In this case, statistical terrorism causes the vendor to
lie.
• "Russian roulette" terrorism. Vendors are contractually bound to
a specific quality criteria, measured statistically with C p or C pk
during a special qualification run. Because of the "random vari-
ability" of random variables, which include C p or C pk , sampling
variability may result in a calculated value of C p or C pk below the
specified minimum value, even if the process is truly capable. If
only a single estimate of C p or C pk is required, and the process is
exactly capable (say 1.33), there is a 50% chance that the estimate
will be below the minimum value. Without including confidence
limits, you are playing Russian roulette in terms of meeting their
requirements.
• "Tax audit" terrorism. The use of standards by large customer
companies forces vendors to estimate capability based on their
guidelines, which may not be appropriate, for example, with
non-normal data. The rigid standards deny the vendors the op-
portunity to understand their own processes and adjust estimation
techniques to match them. The consequences of not meeting the
standard may not be made clear, and these standards keep the im-
provement focus on the products, not the processes. The processes
have to be improved in order to improve the products.
• Other forms of terrorism. These include "self-inflicted wound" ter-
rorism, which results from extreme pressure that managers place
upon their own employees to achieve some statistical goal. There
is also "academy award" terrorism, which is the requirement that
an organization compete for some renowned quality award, inter-
nal or external. Finally, there is the "one true statistician" terrorism,
where an organization succumbs to the teachings of a specific
individual to the exclusion of any other perspective.

Burke et al. (1991) suggest that statistical terrorism may be coun-


tered in the same way as physical terrorism: by intelligence, speed,
and strength. The vendor should be intimately familiar with what
the customer needs in their products (intelligence). Statistical ex-
pertise should be developed in-house or be readily available from
a qualified outside source, so that quick statistical analysis requests
by the customer can be met accurately (speed). Finally, the strength
a practical guide to utilizing c p and c pk 6

of the quality program comes from knowledge, knowledge of your


own processes and how to statistically analyze them in an accurate
manner.

References and Recommended Reading

ANSI/ASQ Z1.4–2003 (R2013). Sampling Procedures and Tables


for Inspection by Attributes.
ANSI/ASQ Z1.9–2003 (R2013). Sampling Procedures and Tables
for Inspection by Variables for Percent Nonconforming.
Chou Y-M, Owen DB, and Borrego SA (1990). Lower confidence
limits on process capability indices. Journal of Quality Technology,
22(3), 223-229.
Deleryd M (1996). Process capability studies in theory and prac-
tice. Licentiate thesis, Lulea University of Technology, Luleå, Sweden.
Kotz S and Lovelace CR (1998). Process Capability Indices in Theory
and Practice. Arnold, London.
Sholz F and Vangel M (1998). Tolerance Bounds and C pk Confi-
dence Bounds Under Batch Effects. In Kahle W, et al. (Eds.), Advances
in Stochastic Models for Reliability, Quality and Safety (pp. 361-379).
Birkhäuser, Boston.
Montgomery DC (2009). Introduction to Statistical Quality Control,
6th edn. John Wiley and Sons, New York.
Confidence Intervals for Process Capability Indices
Omair A. Khan1 1
West Pharmaceutical Services,
R&D Statistical Engineering Intern
July 3, 2015 omair@prettynumbe.rs

This document introduces the use of confidence intervals for process


capability indices. We begin with a basic description of estimation
followed by various equations for calculating the interval for C p and
C pk . The final section describes a custom developed web application to
automate these calculations for West engineers.

Interval estimation

Process capability indices are most commonly reported as single


point estimates. The point estimates of C p and C pk are calculated as

USL LSL
Ĉ p =
6ŝ
and 
USL µ̂ µ̂ LSL
Ĉ pk = min , .
3ŝ 3ŝ
Due to the variability involved in sampling, this is not the most ac-
curate method for quantifying capability. When a process is exactly
capable, for example, there is a 50% chance that the estimate will be
below the minimum value.2 2
Khan OA (June 19, 2015). A Practical
A better estimate can be obtained by calculating the 100(1 a)% Guide to Utilizing C p and C pk . West
Pharmaceutical Services Technical
confidence interval of the process capability index. Here, a is the Document.
probability of type I error one is willing to tolerate. The most com-
mon choice for a is 0.05, resulting in a 95% confidence interval. This Note that it is not entirely correct to
say that there is a 95% chance that the
is interpreted as follows: if repeated samples are taken and the 95%
population parameter lies within the
confidence interval is computed for each sample, 95% of the intervals interval.
will contain the true population parameter. Higher confidence levels
correspond to wider intervals.

Confidence interval for C p

Assuming normally distributed process data, Ĉ p follows a chi-square


distribution. A 100(1 a)% confidence interval for C p is simple to
calculate using Equation 1 (Kane, 1986):
s s
c2a/2,n 1 c21 a/2,n 1
Ĉ p  C p  Ĉ p (1)
n 1 n 1
confidence intervals for process capability indices 8

Confidence interval for C pk

The construction of confidence intervals for C pk is difficult to ob-


tain because the distribution involves the joint distribution of two
non-central t-distributed random variables. Several authors have pro- The interested reader is referred to
posed approximate confidence intervals based on various arguments. Pearn and Lin (2004) for the exact
derivation of the cumulative distribu-
No single equation is considered best in practice (Kotz & Lovelace, tion function of Ĉ pk .
1998). Four equations are included in this section.

Heavlin (1988)
v
u
u
t n 1 Ĉ2pk ✓ 6

C pk = Ĉ pk ± Z1 a/2 + 1+ (2)
9n(n 3) 2(n 3) n 1

Bissell (1990)
v
u
u 1
t Ĉ2pk
C pk = Ĉ pk ± Z1 a/2 + (3)
9n 2(n 1)

Kushler-Hurley (1992)
!
Z
C pk = Ĉ pk 1 ± p 1 a/2 (4)
2( n 1)

Minitab

This formula, used in Minitab 16 and 17, is unique in that it takes The source for this equation is not clear.
batch effects into consideration: It seems to be a modification of the for-
mula proposed by Bissell (Equation 3).
v Minitab’s technical support is trying to
u find a proper citation. This document
u
t 1 Ĉ2pk
C pk = Ĉ pk ± Z1 + (5) will be updated when more information
a/2
N + (m/2)2 2n is available.

Here, N is the total number of observations, m is the sigma tolerance


value (6 by default), n is the degrees of freedom (calculated as Â(ni
1) by default), and ni is the subgroup size.

Web application for PCI interval estimation

An online tool for calculating the confidence interval of C p and C pk


or its lower bound inverse is available at https://westelastomer.
shinyapps.io/pci_confidence. A screenshot of the app is shown in
Figure 1. The user can specify whether they want to input the mea-
sured PCI (output: 100(1 a)% confidence interval) or the desired
PCI (output: lower confidence bound for which the process will be
confidence intervals for process capability indices 9

capable 100(1 a)% of the time). The second option also produces a
plot of the inverse function and a searchable table of values.

Figure 1: Screenshot of web application.


The web application uses Equations 1 and 4 for the calculations.
These were chosen because their lower bound inverses f 1 (C p ) and
f 1 (C pk ) are single-valued functions.
The application will load in any modern browser (including In-
ternet Explorer 9). The tool was developed using Shiny, a web appli-
cation framework for R. The source code is available on my GitHub
page (https://github.com/prettynumbers/cpk_app) for forking and
modification. Because of the usability limitations on shinyapps.io
(25 active hours per month on the free account), I recommend that
West host the application on their own server or upgrade to a paid
account if it is found to be popular.

References and Recommended Reading

Bissell AF (1990). How reliable is your capability index? Journal of


Applied Statistics, 39, 331-340.
Chou Y-M, Owen DB, and Borrego SA (1990). Lower confidence
confidence intervals for process capability indices 10

limits on process capability indices. Journal of Quality Technology,


22(3), 223-229.
Heavlin WD (1988). Statistical properties of capability indices.
Technical Report 320, Technical Library, Advanced Micro Devices, Inc.,
Sunnyvale, CA.
Johnson N and Kotz S (1970). Continuous Univariate Distributions,
vol 2. 1st edn. John Wiley and Sons, New York. p. 224.
Kane VE (1986). Process capability indices. Journal of Quality Tech-
nology 18(1), 41-52.
Kotz S and Lovelace CR (1998). Process Capability Indices in Theory
and Practice. Arnold, London.
Kushler RH and Hurley P (1992). Confidence bounds for capabil-
ity indices. Journal of Quality Technology, 24, 188-196.
Montgomery DC (2009). Introduction to Statistical Quality Control,
6th edn. John Wiley and Sons, New York.
Nagata Y and Nagahata H (1994). Approximation formulas for the
confidence intervals of process capability indices. Okayama Economic
Review, 25, 301-314.
Pearn WL and Lin PC (2004). Testing process performance based
on capability index C pk with critical values. Computers & Industrial
Engineering, 47, 351-369.
Part II

Product Design and Performance


Statistical Procedures for Inter-Rater Reliability
Omair A. Khan1 1
West Pharmaceutical Services,
R&D Statistical Engineering Intern
July 17, 2015 omair@prettynumbe.rs

This document describes various statistical procedures for measuring


inter-rater reliability. These methods quantify the homogeneity in rat-
ings and can be used to show how well two methods of measurement
agree.

ANOVA gauge R&R

This method for determining the capability of a measurement sys-


tem utilizes a designed factorial experiment. The data from this
experiment is analyzed using the random effects model analysis
of variance (ANOVA). If there are a randomly selected parts and b
randomly selected operators, and each operator measures every part
n times, then the measurements (i = part, j = operator, k = measure-
ment) can be represented by the model
8
<i = 1, 2, . . . , a
>
>
yijk = µ + Pi + O j + ( PO)ij + eijk j = 1, 2, . . . , b (1)
>
>
:
k = 1, 2, . . . , n

With this model, the variance of any observation is V (yijk ) = The experiment can easily be extended
sP2 + sO2 + sPO
2 + s2 = s2 + s2 to study different measurement systems
P Gauge . The gauge variability can be by adding an M` term and its two-way
decomposed into the repeatability variance component (s2 ) and and three-way interactions with Pi and
the gauge reproducibility (sO2 + s2 ). It is common to compare the Oj .
PO
estimate of gauge capability to the width of the specifications or the
tolerance band for the part that is being measured. This is called the
precision-to-tolerance (P/T) ratio:

kŝGauge
P/T = (2)
USL LSL
In Equation 2, popular choices for the constant k are k = 5.15 and
k = 6. The value k = 5.15 corresponds to the limiting value of the
number of standard deviations between bounds of a 95% tolerance
interval that contains at least 99% of a normal population, and k =
6 corresponds to the number of standard deviations between the
usual natural tolerance limits of a normal population. Values of the
estimated ratio P/T of 0.1 or less often are taken to imply adequate
gauge capability (Montgomery, 2009).
statistical procedures for inter-rater reliability 12

Concordance correlation coefficient

The concordance correlation coefficient (rc ) for measuring agreement


between continuous, normally-distributed variables X and Y is calcu-
lated as follows for an n-length data set: McBride (2005) suggests the following
descriptive scale for values of the
2s xy concordance correlation coefficient:
rc = (3)
s2x + s2y + ( x y )2 Value of rc Strength of agreement
< 0.90 Poor
Equation 3 is an estimate of the population concordance correlation 0.90 - 0.95 Moderate
coefficient: 0.95 - 0.99 Substantial
2rsx sy > 0.99 Almost perfect
rc = 2 (4)
sx + sy2 + (µ x µy )2
Just like the familiar Pearson correlation coefficient, a value of rc =
+1 corresponds to perfect agreement, a value of rc = 1 corresponds
to perfect negative agreement, and a value of rc = 0 corresponds to
no agreement.

Cohen’s kappa

Cohen’s kappa statistic (k) is a measure of agreement between


categorical variables X and Y. It is unique in that it takes into con-
sideration agreement by chance. Kappa can be used to compare the
ability of different raters to classify parts or defects into one of sev-
eral groups. It can also be used to assess the agreement between
alternative methods of categorical assessment when new techniques
are under study.
Kappa is calculated from the observed and expected frequencies
on the diagonal of a square contingency table. Suppose that there are
n parts on which X and Y are measured, and suppose that there are
g distinct categorical outcomes for both X and Y. Let f ij denote the
frequency of the number of parts with the ith categorical response for
variable X and the jth categorical response for variable Y. Then the
frequencies can be arranged in the following g ⇥ g table:

Y=1 Y=2 ··· Y=g


X=1 f 11 f 12 ··· f 1g
X=2 f 21 f 22 ··· f 2g
.. .. .. .. ..
. . . . .
X=g f g1 f g2 ··· f gg

The observed proportional agreement between X and Y is defined


using the diagonal values as:
g
1
p( a) =
n  fii (5)
i =1
statistical procedures for inter-rater reliability 13

and the expected agreement by chance is:


g
1
p(e) =
n2 Â f i + f +i (6)
i =1

where f i+ is the total for the ith row and f+i is the total for the ith Viera & Garrett (2005) suggest the
column. The kappa statistic is: following descriptive scale for values of
Cohen’s kappa statistic:
p( a) p(e) Value of k Strength of agreement
k= (7) <0 Less than chance
1 p(e)
0.01 - 0.20 Slight
0.21 - 0.40 Fair
Cohen’s kappa is generally between 0 and 1, however negative
0.41 - 0.60 Moderate
values are possible when there is less than chance agreement. For 0.61 - 0.80 Substantial
ordinal data and partial scoring, it is possible to use a weighted form 0.81 - 0.99 Almost perfect

of kappa (Cohen, 1968). When there are more than two categorical
variables being compared, one can use Fleiss’s kappa.

Bland-Altman plot

The Bland-Altman plot (also known as the Tukey mean-difference


plot) provides a quick, graphical method to determine if two raters or
test methods agree. Figure 1 shows an example of this plot:

Figure 1: Bland-Altman plot. One


observation in the lower-left corner is
outside the reference interval.

Each test method is performed on n paired samples. The mean of


the two tests is plotted on the horizontal axis
⇣ and the difference
⌘ is
S1 + S2
plotted on the vertical axis, i.e. S( x, y) = 2 , S1 S2 . The bias
of the two methods is the mean of these differences (Sy ). A reference
interval known as the limits of agreement is often calculated as
statistical procedures for inter-rater reliability 14

Sy ± 1.96s. However, this equation is not valid for smaller sample


sizes. The most accurate formula which can be used in any case is
(Hayes & Krippendorff, 2007):
r
1
Sy ± t0.05,n 1 s 1 + (8)
n
The limits of agreement provide insight into how much random
variation may be influencing the ratings or test methods. If the mea-
surements tend to agree, the differences between the two sets of ob-
servations will be near zero. If one rater or method is usually higher
or lower than the other by a consistent amount, the bias will be dif-
ferent from zero.

References and Recommended Reading

Cohen J (1968). Weighted kappa: Nominal scale agreement with


provision for scaled disagreement or partial credit. Psychological
Bulletin, 70(4), 213-220.
Hayes AF & Krippendorff K (2007). Answering the call for a stan-
dard reliability measure for coding data. Communication Methods and
Measures, 1, 77-89.
McBride GB (2005) A proposal for strength-of-agreement criteria
for Lin’s Concordance Correlation Coefficient. NIWA Client Report:
HAM2005-062.
Montgomery DC (2009). Introduction to Statistical Quality Control,
6th edn. John Wiley and Sons, New York.
Viera AJ & Garrett JM (2005). Understanding Interobserver Agree-
ment: The Kappa Statistic. Journal of Family Medicine, 35(5), 360-363.
Statistical Procedures for Testing Equivalence
Omair A. Khan1 1
West Pharmaceutical Services,
R&D Statistical Engineering Intern
July 30, 2015 omair@prettynumbe.rs

This document covers procedures for testing the equality of two or


more means including t-tests, one-way ANOVA, and post-hoc proce-
dures.

How can we ensure that a certain product produced by multiple


manufacturing plants is actually the same? This question was the
motivation behind this final technical document of the series. Once
we can confirm that there is strong agreement between different mea-
surement systems at different sites,2 we can then perform statistical 2
Khan OA (July 17, 2015). Statistical
tests to verify equal product quality attributes. The paper describes a Procedures for Inter-Rater Reliability.
West Pharmaceutical Services Technical
hypothesis testing approach to comparing product measurements. If Document.
it is found that two or more sets of products that should be the same
are actually not equivalent, a closer inspection of the manufacturing
processes is warranted.

Testing the equality of two means

The most common method for testing H0 : µ1 = µ2 vs. H1 : µ1 6= µ2 is


the t-test. The observations in each group must follow a normal dis-
tribution. The statistic is calculated differently for equal and unequal The Mann-Whitney U test (Wilcoxon
sample sizes and variances. The t-statistic is then compared to the rank-sum test) is the analogous non-
parametric test for testing whether
value of tcritical = t1 a,n (found in a table) to make a decision: two samples come from the same pop-
8 ulation. It does not require that the
<if t < t samples be normally distributed.
critical then do not reject H0
:if t tcritical then reject H0

Equal sample sizes, equal variances

The t-statistic is calculated as:


X1 X2
t= p (1)
s X1 X2 / n
q
where s X1 X2 = s2X1 + s2X2 and the t-statistic has n = 2n 2 degrees
of freedom.

Unequal sample sizes, equal variances

The t-statistic is calculated as:


X1 X2
t= q (2)
1 1
s X1 X2 · n1 + n2
statistical procedures for testing equivalence 16

r
(n1 1)s2X +(n2 1)s2X
where s X1 X2 = 1
n1 + n2 2
2
and the t-statistic has n =
n1 + n2 2 degrees of freedom

Equal or unequal sample sizes, unequal variances

Welch’s t-test is an adaptation of Student’s t-test for unequal vari-


ances. The t-statistic is calculated as:

X X2
t = r1 (3)
s21 s22
n1 + n2

and the degrees of freedom are calculated as:


✓ ◆2
s21 s22
n1 + n2
n⇡ (4)
s41 s42
n21 n1
+ n22 n2

Here, n1 = n1 1 and n2 = n2 1 are the degrees of freedom


associated with the two variance estimates.

Paired samples

When the same set of samples is used in both groups, we can do the
paired t-test to get more power. The t-statistic is calculated as:

XD
t= p . (5)
sD / n
For this equation, the differences between all pairs must be calcu-
lated. The average (X D ) and standard deviation (s D ) of those dif-
ferences are used in the equation. The degrees of freedom for the
hypothesis test are calculated as n = n 1.

Two One-Sided Tests (TOST)

In some cases, it is acceptable to conclude equivalence if the differ- Minitab 17 has the functionality to
ence of the two means falls between an upper and lower bound. The do TOST under the menu heading
"Equivalence Tests."
null hypotheses for non-equivalence are:

H0,1 : µ1 µ2  dL and H0,2 : µ1 µ2 dU

and the alternative hypothesis of equivalence is:

H1 : dL < µ1 µ2 < dU

If we can assume that the two groups of normally-distributed


values have the same variance, the calculation of the two one-sided
statistical procedures for testing equivalence 17

test statistics uses the following equations:

(X2 X1 ) dL
tL = (6)
SE
(X2 X1 ) dU
tU = (7)
SE
where the standard error is:
v
u n
u  1 X1i X 1 2 + Ân2 X2j X2
2 ✓ ◆
t i =1 j =1 1 1
SE = + (8)
n1 + n2 2 n1 n2

The critical value tcritical = t1 a,n1 +n2 2 is used to make a two-part


decision:
8
<if t < t
L critical and tU > tcritical then do not reject H0
:if t L tcritical or tU  tcritical then reject H0

Testing the equality of three or more means

When we are interested in testing the equality of more than two


means, we can perform a one-way analysis of variance (ANOVA).
In the special case of two groups, the F-test used in the ANOVA is
equivalent to the t-test (since F = t2 ). The hypothesis we are testing
is:

H0 : µ1 = µ2 = · · · = µk vs. H1 : at least one mean is different

The theory and procedure of the ANOVA are beyond the scope of
this document. The reader is encouraged to look at any introductory
statistics book for a discussion on this versatile test. It is relatively The Kruskal-Wallis one-way analysis
robust to small deviations from normality, however the assumption of variance is the analogous nonpara-
metric test for testing whether three
of homoscedasticity (equal variance) must be satisfied. The ANOVA or more samples come from the same
cannot be used to determine which means are different if the null population. It does not require that the
samples be normally distributed.
hypothesis is rejected. Post-hoc testing procedures are therefore
necessary and are described in the next section.

Multiple comparisons

Sometimes we are interested in considering a set of statistical infer-


ences simultaneously. It is not acceptable to sequentially perform
these tests without alteration as the probability of incorrectly reject-
ing the null hypothesis (type I error) increases exponentially. For
example, if we are interested in making 10 pairwise comparisons
between k = 5 groups and try to do a series of t-tests with indi-
vidual 95% confidence levels, our overall confidence level falls to
(95%)10 = 59.9%! Figure 1: The overall confidence level
of a set of simultaneous inferences. The
blue and pink lines are for 95% and
90% confidence levels respectively.
statistical procedures for testing equivalence 18

To overcome this issue, multiple testing correction methods must


be used. This section covers the most commonly used procedures.
Except for the Bonferroni correction, all other methods are post-hoc
analyses and should only be run if the ANOVA procedure indicates
that the means are not equal. In calculating the test statistic for these
methods, the MSE is the mean squared error from the ANOVA out-
put.

Bonferroni correction

This is the simplest method for multiple comparisons. It does not


require an ANOVA to be run prior to performing the tests. The pro-
cedure involves a series of t-tests performed at an adjusted confi-
dence level of 100(1 a⇤ )% where a⇤ = a/k . While the individual
pairwise tests are performed at a higher confidence level, the overall
confidence level is still approximately 100(1 a)%.

Tukey-Kramer (Tukey’s HSD) test

When doing all pairwise comparisons, this method is considered


the best available for unequal sample sizes. When samples sizes are
equal and confidence intervals are not needed Tukey’s test is slightly
less powerful than the Bonferroni correction, but the loss in power is
very small unless the groups are large. We are interested in testing
the hypothesis:

H0 : µi = µ j vs. H1 : µi 6= µ j

The test statistic is calculated as:

Xi Xj
qobs = (9)
SE
r⇣ ⌘⇣ ⌘
MSE 1 1
where SE = 2 ni + nj is the standard error. The critical
value qcritical = qa,k,N k can be found in a table of values and is used
to make a decision for each pairwise comparison:
8
<if |q | < q
obs critical then do not reject H0
:if |qobs | qcritical then reject H0

Dunnett’s test

When we are only interested in comparing k treatments against a


control (for a total of k + 1 groups), Dunnett’s test is the preferred
post-hoc analysis. Observations are allocated as n for each treatment
statistical procedures for testing equivalence 19

p
and ncontrol = n k for the control group. We are interested in testing For example, if we choose a total
the hypothesis: sample size of N = 60 with k = 4
treatments, then each
p treatment should
p
have n = N/(k + k ) = 60/(4 + 4) =
H0 : µi = µcontrol vs. H1 : µi 6= µcontrol 10 observations and
p the control should
p
have ncontrol = n k = 10 4 = 20
The test statistic is calculated as: observations.

X control X i
qobs = (10)
SE
r ⇣ ⌘
1
where SE = MSE ncontrol + n1 is the standard error. The critical
i

value qcritical = qa,k+1,N k+1 can be found in a table of values and This is not the same qcritical as the one
is used to make a decision for each pairwise comparison with the for Tukey’s HSD test.

control:
8
<if |q | < q
obs critical then do not reject H0
:if |qobs | qcritical then reject H0

Scheffé’s test

This is the most flexible multiple testing procedure as it allows for Some of the possible null hypotheses
comparing any number of possible contrasts. If only pairwise com- for Scheffé’s test include:

parisons are to be made, the Tukey-Kramer method will result in a H0 : µi = µ j

narrower confidence limit, which is preferable. In the general case H0 : µ3


µ1 + µ2
=0
2
when many or all contrasts might be of interest, Scheffé’s test tends
µ3 + µ4 + µ5 µ1 + µ2
to give narrower confidence limits and is therefore the recommended H0 :
3 2
=0
method.
For an arbitrary contrast C = Âik=1 ci µi where Âik=1 ci = 0, the test
statistic is calculated as:

 ci X i
Sobs = (11)
SE
s ✓ ◆
c2
where SE = MSE Â ni is the standard error. The critical value
i
p
Scritical = (k 1) Fa,k 1,N k is calculated and used to make a deci-
sion:
8
<if S
obs < Scritical then do not reject H0
:if Sobs Scritical then reject H0

References and Recommended Reading

DeGroot MH and Schervish MJ (2011). Probability and Statistics, 4th


edn. Addison-Wesley, Boston.
statistical procedures for testing equivalence 20

Hogg RV, Tanis E, and Zimmermann D (2014). Probability and


Statistical Inference, 9th edn. Pearson.
Lehmann EL and Romano JP (2005). Testing Statistical Hypotheses,
3rd edn. Springer, New York.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy