0% found this document useful (0 votes)

248 views1,236 pages

404 Research Methodology

Uploaded by

Anjaneyulu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

248 views1,236 pages

404 Research Methodology

Uploaded by

Anjaneyulu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1236

RevolutionPharmD.

com

6 Sample Size and Power

The question of the size of the sample, the number of observations, to be used in scientific
experiments is of extreme importance. Most experiments beg the question of sample size.
Particularly when time and cost are critical factors, one wishes to use the minimum sample size
to achieve the experimental objectives. Even when time and cost are less crucial, the scientist
wishes to have some idea of the number of observations needed to yield sufficient data to answer
the objectives. An elegant experiment will make the most of the resources available, resulting
in a sufficient amount of information from a minimum sample size. For simple comparative
experiments, where one or two groups are involved, the calculation of sample size is relatively
simple. A knowledge of the ␣ level (level of significance), ␤ level (1 − power), the standard
deviation, and a meaningful “practically significant” difference is necessary in order to calculate
the sample size.
Power is defined as 1 − ␤ (i.e., ␤ = 1 − power). Power is the ability of a statistical test to
show significance if a specified difference truly exists. The magnitude of power depends on the
level of significance, the standard deviation, and the sample size. Thus power and sample size
are related.
In this chapter, we present methods for computing the sample size for relatively simple
situations for normally distributed and binomial data. The concept and calculation of power
are also introduced.

6.1 INTRODUCTION
The question of sample size is a major consideration in the planning of experiments, but may not
be answered easily from a scientific point of view. In some situations, the choice of sample size
is limited. Sample size may be dictated by official specifications, regulations, cost constraints,
and/or the availability of sampling units such as patients, manufactured items, animals, and
so on. The USP content uniformity test is an example of a test in which the sample size is fixed
and specified [1].
The sample size is also specified in certain quality control sampling plans such as those
described in MIL-STD-105E [2]. These sampling plans are used when sampling products for
inspection for attributes such as product defects, missing labels, specks in tablets, or ampul leak-
age. The properties of these plans have been thoroughly investigated and defined as described
in the document cited above. The properties of the plans include the chances (probability) of
rejecting or accepting batches with a known proportion of rejects in the batch (sect. 12.3).
Sample-size determination in comparative clinical trials is a factor of major importance.
Since very large experiments will detect very small, perhaps clinically insignificant, differences
as being statistically significant, and small experiments will often find large, clinically significant
differences as statistically insignificant, the choice of an appropriate sample size is critical in the
design of a clinical program to demonstrate safety and efficacy. When cost is a major factor in
implementing a clinical program, the number of patients to be included in the studies may be
limited by lack of funds. With fewer patients, a study will be less sensitive. Decreased sensitivity
means that the comparative treatments will be relatively more difficult to distinguish statistically
if they are, in fact, different.
The problem of choosing a “correct” sample size is related to experimental objectives and
the risk (or probability) of coming to an incorrect decision when the experiment and analysis
are completed. For simple comparative experiments, certain prior information is required in

RevolutionPharmD.com
SAMPLE SIZE AND POWER 129

order to compute a sample size that will satisfy the experimental objectives. The following
considerations are essential when estimating sample size.
1. The ␣ level must be specified that, in part, determines the difference needed to represent a
statistically significant result. To review, the ␣ level is defined as the risk of concluding that
treatments differ when, in fact, they are the same. The level of significance is usually (but
not always) set at the traditional value of 5%.
2. The ␤ error must be specified for some specified treatment difference, . Beta, ␤, is the risk
(probability) of erroneously concluding that the treatments are not significantly different
when, in fact, a difference of size or greater exists. The assessment of ␤ and , the
“practically significant” difference, prior to the initiation of the experiment, is not easy.
Nevertheless, an educated guess is required. ␤ is often chosen to be between 5% and 20%.
Hence, one may be willing to accept a 20% (1 in 5) chance of not arriving at a statistically
significant difference when the treatments are truly different by an amount equal to (or
greater than) . The consequences of committing a ␤ error should be considered carefully.
If a true difference of practical significance is missed and the consequence is costly, ␤ should
be made very small, perhaps as small as 1%. Costly consequences of missing an effective
treatment should be evaluated not only in monetary terms, but should also include public
health issues, such as the possible loss of an effective treatment in a serious disease.
3. The difference to be detected, (that difference considered to have practical significance),
should be specified as described in (2) above. This difference should not be arbitrarily or
capriciously determined, but should be considered carefully with respect to meaningfulness
from both a scientific and commercial marketing standpoint. For example, when comparing
two formulas for time to 90% dissolution, a difference of one or two minutes might be
considered meaningless. A difference of 10 or 20 minutes, however, may have practical
consequences in terms of in vivo absorption characteristics.
4. A knowledge of the standard deviation (or an estimate) for the significance test is necessary.
If no information on variability is available, an educated guess, or results of studies reported
in the literature using related compounds, may be sufficient to give an estimate of the
relevant variability. The assistance of a statistician is recommended when estimating the
standard deviation for purposes of determining sample size.
To compute the sample size in a comparative experiment, (a) ␣, (b) ␤, (c) , and (d) ␴
must be specified. The computations to determine sample size are described below (Fig. 6.1).

6.2 DETERMINATION OF SAMPLE SIZE FOR SIMPLE COMPARATIVE EXPERIMENTS

FOR NORMALLY DISTRIBUTED VARIABLES
The calculation of sample size will be described with the aid of Figure 6.1. This explanation is
based on normal distribution or t tests. The derivation of sample-size determination may appear
complex. The reader not requiring a “proof” can proceed directly to the appropriate formulas
below.

Figure 6.1 Scheme to demonstrate calculation of sample size based on ␣, ␤, , and ␴: ␣ = 0.05, ␤ = 0.10,
= 5, ␴ = 7; H 0 : = 0, H a : = 5.

RevolutionPharmD.com
130 CHAPTER 6

6.2.1 Paired-Sample and Single-Sample Tests

We will first consider the case of a paired-sample test where the null hypothesis is that the
two treatment means are equal: H0 : = 0. In the case of an experiment comparing a new
antihypertensive drug candidate and a placebo, an average difference of 5 mm Hg in blood
pressure reduction might be considered of sufficient magnitude to be interpreted as a difference
of “practical significance” ( = 5). The standard deviation for the comparison was known, equal
to 7, based on a large amount of experience with this drug.
In Figure 6.1, the normal curve labeled A represents the distribution of differences with
mean equal to 0 and ␴ equal to 7. This is the distribution under the null hypothesis (i.e., drug
and placebo are identical). Curve B is the distribution of differences when the alternative, Ha :
= 5,∗ is true (i.e., the difference between drug and placebo is equal to 5). Note that curve B is
identical to curve A except that B is displaced 5 mm Hg to the right. Both curves have the same
standard deviation, 7.
With the standard deviation, 7, known, the statistical test is performed at the 5% level as
follows [Eq. (5.4)]:

␦− ␦−0
Z= √ = √ . (6.1)
␴/ N 7/ N

For a two-tailed test, if the absolute value of Z is 1.96 or greater, the difference is significant.
According to Eq. (6.1), to obtain the significance
␴Z 7(1.96) 13.7
␦ ≥ √ = √ = √ . (6.2)
N N N
√ √
Therefore, values of ␦ equal to or greater than 13.7/ N (or equal to or less than −13.7/ N)
will lead to a declaration of significance. These points are designated as ␦L and ␦U in Figure 6.1,
and represent the cutoff points for statistical significance at the 5% level; that is, observed
differences equal to or more remote from the mean than these values result in “statistically
significant differences.”
√If curve B is the true distribution
√ (i.e., = 5), an observed mean difference greater than
13.7/ N (or less than −13.7/ N) will result in the correct decision; H0 will be rejected and√we
conclude that√ a difference exists. If = 5, observations of a mean difference between 13.7/ N
and −13.7/ N will lead to an incorrect decision, the acceptance of H0 (no difference) (Fig. 6.1).
By definition, the probability of making this incorrect decision is equal to ␤.
In the present √example, ␤ will be set at 10%. In Figure 6.1, ␤ is represented by the area in
curve B below 13.7/ N(␦U ), equal to 0.10. (This area, ␤, represents the probability of accepting
H0 if = 5.)
We will now compute the value of ␦ that cuts off 10% of the area in the lower tail of the nor-
mal curve with a mean of 5 and a standard deviation of 7 (curve B in Figure 6.1). Table IV.2 shows
that 10% of the area in the standard normal curve is below −1.28. The value of ␦ (mean difference
in blood pressure between the two groups) that corresponds to a given value of Z (−1.28, in this
example) is obtained from the formula for the Z transformation [Eq. (3.14)] as follows:

␴
␦ = + Z␤ √
N
␦−
Z␤ = √ . (6.3)
␴/ N
√
Applying Eq. (6.3) to our present example, ␦ = 5 − 1.28(7/ N). The value of ␦ in Eqs. (6.2)
and (6.3) is identically the same, equal to ␦U . This is illustrated in Figure 6.1.

∗ is considered to be the true mean difference, similar to ␮. ␦ will be used to denote the observed mean difference.

RevolutionPharmD.com
SAMPLE SIZE AND POWER 131

Table 6.1 Sample Size as a Function of Beta

with = 5 and ␴ = 7: Paired Test (␣ = 0.05)

Beta (%) Sample size, N

1 36
5 26
10 21
20 16

√
From
√ Eq. (6.2), ␦U = 13.7/ N, satisfying the deﬁnition of ␣. From Eq. (6.3), ␦U = 5 −
1.28(7)/ N, satisfying the deﬁnition of ␤. We have two equations in two unknowns (␦U and N),
and N is evaluated as follows:

13.7 1.28(7)
√ = 5− √
N N
(13.7 + 8.96)2
N= = 20.5 ∼
= 21.
52

In general, Eqs. (6.2) and (6.3) can be solved for N to yield the following equation:
␴ 2
N= (Z␣ + Z␤ )2 , (6.4)

where Z␣ and Z␤ † are the appropriate normal deviates obtained from Table IV.2. In our example,
N= (7/5)2 (1.96 + 1.28)2 ∼
= 21. A sample size of 21 will result in a statistical test with 90% power
(␤ = 10%) against an alternative of 5, at the 5% level of signiﬁcance. Table 6.1 shows how the
choice of ␤ can affect the sample size for a test at the 5% level with = 5 and ␴ = 7.
The formula for computing the sample size if the standard deviation is known [Eq. (6.4)]
is appropriate for a paired-sample test or for the test of a mean from a single population. For
example, consider a test to compare the mean drug content of a sample of tablets to the labeled
amount, 100 mg. The two-sided test is to be performed at the 5% level. Beta is designated as
10% for a difference of −5 mg (95 mg potency or less). That is, we wish to have a power of 90%
to detect a difference from 100 mg if the true potency is 95 mg or less. If ␴ is equal to 3, how
many tablets should be assayed? Applying Eq. (6.4), we have
2
3
N= (1.96 + 1.28)2 = 3.8.
5

Assaying four tablets will satisfy the ␣ and ␤ probabilities. Note that Z = 1.28 cuts off 90%
of the area under curve B (the “alternative” curve) in Figure 6.2, leaving 10% (␤) of the area in
the upper tail of the curve. Table 6.2 shows values of Z␣ and Z␤ for various levels of ␣ and ␤
to be used in Eq. (6.4). In this example, and most examples in practice, ␤ is based on one tail of
the normal curve. The other tail contains an insigniﬁcant area relating to ␤ (the right side of the
normal curve, B, in Fig. 6.1)
Equation (6.4) is correct for computing the sample size for a paired- or one-sample test if
the standard deviation is known.
In most situations, the standard deviation is unknown and a prior estimate of the standard
deviation is necessary in order to calculate sample size requirements. In this case, the estimate
of the standard deviation replaces ␴ in Eq. (6.4), but the calculation results in an answer that is
slightly too small. The underestimation occurs because the values of Z␣ and Z␤ are smaller than

† Z␤ is taken as the positive value of Z in this formula.

RevolutionPharmD.com

132 CHAPTER 6

Table 6.2 Values of Z␣ and Z␤ for Sample-Size Calculations

Z␣

One sided Two sided Z␤a

1% 2.32 2.58 2.32

5% 1.65 1.96 1.65
10% 1.28 1.65 1.28
20% 0.84 1.28 0.84
a The value of ␤ is for a single speciﬁed alternative. For a two-sided test,
the probability of rejection of the alternative, if true, (accept H a ) is virtually
all contained in the tail nearest the alternative mean.

√ √
Figure 6.2 Illustration of the calculation of N for tablet assays. X = 95 + ␴ Z␤ / N = 100 − ␴ Z␣ / N.

the corresponding t values that should be used in the formula when the standard deviation is
unknown. The situation is somewhat complicated by the fact that the value of t depends on the
sample size (d.f.), which is yet unknown. The problem can be solved by an iterative method,
but for practical purposes, one can use the appropriate values of Z to compute the sample size
[as in Eq. (6.4)] and add on a few extra samples (patients, tablets, etc.) to compensate for the
use of Z rather than t. Guenther has shown that the simple addition of 0.5Z␣2 , which is equal
to approximately 2 for a two-sided test at the 5% level, results in a very close approximation to
the correct answer [3]. In the problem illustrated above (tablet assays), if the standard deviation
were unknown but estimated as being equal to 3 based on previous experience, a better estimate
of the sample size would be N + 0.5Z␣2 = 3.8 + 0.5(1.96)2 ∼ = 6 tablets.

6.2.2 Determination of Sample Size for Comparison of Means in Two Groups

For a two independent groups test (parallel design), with the standard deviation known and
equal number of observations per group, the formula for N (where N is the sample size for each
group) is
␴ 2
N=2 (Z␣ + Z␤ )2 . (6.5)

If the standard deviation is unknown and a prior estimate is available (s.d.), substitute
s.d. for ␴ in Eq. (6.5) and compute the sample size; but add on 0.25Z␣2 to the sample size for each
group.
Example 1: This example illustrates the determination of the sample size for a two indepen-
dent groups (two-sided test) design. Two variations of a tablet formulation are to be compared
with regard to dissolution time. All ingredients except for the lubricating agent were the same
in these two formulations. In this case, a decision was made that if the formulations differed by
10 minutes or more to 80% dissolution, it would be extremely important that the experiment
shows a statistically signiﬁcant difference between the formulations. Therefore, the pharmaceu-
tical scientist decided to ﬁx the ␤ error at 1% in a statistical test at the traditional 5% level. Data
were available from dissolution tests run during the development of formulations of the drug
RevolutionPharmD.com

SAMPLE SIZE AND POWER 133

and the standard deviation was estimated as 5 minutes. With the information presented above,
the sample size can be determined from Eq. (6.5). We will add on 0.25Z␣2 samples to the answer
because the standard deviation is unknown.
2
5
N=2 (1.96 + 2.32)2 + 0.25(1.96)2 = 10.1.
10

The study was performed using 12 tablets from each formulation rather than the 10 or
11 suggested by the answer in the calculation above. Twelve tablets were used because the
dissolution apparatus could accommodate six tablets per run.
Example 2: A bioequivalence study was being planned to compare the bioavailability of a
final production batch to a previously manufactured pilot-sized batch of tablets that were made
for clinical studies. Two parameters resulting from the blood-level data would be compared:
area under the plasma level versus time curves (AUC) and peak plasma concentration (Cmax ).
The study was to have 80% power (␤ = 0.20) to detect a difference of 20% or more between the
formulations. The test is done at the usual 5% level of significance. Estimates of the standard
deviations of the ratios of the values of each of the parameters [(final product)/(pilot batch)]
were determined from a small pilot study. The standard deviations were different for the
parameters. Since the researchers could not agree that one of the parameters was clearly critical
in the comparison, they decided to use a “maximum” number of patients based on the variable
with the largest relative variability. In this example, Cmax was most variable, the ratio having a
standard deviation of approximately 0.30. Since the design and analysis of the bioequivalence
study is a variation of the paired t test, Eq. (6.4) was used to calculate the sample size, adding
on 0.5Z␣2 , as recommended previously.
␴ 2
N= (Z␣ + Z␤ )2 + 0.5(Z␣2 )

2
0.3
= (1.96 + 0.84)2 + 0.5(1.96)2 = 19.6. (6.6)
0.2

Twenty subjects were used for the comparison of the bioavailabilities of the two formula-
tions.
For sample-size determination for bioequivalence studies using FDA recommended
designs, see Table 6.5 and section 11.4.4.
Sometimes the sample sizes computed to satisfy the desired ␣ and ␤ errors can be inordi-
nately large when time and cost factors are taken into consideration. Under these circumstances,
a compromise must be made—most easily accomplished by relaxing the ␣ and ␤ requirements‡
(Table 6.1). The consequence of this compromise is that probabilities of making an incorrect
decision based on the statistical test will be increased. Other ways of reducing the required
sample size are (a) to increase the precision of the test by improving the assay methodology
or carefully controlling extraneous conditions during the experiment, for example, or (b) to
compromise by increasing , that is, accepting a larger difference that one considers to be of
practical importance.
Table 6.3 gives the sample size for some representative values of the ratio ␴/, ␣, and ␤,
where the s.d. (s) is estimated.

6.3 DETERMINATION OF SAMPLE SIZE FOR BINOMIAL TESTS

The formulas for calculating the sample size for comparative binomial tests are similar to those
described for normal curve or t tests. The major difference is that the value of ␴ 2 , which is
assumed to be the same under H0 and Ha in the two-sample independent groups t or Z tests,
is different for the distributions under H0 and Ha in the binomial case. This difference occurs
because ␴ 2 is dependent on P, the probability of success, in the binomial. The value of P will

‡ In practice, ␣ is often ﬁxed by regulatory considerations and ␤ is determined as a compromise.

134

Table 6.3 Sample Size Needed for Two-Sided t Test with Standard Deviation Estimated

One-sample test Two-sample test with N units per group

Alpha = 0.05 Alpha = 0.01 Alpha = 0.05 Alpha = 0.01

Beta = Beta = Beta = Beta =

Estimated S /Δ 0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20 0.01 0.05 0.10 0.20
4.0 296 211 170 128 388 289 242 191 588 417 337 252 770 572 478 376
2.0 76 54 44 34 100 75 63 51 148 106 86 64 194 145 121 96
1.5 44 32 26 20 58 54 37 30 84 60 49 37 110 82 69 55
1.0 21 16 13 10 28 22 19 16 38 27 23 17 50 38 32 26
0.8 14 11 9 8 19 15 13 11 25 18 15 12 33 25 21 17
0.67 11 8 7 6 15 12 11 9 18 13 11 9 24 18 15 13
0.5 7 6 5 4 10 8 8 7 11 8 7 6 14 11 10 8
0.4 6 5 4 4 8 7 6 6 8 6 5 4 10 8 7 6
RevolutionPharmD.com

0.33 5 4 4 3 7 6 6 5 6 5 4 4 8 6 6 5
CHAPTER 6
SAMPLE SIZE AND POWER 135

be different depending on whether H0 or Ha represents the true situation. The appropriate

formulas for determining sample size for the one- and two-sample tests are
One-sample test

1 p 0 q 0 + p1 q 1
N= (Z␣ + Z␤ )2 , (6.7)
2 2

where = p1 − p0 ; p1 is the proportion that would result in a meaningful difference, and p0 is

the hypothetical proportion under the null hypothesis.
Two-sample test

p 1 q 1 + p2 q 2
N= (Z␣ + Z␤ )2 , (6.8)
2

where = p1 − p2 ; p1 and p2 are prior estimates of the proportions in the experimental groups.
The values of Z␣ and Z␤ are the same as those used in the formulas for the normal curve or
t tests. N is the sample size for each group. If it is not possible to estimate p1 and p2 prior to
the experiment, one can make an educated guess of a meaningful value of and set p1 and p2
both equal to 0.5 in the numerator of Eq. (6.8). This will maximize the sample size, resulting in a
conservative estimate of sample size.
Fleiss [4] gives a fine discussion of an approach to estimating , the practically significant
difference, when computing the sample size. For example, one approach is first to estimate
the proportion for the more well-studied treatment group. In the case of a comparative clinical
study, this could very well be a standard treatment. Suppose this treatment has shown a success
rate of 50%. One might argue that if the comparative treatment is additionally successful for 30%
of the patients who do not respond to the standard treatment, then the experimental treatment
would be valuable. Therefore, the success rate for the experimental treatment should be 50% +
0.3 (50%) = 65% to show a practically significant difference. Thus, p1 would be equal to 0.5 and
p2 would be equal to 0.65.
Example 3: A reconciliation of quality control data over several years showed that the
proportion of unacceptable capsules for a stable encapsulation process was 0.8% ( p0 ). A sample
size for inspection is to be determined so that if the true proportion of unacceptable capsules
is equal to or greater than 1.2% ( = 0.4%), the probability of detecting this change is 80%
(␤ = 0.2). The comparison is to be made at the 5% level using a one-sided test. According to
Eq. (6.7),

1 0.008 · 0.992 + 0.012 · 0.988
N= (1.65 + 0.84)2
2 (0.008 − 0.012)2
7670
=
2
= 3835.

The large sample size resulting from this calculation is typical of that resulting from
binomial data. If 3835 capsules are too many to inspect, ␣, ␤, and/or must be increased. In
the example above, management decided to increase ␣. This is a conservative decision in that
more good batches would be “rejected” if ␣ is increased; that is, the increase in ␣ results in an
increased probability of rejecting good batches, those with 0.8% unacceptable or less.
Example 4: Two antibiotics, a new product and a standard product, are to be compared
with respect to the two-week cure rate of a urinary tract infection, where a cure is bacteriological
evidence that the organism no longer appears in urine. From previous experience, the cure rate
for the standard product is estimated at 80%. From a practical point of view, if the new product
shows an 85% or better cure rate, the new product can be considered superior. The marketing
RevolutionPharmD.com
136 CHAPTER 6

division of the pharmaceutical company felt that this difference would support claims of better
efﬁcacy for the new product. This is an important claim. Therefore, ␤ is chosen to be 1% (power
= 99%). A two-sided test will be performed at the 5% level to satisfy FDA guidelines. The test
is two-sided because, a priori, the new product is not known to be better or worse than the
standard. The calculation of sample size to satisfy the conditions above makes use of Eq. (6.8);
here p1 = 0.8 and p2 = 0.85.

0.08 · 0.2 + 0.85 · 0.15
N= (1.96 + 2.32)2 = 2107.
(0.80 − 0.85)2

The trial would have to include 4214 patients, 2107 on each drug, to satisfy the ␣ and
␤ risks of 0.05 and 0.01, respectively. If this number of patients is greater than that can be
accommodated, the ␤ error can be increased to 5% or 10%, for example. A sample size of 1499
per group is obtained for a ␤ of 5%, and 1207 patients per group for ␤ equal to 10%.
Although Eq. (6.8) is adequate for computing the sample size for most situations, the
calculation of N can be improved by considering the continuity correction [4]. This would be
particularly important for small sample sizes

2
N 8
N = 1+ 1+ ,
4 (N | p2 − p1 |)

where N is the sample size computed from Eq. (6.8) and N is the corrected sample size. In the
example, for ␣ = 0.05 and ␤ = 0.01, the corrected sample size is

2
2107 8
N = 1+ 1+ = 2186.
4 (2107 |0.80 − 0.85|)

6.4 DETERMINATION OF SAMPLE SIZE TO OBTAIN A CONFIDENCE INTERVAL

OF SPECIFIED WIDTH
The problem of estimating the number of samples needed to estimate the mean with a known
precision by means of the confidence interval is easily solved by using the formula for the
confidence interval (see sect. 5.1). This approach has been used as an aid in predicting election
results based on preliminary polls where the samples are chosen by simple random sampling.
For example, one may wish to estimate the proportion of voters who will vote for candidate A
within 1% of the actual proportion.
We will consider the application of this problem to the estimation of proportions. In
quality control, one can closely estimate the true proportion of percent defects to any given
degree of precision. In a clinical study, a suitable sample size may be chosen to estimate the
true proportion of successes within certain specified limits. According to Eq. (5.3), a two-sided
confidence interval with confidence coefficient p for a proportion is

p̂q̂
p̂ ± Z . (6.3)
N

To obtain a 99% conﬁdence interval with a width of 0.01 (i.e., construct an interval that is
within ±0.005 of the observed proportion, p̂ ± 0.005),

p̂q̂
Zp = 0.005
N
SAMPLE SIZE AND POWER 137

Z2p ( p̂q̂ )
N= (6.9)
(W/2)2

(2.58)2 ( p̂q̂ )
N= .
(0.005)2

A more exact formula for the sample size for small values of N is given in Ref. [5].
Example 5: A quality control supervisor wishes to have an estimate of the proportion of
tablets in a batch that weigh between 195 and 205 mg, where the proportion of tablets in this
interval is to be estimated within ±0.05 (W = 0.10). How many tablets should be weighed? Use
a 95% conﬁdence interval.
To compute N, we must have an estimate of p̂ [see Eq. (6.9)]. If p̂ and q̂ are chosen to
be equal to 0.5, N will be at a maximum. Thus, if one has no inkling as to the magnitude of
the outcome, using p̂ = 0.5 in Eq. (6.9) will result in a sufﬁciently large sample size (probably,
too large). Otherwise, estimate p̂ and q̂ based on previous experience and knowledge. In the
present example from previous experience, approximately 80% of the tablets are expected to
weigh between 195 and 205 mg ( p̂ = 0.8). Applying Eq. (6.9),

(1.96)2 (0.8)(0.2)
N= = 245.9.
(0.10/2)2

A total of 246 tablets should be weighed. In the actual experiment, 250 tablets were
weighed, and 195 of the tablets (78%) weighed between 195 and 205 mg. The 95% conﬁdence
interval for the true proportion, according to Eq. (5.3), is

p̂q̂ (0.78)(0.22)
p ± 1.96 = 0.78 ± 1.96 = 0.78 ± 0.051.
N 250

The interval is slightly greater than ±5% because p is somewhat less than 0.8 (pq is larger
for p = 0.78 than for p = 0.8). Although 5.1% is acceptable, to ensure a sufﬁcient sample size, in
general, one should estimate p closer to 0.5 in order to cover possible poor estimates of p.
If p̂ had been chosen equal to 0.5, we would have calculated

(1.96)2 (0.5)(0.5)
N= = 384.2.
(0.10/2)2

Example 6: A new vaccine is to undergo a nationwide clinical trial. An estimate is desired

of the proportion of the population that would be afflicted with the disease after vaccination. A
good guess of the expected proportion of the population diseased without vaccination is 0.003.
Pilot studies show that the incidence will be about 0.001 (0.1%) after vaccination. What size
sample is needed so that the width of a 99% confidence interval for the proportion diseased in
the vaccinated population should be no greater than 0.0002? To ensure that the sample size is
sufficiently large, the value of p to be used in Eq. (6.9) is chosen to be 0.0012, rather than the
expected 0.0010.

(2.58)2 (0.9988)(0.0012)
N= = 797,809.
(0.0002/2)2

The trial will have to include approximately 800,000 subjects in order to yield the desired
precision.
138 CHAPTER 6

6.5 POWER
Power is the probability that the statistical test results in rejection of H0 when a specified
alternative is true. The “stronger” the power, the better the chance that the null hypothesis will
be rejected (i.e., the test results in a declaration of “significance”) when, in fact, H0 is false. The
larger the power, the more sensitive is the test. Power is defined as 1 − ␤. The larger the ␤ error,
the weaker is the power. Remember that ␤ is an error resulting from accepting H0 when H0 is
false. Therefore, 1 − ␤ is the probability of rejecting H0 when H0 is false.
From an idealistic point of view, the power of a test should be calculated before an exper-
iment is conducted. In addition to defining the properties of the test, power is used to help
compute the sample size, as discussed above. Unfortunately, many experiments proceed with-
out consideration of power (or ␤). This results from the difficulty of choosing an appropriate
value of ␤. There is no traditional value of ␤ to use, as is the case for ␣, where 5% is usually
used. Thus, the power of the test is often computed after the experiment has been completed.
Power is best described by diagrams such as those shown previously in this chapter
(Figs. 6.1 and 6.2). In these figures, ␤ is the area of the curves represented by the alternative
hypothesis that is included in the region of acceptance defined by the null hypothesis.
The concept of power is also illustrated in Figure 6.3. To illustrate the calculation of power,
we will use data presented for the test of a new antihypertensive agent (sect. 6.2), a paired sample
test, with ␴ = 7 and H0 : = 0. The test is performed at the 5% level of significance. Let us
suppose that the sample size is limited by cost. The sponsor of the test had sufficient funds
to pay for a study that included only 12 subjects. The design described earlier in this chapter
(sect. 6.2) used 26 patients with ␤ specified equal to 0.05 (power = 0.95). With 12 subjects,
the power will be considerably less than 0.95. The following discussion shows how power is
calculated.
The cutoff points for statistical significance (which specify the critical region) are defined
by ␣, N, and ␴. Thus, the values of ␦ that will lead to a significant result for a two-sided test are
as follows:

␦
Z= √
␴/ N
±Z␴
␦= √ .
N

In our example, Z = 1.96 (␣ = 0.05), ␴ = 7, and N = 12.

±(1.96)(7)
␦= √ = ±3.96.
12

Figure 6.3 Illustration of beta or power (1 − ␤).

SAMPLE SIZE AND POWER 139

Values of ␦ greater than 3.96 or less than −3.96 will lead to the decision that the products
differ at the 5% level. Having deﬁned the values of ␦ that will lead to rejection of H0 , we obtain
the power for the alternative, Ha : = 5, by computing the probability that an average result, ␦,
will be greater than 3.96, if Ha is true (i.e., = 5).
This concept is illustrated in Figure 6.3. Curve B is the distribution with mean equal to 5
and ␴ = 7. If curve B is the true distribution, the probability of observing a value of ␦ below
3.96 is the probability of accepting H0 if the alternative hypothesis is true ( = 5). This is the
deﬁnition of ␤. This probability can be calculated using the Z transformation.

3.96 − 5
Z= √ = −0.51.
7/ 12

Referring to Table IV.2, the area below +3.96 (Z = −0.51) for curve B is approximately
0.31. The power is 1 − ␤ = 1 − 0.31 = 0.69. The use of 12 subjects results in a power of 0.69 to
“detect” a difference of +5 compared to the 0.95 power to detect such a difference when 26
subjects were used. A power of 0.69 means that if the true difference were 5 mm Hg, the statistical
test will result in signiﬁcance with a probability of 69%; 31% of the time, such a test will result in
acceptance of H0 .
A power curve is a plot of the power, 1 − ␤, versus alternative values of . Power curves can
be constructed by computing ␤ for several alternatives and drawing a smooth curve through
these points. For a two-sided test, the power curve is symmetrical around the hypothetical
mean, = 0, in our example. The power is equal to ␣ when the alternative is equal to the
hypothetical mean under H0 . Thus, the power is 0.05 when = H0 (Fig. 6.4) in the power curve.
The power curve for the present example is shown in Figure 6.4.
The following conclusions may be drawn concerning the power of a test if ␣ is kept
constant:

1. The larger the sample size, the larger the power.

2. The larger the difference to be detected (Ha ), the larger the power. A large sample size will
be needed in order to have strong power to detect a small difference.
3. The larger the variability (s.d.), the weaker the power.
4. If ␣ is increased, power is increased (␤ is decreased) (Fig. 6.3). An increase in ␣ (e.g., 10%)
results in a smaller Z. The cutoff points are shorter, and the area of curve B below the cutoff
point is smaller.

Power is a function of N, , ␴, and ␣.

Figure 6.4 Power curve for N = 12, ␣ = 0.05, ␴ = 7, and H0 : = 0.

140 CHAPTER 6

A simple way to compute the approximate power of a test is to use the formula for sample
size [Eqs. (6.4) and (6.5). for example] and solve for Z␤ . In the previous example, a single sample
or a paired test, Eq. (6.4) is appropriate:
␴ 2
N= (Z␣ + Z␤ )2 (6.4)

√
Z␤ = N − Z␣ . (6.10)
␴

Once having calculated Z␤ , the probability determined directly from Table IV.2 is equal to
the power, 1 − ␤. See the discussion and examples below.
In the problem discussed above, applying Eq. (6.10) with = 5, ␴ = 7, N = 12, and
Z␣ = 1.96,

5√
Z␤ = 12 − 1.96 = 0.51.
7

According to the notation used for Z (Table 6.2), ␤ is the area above Z␤ . Power is the area
below Z␤ (power = 1 − ␤). In Table IV.2, the area above Z = 0.51 is approximately 31%. The
power is 1 − ␤. Therefore, the power is 69%.§
If N is small and the variance is unknown, appropriate values of t should be used in place
of Z␣ and Z␤ . Alternatively, we can adjust N by subtracting 0.5Z␣2 or 0.25Z␣2 from the actual
sample size for a one- or two-sample test, respectively. The following examples should make
the calculations clearer.
Example 7: A bioavailability study has been completed in which the ratio of the AUCs for
two comparative drugs was submitted as evidence of bioequivalence. The FDA asked for the
power of the test as part of their review of the submission. (Note that this analysis is different
from that presently required by FDA.) The null hypothesis for the comparison is H0 : R = 1,
where R is the true average ratio. The test was two-sided with ␣ equal to 5%. Eighteen subjects
took each of the two comparative drugs in a paired-sample design. The standard deviation was
calculated from the ﬁnal results of the study, and was equal to 0.3. The power is to be determined
for a difference of 20% for the comparison. This means that if the test product is truly more than
20% greater or smaller than the reference product, we wish to calculate the probability that the
ratio will be judged to be signiﬁcantly different from 1.0. The value of to be used in Eq. (6.10)
is 0.2.
√
0.2 16
Z␤ = − 1.96 = 0.707.
0.3

Note that the value of N is taken as 16. This is the inverse of the procedure for determining
sample size, where 0.5Z␣2 was added to N. Here we subtract 0.5Z␣2 (approximately 2) from N;
18 − 2 = 16. According to Table IV.2, the area corresponding to Z = 0.707 is approximately 0.76.
Therefore, the power of this test is 76%. That is, if the true difference between the formulations
is 20%, a signiﬁcant difference will be found between the formulations 76% of the time. This
is very close to the 80% power that was recommended before current FDA guidelines were
implemented for bioavailability tests (where = 0.2).
Example 8: A drug product is prepared by two different methods. The average tablet
weights of the two batches are to be compared, weighing 20 tablets from each batch. The average
weights of the two 20-tablet samples were 507 and 511 mg. The pooled standard deviation was
calculated to be 12 mg. The director of quality control wishes to be “sure” that if the average
weights truly differ by 10 mg or more, the statistical test will show a signiﬁcant difference, when

§ The value corresponding to Z in Table IV.2 gives the power directly. In this example, the area in the table
corresponding to a Z of 0.51 is approximately 0.69.
SAMPLE SIZE AND POWER 141

he was asked, “How sure?”, he said 95% sure. This can be translated into a ␤ of 5% or a power
of 95%. This is a two independent groups test. Solving for Z␤ from Eq. (6.5), we have

N
Z␤ = − Z␣
␴ 2

10 19
= − 1.96 = 0.609. (6.11)
12 2

As discussed above, the value of N is taken as 19 rather than 20, by subtracting 0.25Z␣2
from N for the two-sample case. Referring to Table IV.2, we note that the power is approximately
73%. The experiment does not have sufﬁcient power according to the director’s standards. To
obtain the desired power, we can increase the sample size (i.e., weigh more tablets). (See Exercise
Problem 10.)

6.6 SAMPLE SIZE AND POWER FOR MORE THAN TWO TREATMENTS
(ALSO SEE CHAP. 8)
The problem of computing power or sample size for an experiment with more than two treat-
ments is somewhat more complicated than the relatively simple case of designs with two
treatments. The power will depend on the number of treatments and the form of the null
and alternative hypotheses. Dixon and Massey [5] present a simple approach to determining
power and sample size. The following notation will be used in presenting the solution to this
problem.
Let M1 , M2 , M3 . . . Mk be the hypothetical population means of the k treatments. The null
hypothesis is M1 = M2 = M3 = Mk . As for the two sample cases, we must specify the alternative
of Mi . The alternative means are expressed as a grand mean, Mt ± some deviation, Di ,
values
where (Di ) = 0. For example, if three treatments are compared for pain, Active A, Active B,
and Placebo (P), the values for the alternative hypothesized means, based on a VAS scale for
pain relief, could be 75 + 10 (85), 75 + 10 (85), and 75 − 20 (55) for the two actives and placebo,
respectively. The sum of the deviations from the grand mean, 75, is 10 + 10 − 20 = 0. The power
is computed based on the following equation:

(Mi − Mt )2 /k
␺ =
2
, (6.12)
S2 /n

where n is the number of observations in each treatment group (n is the same for each treatment)
and S2 is the common variance. The value of ␺ 2 is referred to Table 6.4 to estimate the required
sample size.
Consider the following example of three treatments in a study measuring the analgesic
properties of two actives and a placebo as described above. Fifteen subjects are in each treatment
group and the variance is 1000. According to Eq. (6.12),

(85 − 75)2 + (85 − 75)2 + (55 − 75)2 /3
␺ =
2
= 3.0.
1000/15

Table 6.4 gives the approximate power for various values of ␺ , at the 5% level, as a function
of the number of treatment groups and the d.f. for error for 3 and 4 treatments. (More detailed
tables, in addition to graphs,
√ are given in Dixon and Massey [5].) Here, we have 42 d.f. and three
treatments with ␺ = 3 = 1.73. The power is approximately 0.72 by simple linear interpolation
(42 d.f. for ␺ = 1.7). The correct answer with more extensive tables is closer to 0.73.
142 CHAPTER 6

Table 6.4 Factors for Computing Power for

Analysis of Variance

d.f. error ␺ Power

Alpha = 0.05, k = 3
10 1.6 0.42
2.0 0.76
2.4 0.80
3.0 0.984
20 1.6 0.62
1.92 0.80
2.00 0.83
3.0 >0.99
30 1.6 0.65
1.9 0.80
2.0 0.85
3.0 >0.99
60 1.6 0.67
1.82 0.80
2.0 0.86
3.0 >0.99
inf 1.6 0.70
1.8 0.80
2.0 0.88
3.0 >0.99

alpha = 0.05, k = 4

10 1.4 0.48
2.0 0.80
2.6 0.96
20 1.4 0.56
2.0 0.88
2.6 986
30 1.4 0.59
2.0 0.90
2.6 >0.99
60 1.4 0.61
2.0 0.92
2.6 >0.99
inf 1.4 0.65
2.0 0.94
2.6 >0.99

Table 6.4 can also be used to determine sample size. For example, how many patients
per treatment group are needed to obtain a power of 0.80 in the above example? Applying
Eq. (6.12),

{(85 − 75)2 + (85 − 75)2 + (55 − 75)2 }/3

= ␺ 2.
1000/n

Solve for ␺ 2

␺ 2 = 0.2n.
SAMPLE SIZE AND POWER 143

We can calculate n by trial and error. For example, with N = 20,

0.2N = 4 = ␺ 2 and ␺ = 2.

For ␺ = 2 and N = 20 (d.f.√ = 57), the power is approximately 0.86 (for d.f. = 60, power
0.86). For N = 15 (d.f. = 42, ␺ = 3), we have calculated (above) that the power is approximately
0.72. A sample size of between 15 and 20 patients per treatment group would give a power of
0.80. In this example, we might guess that 17 patients per group would result in approximately
80% power. Indeed, more exact tables show that a sample size of 17(␺ = (0.2 × 17) = 1.85)
corresponds to a power of 0.79.
The same approach can be used for two-way designs, using the appropriate error term
from the analysis of variance.

6.7 SAMPLE SIZE FOR BIOEQUIVALENCE STUDIES (ALSO SEE CHAP. 11)
In its early evolution, bioequivalence was based on the acceptance or rejection of a hypothesis
test. Sample sizes could then be determined by conventional techniques as described in section
6.2. Because of inconsistencies in the decision process based on this approach, the criteria for
acceptance was changed to a two-sided 90% confidence interval, or equivalently, two one-sided
t test, where the hypotheses are (␮1 /␮2 ) < 0.8 and (␮1 /␮2 ) > 1.25 versus the alternative of
0.8 < (␮1 /␮2 ) < 1.25. This test is based on the antilog of the difference between the averages of
the log-transformed parameters (the geometric mean). This test is equivalent to a two-sided 90%
confidence interval for the ratio of means falling in the interval 0.80 to 1.25 in order to accept
the hypothesis of equivalence. Again, for the currently accepted log-transformed data, the 90%
confidence interval for the antilog of the difference between means must lie between 0.80 and
1.25, that is, 0.8 < antilog (␮1 /␮2 ) < 1.25. The sample-size determination in this case is not as
simple as the conventional determination of sample size described earlier in this chapter. The
method for sample-size determination for nontransformed data has been published by Phillips
[6] along with plots of power as a function of sample size, relative standard deviation (computed
from the ANOVA), and treatment differences. Although the theory behind this computation is
beyond the scope of this book, Chow and Liu [7] give a simple way of approximating the power
and sample size. The sample size for each sequence group is approximately
2
CV
N = (t␣, 2N−2 + t␤, 2N−2 )2 , (6.13)
(V − ␦)

where N is the number of subjects per sequence, t the appropriate value from the t distribution, ␣
the signiﬁcance level (usually 0.10), 1 − ␤ the power (usually 0.8), CV the coefﬁcient of variation,
V the bioequivalence limit, and ␦ the difference between products.
One would have to have an approximation of the magnitude of the required sample size
in order to approximate the t values. For example, suppose that RSD = 0.20, ␦ = 0.10, power is
0.8, and an initial approximation of the sample size is 20 per sequence (a total of 40 subjects).
Applying Eq. (6.13)

n = (1.69 + 0.85)2 [0.20/(0.20 − 0.10)]2 = 25.8.

Use a total of 52 subjects. This agrees closely with Phillip’s more exact computations.
Dilletti et al. [8] have published a method for determining sample size based on the log-
transformed variables, which is the currently preferred method. Table 6.5 showing sample sizes
for various values of CV, power, and product differences is taken from their publication.
Based on these tables, using log-transformed estimates of the parameters would result in
a sample size estimate of 38 for a power of 0.8, ratio of 0.9, and CV = 0.20. If the assumed ratio
is 1.1, the sample size is estimated as 32.
Equation (6.13) can also be used to approximate these sample sizes using log values for V
and ␦: n = (1.69 + 0.85)2 [0.20/(0.223 − 0.105)]2 = 19 per sequence or 38 subjects in total, where
0.223 is the log of 1.25 and 0.105 is the absolute value of the log of 0.9.
144 CHAPTER 6

Table 6.5 Sample Sizes for Given CV Power and Ratio (T /R ) for Log-Transformed Parametersa

CV Power ␮r , ␮x
(%) (%) 0.85 0.90 0.95 1.00 1.05 1.10 1.15 1.20
5.0 70 10 6 4 4 4 4 6 16
7.5 16 6 6 4 6 6 10 34
10.0 28 10 6 6 6 8 16 58
12.5 42 14 8 8 8 12 24 90
15.0 60 18 10 10 10 16 32 128
17.5 80 22 12 12 12 20 44 172
20.0 102 30 16 14 16 26 56 224
22.5 128 36 20 16 20 30 70 282
25.0 158 44 24 20 22 38 84 344
27.5 190 52 28 24 26 44 102 414
30.0 224 60 32 28 32 52 120 490
3.0 80 12 6 4 4 4 6 8 22
7.5 22 8 6 6 6 8 12 44
10.0 36 12 8 6 8 10 20 76
12.5 54 16 10 8 10 14 30 118
15.0 78 22 12 10 12 20 42 168
17.5 104 30 16 14 16 26 56 226
20.0 134 38 20 16 18 32 72 294
22.5 168 46 24 20 24 40 90 368
25.0 206 56 28 24 28 48 110 452
27.5 248 68 34 28 34 58 132 544
30.0 292 80 40 32 38 68 156 642
5.0 90 14 6 4 4 4 6 8 28
7.5 28 10 6 6 6 8 16 60
10.0 48 14 8 8 8 14 26 104
12.5 74 22 12 10 12 18 40 162
15.0 106 30 16 12 16 26 58 232
17.5 142 40 20 16 20 34 76 312
20.0 186 50 26 20 24 44 100 406
22.5 232 64 32 24 30 54 124 510
25.0 284 78 38 28 36 66 152 626
27.5 342 92 44 34 44 78 182 752
30.0 404 108 52 40 52 92 214 888

a Source: From Ref. [8].

For ␦ = 1.10 (log = 0.0953), the sample size is: n = (1.69 + 0.85)2 [0.20/ (0.223 − 0.0953)]2 =
16 per sequence or 32 subjects in total.
If the difference between products is speciﬁed as zero (ratio = 1.0), the value for t␤, 2n−2
in Eq. (6.3) should be two sided (Table 6.2). For example, for 80% power (and a large sample
size) use 1.28 rather than 0.84. In the example above with a ratio of 1.0 (0 difference between
products), a power of 0.8, and a CV = 0.2, use a value of (approximately) 1.34 for t␤, 2n−2 .

n = (1.75 + 1.34)2 [0.2/0.223]2 = 7.7 per group or 16 total subjects.

An Excel program to calculate the number of subjects required for a crossover study under
various conditions of power and product differences, for both parametric and binary (binomial)
data, is available on the disk accompanying this volume.
This approach to sample-size determination can also be used for studies where the out-
come is dichotomous, often used as the criterion in clinical studies of bioequivalence (cured or
not cured) for topically unabsorbed products or unabsorbed oral products such as sucralfate.
This topic is presented in section 11.4.8.

RevolutionPharmD.com
RevolutionPharmD.com

2 DATA GRAPHICS

“The preliminary examination of most data is facilitated by the use of diagrams. Diagrams prove
nothing, but bring outstanding features readily to the eye; they are therefore no substitute for
such critical tests as may be applied to the data, but are valuable in suggesting such tests, and in
explaining the conclusions founded upon them.” This quote is from Ronald A. Fisher, the father
of modern statistical methodology [1]. Tabulation of raw data can be thought of as the initial
and least reﬁned way of presenting experimental results. Summary tables, such as frequency
distribution tables, are much easier to digest and can be considered a second stage of reﬁne-
ment of data presentation. Summary statistics such as the mean, median, variance, standard
deviation, and the range are concise descriptions of the properties of data, but much informa-
tion is lost in this processing of experimental results. Graphical methods of displaying data
are to be encouraged and are important adjuncts to data analysis and presentation. Graphical
presentations clarify and also reinforce conclusions based on formal statistical analyses. Finally,
the researcher has the opportunity to design aesthetic graphical presentations that command
attention. The popular cliché “A picture is worth a thousand words” is especially apropos to
statistical presentations. We will discuss some key concepts of the various ways in which data
are depicted graphically.

2.1 INTRODUCTION
The diagrams and plots that we will be concerned with in our discussion of statistical methods
can be placed broadly into two categories:
1. Descriptive plots are those whose purpose is to transmit information. These include dia-
grams describing data distributions such as histograms and cumulative distribution plots
(see sect. 1.2.3). Bar charts and pie charts are examples of popular modes of communicating
survey data or product comparisons.
2. Plots that describe relationships between variables usually show an underlying, but unknown
analytic relationship between the variables that we wish to describe and understand. These
relationships can range from relatively simple to very complex, and may involve only two
variables or many variables. One of the simplest relationships, but probably the one with
greatest practical application, is the straight-line relationship between two variables, as
shown in the Beer’s law plot in Figure 2.1. Chapter 7 is devoted to the analysis of data
involving variables that have a linear relationship.
When analyzing and depicting data that involve relationships, we are often presented
with data in pairs (X, Y pairs). In Figure 2.1, the optical density Y and the concentration X are
the data pairs. When considering the relationship of two variables, X and Y, one variable can
often be considered the response variable, which is dependent on the selection of the second
or causal variable. The response variable Y (optical density in our example) is known as the
dependent variable. The value of Y depends on the value of the independent variable, X (drug
concentration). Thus, in the example in Figure 2.1, we think of the value of optical density as
being dependent on the concentration of drug.

2.2 THE HISTOGRAM

The histogram, sometimes known as a bar graph, is one of the most popular ways of presenting
and summarizing data. All of us have seen bar graphs, not only in scientiﬁc reports but also in
advertisements and other kinds of presentations illustrating the distribution of scientiﬁc data.
RevolutionPharmD.com
DATA GRAPHICS 27

Figure 2.1 Beer’s law plot illustrating a linear relationship between two variables.

The histogram can be considered as a visual presentation of a frequency table. The frequency, or
proportion, of observations in each class interval is plotted as a bar, or rectangle, where the area
of the bar is proportional to the frequency (or proportion) of observations in a given interval.
An example of a histogram is shown in ﬁgure 2.2, where the data from the frequency table in
Table 1.2 have been used as the data source. As is the case with frequency tables, class intervals
for histograms should be of equal width. When the intervals are of equal width, the height of
the bar is proportional to the frequency of observations in the interval. If the intervals are not of
equal width, the histogram is not easily or obviously interpreted, as shown in Figure 2.2(B).
The choice of intervals for a histogram depends on the nature of the data, the distribution
of the data, and the purpose of the presentation. In general, rules of thumb similar to that used

Figure 2.2 Histogram of data derived from

Table 1.2.
28 CHAPTER 2

for frequency distribution tables (sect. 1.2) can be used. Eight to twenty equally spaced intervals
usually are sufﬁcient to give a good picture of the data distribution.

2.3 CONSTRUCTION AND LABELING OF GRAPHS

Proper construction and labeling of graphs are crucial elements in graphical data representation.
The design and actual construction of graphs are not in themselves difficult. The preparation
of a good graph, however, requires careful thought and competent technical skills. One needs
not only a knowledge of statistical principles, but also, in particular, computer and drafting
competency. There are no firm rules for preparing good graphical presentations. Mostly, we
rely on experience and a few guidelines. Both books and research papers have addressed the
need for a more scientific guide to optimal graphics that, after all, is measured by how well the
graph communicates the intended messages(s) to the individuals who are intended to read and
interpret the graphs. Still, no rules will cover all situations. One must be clear that no matter
how well a graph or chart is conceived, if the draftsmanship and execution is poor, the graph
will fail to achieve its purpose.
A “good” graph or chart should be as simple as possible, yet clearly transmit its intended
message. Superfluous notation, confusing lines or curves, and inappropriate draftsmanship
(lettering, etc.) that can distract the reader are signs of a poorly constructed graph. The books
Statistical Graphics, by Schmid [2], and The Visual Display of Quantitative Information by Tufte
[3] are recommended for those who wish to study examples of good and poor renderings of
graphic presentations. For example, Schmid notes that visual contrast should be intentionally
used to emphasize important characteristics of the graph. Here, we will present a few examples
to illustrate the recommendations for good graphic presentation as well as examples of graphs
that are not prepared well or fail to illustrate the facts fairly.
Figure 2.3 shows the results of a clinical study that was designed to compare an active
drug to a placebo for the treatment of hypertension. This graph was constructed from the X, Y
pairs, time and blood pressure, respectively. Each point on the graph (+ , ) is the average blood
pressure for either drug or placebo at some point in time subsequent to the initiation of the
study.
Proper construction and labeling of the typical rectilinear graph should include the fol-
lowing considerations:

1. A title should be given. The title should be brief and to the point, enabling the reader to
understand the purpose of the graph without having to resort to reading the text. The title
can be placed below or above the graph as in Figure 2.3.
2. The axes should be clearly delineated and labeled. In general, the zero (0) points of both axes
should be clearly indicated. The ordinate (the Y axis) is usually labeled with the description
parallel to the Y axis. Both the ordinate and abscissa (X axis) should be each appropriately

115
Diastolic blood pressure (mm Hg)

110

105

100

80
0 2 4 6 8
Time (weeks) after initialion of study

Figure 2.3 Blood pressure as a function of time in a clinical study comparing drug and placebo with a regimen
of one tablet per day. , placebo (average of 45 patients); +, drug (average of 50 patients).
RevolutionPharmD.com
DATA GRAPHICS 29

(A)
450

Exercise time (sec) 400 DRUG II

350

300
DRUG I

250
0 1 2 3 4 5
(B) (C)
500 450

400
Exercise time (sec)

DRUG II
Exercise time (sec) 400
300 DRUG II
DRUG I

200 350

100
300
0 DRUG I
0 1 2 3 4 5
250
(D) 0 1 2 3 4 5
500
(E)
140
400
Difference in exercise time

DRUG II 120
Exercise time (sec)

100
300
DRUG I 80
(sec)

200 60

40
100 20

0
0 0 1 2 3 4 5
0 1 2 3 4 5
Time after dosing (hr)

Figure 2.4 Various graphs of the same data presented in different ways. Exercise time at various time intervals
after administration of single doses of two nitrate products. = Drug I, = Drug II.

labeled and subdivided in units of equal width (of course, the X and Y axes almost always
have different subdivisions). In the example in Figure 2.3, note the units of mm Hg and
weeks for the ordinate and abscissa, respectively. Grid lines may be added [Fig. 2.4(E)] but,
if used, should be kept to a minimum, not be prominent and should not interfere with the
interpretation of the ﬁgure.
3. The numerical values assigned to the axes should be appropriately spaced so as to nicely
cover the extent of the graph. This can easily be accomplished by trial and error and a little
manipulation. The scales and proportions should be constructed to present a fair picture of
the results and should not be exaggerated so to prejudice the interpretation. Sometimes, it
may be necessary to skip or omit some of the data to achieve this objective. In these cases,
the use of a “broken line” is recommended to clearly indicate the range of data not included
in the graph (Fig. 2.4).
30 CHAPTER 2

4. If appropriate, a key explaining the symbols used in the graph should be used. For example,
at the bottom of Figure 2.3, the key deﬁnes as the symbol for placebo and + for drug. In
many cases, labeling the curves directly on the graph (Fig. 2.4) results in more clarity.
5. In situations where the graph is derived from laboratory data, inclusion of the source of the
data (name, laboratory notebook number, and page number, for example) is recommended.

Usually graphs should stand on their own, independent of the main body of the text.
Examples of various ways of plotting data, derived from a study of exercise time at various
time intervals after administration of a single dose of two long-acting nitrate products to anginal
patients, are shown in Figures 2.4(A) to 2.4(E). All of these plots are accurate representations of
the experimental results, but each gives the reader a different impression. It would be wrong to
expand or contract the axes of the graph, or otherwise distort the graph, in order to convey an
incorrect impression to the reader. Most scientists are well aware of how data can be manipulated
to give different impressions. If obvious deception is intended, the experimental results will not
be taken seriously.
When examining the various plots in Figure 2.4, one could not say which plot best repre-
sents the meaning of the experimental results without knowledge of the experimental details,
in particular the objective of the experiment, the implications of the experimental outcome, and
the message that is meant to be conveyed. For example, if an improvement of exercise time of
120 seconds for one drug compared to the other is considered to be significant from a medical
point of view, the graphs labeled A, C, and E in Figure 2.4 would all seem appropriate in con-
veying this message. The graphs labeled B and D show this difference less clearly. On the other
hand, if 120 seconds is considered to be of little medical significance, B and D might be a better
representation of the data.
Note that in plot A of Figure 2.4, the ordinate (exercise time) is broken, indicating that
some values have been skipped. This is not meant to be deceptive, but is intentionally done
to better show the differences between the two drugs. As long as the zero point and the break
in the axis are clearly indicated, and the message is not distorted, such a procedure is entirely
acceptable.
Figures 2.4(B) and 2.5 are exaggerated examples of plots that may be considered not to
reflect accurately the significance of the experimental results. In Figure 2.4(B), the clinically
significant difference of approximately 120 seconds is made to look very small, tending to
diminish drug differences in the viewer’s mind. Also, fluctuations in the hourly results appear
to be less than the data truly suggest. In Figure 2.5, a difference of 5 seconds in exercise time
between the two drugs appears very large. Care should be taken when constructing (as well as
reading) graphs so that experimental conclusions come through clear and true.

6. If more than one curve appears on the same graph, a convenient way to differentiate the
curves is to use different symbols for the experimental points (e.g., ◦, ×, , , +) and, if
necessary, connecting the points in different ways (e.g., —.—.—., . . . . . ., –.–.–.–). A key or
label is used, which is helpful in distinguishing the various curves, as shown in Figures 2.3
to 2.6. Other ways of differentiating curves include different kinds of crosshatching and use
of different colors.

Figure 2.5 Exercise time at various time inter-

vals after administration of two nitrate products.
•, product I; +, product II.
RevolutionPharmD.com
DATA GRAPHICS 31

Figure 2.6 Plot of dissolution of four successive

batches of a commercial tablet product. = batch
I, • = batch II, × = batch 3, = batch 4.

7. One should take care not to place too many curves on the same graph, as this can result in
confusion. There are no speciﬁc rules in this regard. The decision depends on the nature of
the data, and how the data look when they are plotted. The curves graphed in Figure 2.7
are cluttered and confusing. The curves should be presented differently or separated into
two or more graphs. Figure 2.8 is a clearer depiction of the dissolution results of the ﬁve
formulations shown in Figure 2.7.
8. The standard deviation may be indicated on graphs as shown in Figure 2.9. However, when
the standard deviation is indicated on a graph (or in a table, for that matter), it should be
made clear whether the variation described in the graph is an indication of the standard
deviation (S) or the standard deviation of the mean (Sx̄ ). The standard deviation of the
mean, if appropriate, is often preferable to the standard deviation not only because the
values on the graph are mean values, but also because Sx̄ is smaller than the s.d., and
therefore less cluttering. Overlapping standard deviations, as shown in Figure 2.10, should
be avoided, as this representation of the experimental results is usually more confusing than
clarifying.
9. The manner in which the points on a graph should be connected is not always obvious.
Should the individual points be connected by straight lines, or should a smooth curve that
approximates the points be drawn through the data? (See Fig. 2.11.) If the graphs represent
functional relationships, the data should probably be connected by a smooth curve. For
example, the blood level versus time data shown in Figure 2.11 are described most accurately
by a smooth curve. Although, theoretically, the points should not be connected by straight
lines as shown in Figure 2.11(A), such graphs are often depicted this way. Connecting the
individual points with straight lines may be considered acceptable if one recognizes that
this representation is meant to clarify the graphical presentation, or is done for some other
appropriate reason. In the blood-level example, the area under the curve is proportional to
the amount of drug absorbed. The area is often computed by the trapezoidal rule [4], and
depiction of the data as shown in Figure 2.11(A) makes it easier to visualize and perform
such calculations.
Figure 2.12 shows another example in which connecting points by straight lines is con-
venient but may not be a good representation of the experimental outcome. The straight line
connecting the blood pressure at zero time (before drug administration) to the blood pressure
after two weeks of drug administration suggests a gradual decrease (a linear decrease) in blood

Figure 2.7 Plot of dissolution time of ﬁve dif-

ferent commercial formulations of the same drug.
• = product A, = product B, × = product C,
= product D, = product E.
32 CHAPTER 2

Figure 2.8 Individual plots of dissolution of the ﬁve formulations shown in Fig. 2.7.

pressure over the two-week period. In fact, no measurements were made during the initial
two-week interval. The 10-mm Hg decrease observed after two weeks of therapy may have
occurred before the two-week reading (e.g., in one week, as indicated by the dashed line in
Fig. 2.12). One should be careful to ensure that graphs constructed in such a manner are not
misinterpreted.

Figure 2.9 Plot of exercise time as a function of time for an antianginal drug showing mean values and standard
error of the mean.
DATA GRAPHICS 33

Figure 2.10 Graph comparing two antianginal drugs that is confusing and cluttered because of the overlapping
standard deviations. •, drug A; o, drug B.

2.4 SCATTER PLOTS (CORRELATION DIAGRAMS)

Although the applications of correlation will be presented in some detail in chapter 7, we will
introduce the notion of scatter plots (also called correlation diagrams or scatter diagrams) at this
time. This type of plot or diagram is commonly used when presenting results of experiments.
A typical scatter plot is illustrated in Figure 2.13. Data are collected in pairs (X and Y) with the
objective of demonstrating a trend or relationship (or lack of relationship) between the X and
Y variables. Usually, we are interested in showing a linear relationship between the variables
(i.e., a straight line). For example, one may be interested in demonstrating a relationship (or
correlation) between time to 80% dissolution of various tablet formulations of a particular drug

Figure 2.11 Plot of blood level versus time data illustrating two ways of drawing the curves.

Figure 2.12 Graph of blood pressure reduction with time of

antihypertensive drug illustrating possible misinterpretation
that may occur when points are connected by straight lines.
RevolutionPharmD.com
34 CHAPTER 2

45
Time to 80% dissolution (min)

0
0.0 0.2 0.4 0.6 0.8 1.0
Fraction of dose absorbed in vivo

Figure 2.13 Scatter plot showing the correlation of dissolution time and in vivo absorption of six tablet formula-
tions. , formulation A; ×, formulation B; •, formulation C; , formulation D; , formulation E; +, formulation F.

and the fraction of the dose absorbed when human subjects take the various tablets. The data plotted
in Figure 2.13 show pictorially that as dissolution increases (i.e., the time to 80% dissolution
decreases) in vivo absorption increases. Scatter plots involve data pairs, X and Y, both of which
are variable. In this example, dissolution time and fraction absorbed are both random variables.

2.5 SEMILOGARITHMIC PLOTS

Several important kinds of experiments in the pharmaceutical sciences result in data such
that the logarithm of the response (Y) is linearly related to an independent variable, X. The
semilogarithmic plot is useful when the response (Y) is best depicted as proportional changes
relative to changes in X, or when the spread of Y is very large and cannot be easily depicted
on a rectilinear scale. Semilog graph paper has the usual equal interval scale on the X axis and
the logarithmic scale on the Y axis. In the logarithmic scale, equal intervals represent ratios. For
example, the distance between 1 and 10 will exactly equal the distance between 10 and 100 on a
logarithmic scale. In particular, ﬁrst-order kinetic processes, often apparent in drug degradation
and pharmacokinetic systems, show a linear relationship when log C is plotted versus time.
First-order processes can be expressed by the following equation:

kt
log C = log C0 − (2.1)
2.3

where C is the concentration at time t, C0 the concentration at time 0, k the ﬁrst-order rate
constant, t the time, and log represents logarithm to the base 10.
Table 2.1 shows blood-level data obtained after an intravenous injection of a drug
described by a one-compartment model [3].
Figure 2.14 shows two ways of plotting the data in Table 2.1 to demonstrate the linearity
of the log C versus t relationship.
1. Figure 2.14(A) shows a plot of log C versus time. The resulting straight line is a consequence
of the relationship of log concentration and time as shown in Eq. 2.1. This is an equation of
a straight line with the Y intercept equal to log C0 and a slope equal to −k/2.3. Straight-line
relationships are discussed in more detail in chapter 8.

Table 2.1 Blood Levels After Intravenous Injection of Drug

Time after injection, t (hr) Blood level, C (␮g/mL) Log blood level
0 20 1.301
1 10 1.000
2 5 0.699
3 2.5 0.398
4 1.25 0.097
DATA GRAPHICS 35

(A) (B)
100
2nd
Log concentration

1.2

Concentration
cycle

0.8 10

0.4 1st
cycle
0 1
0 1 2 3 4 5 0 1 2 3 4 5
Time after injection (hr) Time after injection (hr)

Figure 2.14 Linearizing plots of data from Table 2.1. (Plot A) log C versus time; (plot B) semilog plot.

2. Figure 2.14(B) shows a more convenient way of plotting the data of Table 2.1, making use of
semilog graph paper. This paper has a logarithmic scale on the Y axis and the usual arithmetic,
linear scale on the X axis. The logarithmic scale is constructed so that the spacing corresponds
to the logarithms of the numbers on the Y axis. For example, the distance between 1 and 2 is
the same as that between 2 and 4. (Log 2−log 1) is equal to (log 4−log 2). The semilog graph
paper depicted in Figure 2.14(B) is two-cycle paper. The Y (log) axis has been repeated two
times. The decimal point for the numbers on the Y axis is accommodated to the data. In our
example, the data range from 1.25 to 20 and the Y axis is adjusted accordingly, as shown in
Figure 2.14(B). The data may be plotted directly on this paper without the need to look up
the logarithms of the concentration values.

2.6 OTHER DESCRIPTIVE FIGURES

Most of the discussion in this chapter has been concerned with plots that show relationships
between variables such as blood pressure changes following two or more treatments, or drug
decomposition as a function of time. Often occasions arise in which graphical presentations are
better made using other more pictorial techniques. These approaches include the popular bar
and pie charts. Schmid [2] differentiates bar charts into two categories: (a) column charts in which
there is a vertical orientation and (b) bar charts in which the bars are horizontal. In general, the
bar charts are more appropriate for comparison of categorical variables, whereas the column
chart is used for data showing relationships such as comparisons of drug effect over time.
Bar charts are very simple but effective visual displays. They are usually used to compare
some experimental outcome or other relevant data where the length of the bar represents the
magnitude. There are many variations of the simple bar chart [2]; an example is shown in Figure
2.15. In Figure 2.15(A), patients are categorized as having a good, fair, or poor response. Forty
percent of the patients had a good response, 35% had a fair response, and 25% had a poor
response.
Figure 2.15(B) shows bars in pairs to emphasize the comparative nature of two treatments.
It is clear from this diagram that Treatment X is superior to Treatment Y. Figure 2.15(C) is another
way of displaying the results shown in Figure 2.15(B). Which chart do you think better sends
the message of the results of this comparative study, Figure 2.15(B) or 2.15(C)? One should be
aware that the results correspond only to the length of the bar. If the order in which the bars
are presented is not obvious, displaying bars in order of magnitude is recommended. In the
example in Figure 2.15, the order is based on the nature of the results, “Good,” “Fair,” and
“Poor.” Everything else in the design of these charts is superﬂuous and the otherwise principal
objective is to prepare an aesthetic presentation that emphasizes but does not exaggerate the
results. For example, the use of graphic techniques such as shading, crosshatching, and color,
tastefully executed, can enhance the presentation.
Column charts are prepared in a similar way to bar charts. As noted above, whether or not
a bar or column chart is best to display data is not always clear. Data trends over time usually
are best shown using columns. Figure 2.16 shows the comparison of exercise time for two drugs
using a column chart. This is the same data used to prepare Figure 2.4(A) (also, see Exercise
Problem 8 at the end of this chapter).
36 CHAPTER 2

Figure 2.15 Graphical representation of patient responses to drug therapy.

RevolutionPharmD.com
DATA GRAPHICS 37

450

400
Exercise time (sec)

Drug 1

350 Drug 2

300

250
1 2 3 4 5
Time after dosing (hr)

Figure 2.16 Exercise time for two drugs in the form of a column chart using data of Figure 2.4.

Pie charts are popular ways of presenting categorical data. Although the principles used in
the construction of these charts are relatively simple, thought and care are necessary to convey
the correct message. For example, dividing the circle into too many categories can be confusing
and misleading. As a rule of thumb, no more than six sectors should be used. Another problem
with pie charts is that it is not always easy to differentiate two segments that are reasonably
close in size, whereas in the bar graph, values close in size are easily differentiated, since length
is the critical feature.
The circle (or pie) represents 100%, or all of the results. Each segment (or slice of pie) has an
area proportional to the area of the circle, representative of the contribution due to the particular
segment. In the example shown in Figure 2.17(A), the pie represents the anti-inﬂammatory
drug market. The slices are proportions of the market accounted for by major drugs in this
therapeutic class. These charts are frequently used for business and economic descriptions, but
can be applied to the presentation of scientiﬁc data in appropriate circumstances. Figure 2.17(B)
shows the proportion of patients with good, fair, and poor responses to a drug in a clinical trial
(see also Fig. 2.15).
Of course, we have not exhausted all possible ways of presenting data graphically. We
have introduced the cumulative plot in section 1.2.3. Other kinds of plots are the stick diagram
(analogous to the histogram) and frequency polygon [5]. The number of ways in which data
can be presented is limited only by our own ingenuity. An elegant pictorial presentation of
data can “make” a report or government submission. On the other hand, poor presentation of
data can detract from an otherwise good report. The book Statistical Graphics by Calvin Schmid
is recommended for those who wish detailed information on the presentation of graphs and
charts.

Figure 2.17 Examples of pie charts.

38 CHAPTER 2

KEY TERMS
Bar charts Independent variables
Bar graphs Key
Column charts Pie charts
Correlation Scatter plots
Data pairs Semilog plots
Dependent variables
Histogram

EXERCISES
1. Plot the following data, preparing and labeling the graph according to the guidelines out-
lined in this chapter. These data are the result of preparing various modiﬁcations of a
formulation and observing the effect of the modiﬁcations on tablet hardness.

Formulation modiﬁcation

Starch (%) Lactose (%) Tablet hardness (kg)

10 5 8.3
10 10 9.1
10 15 9.6
10 20 10.2
5 5 9.1
5 10 9.4
5 15 9.8
5 20 10.4

(Hint: Plot these data on a single graph where the Y axis is tablet hardness and the X axis
is lactose concentration. There will be two curves, one at 10% starch and the other at 5%
starch.)
2. Prepare a histogram from the data of Table 1.3. Compare this histogram to that shown in
Figure 2.2(A). Which do you think is a better representation of the data distribution?
3. Plot the following data and label the graph appropriately.

X : response Y : response
Patient to product A to product B
1 2.5 3.8
2 3.6 2.4
3 8.9 4.7
4 6.4 5.9
5 9.5 2.1
6 7.4 5.0
7 1.0 8.5
8 4.7 7.8

What conclusion(s) can you draw from this plot if the responses are pain relief scores, where
a high score means more relief?
4. A batch of tables was shown to have 70% with no defects, 15% slightly chipped, 10%
discolored, and 5% dirty. Construct a pie chart from these data.
5. The following data from a dose–response experiment, a measure of physical activity, are the
responses of ﬁve animals at each of three doses.

RevolutionPharmD.com
ANOVA:
Analysis of Variation
The basic ANOVA situation
Two variables: 1 Categorical, 1 Quantitative

Main Question: Do the (means of) the quantitative

variables depend on which group (given by
categorical variable) the individual is in?

If categorical variable has only 2 values:

• 2-sample t-test

ANOVA allows for 3 or more groups

An example ANOVA situation
Subjects: 25 patients with blisters
Treatments: Treatment A, Treatment B, Placebo
Measurement: # of days until blisters heal

Data [and means]:

• A: 5,6,6,7,7,8,9,10 [7.25]
• B: 7,7,8,9,9,10,10,11 [8.875]
• P: 7,9,9,10,10,10,11,12,13 [10.11]

Are these differences significant?

Informal Investigation
Graphical investigation:
• side-by-side box plots
• multiple histograms

Whether the differences between the groups are

significant depends on
• the difference in the means
• the standard deviations of each group
• the sample sizes

ANOVA determines P-value from the F statistic

Side by Side Boxplots

11
10
days

A B P
treatment
What does ANOVA do?
At its simplest (there are extensions) ANOVA
tests the following hypotheses:
H0: The means of all the groups are equal.

Ha: Not all the means are equal

• doesn’t say how or which ones differ.
• Can follow up with “multiple comparisons”

Note: we usually refer to the sub-populations as

“groups” when doing ANOVA.
Assumptions of ANOVA
• each group is approximately normal
checkthis by looking at histograms and/or
normal quantile plots, or use assumptions
can handle some nonnormality, but not
severe outliers
• standard deviations of each group are
approximately equal
rule of thumb: ratio of largest to smallest
sample st. dev. must be less than 2:1
Normality Check
We should check for normality using:
• assumptions about population
• histograms for each group
• normal quantile plot for each group

With such small data sets, there really isn’t a

really good way to check normality from data,
but we make the common assumption that
physical measurements of people tend to be
normally distributed.
Standard Deviation Check

Variable treatment N Mean Median StDev

days A 8 7.250 7.000 1.669
B 8 8.875 9.000 1.458
P 9 10.111 10.000 1.764

Compare largest and smallest standard deviations:

• largest: 1.764
• smallest: 1.458
• 1.458 x 2 = 2.916 > 1.764

Note: variance ratio of 4:1 is equivalent.

Notation for ANOVA
• n = number of individuals all together
• I = number of groups
• x = mean for entire data set is

Group i has
• ni = # of individuals in group i
• xij = value for individual j in group i
• xi = mean for group i
• si = standard deviation for group i
How ANOVA works (outline)
ANOVA measures two sources of variation in the data and
compares their relative sizes

• variation BETWEEN groups

• for each data value look at the difference between
its group mean and the overall mean

(x i − x ) 2

• variation WITHIN groups

• for each data value we look at the difference
between that value and the mean of its group

(x ij − xi )
2
The ANOVA F-statistic is a ratio of the
Between Group Variaton divided by the
Within Group Variation:

Between MSG
F= =
Within MSE
A large F is evidence against H0, since it
indicates that there is more difference
between groups than within groups.
Minitab ANOVA Output
Analysis of Variance for days
Source DF SS MS F P
treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00

R ANOVA Output
Df Sum Sq Mean Sq F value Pr(>F)
treatment 2 34.7 17.4 6.45 0.0063 **
Residuals 22 59.3 2.7
How are these computations
made?
We want to measure the amount of variation due
to BETWEEN group variation and WITHIN group
variation

For each data value, we calculate its contribution

( )
to: 2
• BETWEEN group variation: x i − x

• WITHIN group variation: ( x ij − x i ) 2

An even smaller example
Suppose we have three groups
• Group 1: 5.3, 6.0, 6.7
• Group 2: 5.5, 6.2, 6.4, 5.7
• Group 3: 7.5, 7.2, 7.9
We get the following statistics:

SUMMARY
Groups Count Sum Average Variance
Column 1 3 18 6 0.49
Column 2 4 23.8 5.95 0.176667
Column 3 3 22.6 7.533333 0.123333
Excel ANOVA Output
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 5.127333 2 2.563667 10.21575 0.008394 4.737416
Within Groups 1.756667 7 0.250952

Total 6.884 9

1 less than number number of data values -

of groups number of groups
(equals df for each
1 less than number of individuals group added together)
(just like other situations)
Computing ANOVA F statistic
WITHIN BETWEEN
difference: difference
group data - group mean group mean - overall mean
data group mean plain squared plain squared
5.3 1 6.00 -0.70 0.490 -0.4 0.194
6.0 1 6.00 0.00 0.000 -0.4 0.194
6.7 1 6.00 0.70 0.490 -0.4 0.194
5.5 2 5.95 -0.45 0.203 -0.5 0.240
6.2 2 5.95 0.25 0.063 -0.5 0.240
6.4 2 5.95 0.45 0.203 -0.5 0.240
5.7 2 5.95 -0.25 0.063 -0.5 0.240
7.5 3 7.53 -0.03 0.001 1.1 1.188
7.2 3 7.53 -0.33 0.109 1.1 1.188
7.9 3 7.53 0.37 0.137 1.1 1.188
TOTAL 1.757 5.106
TOTAL/df 0.25095714 2.55275

overall mean: 6.44 F = 2.5528/0.25025 = 10.21575

Minitab ANOVA Output
Analysis of Variance for days
Source DF SS MS F P
treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00

# of data values - # of groups

(equals df for each group

1 less than # of
added together)
groups

1 less than # of individuals

(just like other situations)
Minitab ANOVA Output
Analysis of Variance for days
Source DF SS MS F P
treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00

∑ (x − xi ) 2
∑ (x − x) 2

∑ (x
ij
obs ij − x) 2
obs
i

obs

SS stands for sum of squares

• ANOVA splits this into 3 parts
Minitab ANOVA Output
Analysis of Variance for days
Source DF SS MS F P
treatment 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00

MSG = SSG / DFG

MSE = SSE / DFE
P-value
comes from
F = MSG / MSE F(DFG,DFE)

(P-values for the F statistic are in Table E)

So How big is F?
Since F is
Mean Square Between / Mean Square Within

= MSG / MSE

A large value of F indicates relatively more

difference between groups than within groups
(evidence against H0)

To get the P-value, we compare to F(I-1,n-I)-distribution

• I-1 degrees of freedom in numerator (# groups -1)
• n - I degrees of freedom in denominator (rest of df)
Connections between SST, MST,
and standard deviation
If ignore the groups for a moment and just
compute the standard deviation of the entire
data set, we see
∑ ( )
2
x −x SST
= = = MST
2 ij
s
n −1 DFT
So SST = (n -1) s2, and MST = s2. That is, SST
and MST measure the TOTAL variation in the
data set.
Connections between SSE, MSE,
and standard deviation

Remember: si
2
=
∑ (x ij − xi )
2

=
SS[ Within Group i ]
ni − 1 dfi
So SS[Within Group i] = (si2) (dfi )

This means that we can compute SSE from the

standard deviations and sizes (df) of each group:

SSE = SS[Within] = ∑ SS[Within Group i ]

= ∑ s (ni − 1) = ∑ s (dfi )
2
i
2
i
Pooled estimate for st. dev
One of the ANOVA assumptions is that all
groups have the same standard deviation. We
can estimate this with a weighted average:
(n −1)s 2
+ (n −1)s 2
+ ...+ (n −1)s 2
s 2p = 1 1 2 2 I I
n−I
(df1 )s + (df 2 )s + ...+ (df I )s
2 2 2
s =
2 1 2 I
df1 + df 2 + ...+ df I
p

so MSE is the
SSE
sp =
2
= MSE pooled estimate
DFE of variance
In Summary
SST = ∑ (x ij − x ) = s (DFT)
2 2

obs

SSE = ∑ (x ij − x i ) = ∑ si (df i )
2 2

obs groups

SSG = ∑ (x i − x) =2
∑ n (x i i − x) 2

obs groups

SS MSG
SSE +SSG = SST; MS = ; F=
DF MSE
R2 Statistic
R2 gives the percent of variance due to between
group variation

SS[Between ] SSG
R =
2
=
SS[Total ] SST

We will see R2 again when we study

regression.
Where’s the Difference?
Once ANOVA indicates that the groups do not all
appear to have the same means, what do we do?

Analysis of Variance for days

Source DF SS MS F P
treatmen 2 34.74 17.37 6.45 0.006
Error 22 59.26 2.69
Total 24 94.00
Individual 95% CIs For Mean
Based on Pooled StDev
Level N Mean StDev ----------+---------+---------+------
A 8 7.250 1.669 (-------*-------)
B 8 8.875 1.458 (-------*-------)
P 9 10.111 1.764 (------*-------)
----------+---------+---------+------
Pooled StDev = 1.641 7.5 9.0 10.5

Clearest difference: P is worse than A (CI’s don’t overlap)

Multiple Comparisons
Once ANOVA indicates that the groups do not all
have the same means, we can compare them two
by two using the 2-sample t test

• We need to adjust our p-value threshold because we

are doing multiple tests with the same data.

•There are several methods for doing this.

• If we really just want to test the difference between one

pair of treatments, we should set the study up that way.
Tuckey’s Pairwise Comparisons
Tukey's pairwise comparisons
95% confidence
Family error rate = 0.0500
Individual error rate = 0.0199
Use alpha = 0.0199 for
Critical value = 3.55
each test.
Intervals for (column level mean) - (row level mean)

A B
These give 98.01%
B -3.685
0.435
CI’s for each pairwise
difference.
P -4.863 -3.238
-0.859 0.766 Only P vs A is significant
(both values have same sign)
98% CI for A-P is (-0.86,-4.86)
Tukey’s Method in R
Tukey multiple comparisons of means
95% family-wise confidence level

diff lwr upr

B-A 1.6250 -0.43650 3.6865
P-A 2.8611 0.85769 4.8645
P-B 1.2361 -0.76731 3.2395
One-Way ANOVA

Introduction to Analysis of Variance

(ANOVA)
What is ANOVA?

ANOVA is short for ANalysis Of VAriance

Used with 3 or more groups to test for MEAN
DIFFS.
E.g., caffeine study with 3 groups:
No caffeine
Mild dose
Jolt group
Level is value, kind or amount of IV
Treatment Group is people who get specific
treatment or level of IV
Treatment Effect is size of difference in means
Rationale for ANOVA (1)

We have at least 3 means to test, e.g., H0: μ1 = μ2

= μ3.
Could take them 2 at a time, but really want to test
all 3 (or more) at once.
Instead of using a mean difference, we can use the
variance of the group means about the grand mean
over all groups.
Logic is just the same as for the t-test. Compare the
observed variance among means (observed
difference in means in the t-test) to what we would
expect to get by chance.
Rationale for ANOVA (2)
Suppose we drew 3 samples from the same population.
Our results might look like this:

Note that the

means from the 3
groups are not
exactly the same,
but they are
close, so the
variance among
means will be
small.
Rationale for ANOVA (3)
Suppose we sample people from 3 different populations.
Our results might look like this:
Note that the
sample means are
far away from one
another, so the
variance among
means will be
large.
Rationale for ANOVA (4)
Suppose we complete a study and find the following
results (either graph). How would we know or decide
whether there is a real effect or not?

To decide, we can compare our observed variance in

means to what we would expect to get on the basis
of chance given no true difference in means.
Review

When would we use a t-test versus 1-way

ANOVA?
In ANOVA, what happens to the variance in
means (between cells) if the treatment effect
is large?
Rationale for ANOVA
We can break the total variance in a study into
meaningful pieces that correspond to treatment effects
and error. That’s why we call this Analysis of Variance.

Definitions of Terms Used in ANOVA:

XG The Grand Mean, taken over all observations.

XA The mean of any level of a treatment.

The mean of a specific level (1 in this case)
X A1 of a treatment.
Xi The observation or raw data for the ith person.
The ANOVA Model
A treatment effect is the difference between the overall,
grand mean, and the mean of a cell (treatment level).
IV Effect = X A − X G
Error is the difference between a score and a cell
(treatment level) mean.
Error = X i − X A
The ANOVA Model:
X i = X G + (X A − X G ) + (X i − X A )
An individual’s A treatment
is The grand + or IV effect + Error
score mean
The ANOVA Model
X i = X G + (X A − X G ) + (X i − X A )
The grand A treatment Error
mean or IV effect

The graph shows the

terms in the equation.
There are three cells or
levels in this study.
The IV effect and error
for the highest scoring
cell is shown.
ANOVA Calculations
Sums of squares (squared deviations from the mean)
tell the story of variance. The simple ANOVA designs
have 3 sums of squares.

SS tot = ∑ ( X i − X G ) 2 The total sum of squares comes from the

distance of all the scores from the grand
mean. This is the total; it’s all you have.

SSW = ∑ ( X i − X A ) 2 The within-group or within-cell sum of

squares comes from the distance of the
observations to the cell means. This
indicates error.

SS B = ∑ N A ( X A − X G ) 2 The between-cells or between-groups

sum of squares tells of the distance of
the cell means from the grand mean.
SSTOT = SS B + SSW This indicates IV effects.
Computational Example: Caffeine on
Test Scores
G1: Control G2: Mild G3: Jolt
Test Scores
75=79-4 80=84-4 70=74-4
77=79-2 82=84-2 72=74-2
79=79+0 84=84+0 74=74+0
81=79+2 86=84+2 76=74+2
83=79+4 88=84+4 78=74+4
Means
79 84 74
SDs (N-1)
3.16 3.16 3.16
Xi XG ( X i − X G )2
G1 75 79 16
Total Control 77 79 4
Sum of M=79 79 79 0
Squares
SD=3.16 81 79 4
83 79 16
G2 80 79 1
M=84 82 79 9
SD=3.16 84 79 25
86 79 49
SS tot = ∑ ( X i − X G ) 2 88 79 81
G3 70 79 81
M=74 72 79 49
SD=3.16 74 79 25
76 79 9
78 79 1
Sum 370
In the total sum of squares, we are finding the
squared distance from the Grand Mean. If we took
the average, we would have a variance.
SS tot = ∑ ( X i − X G ) 2
Xi XA ( X i − X A )2
G1 75 79 16
Within Control 77 79 4
Sum of M=79 79 79 0
Squares
SD=3.16 81 79 4
83 79 16
G2 80 84 16
M=84 82 84 4
SD=3.16 84 84 0
86 84 4
SSW = ∑ ( X i − X A ) 2 88 84 16
G3 70 74 16
M=74 72 74 4
SD=3.16 74 74 0
76 74 4
78 74 16
Sum 120
Within sum of squares refers to the variance within
cells. That is, the difference between scores and their
cell means. SSW estimates error.

SSW = ∑ ( X i − X A ) 2
XA XG ( X A − X G )2
G1 79 79 0
Between Control 79 79 0
Sum of M=79 79 79 0
Squares
SD=3.16 79 79 0
79 79 0
G2 84 79 25
M=84 84 79 25
SD=3.16 84 79 25
SSB = ∑NA(XA − XG )2 84 79 25
84 79 25
G3 74 79 25
M=74 74 79 25
SD=3.16 74 79 25
74 79 25
74 79 25
Sum 250
The between sum of squares relates the Cell Means to
the Grand Mean. This is related to the variance of the
means.
SSB = ∑NA(XA − XG )2
ANOVA Source Table (1)
Source SS df MS F

Between 250 k-1=2 SS/df F=

Groups 250/2= MSB/MSW
125 = 125/10
=MSB =12.5
Within 120 N-k= 120/12 =
Groups 15-3=12 10 =
MSW
Total 370 N-1=14
ANOVA Source Table (2)

df – Degrees of freedom. Divide the sum of

squares by degrees of freedom to get
MS, Mean Squares, which are population
variance estimates.
F is the ratio of two mean squares. F is
another distribution like z and t. There are
tables of F used for significance testing.
The F Distribution
F Table – Critical Values
Numerator df: dfB

dfW 1 2 3 4 5

5 5% 6.61 5.79 5.41 5.19 5.05

1% 16.3 13.3 12.1 11.4 11.0
10 5% 4.96 4.10 3.71 3.48 3.33
1% 10.0 7.56 6.55 5.99 5.64
12 5% 4.75 3.89 3.49 3.26 3.11
1% 9.33 6.94 5.95 5.41 5.06
14 5% 4.60 3.74 3.34 3.11 2.96
1% 8.86 6.51 5.56 5.04 4.70
Review

What are critical values of a statistics (e.g.,

critical values of F)?
What are degrees of freedom?
What are mean squares?
What does MSW tell us?
Review 6 Steps

1. Set alpha (.05). 4. Determine critical value

F.05(2,12) = 3.89
2. State Null &
5. Decision rule: If test
Alternative statistic > critical value,
H 0: μ1 = μ 2 = μ 3 reject H0.
H1: not all μ are =. 6. Decision: Test is
significant (12.5>3.89).
3. Calculate test statistic: Means in population are
F=12.5 different.
Post Hoc Tests

If the t-test is significant, you have a

difference in population means.
If the F-test is significant, you have a
difference in population means. But you
don’t know where.
With 3 means, could be A=B>C or A>B>C or
A>B=C.
We need a test to tell which means are
different. Lots available, we will use 1.
Tukey HSD (1)
Use with equal sample size per cell.
HSD means honestly significant difference.
MSW α is the Type I error rate (.05).
HSDα = qα
NA
qα Is a value from a table of the studentized range
statistic based on alpha, dfW (12 in our example)
and k, the number of groups (3 in our example).

MSW Is the mean square within groups (10).

NA Is the number of people in each group (5).
MSW
10
HSD.05 = 3.77 = 5.33 Result for our example.
5
From table NA
Tukey HSD (2)

To see which means are significantly different, we

compare the observed differences among our means to
the critical value of the Tukey test.

The differences are:

1-2 is 79-84 = -5 (say 5 to be positive).
1-3 is 79-74 = 5
2-3 is 84-74 = 10. Because 10 is larger than 5.33, this result
is significant (2 is different than 3). The other differences
are not significant. Review 6 steps.
Review

What is a post hoc test? What is its use?

Describe the HSD test. What does HSD
stand for?
Test

Another name for mean square is

_________.
1. standard deviation
2. sum of squares
3. treatment level
4. variance
Test

When do we use post hoc tests?

a. after a significant overall F test
b. after a nonsignificant overall F test
c. in place of an overall F test
d. when we want to determine the impact of
different factors
ANOVA ‐ Analysis of Variance
ANOVA ‐ Analysis of Variance

• Extends independent‐samples t test
ANOVA ‐ Analysis of Variance

• Extends independent‐samples t test
• Compares the means of groups of
independent observations
ANOVA ‐ Analysis of Variance

• Extends independent‐samples t test
• Compares the means of groups of
independent observations
– Don’t be fooled by the name. ANOVA does not
compare variances.
ANOVA ‐ Analysis of Variance

• Extends independent‐samples t test
• Compares the means of groups of
independent observations
– Don’t be fooled by the name. ANOVA does not
compare variances.
• Can compare more than two groups
ANOVA – Null and Alternative Hypotheses
Say the sample contains K independent groups
ANOVA – Null and Alternative Hypotheses
Say the sample contains K independent groups

• ANOVA tests the null hypothesis
H0: μ1 = μ2 = … = μK
– That is, “the group means are all equal”
ANOVA – Null and Alternative Hypotheses
Say the sample contains K independent groups

• ANOVA tests the null hypothesis
H0: μ1 = μ2 = … = μK
– That is, “the group means are all equal”
• The alternative hypothesis is
H1: μi ≠ μj for some i, j
– or, “the group means are not all equal”
Example:
Accuracy of Implant
Placement

Implants were placed in a
manikin using placement
guides of various widths.

15 implants were placed
using each guide.

Error (discrepancies with a
reference implant) was
measured for each implant.
Example:
Accuracy of Implant
Placement

The overall mean of the
entire sample was 0.248
mm.

This is called the “grand”
mean, and is often
X
denoted by .

If H0 were true then we’d
expect the group means to
be close to the grand
mean.
Example:
Accuracy of Implant
Placement
The ANOVA test is based
on the combined distances
from .
X

If the combined distances
are large, that indicates we
should reject H0.
The Anova Statistic
To combine the differences from the grand mean we
– Square the differences
– Multiply by the numbers of observations in the groups
– Sum over the groups

( )2
( )2
(
SSB = 15 X 4 mm − X + 15 X 6 mm − X + 15 X 8 mm − X )
2

where the are the group means.
X*

“SSB” = Sum of Squares Between groups
The Anova Statistic
To combine the differences from the grand mean we
– Square the differences
– Multiply by the numbers of observations in the groups
– Sum over the groups

( )2
( )2
(
SSB = 15 X 4 mm − X + 15 X 6 mm − X + 15 X 8 mm − X )
2

where the are the group means.
X*

“SSB” = Sum of Squares Between groups

Note: This looks a bit like a variance.
How big is big?

• For the Implant Accuracy Data, SSB = 0.0047

• Is that big enough to reject H0?

• As with the t test, we compare the statistic to the
variability of the individual observations.

• In ANOVA the variability is estimated by the Mean
Square Error, or MSE
MSE
Mean Square Error

The Mean Square Error is a
measure of the variability
after the group effects
have been taken into
account.

MSE =
1
(
∑∑ ij j
x − X )2

N −K j i
where xij is the ith
observation in the jth
group.
MSE
Mean Square Error

The Mean Square Error is a
measure of the variability
after the group effects
have been taken into
account.

MSE =
1
(
∑∑ ij j
x − X )2

N −K j i
where xij is the ith
observation in the jth
group.
MSE
Mean Square Error

The Mean Square Error is a
measure of the variability
after the group effects
have been taken into
account.

MSE =
1
(
∑∑ ij j
x − X )2

N −K j i
Note that the variation of
the means seems quite
small compared to the
variance of observations
within groups
Notes on MSE
• If there are only two groups, the MSE is equal to the
pooled estimate of variance used in the equal‐
variance t test.

• ANOVA assumes that all the group variances are
equal.

• Other options should be considered if group
variances differ by a factor of 2 or more.
ANOVA F Test
• The ANOVA F test is based on the F statistic

SSB (K − 1)
F=
MSE
where K is the number of groups.

• Under H0 the F statistic has an “F” distribution, with
K‐1 and N‐K degrees of freedom (N is the total
number of observations)
Implant Data:
F test p‐value
To get a p‐value we
compare our F statistic to
an F(2, 42) distribution.
Implant Data:
F test p‐value
To get a p‐value we
compare our F statistic to
an F(2, 42) distribution.

In our example
.0047 2
F= = .211
.0467 42
Implant Data:
F test p‐value
To get a p‐value we
compare our F statistic to
an F(2, 42) distribution.

In our example
.0047 2
F= = .211
.0467 42

The p‐value is

P (F (2,42) > .211) = 0.81

ANOVA Table
Results are often displayed using an ANOVA Table

Sum of Mean
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811

Within Groups .466 42 .011

Total .470 44
ANOVA Table
Results are often displayed using an ANOVA Table

Sum of Mean
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811

Within Groups .466 42 .011

Total .470 44

Pop Quiz!: Where are the following quantities presented in this table?

Sum of Squares Mean Square F Statistic p value

Between (SSB) Error (MSE)
ANOVA Table
Results are often displayed using an ANOVA Table

Sum of Mean
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811

Within Groups .466 42 .011

Total .470 44

Sum of Squares Mean Square F Statistic p value

Between (SSB) Error (MSE)
ANOVA Table
Results are often displayed using an ANOVA Table

Sum of Mean
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811

Within Groups .466 42 .011

Total .470 44

Sum of Squares Mean Square F Statistic p value

Between (SSB) Error (MSE)
ANOVA Table
Results are often displayed using an ANOVA Table

Sum of Mean
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811

Within Groups .466 42 .011

Total .470 44

Sum of Squares Mean Square F Statistic p value

Between (SSB) Error (MSE)
ANOVA Table
Results are often displayed using an ANOVA Table

Sum of Mean
Squares df Square F Sig.
Between Groups .005 2 .002 .211 .811

Within Groups .466 42 .011

Total .470 44

Sum of Squares Mean Square F Statistic p value

Between (SSB) Error (MSE)
Post Hoc Tests
NHANES I data, women
40-60 yrs old. Compare
cholesterol between
periodontal groups.

The ANOVA shows good

evidence (p = 0.002) that
the means are not all the
same.
Sum of Mean

Between
Squares df Square F Sig.
Which means are different?
33383 3 11128 5.1 .002
Groups
Within
Groups
4417119 2007 2201 Can directly compare the
Total 4450502 2010 subgroups using “post hoc”
tests.
Least Significant Difference test

Std.
N Mean Deviation The most simple post hoc
Healthy 802 221.5 46. 2 test is called the Least
Gingivitis 490 223.5 45.3 Significant Difference Test.
Periodontitis 347 227.3 48.9
Edentulous 372 232.4 48. 8 The computation is very
similar to the equal-
variance t test.
Sum of Mean
Squares df Square F Sig. Compute an equal-variance
Between
Groups
33383 3 11128 5.1 .002 t test, but replace the
Within
4417119 2007 2201 pooled variance (s2) with
Groups
Total 4450502 2010
the MSE.
Least Significant Difference Test: Examples

Std.
Compare Healthy group to
N Mean Deviation Periodontitis group:
Healthy 802 221.5 46. 2
221.5 − 227.3
T= = −1.92
2201(1 802 + 1 347)
Gingivitis 490 223.5 45.3
Periodontitis 347 227.3 48.9
Edentulous 372 232.4 48. 8
p = 2 ⋅ P(t1147 > 1.92) = 0.055

Compare Gingivitis group to

Sum of Mean
Periodontitis group:
Squares df Square F Sig. 223.5 − 227.3
T= = −1.15
2201(1 490 + 1 347)
Between
33383 3 11128 5.1 .002
Groups
Within
4417119 2007 2201
Groups p = 2 ⋅ P(t 835 > 1.15) = 0.25
Total 4450502 2010
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple
comparisons.
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple
comparisons.
• For example, if the data contain 4 groups, then 6
different pairwise comparisons can be made

Healthy Gingivitis

Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Post‐hoc testing usually involves multiple
comparisons.
• For example, if the data contain 4 groups, then 6
different pairwise comparisons can be made

Healthy Gingivitis

Periodontitis Edentulous
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at
significance level α, there is probability α of rejecting
in error.
• Performing multiple tests increases the chances of
rejecting in error at least once.
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at
significance level α, there is probability α of rejecting
in error.
• Performing multiple tests increases the chances of
rejecting in error at least once.
• For example:
– if you did 6 independent hypothesis tests at the α = 0.05
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at
significance level α, there is probability α of rejecting
in error.
• Performing multiple tests increases the chances of
rejecting in error at least once.
• For example:
– if you did 6 independent hypothesis tests at the α = 0.05
– If, in truth, H0 were true for all six.
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at
significance level α, there is probability α of rejecting
in error.
• Performing multiple tests increases the chances of
rejecting in error at least once.
• For example:
– if you did 6 independent hypothesis tests at the α = 0.05
– If, in truth, H0 were true for all six.
– The probability that at least one test rejects H0 is 26%
Post Hoc Tests: Multiple Comparisons
• Each time a hypothesis test is performed at
significance level α, there is probability α of rejecting
in error.
• Performing multiple tests increases the chances of
rejecting in error at least once.
• For example:
– if you did 6 independent hypothesis tests at the α = 0.05
– If, in truth, H0 were true for all six.
– The probability that at least one test rejects H0 is 26%
– P(at least one rejection) = 1‐P(no rejections) = 1‐.956 = .26
Bonferroni Correction for Multiple Comparisons
• The Bonferroni correction is a simple way to adjust
for the multiple comparisons.

Bonferroni Correction
• Perform each test at significance level α.
• Multiply each p-value by the number of tests
performed.
• The overall significance level (chance of any of the
tests rejecting in error) will be less than α.
Example: Cholesterol Data post‐hoc comparisons
Mean Least
Difference Significant
Group 1 Group 2 (Group 1 - Difference Bonferroni
Group 2) p-value p-value
Healthy Gingivitis -2.0 .46 1.0
Healthy Periodontitis -5.8 .055 .330
Healthy Edentulous -10.9 .00021 .00126
Gingivitis Periodontitis -3.9 .25 1.0
Gingivitis Edentulous -8.9 .0056 .0336
Periodontitis Edentulous -5.1 .147 .88
Example: Cholesterol Data post‐hoc comparisons
Mean Least
Difference Significant
Group 1 Group 2 (Group 1 - Difference Bonferroni
Group 2) p-value p-value
Healthy Gingivitis -2.0 .46 1.0
Healthy Periodontitis -5.8 .055 .330
Healthy Edentulous -10.9 .00021 .00126
Gingivitis Periodontitis -3.9 .25 1.0
Gingivitis Edentulous -8.9 .0056 .0336
Periodontitis Edentulous -5.1 .147 .88

Conclusion: The Edentulous group is significantly different than
the Healthy group and the Gingivitis group (p < 0.05), after
adjustment for multiple comparisons
Summarizing Scores with
Measures of Central Tendency:
The Mean, Median, and Mode
Outline of the Course
III. Descriptive Statistics
A. Measures of Central Tendency (Chapter 3)
1. Mean
2. Median
3. Mode
B. Measures of Variability (Chapter 4)
1. Range
2. Mean deviation
3. Variance
4. Standard Deviation
C. Skewness (Chapter 2)
1. Positive skew
2. Normal distribution
3. Negative skew
D. Kurtosis
1. Platykurtic
2. Mesokurtic
3. Leptokurtic
Measures of Central Tendency

z The goal of measures of central tendency is

to come up with the one single number that
best describes a distribution of scores.
z Lets us know if the distribution of scores
tends to be composed of high scores or low
scores.
Measures of Central Tendency

z There are three basic measures of central

tendency, and choosing one over another
depends on two different things.
z 1. The scale of measurement used, so that
a summary makes sense given the nature
of the scores.
z 2. The shape of the frequency distribution,
so that the measure accurately
summarizes the distribution.
Measures of Central Tendency
Mode
z The most common observation in a group of scores.
z Distributions can be unimodal, bimodal, or multimodal.
z If the data is categorical (measured on the nominal scale)
then only the mode can be calculated.
z The most frequently occurring score (mode) is Vanilla.
Flavor f 30
25
Vanilla 28
20
Chocolate 22 15
f

Strawberry 15 10
5
Neapolitan 8 0
Butter Pecan 12
rry

d
n
lla

e
a
ca
at

ita

pl
ni

Ro
ol

ip
Pe
Va

ol
w
c

R
Rocky Road 9
p

ky
ho

r
ea

ge
tte

oc
St
C

d
Bu

Fu
Fudge Ripple 6
Measures of Central Tendency
Mode

z The mode can also be calculated with

ordinal and higher data, but it often is not
appropriate.
z If other measures can be calculated, the
mode would never be the first choice!
z 7, 7, 7, 20, 23, 23, 24, 25, 26 has a mode
of 7, but obviously it doesn’t make much
sense.
Measures of Central Tendency
Median

z The number that divides a distribution of scores

exactly in half.
z The median is the same as the 50th percentile.

z Better than mode because only one score can be

median and the median will usually be around
where most scores fall.
z If data are perfectly normal, the mode is the
median.
z The median is computed when data are ordinal
scale or when they are highly skewed.
Measures of Central Tendency
Median
z There are three methods for computing the
median, depending on the distribution of scores.
z First, if you have an odd number of scores pick the
middle score.
z 1 4 6 7 12 14 18
z Median is 7
z Second, if you have an even number of scores,
take the average of the middle two.
z 1 4 6 7 8 12 14 16
z Median is (7+8)/2 = 7.5
z Third, if you have several scores with the same
value in the middle of the distribution use the
formula for percentiles (not found in your book).
Measures of Central Tendency
Mean

z The arithmetic average, computed simply by adding

together all scores and dividing by the number of
scores.
z It uses information from every single score.

ΣX Σ X
z For a population: μ = For a Sample: X =
N n
Measures of Central Tendency
Mean
Other Notes

z If data are perfectly normal, then the mean, median

and mode are exactly the same.

z I would prefer to use the mean whenever possible

since it uses information from EVERY score.

z Though the preferred symbol for the mean is an X with

a line over the top, creating this symbol is pretty tricky
on the computer. APA style says:
X =M
Measures of Central Tendency
The Shape of Distributions
z With perfectly bell
shaped distributions,
the mean, median, and
mode are identical.
z With positively skewed
data, the mode is
lowest, followed by the
median and mean.
z With negatively skewed
data, the mean is
lowest, followed by the
median and mode.
Measures of Central Tendency
Mean vs. Median
Salary Example
z On one block, the income from the families are (in
thousands of dollars) 40, 42, 41, 45, 38, 40, 42, 500

ΣX 788
z ΣX=788, X = = = 98.5
n 8
z The Mean salary for this sample is $98,500 which is
more than twice almost all of the scores.
z Arrange the scores 38, 40, 40, 41, 42, 42, 45, 500
z The middle two #’s are 41 and 42, thus the average is
$41500, perhaps a more accurate measure of central
tendency.
Measures of Central Tendency
Mean vs. Median
Reaction Time Example
z Data is time to complete task (in s):
z 45, 34, 87, 56, 21, didn’t finish, 49

z It is not possible to compute a mean with this

unknown number.
z Even though we do not know this person’s
time, I do know it is REALLY big.
z 21, 34, 45, 49, 56, 87, something bigger
z The median is the middle number, 49
Measures of Central Tendency
Mean
Algebra Revisited
z Its useful to consider the formula as the same as any
other algebraic formulas, subject to the same rules.

ΣX
X=
n
X •n = ΣX

z Therefore, if we know the mean of a group of scores,

we can figure out the ΣX.
Measures of Central Tendency
Mean
Weighted Mean
z Lets pretend that one semesters class of 23 students
scored M1 = 18 points on a quiz. The same quiz was
then given the next semester to 34 students who then
got M2 = 22 points. What is the overall (weighted)
mean for these 57 students.
z ΣX1 can be computed by multiplying M1 times the
sample size (ΣX1= M1*n1 = 18*23 = 414).
z For the second class, ΣX2 = M2*n2 = 22 * 34 = 748

z ΣXtotal = ΣX1 + ΣX2 = 414 + 748 = 1206

z ntotal = n1 + n2 = 23 + 34 = 57

z Mtotal = ΣXtotal / ntotal = 1206/57 = 21.158

Measures of Central Tendency
Mean
Adding a Score
z On the first exam, 15 students had M = 85.
z One kid came in late and took the test and
scored 53, what is Mnew?
z ΣXoriginal = Moriginal*noriginal = 85*15 = 1275

z ΣXnew = ΣXoriginal + new score = 1275 + 53 =

1328
z nnew = noriginal + 1

z Mnew = ΣXnew/nnew = 1328/16 = 83

Measures of Central Tendency
Mean
Changing an Existing Score
z On the first exam, 16 students had M = 83.
z One kid came in after the test and
complained, I listened and decided to give
him 10 extra points, now what?
z ΣXoriginal = Moriginal*noriginal = 83*16 = 1328

z ΣXnew = ΣXoriginal + extra points = 1328 + 10 =

1338
z Mnew = ΣXnew/n = 1338/16 = 83.625
Measures of Central Tendency
Mean
Transformations
z If a constant is added (or subtracted) to each score, the same
constant will be added (or subtracted) to the mean.
z If M for an exam is 82, then I find that I screwed up a
question and give everyone 5 extra points, M simply
becomes 82+5=87.
z If every score is multiplied or divided by a constant number,
then the mean will also be multiplied or divided by the same
number.
z This last property is particularly useful when converting
between units of measurement.
z If the M for the height of a group of first-graders is 47 inches,
but I need to know their heights in cm I could:
z Take every kids height * 2.54, then recompute M
z Or, I could take the mean times 2.54 and conclude the M
height of these kids is 119.38 cm.
Measures of Central Tendency
Deviations around the Mean

z A common
Exam Score X−X
formula we will 7 (7-9) = -2
be working with 6 (6-9) = -3
extensively is 8 (8-9) = -1
the deviation: 9 (9-9) = 0
X−X 12 (12-9) = 3
10 (10-9) = 1
ΣX = 72 11 (11-9) = 2
n=8 9 (9-9) = 0
ΣX 72
X= = =9 ∑ (X − X ) = 0
n 8
Measures of Central Tendency
Using the Mean to Interpret Data
Predicting Scores

z If asked to predict a score, and you know

nothing else, then predict the mean.
z However, we will probably be wrong, and our
error will equal:
X−X

z A score’s deviation indicates the amount of

error we have when using the mean to predict
an individual score.
Measures of Central Tendency
Using the Mean to Interpret Data
Describing a Score’s Location

z If you take a test and get a score of 45, the 45 means

nothing in and of itself. However, if you learn that the
M = 50, then we know more. Your score was 5 units
BELOW M.
z Positive deviations are above M.

z Negatives deviations are below M.

z Large deviations indicate a score far from M.

z Large deviations occur less frequently.

Measures of Central Tendency
Using the Mean to Interpret Data
Describing the Population Mean
z Remember, we usually want to know population
parameters, but populations are too large.
z So, we use the sample mean to estimate the
population mean.

X ≈μ
Measures of Central Tendency
Consider the Measurements and Frequency Table
Generated in the previous lecture
87, 85, 79, 75, 81, 88, 92, 86, 77, 72, 75, 77, 81, 80, 77,
73, 69, 71, 76, 79, 83, 81, 78, 75, 68, 67, 71, 73, 78, 75,
84, 81, 79, 82, 87, 89, 85, 81, 79, 77, 81, 78, 74, 76, 82,
85, 86, 81, 72, 69, 65, 71, 73, 78, 81, 77, 74, 77, 72, 68
Class Class Midpoint Total Frequency
64.5 - 69.5 67 6 0.100
69.5 – 74.5 72 11 0. 183
74.5 – 79.5 77 20 0.333
79.5 – 84.5 82 13 0.217
84.5 – 89.5 87 9 0.150
89.5 – 94.5 92 1 0.0167
Measures of Central Tendency

We have determined that the lowest and highest readings

in this set of measurements are:
•Low = 65
•High = 92

This gives a range = 92 – 25 = 27

The simplest measurement of central tendency in this

population is the midrange.

Define: midrange = (low value + high value)/2

Midrange = (65 + 92)/2 = 157/2 = 78.5

Measures of Central Tendency

The most descriptive measure of central tendency in a

population is its mean, because the mean of a sample taken
from the population can be shown to be predictive of the
population mean within some range determined by the
sampling error.

Define the population mean by the formula

μ = Σi xi/N where
μ = the population mean
Σi xi = the sum over each member of the population
xi
N = the number of items in the population
Measures of Central Tendency

For the 60 temperature readings in this population we

obtain:
87, 85, 79, 75, 81, 88, 92, 86, 77, 72, 75, 77, 81, 80, 77,
73, 69, 71, 76, 79, 83, 81, 78, 75, 68, 67, 71, 73, 78, 75,
84, 81, 79, 82, 87, 89, 85, 81, 79, 77, 81, 78, 74, 76, 82,
85, 86, 81, 72, 69, 65, 71, 73, 78, 81, 77, 74, 77, 72, 68

μ = (87+85+ 79 +….+72+68)/60 = 4751/60 = 79.183

Measures of Central Tendency
A third measure of central tendency is the median
The median of a population of size N is found by
1. Arranging the individual measurements in ascending
order, and
2. If N is odd, selecting the value in the middle of this list as
the median (there will be the same number of values
above and below the median)
3. If N is even find the values at position N/2 and N/2 + 1 in
this list (call them xN/2 and xN/2+1) and let median be given
by the formula median = (xN/2 + xN/2+1)/2 or be the value
halfway between these two measurements.
Note! When N is even the median will usually not be an
actual value in the population
Measures of Central Tendency
We now find the median of the population of temperature
readings
87, 85, 79, 75, 81, 88, 92, 86, 77, 72, 75, 77, 81, 80, 77,
73, 69, 71, 76, 79, 83, 81, 78, 75, 68, 67, 71, 73, 78, 75,
84, 81, 79, 82, 87, 89, 85, 81, 79, 77, 81, 78, 74, 76, 82,
85, 86, 81, 72, 69, 65, 71, 73, 78, 81, 77, 74, 77, 72, 68
Arrange these 60 measurements in ascending order

65, 67, 68, 68, 69, 69, 71, 71, 71, 72, 72, 72, 73, 73, 73,
74, 74, 75, 75, 75, 75, 76, 76, 77, 77, 77, 77, 77, 77, 78,
78, 78, 78, 79, 79, 79, 79, 80, 81, 81, 81, 81, 81, 81, 81,
81, 82, 82, 83, 84, 85, 85, 85, 86, 86, 87, 87, 88, 89, 92
Since N/2 = 30 and both the 30th and 31st values in the list are
the same, we obtain median = 78
Measures of Central Tendency
One further parameter of a population that may give some
indication of central tendency of the data is the mode

Define: mode = most frequently occurring value in the

population

From the previous data we see:

Note! If two different values were to occur most frequently, the

distribution would be bimodal. A distribution may be multi-modal.
Measures of Central Tendency
Next we show where each of these parameters occur in the
frequency distribution graph for this tabulated data.
Frequency %
42
39 Mean = 79.183
36
33 x Median = 78
30
27 Midrange = 78.5
24
Mode = 81
21 x
18 x
15 x
12
9 x
6
3 median mean
0 x
67 72 77 82 87 92 Temperature
Measures of Central Tendency

In the previous pages we have calculated the mean from the

raw data. We can also use the tabulated data to calculate the
mean of the population

Use the formula

μ = Σi(fi * xi) / Σi fi

Where xi = the midpoint of the ith class

and fi = the number of items in the ith class
Measures of Central Tendency
From the table we obtain
Class Class Midpoint (x) Total (f) Frequency f*x
64.5 - 69.5 67 6 0.100 402
69.5 – 74.5 72 11 0. 183 792
74.5 – 79.5 77 20 0.333 1540
79.5 – 84.5 82 13 0.217 1066
84.5 – 89.5 87 9 0.150 783
89.5 – 94.5 92 1 0.0167 92
60 4675

μ = Σi(fi * xi) / Σi fi = 4675/60 = 77.917

The small discrepancy between these two values for the mean is due to the
way the data is accumulated into classes. The mean of the raw data is more
accurate, the mean of the tabulated data is often more convenient to obtain.
Measure of central tendency
Central tendency
A statistical measure that identifies a single
score as representative for an entire
distribution. The goal of central tendency is
to find the single score that is most typical
or most representative of the entire group.
Measure of central tendency
Measure of central tendency
The mean
Population mean vs. sample mean
ΣX ΣX
μ= x=
N n
N=4: 3,7,4,6
ΣX 20
x= = =5
n 4
Measure of central tendency
The weighted mean
Group A: x = 6 n=12
Group B: x = 7 n=8
72 + 56
Weighted mean = ΣX 1 + ΣX 2 = 12 + 8
= 6.4
n1 + n2

Seriously sensitive to extreme scores.

Measure of central tendency
Median
The score that divides a distribution exactly
in half. Exactly 50 percent of the
individuals in a distribution have scores at
or below the median.
odd: 3, 5, 8, 10, 11 Æ median=8
even: 3, 3, 4, 5, 7, 8 Æ
median=(4+5)/2=4.5
Measure of central tendency
Median
The median is often used as a measure of
central tendency when the number of
scores is relatively small, when the data
have been obtained by rank-order
measurement, or when a mean score is
not appropriate.
Measure of central tendency
Mode
Most frequently obtained score in the data
Problems:
No mode
Measure of central tendency
Choosing a measure of central tendency
the level of measurement of the variable concerned
(nominal, ordinal, interval or ratio);
the shape of the frequency distribution;
what is to be done with the figure obtained.
The mean is really suitable only for ratio and interval
data. For ordinal variables, where the data can be ranked
but one cannot validly talk of `equal differences' between
values, the median, which is based on ranking, may be
used. Where it is not even possible to rank the data, as in
the case of a nominal variable, the mode may be the
only measure available.
Measure of central tendency
Central tendency and the shape of the
distribution
Summary
1. The purpose of central tendency is to determine the single value that best
represents the entire distribution of scores. The three standard measures of
central tendency are the mode, the median, and the mean.
2. The mean is the arithmetic average. It is computed by summing all the
scores and then dividing by the number of scores. Conceptually, the mean
is obtained by dividing the total (IX) equally among the number of
individuals (N or n). Although the calculation is the same for a popula-tion
or a sample mean, a population mean is identified by the symbol and a
sample mean is identified by X.
3. Changing any score in the distribution will cause the mean to be changed.
When a constant value is added to (or sub-tracted from) every score in a
distribution, the same con-stant value is added to (or subtracted from) the
mean. If every score is multiplied by a constant, the mean will be multiplied
by the same constant. In nearly all circum-stances, the mean is the best
representative value and is the preferred measure of central tendency.
Summary
1. The median is the value that divides a distribution exactly in half. The
median is the preferred measure of central tendency when a
distribution has a few extreme scores that displace the value of the
mean. The median also is used when there are undetermined
(infinite) scores that make it impossible to compute a mean.
2. The mode is the most frequently occurring score in a dis-tribution. It
is easily located by finding the peak in a frequency distribution graph.
For data measured on a nominal scale, the mode is the appropriate
measure of central ten-dency. It is possible for a distribution to have
more than one mode.
3. For symmetrical distributions, the mean will equal the me-dian. If
there is only one mode, then it will have the same value, too.
4. For skewed distributions, the mode will be located toward the side
where the scores pile up, and the mean will be pulled toward the
extreme scores in the tail. The median will be located between these
two values.
Homework

Imagine that you received the following data on the vocabulary test mentioned earlier:
20 22 23 23 23
23 23 23 24 25
28 29 30 30 30
30 30 30 31 32
32 33 33 34 35
35 36 36 37 37

1. Chart the data and draw the frequency polygon.

2. Compute the mean, mode, and median of the data and decide which of the three you
believe to be best for the central tendency of the data.
Measure of variability
Variability provides a quantitative
measure of the degree to which scores
in a distribution are spread out or
clustered together.
Measure of variability
Range
range=Xhighest – Xlowest
Quartile:
A statistical term describing a division of observations into four defined intervals
based upon the values of the data and how they compare to the entire set of
observations.
Each quartile contains 25% of the total observations. Generally, the data is
ordered from smallest to largest with those observations falling below 25% of all
the data analyzed allocated within the 1st quartile, observations falling between
25.1% and 50% and allocated in the 2nd quartile, then the observations falling
between 51% and 75% allocated in the 3rd quartile, and finally the remaining
observations allocated in the 4th quartile.
Interquartile: The interquartile range is a measure of spread or dispersion. It
is the difference between the 75th percentile (often called Q3) and the 25th
percentile (Q1). The formula for interquartile range is therefore: Q3-Q1.
Semi-interquartile: The semi-interquartile range is a measure of spread or
dispersion. It is computed as one half the difference between the 75th
percentile [often called (Q3)] and the 25th percentile (Q1). The formula for
semi-interquartile range is therefore: (Q3-Q1)/2.
TOEFL: (560-470)/2=45
Measure of variability
Measure of variability
Variance
Deviation: deviation of one score from the
mean
Variance: taking the distribution of all
scores into account.
Sum of
square (SS)

n=24
Measure of variability
Standard deviation
squar ed
scor e mean devi at i on* devi at i on
8 9. 67 - 1. 67 2. 79
25 9. 67 +15. 33 235. 01
7 9. 67 - 2. 67 7. 13
5 9. 67 - 4. 67 21. 81
8 9. 67 - 1. 67 2. 79
3 9. 67 - 6. 67 44. 49
10 9. 67 + . 33 . 11
12 9. 67 + 2. 33 5. 43
9 9. 67 - . 67 . 45
sum of squar ed dev= 320. 01

St andar d Devi at i on = Squar e r oot ( sum of squar ed devi at i ons / ( N- 1)

= Squar e r oot ( 320. 01/ ( 9- 1) )
= Squar e r oot ( 40)
= 6. 32
Measure of variability
The larger the standard deviation figure, the wider
the range of distribution away from the measure of
central tendency
Measure of variability
Adding a constant to each score does
not change the standard deviation.
Multiplying each score by a constant
causes the standard deviation to be
multiplied by the same constant.
Measure of variability
Group A Group B
11 20
8 10
10 1
9 8
8 0
12 30
10 13
11 6
Measure of variability

Reporting the standard deviation (APA):

Type of instrument
Listening Watching
Mean SD Mean SD
Males 15.72 4.43 6.94 2.26
Females 3.47 1.12 2.61 0.98
Measure of variability
Standard deviation and normal distribution
Homework
1. Calculate the mean, median, mode, range and standard
deviation for the following sample:

Midterm Exam
X X
100 85
88 82
83 96
105 107
78 102
98 113
126 94
85 119
67 91
88 100
88 72
77 88
114 85
Homework
2. Suppose that the following scores were obtained on administering a
language proficiency test to ten aphasics who had undergone a course
of treatment, and ten otherwise similar aphasics who had not
undergone the treatment:
Experimental group Control group
15 31
28 34
62 47
17 41
31 28
58 54
45 36
11 38
76 45
43 32
Calculate the mean score and standard deviation for each group, and
comment on the results.
Locating scores and finding
scales in a distribution
Percentiles, quartiles, deciles
Mind work

Imagine that you conducted an in-service course for ESL teachers. To receive university credit for the
course, the teachers must take examinations--in this case, a midterm and a final. The midterm was a
multiple-choice test of 50 items and the final exam presented teachers with 10 problem situations to
solve. Sue, like most teachers, was a whiz at taking multiple-choice exams, but bombed out on the
problem-solving final exam. She received a 48 on the midterm and a 1 on the final. Becky didn't do so
well on the midterm. She kept thinking of exceptions to answers on the multiple-choice exam. Her score
was 39. However, she really did shine on the final, scoring a 10. Since you expect students to do well on
both exams, you reason that Becky has done a creditable job on each and Sue has not. Becky gets the
higher grade. Yet, if you add the points together, Sue has 49 and Becky has 49. The question is whether
the points are really equal.
Should Sue also do this bit of arithmetic, she might come to your office to complain of the injustice of it
all. How will you show her that the value of each point on the two tests is different?
Locating scores and finding
scales in a distribution
Standard score (z-scores) X −x
z=
s
Locating scores and finding
scales in a distribution
Mind work

Suppose that we have measured the times taken by a very large number of
people to utter a particular sentence, and have shown these times to be
normally distributed with a mean of 3.45 sec and a standard deviation of
0.84 sec. Armed with this information, we can answer various questions.
1. What proportion of the (potentially infinite) population of utterance
times would be expected to fall below 3 sec?
2. What proportion would lie between 3 and 4 sec?
3. What is the time below which only 1 per cent of the times would be
expected to fall?
Mind work

1. z-score for 3 sec. z = 3 − 3.45 = −0.54

0.84
2. check the normal distribution table
3. z-score for 4 sec. z = 4 − 3.45 = −0.66
0.84
4. 100-29.46-25.46=45.1 per cent
5. z-score for 1 per cent: 2.33
6. − 2.33 = x − 3.45 x=(-2.33x0.84)+3.45=1.49
0.84
sec
Normal Distribution Table
Locating scores and finding
scales in a distribution
T-score
T score = 10(z) + 50
Z=(T-score-500)/100
X = z×s + x
Mind work

某外语学院在其研究生教学中规定，只要有一门课程的考试
成绩低于75分，即取消其撰写论文的资格。显然，这是不科
学的。因为这实质上也是把不同质的考试硬拉在一起进行比
较。同是75分，在不同考试中的意义是不一样的。在一个非
常容易的考试中，它可能是比较低的分数，而在一个难度较
大的考试中，它却可能是比较高的考分。如果凡是低于该分
数的都不让写论文，这是不科学的，也是不公平的。科学的
做法是把各科的考试分数换算成标准分，然后规定多少标准
分以下的没有资格写论文。同上例一样，有了标准分之后，
也可以把各科的成绩合成一个总分，或求平均分，排出名次，
再制定一个标准，以确定总分或平均分为多少的人才有资格
撰写论文
Locating scores and finding
scales in a distribution
Distributions with nominal data
Implicational scaling (Guttman scaling)
Coefficient of scalability
Homework
I. The following scores are obtained by 50 subjects on a language aptitude test:
42 62 44 32 47 42 52 76 36 43
55 27 46 55 47 28 53 44 15 61
18 59 58 57 49 55 88 49 50 62
61 82 66 80 64 50 40 53 28 63
63 25 58 71 82 52 73 67 58 77

1. Draw a histogram to show the distribution of the scores.

2. Calculate the mean and standard deviation of the scores.
3. Suppose Lihua scored 55 in this test, what’s her position in the
whole class?
II. Suppose there will be 418,900 test takers for the NMET in 2006 in
Guangdong, the key universities in China plan to enroll altogether 32,000
students in Guangdong. What score is the lowest threshold for a student to
be enrolled by the key universities? (Remember the mean is 500, standard
deviation is 100).
Sample statistics and population
parameter: estimation
Standard error
Sampling distribution of the mean
Standard error of mean
Standard error = s
N
In order to halve the standard error, we should have to
take a sample which was four times as big.
Central limit theorem:
For any population with mean of μand standard deviation
of σ, the distribution of sample means for sample size n
will approach a normal distribution with a mean of μand a
standard deviation of σ / n as n approaches infinity.
samples above 30
Sample statistics and population
parameter: estimation
Interpreting standard error: confidence
limits
Sample statistics and population
parameter: estimation

Normal distribution: sample is large

t-distribution: sample is small
Degree of freedom: N-1
When sample is large, t = z
Sample statistics and population
parameter: estimation
Interpreting standard error: confidence
limits
Mean=58.2
s=23.6
N=50
23.6
Standard error= N 50 = 3.3
s
=

.2−x
z=58 .2−x
58 51.7≤ x≤64.7
3.3 −1.96
≤ ≤1.9
3.3
Sample statistics and population
parameter: estimation
Confidence limits for proportions
Standard error = p(1 − p )
N
Confidence limits=proportion in sample
±(critical value x standard error)
Sample statistics and population
parameter: estimation
Suppose that we have taken a random sample of 500 finite verbs from a text,
and found that 150 of them have present tense form. How can we set
confidence limits for the proportion of present tense finite verbs in the whole
text, the population from which the sample is taken?

95% confidence limits = proportion in sample

± (1.96 X standard error) =0.30±(1.96x0.02)
= 0.30 ± 0.04 = 0.26 to 0.34.
We can thus be 95 per cent confident that the proportion of present tense
finite verbs in the population lies between 26 and 34 per cent.
Sample statistics and population
parameter: estimation
Estimating required sample sizes
Standard error = p(1 − p )
N

In a paragraph there are 46 word tokens, of which 11 are two-

letter words. The proportion of such words is thus 11/46 or 0.24.
How big a sample of words should we need in order to be 95 per
cent confident that we had measured the proportion to within an
accuracy of 1 per cent?
0.01=1.96 x standard error
Standard error = 0.01 x 1.96
0.24 × 0.76
N= 2
= 7007
(0.01 / 1.96)
Homework
I. The following are the times (in seconds) taken for a group of 30 subjects
to carry out the detransformation of a sentence into its simplest form:
0.55 0.56 0.52 0.59 0.51 0.50
0.42 0.41 0.37 0.22 0.24 0.41
0.49 0.59 0.75 0.65 0.63 0.61
0.72 0.77 0.76 0.39 0.26 0.68
0.30 0.32 0.44 0.61 0.54 0.47

Calculate (i) the mean, (ii) the standard deviation, (iii) the standard error
of the mean, (iv) the 99 per cent confidence limits for the mean.

II. A random sample of 300 finite verbs is taken from a text, and it is found
that 63 of these are auxiliaries. Calculate the 95 per cent confidence
limits for the proportion of finite verbs which are auxiliaries in the text
as a whole.
III. Using the data in question II, calculate the size of the sample of finite
verbs which would. be required in order to estimate the proportion of
auxiliaries to within an accuracy of 1 per cent, with 95 per cent
confidence.
Probability and Hypothesis
Testing
Null hypothesis (H0)
The null hypothesis states that in the general
population there is no change, no difference, or no
relationship. In the context of an experiment, H0
predicts that the independent variable (treatment)
will have no effect on the dependent variable for
the population. H0: μA- μB=0 or μA= μB
Alternative hypothesis (H1)
The alternative hypothesis (H1) states that there is
a change, a difference, or a relationship for the
general population. H1: μA≠ μB
Probability and Hypothesis
Testing
Null hypothesis (H0)
When we reject the null hypothesis, we want the probability to be
very low that we are wrong. If, on the other hand, we must accept
the null hypothesis, we still want the probability to be very low that
we are wrong in doing so.
Type I error and Type II error
A type I error is made when the researcher rejected the null
hypothesis when it should not have been rejected.
A type II error is made when the null hypothesis is accepted when
it should have been rejected.
In research, we test our hypothesis by finding the probability of
our results. Probability is the proportion of times that any
particular outcome would happen if the research were repeated
an infinite number of times.
Probability and Hypothesis
Testing
Two-tailed and one-tailed hypothesis
When we specify no direction for the null hypothesis (i.e.,
whether our score will be higher or lower than more typical
scores), we must consider both tails of the distribution. This
is called two-tailed hypothesis.
If we have good reason to believe that we will find a
difference (e.g., previous studies or research findings
suggest this is so), then we will use a one-tailed hypothesis.
One-tailed tests specify the direction of the predicted
difference. We use previous findings to tell us which
direction to select.
.05 .01
1-tailed 1.64 2.33
2-tailed 1.96 2.57
Probability and Hypothesis
Testing
Steps in hypothesis testing
1. State the null hypothesis.
2. Decide whether to test it as a one- or two-tailed hypothesis. If there is
no research evidence on the issue, select a two-tailed hypothesis. This
will allow you to reject the null hypothesis in favor of an alternative
hypothesis. If there is research evidence on the issue, select a
one-tailed hypothesis. This will allow you to reject the null
hypothesis in favor of a directional hypothesis.
3. Set the probability level (α level). Justify your choice.
4. Select the appropriate statistical test(s) for the data.
5. Collect the data and apply the statistical test(s).
6. Report the test results and interpret them correctly.
Probability and Hypothesis
Testing
Parametric vs. nonparametric
Parametric procedures
Make strong assumptions about the distribution of the
data
Assume the data are NOT frequencies or ordinal scales
but interval data
Data are normally distributed
Nonparametric procedures
Do not make strong assumptions about the shape of the
distribution of the data
Work with frequencies and rank-ordered scales
Used when the sample size is small
Homework
Lecture 14 chi-square test, P-
value
• Measurement error (review from lecture 13)
• Null hypothesis; alternative hypothesis
• Evidence against null hypothesis
• Measuring the Strength of evidence by P-
value
• Pre-setting significance level
• Conclusion
• Confidence interval
Some general thoughts about
hypothesis testing
• A claim is any statement made about the truth; it
could be a theory made by a scientist, or a
statement from a prosecutor, a manufacture or a
consumer
• Data cannot prove a claim however, because there
• May be other data that could contradict the theory
• Data can be used to reject the claim if there is a
contradiction to what may be expected
• Put any claim in the null hypothesis H0
• Come up with an alternative hypothesis and put it
as H1
• Study data and find a hypothesis testing statistics
which is an informative summary of data that is
• Testing statistics is obtained by experience or
statistical training; it depends on the
formulation of the problem and how the data
are related to the hypothesis.
• Find the strength of evidence by P-value :
from a future set of data, compute the
probability that the summary testing statistics
will be as large as or even greater than the
one obtained from the current data. If P-
value is very small , then either the null
hypothesis is false or you are extremely
unlucky. So statistician will argue that this is a
strong evidence against null hypothesis.
If P-value is smaller than a pre-specified level
(called significance level, 5% for example),
Back to the microarray
• example
Ho : true SD σ=0.1 (denote 0.1 by σ0)
• H1 : true SD σ > 0.1 (because this is the main
concern; you don’t care if SD is small)
• Summary :
• Sample SD (s) = square root of ( sum of
squares/ (n-1) ) = 0.18
• Where sum of squares = (1.1-1.3)2 + (1.2-
1.3)2 + (1.4-1.3)2 + (1.5-1.3)2 = 0.1, n=4
• The ratio s/ σ =1.8 , is it too big ?
• The P-value consideration:
• Suppose a future data set (n=4) will be
collected.
• Let s be the sample SD from this future
dataset; it is random; so what is the
b bilit th t / ill b
• P(s/ σ0 >1.8)
• But to find the probability we need to use chi-
square distribution :
• Recall that sum of squares/ true variance
follow a chi-square distribution ;
• Therefore, equivalently, we compute
• P ( future sum of squares/ σ02 > sum of
squares from the currently available data/
σ02), (recallσ0 is
• The value claimed under the null hypothesis) ;
Once again, if data were generated again, then Sum of
squares/ true variance is random and follows a chi-squared
distribution
with n-1 degrees of freedom; where sum of squares= sum of
squared distance between each data point and the sample
mean
P-value = P(chi-square random variable> computed
value
Note fromofdata)=P
: Sum (chisquare
squares= random
(n-1) sample variable
variance > 10.0)
= (n-1)(sample
SD)For
2 our case, n=4; so look at the chi-square

distribution with df=3; from table we see :

P-value is between .025
and .01, reject null
hypothesis at 5%
significance level
9.348 11.34
The value computed from available data =
.10/.01=10 (note sum of squares=.1, true variance
Confidence interval
• A 95% confidence interval for true variance σ2
is
• (Sum of squares/C2, sum of squares/C1)
• Where C1 and C2 are the cutting points from
chi-square table with d.f=n-1 so that
• P(chisquare random variable > C1)= .975
• P(chisquare random variable>C2)=.025
• This interval is derived from
• P( C <
For our data,
1 sum of squares/ σ 2 <C )=.95
sum of squares= .1 ; from
2 d.f=3 of table,
C1=.216, C2=9.348; so the confidence interval of σ2 is
0.1017 to .4629; how about confidence interval of σ ?
Chi-Square Procedures

11.1
Chi-Square Goodness of Fit
Characteristics of the Chi-Square Distribution
1. It is not symmetric.
Characteristics of the Chi-Square Distribution
1. It is not symmetric.
2. The shape of the chi-square distribution
depends upon the degrees of freedom, just like
Student’s t-distribution.
Characteristics of the Chi-Square Distribution
1. It is not symmetric.
2. The shape of the chi-square distribution
depends upon the degrees of freedom, just like
Student’s t-distribution.
3. As the number of degrees of freedom
increases, the chi-square distribution becomes
more symmetric as is illustrated in Figure 1.
Characteristics of the Chi-Square Distribution
1. It is not symmetric.
2. The shape of the chi-square distribution
depends upon the degrees of freedom, just like
Student’s t-distribution.
3. As the number of degrees of freedom
increases, the chi-square distribution becomes
more symmetric as is illustrated in Figure 1.
4. The values are non-negative. That is, the
values of are greater than or equal to 0.
The Chi-Square Distribution
A goodness-of-fit test is an inferential
procedure used to determine whether a
frequency distribution follows a claimed
distribution.
Expected Counts
Suppose there are n independent trials an
experiment with k > 3 mutually exclusive possible
outcomes. Let p1 represent the probability of
observing the first outcome and E1 represent the
expected count of the first outcome, p2 represent
the probability of observing the second outcome
and E2 represent the expected count of the second
outcome, and so on. The expected counts for
each possible outcome is given by

Ei = μi = npi for i = 1, 2, …, k
EXAMPLE Finding Expected Counts
A sociologist wishes to determine whether the distribution for
the number of years grandparents who are responsible for
their grandchildren is different today than it was in 2000.
According to the United States Census Bureau, in 2000,
22.8% of grandparents have been responsible for their
grandchildren less than 1 year; 23.9% of grandparents have
been responsible for their grandchildren 1or 2 years; 17.6%
of grandparents have been responsible for their
grandchildren 3 or 4 years; and 35.7% of grandparents have
been responsible for their grandchildren for 5 or more years.
If the sociologist randomly selects 1,000 grandparents that
are responsible for their grandchildren, compute the
expected number within each category assuming the
distribution has not changed from 2000.
Test Statistic for Goodness-of-Fit Tests
Let Oi represent the observed counts of category i,
Ei represent the expected counts of an category i, k
represent the number of categories, and n represent
the number of independent trials of an experiment.
Then,
i = 1, 2, …, k

approximately follows the chi-square distribution

with k – 1 degrees of freedom provided (1) all
expected frequencies are greater than or equal to 1
(all Ei > 1) and (2) no more than 20% of the
expected frequencies are less than 5. NOTE: Ei =
npi for i = 1,2, ..., k.
The Chi-Square Goodness-of-Fit Test
If a claim is made regarding a distribution, we
can use the following steps to test the
claim provided
1. the data is randomly selected
The Chi-Square Goodness-of-Fit Test
If a claim is made regarding a distribution, we
can use the following steps to test the
claim provided
1. the data is randomly selected
2. all expected frequencies are greater than
or equal to 1.
The Chi-Square Goodness-of-Fit Test
If a claim is made regarding a distribution, we
can use the following steps to test the
claim provided
1. the data is randomly selected
2. all expected frequencies are greater than
or equal to 1.
3. no more than 20% of the expected
frequencies are less than 5.
Step 1: A claim is made regarding a
distribution. The claim is used to
determine the null and alternative
hypothesis.

Ho: the random variable follows the

claimed distribution

H1: the random variable does not follow

the claimed distribution
Step 2: Calculate the expected frequencies for
each of the k categories. The expected
frequencies are npi for i = 1, 2, …, k where n is
the number of trials and pi is the probability of the
ith category assuming the null hypothesis is true.
Step 3: Verify the requirements fort he
goodness-of-fit test are satisfied.

(1) all expected frequencies are greater

than or equal to 1 (all Ei > 1)

(2) no more than 20% of the expected

frequencies are less than 5.
EXAMPLE Testing a Claim Using the Goodness-of-Fit
Test
A sociologist wishes to determine whether the distribution
for the number of years grandparents who are
responsible for their grandchildren is different today than
it was in 2000. According to the United States Census
Bureau, in 2000, 22.8% of grandparents have been
responsible for their grandchildren less than 1 year;
23.9% of grandparents have been responsible for their
grandchildren 1or 2 years; 17.6% of grandparents have
been responsible for their grandchildren 3 or 4 years; and
35.7% of grandparents have been responsible for their
grandchildren for 5 or more years. The sociologist
randomly selects 1,000 grandparents that are responsible
for their grandchildren and obtains the following data.
Solution:
• Step 1. Construct the Hypothesis
• H0 : The distribution for the number of years
grandparents who are responsible for their
grandchildren is the same today as it was in 2000.

H1 : The distribution for the number of years

grandparents who are responsible for their
grandchildren is different today from what it was
in 2000.
• Step 2. Compute the expected counts for each
category, assuming that the null hypothesis is true.
Number of Years Frequency(Oi) Expected Frequency(Ei)
(observed count) (expected count)

Less than 1 year 252 228

1 or 2 years 255 239
3 or 4 years 162 176
5 or more years 331 357
Solution(cont’d):

• Step 3. Verify that the requirements for the

goodness-of-fit test are satisfied.
1. All expected frequencies( or expected
counts ) are bigger than or equal to 1?
2. No more than 20% of the expected
frequencies are less than 5.
Step 4. Find the critical values, determine the
critical region.
α=0.05, k = 4, degree of freedom = k-1 =3
Look in table IV,
χα2 =7.815
C:=(7.815, infinity)
• Step 5. Compute the test statistic

χ2 = (252-228)^2/228+(255-239)^2/239
+( 162-176)^2/176+(331-357)^2/357
=6.605
• Step 6. Compare the test statistics with the critical values
the test statistic < the critical value
or the test statistic does not lie in th critical region.
• Step 7. Conclusion?
• There is no sufficient evidence at the α=0.05 level of significance to reject
the null hypothesis, i.e., the claim of the distribution for the number of years
grandparents who are responsible for their grandchildren is the same today as
it was in 2000
Or ….
CORRELATION
Correlation
key concepts:
Types of correlation
Methods of studying correlation
a) Scatter diagram
b) Karl pearson’s coefficient of correlation
c) Spearman’s Rank correlation coefficient
d) Method of least squares
Correlation
Correlation: The degree of relationship between the
variables under consideration is measure through the
correlation analysis.
The measure of correlation called the correlation coefficient
The degree of relationship is expressed by coefficient which
range from correlation ( -1 ≤ r ≥ +1)
The direction of change is indicated by a sign.
The correlation analysis enable us to have an idea about the
degree & direction of the relationship between the two
variables under study.
Correlation
Correlation is a statistical tool that helps
to measure and analyze the degree of
relationship between two variables.
Correlation analysis deals with the
association between two or more
variables.
Correlation & Causation
Causation means cause & effect relation.
Correlation denotes the interdependency among the
variables for correlating two phenomenon, it is essential
that the two phenomenon should have cause-effect
relationship,& if such relationship does not exist then the
two phenomenon can not be correlated.
If two variables vary in such a way that movement in one
are accompanied by movement in other, these variables
are called cause and effect relationship.
Causation always implies correlation but correlation does
not necessarily implies causation.
Types of Correlation
Type I

Correlation

Positive Correlation Negative Correlation

Types of Correlation Type I
Positive Correlation: The correlation is said to
be positive correlation if the values of two
variables changing with same direction.
Ex. Pub. Exp. & sales, Height & weight.
Negative Correlation: The correlation is said to
be negative correlation when the values of
variables change with opposite direction.
Ex. Price & qty. demanded.
Direction of the Correlation
Positive relationship – Variables change in the
same direction.
Indicated by
As X is increasing, Y is increasing
As X is decreasing, Y is decreasing
sign; (+) or (-).
E.g., As height increases, so does weight.
Negative relationship – Variables change in
opposite directions.
As X is increasing, Y is decreasing
As X is decreasing, Y is increasing

E.g., As TV time increases, grades decrease

More examples
Positive relationships Negative relationships:
water consumption alcohol consumption
and temperature. and driving ability.
study time and Price & quantity
grades. demanded
Types of Correlation
Type II

Correlation

Simple Multiple

Partial Total
Types of Correlation Type II
Simple correlation: Under simple correlation
problem there are only two variables are studied.
Multiple Correlation: Under Multiple
Correlation three or more than three variables
are studied. Ex. Qd = f ( P,PC, PS, t, y )
Partial correlation: analysis recognizes more
than two variables but considers only two
variables keeping the other constant.
Total correlation: is based on all the relevant
variables, which is normally not feasible.
Types of Correlation
Type III

Correlation

LINEAR NON LINEAR

Types of Correlation Type III
Linear correlation: Correlation is said to be linear
when the amount of change in one variable tends to
bear a constant ratio to the amount of change in the
other. The graph of the variables having a linear
relationship will form a straight line.
Ex X = 1, 2, 3, 4, 5, 6, 7, 8,
Y = 5, 7, 9, 11, 13, 15, 17, 19,
Y = 3 + 2x
Non Linear correlation: The correlation would be
non linear if the amount of change in one variable
does not bear a constant ratio to the amount of change
in the other variable.
Methods of Studying Correlation

Scatter Diagram Method

Graphic Method
Karl Pearson’s Coefficient of
Correlation
Method of Least Squares
Scatter Diagram Method

Scatter Diagram is a graph of observed

plotted points where each points
represents the values of X & Y as a
coordinate. It portrays the relationship
between these two variables graphically.
A perfect positive correlation
Weight
Weight
of B
Weight A linear
of A
relationship

Height
Height Height
of A of B
High Degree of positive correlation
Positive relationship
r = +.80

Weight

Height
Degree of correlation
Moderate Positive Correlation

r = + 0.4
Shoe
Size

Weight
Degree of correlation
Perfect Negative Correlation

r = -1.0
TV
watching
per
week

Exam score
Degree of correlation
Moderate Negative Correlation
r = -.80
TV
watching
per
week

Exam score
Degree of correlation
Weak negative Correlation

Shoe
r = - 0.2
Size

Weight
Degree of correlation
No Correlation (horizontal line)

r = 0.0
IQ

Height
Degree of correlation (r)
r = +.80 r = +.60

r = +.40 r = +.20
2) Direction of the Relationship
Positive relationship – Variables change in the
same direction.
Indicated by
As X is increasing, Y is increasing
As X is decreasing, Y is decreasing
sign; (+) or (-).
E.g., As height increases, so does weight.
Negative relationship – Variables change in
opposite directions.
As X is increasing, Y is decreasing
As X is decreasing, Y is increasing

E.g., As TV time increases, grades decrease

Advantages of Scatter Diagram
Simple & Non Mathematical method
Not influenced by the size of extreme
item
First step in investing the relationship
between two variables
Disadvantage of scatter diagram

Can not adopt the an exact degree of

correlation
Karl Pearson's
Coefficient of Correlation
Pearson’s ‘r’ is the most common
correlation coefficient.
Karl Pearson’s Coefficient of Correlation
denoted by- ‘r’ The coefficient of
correlation ‘r’ measure the degree of
linear relationship between two variables
say x & y.
Karl Pearson's
Coefficient of Correlation
Karl Pearson’s Coefficient of
Correlation denoted by- r
-1 ≤ r ≥ +1
Degree of Correlation is expressed by a
value of Coefficient
Direction of change is Indicated by sign
( - ve) or ( + ve)
Karl Pearson's
Coefficient of Correlation
When deviation taken from actual mean:
r(x, y)= Σxy /√ Σx² Σy²
When deviation taken from an assumed
mean:
r= N Σdxdy - Σdx Σdy
√N Σdx²-(Σdx)² √N Σdy²-(Σdy)²
Procedure for computing the
correlation coefficient
Calculate the mean of the two series ‘x’ &’y’
Calculate the deviations ‘x’ &’y’ in two series from their
respective mean.
Square each deviation of ‘x’ &’y’ then obtain the sum of
the squared deviation i.e.∑x2 & .∑y2
Multiply each deviation under x with each deviation under
y & obtain the product of ‘xy’.Then obtain the sum of the
product of x , y i.e. ∑xy
Substitute the value in the formula.
Interpretation of Correlation
Coefficient (r)
The value of correlation coefficient ‘r’ ranges
from -1 to +1
If r = +1, then the correlation between the two
variables is said to be perfect and positive
If r = -1, then the correlation between the two
variables is said to be perfect and negative
If r = 0, then there exists no correlation between
the variables
Properties of Correlation coefficient
The correlation coefficient lies between -1 & +1
symbolically ( - 1≤ r ≥ 1 )
The correlation coefficient is independent of the
change of origin & scale.
The coefficient of correlation is the geometric mean of
two regression coefficient.
r = √ bxy * byx
The one regression coefficient is (+ve) other regression
coefficient is also (+ve) correlation coefficient is (+ve)
Assumptions of Pearson’s
Correlation Coefficient
There is linear relationship between two
variables, i.e. when the two variables are
plotted on a scatter diagram a straight line
will be formed by the points.
Cause and effect relation exists between
different forces operating on the item of
the two variable series.
Advantages of Pearson’s Coefficient

It summarizes in one value, the

degree of correlation & direction
of correlation also.
Limitation of Pearson’s Coefficient

Always assume linear relationship

Interpreting the value of r is difficult.
Value of Correlation Coefficient is
affected by the extreme values.
Time consuming methods
Coefficient of Determination
The convenient way of interpreting the value of
correlation coefficient is to use of square of
coefficient of correlation which is called
Coefficient of Determination.
The Coefficient of Determination = r2.
Suppose: r = 0.9, r2 = 0.81 this would mean that
81% of the variation in the dependent variable
has been explained by the independent variable.
Coefficient of Determination
The maximum value of r2 is 1 because it is
possible to explain all of the variation in y but it
is not possible to explain more than all of it.
Coefficient of Determination = Explained
variation / Total variation
Coefficient of Determination: An example
Suppose: r = 0.60
r = 0.30 It does not mean that the first
correlation is twice as strong as the second the
‘r’ can be understood by computing the value of
r2 .
When r = 0.60 r2 = 0.36 -----(1)
r = 0.30 r2 = 0.09 -----(2)
This implies that in the first case 36% of the total
variation is explained whereas in second case
9% of the total variation is explained .
Spearman’s Rank Coefficient of
Correlation
When statistical series in which the variables
under study are not capable of quantitative
measurement but can be arranged in serial order,
in such situation pearson’s correlation coefficient
can not be used in such case Spearman Rank
correlation can be used.
R = 1- (6 ∑D2 ) / N (N2 – 1)
R = Rank correlation coefficient
D = Difference of rank between paired item in two series.
N = Total number of observation.
Interpretation of Rank
Correlation Coefficient (R)
The value of rank correlation coefficient, R
ranges from -1 to +1
If R = +1, then there is complete agreement in
the order of the ranks and the ranks are in the
same direction
If R = -1, then there is complete agreement in
the order of the ranks and the ranks are in the
opposite direction
If R = 0, then there is no correlation
Rank Correlation Coefficient (R)
a) Problems where actual rank are given.
1) Calculate the difference ‘D’ of two Ranks
i.e. (R1 – R2).
2) Square the difference & calculate the sum of
the difference i.e. ∑D2
3) Substitute the values obtained in the
formula.
Rank Correlation Coefficient
b) Problems where Ranks are not given :If the
ranks are not given, then we need to assign
ranks to the data series. The lowest value in the
series can be assigned rank 1 or the highest
value in the series can be assigned rank 1. We
need to follow the same scheme of ranking for
the other series.
Then calculate the rank correlation coefficient in
similar way as we do when the ranks are given.
Rank Correlation Coefficient (R)
Equal Ranks or tie in Ranks: In such cases
average ranks should be assigned to each
individual. R = 1- (6 ∑D2 ) + AF / N (N2 – 1)

AF = 1/12(m13 – m1) + 1/12(m23 – m2) +…. 1/12(m23 – m2)

m = The number of time an item is repeated
Merits Spearman’s Rank Correlation
This method is simpler to understand and easier
to apply compared to karl pearson’s correlation
method.
This method is useful where we can give the
ranks and not the actual data. (qualitative term)
This method is to use where the initial data in
the form of ranks.
Limitation Spearman’s Correlation
Cannot be used for finding out correlation in a
grouped frequency distribution.
This method should be applied where N
exceeds 30.
Advantages of Correlation studies
Show the amount (strength) of relationship
present
Can be used to make predictions about the
variables under study.
Can be used in many places, including natural
settings, libraries, etc.
Easier to collect co relational data
Regression Analysis
Regression Analysis is a very
powerful tool in the field of statistical
analysis in predicting the value of one
variable, given the value of another
variable, when those variables are
related to each other.
Regression Analysis
Regression Analysis is mathematical measure of
average relationship between two or more
variables.
Regression analysis is a statistical tool used in
prediction of value of unknown variable from
known variable.
Advantages of Regression Analysis
Regression analysis provides estimates of
values of the dependent variables from the
values of independent variables.
Regression analysis also helps to obtain a
measure of the error involved in using the
regression line as a basis for estimations .
Regression analysis helps in obtaining a
measure of the degree of association or
correlation that exists between the two variable.
Assumptions in Regression Analysis
Existence of actual linear relationship.
The regression analysis is used to estimate the
values within the range for which it is valid.
The relationship between the dependent and
independent variables remains the same till the
regression equation is calculated.
The dependent variable takes any random value but
the values of the independent variables are fixed.
In regression, we have only one dependant variable
in our estimating equation. However, we can use
more than one independent variable.
Regression line
Regression line is the line which gives the best
estimate of one variable from the value of any
other given variable.
The regression line gives the average
relationship between the two variables in
mathematical form.
The Regression would have the following
properties: a) ∑( Y – Yc ) = 0 and
b) ∑( Y – Yc )2 = Minimum
Regression line
For two variables X and Y, there are always two
lines of regression –
Regression line of X on Y : gives the best
estimate for the value of X for any specific
given values of Y
X=a+bY a = X - intercept
b = Slope of the line
X = Dependent variable
Y = Independent variable
Regression line
For two variables X and Y, there are always two
lines of regression –
Regression line of Y on X : gives the best
estimate for the value of Y for any specific given
values of X
Y = a + bx a = Y - intercept
b = Slope of the line
Y = Dependent variable
x= Independent variable
The Explanation of Regression Line
In case of perfect correlation ( positive or
negative ) the two line of regression coincide.
If the two R. line are far from each other then
degree of correlation is less, & vice versa.
The mean values of X &Y can be obtained as
the point of intersection of the two regression
line.
The higher degree of correlation between the
variables, the angle between the lines is
smaller & vice versa.
Regression Equation / Line
& Method of Least Squares
Regression Equation of y on x
Y = a + bx
In order to obtain the values of ‘a’ & ‘b’
∑y = na + b∑x
∑xy = a∑x + b∑x2
Regression Equation of x on y
X = c + dy
In order to obtain the values of ‘c’ & ‘d’
∑x = nc + d∑y
∑xy = c∑y + d∑y2
Regression Equation / Line when
Deviation taken from Arithmetic Mean
Regression Equation of y on x:
Y = a + bx
In order to obtain the values of ‘a’ & ‘b’
a = Y – bX b = ∑xy / ∑x2
Regression Equation of x on y:
X = c + dy
c = X – dY d = ∑xy / ∑y2
Regression Equation / Line when
Deviation taken from Arithmetic Mean
Regression Equation of y on x:
Y – Y = byx (X –X)
byx = ∑xy / ∑x2
byx = r (σy / σx )

Regression Equation of x on y:
X – X = bxy (Y –Y)
bxy = ∑xy / ∑y2
bxy = r (σx / σy )
Properties of the Regression Coefficients
The coefficient of correlation is geometric mean of the two
regression coefficients. r = √ byx * bxy
If byx is positive than bxy should also be positive & vice
versa.
If one regression coefficient is greater than one the other
must be less than one.
The coefficient of correlation will have the same sign as
that our regression coefficient.
Arithmetic mean of byx & bxy is equal to or greater than
coefficient of correlation. byx + bxy / 2 ≥ r
Regression coefficient are independent of origin but not of
scale.
Standard Error of Estimate.
Standard Error of Estimate is the measure of
variation around the computed regression line.
Standard error of estimate (SE) of Y measure the
variability of the observed values of Y around the
regression line.
Standard error of estimate gives us a measure about
the line of regression. of the scatter of the
observations about the line of regression.
Standard Error of Estimate.
Standard Error of Estimate of Y on X is:
S.E. of Yon X (SExy) = √∑(Y – Ye )2 / n-2
Y = Observed value of y
Ye = Estimated values from the estimated equation that correspond
to each y value
e = The error term (Y – Ye)
n = Number of observation in sample.

The convenient formula:

(SExy) = √∑Y2 _ a∑Y _ b∑YX / n – 2
X = Value of independent variable.
Y = Value of dependent variable.
a = Y intercept.
b = Slope of estimating equation.
n = Number of data points.
Correlation analysis vs.
Regression analysis.
Regression is the average relationship between two
variables
Correlation need not imply cause & effect
relationship between the variables understudy.- R A
clearly indicate the cause and effect relation ship
between the variables.
There may be non-sense correlation between two
variables.- There is no such thing like non-sense
regression.
Correlation analysis vs.
Regression analysis.
Regression is the average relationship between two
variables
R A.
What is regression?
Fitting a line to the data using an equation in order
to describe and predict data
Simple Regression
Uses just 2 variables (X and Y)
Other: Multiple Regression (one Y and many X’s)
Linear Regression
Fits data to a straight line
Other: Curvilinear Regression (curved line)
We’re doing: Simple, Linear Regression
From Geometry:
Any line can be described by an equation
For any point on a line for X, there will be a
corresponding Y
the equation for this is y = mx + b
m is the slope, b is the Y-intercept (when X = 0)
Slope = change in Y per unit change in X
Y-intercept = where the line crosses the Y axis
(when X = 0)
Regression equation
Find a line that fits the data the best, = find a
line that minimizes the distance from all the
data points to that line
^
Regression Equation: Y(Y-hat) = bX + a
Y(hat) is the predicted value of Y given a
certain X
b is the slope
a is the y-intercept
Regression Equation:
Y = .823X + -4.239
We can predict a Y score from an X by
plugging a value for X into the equation and
calculating Y
What would we expect a person to get on quiz
#4 if they got a 12.5 on quiz #3?

Y = .823(12.5) + -4.239 = 6.049

Advantages of Correlation studies
Show the amount (strength) of relationship present
Can be used to make predictions about the variables
studied
Can be used in many places, including natural
settings, libraries, etc.
Easier to collect correlational data
Chapter 5: Regression

'Regression' (latin) means 'retreat',

'going back to', 'stepping back'.
In a 'regression' we try to (stepwise)
retreat from our data and explain
them with one or more explanatory
predictor variables. We draw a
'regression line' that serves as the
(linear) model of our observed data.

www.vias.org/.../img/gm_regression.jpg
Correlation vs. regression
z Correlation z Regression
z In a correlation, we z In a regression, we
look at the try to predict the
relationship between outcome of one
two variables without variable from one or
knowing the direction more predictor
of causality variables. Thus, the
direction of causality
can be established.
z 1 predictor=simple
regression
z >1 predictor=multiple
regression
Correlation vs. regression
Correlation Regression
For a correlation you
do not need to For a regression you do want to find
know anything out about those relations between
about the possible variables, in particular, whether
relation between one 'causes' the other.
the two variables Therefore, an unambiguous causal
Many variables template has to be established
correlate with each between the causer and the
other for unknown causee before the analysis!
reasons This template is inferential.
Correlation underlies Regression is THE statistical
regression but is method underlying ALL inferential
descriptive only statistics (t-test, ANOVA, etc.). All
that follows is a variation of
regression.
Linear regression
Independent and dependent variables
In a regression, the predictor variables are
labelled 'independent' variables. They predict
the outcome variable labelled 'dependent'
variable.

A regression in SPSS is always a linear

regression, i.e., a straight line represents the
data as a model.

http://snobear.colorado.edu/Markw/SnowHydro/ERAN/regression.jp
Method of least squares
In order to know which line to choose as the best
model of a given data cloud, the method of least
squares is used. We select the line for which the
sum of all squared deviations (SS) of all data
points is lowest. This line is labelled 'line of best
fit', or 'regression line'.
Regression line
Simple regression In mathematics, a coefficient is a
constant multiplicative factor of a

Regression coefficients certain object. For example, the

coefficient in 9x2 is 9.
http://en.wikipedia.org/wiki/Coefficient

The linear regression equation ( 5.2) is:

Yi = (b0 + b1Xi) + εi
Yi = outcome we want to predict
b0 = intercept of the regression line regression
b1 = slope of the regression line coefficients

Xi = Score of subjecti on the predictor variable

εi = residual term, error
Slope/gradient and intercept

zSlope/gradient:
steepness of the line;
neg or pos
zIntercept: where the line

crosses the y-axis

Yi = (- 4 + 1.33Xi) + εi

http://algebra-tutoring.com/slope-intercept-form-equation-lines-1-gifs/slope-52.gif
'goodness-of-fit'

The line of best fit (regression line) is compared

with the most basic model. The former should be
significantly better than the latter. The most basic
model is the mean of the data.
Relation between tobacco and
alcohol consume

http://images.google.de/imgres?imgurl=http://math.uprm.edu/~wrolke/esma3102/graphs/rssfig2.pn
g&imgrefurl=http://math.uprm.edu/~wrolke/esma3102/rss.htm&h=552&w=553&sz=4&hl=de&start=
23&tbnid=eY0TWAtPXf0_ZM:&tbnh=133&tbnw=133&prev=/images%3Fq%3Dsum%2Bof%2Bsqua
res%26start%3D21%26svnum%3D10%26hl%3Dde%26lr%3D%26sa%3DN
Mean of Y as basic model
The summed
squared
Yi -⎯Y differences
between
observed
values and the
mean, SST, are
big, hence the
Mean, ⎯Y mean is not a
good model of
the data

Sum of squares total: SST

Regression line as a model
The summed
squared
differences
between
observed
values and the
regression line,
SSR, are
smaller, hence
this regression
line is a much
better model of
the data

sum of squares residual SSR

Model

Mean, ⎯Y Sum of squares

model, SSM

SSM: sum of squared differences between the

mean of Y and the regresion line (as our model)
Comparing the basic model and the
regression model: R 2

The improvement by the regression model

can be expressed by dividing the sum of
squares of the regression model SSM by the
sum of squares of the basic model SST:
The basic comparison in statistics is always
to compare the amount of variance that our
R2 = SSM model can explain with the total amount of
variation there is. If the model is good it can
SST explain a significant proportion of this overall
variance.

This is the same measure as the R2 in chapter 4 on

correlation. Take the square root of R2 and you have the
Pearson correlation coefficient r!
Comparing the basic model and the
regression model: F-Test
In the F-Test, the ratio of the improvement due to the
model SSM and the difference between the model and
the observed data, SSR, is calculated.
We take the mean sum of squares, or mean squares,
MS, for the model, MSM, and the observed data, MSR:

F = MSM
MSR
The F-ratio should be high (since the model should
have improved the prediction considerably, as
expressed in MSM). MSR, the difference between the
model and the observed data (the residual), should be
small.
The coefficient of a predictor
The coefficient of the predictor b1≠0
X is b1. B1 indicates the
gradient/slope of the regression b1=0
line. It says how much Y
changes when X is changed
one unit. In a good model, b1
should always be different from
0, since the slope is either
positive or negative. If b1=0, this means:
z A change in one unit of the

Only a bad model, i.e., the basic predictor X does not change
model of the mean, has a slope the predicted variable Y
zThe gradient of the
of 0. regression line is 0.
T-Test of the coefficient of the predictor
A good predictor variable should have a b1 that is
different from 0 (the regression coefficient of the
basic model, the mean). Whether this difference is
significant, can be tested by a t-test.
The b of the expected values (0-Hypothesis, i.e.,
0) is subtracted from the b of the observed values
and divided by the standard error of b.

t = bobserved – bexpected Since bexpeted=0

SEb

t= bobserved t should be * different from

SEb
Simple regression on SPSS
(using the Record1.sav data)
Descriptive glance: Scatterplot of the correlation
between advertisement and record sales

Graphs --> Interactive --> Scatterplot

Under 'Fit', tick

'include constant'
and 'fit line to
total'
Comparing the mean and the regression model
(using the Record1.sav data)

Graphs --> Interactive -->

Scatterplot

Under 'Fit', tick

'mean'

--> The regression line is quite

different from the mean
Simple regression on SPSS
(using the Record1.sav data)

Analyze --> Regression --> Linear

Predictor:
How much money
(in 1000)
you spend on What you want to predict:
advertisement # of records (in 1000) sold
Output of simple regression on SPSS
(using the Record1.sav data)
Analyze --> Regress --> Linear
R is the simple Pearson
correlation between R² is the amount of
'advertisement' and explained variance
'records sold'

R2= 33% of the total variance can be explained

by the predictor 'advertisement'.

66% of the variance cannot be explained.

ANOVA for the SSM (F-test): advertisement
predicts sales significantly
F = MSM/MSR
= 433687,833/4354,87
SSR
= 99,587
SSM

MSM

MSR
SST
sum of squares total
b1 gradient
b0 intercept Regression If predictor X
is increased
where regres- coefficients b0, b1 by 1 unit (1000, then
sion line
crosses Y axis 96,12 extra
When no money records will
is spent (X=0), be sold
134,140 records are
sold t= B/SEB
134,14/7,537=
17,799

=.09612
A closer look at the t-values

The equation for computing the t-value is t= B/SEB

For the constant: 134,14/7,537=17,799
For ADVERTS: B=0.09612/.010 should result in 9.612, however, t= 9.979

What’s wrong? Nothing, this is a rounding error. If you double-click on the output table
“Coefficients”, a more exact number will be shown:
9.612E-02 = 0,09612448597388
.010 = 0,00963236621523
If you re-compute the equation with these numbers, the result is correct:
0,09612448597388/ 0,00963236621523 = 9.979
Using the model for Prediction
Imagine the record company wants to spend
100,000 £ for advertisement.
Using Equation 5.2, we can fit in the values of b0
and b1:

Yi = (b0 + b1Xi)
= 134.14 + (.09612 x Advertising Budgeti) Is that a
good deal?
Expl: If 100,000 £ are spent on ads,

134.14 + (.09612 x 100) = 143.75

144,000 records should be sold on the first week.

http://image.informatik.htw-
aalen.de/Thierauf/Knobelaufgaben/Sommer03/zweifel.png
Multiple regression

In a multiple regression, we predict the outcome of a

dependent variable Y by a linear combination of >1
independent predictor variables Xi
Outcomei = (Modeli) + errori

Every variable has its own coefficient: b1, b2,...,bn

(5.9) Yi = (b0 + b1X1 + b2X2 + ... + bnXn) + εi

b1X1= 1st predictor variable with its coefficient

b2X2 = 2nd predictor variable with its coefficient, etc.
εi = residual term
Multiple Regression on SPSS
using file record2.sav

We want to predict record sales (Y) by two

predictors:
X1 = advertisement budget
X2 = number of plays on Radio 1

Record Salesi = b0 + b1Adi + b2Playi + εi

Instead of a regression line, a regression plane (2

dimensions) is now fitted to the data (3
dimensions)
3D-Scatterplot of the relation
between record sale (Y) and
advertisement budget (X1)
No of plays on Radio 1/week (X2)
Graphs -->
Interactive -->
Scatterplot --> 3D
Multiple regression with 2 Variables
can be visualized as a 3D-scatterplot.
More variables cannot be
accomodated visually.
Regression planes and confidence
intervals of multiple regression

Under the
menu 'Fit',
specify the
following
options
3-D-scatterplot

If adjusted appropriately,
you can see the
regression plain and the The regression plains are
confidence plains chosen as to cover most of
almost like lines the data points in the three-
dimensional data cloud
Sum of squares, R, R2
The terms we encountered for simple regression,
SST, SSR, SSM, still mean the same, but are more
complicated to compute now.

Instead of the simple correlational coefficient R, we

use a multiple correlation coefficient Multiple R.

Multiple R is the correlation between the predicted

and observed values of the outcome. As in simple
R, Multiple R, should be great.
Multiple R2 is a measure of the explained variance
of Y by the predictor variables X1-Xn.
Methods of regression
The predictors of the model should be selected
carefully, e.g., based on past research or
theoretically well motivated.
zHierarchical method (ordered entry): first,
known predictors are entered, then new ones,
either blockwise (all together) or stepwise
zForced entry ('enter'): All predictors are forced

into the model simultaneously

zStepwise methods: Forward: Predictors are

introduced one by one, according to their

predictive power. Stepwise: Same as Forward + a
removal test. Backward: Predictors are judged
against a removal criterion and eliminated
accordingly.
How to choose one's predictors

zBased on the theoretical literature,

choose predictors in their order of
importance. Do not choose too many
zRun an initial multiple regression

zEliminate useless predictors

zTake ca. n=15 subjects per predictor

Evaluating the model

1. The model must fit the data sample

2. The model should generalize beyond
the sample
Evaluating the model - diagnostics
1. Fitting the observed data:
- Check for outliers which bias the Analyze --> Regression
model and enlarge the residual --> Linear
- Look at standardized residuals (z-Under 'Save', specify:
scores): If > 1% are lying outside
the margins of +/- 2.58, the model
is poor.
- Look at studentized residuals:
(unstandardized residuals/ SD that
varies point by point.) Yields a
more exact estimate of error
variance.

Note: SPSS adds the computed

scores into new columns in the
data file.
Evaluating the model - diagnostics
- continued
zIdentify influential cases and
see how the model changes if
they are excluded.
This is done by running the
regression without that particular
case and then use the new model to
predict the value of the just
excluded case (its 'adjusted
predicted value'). If the case is
similar to all other cases, its
'adjusted predicted value' will not
differ much from its predicted value,
given the model including it.
Evaluating the model - continued
DFBeta:a measure of the
influence of a case on the values
of bi.
DFFit: “...difference between the
adjusted predicted value and the
original predicted value of a
particular case.” (Field 2005, 729).
Deleted residual: residual based
on the adjusted predicted value.
“... the difference between the
adjusted predicted value for a
case and the original observed
value for that case.” (Field 2005,
728)
A way of standardizing the deleted
residual is to divide it by its SD -->
studentized deleted residual.
Evaluating the model
- continued
Identify influential cases and see how the
z

model changes if they are excluded.

Cook's distance measures the influence of
a case on the overall model's ability to
predict all cases.
Leverage estimates “the influence of the
observed value of the outcome variable
over the predicted values.” (Field 2005, 736)
Leverage values lie between 0<x>1 and may be
used to define cut-off points for excluding
influential cases.

Mahalanobis distances measure the

distance of cases from the means of the
predictor variables.
Example for using DFBeta as an
indicator of an 'influential case'
using file dfbeta.sav
z Run a simple regression with all data

(including outlier, case 30):

Analyze --> Regression --> Linear

What you
want to
predict

Your predictor
Example for using DFBeta as an
indicator of an 'influential case'
using file dfbeta.sav
z All data (including z Case 30 removed

outlier, case 30): (with Data --> Select

cases --> use filter
variable)
z B0=29; b1= -.90
z B0 = 31; b1=-1
→ Both regression coefficients
b0 (constant/intercept) and b1
(gradient/slope) changed !
Example for using DFBeta as an
indicator of an 'influential case'
using file dfbeta.sav

Dfbeta of the constant (dfb0)

and of the predictor x (dfb1)
are much higher than those
of the other cases
Summary of both calculations
Scatterplots for both samples
Parameter (b) + case 30 - case 30 Difference
Constant (b0) 29.00 31.00 -2.00
Gradient (b1) -.90 -1 .10
Model Y=(-.9)X+29 Y=(-1)X+31
Predicted Y 28.0100 30-1.09
z With case 30: z Without case 30

Outlier
DFBetas, DFFit, CVR's
All the following measures measure the difference
between a model including and one excluding
influential cases:
zStandardized DFBeta: Difference between a
parameter estimated using all cases and
estimated when one case is excluded, e.g.
DFBetas of the parameters b0 and b1.
zStandardized DFFit: Difference between the

predicted value for a case in a model including vs.

in a model excluding this value.
zCovariance ratio (CVR): measure of whether a

case influences the variance of the regression

parameters. This ratio should be close to 1.
Help-Window,Topic index 'Linear Regression'
Window „Save new variables“
I find it hard to
remember what all
those influence
statistics mean...

Why don't you look

them up in the „Help
window“ ?

http://image.informatik.htw-
aalen.de/Thierauf/Knobelaufgaben/Sommer03/zweifel.png
Residuals and influence statistics
(using the file pubs.sav)
The correlation between no.
outlier of pubs in London districts
and deaths with and without
the outlier.
Note: The residual for the
outlier fitted to the regression
line including it is small.
However, its influence
statistics is huge.
Why? The outlier is the 'City of
London' district, where a lot of
pubs are but only few residents
live. The ones who are drinking in
Scatterplot of both variables those pubs are visitors, hence,
Graphs --> Interactive --> the ratio of deaths of citizens
scatterplot given the overall consumation of
alcohol is relatively low.
Case summary: 8
London districts

St. Res. Lever St. DFFIT St. DFB Interc St. DFB Pubs
1 -1,34 0,04 -0,74 -0,74 0,37
2 -0,88 0,03 -0,41 -0,41 0,18
3 -0,42 0,02 -0,18 -0,17 0,07
4 0,04 0,02 0,02 0,02 -0,01
5 0,5 0,01 0,2 0,19 -0,06
6 0,96 0,01 0,4 0,38 -0,1
7 1,42 0 0,68 0,63 -0,12
8 -0,28 0,86 -4,60E+008 92676016 -4,30E+008
Total 8 8 8 8 8
The residual of the
outlier #8 is small
because it actually
sits very close to the The influence statistics are huge!
regression line
Excluding the outlier
(pubs.sav)
If you create a variable “num_dist” (number of the
district) in the variables list of the pubs.sav file and
simply allocate a number to each district (1-8), you can
use this variable to exclude the problematic district #8.
Data Æ Select cases Æ If condition is satisfied Æ
num_dist~=8
Excluding the outlier – continued
(pubs.sav)
Look at the scatterplot again
now that district # 8 has
been excluded:
Graphs Æ Interactive Æ
Scatterplot

Now the 7 remaining districts

all line up perfectly on the
(idealized) regression line
Will our sample regression
generalize to the population?
If we want to generalize our findings of one sample to
the population, we have to check some assumptions:
zVariable types: predictor variables must be

quantitative (interval) or categorical (binary); outcome

variable must be quantitative, continuous and
unbounded (whole range must be instantiated)
zNon-zero variance of predictors

zNo perfect correlation between ≥ 2 predictors

zPredictors are uncorrelated to any 'third variable'

which was not included in the regression

zAll levels of the predictor variables should have same

variance
Will our sample regression
generalize to the population?
- continued

zIndependent errors: The residual terms of any

two observations should be uncorrelated (Durbin-
Watson Test)
zResiduals should be normally distributed

zAll of the values of the outcome variable are

independent
zPredictors and outcome have a linear relation

zIf these assumptions are not met, we cannot

draw valid conclusions from our model!
Two methods for the cross-
validation of the model
If our model is generalizable, it should be able to
predict the outcome of a different sample.
− Adjusted R2: R2 indicates the loss of
predictive power (shrinkage) if the model
were applied to the population:
adj R2 = 1- n-1 n-2 n+1 (1-R2)
n-k-1 n-k-2 n

R²= unadjusted value

n= number of cases
k= number of predictors in the model

− Data splitting: The entire sample is split

into two. Regressions are computed and
compared for both halves. Nice method
but one rarely has so many data.
Sample size
The required sample size for a regression
depends on

zThe number of predictors k

zThe size of the effect

zThe size of the statistical power

e.g.,
large efffect --> n= 80 (for up to 20 predictors)
medium effect --> n=200
small effect --> n=600
(Multi-)Collinearity
If ≥ 2 predictors are inter-correlated, we speak of
collinearity. In the worst case, 2 variables have a
correlation of 1. This is bad for a regression, since
the regression cannot be computed reliably
anymore. This is because the variables become
interchangeable.
High collinearity is rare, but some degree of
collinearity is always around.
Problems with collinearity:
zIt underestimates the variance of a second variable if
this variable is strongly intercorrelated with the first
variable. It adds little unique variance although – taken
for itself – it would explain a lot.
zWe can't decide which variable is important, which

variable should be included

zThe regression coefficients (b-values) become instable.
How to deal with collinearity

SPSS has some collinearity diagnostics:

zVariance inflation factor
zTolerance statistics

z...

→ in the 'Statistics' window of the 'linear

regression' menu
Multiple Regression on SPSS
(using the file Record2.sav)
Example: Predicting the record sales from 3
predictors:
zX1: Advertisement budget,
zX2: times played on radio,

zX3: attractiveness of the band

Since we know already that money for ads is a predictor,

it will be entered into the regression first (1st block), and
the 2 new predictors later (2nd block) --> hierarchical
method ('Enter').
1st
2nd block
block
Var 2+3
Var 1
What the „Statistics“ box should look like
Analyze --> Regression --> Linear
Regression Plots
Plotting *ZRESID (standardized residuals = errors) against *ZPRED
(standardized predicted values) helps us determine whether the
assumption of random errors and homoscedasticity (equal variances) are
met.

*ZRED
*ZPRED

For heteroscedasticity

Heteroscedasticity occurs
when the residuals at each
level of the predictor
For 'random errors' variables have unequal
variances.
Regression diagnostics

The
regression
diagnostics
are saved in
the data file,
each as a
separate
variable in a
new column
Options
leave them as they are
Interpreting Multiple Regression

The 'Descriptives'
give you a brief
summary of the
variables
Interpreting Multiple Regression
Pearson correlations R

R of predictors 123 with

outcome
R of pred1 with the others
R of pred2 with the other
R of pred3 with the others
Significance levels for all
correlations

Correlations: R's between all variables and signif-

levels. Pred 2 (plays on radio) is the best predictor.
Predictors should not correlate higher than R>.9 (collinearity)
Summary of model
Correlation Degrees of
between Change from freedom; If errors are
predictor(s) 0 to .335 df1:p-1 independent.
and out- (Model 1) df2:N-p-1 If value close
Only come (N=sample size; to 2, then OK
adver- and another
change of .330 p=# of predictors)
tisement
as predic (Model 2)
tor

How well
the model
generalizes. F-values
3 predic Explained Similar val- for R2 The model(s)
tors variance ues to R2 change bring about
by the are good. a significant
predic- Only 5% change
tor(s) shrinkage
ANOVA for the model against the
basic model (the mean)
Df equal to
# of cases
Df equal to Df equal minus F-values:
# of
# of cases to # of coefficients MSM/MSR:
minus 1 predic- (b0,b1)
433687.833/4354.87=99.587
200-1=199 tors 200-2=198 287125.806/2217.217=129.498
SSM

Significance
level

SSR

SST Both Model 1

and 2 have
improved the
prediction
significantly,
Model 2
Mean squares: (3 predictors)
SS/df
433687.8/1=433687.8
even better
862264.2/198=4354.87 than Model 1
(1 predictor)
Record sales increase
by .511 SD's when
the predictor (ads)
changes 1 SD;
Model parameters
With 95% confidence the b-values
b1 and b2 have equal 'gains' lie within these boundaries
Tight boundaries are good
Model 1= same
as in first
analysis

b0
b1 *
b2
b3

Pearson Corr of
predictor x outcome
controlled for each single
other predictor
Pearson Corr of
The 'Coefficients' table tells us the predictor x outcome
individual contribution of variables to the controlled for all
other predictor
regression model. The Standardized Beta's 'unique relationship'
tell us the importance of each predictor
Excluded variables

What contribution would

this predictor have made
to a model containing it

SPSS gives a summary of those predictors that were not

entered in the Model (here only for Model 1) and
evaluates the contribution of the excluded variables.
Regression equation for
Model 2
(including all 3 predictor
variables)

Salesi = b0+b1Advertisingi +b2airplayi +b3attractivenessi

= -26.61+(0.08Adi)+ (3.37Airplayi) + (11.09 Attracti)

Interpretation:
If Ad increaes 1 unit-->sales increase .08 units; if airplay + 1
unit-->sales+3.37; if attract + 1 unit --> sales +11 units,
independent of the contributions of the other predictors.
No Multicollinearity
(In this regression, variables are not closely linearly
related)

Each predictor's variance proportions load highly on

a different dimension (Eigenvalue)
--> they are not intercorrelated, hence no collinearity
Casewise diagnostics
The casewise
z-value diagnostics lists cases
that lie outside the
>5% boundaries of 2 SD (in
the z-distribution, only
5% should be beyond
1.96 SD and only 1%
beyond 2.58
Case 169 deviates
>1%
most and needs to be
>1% followed up
>5%
Following up influential cases with „Case summaries“
--> everything OK
No DFBETA's >1 (all OK
Leverage values
<.06 (all OK)

Cook distances <1 (all OK) Mahalanobis' distances

<15 (all OK)
Identify influencing cases by the case
summary
zIn the standardized residulas, no more than 5%

must have values exceeding 2 and 1% exceeding 3.

z Cook's distances >1 might pose a problem

zLeverage (# of predictors + 1/sample size) must not

be twice or three times higher

zMahalanobis distance: cases with >25 in large

samples (n=500) and >15 in small samples (n=100)

can be problemantic
zAbsolute values of DFBeta should not exceed 1

zDetermine upper and lower limit of covariance ratio

(CVR). Upper limit = 1+3(average leverage); lower

limit = 1-3(average leverage).
Checking assumptions:
Heteroscedasticity
(Heteroscedasticity:
residuals (errors) at each
level of predictor have
different variances). Here
variances are equal

Plot of standardized residual *ZRESID/

standardized predicted value *ZPRED
Points are randomly and evently dispersed
--> assumptions of linearity and homoscedasticity
are met
Checking assumptions
Normality of residuals

The distribution of the residuals is normal (left

hand picture), the observed probabilities
correspond to the expected ones (right hand side)
Checking assumptions
Normality of residuals - continued

The Kolmogoroff-Smirnov-Test for

the standardized residuals is n.s.
--> normal distribution

Boxplots, too, show the

normality
(note the 3 outliers!)
Checking assumptions
Partial Regression Plots

Scatterplots of the residuals

of the outcome variable and
each of the predictors
separately.
No indication of outliers,
evenly spaced out cloud of
dots (only the residual
variance of 'attractiveness of
band' seems to be uneven.
EPIDEMIOLOGIC MEASURES:
INCIDENCE & PREVALENCE

1
Measuring Epidemiological Outcomes

Relationship between any two numbers

Ratio
(e.g. males / females)

A ratio where the numerator is

Proportion included in the denominator
(e.g. males / total births)

A proportion with the specification of time

Rate
(e.g. deaths in 2000 / population in 2000)
2
Definitions

y Incidence is the rate of new cases of a

disease or condition in a population at
risk during a time period

y Prevalence is the proportion of the

population affected

3
Number of new cases during a time period
Incidence =
Population at risk during that time period

y Incidence is a rate
y Calculated for a given time period (time
interval)
y Reflects risk of disease or condition 4
Number of existing cases
Prevalence =
Total number in the population at risk

y Prevalence is a proportion
y Point Prevalence: at a particular instant in time
y Period Prevalence: during a particular interval of time
(existing cases + new cases)
5
Prevalence = Incidence × Duration
Prevalence depends on the rate of occurrence (incidence)
AND the duration or persistence of the disease

At any point in time:

More new cases (increased risk) yields more
existing cases
Slow recovery or slow progression increases
the number of affected individuals
6
The population perspective requires
measuring disease in populations

• Science is built on classification and

measurement.
• Reality is infinitely detailed, infinitely complex.
• Classification and measurement seek to capture
the essential attributes.

7
Measurement “captures” the phenomenon

Classification and measurement are based on:

1. Objective of the classification
2. Conceptual model (understanding of the
phenomenon)
3. Availability of data (technology)

8
An example population (N=200)

dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb O
O
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb

O O
dbcdcdbcdcdbcdcdbcdcdbcdc
9
How can we quantify disease in populations?

dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
10
How can we quantify disease in populations?

dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
11
How can we quantify disease in populations?

dbcdcdbcdcdbcdcdbcdcdbcdc

dbcdcdbcdcdbcdcdbcdcdbcdc OO
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
13
How can we quantify disease in populations?

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
14
How can we quantify the frequency?

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
17
1 new case in month 2

dbcdcdbcdcdbcdcdbcdcdbcdc

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
22
1 new case in month 7

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
23
2 new cases in month 8

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcdOOO
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
24
2 cases in month 9

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb

O OOO
bcdcdbcdcdbcdcdbcdcdbcdcd

O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
25
Rate of occurrence of new cases during 9 months:
1 case/month to 2 cases/month
OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb

O OOO
bcdcdbcdcdbcdcdbcdcdbcdcd

O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
26
Number of cases depends on length of interval

Divide by length of time interval, so can

compare across intervals
Number of new cases
Rate of new cases = –––––––––––––––––
Time interval

= 12 cases / 9 months = 1.33 cases / month

27
Number of cases depends on population size

So, divide by population and time:

Number of new cases

Incidence rate = ––––––––––––––––––
Population-time

28
How to estimate population-time?

Population at risk: the people eligible to

become a case and to be counted as one.
In this example that population declines as
each case occurs.
So estimate population-time as . . .

1/25/2011 Incidence and prevalence 29

Population-time =
Method 1: Add up the time that each person is
at risk
Method 2: Add up the population at risk during
each time segment
Method 3: Multiply the average size of the
population at risk by the length of the time
interval
1/25/2011 Incidence and prevalence 30
Estimating population-time - method 2
Total population-time over 9 months =
200 + 199 + 198 + 197 + 195 + 194 + 193 +
192 + 190
= 1,758 person-months
= 146.5 person-years
However, cases are not at risk for a full
month.
31
Estimating population-time - method 2
- better
Total population-time over 9 months =
199.5 + 198.5 + 197.5 + 196 + 194.5 + 193.5
+ 192.5 + 191 + 189
= 1,752 person-months
= 146 person-years
assuming that cases develop, on average, in
the middle of the month
32
Estimating population-time - method 3
Average size of the population at risk during the
9 months = 195.3 (1,758 / 9) or approximately:
(200 + 188) /2 = 194
Population-time = 195.3 x 9 months or
(approximately) 194 x 9 months
= 1,746 person-months
= 145.5 person-years
33
Equivalent to - method 3
Take initial size of population at risk and reduce
it for time the people were not at risk due to
acquiring the disease:
200 - 12/2 = 194 (approximately)
Population-time = 194 x 9 months
= 1,746 person-months
= 145.5 person-years
34
Incidence rate (“incidence density”)

Number of new cases

–––––––––––––––––––––––––––––––
Avg population at risk × Time interval

Number of new cases

= ––––––––––––––––––––
Population-time
35
What proportion of the population
at risk are affected after 5 months?
dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb O
O
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb

O O
dbcdcdbcdcdbcdcdbcdcdbcdc
36
What proportion of the population
is affected after 1 month? (1/200)
dbcdcdbcdcdbcdcdbcdcdbcdc
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
37
What proportion of the population
is affected after 2 months? (2/200)
dbcdcdbcdcdbcdcdbcdcdbcdc

O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
39
What proportion of the population is
affected after 4 months? (5/200)
dbcdcdbcdcdbcdcdbcdcdbcdc OO
O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
40
6 / 200 = 0.03 = 3% = 30 / 1,000
in 5 months
dbcdcdbcdcdbcdcdbcdcdbcdc OO
O
bcdcdbcdcdbcdcdbcdcdbcdcd O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc
cdcdbcdcdbcdcdbcdcdbcdccb

O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
41
Incidence proportion (“cumulative incidence”)

Number of new cases

5-month CI = –––––––––––––––––––
Population at risk
Incidence proportion estimates risk.

1/25/2011 Incidence and prevalence 42

Incidence rate versus incidence proportion
• Incidence rate measures how rapidly cases are
occurring.
• Incidence proportion is cumulative.
• When care only about the “bottom line” (i.e.,
what has happened by the end of given period):
incidence proportion (CI).

1/25/2011 Incidence and prevalence 43

Incidence rate versus incidence proportion
• If risk period is long (e.g., cancer), we usually
observe only a portion.
• To compare results from studies with different
length of follow-up, use incidence rate (IR)
• If risk period is short, we usually observe all of it
and can use incidence proportion.

44
Incidence rate versus incidence
proportion
(rare disease, IR = 0.005 /
month)(see spreadsheet at epidemiolog.net/studymat/)
45
Incidence rate versus incidence
proportion
(common disease, IR = 0.1 /
month)
46
Case fatality rate

“Case fatality rate” (but it’s really a proportion)

= proportion of cases who die
(in a specified time interval)
• Like a “cumulative incidence of death” in cases
[ “incidence rate of death” in cases =
“termination rate” = 1/(average survival time)]

47
Mortality rate
Number of deaths
Mortality rate = ––––––––––––––––––––––––––––
Population at risk × Time interval

Number of deaths
Annual mortality rate = ––––––––––––––––––––––
Mid-year population (x 1 yr)

48
Mortality rate (more notes)
Number of deaths
Mortality rate = ––––––––––––––––––––––––––––
Population at risk × Time interval

Number of deaths
Annual mortality rate = ––––––––––––––––––
Mid-year population
6/6/2002 Incidence and prevalence 49
Mortality rates versus incidence rates
• Mortality data are more generally available
• Fatality reflects many factors, so mortality
rates may not be a good surrogate of incidence
rates
• Death certificate cause of death not always
accurate or useful

50
Prevalence – another important proportion

Number of existing (and new) cases

Prevalence = –––––––––––––––––––––––––––––––
Population at risk

51
1 new case, 1 death

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

O O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
52
1 new case, 1 new death

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcd O
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
53
2 new cases, no deaths

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb
bcdcdbcdcdbcdcdbcdcdbcdcdOOO
O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
54
2 new cases, 1 new death

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb

O OOO
bcdcdbcdcdbcdcdbcdcdbcdcd

O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
55
What is the prevalence? (9 / 197)

OO
dbcdcdbcdcdbcdcdbcdcdbcdc

OO O
bcdcdbcdcdbcdcdbcdcdbcdcd

O
cdcdbcdcdbcdcdbcdcdbcdcbb
dcdbcdcdbcdcdbcdcdbcdcccc

O
cdcdbcdcdbcdcdbcdcdbcdccb

O OOO
bcdcdbcdcdbcdcdbcdcdbcdcd

O
dcdbcdcdbcdcdbcdcdbcdcdbb
dbcdcdbcdcdbcdcdbcdcdbcdc
56
Fine points . . .
•Who is “at risk”?
• Endometrial cancer? Prostate cancer?
Breast cancer?
• Only women who have not had a
hysterectomy?
“Could” develop the condition + “would” be
counted.
57
More fine points
• Age?
• Immunity?
• Genetically susceptible?

58
More fine points . . .

• How do we measure time?

• Are 10 people followed for 10 years the

same as 100 people followed for 1 year?

• Aging of the cohort? Secular changes?

59
Fine points . . .
• Importance of stating units and scaling
unless they are clear from the context
– e.g., 120 per 100,000 person-years =
10 per 100,000 person-months
– Hazards from lack of clarity

60
“You can never, never take anything
for granted.”
Noel Hinners, vice president for flight
systems at Lockheed Martin Astronautics in
Denver, concerning the loss of the Martian
Climate Orbiter due to the Lockheed Martin
spacecraft team’s having reported
measurements in English units whiles the
orbiter’s navigation team at the Jet
Propulsion Laboratory (JPL) in Pasadena,
California assumed the measurements
were in metric units.

61
Relation of incidence and prevalence

• Prevalence depends on incidence

• Higher incidence leads to higher prevalence if duration
of cases does not change.
• Limitation of the bathtub analogy – flow rate needs to
be expressed relative to the size of the source
• Introducing a new analogy . . .

62
Existing
cases

Population
at risk
Deaths,
cures, etc.
63
Incidence, prevalence, duration of hospitalization
Remote community of 101,000 people
One hospital, patient census = 1,000
Steady state
500 admissions per week
Prevalence = 1,000/101,000 = 9.9/1,000
IR = 500/100,000 = 5/1,000/week
Duration Prevalence / IR = 2 weeks
64
Relation of incidence and prevalence

Under somewhat special conditions,

Prevalence odds = incidence × duration
Prevalence incidence × duration
(see spreadsheet at www.epidemiolog.net/studymat/)

65
Standardization
• When objective is comparability, need to adjust
for different distributions of other determinants
• Strategy:
• Analyze within each subgroup (stratum)
• Take a weighted average across strata
• Use same weights for all populations
(See the Evolving Text on www.epidemiolog.net)
66
Familiar example of weighted averages
• Liters of petrol per kilometer - differs for Interstate
(0.050 LpK) and non-Interstate (0.100 LpK) driving.
• To compare different cars, can:
• Compare them for each type of driving separately
(stratified analysis)
• Average for each car, using one set of weights
(e.g., 80% Interstate, 20% non-Interstate)
• E.g. = 0.80 x 0.050 LpK + 0.20 x 0.100 LpK = 0.060 LpK
67
Comparing a Suburu and a Mazda
Juan drives a Suburu 800 km on Interstate highways
and 200 km on other roads. His car uses 0.050 LpK
on Interstates and 0.100 LpK on other roads, for a
total of 60 liters of petrol, an average of 0.060 LpK
(60 L / 1000 km). His overall LpK can be expressed
as a weighted average:
(800/1000) x 0.050 LpK + (200/1000) x 0.100 LpK
= 0.80 x 0.050 LpK + 0.20 x 0.100 LpK = 0.060 LpK
68
Comparing a Suburu and a Mazda
Shizu drives her Mazda on a different route, with
only 200 km on Interstate and 800 km on other
roads. She uses 0.045 lpk on Interstate highways
and 0.080 LpK on non-Interstate. She uses a total
of 73 liters, or 0.073 LpK. Her overall LpK can be
expressed as a weighted average:
(200/1,000) x 0.045 LpK + (800/1,000) x 0.080 LpK
= 0.20 x 0.045 LpK + 0.80 x 0.080 LpK =0.073 LpK
69
How can we compare their fuel efficiency?
Juan Shizu
Km LpK Km LpK
Interstate 800 0.050 200 0.045
Other 200 0.100 800 0.080
Total 1,000 0.060 1,000 0.073

70
Total fuel efficiency is not comparable
because weights are different
Juan Shizu
% LpK % LpK
Interstate 80 0.050 20 0.045
Other 20 0.100 80 0.080
Total 100% 0.060 100% 0.073

71
By adopting a “standard” set of weights we
can compare fairly
Juan Shizu
% LpK % LpK
Interstate 60 0.050 60 0.045
Other 40 0.100 40 0.080
Total 100 0.060 100 0.073
Standardized 0.070 0.059

72
Comparing a Suburu and a Mazda

•Juan’s Suburu:
= 0.60 x 0.050 LpK + 0.40 x 0.100 LpK =0.070 LpK
•Shizu’s Mazda:
= 0.60 x 0.045 LpK + 0.40 x 0.080 LpK =0.059 LpK

The choice of weights may often affect the results of

the comparison.
73
t- and F-tests
Testing hypotheses
Overview
• Distribution& Probability
• Standardised normal distribution
• t-test
• F-Test (ANOVA)
Starting Point
• Central aim of statistical tests:
– Determining the likelihood of a value in a
sample, given that the Null hypothesis is true:
P(value|H0)
• H0: no statistically significant difference between
sample & population (or between samples)
• H1: statistically significant difference between
sample & population (or between samples)

– Significance level: P(value|H0) < 0.05

Types of Error
Population

H0 H1

β-error
H0 1-α
(Type II error)
Sample
H1 α-error
1-β
(Type I error)
Distribution & Probability
If we know s.th. about the distribution of events, we know s.th.
about the probability of these events
n

∑x i
x= i =1

α/2 n
n

∑ i
( x − x ) 2

s= i =1

n
Standardised normal distribution

Population Sample

x−μ xi − x xz = 0
z= zi =
σ s sz = 1

• the z-score represents a value on the x-axis for which we know the
p-value

• 2-tailed: z = 1.96 is 2SD around mean = 95% Æ ‚significant‘

• 1-tailed: z = +-1.65 is 95% from ‚plus or minus infinity‘
t-tests:
Testing Hypotheses About Means

x1 − x 2 2 2

t= s x1 − x2 =
s1
+
s2
s x1 − x2 n1 n2

differences _ between _ sample _ means

t=
estimated _ standard _ error _ of _ differences _ between _ means
Degrees of freedom (df)
• Number of scores in a sample that are free to vary

• n=4 scores; mean=10 Æ df=n-1=4-1=3

– Mean= 40/4=10
– E.g.: score1 = 10, score2 = 15, score3 = 5 Æ score4 = 10
Kinds of t-tests
Formula is slightly different for each:
• Single-sample:
• tests whether a sample mean is significantly different from a
pre-existing value (e.g. norms)
• Paired-samples:
• tests the relationship between 2 linked samples, e.g. means
obtained in 2 conditions by a single group of participants
• Independent-samples:
• tests the relationship between 2 independent populations
• formula see previous slide
Independent sample t-test
Number of words recalled
Group 1 Group 2 (Imagery) df = (n1-1) + (n2-1) = 18
21 22
19 25
x1 − x 2 19 − 26
18 27 t= = = −7
18 24 s x1 − x2 1
23 26
17 24 t ( 0.05,18) = ±2.101
19 28
16 26
21 30 t > t ( 0.05,18)
18 28
mean = 19 mean = 26
Æ Reject H0
std = sqrt(40) std = sqrt(50)
Bonferroni correction
• To control for false positives:

p
pc =
n

•E.g. four comparisons:

0.05
pc = = 0.0125
4
F-tests / Analysis of Variance (ANOVA)
T-tests - inferences about 2 sample means
But what if you have more than 2 conditions?
e.g. placebo, drug 20mg, drug 40mg, drug 60mg
Placebo vs. 20mg 20mg vs. 40mg
Placebo vs 40mg 20mg vs. 60mg
Placebo vs 60mg 40mg vs. 60mg

Chance of making a type 1 error increases as you do more t-tests

ANOVA controls this error by testing all means at once - it can compare k
number of means. Drawback = loss of specificity
F-tests / Analysis of Variance (ANOVA)
Different types of ANOVA depending upon experimental
design (independent, repeated, multi-factorial)

Assumptions
• observations within each sample were independent
• samples must be normally distributed
• samples must have equal variances
F-tests / Analysis of Variance (ANOVA)

obtained difference between sample means

t= difference expected by chance (error)

variance (differences) between sample means

F= variance (differences) expected by chance (error)

Difference between sample means is easy for 2 samples:

(e.g. X1=20, X2=30, difference =10)
but if X3=35 the concept of differences between sample means gets
tricky
F-tests / Analysis of Variance (ANOVA)
Solution is to use variance - related to SD

Standard deviation = Variance

E.g. Set 1 Set 2

20 28
30 30 These 2 variances provide
a relatively accurate
35 31 representation of the size of
s2=58.3 s2=2.33 the differences
F-tests / Analysis of Variance (ANOVA)
Simple ANOVA example

Total variability

Between treatments Within treatments

variance variance
---------------------------- --------------------------
Measures differences due to: Measures differences due to:
1. Treatment effects 1. Chance
2. Chance
F-tests / Analysis of Variance (ANOVA)

When treatment has no effect, differences

between groups/treatments are entirely due
to chance. Numerator and denominator will
MSbetween be similar. F-ratio should have value around
F= 1.00
MSwithin
When the treatment does have an effect then
the between-treatment differences
(numerator) should be larger than chance
(denominator). F-ratio should be noticeably
larger than 1.00
F-tests / Analysis of Variance (ANOVA)
Simple independent samples ANOVA example

F(3, 8) = 9.00, p<0.05

Placebo Drug A Drug B Drug C
There is a difference
Mean 1.0 1.0 4.0 6.0
somewhere - have to use
SD 1.73 1.0 1.0 1.73 post-hoc tests (essentially
n 3 3 3 3 t-tests corrected for multiple
comparisons) to examine
further
F-tests / Analysis of Variance (ANOVA)
Gets more complicated than that though…
Bit of notation first:
An independent variable is called a factor
e.g. if we compare doses of a drug, then dose is our factor

Different values of our independent variable are our levels

e.g. 20mg, 40mg, 60mg are the 3 levels of our factor
F-tests / Analysis of Variance (ANOVA)
Can test more complicated hypotheses - example 2 factor
ANOVA (data modelled on Schachter, 1968)
Factors:
1. Weight - normal vs obese participants
2. Full stomach vs empty stomach

Participants have to rate 5 types of crackers, dependent

variable is how many they eat
This expt is a 2x2 factorial design - 2 factors x 2 levels
F-tests / Analysis of Variance (ANOVA)
Mean number of crackers eaten

Empty Full

Result:
Normal 22 15 = 37
No main effect for
factor A (normal/obese)

Obese 17 18 = 35
No main effect for
factor B (empty/full)

= 39 = 33
F-tests / Analysis of Variance (ANOVA)
Mean number of crackers eaten

Empty Full
23
22
Normal 22 15 21
20
19
18 obese
17
16
Obese 17 18 15 normal
14

Empty Full
Stomach Stomach
F-tests / Analysis of Variance (ANOVA)
Application to imaging…
F-tests / Analysis of Variance (ANOVA)
Application to imaging…

Early days => subtraction methodology => T-tests corrected for multiple comparisons

e.g. Pain Appropriate rest Statistical

Visual task - =
parametric map
condition
F-tests / Analysis of Variance (ANOVA)

This is still a fairly

simple analysis. It
shows the main
effect of pain
(collapsing across
the pain source) and
the individual
conditions.
More complex
analyses can look at
interactions between
factors

Derbyshire, Whalley, Stenger, Oakley, 2004

References

Gravetter & Wallnau - Statistics for the

behavioural sciences
Last years presentation, thank you to:
Louise Whiteley & Elisabeth Rounis
http://www.fil.ion.ucl.ac.uk/spm/doc/mfd-2004.html

Google
The geometric mean is the positive
number x such that:
A X
X B
To find the geometric mean:
Example: Find the geometric mean
between 4 and 16.

Write a proportion with X in the two

mean positions:
Find the
4 X Cross multiply: square root,
X2 = 64 simplify.
X 16
X=8
Another example:
Find the geometric mean between 4
and 20.

4 X X= 80
X2 = 80
X 20
X=4 5
What if you know the geometric mean?
Example: 6 is the geometric mean
between 9 and what other number:

Set up the Cross multiply:

proportion
this way: 9X = 36

X 6 And finish!
6 9 X = 4 *no square root!!!
Research: Hypothesis
Definition
the word hypothesis is derived form the Greek words
9 “hypo” means under
9 “tithemi” means place

Under known facts of the problem to explain relationship between these

........ is a statement subject to verification
......... a guess but experienced guess based on some facts
..is a hunch, assumption, suspicion, assertion or an idea about a phenomena,
relationship, or situation, the reality of truth of which one do not know
a researcher calls these assumptions, assertions, statements, or hunches
hypotheses and they become the basis of an inquiry.
In most cases, the hypothesis will be based upon either previous studies or the
researcher’s own or someone else’s observations
Hypothesis is a conjectural statement of relationship between two or more
variable (Kerlinger, Fried N, Foundations of Behabioural Research , 3rd edition,
New York: Holt, Rinehart and Winston, 1986)

Definition

Hypothesis is proposition, condition or principle which is assumed, perhaps

without belief, in order to draw its logical consequences and by this method
to test its accord with facts which are known or may be determined
(Webster’s New International Dictionary of English).
A tentative statement about something, the validity of which is usually
unknown (Black, James A & Dean J Champion, Method and Issues in Social
Research, New York: John Wiley & Sons, Inc, 1976)
Hypothesis is proposition that is stated is a testable form and that predicts a
particular relationship between two or more variable. In other words, id we
think that a relationship exists, we first state it is hypothesis and then test
hypothesis in the field (Baily, Kenneth D, Methods of Social Research, 3rd
edition, New York: The Free Press, 1978)
Definition

A hypothesis is written in such a way that it can be proven or disproven by

valid and reliable data – in order to obtain these data that we perform our
study (Grinnell, Richard, Jr. Social Work Research and Evaluation, 3rd edition,
Itasca, Illinois, F.E. Peacock Publishers, 988)
A hypothesis may be defined as a tentative theory or supposition set up and
adopted provisionally as a basis of explaining certain facts or relationships and
as a guide in the further investigation of other facts or relationships (Crisp,
Richard D, Marketing Research, New York: McGraw Hill Book Co., 1957 )
Characteristics
Hypotheses has the following characteristics:
9 a tentative proposition
9 unknown validity
9 specifies relation between two or more variables
9
Functions
Bringing clarity to the research problem
Serves the following functions
9 provides a study with focus
9 signifies what specific aspects of a research problem is to investigate
9 what data to be collected and what not to be collected
9 enhancement of objectivity of the study
9 formulate the theory
9 enable to conclude with what is true or what is false
Characteristics
Simple, specific, and contextually clear
Capable of verification
Related to the existing body of knowledge
Operationalisable
Typologies
Three types
9 working hypothesis
9 Null hypothesis
9 Alternate hypothesis

Working hypothesis
The working or trail hypothesis is provisionally adopted to explain the
relationship between some observed facts for guiding a researcher in the
investigation of a problem.
A Statement constitutes a trail or working hypothesis (which) is to be tested and
conformed, modifies or even abandoned as the investigation proceeds.
Typologies
Null hypothesis
A null hypothesis is formulated against the working hypothesis; opposes the
statement of the working hypothesis
....it is contrary to the positive statement made in the working hypothesis;
formulated to disprove the contrary of a working hypothesis
When a researcher rejects a null hypothesis, he/she actually proves a working
hypothesis

In statistics, to mean a null hypothesis usually Ho is used. For example,

Ho ÆQ = O
where Q is the property of the population under investigation
O is hypothetical
Typologies
Alternate hypothesis
An alternate hypothesis is formulated when a researcher totally rejects null
hypothesis

He/she develops such a hypothesis with adequate reasons

The notion used to mean alternate hypothesis is H1 ÆQ>O

i.e., Q is greater than O
Example
Working hypothesis: Population influences the number of bank branches in a
town

Null hypothesis (Ho): Population do not have any influence on the number of
bank branches in a town.

Alternate hypothesis (H1): Population has significant effect on the number of

bank branches in a town. A researcher formulates this hypothesis only after
rejecting the null hypothesis.
Hypothesis Testing
• Goal: Make statement(s) regarding unknown population
parameter values based on sample data
• Elements of a hypothesis test:
– Null hypothesis - Statement regarding the value(s) of unknown
parameter(s). Typically will imply no association between
explanatory and response variables in our applications (will
always contain an equality)
– Alternative hypothesis - Statement contradictory to the null
hypothesis (will always contain an inequality)
– Test statistic - Quantity based on sample data and null
hypothesis used to test between null and alternative hypotheses
– Rejection region - Values of the test statistic for which we
reject the null in favor of the alternative hypothesis
Hypothesis Testing
Test Result – H0 True H0 False

True State
H0 True Correct Type I Error
Decision
H0 False Type II Error Correct
Decision

α = P (Type I Error ) β = P (Type II Error )

• Goal: Keep α, β reasonably small

Example - Efficacy Test for New drug
• Drug company has new drug, wishes to compare it
with current standard treatment
• Federal regulators tell company that they must
demonstrate that new drug is better than current
treatment to receive approval
• Firm runs clinical trial where some patients
receive new drug, and others receive standard
treatment
• Numeric response of therapeutic effect is obtained
(higher scores are better).
• Parameter of interest: μNew - μStd
Example - Efficacy Test for New drug
• Null hypothesis - New drug is no better than standard trt

H 0 : μ New − μ Std ≤ 0 (μ New − μ Std = 0)

• Alternative hypothesis - New drug is better than standard trt

H A : μ New − μ Std > 0

• Experimental (Sample) data:

y New y Std
s New sStd
nNew nStd
Sampling Distribution of Difference in Means

• In large samples, the difference in two sample means is

approximately normally distributed:
⎛ σ 2
σ 2 ⎞
Y 1 − Y 2 ~ N ⎜ μ1 − μ 2 , 1
+ 2 ⎟
⎜ n n ⎟
⎝ 1 2 ⎠

• Under the null hypothesis, μ1-μ2=0 and:

Y1 −Y 2
Z= ~ N (0,1)
σ 2
σ 2
1
+ 2
n1 n2

• σ12 and σ22 are unknown and estimated by s12 and s22
Example - Efficacy Test for New drug

• Type I error - Concluding that the new drug is better than the
standard (HA) when in fact it is no better (H0). Ineffective drug is
deemed better.
– Traditionally α = P(Type I error) = 0.05

• Type II error - Failing to conclude that the new drug is better

(HA) when in fact it is. Effective drug is deemed to be no better.
– Traditionally a clinically important difference (Δ) is assigned
and sample sizes chosen so that:
β = P(Type II error | μ1-μ2 = Δ) ≤ .20
Elements of a Hypothesis Test
• Test Statistic - Difference between the Sample means,
scaled to number of standard deviations (standard errors)
from the null difference of 0 for the Population means:

y1 − y 2
T .S . : zobs =
s12 s22
+
n1 n2

• Rejection Region - Set of values of the test statistic that are

consistent with HA, such that the probability it falls in this
region when H0 is true is α (we will always set α=0.05)

R.R. : zobs ≥ zα α = 0.05 ⇒ zα = 1.645

P-value (aka Observed Significance Level)
• P-value - Measure of the strength of evidence the sample
data provides against the null hypothesis:
P(Evidence This strong or stronger against H0 | H0 is true)
P − val : p = P ( Z ≥ zobs )
Large-Sample Test H0:μ1-μ2=0 vs H0:μ1-μ2>0

• H0: μ1-μ2 = 0 (No difference in population means

• HA: μ1-μ2 > 0 (Population Mean 1 > Pop Mean 2)

y1 − y 2
• T . S . : z obs =
s 12 s 22
+
n1 n2
• R . R . : z obs ≥ z α
• P − value : P ( Z ≥ z obs )

• Conclusion - Reject H0 if test statistic falls in rejection region,

or equivalently the P-value is ≤ α
Example - Botox for Cervical Dystonia

• Patients - Individuals suffering from cervical dystonia

• Response - Tsui score of severity of cervical dystonia
(higher scores are more severe) at week 8 of Tx
• Research (alternative) hypothesis - Botox A
decreases mean Tsui score more than placebo
• Groups - Placebo (Group 1) and Botox A (Group 2)
• Experimental (Sample) Results:

y1 = 10.1 s1 = 3.6 n1 = 33
y 2 = 7.7 s2 = 3.4 n2 = 35
Source: Wissel, et al (2001)
Example - Botox for Cervical Dystonia
Test whether Botox A produces lower mean Tsui
scores than placebo (α = 0.05)
• H 0 : μ1 − μ 2 = 0
• H A : μ1 − μ 2 > 0
10.1 − 7.7 2.4
• T .S . : zobs = = = 2.82
2
(3.6) (3.4) 2 0.85
+
33 35
• R.R. : zobs ≥ zα = z.05 = 1.645
• P − val : P ( Z ≥ 2.82) = .0024

Conclusion: Botox A produces lower mean Tsui scores than

placebo (since 2.82 > 1.645 and P-value < 0.05)
2-Sided Tests
• Many studies don’t assume a direction wrt the
difference μ1-μ2
• H0: μ1-μ2 = 0 HA: μ1-μ2 ≠ 0
• Test statistic is the same as before
• Decision Rule:
– Conclude μ1-μ2 > 0 if zobs ≥ zα/2 (α=0.05 ⇒ zα/2=1.96)
– Conclude μ1-μ2 < 0 if zobs ≥ -zα/2 (α=0.05 ⇒ -zα/2= -1.96)
– Do not reject μ1-μ2 = 0 if -zα/2 ≤ zobs ≤ zα/2
• P-value: 2P(Z≥ |zobs|)
Power of a Test
• Power - Probability a test rejects H0 (depends on μ1- μ2)
– H0 True: Power = P(Type I error) = α
– H0 False: Power = 1-P(Type II error) = 1-β

· Example:
· H0: μ1- μ2 = 0 HA: μ1- μ2 > 0
• σ12 = σ22 = 25 n1 = n2 = 25
· Decision Rule: Reject H0 (at α=0.05 significance level) if:

y1 − y 2 y1 − y 2
z obs = = ≥ 1 .645 ⇒ y 1 − y 2 ≥ 2 .326
σ 2
σ 2
2
1
+ 2
n1 n2
Power of a Test
• Now suppose in reality that μ1-μ2 = 3.0 (HA is true)
• Power now refers to the probability we (correctly)
reject the null hypothesis. Note that the sampling
distribution of the difference in sample means is
approximately normal, with mean 3.0 and standard
deviation (standard error) 1.414.
• Decision Rule (from last slide): Conclude population
means differ if the sample mean for group 1 is at least
2.326 higher than the sample mean for group 2
• Power for this case can be computed as:

P (Y 1 − Y 2 ≥ 2.326) Y 1 − Y 2 ~ N (3, 2.0 = 1.414)

Power of a Test

2.326− 3
Power= P(Y 1 − Y 2 ≥ 2.326) = P(Z ≥ = −0.48) = .6844
1.41

• All else being equal:

• As sample sizes increase, power increases

• As population variances decrease, power increases
• As the true mean difference increases, power increases
Power of a Test
Distribution (H0) Distribution (HA)
Power of a Test

Power Curves for group sample sizes of 25,50,75,100 and

varying true values μ1-μ2 with σ1=σ2=5.
• For given μ1-μ2 , power increases with sample size
• For given sample size, power increases with μ1-μ2
Sample Size Calculations for Fixed Power
• Goal - Choose sample sizes to have a favorable chance of
detecting a clinically meaning difference
• Step 1 - Define an important difference in means:
– Case 1: σ approximated from prior experience or pilot study - dfference
can be stated in units of the data
– Case 2: σ unknown - difference must be stated in units of standard
deviations of the data

μ1 − μ 2
δ=
σ
• Step 2 - Choose the desired power to detect the the clinically
meaningful difference (1-β, typically at least .80). For 2-sided test:

2(zα / 2 + z β )
2

n1 = n2 =
δ2
Example - Rosiglitazone for HIV-1
Lipoatrophy
• Trts - Rosiglitazone vs Placebo
• Response - Change in Limb fat mass
• Clinically Meaningful Difference - 0.5 (std dev’s)
• Desired Power - 1-β = 0.80
• Significance Level - α = 0.05

zα / 2 = 1.96 z β = z.20 = .84

2(1.96 + 0.84 )
2
n1 = n2 = 2
= 63
(0.5)
Source: Carr, et al (2004)
Confidence Intervals
• Normally Distributed data - approximately 95% of
individual measurements lie within 2 standard
deviations of the mean
• Difference between 2 sample means is
approximately normally distributed in large
samples (regardless of shape of distribution of
individual measurements):

⎛ 2 ⎞
σ 2
σ
Y 1 − Y 2 ~ N ⎜ μ1 − μ 2 , 1 + 2 ⎟
⎜ n n ⎟
⎝ 1 2 ⎠
• Thus, we can expect (with 95% confidence) that our sample
mean difference lies within 2 standard errors of the true difference
(1-α)100% Confidence Interval for μ1-μ2

• Large sample Confidence Interval for μ1-μ2:

(y )
2 2
s s
1 − y 2 ± zα / 2 1
+ 2
n1 n2
• Standard level of confidence is 95% (z.025 = 1.96 ≈ 2)
• (1-α)100% CI’s and 2-sided tests reach the same
conclusions regarding whether μ1-μ2= 0
Example - Viagra for ED
• Comparison of Viagra (Group 1) and Placebo (Group 2)
for ED
• Data pooled from 6 double-blind trials
• Subjects - White males
• Response - Percent of succesful intercourse attempts in
past 4 weeks (Each subject reports his own percentage)

y1 = 63.2 s1 = 41.3 n2 = 264

y 2 = 23.5 s2 = 42.3 n2 = 240

95% CI for μ1- μ2:

(41.3)2 (42.3)2
(63.2 − 23.5) ±1.96 + ≡ 39.7 ± 7.3 ≡ (32.4,47.0)
264 240
Source: Carson, et al (2002)
Hypothesis Testing: Preliminaries

A hypothesis is a statement that something is true.

Null hypothesis: A hypothesis to be tested. We use
the symbol H0 to represent the null hypothesis
Alternative hypothesis: A hypothesis to be
considered as an alternative to the null hypothesis. We
use the symbol Ha to represent the alternative
hypothesis.
- The alternative hypothesis is the one believe to
to be true, or what you are trying to prove
is true.
In this course, we will always assume that the null
hypothesis for a population parameter, Θ , always specifies
a single value for that parameter. So, an equal sign always
appears:
H 0 :Θ = Θ 0

If the primary concern is deciding whether a population

parameter is different than a specified value, the alternative
hypothesis should be:
H a :Θ ≠ Θ 0

This form of alternative hypothesis is called a two-tailed

test.
Example: You suspect that the equilibrium wage of low
skilled workers is not the federal minimum wage level of
$5.15
*If the primary concern is whether a population
parameter, Θ , is less than a specified value Θ 0 , the
alternative hypothesis should be:
H a :Θ < Θ 0

A hypothesis test whose alternative hypothesis has

this form is called a left-tailed test.
*If the primary concern is whether a population
parameter, Θ , is greater than a specified value Θ 0,
the alternative hypothesis should be:
H a :Θ > Θ 0

A hypothesis test whose alternative hypothesis has

this form is called a right-tailed test.
A hypothesis test is called a one-tailed test if it is
either right- or left-tailed, i.e.,if it is not a two-tailed
test.
After we have the null hypothesis, we have to determine
whether to reject it or fail to reject it.
The decision to reject or fail to reject is based on information
contained in a sample drawn from the population of interest.
The sample values are used to compute a single number,
corresponding to a point on a line, which operates as a decision
maker. This decision maker is called test statistic
If test statistic falls in some interval which support alternative
hypothesis, we reject the null hypothesis. This interval is called
rejection region
It test statistic falls in some interval which support null
hypothesis, we fail to reject the null hypothesis. This interval is
called acceptance region
The value of the point, which divide the rejection region and
acceptance one is called critical value
We can make mistakes in the test.
Type I error: reject the null hypothesis when it is true.
probability of type I error is denoted by α
Type II error: accept the null hypothesis when it is wrong.
probability of type II error is denoted by β
Test of hypothesis for a population mean

• We are basically asking: What observed value of x

bar would be different enough from my null
hypothesis value to convince me that my null is
wrong
• We always talk in terms of type I errors, alpha,
which are always small (.1, .05, .01)
• The smaller alpha gets the more tight your proof
that the alternative is correct, because the
probability of type I error is reduced, but the
chances of pa type II error are increased
Test of hypothesis for a population mean
(two tailed and large sample)
1) Hypothesis: H 0 :μ = μ0
H a :μ ≠ μ0
2) Test statistic: large sample case
x − μ0
z obs =
σ / n
3) Critical value, rejection and acceptance region:
- The bigger the absolute value of z is, the more
possible to reject null hypothesis.
- The critical value depend on the significance level α
- rejection region: | zobs |> zα / 2 or crit
Test of hypothesis for a population mean
(one tailed test and large sample)
1) Hypothesis: H 0 :μ = μ0
H a :μ > μ0 or H a :μ < μ0
2) Test statistic: large sample case
x − μ0
z obs =
σ / n
3) Critical value, rejection and acceptance region:
rejection region: zobs > zα or zobs < − zα
Example: a sample of 60 students’ grades is take from
a large class, the average grade in the sample is 80
with a sample standard deviation 10.
Test the hypothesis that the average grade is 75 with
5% significance level (probability of making a type I
error).
Test of hypothesis for a population mean
(two tailed and small sample)
1) Hypothesis: H 0 :μ = μ0
H a :μ ≠ μ0
2) Test statistic: small sample case
x − μ0
t=
s/ n
3) Critical value, rejection and acceptance region:
- The bigger the absolute value of t is, the more
possible to reject null hypothesis.
- The critical value depends on significance level α

- rejection region: | t |> tα / 2 d.f.=n-1

Test of hypothesis for a population mean
(one tailed test and small sample)
1) Hypothesis: H 0 :μ = μ0
H a : μ > μ 0 or H a :μ < μ0
2) Test statistic: small sample case
x − μ0
t=
s/ n
3) Critical value, rejection and acceptance region:
rejection region: t > tα or t < −tα d.f.=n-1
Example: suppose you have a sample of 11 Econ 70
midterm exam grades. The mean of that sample
is 81 with a standard deviation of 9.
1) Test hypothesis that average grade of the
population is 75 with 5% significance level.
2) Test hypothesis that average grade of the
population is greater than 80 with 5%
significance level.
STATA
• ttest
Test of difference between two
population means

Population 1: faculty in public schools

Population 2: faculty in private schools

μ1 =mean salary of faculty in public schools

μ 2 =mean salary of faculty in private schools
Two samples: one from public the other from
private
H 0 : μ1 = μ 2
H a : μ1 ≠ μ 2
In large sample case, the sampling distribution
of difference between population mean x1 − x2 is
a normal distribution with mean
μ ( x − x ) = μ1 − μ 2
1 2

and the standard deviation is

σ 12 σ 22
σ (x −x ) = +
1 2
n1 n2
Test of hypothesis for difference of two population means
(two tailed and large sample)
1) Hypothesis: D0 is some specified difference that you
wish to test. For many tests, you will wish to
hypothesize that there is no difference between two
means, that is D0=0
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 ≠ D0
2) Test statistic: large sample case
( x1 − x 2 ) − D 0 ( x1 − x 2 ) − D 0
z obs = =
σ (x 1−
σ 12 σ 22
x2 )
+
n1 n2

3) Critical value, rejection and acceptance region:

rejection region: | zobs |> zα / 2
Test of hypothesis for difference of two population
means(one tailed test and large sample)

1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 > D 0 or H a : μ1 − μ 2 < D0
2) Test statistic: large sample case

( x1 − x 2 ) − D 0 ( x1 − x 2 ) − D 0
z obs = =
σ (x 1 − x2 )
σ 12 σ 22
+
n1 n2

3) Critical value, rejection and acceptance region:

rejection region: zobs > zα or zobs < − zα
Example: compare salary difference.

Population 1: faculty in public schools

Population 2: faculty in private schools
μ1
=mean salary of faculty in public schools
μ2
=mean salary of faculty in private schools
Sample 1: salaries of faculty members in public schools
(n=30)
Sample 2: salaries of faculty members in private schools
(n=35)
x1 = 57.48 x2 = 66.39
s1 = 9 s 2 = 9 .5

Test the hypothesis that the salaries are less for faculty in
public school with 5% significance level
In small sample case, the sampling
distribution of the difference between two
means is the t-distribution with mean
μ ( x − x ) = μ1 − μ 2
1 2

and standard deviation

1 1
s +
n1 n 2
where
( n − 1) s 2
+ ( n − 1) s 2
s2 = 1 1 2 2
n1 + n2 − 2

with n1+n2-2 degrees of freedom

Test of hypothesis for difference of two population
means
(two tailed and small sample)

1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 ≠ D0

2) Test statistic: small sample case

( x1 − x 2 ) − D 0
t obs =
1 1
s +
n1 n 2

3) Critical value, rejection and acceptance region:

rejection region: | tobs |> tα / 2 d.f=n1+n2-2
Test of hypothesis for difference of two population means
(one tailed test and small sample)
1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ 1 − μ 2 > D 0 or H a : μ1 − μ 2 < D0
2) Test statistic: small sample case
( x1 − x 2 ) − D 0
t obs =
1 1
s +
n1 n 2

3) Critical value, rejection and acceptance region:

rejection region: tobs > tα or tobs < −tα d.f.=n1+n2-2
Example: compare salary difference.

Population 1: faculty in public schools

Population 2: faculty in private schools

μ 1 =mean salary of faculty in public schools

μ 2 =mean salary of faculty in private schools

Sample 1: salaries of faculty members in public schools (n=10)

Sample 2: salaries of faculty members in private schools (n=15)

x1 = 57.48 x2 = 66.39
s1 = 9 s 2 = 9 .5
Test the hypothesis that the salaries are the same for faculty in
public and private school with 5% significance level
Test of hypothesis for binomial proportion
1) Hypothesis: H 0 : p = p0

Two-tailed: H a : p ≠ p0

One-tailed: H a : p > p0 or H a : p < p0

2) Test statistic: large sample case

pˆ − p 0
z obs = pˆ =
x
p0q0 n
n
3) Critical value, rejection and acceptance region:
rejection region: two-tailed : | z |> z
obs α /2

one-tailed: zobs > zα or zobs < − zα

STATA
• prtest
Test of hypothesis for difference in binomial
proportions

1) Hypothesis :
H A : ( p1 − p 2 ) ≠>< D 0 one/two tail tests
2)Test statistic

( pˆ 1 − pˆ 2 ) − D 0
z obs =
p1 q1 p 2 q 2
+
n1 n2
Test of hypothesis for difference in binomial
proportions

• Because p1 and p2 are not known use a pooled p

in the sample standard error when your testing
whether the difference is zero
x1 + x2
pˆ =
n1 + n2
• And when you are testing whether the difference
is something other than zero use the estimated
proportions from the two different samples
• Section 8.8 in the book has this spelled out nicely
P-values
P-values
The smallest value of alpha for which test results are
statistically significant, or in other words, statistically
different than the null hypothesis value.
Smallest value at which you still reject the null.
Example 1: You see a p-value of .025
- You would fail to reject at a 1% level of sig, but reject at
5%
Example 2: 60 students are polled average of 72 observed
with a standard deviation of 10, what is the p-value of the
test whether the population average is 75?
P-value
1. Calculate the z observed value for your observation
2. Find the area to the right of this value
3. If this is a two tailed test multiply this area by 2, if this is
a one-tail test you are done

Example:
60 students are polled average of 72 observed with a
standard deviation of 10, what is the p-value of the test
whether the population average is 75?
Power of a statistical test
- P(reject the null hypothesis when it is false)=1-β
-(1-α) is the probability we accept the null when it was in
fact true
-(1-β) is the probability we reject when the null is in fact
false - this is the power of the test.
-You would prefer to have a larger power
-The power changes depending on what the actual
population parameter is.
Hypothesis Testing: Preliminaries

A hypothesis is a statement that something is true.

If the primary concern is deciding whether a population

parameter is different than a specified value, the alternative
hypothesis should be:
H a :Θ ≠ Θ 0

This form of alternative hypothesis is called a two-tailed

A hypothesis test whose alternative hypothesis has

this form is called a left-tailed test.
*If the primary concern is whether a population
parameter, Θ , is greater than a specified value Θ 0,
the alternative hypothesis should be:
H a :Θ > Θ 0

A hypothesis test whose alternative hypothesis has

• We are basically asking: What observed value of x

- rejection region: | t |> tα / 2 d.f.=n-1

Population 1: faculty in public schools

Population 2: faculty in private schools

μ1 =mean salary of faculty in public schools

and the standard deviation is

3) Critical value, rejection and acceptance region:

rejection region: | zobs |> zα / 2
Test of hypothesis for difference of two population
means(one tailed test and large sample)

1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 > D 0 or H a : μ1 − μ 2 < D0
2) Test statistic: large sample case

( x1 − x 2 ) − D 0 ( x1 − x 2 ) − D 0
z obs = =
σ (x 1 − x2 )
σ 12 σ 22
+
n1 n2

3) Critical value, rejection and acceptance region:

rejection region: zobs > zα or zobs < − zα
Example: compare salary difference.

Population 1: faculty in public schools

and standard deviation

1 1
s +
n1 n 2
where
( n − 1) s 2
+ ( n − 1) s 2
s2 = 1 1 2 2
n1 + n2 − 2

with n1+n2-2 degrees of freedom

Test of hypothesis for difference of two population
means
(two tailed and small sample)

1) Hypothesis:
H 0 : μ1 − μ 2 = D0
H a : μ1 − μ 2 ≠ D0

2) Test statistic: small sample case

( x1 − x 2 ) − D 0
t obs =
1 1
s +
n1 n 2

3) Critical value, rejection and acceptance region:

rejection region: tobs > tα or tobs < −tα d.f.=n1+n2-2
Example: compare salary difference.

Population 1: faculty in public schools

Population 2: faculty in private schools

μ 1 =mean salary of faculty in public schools

μ 2 =mean salary of faculty in private schools

Sample 1: salaries of faculty members in public schools (n=10)

Sample 2: salaries of faculty members in private schools (n=15)

Two-tailed: H a : p ≠ p0

One-tailed: H a : p > p0 or H a : p < p0

2) Test statistic: large sample case

pˆ − p 0
z obs = pˆ =
x
p0q0 n
n
3) Critical value, rejection and acceptance region:
rejection region: two-tailed : | z |> z
obs α /2

one-tailed: zobs > zα or zobs < − zα

STATA
• prtest
Test of hypothesis for difference in binomial
proportions

1) Hypothesis :
H A : ( p1 − p 2 ) ≠>< D 0 one/two tail tests
2)Test statistic

( pˆ 1 − pˆ 2 ) − D 0
z obs =
p1 q1 p 2 q 2
+
n1 n2
Test of hypothesis for difference in binomial
proportions

• Because p1 and p2 are not known use a pooled p

• A
W
T

x- x =-
x- x =+
y- y =+
- x +=+ y- y =+
+ x +=+

x- x =-
x- x
x =+
y- y =- y- y = -
- x - =+ + x - =+
(x − x )( y − y ) = S
∑ n
xy = covariance

• Covariance –

• W

• W
(x − x )( y − y ) = S
∑ n
xy = covariance

• Covariance –

– W

• Y
– (you don’t
P
M C C
Is to standardise the covariance so that it
can interpreted easily. It converts the
covariance to a number between -1 to 1,
where:
• -1 is a perfect negative correlation Karl Pearson
• 1 is a perfect positive correlation
1857 - 1936
• 0 is no correlation

∑ ( x − x )( y − y )
r = n
⎛ ∑ (x − x ) ⎞⎛ ∑ (y − y ) ⎞
2 2

⎜ ⎟⎜ ⎟
⎜ n ⎟⎜ n ⎟
⎝ ⎠⎝ ⎠
P M C C

1
∑ (x − x )( y − y )
r= n
SxS y
T
T
T
• E A
•P 140
T
• I

F W

• No –
B
• S

– G
C 7 13
T
I
I

• O
– A
W

• N
– P
The Kruskal-Wallis H Test

• The Kruskal-Wallis H Test is a

nonparametric procedure that can be used to
compare more than two populations in a
completely randomized design.
• All n = n1+n2+…+nk measurements are jointly
ranked (i.e.treat as one large sample).
• We use the sums of the ranks of the k samples
to compare the distributions.
The Kruskal-Wallis H Test

9Rank the total measurements in all k samples

from 1 to n. Tied observations are assigned average of
the ranks they would have gotten if not tied.
9Calculate

Ti = rank sum for the ith sample i = 1, 2,…,k

9And the test statistic

12 Ti 2
H= ∑ − 3(n + 1)
n(n + 1) ni
The Kruskal-Wallis H Test

H0: the k distributions are identical versus

Ha: at least one distribution is different
Test statistic: Kruskal-Wallis H
When H0 is true, the test statistic H has an
approximate chi-square distribution with df
= k-1.
Use a right-tailed rejection region or p-
value based on the Chi-square distribution.
Example
Four groups of students were randomly
assigned to be taught with four different
techniques, and their achievement test scores
were recorded. Are the distributions of test
scores the same, or do they differ in location?
1 2 3 4
65 75 59 94
87 69 78 89
73 83 67 80
79 81 62 88
Teaching Methods
1 2 3 4
65 (3) 75 (7) 59 (1) 94 (16)
87 (13) 69 (5) 78 (8) 89 (15)
73 (6) 83 (12) 67 (4) 80 (10)
79 (9) 81 (11) 62 (2) 88 (14)
Ti 31 35 15 55

Rank the 16 H0: the distributions of scores are the same

measurements Ha: the distributions differ in location
from 1 to 16, 2
and calculate 12 T
Test statistic: H = ∑ i − 3(n + 1)
the four rank n(n + 1) ni
sums.
12 ⎛ 312 + 352 + 152 + 552 ⎞
= ⎜⎜ ⎟⎟ − 3(17) = 8.96
16(17) ⎝ 4 ⎠
Teaching Methods
H0: the distributions of scores are the same
Ha: the distributions differ in location

12 Ti 2
Test statistic: H = ∑ − 3(n + 1)
n(n + 1) ni
12 ⎛ 312 + 352 + 152 + 552 ⎞
= ⎜⎜ ⎟⎟ − 3(17) = 8.96
16(17) ⎝ 4 ⎠
Reject H0. There is sufficient
Rejection region: For a right-
evidence to indicate that there
tailed chi-square test with α =
is a difference in test scores for
.05 and df = 4-1 =3, reject H0 if
the four teaching techniques.
H ≥ 7.81.
Key Concepts
I. Nonparametric Methods
These methods can be used when the data cannot be measured on a
quantitative scale, or when
• The numerical scale of measurement is arbitrarily set by the
researcher, or when
• The parametric assumptions such as normality or constant
variance are seriously violated.
Key Concepts
Kruskal-Wallis H Test: Completely Randomized Design
1. Jointly rank all the observations in the k samples (treat as one
large sample of size n say). Calculate the rank sums, Ti = rank
sum of sample i, and the test statistic
12 Ti 2
H= ∑ − 3(n + 1)
n(n + 1) ni

2. If the null hypothesis of equality of distributions is false, H

will be unusually large, resulting in a one-tailed test.
3. For sample sizes of five or greater, the rejection region for H is
based on the chi-square distribution with (k − 1) degrees of
freedom.
The Mann-Whitney U
test
Peter Shaw
Introduction
We meet our first inferential test.
You should not get put off by the
messy-looking formulae – it’s
usually run on a PC anyway.
The important bit is to understand
the philosophy of the test.
Imagine..
That you have acquired a set of
measurements from 2 different sites.
Maybe one is alleged to be polluted, the
other clean, and you measure residues in
the soil.
Maybe these are questionnaire returns
from students identified as M or F.
You want to know whether these 2 sets
of measurements genuinely differ. The
issue here is that you need to rule out
the possibility of the results being
random noise.
The formal procedure:
Involves the creation of two competing
explanations for the data recorded.
Idea 1:These are pattern-less random
data. Any observed patterns are due to
chance. This is the null hypothesis H0
Idea 2: There is a defined pattern in the
data. This is the alternative hypothesis
H1
Without the statement of the competing
hypotheses, no meaning test can be run.
Occam’s razor
If competing explanations exist, chose
the simpler unless there is good reason
to reject it.
Here, you must assume H0 to be true
until you can reject it.
In point of fact you can never
ABSOLUTELY prove that your
observations are non-random. Any
pattern could arise in random noise, by
chance. Instead you work out how likely
H0 is to be true.
Example
You conduct a questionnaire survey of homes in the
Heathrow flight path, and also a control population of
homes in South west London. Responses to the question
“How intrusive is plane noise in your daily life” are
tabulated:
Noise complaints 1= no complaint, 5 = very unhappy
Homes near airport Control site
5 3
4 2
4 4
3 1
5 2
4 1
5
Stage 1: Eyeball the
data!
These data are ordinal, but not normally
distributed (allowable scores are 1, 2, 3, 4 or
5).
Use Non-parametric statistics
It does look as though people are less happy
under the flightpath, but recall that we must
state our hypotheses H0, H1
H0: There is no difference in attitudes to plane
noise between the two areas – any observed
differences are due to chance.
H1: Responses to the question differed
between the two areas.
Now we assess how
likely it is that this
pattern could occur by
chance:
This is done by performing a calculation.
Don’t worry yet about what the calculation
entails.
What matters is that the calculation gives an
answer (a test statistic) whose likelihood
can be looked up in tables. Thus by means
of this tool - the test statistic - we can work
out an estimate of the probability that the
observed pattern could occur by chance in
random data
One philosophical
hurdle to go:
The test statistic generates a probability - a
number for 0 to 1, which is the probability of H0
being true.
If p = 0, H0 is certainly false. (Actually this is
over-simple, but a good approximation)
If p is large, say p = 0.8, H0 must be accepted
as true.
But how about p = 0.1, p = 0.01?
Significance

We have to define a threshold, a boundary, and

say that if p is below this threshold H0 is rejected
otherwise H1 is accepted.
This boundary is called the significance level. By
convention it is set at p=0.05 (1:20), but you can
chose any other number - as long as you specify
it in the write-up of your analyses.
WARNING!! This means that if you analyse 100
sets of random data, the expectance (log-term
average) is that 5 will generate a significant test.
The procedure:
Decide significance level
Set up H0, H1.
p=0.05

Data
5 3
4 2
4 4
Test statistic Probability of
3 1 U = 15.5 H0 being true
5 2 p = 0.03
4 1
5
Is p above critical level?
Y N

Reject H0
Accept H0
This particular test:
The Mann-Whitney U test is a non-parametric
test which examines whether 2 columns of data
could have come from the same population (ie
“should” be the same)
It generates a test statistic called U (no idea why
it’s U). By hand we look U up in tables; PCs
give you an exact probability.
It requires 2 sets of data - these need not be
paired, nor need they be normally distributed,
nor need there be equal numbers in each set.
How to do it
1: rank all data into 2 Harmonize ranks where the
ascending order, same value occurs more than
then re-code the data once
set replacing raw
data with ranks.

Data Data Data

5 3 5 #13 3 #5 5 #13 = 12 3 #5 = 5.5
4 2 4 #10 2 #4 4 #10 = 8.5 2 #4 = 3.5
4 4 4 #9 4 #7 4 #9 = 8.5 4 #7 = 8.5
3 1 3 #6 1 #2 3 #6 = 5.5 1 #2 = 1.5
5 2 5 #12 2 #3 5 #12 = 12 2 #3 = 3.5
4 1 4 #8 1 #1 4 #8 = 8.5 1 #1 = 1.5
5 5 #11 5 #11 = 12
Once data are ranked:
Add up ranks for each column; call these rx and ry
(Optional but a good check:
rx + ry = n2/2 + n/2, or you have an error)
Calculate
Ux = NxNy + Nx(Nx+1)/2 - Rx
Uy = NxNy + Ny(Ny+1)/2 - Ry
take the SMALLER of these 2 values and look up in tables. If U
is LESS than the critical value, reject H0

NB This test is unique in one feature: Here low values of the

test stat. Are significant - this is not true for any other test.
In this case:
Data
5 #13 = 12 3 #5 = 5.5
4 #10 = 8.5 2 #4 = 3.5
Ux = 6*7 + 7*8/2 - 67 = 3
4 #9 = 8.5 4 #7 = 8.5 Uy = 6*7 + 6*7/2 - 24 = 39
3 #6 = 5.5 1 #2 = 1.5
5 #12 = 12 2 #3 = 3.5 Lowest U value is 3.
4 #8 = 8.5 1 #1 = 1.5
5 #11 = 12
___ ___ Critical value of U (7,6) = 4 at p =
rx=67 ry=24 0.01.
Calculated U is < tabulated U so
Check: rx + ry + 91
13*13/2 + 13/2 = 91 CHECK.
reject H0.

At p = 0.01 these two sets of data

differ.
Tails.. Generally use
2 tailed tests
2 tailed test: These
populations DIFFER.

1 tailed test: Population X is

Greater than Y (or Less than Y).

Lower tail of distribution Upper tail of distribution

Kruskal-Wallis: The U test’s big cousin
When we have 2 groups to compare (M/F, site 1/site 2, etc) the U test is
correct applicable and safe.

How to handle cases with 3 or more groups?

The simple answer is to run the Kruskal-Wallis test. This is run on a PC,
but behaves very much like the M-W U. It will give one significance
value, which simply tells you whether at least one group differs from one
other.
Males Females
Site 1 Site 2 Site 3

Do males differ
Do results differ
from females?
between these
sites?
Your coursework:

I will give each of you a sheet with data collected from 3 sites. (Don’t
try copying – each one is different and I know who gets which dataset!).

I want you to show me your data processing skills as follows:

1: Produce a boxplot of these data, showing how values differ between

the categories.
2: Run 3 separate Mann-Whitny U tests on them, comparing 1-2, 1-3 and
2-3. Only call the result significant if the p value is < 0.01
3: Run a Kruskal-Wallis anova on the three groups combined, and
comment on your results.
Mann-Whitney U test
pair no. S. male P. male
thorax thorax
width width
1 4 2.8
2 3 2.7
3 2.6 2.6
4 3.85 2.7
5 2.65 2.6
6 2.7 2.6
7 2.85 2.7
8 2.85 2.8
9 3.2 2.9
10 2.9 2.6
Mann-Whitney U test

Push this to
sort the data in
an ascending
order
Mann-Whitney U test
S. male P. male
thorax thorax
rank width rank width
3 2.6 3 2.6
6 2.65 3 2.6
8.5 2.7 3 2.6
13.5 2.85 3 2.6
13.5 2.85 8.5 2.7
15.5 2.9 8.5 2.7
17 3 8.5 2.7
18 3.2 11.5 2.8
19 3.85 11.5 2.8
20 4 15.5 2.9

•Rank both lists as one combined list

•I found this a time consuming task
Mann-Whitney U test
S. male P. male
thorax thorax
rank width rank width
3 2.6 3 2.6
6 2.65 3 2.6
8.5 2.7 3 2.6
13.5 2.85 3 2.6
13.5 2.85 8.5 2.7
15.5 2.9 8.5 2.7
17 3 8.5 2.7
18 3.2 11.5 2.8
19 3.85 11.5 2.8
20 4 15.5 2.9
R1= 30.6 R2= 26.9
•Sum the ranks for each sample
•N1= # obs in 1 N2= # obs in 2
Mann-Whitney U test
• Normally you would now use the formula’s
and chart in the Brown reading.
• U1=(N1)(N2)+[(N1)(N1+1)]/2 – R1 U1=124.4
• U2=(N1)(N2)+[(N2)(N2+1)]/2 – R2 U2=128.2

• However the sample size is larger than the

table will allow because any sample greater
than 20 can be assumed to mimic normality
• We therefore use the equation to convert the
U statistic to a Z- score.
Mann-Whitney U test
•U1=124.4 •N1=10
•U2=128.2 •N2=10

• Z = {largest U value – [N1*N2]/2}

(N1)(N2)(N1+N2+1)]/12
• Z = 5.9
• If Z > 1.96 than P < 0.05
• Therefore there is a significant difference
between the thorax width of single and
mated males
Wilcoxon Signed Rank
• When N>15 use a z score conversion
• µT+ = N(N+1)/4
• VarT+ = N(N+1)(2N+1)/24

• Z = T+ - µT+ / VarT+
• = T+ - [N(N+1)/4]
[N(N+1)(2N+1)/24]
• If Z > 1.96 than P < 0.05
• reject null hypothesis
MEASURES OF
DISPERSION
Measures of Dispersion
• While measures of central tendency indicate what value
of a variable is (in one sense or other) “average” or
“central” or “typical” in a set of data, measures of
dispersion (or variability or spread) indicate (in one
sense or other) the extent to which the observed values
are “spread out” around that center — how “far apart”
observed values typically are from each other and
therefore from some average value (in particular, the
mean). Thus:
– if all cases have identical observed values (and thereby are also
identical to [any] average value), dispersion is zero;
– if most cases have observed values that are quite “close
together” (and thereby are also quite “close” to the average
value), dispersion is low (but greater than zero); and
– if many cases have observed values that are quite “far away”
from many others (or from the average value), dispersion is high.
• A measure of dispersion provides a summary statistic
that indicates the magnitude of such dispersion and, like
a measure of central tendency, is a univariate statistic.
Importance of the Magnitude
Dispersion Around the Average
• Dispersion around the mean test score.

• Baltimore and Seattle have about the same mean daily

temperature (about 65 degrees) but very different
dispersions around that mean.

• Dispersion (Inequality) around average household

income.
Hypothetical Ideological Dispersion
Hypothetical Ideological Dispersion (cont.)
Dispersion in Percent Democratic in CDs
Measures of Dispersion
• Because dispersion is concerned with how “close
together” or “far apart” observed values are (i.e., with the
magnitude of the intervals between them), measures of
dispersion are defined only for interval (or ratio)
variables,
– or, in any case, variables we are willing to treat as interval (like
IDEOLOGY in the preceding charts).
– There is one exception: a very crude measure of dispersion
called the variation ratio, which is defined for ordinal and even
nominal variables. It will be discussed briefly in the Answers &
Discussion to PS #7.)

• There are two principal types of measures of dispersion:

range measures and deviation measures.
Range Measures of Dispersion
• Range measures are based on the distance between
pairs of (relatively) “extreme” values observed in the
data.
– They are conceptually connected with the median as a measure
of central tendency.

• The (“total” or “simple”) range is the maximum (highest)

value observed in the data [the value of the case at the
100th percentile] minus the minimum (lowest) value
observed in the data [the value of the case at the 0th
percentile]
– That is, it is the “distance” or “interval” between the values of the
two most extreme cases,
– e.g., range of test scores
TABLE 1 – PERCENT OF POPULATION AGED 65 OR HIGHER
IN THE 50 STATES
(UNIVARIATE DATA)

Alabama 12.4 Montana 12.5

Alaska 3.6 Nebraska 13.8
Arizona 12.7 Nevada 10.6
Arkansas 14.6 New Hampshire 11.5
California 10.6 New Jersey 13.0
Colorado 9.2 New Mexico 10.0
Connecticut 13.4 New York 13.0
Delaware 11.6 North Carolina 11.8
Florida 17.8 North Dakota 13.3
Georgia 10.0 Ohio 12.5
Hawaii 10.1 Oklahoma 12.8
Idaho 11.5 Oregon 13.7
Illinois 12.1 Pennsylvania 14.8
Indiana 12.1 Rhode Island 14.7
Iowa 14.8 South Carolina 10.7
Kansas 13.6 South Dakota 14.0
Kentucky 12.3 Tennessee 12.4
Louisiana 10.8 Texas 9.7
Maine 13.4 Utah 8.2
Maryland 10.7 Vermont 11.9
Massachusetts 13.7 Virginia 10.6
Michigan 11.5 Washington 11.8
Minnesota 12.6 West Virginia 13.9
Mississippi 12.1 Wisconsin 13.2
Missouri 13.8 Wyoming 8.9
Range in a Histogram
Problems with the [Total] Range

• The problem with the [total] range as a measure of

dispersion is that it depends on the values of just two
cases, which by definition have (possibly extraordinarily)
atypical values.
– In particular, the range makes no distinction between a polarized
distribution in which almost all observed values are close to
either the minimum or maximum values and a distribution in
which almost all observed values are bunched together but there
are a few extreme outliers.
• Recall Ideological Dispersion bar graphs =>
– Also the range is undefined for theoretical distributions that are
“open-ended,” like the normal distribution (that we will take up in
the next topic) or the upper end of an income distribution type of
curve (as in previous slides).
Two Ideological Distributions with
the Same Range
The Interdecile Range
• Therefore other variants of the range measure that do
not reach entirely out to the extremes of the frequency
distribution are often used instead of the total range.

• The interdecile range is the value of the case that stands

at the 90th percentile of the distribution minus the value
of the case that stands at the 10th percentile.
– That is, it is the “distance” or “interval” between the
values of these two rather less extreme cases.
The Interquartile Range

• The interquartile range is the value of the case that

stands at the 75th percentile of the distribution minus the
value of the case that stands at the 25th percentile.
– The first quartile is the median observed value among
all cases that lie below the overall median and the
third quartile is the median observed value among all
cases that lie above the overall median.
– In these terms, the interquartile range is third quartile
minus the first quartile.
The Standard Margin of Error Is a Range Measure
• Suppose the Gallup Poll takes a random sample of n respondents
and reports that the President's current approval rating is 62% and
that this sample statistic has a margin of error of ±3%. Here is what
this means: if (hypothetically) Gallup were to take a great many
random samples of the same size n from the same population (e.g.,
the American VAP on a given day), the different samples would give
different statistics (approval ratings), but 95% of these samples
would give approval ratings within 3 percentage points of the true
population parameter.

• Thus, if our data is the list of sample statistics produced

by the (hypothetical) “great many” random samples, the
margin of error specifies the range between the value of
the sample statistic that stands at the 97.5th percentile
minus the sample statistic that stands at the 2.5th
percentile (so that 95% of the sample statistics lie within
this range). Specifically (and letting P be the value of
the population parameter) this “95% range” is
(P + 3%) - (P -3%) = 6%, i.e., twice the margin error.
Deviation Measures of Dispersion
• Deviation measures are based on average deviations
from some average value.
– Since dispersion measures pertain to with interval variables, we
can calculate means, and deviation measures are typically
based on the mean deviation from the mean value.
– Thus the (mean and) standard deviation measures are
conceptually connected with the mean as a measure of central
tendency.
• Review: Suppose we have a variable X and a set of
cases numbered 1,2, . . . , n. Let the observed value of
the variable in each case be designated x1, x2, etc.
Thus:
Deviation Measures of Dispersion: Example
Deviation Measures of Dispersion (cont.)
• The deviation from the mean for a representative case i is xi - mean
of x.
– If almost all of these deviations are close to zero, dispersion is
small.
– If many of these deviations much different from zero, dispersion
is large.
• This suggests we could construct a measure D of dispersion that
would simply be the average (mean) of all the deviations.

But this does not work because, as we saw earlier, it is a property

of the mean that all deviations from it add up to zero (regardless of
how much dispersion there is).
Deviation Measures of Dispersion: Example
(cont.)
The Mean Deviation
• A practical way around this problem is simply to ignore
the fact that some deviations are negative while others
are positive by averaging the absolute values of the
deviations.
• This measure (called the mean deviation) tells us the
average (mean) amount that the values for all cases
deviate (regardless of whether they are higher or lower)
from the average (mean) value.
• Indeed, the Mean Deviation is an intuitive, understand-
able, and perfectly reasonable measure of dispersion,
and it is occasionally used in research.
The Mean Deviation (cont.)
The Variance
• Statisticians dislike this measure because the formula is
mathematically messy by virtue of being “non-algebraic”
(in that it ignores negative signs).
• Therefore statisticians, and most researchers, use
another slightly different deviation measure of dispersion
that is “algebraic.”
– This measure makes use of the fact that the square of any real
(positive or negative) number other than zero is itself always
positive.
• This measure --- the average of the squared deviations
from the mean (as opposed the average of the absolute
deviations) --- is called the variance.
The Variance (cont.)
The Variance (cont.)
• The variance is the average squared deviation from the
mean.
– The total (and average) average squared deviation from the mean
value of X is smaller than the average squared deviation from any
other value of X.
• The variance is the usual measure of dispersion in
statistical theory, but it has a drawback when researchers
want to describe the dispersion in data in a practical way.
– Whatever units the original data (and its average values and its
mean dispersion) are expressed in, the variance is expressed in
the square of those units, which may not make much (or any)
intuitive or practical sense.
– This can be remedied by finding the (positive) square root of the
variance (which takes us back to the original units).
• The square root of the variance is called the standard
deviation.
The Standard Deviation
The Standard Deviation (cont.)

• In order to interpret a standard deviation, or to make a

plausible estimate of the SD of some data, it is useful to
think of the mean deviation because
– it is easier to estimate (or guess) the magnitude of the MD than
the SD; and
– the standard deviation has approximately the same numerical
magnitude as the mean deviation, though it is almost always
somewhat larger.
• The SD is never less than the MD;
• the SD is equal to the mean deviation if the data is distributed in a
maximally “polarized” fashion;
• Otherwise the SD is somewhat larger than the MD — typically about
20-50% larger.
Standard Deviation Worksheet
1. Set up a worksheet like the one shown in the previous slides.
2. In the first column, list the values of the variable X for each of the n cases.
[This is the raw data.]
3. Find the mean value of the variable in the data, by adding up the values in
each case and dividing by the number of cases.
4. In the second column, subtract the mean from each value to get, for each
case, the deviation from the mean. Some deviations are positive, others
negative, and (apart from rounding error) they must add up to zero; add
them up as an arithmetic check.
5. In the third column, square each deviation from the mean, i.e., multiply the
deviation by itself. Since the product of two negative numbers is positive,
every squared deviation is non-negative, i.e., either positive or (in the event
a case has a value that coincides with the mean value).
6. Add up the squared deviations over all cases.
7. Divide the sum of the squared deviations by the number of cases; this gives
the average squared deviation from the mean, commonly called the
variance.
8. The standard deviation is the (positive) square root of the variance. (The
square root of x is that number which when multiplied by itself gives x.)
The Mean, Deviations, Variance, and SD

• What is the effect of adding a constant amount to (or

subtracting from) each observed value?
• What is the effect of multiplying each observed value (or
dividing it by) a constant amount?
Adding (subtracting) the same amount to (from) every
observed value changes the mean by the same amount
but does not change the dispersion (for either range or
deviation measures)
Multiplying (or dividing) every observed value by the same
factor changes the mean and the SD [or MD] by that same
factor and changes the variance by that factor squared.
Sample Estimates of Population
Dispersion
• Random sample statistics that are percentages or averages
provide unbiased estimates of the corresponding population
parameters.

• However, sample statistics that are dispersion measures

provide estimates of population dispersion that are biased
(at least slightly) downward.
– This is most obvious in the case of the range; it should
be evident that a sample range is almost always smaller
than, and can never be larger than, than the
corresponding population range.
Sample Estimates of Population Dispersion (cont.)
• The sample standard deviation (or variance) is also biased
downward, but only slightly if the sample at all large.
– While the SD of a particular sample can be larger than the population
SD, sample SDs are on average slightly smaller than the corresponding
population SDs).
• The sample SD can be adjusted to provide an unbiased estimate of
the population SD
– This simple adjustment consists of dividing the sum of the squared
deviations by n - 1, rather than by n.
– Clearly this adjustment makes no practical difference unless the sample
is quite small.
• Notice that if you apply the SD [or MD or any Range] formula in the
event that you have just a single observation in your sample, sample
dispersion = 0 regardless of what the observed value is.
– More intuitively, you can get no sense of how much dispersion there is
in a population with respect to some variable until you observe at least
two cases and can see how “far apart” they are.
• This is why you will often see the formula for the variance and SD
with an n - 1 divisor (and scientific calculators often build in this
formula).
– However, for POLI 300 problem sets and tests, you should use the
formula given in the previous section of this handout.
Dispersion in Ratio Variables
• Given a ratio variable (e.g. income), the interesting
“dispersion question” may pertain not to the interval
between two observed values or between an observed
value and the mean value but to the ratio between the
two values.
– For example, fifty years ago, the income of the household at the
25th percentile was about $5,000 and the income of the
household at the 75th percentile was about $10,000, while today
the figures are about $40,000 and $80,000 respectively.
• While the interval between the two income levels (the interquartile
range) has increased from $5,000 to $40,000, the ratio between the
two income levels has remained a constant 2 to 1.
• Other examples pertain to income:
– One household “poverty level” is defined as half of median
household income.
– Households with more than twice the median income are
sometimes characterized as “well off.”
– The average compensation of CEOs today is about 250 times
that of the average worker, whereas 50 years it was only about
40 times that of the average worker.)
Dispersion in Ratio Variables (cont.)

• The degree of dispersion in ratio variables can naturally

be referred to as the degree inequality.
– For example the two sets of income levels ($5K vs.
$10K and $40K vs. $80K) at the 25th and 75th
percentiles respectively seem to be “equally unequal”
because they are in the same ratio.

• Thus the SD does not work well as a measure of

inequality (of income, etc.), because it takes no account
of the ratio property of [ratio] variables.
The Coefficient of Variation
• One ratio measure of dispersion/inequality is called the
coefficient of variation, which is simply the standard
deviation divided by the mean.
– It answers the question: how big is the SD of the distribution
relative to the mean of the distribution?

• Recall PS#6, Question #7, comparing the distributions of

height and weight among American adults.
– We naturally to want to say that in some sense that American
adults exhibit more dispersion in weight than height.
– But if by dispersion we mean [any kind of] range, mean
deviation, or variance/SD, the claim is strictly meaningless
because the two variables are measured in in different units
(pounds, kilograms, etc. vs. inches, feet, centimeters, etc.), so
the numerical comparison is not valid.
Coefficient of Variation (cont.)
Summary statistics for WEIGHT and HEIGHT (both ratio variables) of American adults in
different units:
Weight Height
Mean 160 pounds 66 inches
72.6 kilograms 5.5 feet
.08 tons 168 centimeters

SD 30 pounds 4 inches
13.6 kilograms .33 feet
.015 tons 10.2 centimeters

Which variable [WEIGHT or HEIGHT] has greater dispersion? [No meaningful answer can
be given]
Which variable has greater dispersion relative to its average, e.g., greater Coefficient of
Dispersion (SD relative to mean)?

30 = 13.6 = .015 = .18 4 = .33 = 10.2 = .06

160 72.6 .08 66 5.5 168

Note that the Coefficient of Variation is a pure number, not expressed in any units and is
the same whatever units the variable is measured in.
Coefficient of Variation

• The old and new SDs are the same.

• The old Coefficient of Variation was
SD/Mean = 2/14 = 1/7 = 0.143
• while the new Coefficient of variation is
SD/Mean = 2/4 = 0.5
Coefficient of Variation (cont)

• The old and new SDs are the same.

• The old Coefficient of Variation was
– SD/mean = 2/14 = 1/7 = 0.143
• The new Coefficient of Variation is
– SD/mean = 2/114 = 0.0175
Coefficient of Variation (cont)

• The new SD is 10 times the old SD.

• But the old and new Coefficients of Variation are the
same:
SD/mean = 2/14 = 20/140 = 1/7 = 0.143
Introduction to Hypothesis Testing

The One-Sample z Test

• Conditions of Applicability:
– One group of subjects
– Comparing to population with known mean and variance.

• Note: this is not a common situation in Psychology!

PSYC 6130, PROF. J. ELDER 2

Example: Finish times for the 2005 Toronto Marathon
(Oct 16, 2005)

• Suppose your population of interest are women who ran the

marathon (slightly artificial).
• You hypothesize that women in their early twenties (20-24) are
faster than the average woman who ran the marathon.
• Here the ‘treatment’ is ‘youth’.

PSYC 6130, PROF. J. ELDER 3

Null Hypothesis Testing

• Largely due to English mathematician Sir R.A. Fisher (1890-1962)

• ‘Proof by contradiction’
• Suppose the null hypothesis is true
– In our example, the null hypothesis is that the finishing times for young
women are drawn from the same distribution as for the rest of the
female contestants.
– Knowing the mean and standard deviation of the population, we can
compute the sampling distribution of the mean for a sample of size n.
This is the null hypothesis distribution.
– The mean time for our sample of young women should be plausible
under this sampling distribution.
– If it is not plausible, it suggests that the null hypothesis is false.
– This lends credence to our alternate hypothesis (that young women are
faster).

PSYC 6130, PROF. J. ELDER 4

How do we judge the plausibility of the null hypothesis?

• The sample mean should be plausible under the sampling

distribution of the mean.

p( X ) σX

Implausible
X X X μ
Fairly plausible
Highly plausible

PSYC 6130, PROF. J. ELDER 5

Plausibility of the null hypothesis

• The plausibility of the null hypothesis is judged by computing the

probability p of observing a sample mean that is at least as deviant
from the population mean as the value we have observed.

p( X )
σX

X μ
p
PSYC 6130, PROF. J. ELDER 6
Plausibility of the null hypothesis

• This computation is simplified by converting to z-scores.

• Under the assumption of normality, we can determine this probability
from a standard normal table.

X −μ
z=
σX
p( z )
1

z 0
p
PSYC 6130, PROF. J. ELDER 7
Results for 2005 Toronto Marathon

n = 420
μ = 4hr 16min = 256 min
σ = 33min

PSYC 6130, PROF. J. ELDER 8

Results for Random Sample
of Women Under 25

n = 38
X = 4hr 9min = 249 min

PSYC 6130, PROF. J. ELDER 9

Statistical Decisions

• We now know the probability that an observation like ours could

have been drawn from the general female contestant population, i.e.
that our ‘treatment of youth’ had no effect.
• This probability is pretty small. Should we reject the null
hypothesis? This is the process of turning a continuous probability
(a real number) into a binary decision (yes or no).
• If we reject the null hypothesis, there is a chance we will be wrong.
We have to decide what chance we are willing to take, i.e. the
maximum p-value we will accept as grounds for rejecting the null
hypothesis.
• We call this probability threshold the alpha (α) level. A typical value
is .05.
• The α−level must be decided prior to the experiment.

PSYC 6130, PROF. J. ELDER 10

Type I and Type II Errors

• Type I Error: the null hypothesis is true and we reject it.

• Type II Error: the null hypothesis is false and we fail to
reject it.

Actual Situation

Researcher’s Decision Null Hypothesis is True Null Hypothesis is False

Accept the Null p (accept H0 | H0 true) p (accept H0 | H0 false)

Hypothesis = 1 −α =β
Reject the Null p (reject H0 | H0 true) p (reject H0 | H0 false)
Hypothesis =α = 1 − β (power)

PSYC 6130, PROF. J. ELDER 11

Type I and Type II Errors

• Which is more serious?

– Type I can be bad, as rejecting the null hypothesis (e.g., ‘This
stuff really works’), may cause actions to be taken that have no
value.
– Type II may not be so bad, if it is understood that the treatment
may still have an effect (we fail to reject the null hypothesis, but
we do not reject the alternate hypothesis).
– But Type II may be bad if it leads to inaction when action would
have produced good results (e.g., a cure for cancer).

PSYC 6130, PROF. J. ELDER 12

One-Tailed vs Two-Tailed Tests

• Our marathon hypothesis was one-tailed, because we

made a specific prediction about the direction of the
effect (young women are faster).
• Suppose we had simply hypothesized that young women
are different.

PSYC 6130, PROF. J. ELDER 13

Two-Tailed Test

X −μ
z=
p( z ) σX
1

z 0 z

PSYC 6130, PROF. J. ELDER 14

One-Tailed vs Two-Tailed Tests

• Use a one-tailed test when you have a specific reason to

believe the effect will be in a particular direction, and you
do not care if the effect is in the opposite direction.
• Otherwise, use a two-tailed test.
• One-tailed tests will always result in smaller p values,
and hence a greater chance of reaching significance for
your directional hypothesis.
• The decision of whether to perform one-tailed or two-
tailed tests must be made prior to data collection.

PSYC 6130, PROF. J. ELDER 15

Basic Procedure for Statistical Inference

1. State the hypothesis

2. Select the statistical test and significance level
3. Select the sample and collect the data
4. Find the region of rejection
5. Calculate the test statistic
6. Make the statistical decision

PSYC 6130, PROF. J. ELDER 16

Step 1. State the Hypothesis

Null hypothesis: marathon times for young women are the same
as for the general female contestant population.

Alternate hypothesis: young women are faster.

H 0 : μ = μ0
H A : μ < μ0

PSYC 6130, PROF. J. ELDER 17

Step 2. Select the Statistical Test and the
Significance Level
• We are comparing a sample mean to a population with
known mean and standard deviation Æ z-test
• p=.05 is probably appropriate.

PSYC 6130, PROF. J. ELDER 18

Step 3. Select the Sample and Collect the
Data
• Ideally, we would randomly assign the treatment to a
random sample of the population (Toronto Marathon
women). Is this possible?
• Instead, we randomly sample female contestants under
25.

PSYC 6130, PROF. J. ELDER 19

Step 4. Find the Region of Rejection

• The z value defining the rejection region is called the

critical value for your test, and is a function of the
selected α-level. For this reason, we often denote the
critical value as zα

p( z )
1

0
α = .05 zα = −1.65
PSYC 6130, PROF. J. ELDER 20
Step 5. Calculate the Test Statistic

X −μ
z=
σX

PSYC 6130, PROF. J. ELDER 21

Step 6. Make the Statistical Decision

• p<α: Reject null hypothesis.

• p>α: Fail to reject null hypothesis.

PSYC 6130, PROF. J. ELDER 22

Example: Height of Female Psychology Graduate Students

Canadian Adult Female Population: Sample: Female students enrolled in PSYC 6130C 2008-09
μ 162.10 cm
σ 6.55 cm

PSYC 6130, PROF. J. ELDER 23

Assumptions Underlying One-Sample z Test

• Random sampling
• Variable is normal
– CLT: Deviations from normality ok as long as sample is large.

• Dispersion of sampled population is the same as for the

comparison population
– e.g. suppose means are the same, but dispersion of sampled
population is greater than dispersion of comparison population.

PSYC 6130, PROF. J. ELDER 24

Limitations of the One-Sample Test

• Strongly depends on random sampling.

• Better to have two groups of subjects: test (treatment)
group and control group.
• Problem of random sampling reduces to problem of
random assignment to two groups: much easier!

PSYC 6130, PROF. J. ELDER 25

Reporting your results

• Express your result in evocative English, then include

the required numbers.
• Follow APA style.
• Example:
– Young female runners were not found to be significantly faster
than the general female contestant population, z=-1.31, p=0.095,
one-tailed.

PSYC 6130, PROF. J. ELDER 26

Total number of significant results

• Consistent use of a fixed alpha-level determines the proportion of null

experiments that generate significant results.
• Don’t have enough information to know how many reported results are
errors, because:
– Don’t know the relative proportion of cases where H0 is true and H0 is false.
– Don’t know the power of effective experiments.
– Typically only significant results are reported (publication bias).

PSYC 6130, PROF. J. ELDER 27

Chapter 9

Nonparametric Statistics

EPI 809 / Spring 2008

Learning Objectives

1. Distinguish Parametric &

Nonparametric Test Procedures
2. Explain commonly used
Nonparametric Test Procedures
3. Perform Hypothesis Tests Using
Nonparametric Procedures

EPI 809 / Spring 2008

Hypothesis Testing Procedures

Many More Tests Exist!

EPI 809 / Spring 2008
Parametric Test Procedures

1. Involve Population Parameters (Mean)

2. Have Stringent Assumptions

(Normality)

3. Examples: Z Test, t Test, χ2 Test,

F test
EPI 809 / Spring 2008
Nonparametric Test
Procedures
1. Do Not Involve Population Parameters
Example: Probability Distributions, Independence

2. Data Measured on Any Scale (Ratio or

Interval, Ordinal or Nominal)

3. Example: Wilcoxon Rank Sum Test

EPI 809 / Spring 2008

Advantages of
Nonparametric Tests
1. Used With All Scales
2. Easier to Compute
3. Make Fewer Assumptions
4. Need Not Involve
Population Parameters
5. Results May Be as Exact
as Parametric Procedures

© 1984-1994 T/Maker Co.

EPI 809 / Spring 2008

Disadvantages of
Nonparametric Tests

1. May Waste Information © 1984-1994 T/Maker Co.

Parametric model more efficient

if data Permit

2. Difficult to Compute by
hand for Large Samples
3. Tables Not Widely Available

EPI 809 / Spring 2008

Popular Nonparametric Tests
1. Sign Test

2. Wilcoxon Rank Sum Test

3. Wilcoxon Signed Rank Test

EPI 809 / Spring 2008

Sign Test

EPI 809 / Spring 2008

Sign Test
1. Tests One Population Median, η

2. Corresponds to t-Test for 1 Mean

3. Assumes Population Is Continuous

4. Small Sample Test Statistic: # Sample Values

Above (or Below) Median

5. Can Use Normal Approximation If n ≥ 10

EPI 809 / Spring 2008

Sign Test Concepts

¾ Make null hypothesis about true median

¾ Let S = number of values greater than median

¾ Each sampled item is independent

¾ If null hypothesis is true, S should have binomial

distribution with success probability .5

EPI 809 / Spring 2008

Sign Test Example

¾ You’re an analyst for Chef-Boy-R-Dee. You’ve

asked 7 people to rate a new ravioli on a 5-point
scale (1 = terrible,…, 5 = excellent) The ratings
are: 2 5 3 4 1 4 5.

¾ At the .05 level, is there evidence that the

median rating is at least 3?

EPI 809 / Spring 2008

Sign Test Solution
¾ H0: P-Value:
¾ Ha:
¾ α=
¾ Test Statistic:

Decision:

Conclusion:

EPI 809 / Spring 2008

Sign Test Solution
¾ H0: η = 3 P-Value:
¾ Ha: η < 3
¾ α=
¾ Test Statistic:

Decision:

Conclusion:

EPI 809 / Spring 2008

Sign Test Solution
¾ H0: η = 3 P-Value:
¾ Ha: η < 3
¾ α = .05
¾ Test Statistic:

Decision:

Conclusion:

EPI 809 / Spring 2008

Sign Test Solution
¾ H0: η = 3 P-Value:
¾ Ha: η < 3
¾ α = .05
¾ Test Statistic:
S=2
Decision:
(Ratings 1 & 2 Are
Less Than η = 3:
2, 5, 3, 4, 1, 4, 5) Conclusion:
Is observing 2 or
more a small prob
event? EPI 809 / Spring 2008
Sign Test Solution
¾ H0: η = 3 P-Value:
¾ Ha: η < 3 P(S ≥ 2) = 1 - P(S ≤ 1)
¾ α = .05 = .9297
¾ Test Statistic: (Binomial Table, n = 7,
p = 0.50)
S=2
Decision:
(Ratings 1 & 2 Are
Less Than η = 3:
2, 5, 3, 4, 1, 4, 5) Conclusion:
Is observing 2 or
more a small prob
event? EPI 809 / Spring 2008
Sign Test Solution
¾ H0: η = 3 P-Value:
¾ Ha: η < 3 P(x ≥ 2) = 1 - P(x ≤ 1)
¾ α = .05 = . 9297
¾ Test Statistic: (Binomial Table, n = 7,
p = 0.50)
S=2
Decision:
(Ratings 1 & 2 Are
Do Not Reject at α = .05
Less Than η = 3:
2, 5, 3, 4, 1, 4, 5) Conclusion:
Is observing 2 or
more a small prob
event? EPI 809 / Spring 2008
Sign Test Solution
¾ H0: η = 3 P-Value:
¾ Ha: η < 3 P(x ≥ 2) = 1 - P(x ≤ 1) =
¾ α = .05 = . 9297
¾ Test Statistic: (Binomial Table, n = 7,
p = 0.50)
S=2
Decision:
(Ratings 1 & 2 are
Do Not Reject at α = .05
< η = 3:
2, 5, 3, 4, 1, 4, 5) Conclusion:
Is observing 2 or There is No evidence for
more a small prob Median < 3
event? EPI 809 / Spring 2008
Wilcoxon Rank Sum
Test

EPI 809 / Spring 2008

Wilcoxon Rank Sum Test

1.Tests Two Independent Population

Probability Distributions
2.Corresponds to t-Test for 2 Independent
Means
3.Assumptions
Independent, Random Samples
Populations Are Continuous
4.Can Use Normal Approximation If ni ≥ 10
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Procedure
1. Assign Ranks, Ri, to the n1 + n2 Sample
Observations
If Unequal Sample Sizes, Let n1 Refer to Smaller-Sized Sample
Smallest Value = 1

2. Sum the Ranks, Ti, for Each Sample

Test Statistic Is TA (Smallest Sample)
Null hypothesis: both samples come from the same underlying
distribution
Distribution of T is not quite as simple as binomial, but it can be
computed

EPI 809 / Spring 2008

Wilcoxon Rank Sum Test
Example

¾ You’re a production planner. You want to see if

the operating rates for 2 factories is the same.
For factory 1, the rates (% of capacity) are 71,
82, 77, 92, 88. For factory 2, the rates are 85,
82, 94 & 97. Do the factory rates have the same
probability distributions at the .10 level?

EPI 809 / Spring 2008

Wilcoxon Rank Sum Test
Solution
¾ H0: Test Statistic:
¾ Ha:
¾ α=
¾ n1 = n2 =
¾ Critical Value(s):
Decision:

Conclusion:
Σ Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
¾ H0: Identical Distrib. Test Statistic:
¾ Ha: Shifted Left or
Right
¾ α=
¾ n1 = n2 =
¾ Critical Value(s): Decision:

Conclusion:
Σ Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Solution
¾ H0: Identical Distrib. Test Statistic:
¾ Ha: Shifted Left or
Right
¾ α = .10
¾ n1 = 4 n2 = 5
¾ Critical Value(s): Decision:

Conclusion:
Σ Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum
Table 12 (Rosner) (Portion)
α = .05 two-tailed

EPI 809 / Spring 2008

Wilcoxon Rank Sum Test
Solution
¾ H0: Identical Distrib. Test Statistic:
¾ Ha: Shifted Left or
Right
¾ α = .10
¾ n1 = 4 n2 = 5
¾ Critical Value(s): Decision:

Reject Do Not Reject

Reject Conclusion:
12 28 Σ Ranks
EPI 809 / Spring 2008
Wilcoxon Rank Sum Test
Computation Table

Factory 1 Factory 2
Rate Rank Rate Rank

Rank Sum